We are interested in two broad areas of research. First, we develop computational methods for predicting the structure, function, and evolution of proteins from sequence. We develop statistical methods that enable us to make use of the vast amount of sequence information that is becoming available at an ever-increasing pace. The goal is to provide life scientists with more and more powerful tools for predicting the functions and structures of proteins in order to guide their experimental work.


Second, we want to understand how transcriptional regulation, which represents the most important level of cellular regulation, is encoded in each gene's regulatory regions. We develop computational methods to analyse regulatory sequences and to detect regulatory motifs. We also want to predict transcription rates, using probabilistic modeling, statistical physics, and machine learning techniques. We collaborate extensively with experimental groups to elucidate the molecular processes regulating transcription initiation, elongation, mRNA processing, and chromatin states.

We develop and employ machine learning, statistical, and algorithmic methods, both to create tools for the wider biological community, and to investigate biological questions in the above areas.


 Automatic protein structure and function prediction

Proteins are key players in life and understanding their functions is crucial to unravel the workings of our cells and the origin of deseases. Yet the molecular functions of most human proteins have only very partially been understood. Since proteins are molecular machines, knowing their 3D structure often helps a lot in understanding their behaviour and involvement in cellular processes. But while the sequences of essentially all human proteins and more than 10 million from other organisms are known, determining the 3D structure of proteins is very costly and often fails. Hence methods to predict protein structures from their sequence have proven valuable, e.g. by suggesting functional hypotheses that focus experimental efforts to investigate their functions. Traditionally, protein stucture prediction has been done in a time-consuming manual process by specialists. We are developing sensitive and reliable methods and servers for automatic structure prediction, which allow non-specialists and bench biologists to apply these powerful structure prediction techniques to their work.

We regularly participate in the community-wide blind protein structure prediction competition CASP (Critical Assessment of Techniques for Protein Structure Prediction), which takes place every two years. In the latest CASP competition that took place in 2010, our HHpred server was assessed as one of the best out of the 81 servers, in particular in template-based structure prediction, the category most relevant for biological applications (See Mariani, 2011, CASP9 results table, and figure below). This is remarkable for two reasons: First, with ~4 minutes median response time, HHpred is about 100 faster than other top servers. Second, most top servers use our freely available software HHsearch to find templates and to align target to template sequences. For CASP8 results, please refer to our publication in Proteins (2009). An interactive version of HHpred, which is geared towards non-specialists and offers much more functionality than the fully automatic CASP version, can be found here: HHpred.

Official CASP9 cumulative Z-scores for model quality in the template-based modeling category, plotted over the median response time of the 81 participating servers. (See Table)

Also, knowing the functional sites (e.g. ligand-binding residues or catalytic residues) of already characterized proteins allows to transfer annotation to many homologous proteins based on the presence or absence of functional amino acids. Using structural models of proteins, wepredict their functional sites from sequence alone. These predictions can then be validated by genetic or biochemical experiments.[top]


 Fast, sensitive and accurate sequence alignment

Two trends are going to drive the explosion of sequence information in the near future: The development of fast, highly parallelized next and third generation sequencing technologies (Bentley 2007) and the growth of metagenomics (Tringe 2005, Gabor 2007, Schmeisser 2007), which aims to study organisms in their natural habitat by sequencing large amounts of environmental probes, obliterating the need for prior cultivation (see, e.g., Yooseph 2007, Kurokawa 2007). The riches of sequence information that these developments promise will not be utilized well by present-day methods for sequence-based structure and function prediction. Also, the sheer amount of data will create substantial performance challenges for present day sequence comparison methods.

We are developing novel and more sensitive sequence search techniques that will help to make sense of this vast amount of information, reducing the number of unannotatable sequences, improving protein structure prediction, and enhancing the quality of multiple sequence alignments, a critical factor for almost all sequence-based anaysis methods. We are working on several approaches to make single sequence searches more sensitive without loss of speed, e.g. by taking sequence context into account, by developing an iterative search method based on pairwise comparison of profile Hidden Markov Models (HMMs), and by generalizing HMM-HMM comparison. In another project, we are developing methods for ultrafast sequence comparison to improve speeds over BLAST and PSI-BLAST by a factor of 100 to 1000. In the longer term, we would like to combine all of these methods to build an intelligent sequence repository that organizes all known sequences by their functional, structural, and evolutionary domains. By instilling sense into raw sequences filling current databases, such a next-generation domain-based database could be invaluable for more efficiently elucidating the functions of proteins. [top]



 Protein evolution


In collaboration with Andrei Lupas from the Max-Planck-Institute for Developmental Biology in Tübingen we use our sensitive methods for homology detection to investigate the evolution of proteins and, in particular, the origin of protein domains from smaller fragments (Söding and Lupas 2003). We are gathering evidence for our hypothesis that protein domains were recombined and duplicated from shorter peptides of 20 to 50 residues length, which in turn evolved in the RNA world as cofactors to enhance RNA's catalytic and binding capabilities (Alva 2007, Coles 2005).

As an example, this galaxy of folds shows domains of known structure as dots colored by fold type. They were clustered by their pairwise sequence similarities, a property that reflects common descent. Most folds are monophyletic and cluster together, but interestingly, several clusters contain domains of different fold types. These are linked by recurrent fragments with similar seuquences and structures which may be descendants of an ancestral pool of peptide modules from which the first folded proteins arose. (Alva et al. 2010)[top]


 Transcriptional regulation


Chimpanzees are over 98 percent genetically identical with humans, yet their phenotype is surprisingly diffent. In recent years it has become clear that it is the regulation of our genes and not the genes themselves that make us different (see, e.g., McLean et al., Nature 2011; Maurano et al., Science 2012). The importance of transcriptional regulation is seen in the development of a multicellular organism: Beginning with the fertilized egg, networks of transcriptionally regulated and regulating genes orchestrate the miraculous development from a single cell into an adult organism according to a master plan set out in the genomic DNA sequence. Transcription factors regulate the transcription of target genes by binding to regulatory motifs in enhancer and promoter regions on the DNA. But given the importance of transcriptional regulation for the development of complex organisms, cellular responses, homeostasis, and the origin of many deseases, surprisingly little is known about the mechanisms, molecular protagonists, and molecular processes involved in switching transcription on and off.

We develop tools to analyse promoters in order to understand and predict their regulatory behaviour. To gain insights into the functional constraints during the evolution of regulatory regions, we are developing methods that describe in more detail the evolution of such regions. This should allow us to improve the multi-species alignment of genomic sequences. This in turn will help to identify functional regions by their conservation signature.

We analyze the core promoter structure and its function in transcriptional regulation. We are further interested in understanding the interaction between DNA regulatory elements and specific transcriptional activators and repressors on the one hand, and these transcriptional activators and repressors on the other hand. These investigations will interface with our efforts to identify and characterize the entire set of DNA- and RNA-binding proteins in eukaryotic genomes. Here, we can draw on our expertise about protein structure modeling and remote homology detection.

We are also pursuing a quantitative approach to transcriptional regulation, with the goal to develop thermodynamically inspired models that are able to predict transcriptional activity of genes given the concentration of regulatory factors, nucleosomes, and their modifications. We would like to model how the concentrations and activities of transcription factors and nucleosomes affect their binding to DNA, and how the bound factors affect transcription at target genes that are sometimes kilobases away.

We develop our models in close collaboration with the groups of Patrick Cramer, Ulrike Gaul, and others, who study transcriptional regulation in the model organisms yeast and fruit fly using high-throughput, genome-wide measurements. Since components of the transcription regulatory machinery such as RNA Polymerase II, the general transcription factors, and macromolecular regulatory complexes such as the Mediator, are well conserved from yeast to human, these model organisms are ideal to understand the fundamental principles of gene regulation.[top]