Research
Our research interests are in applying and developing novel graph-theoretic/statistical/machine learning techniques for solving problems in computational biology. These techniques can provide an answer to many challenges in computational biology because they offer a natural way to integrate different types of data and to handle large amounts of noisy information.
Our research has mainly focused on four areas:
- Inference and analysis of large-scale Protein-Protein Interaction networks.
- Protein Function Prediction.
- Inferring relationships between Genotype, Phenotype and Environment.
- Analysis of Biological Processes from co-expression networks.
Inference and analysis of large-scale protein-protein interaction networks
Proteins carry out their molecular functions by interacting with other molecules, mainly other proteins. For this reason protein interactions provide an important step toward understanding protein function and cell behaviour. Systematically mapping the set of all protein-protein interactions within an organism – the interactome – has therefore become a major challenge in post-genomic biology. Recent developments in experimental procedures (e.g. co-affinity purification followed by mass spectrometry, AP-MS) have resulted in the publication of many high-quality protein-protein interaction datasets for different organisms ranging from the yeast Saccharomyces cerevisiae to Homo sapiens.
An interactome has a natural representation as an undirected graph, often called protein-protein interaction (PPI) network, where nodes represent proteins and edges represent interactions between pairs of proteins. Often an estimation of the reliability of such interactions is available and is included as edge labels (weights). Interactomes have a modular structure, meaning that there are sets of proteins that interact with each other more frequently than with the rest of the network. These densely connected regions are typically interpreted as protein complexes, and their identification is crucial to deepen our understanding of cellular processes. The problem of identifying protein complexes from PPI data is then equivalent to detecting dense regions containing many connections in PPI networks (or regions with large weights if the networks are weighted).
In our lab research on large scale PPI networks has been funded by the BBSRC (grant BB/F00964X/1) and the Royal Society (grant NF080750). We have worked on methods for:
- Detecting potentially overlapping protein complexes from protein-protein interaction networks [Nature Methods, accepted. Click here for a platform-independent, open-source implementation of the method].
- Integrating different protein interaction experiments into one PPI network [Nature, vol. 440, 7084, pp. 637-643, 2006].
- De-noising large scale protein-protein interaction experiments exploiting the graph PPI network topology [Bioinformatics, vol. 22, 7, pp.823-829, 2006].
- Predicting protein-protein interactions from genomic features [Genome Research, vol. 15, 7, pp. 945-953, 2005].
Protein Function Prediction
In recent years, the numerous large scale sequencing projects have generated enormous amounts of sequence data. This has led to the identification of thousands of previously unknown genes whose function awaits to be characterized. A precise definition of protein function is difficult, as in general the meaning of the term “function” depends on the context which one is considering. The current dominant solution to this problem is through the use of ontologies, consisting of terms in a controlled vocabulary organized in a hierarchical structure through a set of well-defined relationships.
Standard ontologies usually have a structure that can be modeled by a rooted and oriented tree or, more generally, by a directed acyclic graph, like the Gene Ontology, which is becoming the standard. Having defined function through ontologies, even for the best characterized model organisms, about a third of the proteins have unknown function. A fundamental goal is therefore to identify the function of uncharacterized genes on a genomic scale. It is difficult to design functional assays for uncharacterized genes so a major challenge in bioinformatics is to devise algorithmic methods that, given a gene, can predict a hypothesis for its function that can then be validated experimentally.
In our lab research in protein function prediction has been funded by the BBSRC (grant BB/F00964X/1). We have worked on methods for:
- Improve any semantic similarity measures on GO by exploring the ontology beneath the terms and modeling uncertainty [Bioinformatics, under review].
- Grouping proteins according to consensus architecture [Nucleic Acid Research, under review].
- Predicting protein function in E. Coli by combining proteomic and genomic context data [PLoS Biology (2009) ;7(4) (this paper was highlighted in Nature Methods 2009; 6, 402-403)].
- Spectral methods for clustering protein sequences according their evolutionary relationship using sequence distances [Nucleic Acids Research 34:1571, 2006], [BMC Bio-informatics 11:120, 2010], [Proceedings of IJCNN 2003; LNCS 2859 (2003)].
Inferring relationships between genotype, phenotype and environment
An important problem in biology is to uncover the links between the genetic makeup of an organism (genotype) and its observable physical or biochemical characteristics (phenotype). For example, this would increase our ability to rapidly characterize an unknown microorganism, which is critical in both responding to infectious disease and biodefense. To do this, we need some way of anticipating an organism’s phenotype based on the molecules encoded by its genome.
At the same time, by what means specific sequences link distinct environmental conditions with specific biological processes is also not well understood. Thus, another important challenge is how the usage of particular pathways and subnetworks reflects the adaptation of microbial communities across environments and habitats – i.e., how network dynamics relates to environmental features. We have worked on methods for:
- Quantifying environmental adaptation of metabolic pathways in metagenomics [Proc Natl Acad Sci USA (2009) Feb 3;106(5):1374-9. This paper was highlighted in Science 2009; 323 (5918) and Nature Genetics 2009; 41 (275)].
- Predicting essential genes in fungal genomes [Genome Research, 16, 1126 (2006)].
- Integrating curated databases to identify genotype-phenotype associations [BMC Genomics, vol. 7, p. 257, 2006].
Analysis and detection of biological processes from co-expression networks
Gene expression experiments measure the activity of thousands of genes in response to different conditions. Generally, genes involved in a particular biological mechanism tend to exhibit similar expression patterns and form groups. An important question in this area is that of detecting from transcriptomics data which biological processes are activated in a given condition.
Another problem is that of selecting marker genes which can represent such specific mechanisms. In fact these markers can be used as readouts and help understanding the mechanisms, monitor the interactions between them and track the physiological effect they may exert. For example, as yeast cells grow, genes involved in various hormone pathways exhibit distinct similarity in expression patterns and form groups. Sensitive and specific markers which can track and report the dynamics of each group are important for investigating the mechanisms of response to each hormone, cross-talk between hormone pathways and the relationship between hormones and phenotypic effects.
In our lab research for the analysis of transcriptomics data has been funded by the BBSRC (grant BB/F00964X/1) and Royal Holloway, through the Agnes Grace Ellen Endowment. We have developed methods for:
- Computational selection of transcriptomics experiments in a way to improve Guilt-by-Association analyses [PLoS ONE, under review].
- Detecting process representative genes by integrating data from multiple sources [a patent application in this topic is currently being filed].
- Clustering Pseudomonas aeruginosa transcriptomes [BMC Genomics, 2006 Jun 26;7:162].