Spectral clustering of protein sequences

Clustering protein sequences based on their evolutionary relationship is important for sequence annotation as structural and functional relationships can potentially be inferred. Most of the existing methods were based on simply thresholding a measure related to the distance between sequences. We mapped this problem into that of clustering the nodes of a weighted undirected graph in which each node corresponds to a protein sequence and the weights on the edges correspond to a measure of distance between two sequences. E-values within and between superfamiliesThe goal is to partition such a graph into a set of discrete clusters whose members are homologs. By formulating the problem this way, it was possible to analyze the limitations that one might expect from earlier methods and we were able to provide a theoretical explanation of why spectral clustering methods should perform better for this problem. Then, we introduced a spectral method that partitions the graph into clusters by considering the random walk formulation on the graph and analyzing the perturbations to the stationary distribution of a Markov relaxation process. This is done by looking at the eigenvectors of the Markov transition matrix. Comparison of protein families obtained by various methodsThe weights on the edges of the graphs that we used are a nonlinear transformation of the pairwise BLAST E-values, and the parameters of such nonlinear transformation were learned. We tested this algorithm on difficult sets of proteins whose relationships are known from the SCOP database [2, 4, 5]. Our method correctly identified many of the family/superfamily relationships. We found that the eigenvalue spectrum and in particular, the eigengaps provide useful indication on the number of clusters in the dataset. Using this approach the results obtained were much better than those obtained using other methods on the same datasets. In fact, we consistently observed that, the number of clusters obtained for a given set of proteins was close to the number of superfamilies in that set; there were fewer singletons; and the method correctly grouped most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better (on average, improvements were: 84% over hierarchical clustering, 34% over connected component analysis (similar to GeneRAGE), and 72% over TribeMCL.

In collaboration with Rajkumar Sashidaran (Stanford University), we have recently released SCPS, a fast multi-platform implementation of the spectral method that enables us to detect protein families on a genome-wide scale [1]. We are currently investigating how to improve these results using sequence-profiles scores [3].

[1] T. Nepusz, R. Sasidharan, A. Paccanaro (2010). SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale. BMC Bioinformatics 11:120. (highly accessed) [read online] [pubmed]

[2] A. Paccanaro, J. A. Casbon, M. A. S. Saqi (2006). Spectral clustering of protein sequences. Nucleic Acids Research 34(5):1571-80 [pubmed]

[3] R. Sasidharan, M. Gerstein, A. Paccanaro (2006) Spectral clustering of protein sequences using sequence-profile scores. In: Proceedings of ICNPSC 2006 — 3rd International Conference on Neural Parallel and Scientific Computations, Atlanta, USA.

[4] A. Paccanaro, C.Chennubhotla, J. A. Casbon, M. A. S. Saqi (2003). Spectral Clustering of Protein Sequences. IJCNN 2003 – International Joint Conference on Neural Networks, Portland, Oregon, USA.

[5] C. Chennubhotla, A. Paccanaro (2003). Markov Analysis of Protein Sequence Similarities Lecture Notes in Computer Science series, Vol. 2859, 278-286, Springer-Verlag.