Species-specific analysis of protein sequence motifs using mutual information
© Hummel et al; licensee BioMed Central Ltd. 2005
Received: 03 March 2005
Accepted: 29 June 2005
Published: 29 June 2005
Protein sequence motifs are by definition short fragments of conserved amino acids, often associated with a specific function. Accordingly protein sequence profiles derived from multiple sequence alignments provide an alternative description of functional motifs characterizing families of related sequences. Such profiles conveniently reflect functional necessities by pointing out proximity at conserved sequence positions as well as depicting distances at variable positions. Discovering significant conservation characteristics within the variable positions of profiles mirrors group-specific and, in particular, evolutionary features of the underlying sequences.
We describe the tool PRO file analysis based on M utual I nformation (PROMI) that enables comparative analysis of user-classified protein sequences. PROMI is implemented as a web service using Perl and R as well as other publicly available packages and tools on the server-side. On the client-side platform-independence is achieved by generally applied internet delivery standards. As one possible application analysis of the zinc finger C2H2-type protein domain is introduced to illustrate the functionality of the tool.
The web service PROMI should assist researchers to detect evolutionary correlations in protein profiles of defined biological sequences. It is available at http://promi.mpimp-golm.mpg.de where additional documentation can be found.
Here we describe PROMI, a system to discover group-specific conservation characteristics in the amino acid distribution of profiles. For this we understand the sequences forming a general profile to be associated with a user-defined biological classification label, where the number of labels should be much smaller than the number of rows in the profile. In detail relations between profile columns and the applied group affiliation of the sequences forming the profile shall be investigated. The relations will be apparent by constituting significant amino-acid conservations, leading either to distinct amino acid consensus patterns in the analyzed groups or to knowledge about affinity between the groups .
To tackle this aim the mutual information (MI) is used as an interdependence measure of random variables X i and Y [2–5]. The interdependence between X i (in our case column of a profile X) and Y (here group affiliation) is understood as the knowledge one gains about Y if X i is known and vice versa [6, 7]. Small values imply small gain of knowledge between the variables, whereas high values point out a higher gain. The calculated MI-profile of the whole alignment consisting of all k groups as well as all pairwise profiles together with computed sequence logos finally allow conclusions regarding group-specific amino acid-positions where the distribution differ significantly and thus a group-discrimination on the basis of one profile-position is possible. Moreover the mean value of each pairwise MI-profile leads to formation of an elementary distance matrix D, where low MI-profile-mean-values state that the molecular similarity between groups of sequences is high opposed to higher MI-profile-mean-values with a higher molecular distance in the underlying groups. Further, by applying hierarchical clustering to D, a phylogenetic tree reflecting the distance between its constituents can be constructed.
In the following we use "class" and "classification" synonymously with "group" and "group affiliation".
Results and discussion
Visualisation of mutual information profiles
Seemingly a very naive and rough approach disregarding protein steric structures as well as amino acid proximities the method is coherent. MI as a distance measure satisfies the four axioms of a metric: non-negativity, identity of indiscernibles, symmetry and triangle inequality .
As aforementioned the method we have outlined may not only be applied in a species-specific context, but may also be understood in terms of a phylogenetic identifier, gene expression values or any other desired classification by the user. Influence of potential negative thresholds, as for example false positive matches using PROSITE motifs for sequence search, or a possible bias because of highly differing amounts of matching sequence fragments per group, could be decreased with little extra effort. For this the inclusion of a user-prepared multiple alignment, derived by BLAST and CLUSTAL is conveniently possible. The number of selected instances per group can be balanced by limiting the fragments to a suitable number. Regardless of the abovementioned issues, the proposed method is a fast and convenient approach for the motif-specific analysis of hundreds of sequences derived by homology of ortholog or paralog gene and protein domain families.
Multiple alignment and mutual information
Given a motif, describing sequence fragments of length n, the potential m matching sequence fragments with length n can be arranged as a multiple alignment as seen in Figure 1. Thereby each of the n columns corresponds to one position of the motif. Furthermore the m fragments belong to k groups reflecting the user-applied classification (m >> k). By setting up this structure n m-dimensional vectors X i (with i = 1...n) are constituted, each consisting of characters from the amino acid one letter code plus one additional letter for gaps. All X i combine as the matrix X = (X1, X2,..., X i ,..., X n ) reflecting the profile. Additionally an m-dimensional vector (Y) is formed from the classification consisting of a discrete set of group labels, where each label corresponds per row to the group affiliation of a sequence fragment in matrix X at the same row.
A convenient measure of correlation between two discrete random variables such as in our case X i and Y is the mutual information I(X i , Y) using the entropy and joint entropy respectively
The entropy H(X i ) is calculated using the number of occurrences of each character x within the vector X i of length m. The same is used for H(Y). The joint entropy H(X i , Y) is derived by concatenating the characters x and the group labels y and again calculating the likelihood of all possible combinations. Furthermore each MI value is multiplied by a factor giving the likelihood of gaps in each column. This heuristic approach was adapted from the C4.5 machine learning algorithm . Iterating each column i (i = 1...n) of the profile and calculating I(X i , Y) a so-called mutual information profile (MI-profile) can be established, incorporating all MI-values.
Distance matrix and phylogenetic tree
Given sequences affiliated to k groups, pairwise orderless combinations of groups can be formed. By calculating MI-profiles for all pairwise combinations so called pairwise MI-profiles can be created. As one potential measure the (k*k)-dimensional distance matrix D is formed up by calculating the mean over all columns of each pairwise MI-profile, schematically shown in Figure 3. This mean describes the distance of amino acid conservation between the groups of sequences. In the case of our study these groups are species-derived, but can also be of any other classification order. Given r and s as one possible combination of group labels can be calculated. (X,Y)r, sare all these sequence fragments belonging either to group r or to group s. Eventually, the phylogenetic tree reflecting the distances can be constructed by applying hierarchical clustering (complete linkage clustering) to D.
Availability and requirements
Project name: PROMI
Project home page: http://promi.mpimp-golm.mpg.de/
Operating system: server side Linux; client side platform independent
Programming language: Perl, R
Other requirements: EXPASY ScanProsite tool, R, Bioperl, RSPerl, RSvgDevice, Berkeley weblogo software
License: GNU GPL
Any restrictions to use by non-academics: no
The source code of the web service script and the R script are available upon request. Other used software components are available at the according sources.
- Weckwerth W, Selbig J: Scoring and identifying organism-specific functional patterns and putative phosphorylation sites in protein sequences using mutual information. Biochemical & Biophysical Research Communications 2003, 307: 516–521.View ArticleGoogle Scholar
- Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D, Korn K, Selbig J: Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype. Proceedings of the National Academy of Sciences of the United States ofAmerica 2002, 99(12):8271–8276.View ArticleGoogle Scholar
- Hannenhalli SS, Russell RB: Analysis and prediction of functional sub-types from protein sequence alignments. Journal of Molecular Biology 2000, 303(1):61–76.View ArticlePubMedGoogle Scholar
- Mirny LA, Gelfand MS: Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors. Journal of Molecular Biology 2002, 321(1):7–20.View ArticlePubMedGoogle Scholar
- Weisser D, Klein-Seetharaman J: Identification of fundamental building blocks in protein sequences using statistical association measures. Proceedings of the 2004 ACM symposium on Applied computing 2004, 154–161.View ArticleGoogle Scholar
- Hartley RVL: Transmission of Information. The Bell System Technical Journal 1928, 3: 535–564.View ArticleGoogle Scholar
- Shannon CE: A Mathematical Theory of Communication. The Bell System Technical Journal 1948, 27: 379–423. 623–656View ArticleGoogle Scholar
- EXPASY Scan PROSITE Tool[ftp://us.expasy.org/databases/prosite/tools/ps_scan]
- NCBI none redundant protein database[ftp://ftp.ncbi.nih.gov/blast/db/nr.tar.gz]
- R-Project Home page[http://www.r-project.org/]
- RSPerl Home page[http://www.omegahat.org/RSPerl/]
- RSvgDevice Home page[http://www.darkridge.com/~jake/RSvg/]
- Adobe SVG-Viewer Download page[http://www.adobe.com/svg/viewer/install/]
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka ED, Wilkinson M, Birney E: The Bioperl Toolkit: Perl modules for the life sciences. Genome Research 2002, 12(10):1611–8. [http://www.bioperl.org/]PubMed CentralView ArticlePubMedGoogle Scholar
- Berkeley Weblogo Tool Home page[http://weblogo.berkeley.edu/]
- Wolfe SA, Nekludova L, Pabo CO: DNA RECOGNITION BY Cys2His2ZINC FINGER PROTEINS. Annu Rev Biophys Biomol Struct 1999, 3: 183–212.Google Scholar
- Kraskov A, Stögbauer H, Andrzejak RG, Grassberger P: Hierarchical Clustering Based on Mutual Information.2003. [http://arxiv.org/abs/q-bio/0311039]Google Scholar
- Quinlan JR: C4.5: Programs for Machine Learning. Morgan Kaufman 1993, 27–29.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.