Enhanced protein domain discovery using taxonomy
© Coin et al 2004
Received: 05 February 2004
Accepted: 11 May 2004
Published: 11 May 2004
Skip to main content
© Coin et al 2004
Received: 05 February 2004
Accepted: 11 May 2004
Published: 11 May 2004
It is well known that different species have different protein domain repertoires, and indeed that some protein domains are kingdom specific. This information has not yet been incorporated into statistical methods for finding domains in sequences of amino acids.
We show that by incorporating our understanding of the taxonomic distribution of specific protein domains, we can enhance domain recognition in protein sequences. We identify 4447 new instances of Pfam domains in the SP-TREMBL database using this technique, equivalent to the coverage increase given by the last 8.3% of Pfam families and to a 0.7% increase in the number of domain predictions. We use PSI-BLAST to cross-validate our new predictions. We also benchmark our approach using a SCOP test set of proteins of known structure, and demonstrate improvements relative to standard Hidden Markov model techniques.
Explicitly including knowledge about the taxonomic distribution of protein domains can enhance protein domain recognition. Our method can also incorporate other context-specific domain distributions – such as domain co-occurrence and protein localisation.
Protein domains are the structural, functional and evolutionary units of proteins. Several statistical techniques are currently used for detecting protein domains. In particular, Profile hidden Markov models (profile HMMs) have been successfully applied to this problem [1, 2], and form the basis for databases such as Pfam . Profile HMMs can be more sensitive than methods which look for pairwise homology . Our ability to detect distant homology is limited by noise. This is due to the divergence of the amino acid sequence too far away from the profile to detect the similarity, despite the preservation of structure and function. We attempt to take into account extra information concerning the patterns of occurrence of domains in order to recognize distant homology. We have previously discovered that using the probabilities of domains occurring together in a sequence as contextual information significantly enhances domain detection . In this paper we investigate using the species distribution of domains to enhance detection.
For each sequence fragment A from a species S, our approach is to annotate the sequence as domain D if the probability P(D|A,S) is sufficiently high. This probability can be split (using Bayes' rule) into an amino-acid based and species based term:
Taking logs and approximating P(A|D,S) by P(A|D), we obtain
In the Pfam annotation , a domain D annotating the sequence fragment A is recognized as real if the domain log-odds ratio is greater than a manually curated threshold,
The log-odds ratio is calculated using the HMMER package . In this formulation P(A|D) represents the probability that the profile HMM representing D generated the sequence A. The term P(A|R(D)) represents the probability that the sequence was generated independently residue by residue according to a baseline composition model. The composition model is derived from the domain model by looking at the average composition of sequences generated by the model. This removes spurious matches based on composition alone.
In eq. 2 we replace the species composition model P(A|S) with a domain specific compositional model P(A|R(D)). By comparing eqs. 2 and 3, we see that the taxonomic adjustment is the right-hand bracket of eq. 2, and so our taxonomic score is:
The procedure we used to estimate the right-hand adjustment term is described in detail in the Methods section.
SCOP is a database which classifies all proteins of known structure . SCOP classifies protein domains: multi-domain proteins are split into component protein domains which are classified hierarchically in four levels: family, superfamily, fold and class. Sequences belonging to the same family share sequence similarity, suggesting a common function and implying a clear common evolutionary origin; families are clustered into superfamilies on the basis of structural similarity, suggesting a probable common evolutionary origin; superfamilies are grouped into folds on the basis of similar secondary structure topology. ASTRAL is a database of protein sequences of known structure, annotated with SCOP family classifications . ASTRAL provides protein sequences filtered to various levels of sequence similarity. Our test set consisted of ASTRAL sequences filtered so that no pair of sequences has more than 40% identity.
Pfam is a database of multiple sequence alignments and hidden Markov models . Pfam classifies all sequences in SWISS-PROT/TrEMBL on the basis of sequence similarity. This study focuses on extending homology detection of Pfam models at the SCOP superfamily level, as this represents a 'hard' test of homology recognition. Our set of test models consists of 869 Pfam families each of which overlap one and only one SCOP superfamily. We classify proteins in the same superfamily as homologous. We classify proteins which are in different folds as non-homologous. Proteins which are in the same fold but different superfamily are not classified.
We apply our method for detecting protein domains to the Pfam database. We consider whether we can improve the coverage of existing Pfam models by incorporating taxonomic information. In contrast with the SCOP/ASTRAL benchmark, we use manually curated thresholds from Pfam, rather than e-value thresholds, as we wish to compare our new annotation with the existing Pfam annotation. Our rationale is that the Pfam annotation is based on these thresholds, and we wish to evaluate the effect of taxonomically adjusting log-odds scores independently of modifying the thresholding technique. By rescanning SWISS-PROT/TrEBML, we found 4447 extra occurrences of Pfam families in proteins from SWISS-PROT/TrEMBL in the Pfam database, with sequence coverage equivalent to the last 8.3% of Pfam families (401 of 4,832 families in release 7.7). This corresponds to a 0.7% increase in the number of domain predictions. The new occurrences are limited to 461 Pfam families, of which 242 families contribute 95% of new hits.
We find a novel antistasin domain on the theromin protein (THBI_THETS) in Theromyzon tessulatum, a leech. This protein has important medical applications as a potent thrombin inhibitor, and is found in the head of the leech . The antistasin family is an inhibitor of trypsin-family proteases and is often found in anti-coagulants. Again we find that the function of the protein concurs with the novel domain occurrence. We also find a novel occurrence of the toxin_2, or scorpion short toxin domain on the ErgToxin protein (Q9GQ92) in Centuroides noxius (Mexican scorpion). The ErgToxin protein blocks the ERG-K+-channels of nerve, heart and endocrine cells . Other members of the toxin_2 family also inhibit potassium channels.
Finally, in the fertilization 18 kda protein (Q25063) in Haliotis fulgens (Green Abalone), we identify a novel Egg_lysin domain. Egg_lysin is found in other Haliotidae, as well as other Archaeogastropoda. The 18 kda fertilization protein acts in conjunction with a paralogous 16 kda lysin protein on the egg vitelline envelope. The 16 kda protein creates a hole in the vitelline envelope. The 18 kda protein is a potent fusagen of liposomes, and is thought to mediate membrane fusion between the gametes, a step in gamete recognition which is important in restricting heterospecific fertilization in other species . These authors also found very high divergence amongst the group of orthologous 18 kda proteins in California abalone; together with a high frequency of non-synonymous to synonymous substitution, indicating a high selective pressure toward differentiation between species and thus furthering the gamete recognition hypothesis. Furthermore, the 18 kda protein exhibits a rate of evolution 2–3× that of the 16 kda protein. The 18 kda protein in Haliotis fulgens is the most distantly related of this group (with 27%–34% identity to the others), and hence standard profile methods fail to detect the similarity. We see that the taxonomic score adjustment helps to correctly identify a distantly related domain.
Blast Results For New Positives Predicted By Model.
PSI-BLAST does not find match in Pfam Family
Majority of matches to correct Pfam family
Majority of matches to incorrect family
Has 1 match to correct family
Has matches to related family
All matches to unrelated families
We compare these results with our previous results detected using a variable order Markov model to detect domain co-occurrence . We reported there 15,263 new domain occurrences, equivalent to the last 15.6% of Pfam families. Furthermore, we find that of the two sets of new domain occurrences, 3803 of the new occurrences occur in both. This suggests that domain co-occurrence and taxonomic distribution often reinforce each other. The new domain occurrences from the paper are available via the Pfam website, and we will incorporate the incremental new detections of the method of this paper into Pfam in a similar manner.
We have demonstrated that taxonomic distribution can be used to enhance protein domain detection. Furthermore, we have found several examples in which the increased predictive power has discovered domains which are biologically important. From a theoretical point of view, this method is significant in that it provides a way of evaluating in a probabilistic fashion the appropriate trade-off between amino-acid signal strength and species information. Lastly, from a pragmatic perspective, the method increases sequence coverage. The taxonomic adjustment is a general technique which can be applied to other similarity searches, including BLAST and PSI-BLAST .
We can also see this more broadly as an example of a general method which can be used to integrate contextual information into a similarity detection algorithm. In particular, we have already used similar techniques to integrate information pertaining to domain co-occurrence. One line of future development is to use both types of information together. Another possible form of contextual information is protein localisation. There are, however, challenges in deciding the best way to integrate this information into a single model, and this remains a basis for ongoing research.
The first term in eq. 4 is pre-calculated using HMMER. We keep all those hits which have HMMER e-value less than 1000 to search with the taxonomic adjustment.
The probability P(D) is estimated from currently annotated domains in Pfam:
where N(D) represents the count of domain D in the database, and N tot the total number of domains in the database.
We also need to estimate the probabilities for each domain conditional on each species, P(D|S), which is complicated by sparsity of the data set. To combat data sparsity, and to avoid constructing thresholds which preclude the possibility of observing certain species-domain pairs, we recursively interpolate frequencies of taxa and domain combinations along a guide species tree. For i = 0,..., m - 1, where S 0 = S is the species and S i is the i th parent taxon, we write
We denote by S m the kingdom, and write
The parameter α represents the degree to which the estimation is based on nodes higher up in the taxonomy, rather than the leaves. We have found that the choice α = 0.5 works well in practice. Note that this strategy is a smoothing strategy which recursively interpolates counts of species which are similar according to the taxonomy. The taxonomy is taken from SWISS-PROT.
The protein database used is SWISS-PROT40 + TrEMBL18. Release 7.7 of the Pfam database was used both for training the model and for searching against the protein database. Release 7.7 contains 4,832 families, with matches to 74% proteins in SWISS-PROT40/TrEMBL18, and sequence coverage of 53%.
For the SCOP test, we used SCOP release 1.63 and the test set consisted of proteins from ASTRAL release 1.63 filtered to 40% identity, which consisted of 5226 proteins.
The Wellcome Trust Sanger Institute is supported by the Wellcome Trust. LC holds a Leslie Wilson Scholarship from Magdelene College Cambridge and a Cambridge Australia Trust studentship.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.