ProfNet, a method to derive profile-profile alignment scoring functions that improves the alignments of distantly related proteins
© Ohlson and Elofsson; licensee BioMed Central Ltd. 2005
Received: 09 May 2005
Accepted: 14 October 2005
Published: 14 October 2005
Profile-profile methods have been used for some years now to detect and align homologous proteins. The best such methods use information from the background distribution of amino acids and substitution tables either when constructing the profiles or in the scoring. This makes the methods dependent on the quality and choice of substitution table as well as the construction of the profiles.
Here, we introduce a novel method called ProfNet that is used to derive a profile-profile scoring function.
The method optimizes the discrimination between scores of related and unrelated residues and it is fast and straightforward to use. This new method derives a scoring function that is mainly dependent on the actual alignment of residues from a training set, and it does not use any additional information about the background distribution.
It is shown that ProfNet improves the discrimination of related and unrelated residues. Further it can be used to improve the alignment of distantly related proteins.
The best performance is obtained using superfamily related proteins in the training of ProfNet, and a classifier that is related to the distance between the structurally aligned residues. The main difference between the new scoring function and a traditional profile-profile scoring function is that conserved residues on average score higher with the new function.
Alignment of proteins is one of the fundamental methods in bioinformatics. Alignments are used to detect homology and to study evolutionary events. The ability to align distantly related proteins can be improved significantly by the inclusion of evolutionary information [1, 2] or predicted features . Although significant improvements of alignment qualities has been seen recently in CASP , it is not clear how much the improved performance is due to an improvement of alignment methodologies and how much is due to increased number of sequences and structures that can be used to span the distance between a query protein and a target structure. However, in a recent study we have shown that the average alignment quality, as measured by MaxSub , improved by 10% at the family level and 50% at the superfamily level by the use of profile-profile scoring instead of sequence-profile scoring . These findings are comparable to the ones found in a number of recent studies [7–9].
Profile-profile alignments can be implemented in several different ways [10–16]. The fundamental difference between different profile-profile alignment methods lie in how they calculate the score between two profile vectors. A profile, as defined in this study, can be seen as a set of vectors where each vector contains the frequency of each amino acid at a particular position in a multiple sequence alignment. In traditional sequence-profile alignments the score is calculated by extracting (the log of) the probability for an amino acid in this vector. However, in profile-profile alignments it is necessary to compare two vectors and this can be done in several different ways, including; calculating the sum of pairs, the dot product or a correlation coefficient between the two vectors. In addition, information about the background frequency and substitution probabilities can be used. Although, it has been shown that profile-profile methods using a probabilistic model seem to be superior to other methods [6, 8], it is quite likely that better profile-profile scoring functions could be developed.
Here we present ProfNet, a method to develop novel profile-profile scoring functions. ProfNet is based on the ability to separate related from unrelated residue pairs, and it uses an artificial neural network (ANN) trained to identify pairs of residues from structurally aligned proteins. We show that ProfNet provides significantly better identification of related residues than prob_score  and that it also can be used to provide a slight improvement of the alignment of distantly related proteins. Another advantage of this approach is that it makes it trivial to add additional information to the scoring function.
It could be expected that a good profile-profile scoring function should provide high scores if two profile vectors have similar amino acid distributions that differs from the background distribution. In addition, the score should include information about what amino acids are more likely to be exchanged with each other. In an earlier study we found that one profile-profile scoring method, prob_score , performed these tasks quite well . However, it is quite possible that a better function could be found. In order to develop such a function we have develop the method ProfNet that separates residue pairs that should and should not be aligned. Here, it is assumed that residue pairs aligned in a structural alignment should also be aligned by the profile-profile scoring function, while residue pairs belonging to proteins from different folds should not be aligned at all. Therefore, the scoring function was trained to identify pairs of structurally aligned residues. Finally, the ability to correctly align protein pairs using this novel scoring function was tested.
Identification of related residues
MCC-values and the corresponding Z-scores for prob_score and ProfNet versions trained on different datasets. The ProfNet versions were trained on profile vector pairs from unrelated proteins and protein positions related at family (ProfNet_fam), superfamily (ProfNet_su), fold (ProfNet_fold), and all SCOP levels (family, superfamily and fold) (ProfNet_all). The training of ProfNet_S was done using superfamily related profile vector pairs as positive examples, and classified by the S-score instead of the binary classifiers used in the other cases. The results are shown for protein pairs related on family, superfamily and fold level. The best results are shown in bold.
In the above tests, all aligned positions were treated equally. However, certainly some of the aligned positions in the structural alignment are more closely aligned than others. Therefore, we also used a continuous function related to the distance between the two residues after the structural superposition. To measure the distance between two residues we used the S-score . This ProfNet version, called ProfNet_S, performed quite well at the superfamily and fold levels, but did not distinguish the family related residues optimally. The Z-score for the fold related scores shows an improvement over prob_score by 60% and all the curves show a Gaussian like distribution, see Table 1 and Figure 1.
Alignment quality results for prob_score and the ProfNet versions trained on different datasets. The ProfNet versions were trained on profile vector pairs from unrelated proteins and protein positions related at family (ProfNet_fam), superfamily (ProfNet_su), fold (ProfNet_fold), and all SCOP levels (ProfNet_all). The training of ProfNet_S was done using superfamily related profile vector pairs as positive examples, and classified by the S-score instead of the binary classifiers used in the other cases. The average MaxSub scores are listed for a test sets with proteins related the family, superfamily or fold levels. The best results are shown in bold.
Unfortunately, we did not see any significant improvement on the ability to detect related proteins using these novel scoring functions. The failure to increase fold recognition indicates that there still is work to do to find the optimal profile-profile scoring function. Quite likely, the construction of the negative training set was not done optimally.
Both prob_score and ProfNet provide a score for two profile vectors, that should be related to the similarity between the two (profile) positions. In the following sections we will compare these two functions, where ProfNet_S is used as a representative of the ProfNet method.
Average Z-scores for prob_score and ProfNet_S for the scores for different types of conserved residue pairs. ProfNet_S was trained on superfamily related profile vector pairs and using the S-score as a classifier. The pairs are grouped into pairs with identical residues, positive, zero and negative BLOSUM scores. Finally, the Z-scores for a pair containing one conserved and one non-conserved residue and two non-conserved residues are shown. The highest scores are shown in bold.
non cons-non cons
Correlation between different substitution tables and the profile-profile scores. Three tables, BLOSUM62, GONNET and JTT, are derived from sequence alignments while SDM is a structure based table. Three tables derived from ProfNet are also included, using the all (trained on data from all SCOP levels), su (trained on superfamily related data) and S-score (trained on superfamily related data and using the S-score as a classifier instead of a binary classifier) versions. STRUCTAL is a substitution table constructed from the residue matches found in the structurally aligned superfamily-related training set. The highest correlations are shown in bold.
The scores from the substitution tables GONNET , JTT  and SDM  were also compared with the ProfNet derived substitution table, see table 4. The first two substitution tables are based on sequence alignments, while SDM is a structurally derived substitution table, i.e. based on structural alignments. Overall, prob_score showed a higher correlation to the substitution tables than to ProfNet, and ProfNet showed higher correlation with prob_score than with the substitution tables. This shows that ProfNet capture some of the substitution table information and some of the conservation information used in prob_score. It can also be seen that prob_score and ProfNet show comparable correlation with a substitution table created directly from the structurally aligned superfamily-related dataset.
Here, we have only used the most obvious information from the profiles, i.e. the frequencies in the profile vectors for the development of the scoring function. One possible advantage of the ProfNet method is that it is easy to include other types of information, such as gap-information and predicted features, into the scoring functions.
A novel method, ProfNet, to derive a profile-profile scoring function is shown to improve the discrimination between related and unrelated residue residues pairs. Further, ProfNet can be used to marginally improve the alignment quality of proteins related at the fold level. One benefit of this method is that it is easy to use and fast to evaluate, while one drawback is that a good and well balanced training set has to be used, and it is slower than prob_score. When choosing the training set, it seems as if the family related set is too focused on sequence similarity while the fold related training set on the other hand does not seem to include enough closely related pairs. The superfamily related training set could be seen as an intermediate, where the network will learn the features in the residue pairs that are essential when scoring unseen residue pairs. It was also found that using a binary classifier is not the best way to classify the training data, but instead some continuous classifier could be used. When using the superfamily related training data and ProfNet_S we see an improvement over prob_score by 31% in MCC (60% in Z-score) and 14% in average alignment quality for the fold related proteins. Interestingly, ProfNet clearly scores all conserved residues higher than prob_score does.
We used the log-odds profiles obtained after ten iterations of PSI-BLAST  version 2.2.2, using an E-value cutoff of 10-3 and all other parameters at default settings. The search was performed against nrdb90 from EBI . The frequency profiles, used in prob_score, were back-calculated from the log-odds profiles obtained from PSI-BLAST as in . The profiles used in ProfNet were created by a transformation of the log-odds profiles using a simple transformation as in PSI-PRED , , where x is the value from the log-odds profile. In this study a profile is a matrix of dimensions 20xL, where L is the length of the target or query sequence. The term "profile vector" also known as "profile column" is a 20 × 1 dimensional vector with values corresponding to the occurrence of each amino acid, as calculated from the PSI-BLAST log-odds profiles, in a certain position in the profile.
Scoring of two profiles
The input to the ProfNet scoring function is two transformed profile vectors, see above. The score between two profiles was calculated by first filling the dynamic programming matrix using ProfNet as a scoring function. After the matrix is filled, standard dynamic programming is used, with affine gap penalties. The number of calculations for each cell in the dynamic programming matrix for ProfNet is hn × in, where hn = # hidden nodes in the ANN and in = # input nodes (= 2 × 20), typically 20 × 40. The number of calculations for each cell in the dynamic programming matrix for prob_score is 2 × r × (1 + x), where r = # residues in the alphabet (= 20), and x is the number of calculations for a logarithm, i.e. 2 × 20 × (1 + x). In our implementation ProfNet is almost three times slower than prob_score.
In Mittelman et. al.  it is shown that probabilistic scoring functions is significantly better than other scoring functions and in Wang & Dunbrack 2004 , it is stated that with optimized gap penalties, most scoring functions behave similarly to one another in alignment accuracy. Taking all this into account, we choose to use the probabilistic scoring function prob_score instead of for example COMPASS or PICASS03 , since it was used in our previous study, where it was shown to be one of the best methods.
A subset of SCOP  version 1.57, class a to e, where no two protein domains have more than 75% sequence identity was used in the training of the artificial neural networks. For the positive training examples, protein pairs were structurally aligned using STRUCTAL  and all pairs of residues within 3 Å separation were used, while a set of negative training examples was created from randomly selected residue pairs from proteins of different folds. For the positive and negative data sets no more than 15 aligned positions from the same protein pair were used.
In an attempt to clarify what dataset to use in the ANN training we used five different datasets. The datasets consist of pairs of profile vectors corresponding to the aligned residues between protein pairs from the same family, superfamily (where no two proteins came from the same family), fold (where no two proteins came from the same superfamily), and a combination of family, superfamily and fold as positive examples and using randomly chosen vector pairs from unrelated protein positions as negative examples. The ANNs were trained to classify the profile vector pairs as related or unrelated (0 or 1). We also trained an ANN with the superfamily related set as positive examples, and trained to classify the profile vector pair according to the S-score . The rmsd is calculated between the C α atoms of the aligned residues.
The ratio between the positive and negative examples was not adjusted, instead all examples were used in the training as this was shown to produce the best alignment quality results for the superfamily related training set (data not shown). The size of the datasets ranges from 20 000 examples for the fold related and the negative dataset to 100 000 for the S-score trained examples.
Matthews correlation coefficient
When comparing how well a method can separate positive and negative examples, such as the scores for related and unrelated profile positions, Matthews Correlation coefficient  (MCC) is a useful fitness measure. MCC takes into account both over-prediction and under-prediction and imbalanced data sets. It is defined as, . True positives (tp) are correctly predicted related scores, true negatives (tn) are correctly predicted unrelated scores, false negatives (fn) wrongly predicted related scores and false positives (fp) wrongly predicted unrelated scores. The MCC score is in the interval (-1,1), where one shows a perfect separation, and zero is the expected value for random scores. Three subsets (family, superfamily, and fold level) of the SCOP version 1.57 dataset that were not used in the training were used to calculate the MCC-values for each method.
Artificial neural network training
The artificial neural networks (ANNs) were trained on 80% of the dataset, where a protein is only present in either the training or the test set. The neural network package Netlab in MatLab was used for the ANN training [31, 32]. A linear activation function was chosen, and the training was carried out using the scaled gradient algorithm. Given two residues that should be aligned according to the training data, the ANN functions extracted their respective residue vectors from the transformed PSI-BLAST profiles, see above. The training of the ANNs was done using a grid search over the number of hidden nodes and number of training cycles. After the initial grid search, the search procedure was tuned to the area that produced the best results. At least 49 sets of parameters were tested for each ANN. The ANN-based scoring functions were chosen by selecting the ANN with the highest MCC-value and the minimum number of training cycles and hidden nodes. In the next step the ANN were used for the alignment quality test. The ProfNet scoring functions were implemented into the Palign [1, 33] package
In summary, the ANNs were trained to identify related and unrelated profile vectors. The ANN use two transformed profile vectors, as described above, as input. The network should output a high score if two vectors are related and a low score otherwise. The ANNs that use a binary classifier outputs a value in the range (-0.6, 1.7), and the ANN that use the continuous S-score as a classifier output scores in the range (0, 1). In a sense, the network is trying to find a function that best can explain and correlate the training examples, i.e. the 40 numbers from the two profile vectors, and their output values. For the S-score trained network, the training examples classification are related to the rmsd between the C α atoms of the two residues that are aligned. With this strategy, the ANN is trained to predict the distance between the two residues, and hence if they should be aligned or not.
This dataset was also constructed from the same subset of SCOP version 1.57, class a to e, where no two protein domains have more than 75% sequence identity. From this dataset we included no more than 5 proteins from the same superfamily and no more than one model per domain target, we used in total 799 family, 672 superfamily and 602 fold related protein pairs. Among the superfamily related proteins, no proteins from the same family were included, and among the fold related proteins, no proteins from the same superfamily were included.
Throughout this study, only local alignments were used. For each alignment we created a model of the query protein and compared the structure of this model with the correct structure. We used MaxSub  which finds the largest subset of C α atoms of a model that superimpose well over the experimental model. We only report the MaxSub score because we noted in our earlier study  that the results obtained using other methods, such as LGscore , were almost identical. The parameters for the best MaxSub scores on superfamily and fold level are not always the same, therefore we show the results for a choice of parameters with MaxSub scores reasonably high at all levels.
The gap- and shift-parameters has to be optimized to get a good performance in the alignment quality test. The shift value is added to the score, so that an average score is negative, and the gap-opening (GO) and gap-extension (GE) is used to penalize for including a gap in the sequence. The gap-parameters in the alignment quality test were optimized with the constraint that the gap-extension penalty should be 5 or 10% of the gap-opening penalty. Ideally other ratios should be tried as well but as this would take too long time and we found reasonably good results, using the GO/GE ratio described above we did not spend any more time on the optimization. In addition this ratio between GO and GE has been seen to perform well in many other scoring schemes such as PSI-BLAST and prob_score. By using this rule we only had to search two two-dimensional parameter landscapes. We searched a grid of G0 = (0.1,0.2...,1.5) and shift = (-0.5, -0.45,...,1.5) for the ProfNet methods, and G0 = (0.2,0.3...,3.5) and shift = (-0.5,-0.45,...,1.5) for prob_score. The set of parameters with the best MaxSub score was then chosen.
In the two ROC-plots, the error rate is plotted against the sensitivity (= tp/(tp + fn)). In figure 2 the error rate and sensitivity was calculated from scores of related and unrelated profile positions, i.e. from the MCC analysis data. In figure 3, the alignment quality was calculated for the superfamily related set that was used in the alignment quality test and a negative dataset. The negative dataset consists of 1000 unrelated protein pairs from SCOP version 1.57, class a to e, where no two protein domains have more than 75% sequence identity.
In the comparison of ProfNet and prob_score, a conserved residue was defined as a residue with "frequency" (calculated from the converted PSI-BLAST profiles) above a certain cutoff in the profile frequency vector. In a non-conserved vector, no residue has a frequency above 0.10. To analyze how the methods score conserved residues, the average score was calculated between the conserved residues related at superfamily level from the test set used in the MCC test. To make the comparison more straightforward, the scores were transformed into Z-scores according to Z-score(x) = (x - μ)/σ, where μ is the average score over many randomly chosen examples and σ is the standard deviation. From these scores, substitution table-like matrices were derived for the methods. All different ProfNet versions produced the same outliers (data not shown).
This work was supported by grants from the Swedish Natural Sciences Research Council. We wish to thank Bob MacCallum and Björn Wallner for valuable support and discussions.
- Elofsson A: A study on how to best align protein sequences. Proteins 2002, 15(3):330–339. 10.1002/prot.10043View ArticleGoogle Scholar
- Wallner B, Fang H, Ohlson T, Frey-Skött J, Elofsson A: Using evolutionary information for the query and target improves fold recognition. Proteins 2004, 54(2):342–350. 10.1002/prot.10565View ArticlePubMedGoogle Scholar
- Rost B, Sander C: Prediction of protein secondary structure structure at better than 70% accuracy. J Mol Biol 1993, 232: 584–599. 10.1006/jmbi.1993.1413View ArticlePubMedGoogle Scholar
- Moult J, Hubbard T, Bryant SH, Fidelis K, Pedersen JT: Critical assesment of methods of proteins structure predictions (CASP): Round II. Proteins (Suppl) 1997, 1: 2–6. Publisher Full Text 10.1002/(SICI)1097-0134(1997)1+<2::AID-PROT2>3.0.CO;2-TView ArticleGoogle Scholar
- Siew N, Elofsson A, Rychlewski L, Fischer D: MaxSub: An automated measure to assess the quality of protein structure predictions. Bioinformatics 2000, 16(9):776–785. 10.1093/bioinformatics/16.9.776View ArticlePubMedGoogle Scholar
- Ohlson T, Wallner B, Elofsson A: Profile-profile methods provide improved fold-recognition: A study of different profile-profile alignment methods. Proteins 2004, 57: 188–197. 10.1002/prot.20184View ArticlePubMedGoogle Scholar
- Marti-Renom M, Madhusudhan M, Sali A: Alignment of protein sequences by their profiles. Protein Sci 2004, 13: 1071–1087. 10.1110/ps.03379804PubMed CentralView ArticlePubMedGoogle Scholar
- Wang G, Dunbrack RL: Scoring profile-to-profile sequence alignments. Protein Sci 2004, 13(6):1612–1626. 10.1110/ps.03601504PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar R, Sjolander K: A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 2004, 20(8):1301–1308. 10.1093/bioinformatics/bth090View ArticlePubMedGoogle Scholar
- Fischer D: Hybrid Fold Recognition: Combining sequence derived properties with evolutionary information. In Pacific Symposium on Biocomputing. Volume 5. Edited by: Altman R, Dunker A, Hunter L, Klien T. World Scientific; 2000:116–127.Google Scholar
- Rychlewski L, Jaroszewski L, Li W, Godzik A: Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci 2000, 9(2):232–241.PubMed CentralView ArticlePubMedGoogle Scholar
- von Öhsen N, Sommer I, Zimmer R: Profile-profile alignments: a powerful tool for protein structure prediction. In Pacific Symposium on Biocomputing Edited by: Altman RB, Dunker AK, Hunter L, Jung TA, Klein TE. 2003, 252–263.Google Scholar
- Yona G, Levitt M: Within the twilight zone: A sensitive profile-profile comparison tool based on information theory. J Mol Biol 2002, 315: 1257–1275. 10.1006/jmbi.2001.5293View ArticlePubMedGoogle Scholar
- Sadreyev R, Grishin N: COMPASS: A Tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 2003, 326: 317–336. 10.1016/S0022-2836(02)01371-2View ArticlePubMedGoogle Scholar
- Edgar R, Sjolander K: SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics 2003, 19(22):1404–1411. 10.1093/bioinformatics/btg158View ArticlePubMedGoogle Scholar
- Pei J, Sadreyev R, Grishin NV: PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 2003, 19: 427–428. 10.1093/bioinformatics/btg008View ArticlePubMedGoogle Scholar
- Mittelman D, Sadreyev R, Grishin N: Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments. Bioinformatics 2003, 19: 1531–1539. 10.1093/bioinformatics/btg185View ArticlePubMedGoogle Scholar
- Cristobal S, Zemla A, Fischer D, Rychlewski L, Elofsson A: A study of quality measures for protein threading models. BMC Bioinformatics 2001., 2(5):Google Scholar
- Edgar R: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004., 5(32):Google Scholar
- Tress M, Jones D, Valencia A: Predicting reliable regions in protein alignments from sequence profiles. J Mol Biol 2003, 330(4):705–718. 10.1016/S0022-2836(03)00622-3View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 1992, 10915–10919.Google Scholar
- Gonnet G, Cohen M, Benner S: Exhaustive matching of the entire protein sequence database. Science 1992, 257(5077):1609–1610.Google Scholar
- Jones D, Taylor W, Thornton J: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 1992, 8(3):275–282.PubMedGoogle Scholar
- Prlic A, Domingues F, Sippl M: Structure derived substitution matrices for alignment of distantly related sequences. Protein Engineering 2000., 13(8):Google Scholar
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Holm L, Sander C: Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 1998, 14: 423–429. 10.1093/bioinformatics/14.5.423View ArticlePubMedGoogle Scholar
- Jones D: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292(2):195–202. 10.1006/jmbi.1999.3091View ArticlePubMedGoogle Scholar
- Murzin A, Brenner S, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Gerstein M, Levitt M: Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Sci 1998, 7: 445–456.PubMed CentralView ArticlePubMedGoogle Scholar
- Matthews B: Comparison of predicted and observed secondary structure, of T4 phage lysozyme. Biochim Biophys Acta 1996, 405: 442–451.View ArticleGoogle Scholar
- Bishop CM: Neural Networks for Pattern Recognition. Great Clarendon St, Oxford OX2 6DP, UK.: Oxford University Press; 1995.Google Scholar
- Nabney I, Bishop C: NetLab: Netlab neural network software.1995. [http://www.ncrg.aston.ac.uk/netlab/]Google Scholar
- Elofsson A, Ohlson T: palign.[http://www.sbc.su.se/~arne/palign/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.