Blast sampling for structural and functional analyses
© Friedrich et al; licensee BioMed Central Ltd. 2007
Received: 08 August 2006
Accepted: 23 February 2007
Published: 23 February 2007
The post-genomic era is characterised by a torrent of biological information flooding the public databases. As a direct consequence, similarity searches starting with a single query sequence frequently lead to the identification of hundreds, or even thousands of potential homologues. The huge volume of data renders the subsequent structural, functional and evolutionary analyses very difficult. It is therefore essential to develop new strategies for efficient sampling of this large sequence space, in order to reduce the number of sequences to be processed. At the same time, it is important to retain the most pertinent sequences for structural and functional studies.
An exhaustive analysis on a large scale test set (284 protein families) was performed to compare the efficiency of four different sampling methods aimed at selecting the most pertinent sequences. These four methods sample the proteins detected by BlastP searches and can be divided into two categories: two customisable methods where the user defines either the maximal number or the percentage of sequences to be selected; two automatic methods in which the number of sequences selected is determined by the program. We focused our analysis on the potential information content of the sampled sets of sequences using multiple alignment of complete sequences as the main validation tool. The study considered two criteria: the total number of sequences in BlastP and their associated E-values. The subsequent analyses investigated the influence of the sampling methods on the E-value distributions, the sequence coverage, the final multiple alignment quality and the active site characterisation at various residue conservation thresholds as a function of these criteria.
The comparative analysis of the four sampling methods allows us to propose a suitable sampling strategy that significantly reduces the number of homologous sequences required for alignment, while at the same time maintaining the relevant information concerning the active site residues.
Recent developments in whole genome sequencing, assembly techniques and expressed sequence tag (EST) methods have lead to a vast amount of sequence data flooding the protein and DNA databases. Over 390 complete genomes are now referenced on the GOLD web site with many others in the sequencing and assembly stages. In addition, the recent emergence of high throughput functional genomics techniques has increased the rate at which genome and sequence products are being functionally characterized. As a consequence, the majority of new sequences have homologues in the public databases, and new functional or structural data may often be inferred for the sequence under study using database mining.
Thus, sequence database mining and analysis have become essential first steps for a wide range of applications in molecular biology. One of the most widely used methods for detecting homologous sequences is Blast. The Blast suite of programs is used to find local sequence similarities, which might lead to evolutionary clues about the structure and/or function of the query sequence. The detected sequences can then be used e.g. to build a multiple alignment of complete sequences (MACS), which represents an ideal workbench to study all the information related to a set of homologous sequences. Indeed, by placing a sequence in the context of its overall family, the MACS permits not only a "horizontal" analysis of the sequence along its complete length, but also a "vertical" view of its evolution among different organisms. MACS are typically used to perform comparative analyses at the genome level, to define the phylogenetic relationships between organisms in evolutionary studies, to identify conserved functional residues, motifs or domains and to predict protein or RNA secondary and tertiary structures.
As a direct consequence of the recent database growth, Blast searches frequently lead to the identification of hundreds to thousands of potential homologues for a single query sequence. Dealing with so much data can be detrimental, not only in terms of computational and human analysis time, but also in terms of the accuracy and the significance of the results. Problems, such as sequencing or intron/exon prediction errors, redundancy or the presence of partial sequences, may represent a significant source of noise, depending on the biological question under study.
It is therefore essential to develop novel strategies to reduce the set of sequences to be processed at the earliest possible stage of an analysis, which is generally during the sequence database search. There are clearly two possibilities: an a priori reduction of the sequence database search space or an a posteriori sampling of the sequences detected by the database search program. Some types of studies have intrinsic a priori sequence filters; e.g. the construction of a phylogenetic distribution of proteins from complete genomes or the analysis of proteins belonging to specific clades. Another a priori strategy is the use of a pre-processed non-redundant database, where sequences are clustered by means of their percent identities, such as the UniRef database series (UniRef90, UniRef50). The a posteriori sampling methods are generally based on sequence similarity criteria and frequently require user intervention. For example, UniqueProt is a fast and simple method that reduces the redundancy of the dataset by removing over-represented sequences, based on a user-defined percent identity threshold. This method works reasonably well when the proteins have similar domain architectures. A similar strategy is incorporated in BLAST Filter, which generates smaller sequence sets by filtration of Blast results based on 15 distinct user-configurable rules requiring a complex pre-scanning of the Blast results. These methods are therefore not suitable for automatic, high-throughput projects. A more recent study describes a Monte-Carlo sequence selection strategy to improve the detection of residues belonging to a functional surface in the context of a multiple alignment of proteins of known structure. However the latter study samples sequences after the construction of the multiple alignment, which may incur a large time penalty. If possible, it is clearly advantageous in terms of processing time, to address the relationship between sequence sampling and the information content of the resulting alignment during the initial Blast search step.
two customisable methods with user-defined parameters that determine either the maximal number of sequences to be selected (the strips method, sm), or the percent reduction rate (the random method, rm),
two automatic methods based only on the E-values calculated by BlastP: the mean method (mm) and the second derivative method (sdm). These methods automatically determine the number of sequences in the sampled set.
We focused our analysis on the potential information contained in the sampled sets of BlastP sequences, using the MACS as our main validation tool. Our analysis focused more precisely on the conservation of residues implicated in the active sites of 284 proteins with known and annotated 3D structure selected from the Protein Data Bank (PDB). A good sampling method should not alter the global quality of the resulting alignments, and should preserve the relevant structural and functional information, e.g. the conservation of active site residues. This analysis allows us to propose a suitable strategy to sample homologous sequences, while keeping the pertinent information in the associated MACS.
Results and discussion
a BlastP search was performed in the Uniref90 database to identify the set of potential homologous sequences. Of the 284 BlastP searches, 36 detected more than 1000 sequences with E-value ≤ 0.001, which illustrates the necessity for new strategies that are capable of reducing the number of sequences to process in subsequent analyses.
each sampling method (the mean method mm, the second derivative method sdm, the strips method sm and the random method rm) was independently applied to the set of detected sequences, resulting in a sampled sequence set containing a reduced number of sequences.
Five sets of sequences were thus associated with each initial protein: the unsampled set of sequences detected by BlastP and the 4 sampled sequence sets. These sets of sequences were then multiply aligned (when necessary, we limited the alignment to 500 sequences with the lowest E-values before or after sampling), resulting in five multiple alignments of complete sequences (MACS): respectively MACS_init containing the top 500 sequences detected by BlastP, MACS_mm, MACS_sdm, MACS_sm and MACS_rm.
The first part of the analysis studies the reduction rate associated with the different sampling methods, depending on the initial number of sequences in the BlastP results and their E-value distribution. We also studied the amount of sequence coverage between the different methods, in order to estimate their redundancy or complementarity. The second part then studies the effect of the sampling methods on the MACS information content by considering the quality of the MACS and the conservation of documented active site residues.
Large scale comparison of sampled sequence sets
We studied the behaviour of the different sampling methods for a large set of diverse BlastP searches (concerning 284 protein families). We analysed the effect of the BlastP results on the ability of the sampling methods to effectively reduce the number of sequences according to two criteria: the total number of sequences detected and their E-value distribution. We also compared the sequence coverage between the different sampling methods.
Sequence reduction rate
Mean reduction rate associated with sampling methods
global (284 seq.)
Mean reduction ratio (%)
subset-100 (91 seq.)
Mean reduction ratio (%)
subset100–500 (114 seq.)
Mean reduction ratio (%)
subset+500 (79 seq.)
Mean reduction ratio (%)
subset-100: 91 proteins for which BlastP detected less than 100 sequences
subset100–500: 114 proteins for which BlastP detected between 100 and 500 sequences
subset+500: 79 proteins for which BlastP detected more than 500 sequences
In summary, the reduction rate of mm depends on both the number of BlastP sequences and their E-value distribution. In contrast, sdm and sm depend mainly on the number of sequences. Nevertheless, sdm and sm behave differently in relation to the BlastP E-value distribution: the sdm reduction rate is relatively constant, while the sm reduction rate is much more variable.
Sequence coverage between the sampling methods. The method in the first column is considered as the reference
1 st method (reference)
3 rd method
The sequence coverage rate was calculated by considering the number of sequences common to 2, 3 or 4 methods compared to one method chosen as a reference (left row in Table 2). 1938 sequences were selected by all 4 methods, whereas 6721 (1938 sequences added with the 4783 sequences by mm, sdm and sm) sequences are common to mm, sm and sdm. The difference between these 2 values can be explained by the numerous sequences selected only by rm. The mean coverage rates observed for mm and sdm are quite similar at around 75% (10076 sequences specifically selected by these two methods added with the 6721 sequences previously quoted and the 3230 sequences selected by mm, sdm and rm). This similarity might be expected, since both methods are fully automatic and entirely based on the E-values.
We also noticed that the sequences common to mm, sdm and sm sampled sets are usually located at positions in the E-value distribution where large differences occur (see Additional file 1: Sequences selected in the case of the 1QJ4 protein). Thus, we conclude that mm, sm and sdm select mainly variable sequences which may supplement the structural and/or functional information of the sampled set of sequences.
Impact of the sampling on the potential information in the sequence set
To estimate the impact of the methods on the potential information in the sampled sets, we used multiple alignment of complete sequences (MACS) as the main tool. We analysed the diversity of the sequences included in the MACS, the global quality of the MACS and the extent to which active site residues were observed in conserved columns in the different MACS.
At the structural and functional level, closely related sequences may not add relevant information, whereas diversity is usually more informative. In the context of Blast searches, sequences detected with nearly the same E-values, especially in the case of low E-values, are more likely to be similar, and inversely, a difference in the E-values usually represents a sequence divergence.
As the information content of a set of homologous sequences is generally related to their diversity, we investigated the SDS (Sampled Distant Sequences), whose selection increases the diversity of the sampled MACS compared to the MACS_init (see Methods).
Proportion of SDS selected by sampling methods
Total Blast set (*)
Total number of sequences
Number of SDS
SDS proportion (%)
Proportion of good quality MACS and mean norMD
Good quality MACS (%)
subset-100 (91 seq.)
norMD ≥ 0.3 (%)
subset100–500 (114 seq.)
norMD ≥ 0.3 (%)
subset+500 (79 seq.)
norMD ≥ 0.3 (%)
In order to investigate the relationship between MACS quality and the number of sequences detected by BlastP, we also studied the quality of the MACS obtained in the three subsets of comparable size defined in the section Sequence reduction rate. For subset-100, 95% of the MACS_init can be considered to be of good quality, with a high mean norMD value (0.78) as shown in Table 4. Sampling the sequences using any of the 4 methods increases both the proportion of good quality MACS (from 98 to 100% compared to 95%) and the mean norMD (from 0.86 to 0.96 compared to 0.78). Similar results were observed for subset100–500, where the sampling methods again increased the proportion of good quality MACS (82 to 96% compared to 68% for MACS_init) and the mean norMD (0.55 to 0.63 compared to 0.51). For subset100–500, sm which is the method that reduces the most the set of aligned sequences, results in a higher proportion of good quality MACS and a higher mean norMD. Furthermore, for subset+500, sm is the only sampling method able to improve the MACS global quality compared to MACS_init, both in terms of proportion (89% compared to 78%) and mean norMD value (0.53 compared to 0.49). It is important to note that the high proportion of SDS added in the context of subset+500 by the mm, sdm and rm methods, corrupts the resulting alignments: the proportion of good quality MACS falls respectively to 49, 43 and 37%. Increasing the sequence diversity in a MACS should normally improve the information content, but including too many distant sequences can also be harmful in terms of quality, so that the MACS becomes less informative. This seeming contradiction clearly reflects the current limitations of the algorithms used to construct multiple alignments.
From a quality point of view, we conclude that sm is the most appropriate sampling method since a higher proportion of good quality MACS is obtained after sampling, as well as an increased mean norMD value. Moreover, by significantly reducing the number of sequences to be aligned, the sm method also reduces the computation time required to construct the MACS.
MACS information content
The information content of a MACS is difficult to measure objectively. We therefore decided to investigate the residues annotated in the PDB database as being involved in functional active sites. These residues are usually well conserved in a protein family[18, 19] and well characterized both biochemically and structurally. As an estimate of the information content of a MACS, we calculated the number of known active sites that were detected in conserved columns of the alignment. Given a conservation threshold cut-off x, a column is considered to be "conserved" if x% of the residues, including gaps, are identical in the column. The sensitivity and specificity of the active site detection can then be computed (see Methods).
In this study, we only considered those tests for which the init, mm, sdm and sm all resulted in good quality MACS, which represents 192 of the 284 protein dataset. The rm method has been excluded from this study based on the results of the MACS quality analysis (see above). In subset+500, low quality MACS were obtained after rm sampling. Furthermore, preliminary analyses of active site detection using rm indicated that the mean sensitivity is much lower compared to all the other methods (see Additional file 3: G-mean results associated with detailed sensitivity and specificity when considering all proteins in subset+500 using the 80% threshold), indicating that the informational content was not conserved.
G-mean results associated with detailed sensitivity and specificity
(threshold = 80%)
(threshold = 90%)
(threshold = 75%)
(threshold = 80%)
We then considered the three subsets defined in the section Sequence reduction rate separately, and the corresponding MACS_init ROC curves are shown in Figure 6.
For subset-100, the most suitable column conservation threshold for active site discrimination is 90%. G-mean values decrease with the application of any of the sampling methods: the MACS_init G-mean value (0.81) is slightly reduced after sm (0.79) and a larger reduction is observed after mm and sdm (both 0.76). This decrease is caused by a loss of specificity of the active site detection, directly linked to the reduction of sequence diversity after sampling. Indeed, when only a small number of homologues are detected by BlastP, the variability between the sequences is usually relatively low. Consequently the associated MACS contains a higher proportion of conserved columns, and more false positive predictions are obtained. However, this is not a serious problem as the unsampled MACS_init alignments for this subset are generally of high quality and the small number of sequences (<100) in the BlastP results means that reduction of the sequence set is not necessary for computational purposes.
For subset100–500, the most suitable column conservation threshold is 75%. The highest sensitivity for active site discrimination was obtained using sm sampling (0.89). However, the sensitivity and specificity scores are quite similar for all the sampling methods (G-mean values are all between 0.86 and 0.87) and the differences observed between the methods cannot be considered to be significant. Nevertheless, we observed previously that in this subset, the sm sampling is more accurate in terms of reduction rate and MACS quality, and consequently sm seems to be the most suitable sampling method.
Finally, for subset+500, a 80% conservation threshold was determined. The sensitivity of active site determination after mm and sdm sampling decreases drastically (0.72 and 0.68 respectively compared to 0.80 with no sampling), whereas the sensitivity and specificity of the sm sampled set are both close to the values obtained for MACS_init (Se = 0.79/Sp = 0.85 and Se = 0.80/Sp = 0.82 respectively). This leads to similar G-mean values for MACS_init and mm (0.81 and 0.80 respectively), a small decrease is observed for sdm (0.78), whereas sm shows a better G-mean value (0.82), indicating a better accuracy for active site detection. These observations correlate with the MACS quality results in which the proportion of good quality MACS is higher after application of sm sampling compared to the other methods. The study of sequence coverage showed that the mm and sdm sampled sets both contain a higher proportion of SDS compared to sm (Table 3). Moreover, the sm sampling resulted in a higher reduction rate than the mm and sdm methods under these conditions (Table 1). Without sequence sampling, the average time to construct a multiple alignment for the set of 79 alignments with more than 500 proteins was 995 seconds (maximum time: 4740 seconds). After sampling with the sm method, the average time for the same set of alignments was 17 seconds (maximum time: 125 seconds). Thus, all these observations converge towards the conclusion that sm is the most suitable sampling method for the effective reduction of the number of sequences detected by BlastP, while maintaining the powerful information content of the subsequent MACS.
The rapid accumulation of numerous homologues in the sequence databases is a problem for which no unique solution exists. This study demonstrates that it is possible to sample the homologous sequences detected by BlastP while at the same time retaining the relevant information concerning the active site residues inside the sampled set of sequences.
We showed that on average 30% of the detected sequences are sufficient to efficiently maintain the relevant functional information, however the sequence selection cannot be performed randomly.
The reduction of the sequence set is not necessary with proteins having few homologues in sequence databases (less than 100). In this case, the variability between the sequences is usually relatively low and sampling the sequences results in a loss of information.
The strips sampling (sm) is the most suitable sampling method for the effective reduction of the sequence set when more than 100 sequences are detected by BlastP searches. This method maintains the potential structural and functional information in the sampled set and by defining the maximal number of sequences (set here to 100) the computation time remains reasonable.
In conclusion, regardless of the size of the initial BlastP results, our sampling strategy produces a set of sequences that is computationally and humanly manageable.
In the future, we will study the conservation of other kinds of information that can be extracted from a set of homologous sequences, such as secondary structure information or motif conservation.
284 protein dataset
We defined a set of 284 distinct proteins, which we refer to as the "284 protein dataset" using a similar methodology to that developed by Aloy and co-workers for the creation of a protein test set for the prediction of functional sites. To build our 284 protein dataset, we selected protein sequences sharing less than 70% identity with one another and having an annotated active site, from the March 2005 release of the PDB. When several polypeptide chains existed for a single PDB entry, the chain containing the most annotated catalytic residues was selected first, and then the longest one. The information concerning the active site residues was extracted from the SITE records description when "active" or "catalytic" words were found in the associated definition. The 284 proteins consisted of a total of 96403 residues, of which 1045 represented active site residues (from 1 to 20 residues per protein).
The 284 protein dataset covers a large part of the protein fold space according to the CATH classification. Only 9 proteins have not been classified and 126 proteins have been defined as multi-domain proteins. 440 domains are represented, of which 68 belong to class 1 (mainly alpha),119 to class 2 (mainly beta), 251 to class 3 (mixed alpha-beta) and 2 to class 4 (few secondary structures).
These 284 proteins correspond to 257 enzymes and 27 non-enzymes. According to the official Enzyme Nomenclature, they can be classified as follows: 49 oxidoreductases (EC 1), 26 transferases (EC 2), 148 hydrolases (EC 3), 23 lyases (EC 4), 9 isomerases (EC 5) and 2 ligases (EC 6). The non-enzyme proteins are mostly toxins, binding proteins and inhibitors.
The full list of PDB names is available (see Additional file 4: List of the PDB identifier constituting the 284 protein dataset).
The BlastP searches were performed on the UniRef90 database (2005/05/23 version), a non redundant database based on UniProt, for which sequences sharing more than 90% identity are clustered in one single entry corresponding to a representative sequence for this cluster. We chose a non redundant database in order to avoid the over-representation of identical or nearly identical sequences resulting from closely related genome sequencing projects, etc. Such very closely related sequences were ignored as they do not add any significant information in terms of catalytic functional residues. The standard version 2.2.10 of BlastP has been used, and parameters e, v, and b were set to 0.001, 5000 and 5000 respectively, allowing the retrieval of up to 5000 sequences and alignments.
Characterisation of the BlastP E-value distribution
All the 153128 sequences detected with E-value ≤ 0.001 by BlastP for the 284 protein dataset were pooled and sorted according to their respective E-values. This list was then divided into 10 equally populated intervals and the E-values corresponding to the boundaries of each interval were defined from this cutting (interval 1: 1.10-200 to 4.10-67; interval 2: 4.10-67 to 1.10-39; interval 3: 1.10-39 to 7.10-30; interval 4: 7.10-30 to 1.10-23; interval 5: 1.10-23 to 1.10-18; interval 6: 1.10-18 to 2.10-14; interval 7: 2.10-14 to 6.10-11; interval 8: 6.10-11 to 8.10-8; interval 9: 8.10-8 to 2.10-5; interval 10: 2.10-5 to 0.001). For each individual BlastP result, the percentage of sequences in each interval was calculated and we thus obtained for each BlastP, a list of 10 values ranging from 0 to 100% characterising the E-value distribution for this BlastP search.
The 284 lists were then clustered using two classification programs: a Dirichlet mixture algorithm and the Secator program. The same global tendencies were observed with both methods with different degrees of resolution (data not shown). We choose to work with Secator's classification which avoided the creation of poorly populated groups. The chosen classification resulted in 10 groups that we named E-clusters (Figure 3). E-cluster 1 corresponds to a highly populated interval 1, i.e. a majority of highly similar sequences detected by BlastP. E-cluster 10 corresponds to a highly populated interval 10, i.e. the BlastP result contains a majority of weakly related sequences.
Ballast processes the Blast results and determines anchors, called LMS (Local Maximum Segments) based on the high scoring-segment pairs detected by Blast.
DbClustal uses the LMS as anchors to create a MACS. Sequence fragments are eliminated, and the number of sequences to be aligned is limited to 500 (corresponding to the 500 lowest E-values before or after sampling).
Rascal scans the complete alignment and corrects locally misaligned regions.
norMD objectively estimates the MACS quality. It combines the advantages of a column-scoring technique with the sensitivity of methods incorporating residue similarity scores. It also incorporates gap information and ab initio sequence information, such as the number, length and similarity of the aligned sequences. A norMD score ≥ 0.3 is assumed to indicate a good quality alignment.
the mean method (mm). A threshold is defined as the mean difference between successive E-value logarithms. Let n be the number of sequences detected with an E-value ≤ 0.001, En is the E-value associated with the nth sequence and E1 the lowest printed E-value.
the second derivative method (sdm). The second derivative of the E-value as a function of rank is computed and the sequences corresponding to its inflexion points are selected. Let V be the variation function of the E-value curve, i.e. V(i+1) = Ei+1 - Ei. A sequence is selected if f" (Vi) < 0 (Figure 7(b)).
the strips method (sm), for which the maximal number of sequences to be selected is fixed. The logarithmic graph of the Blast E-values is divided into a preset number x of strips of equal width (x = 100 in this study). Let E1 be the smallest E-value from the Blast search results and En the highest one (En ≤ 0.001).
Width = (log(E1) - log(E n ))
the random method (rm), for which the associated reduction rate was defined as 70%, after initial analysis of the mean reduction rates of the 3 other sampling methods. Sequences are randomly selected on this basis. This method is a control in our study, used to estimate the relevance of a selection according to the sequence dispersion.
Evaluation of the sequence sets selected by the 4 sampling methods
Two different tests were designed to evaluate and compare the set of sequences selected by each of the four different sampling methods.
Sequence reduction rate
The reduction rate estimates the relationship between the total number of sequences detected by a BlastP search with an E-value ≤ 0.001 and the reduced number of sequences after sampling. For each query protein X of the 284 protein dataset:
For a given sampling method, the mean reduction rate is the average of the 284 individual reduction rates:
The sampling method coverage at the sequence level (coverage rate) corresponds to the number of sequences selected in common by the sampling methods. The sequence coverage was calculated by considering the number of sequences jointly selected by 2, 3 or 4 methods compared to one method chosen as the reference.
Evaluation of the MACS information content
Three tests were designed to evaluate the potential information content associated with a MACS.
Sampled distant sequences
The quality of the test alignments used in this study was evaluated using the norMD (normalized Mean Distance) objective function. As stated in previous studies, a norMD score greater than 0.3 indicates a good quality MACS.
Identification of active site residues
In order to estimate the impact of each sampling method on the structural information content of a MACS, we determined the number of active site residues that were found in conserved columns in the MACS. We tested 9 different conservation thresholds: 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% and 100%. Let x be the considered threshold: a column is considered as a "conserved column" when at least x% of the residues (including gaps) are identical at this position.
The sensitivity (Se) and the specificity (Sp) of each sampling method are defined as:
TP (True Positive) = number of active site residues in conserved columns,
FP (False Positive) = number of non-active site residues in conserved columns,
TN (True Negative) = number of non-active site residues in non-conserved columns,
FN (False Negative) = number of active site residues in non-conserved columns.
ii) ROC curve
A Receiver Operating Characteristic curve (ROC curve) is a graph of the true positive rate (sensitivity) versus the false positive rate (1 – specificity), while varying the conservation threshold. It measures the potential of a classifier to discriminate between the two classes, and allows the determination of the most suitable threshold for discrimination: the inflexion of the ROC curve near the top and left axis corresponds to the best classifier performance. Thus, the area under the ROC curve (AUC) provides a single metric that can be used to judge the overall discriminative ability of a classification method. An AUC of 0.5 indicates a random prediction; between 0.7 and 0.8 indicates acceptable discrimination; between 0.8 and 0.9 indicates excellent discrimination, and above 0.9 indicates outstanding discrimination.
The most suitable conservation threshold for the discrimination of active site residues was determined by computing ROC curves while varying the classification threshold from 60 to 100% in the context of the MACS_init.
iii) G-mean accuracy
To assess the relevance of the sampling methods in terms of MACS information content, we compared the sensitivity and specificity results obtained with and without sampling. We studied the so-called confusion matrix, which includes predicted and true active site classifications, and from which several metrics can be obtained. We have "imbalanced classes" in this study: i.e. columns containing active site residues represent only a small minority of the total number of MACS columns, which means that we cannot use metrics such as accuracy or precision which are not suitable for this kind of data. We therefore used the geometric mean of accuracies as a comparison metric defined as:
List of abbreviations
multiple alignment of complete sequences
protein data bank
sampled distant sequences
receiver operating characteristic
area under the curve
The authors thank Patrice Koehl and Frédéric Plewniak for stimulating discussions and are grateful to Julie D. Thompson for careful reading of the manuscript and helpful comments. The authors also wish to thank the referees for their constructive comments. This work was funded by the Institut National de la Santé Et de la Recherche Médicale, the Centre National de la Recherche Scientifique, the Université Louis Pasteur from Strasbourg, the Réseau National des Génopoles from Strasbourg and the Décrypthon program, initiated by the Association Française contre les Myopathies, IBM and the CNRS.
- Boguski MS, Lowe TM, Tolstoshev CM: dbEST--database for "expressed sequence tags". Nat Genet 1993, 4(4):332–333. 10.1038/ng0893-332View ArticlePubMedGoogle Scholar
- Bernal A, Ear U, Kyrpides N: Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res 2001, 29(1):126–127. 10.1093/nar/29.1.126PubMed CentralView ArticlePubMedGoogle Scholar
- Genome OnLine Database[http://www.genomesonline.org/]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch O: Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene 2001, 270(1–2):17–30. 10.1016/S0378-1119(01)00461-9View ArticlePubMedGoogle Scholar
- Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 1999, 27(13):2682–2690. 10.1093/nar/27.13.2682PubMed CentralView ArticlePubMedGoogle Scholar
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, 34(Database issue):D187–91. 10.1093/nar/gkj161PubMed CentralView ArticlePubMedGoogle Scholar
- Mika S, Rost B: UniqueProt: Creating representative protein sequence sets. Nucleic Acids Res 2003, 31(13):3789–3791. 10.1093/nar/gkg620PubMed CentralView ArticlePubMedGoogle Scholar
- Spalding JB, Lammers PJ: BLAST Filter and GraphAlign: rule-based formation and analysis of sets of related DNA and protein sequences. Nucleic Acids Res 2004, 32(Web Server issue):W26–32. 10.1093/nar/gkh459PubMed CentralView ArticlePubMedGoogle Scholar
- Mihalek I, Res I, Lichtarge O: A structure and evolution-guided Monte Carlo sequence selection strategy for multiple alignment-based analysis of proteins. Bioinformatics 2006, 22(2):149–156. 10.1093/bioinformatics/bti791View ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Przybylski D, Rost B: Alignments grow, secondary structure prediction improves. Proteins 2002, 46(2):197–205. 10.1002/prot.10029View ArticlePubMedGoogle Scholar
- Thompson JD, Plewniak F, Ripp R, Thierry JC, Poch O: Towards a reliable objective function for multiple sequence alignments. J Mol Biol 2001, 314(4):937–951. 10.1006/jmbi.2001.5187View ArticlePubMedGoogle Scholar
- Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res 2002, 30(1):276–280. 10.1093/nar/30.1.276PubMed CentralView ArticlePubMedGoogle Scholar
- Plewniak F, Bianchetti L, Brelivet Y, Carles A, Chalmel F, Lecompte O, Mochel T, Moulinier L, Muller A, Muller J, Prigent V, Ripp R, Thierry JC, Thompson JD, Wicker N, Poch O: PipeAlign: A new toolkit for protein family analysis. Nucleic Acids Res 2003, 31(13):3829–3832. 10.1093/nar/gkg518PubMed CentralView ArticlePubMedGoogle Scholar
- Nuin PA, Wang Z, Tillier ER: The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics 2006, 7: 471. 10.1186/1471-2105-7-471PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Koehl P, Ripp R, Poch O: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 2005, 61(1):127–136. 10.1002/prot.20527View ArticlePubMedGoogle Scholar
- Mirny LA, Shakhnovich EI: Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J Mol Biol 1999, 291(1):177–196. 10.1006/jmbi.1999.2911View ArticlePubMedGoogle Scholar
- Bartlett GJ, Porter CT, Borkakoti N, Thornton JM: Analysis of catalytic residues in enzyme active sites. J Mol Biol 2002, 324(1):105–121. 10.1016/S0022-2836(02)01036-7View ArticlePubMedGoogle Scholar
- Kohavi R, Provost F: Glossary of Terms. Machine Learning 1998, 30: 271–274. 10.1023/A:1017181826899View ArticleGoogle Scholar
- Kubat M, Holte RC, Matwin S: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 1998, 30: 195–215. 10.1023/A:1007452223027View ArticleGoogle Scholar
- Aloy P, Querol E, Aviles FX, Sternberg MJ: Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J Mol Biol 2001, 311(2):395–408. 10.1006/jmbi.2001.4870View ArticlePubMedGoogle Scholar
- Pearl FM, Bennett CF, Bray JE, Harrison AP, Martin N, Shepherd A, Sillitoe I, Thornton J, Orengo CA: The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res 2003, 31(1):452–455. 10.1093/nar/gkg062PubMed CentralView ArticlePubMedGoogle Scholar
- Webb EC: Enzyme nomenclature: a personal retrospective. Faseb J 1993, 7(12):1192–1194.PubMedGoogle Scholar
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 2004, 32(Database issue):D115–9. 10.1093/nar/gkh131PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Sjolander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D: Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci 1996, 12(4):327–345.PubMedGoogle Scholar
- Wicker N, Perrin GR, Thierry JC, Poch O: Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol Biol Evol 2001, 18(8):1435–1441.View ArticlePubMedGoogle Scholar
- Plewniak F, Thompson JD, Poch O: Ballast: blast post-processing based on locally conserved segments. Bioinformatics 2000, 16(9):750–759. 10.1093/bioinformatics/16.9.750View ArticlePubMedGoogle Scholar
- Thompson JD, Plewniak F, Thierry J, Poch O: DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res 2000, 28(15):2919–2926. 10.1093/nar/28.15.2919PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Thierry JC, Poch O: RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics 2003, 19(9):1155–1161. 10.1093/bioinformatics/btg133View ArticlePubMedGoogle Scholar
- Errami M, Geourjon C, Deleage G: Detection of unrelated proteins in sequences multiple alignments by using predicted secondary structures. Bioinformatics 2003, 19(4):506–512. 10.1093/bioinformatics/btg016View ArticlePubMedGoogle Scholar
- Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143(1):29–36.View ArticlePubMedGoogle Scholar
- Twa MD, Parthasarathy S, Roberts C, Mahmoud AM, Raasch TW, Bullimore MA: Automated decision tree classification of corneal shape. Optom Vis Sci 2005, 82(12):1038–1046. 10.1097/01.opx.0000192350.01045.6fPubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.