FRAGS: estimation of coding sequence substitution rates from fragmentary data

Background Rates of substitution in protein-coding sequences can provide important insights into evolutionary processes that are of biomedical and theoretical interest. Increased availability of coding sequence data has enabled researchers to estimate more accurately the coding sequence divergence of pairs of organisms. However the use of different data sources, alignment protocols and methods to estimate substitution rates leads to widely varying estimates of key parameters that define the coding sequence divergence of orthologous genes. Although complete genome sequence data are not available for all organisms, fragmentary sequence data can provide accurate estimates of substitution rates provided that an appropriate and consistent methodology is used and that differences in the estimates obtainable from different data sources are taken into account. Results We have developed FRAGS, an application framework that uses existing, freely available software components to construct in-frame alignments and estimate coding substitution rates from fragmentary sequence data. Coding sequence substitution estimates for human and chimpanzee sequences, generated by FRAGS, reveal that methodological differences can give rise to significantly different estimates of important substitution parameters. The estimated substitution rates were also used to infer upper-bounds on the amount of sequencing error in the datasets that we have analysed. Conclusion We have developed a system that performs robust estimation of substitution rates for orthologous sequences from a pair of organisms. Our system can be used when fragmentary genomic or transcript data is available from one of the organisms and the other is a completely sequenced genome within the Ensembl database. As well as estimating substitution statistics our system enables the user to manage and query alignment and substitution data.


Background
Substitution rates of coding sequences provide a valuable means of characterising the evolutionary divergence of homologues. A significant excess in the rate of non-synonymous substitution (K a ) compared to the rate of nearly neutral synonymous substitution (K s ) is widely used as evidence that a sequence has evolved under positive selec-tive pressure [1]. The identification of individual genes under positive selection or strong negative selection has important implications for the understanding of processes such as the evolution of drug resistance and evasion of the immune system by pathogens [2,3]. In the case of genes not evolving under positive selection the relative rates of non-synonymous to synonymous substitutions (K a /K s ) provide an indication of selective constraint and a means of estimating the strength of purifying selection acting on the sequences.
The average value of K a /K s from pairwise alignments of coding sequences has been used to compare the evolutionary rates of duplicated and unduplicated genes in order to investigate the affect of gene duplication on the rate of sequence evolution [4][5][6]. More recently, coding sequence substitution rates of collinear and rearranged chromosomes from closely related organisms have been proposed to contain clearly detectable traces of the speciation process [7]. Coding sequence substitution rate estimates have also permitted investigation of the relationship between level and breadth of gene expression and the rate of coding sequence evolution [8,9]. All of these comparisons have required the construction of large datasets of pair-wise in-frame alignments of orthologous sequences. FRAGS automates the process of generating these kinds of datasets and provides a framework for the management of the results generated.
Databases designed specifically to contain molecular evolutionary information have recently been established, prompted by the necessity to manage large quantities of data generated by genomic studies. The Adaptive Evolution Database, TAED [10], which contains coding sequence substitution rates estimated using the approximate method of Li et al. (Li,), is one example of these developments. Software for genomic scale molecular evolutionary studies has also been released, e.g. GenomeHistory [14], which uses a Maximum Likelihood estimation method following Muse and Gaut [15], and Goldman and Yang's method [16]. Unfortunately, it is often difficult to obtain both the software and databases used in these kinds of studies, and consequently the reproduction of results can be both laborious and inexact. Database resources such as Ensembl [17][18][19], Hovergen [20] and TOGA [21] are being constructed with eukaryotic comparative genomics in mind, but still need further development to provide adequate evolutionary information. Implementation FRAGS (the FRAGment Substitution rate estimation system) is a Python [22] based system, designed for obtaining and managing coding sequence substitution data from in-frame alignments of sequence fragments, and coding sequences from a related completely sequenced genome within the Ensembl database [23]. FRAGS uses existing Open Source or academically available software, including modules and code from the Biopython project [24]. Embedded SQL statements are used for database manipulation.
In order to obtain coding sequence substitution distances it was necessary to develop a system that was able to produce in-frame alignments from any available fragmentary sequences (such as shotgun sequences, BAC end sequences and ESTs) and orthologous sequences from a completely sequenced organism. The requirement of a completely sequenced reference organism comes from our approach to identifying orthologues. Sequence fragments with similarity to just one coding sequence in the reference organism and that do not have significant similarity to non-coding regions are considered orthologous (see below). The complete strategy used to generate inframe alignments from fragmentary data is shown in The input data to FRAGS consists of sequence fragments in fasta format. The sequence fragments are filtered for repeats and vector contamination using RepeatMasker [25] and BLASTN [26] against the Univec_core database [27]. The system is also capable of filtering for low-quality sequence regions, using an adjustable cut-off, if Phred [28] quality scores are provided. The sequence fragments are searched against all known and predicted proteins in the Ensembl database using BLASTX [26] with the default parameters. In order to eliminate paralogous matches that would distort estimates of substitution rates, nucmer, from the MUMmer package [29,30], is used to search the fragments against the complete genome sequence underlying the Ensembl database. Sequence fragments that do not match to the genome sequence with a similarity and match-length above pre-defined cut-off values, or sequences that have more than one match to the genome, are excluded from further analysis. The parameters that determine whether a fragment has matched unambiguously are decided by the user. The co-ordinates of unambiguous matches of fragments to the genome are compared to the genomic co-ordinates of the Ensembl protein matching the fragment. If these co-ordinates are within a predetermined distance, the fragment and protein are retained for further analysis (Figure 1.).
In-frame alignments of protein coding sequences are produced through a two-step procedure using GeneWise [31,32]. First the nucleotide sequence fragment is aligned against the Ensembl protein and then the corresponding Ensembl transcript is aligned against the putative translation of the nucleotide fragment. Maximum likelihood estimates of coding sequences substitution distances are then derived from the in-frame alignments using the codeml program from the Paml [33] package. Substitution distances are estimated from individual sequence alignments as well as from a concatenated alignment of all of the sequences.
Flow diagram illustrating the strategy used to generate in-frame alignments from fragmentary sequences Figure 1 Flow diagram illustrating the strategy used to generate in-frame alignments from fragmentary sequences.

Data sets and program parameters
We used FRAGS to estimate coding sequence substitution distances for human and chimpanzee orthologue-pairs using genomic, EST and complete coding sequence datasets from chimpanzee and the Ensembl Human database (version 8.3 based on the June 2002 Golden Path build -NCBI30). In total 148,102 genomic sequence fragments were obtained from the Riken Chimpanzee Sequencing Initiative [34,35]. A separate set of 6930 chimpanzee cDNA sequences was obtained from the Genbank EST division. A smaller set of complete chimpanzee coding sequences from Genbank was also obtained [4].
BLASTX matches between chimpanzee sequence fragments and human proteins from the Ensembl database were required to be more than 60 amino acids in length and to have at least 90% identity. The chimpanzee sequence fragments were searched against the human genome (June 2002 Golden Path genome assembly [36]) using nucmer [29,30]. Chimpanzee sequence fragments that did not match the genome in the same location as the homologous human protein were omitted from further analysis. Sequence fragments that had a second match to the genome, with at least 90% of the number of identical residues as the best match, were also removed from the analysis.
The pairwise comparison mode of codeml (runmode = -2) was used to provide maximum likelihood estimates of substitution distances from the in-frame alignments generated by GeneWise (see Implementation). Estimates were carried out separately for alignments of individual fragments as well as for a concatenated alignment of all fragments. Ambiguity characters were removed from the data by codeml before the maximum likelihood estimates were performed (clean_data = 1). Default values were used for the remaining parameters of codeml.
Estimates of sampling error for the concatenated alignments were produced using a non-parametric bootstrap resampling procedure, rather than from the output of codeml, as error estimates from codeml may not be reliable [37]. A sampling error sufficient to include 90% of the resampled data was estimated from 100 bootstrap replicates of the concatenated alignments.

Database and graphical interface
MySQL ® [38] was used as the management system for the relational database. FRAGS was developed for Linux and FreeBSD systems. In addition to the MySQL ® SQL interpreter, a number of graphical interfaces for data manipulation are available under Open Source licenses. The MySQL ® Control centre and the database component of OpenOffice.org [39] are cross-platform GUI's that can be used for data manipulation. The databases can also be accessed via the phpMyAdmin [40] interface ([41]; Figure  2.) which can be used for querying, sorting and export of data from a web browser.

Results and Discussion
We used FRAGS to estimate substitution rates from alignments of human and chimpanzee coding sequences. Three different sources of chimpanzee sequence data were analysed separately, i.e. genomic sequence fragments, ESTs and annotated coding sequences from Genbank. We applied a maximum likelihood [16] as well as an approximate method [42] to infer substitution rates from individual alignments as well as from concatenated alignments consisting of all sequences of a specific type. Estimates derived from the approximate method were similar to the estimates derived from the maximum likelihood method and can be obtained from our web site [43]. Sequence quality scores were available for the genomic sequences and for these we report substitution rates derived from a quality filtered dataset and for the complete dataset.
Sequence pairs with zero substitutions at synonymous sites were omitted from substitution rate calculations for individual sequences, because K a /K s is undefined ( Table  1). Estimates of K a /K s from concatenated sequences are not affected by individual sequences with K s = 0. Furthermore concatenated alignments have been shown to give more accurate estimates of genetic distances than estimates derived by averaging over individual genes or proteins [44]. In spite of this substitution rates derived from concatenated and individual sequences were very similar. The results presented in Table 2. are for concatenated alignments and represent the most accurate estimates of substitution rates that we have produced. Intermediate data generated prior to the final estimates of substitution rates can be accessed from the MySQL databases [43].
We have found that, at least in the case of human and chimpanzee, the source of the sequence data used has a larger impact on the estimate of substitution rates than the methodology employed. Our estimate of the non-synonymous substitution distance derived from concatenated alignments of genomic sequences was approximately twice the estimate derived from the EST sequences, even when the genomic sequences were filtered for sequence quality (K a = 0.012 and 0.006 for genomic and EST sequences respectively; Table 2.). The estimate of the synonymous substitution distance was also slightly elevated in the genomic sequences (K s = 0.033 compared to 0.026 for EST sequences), but, in spite of the increase in K s the estimate of K a /K s remained far higher in the genomic sequences compared to the ESTs (K a /K s = 0.37 and 0.24 respectively). The estimate of K a /K s that we derived from the Genbank coding sequences was far higher than the Screenshot of the user interface to the MySQL database Figure 2 Screenshot of the user interface to the MySQL database. The screenshot shows a subset of the data available from the substitution statistics table and also illustrates the query functionality available from the web-based phpMyAdmin interface [41]. one obtained from either the ESTs or the quality filtered genomic sequences (K a /K s = 0.47, 0.37, 0.24 for the concatenated Genbank, genomic and EST sequences respectively; Table 2). These higher estimates could be a consequence of the over-representation of immune system proteins, with relatively high K a values, in this data set [45].

Data Set
The average value of K a /K s that we obtained from the Genbank EST sequences using FRAGS (K a /K s = 0.24) was in good agreement with two previous estimates. The estimate of Hellmann and colleagues (K a /K s = 0.20; [45]) was based on 1226 coding sequences obtained from an EST dataset. A similar figure of 0.20 was obtained by Sakate et al. [46] from 226 consensus sequences derived from a slightly larger set of 5' cDNA sequence ends from brain, skin, and liver sequence libraries. However, in a study that focussed on differences in the rates of evolution between rearranged and unrearranged chromosomes, Navarro and Barton [7] have recently published an estimate of K a /K s derived from human and chimpanzee coding sequence orthologues from Genbank. Their estimate of 61% is considerably higher than any previous estimate and is also higher than the one derived here from a similar Genbank dataset.
The values of K a /K s derived from EST sequences, both here and previously [45], are significantly lower than the values derived from the other sources. Highly expressed genes are known to experience higher levels of purifying selection and to have lower average values of K a /K s [8,9,47,48]. Hellmann et al. [45] have suggested that low values of K a /K s estimated from EST data may result from over-representation of highly expressed genes in cDNA libraries. Because highly expressed genes are over-represented in EST data, estimates of K a /K s derived from EST data should not be compared to estimates derived from genomic data or from a random set of complete coding sequences.
Methods used to align sequences have also been shown to contribute to the variability in estimates of substitution rates. In one comparison of mouse and human ortho-logues [49] sequence identity was found to be 5-6% higher than in an earlier estimate [50]. The higher estimate was proposed to result from the use of local rather than global alignments [49]. Our study uses identical alignment strategies for the different kinds of data studied and therefore results derived from the different data sources are more easily compared.
Errors in estimation of substitution can be introduced by inclusion of low-quality sequence (sequence that is more likely to be incorrectly identified by base calling software). Because sequencing errors are independent of translation, they tend to result in values of K a /K s that are closer to unity. The average value of K a /K s across all genes in a pair of organisms is normally far less than one and, as a result, sequence error normally increases the estimate of the average of K a /K s . The substantial decrease in the average value of K a /K s derived from quality-filtered BES sequences compared to the unfiltered sequences (Table 2) indicates that a significant amount of sequence error was removed through filtering. Interestingly, in spite of being errorprone, EST sequences provide the lowest values of K a /K s .
No publicly accessible quality data was available for the EST sequences, however, an upper-bound on the rate of sequencing error in the ESTs was estimated from the value of K a /K s . Because sequencing error is random, the number of synonymous sequence errors per synonymous site is expected to be equal to the number of non-synonymous sequence errors per non-synonymous site. In the worstcase all non-synonymous substitutions in the EST data are sequencing errors. For the EST data studied here this implied a maximum error-rate of 0.0044, or one sequencing error per 225 base-pairs. Given that non-synonymous substitutions do occur, this estimate is very conservative and the true error-rate is likely to be much lower than the upper-bound estimated here. A similar calculation, based on a comparison of the number of non-synonymous substitutions per non-synonymous site in the filtered and unfiltered genomic data, yielded a high estimate of one error per 100 base-pairs in the BES data. It is important to emphasize that FRAGS is not intended to replace existing tree-based methods of inferring positive selection. When multiple homologous coding sequences are available for a gene of interest, tree-based methods are expected to be the most sensitive way to infer positive selection. Swanson et al. [51] have shown, however, that pairwise comparisons of fragmentary coding sequences can provide an initial screen for genes evolving under positive selection and can reveal differing selective pressures acting on different gene classes.

Future Prospects
In order to stimulate further modular development we will continue to integrate this software more tightly within the frameworks of existing projects (Biopython [24], BioSQL [52] and Ensembl [23]. This will assist in the accommodation of software components (plugins), that are 'wrapped' by modules that enforce adherence to the underlying database structure, making it more easily adaptable for use with different alignment strategies and methods of estimating substitution rates. Currently FRAGS uses codeml [33] to estimate substitution rates from the in-frame alignments, however we intend to extend the software to incorporate additional methods for estimating substitution rates that are released under Academic or Open Source licenses [53].
Module tests are being written to enable repeatability of 'experiments', and will be included in subsequent releases of the package. We hope that other researchers using this software will contribute to the development of these module tests, and that they may, in future, serve as the basis for standardizing these types of analyses.
In order to deal with larger quantities of sequence data, especially data derived from shotgun sequencing projects, time-consuming data processing stages of the software will be fully parallelized. Analysis of larger quantities of data will also benefit from clustering and assembly of sequence data in order to reduce redundancy of the alignments and improve speed.

Conclusions
Different methodologies and data sources have yielded significantly different estimates of the important evolutionary parameter K a /K s for pairs of organisms. The difficulty in obtaining an accurate estimate of K a /K s for orthologous sequences is particularly acute for organisms for which limited amount of fragmentary data is available, as different data types (genomic or transcript) and genes from different functional classes can yield very different estimates. For such organisms, the limited set of genes for which complete coding sequences are available are often highly functionally biased.
Our software facilitates significantly the task of estimating substitution rates of coding sequences from fragmentary data and enables the researcher to repeat the analysis for different datasets, or with different parameter values. This software can be of considerable utility to researchers who are performing initial characterisation of sequenced fragments from organisms for which a related completely sequenced organism is available. As the number of completely sequenced and well-annotated organisms in databases such as Ensembl increases, we anticipate that our software will become even more widely applicable.

Authors' contributions
ES conceived and coded the software and prepared the manuscript draft. CS supervised the research and the development of the software, and assisted in preparation of the manuscript draft. WH provided user feedback, reviewed the development of the system, edited the draft and provided guidance as to the applicability and design of the system. All authors read and approved the final manuscript.