Representative transcript sets for evaluating a translational initiation sites predictor

Background Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to complete a comparative analysis, it is desirable to have several benchmark data sets which can be used to test the effectiveness of different algorithms. An ideal benchmark data set should be reliable, representative and readily available. Preferably, proteins encoded by members of the data set should also be representative of the protein population actually expressed in cellular specimens. Results In this paper, we report a general algorithm for constructing a reliable sequence collection that only includes mRNA sequences whose corresponding protein products present an average profile of the general protein population of a given organism, with respect to three major structural parameters. Four representative transcript collections, each derived from a model organism, have been obtained following the algorithm we propose. Evaluation of these data sets shows that they are reasonable representations of the spectrum of proteins obtained from cellular proteomic studies. Six state-of-the-art predictors have been used to test the usefulness of the construction algorithm that we proposed. Comparative study which reports the predictors' performance on our data set as well as three other existing benchmark collections has demonstrated the actual merits of our data sets as benchmark testing collections. Conclusion The proposed data set construction algorithm has demonstrated its property of being a general and widely applicable scheme. Our comparison with published proteomic studies has shown that the expression of our data set of transcripts generates a polypeptide population that is representative of that obtained from evaluation of biological specimens. Our data set thus represents "real world" transcripts that will allow more accurate evaluation of algorithms dedicated to identification of TISs, as well as other translational regulatory motifs within mRNA sequences. The algorithm proposed by us aims at compiling a redundancy-free data set by removing redundant copies of homologous proteins. The existence of such data sets may be useful for conducting statistical analyses of protein sequence-structure relations. At the current stage, our approach's focus is to obtain an "average" protein data set for any particular organism without posing much selection bias. However, with the three major protein structural parameters deeply integrated into the scheme, it would be a trivial task to extend the current method for obtaining a more selective protein data set, which may facilitate the study of some particular protein structure.


Background
The accurate recognition of translational initiation sites (TISs) in mRNA sequences is crucial to identifying the primary structure of the corresponding proteins. A number of computational approaches have been proposed which aim at predicting TISs without human intervention. In order to better evaluate the merit of a newly proposed approach, it is essential to conduct a thorough comparative study that involves several existing approaches. This requires the existence of some high quality benchmark data sets that can be easily used for testing most existing methods.
So far, only a few data sets used for this purpose have been made available and a description of them is in order. Pedersen and Nielsen [1] generated a vertebrate sequence collection and an Arabidopsis thaliana collection. The former consists of sequences from eight model vertebrate organisms and the latter contains sequences from Arabidopsis thaliana (thale cress, a dicot plant). The original sequences were the nuclear genes with annotated start codon which were selected from GenBank release 95. Each of them was spliced by eliminating the possible introns and joining the remaining exon parts. Sequences that contain less than 10 nucleotides upstream of the start codon, or less than 150 nucleotides downstream, or have non-nucleotide symbols in the intervals were eliminated. The resulting data sets then underwent a redundancy elimination process which removes redundant sequences that are from homologous genes. The final data sets contain 3312 vertebrates sequences and 523 Arabidopsis thaliana sequences. The vertebrates collection has been widely used in academia as a benchmark collection for a certain TIS recognition method. However, originating from the GenBank, the sequences in this collection are susceptible to sequencing errors. The authors also did not discuss about the quality of the TIS annotation provided by Gen-Bank. As well, due to excessive trimming, on average, a sequence is only about 160 nt long, and it generally does not contain the complete ORF. This may hinder the application of the classifiers which rely on the information about ORFs. Also due to trimming, there exist only about four AUGs in one sequence (within a complete ORF, there are usually around 20 AUGs), which potentially leads to an overly optimistic estimation of the performance of a given algorithm. Both data sets are downloadable from the Internet http://www.cbs.dtu.dk/services/NetStart/.
Another existing testing data set was constructed by Nadershahi et al. [2], which is also accessible through the public domain (http://www.biomedcentral.com/content/ supplementary/1471-2105-5-14-S1.txt and http:// www.biomedcentral.com/content/supplementary/1471-2105-5-14-S2.txt). The authors had compiled a list of 100 human EST sequences, 50 of which contain complete ORFs and 50 do not. Originally a search in the GenBank database for human genomic sequences containing the annotation of "complete CDS" was used. No further discussion regarding the annotation's reliability was given by the authors. A filter for NCBI RefSeq entries was then applied to obtain a non-redundant collection. They were further filtered by using the UniGene clusters as reference. In the end, 50 clusters containing ProtEST links were randomly selected to construct the final data set. Although the authors used this data set to evaluate several state-ofthe-art TIS predictors in their paper, it has not been widely used by other researchers, possibly due to the following two factors. Firstly, the collection is fairly small containing only 100 sequences, and secondly, ESTs may include errors leading to frame shifts and they only represent segments of the parent cDNA, therefore require additional curation before the determination of corresponding protein sequence.
Hatzigeorgiou [3] also generated a human cDNA sequence collection to test her neural network TIS classifier. The protein database SwissProt was used as the source, where all the human proteins whose N-terminal sites are sequenced at the amino acid level were collected and manually checked. Then corresponding full-length mRNAs with verified TISs are selected. In the end, 475 derived human cDNAs were obtained. This is one of the few published data sets for TIS prediction that contain highly reliable sequences. However, the collection construction scheme requires human intervention, and the data set is not accessible through the public domain.
Saeys et al. [4] compiled a data set from the consensus CDS (CCDS) database by selecting all the genes with a consensus TIS (i.e., the triplet of ATG). The authors' justification of using the CCDS repository is that the core set of human protein coding regions included in the database are consistently annotated and of high quality. Annotation updates represent genes that are defined by a mixture of manual curation and automated computational processing. The quality tests performed include consistency in cross-species analysis, analyses to identify putative pseudogenes, retrotransposed genes, consensus splice sites, supporting transcripts and protein homology. The resulting data collection contains 13917 sequences. There are several limitations of using this collection to test a TIS classifier: a) the data set is too large: the file that has undergone compression is almost 40 mega bytes in size, making it quite impractical for efficient testing; b) the database itself only contains genomic sequences, which deviates from the type of nucleotide sequences that are actually used by the translational mechanism (mRNAs); and c) up to today, CCDS only investigates the core CDS sets of two organisms: human and mouse, which greatly limits the applicability of the collection construction scheme proposed by the authors.
Hu et al. [5] proposed a computational approach for predicting TIS in prokaryotic genomes. The method models a positional weight matrix (PWM) of aligned sequences around predicted TISs in terms of a linear combination of three elementary PWMs. Based upon the best predictions output by the constituents of the computational model, an annotated TIS database called SupTISA is constructed. The annotation of the data set has been shown to present a high quality when validated on experimentally verified TISs set EcoGene [6], a genome sequence database for Escherichia coli. However, there are several drawbacks of this database. Firstly, it is only proposed for prokaryotic genomes, which limits its applicability. Secondly, it is completely built upon an existing TIS predictor, which may involve some selection bias. Therefore, we believe it does not meet the necessary criteria of a benchmark data set.
Cai et al. [7] proposed a web-based database called ATID which consists of 300 genes from Homo sapiens, Mus musculus and other species, where each gene is shown to have multiple translational initiation sites. In our current version of the algorithm, genes that are associated with alternative translational initiation events are excluded. We believe that the ATID database may serve as a good complement to the data sets generated by our algorithm.
Our review of the literature has inspired us to propose a general scheme for constructing high quality, reliable, representative, non-redundant and easily accessible data sets that can facilitate the evaluation of algorithms used to identify gene characteristics such as TIS prediction. In this paper, we describe such an algorithm. We used four model organisms as an example and reported characteristics on the four sample data sets based on the analysis of the selected proteins' molecular weight, isoelectric point and hydrophobicity profile. These results have led to some interesting observations that may be valuable for protein studies in general. Some open-ended discussion has been provided as well with the hope of inviting further input on the subject from the audience of this article.

Source Sequence Repository
To locate the most reliable sequence source available, we reviewed the following state-of-the-art molecular biology databases. The GenBank is a collection of publicly available annotated nucleotide sequences, including mRNA sequences with coding regions, segments of genomic DNA with a single gene or multiple genes as well as ribosomal RNA gene clusters. Though complete, it includes all sequence data submitted, possibly containing erroneous ones. Therefore the repository is susceptible to redundancy and errors. UniProt is a protein sequence database that was formed through the merger of SwissProt, TrEMBL and PIR-PSD. It provides a collection of functional information on proteins with rich annotation. However the entries in UniProt are represented by amino acid sequences, which cannot be directly mapped to nucleotide sequences. Consequently the database is not suitable for constructing benchmark data sets for TIS predictor evaluation in that almost all of the existing methods only analyze nucleotide sequences. The Reference Sequence (RefSeq) database is a curated collection of DNA, RNA, and protein sequences built by NCBI. It aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences for taxonomically diverse organisms ranging from eukaryotes to bacteria to viruses. However, some possible limitations of the annotation procedure applied to yield sequences included in RefSeq are implied in Nielsen and Krogh's study [8] which employs an automatic prokaryotic gene predictor to validate the annotation quality of GenBank and RefSeq respectively using several model prokaryotic organisms. Although we believe that mistakes may exist in some of the RefSeq's annotation, to the best of our knowledge, the sequence repository still ranks higher than its peers in terms of accuracy and reliability. Therefore, we still decide to use RefSeq as the source database to construct the data sets in our experiments. It is worth noting that the algorithm that is to be described in subsequent sections is a general procedure which is independent from the repository that is chosen.

Implementation
Our data construction algorithm consists of three phases. In Phase I, we start with retrieving a complete mRNA sequence collection for a given organism from the RefSeq database, then we eliminate the sequences that have at least one of the following problems: a) the corresponding protein contains non-standard amino acid(s), b) the start codon is not AUG, and c) there exists some data format error. The cardinality of a data set at a particular phase (denoted by Card.) and the size portion of the Phase I collection that has been retained in this phase (denoted by Ratio).
Phase II intends to select the most representative mRNA sequences out from the Phase I data set. By representative, we mean the mRNAs whose corresponding protein products are commonly found in the organism proteomics. Three different characteristics are considered in our implemenation: the molecular weight (MW), the isoelectric point (pI) and the hydrophobicity index (HI) profile. Assuming some appropriate settings are found for the parameters of MW and pI, then our approach can identify a dataset that corresponds to a perceived "average" expressed protein population as determined by current proteomic investigations. Due to technical reasons (since analyzing the HI profile is a more computationally intensive task than the previous two), we divide the Phase II selection process into two stages and let us denote the intermediate data collection that include sequences satisfying conditions a and b by the Phase II-Intermediate collection.
It is a trivial task to compute the molecular weight for a given protein whose primary structure is known. Eq. 1 shows the formula, where n refers to the number of amino acids in the protein and aim i refers to the average isotopic mass of the i-th amino acid.
To facilitate the introduction to our solution for calculating pI, let us quickly review the definition of the isoelectric point of a protein -the pH value in which the net charge (1) Presentation of the MW-pI plot for Phase I data sets Figure 1 Presentation of the MW-pI plot for Phase I data sets. Each sequence entry's corresponding protein is represented by a (MW, pI) tuple, therefore, the entire collection can be plotted onto a two-dimensional space. To highlight the difference of population densities among different regions of the graph, a variable called density is employed. It controls the color of the scattered points on the plot, where the plot space is viewed as a grid that consists of many small cells each of which has a length of 10 KDa along the X-axis and a width of 0.1 pH along the Y-axis. Each cell's local population (the number of points that fall into that region) is recorded and its ratio over the entire population is calculated. Since most of the ratios are very small numbers, they have been multiplied by 10000 before they are used in plotting the graph. where pK n is the acid dissociation constant of a negatively charged amino acid or the C-terminal and pK p is the acid dissociation constant of a positively charged amino acid or the N-terminal. The net charge of the protein equals to the sum of y n and y p . Therefore, by experimenting with different pH values, we can locate the one that results in the charge sum to be zero. This pH value is then the isoelectric point of the protein under investigation. In our implementation, Nozaki and Tanford's pK table [9] has been used.
To take the factor of hydrophobicity index profile into consideration, we initiate a statistical analysis on the sequences from the Phase II-Intermediate data set to facilitate the selection of the sequences whose HI fit in the average HI profile for the expressed proteins in this organism. The procedure is explained as follows: 1. For each sequence in Phase II-Intermediate data collection, obtain its hydropathy plot using the Kyte- Illustrates the MW-pI plot for Phase II data sets Figure 2 Illustrates the MW-pI plot for Phase II data sets. Compared to Figure 1, narrower value ranges are used for MW and pI, where MW is in between 20-70 KDa, and pI is in between 5 to 9 pH. 2. For each of the five metrics mentioned above, calculate the mean and the sample variance S using Eq. 4 and Eq. 5 respectively.
3. Filter the sequences in the Phase II-Intermediate collection so that a qualifiable sequence has a hydrophobicity profile whose values for the above 5 variables lie within the range of for any individual variable. It is worth noting that according to the empirical rule [11], in the case of a normal distribution, 95% of the population will lie within this range.
In Phase III, we further eliminate some sequences that are not suitable for TIS predictor evaluation from Phase II if one meets at least one of the following two conditions: a) the number of the nucleotides in the 5' untranslated region (UTR) is less than 12, and b) if the protein that corre- This table reports the statistics of the molecular weight of the proteins corresponding to the transcript sequences included in Phases I, II, III data sets (Unit: KDa). To facilitate the analysis, the entire MW value range has been divided into several smaller ranges (e.g., 0-20 KDa, 20-40 KDa, etc). Throughout the table, two measures have been employed: PR, short for Population Ratio, is a horizontal comparison which compares the protein population within a group against other groups in the same phase; whereas LR, short for Loss Ratio, is a vertical comparison which, for a certain group, considers the portion of the protein population that have been lost through phase transitions. More specifically, given a particular aspect X under investigation (e.g. MW, pI, etc), PR calculates the portion of the number of proteins whose value on X is within a particular range (e.g, [20, 40) KDa, [8, 9) pH, etc) over that of the entire protein population in the same phase. LR computes the ratio of the number of eliminated proteins whose value on X is within a particular range Y over that of the original (Phase I) protein population within that range Y. The data that have been highlighted in italic and bold font are the outstanding numbers within the row it belongs to.
sponds to the mRNA sequence belongs to a homologous protein family and an mRNA that represents this group of proteins has already been included in the data set.
The reason for having condition a is to facilitate the usage and increase the efficiency of the data set for TIS prediction because if untrimmed, the transcript sequence collection, when tested on most of the established TIS classifiers, may require considerably more computational resources than that is normally able to be offered. Since most existing systems consider a neighborhood window that is upstream to a putative start codon, which usually spans from positions -12 to -1 [12,13], we only consider the sequences that have a 5' UTR which at least contains 12 nucleotides. We realize that such a strategy may not be valid for testing any approach which relies on a 5' UTR of a wider range. Should such a demand becomes a reality, it is trivial to reconstruct the data sets by following the generalized post-processing algorithm that is presented in the ensuing text. We denote the data set that include only the sequences having 5' UTR of 12 or more nucleotides by the Phase III-Intermediate collection.
Having a check on condition b avoids selecting redundant transcript sequences that lead to homologous proteins. We used the BLAST tool [14]http:// blast.ncbi.nlm.nih.gov/Blast.cgi to analyze homology between a pair of sequences. The procedure is described as follows: 1. Construct a BLAST database using only the protein sequences that correspond to the entries in the Phase III-Intermediate collection.
2. Initialize two empty entry ID lists for the qualifiable and the redundant sequences, which are denoted by L q and L r respectively.
3. For each sequence in the collection under investigation, check whether its ID is in L r . If so, skip this sequence and continue with the next one in the data set. Otherwise, use blastp to locate the entries that are homologous to the query entry (with a threshold value for e of 1e-30). Add those entry IDs to L r .

Repeat
Step 3 until every sequence has been scanned.
To complete the construction of the Phase III sequence collection, we have employed a post-processing procedure. Although a complete mRNA sequence contains the 5' cap, the 5' UTR, the 3' UTR and a poly-A tail in addition to the CDS, not all of these regions are of critical importance to identify the initiation site for translation. Therefore we conduct a randomized trimming process on the sequences in the collection. The details of the postprocessing strategy follows: 1. Set max u and min u to be the maximal and the minimal numbers of nucleotides in the 5' UTR after trimming. Set max d and min d to be the maximum and the minimum of the size of the 3' UTR after trimming.

For 5' UTR:
(a) obtain a random number rand u ranging from min u to max u

Data Sets
We have applied the aforementioned algorithms to generate four representative transcript sequence collections for the following model organisms: Homo sapiens (human), Mus musculus (house mouse), Caenorhabditis elegans (roundworm) and Drosophila melanogaster (fruit fly). A quick summary of the number of sequences in the data sets at different phases is shown in Table 1. As we desired to obtain data sets of transcripts whose expressed polypeptide products were representative of proteins most commonly expressed in cells, comparison with the protein universe identified in proteomic studies was in order. Unfortunately, published studies showing the physical distribution of proteins identified by proteomic analysis are relatively uncommon, far outnumbered by those evaluating specific protein classes and/or concerned with protein or protein subdomain identification and classification [15,16]. A very thorough proteomic analysis of 8 different human tissues did show distributions of proteins within protein structural parameters such as molecular weight, pI and hydrophobicity that we could use for comparison [17]. From their data, approximately 80% of expressed proteins distribute between 10-75 KDa MW. Approximately 78% of proteins segregate between pI 5-9. As well, a very good normal distribution of protein hydrophobicity was observed, with approximately 94% of proteins distributing within the middle ranges (GRAVY score 20-70) of hydrophobicity. In a study of the mouse liver proteome using advanced mass spectrometry [18], similar findings were also revealed. From the 3D plot showing the UniProt complete mouse proteome, a majority of the expressed proteins lie in the range of 10-100 KDa MW and 5-9 for pI. Although it is premature to assume that these ranges would apply to any organism, they have however reflected the actual mammalian proteomic data. Having a focus of presenting a general algorithm for selecting the transcript sequences that yieid a set of proteins satisfying particular parameter settings, we decide to adopt a sample scheme for configuring the MW and pI parameters based upon our experience with the proteomic literature concerning human proteins where MW is within the range of 20-70 KDa and pI is between 5-9. It is worth noting that the filtration of the data sets in Phase II is arbitrary and it is not intended to serve as a universal strategy for every organism.
In order to provide a reference of the characteristics of the average protein groups in each of these organisms, we conducted some statistical analysis on them based on the measurements of MW, pI and the hydropathy plots. Figures. 1, 2, 3 visualize the transcripts' c orresponding proteins' MW vs pI plots for each of the three phases. Table 2 reports the statistics related to a variety of MW ranges given the data sets of three phases. Table 3 summarizes the results of the statistical analysis conducted from the perspective of isoelectric point. All of the remaining tables are dedicated to the protein hydropathy analysis. Since for a given protein, a continuous hydropathy plot can be obtained, we try to capture its characteristics by analyzing the following aspects. Tables 4, 5 refer to the analysis conducted on the maximal and minimal hydrophobicity values within the hydrophobicity plot. Tables 6, 7 illustrate the results regarding the average positive hydrophobicity values and the average negative hydrophobicity values. Table 8 reports a series of statistics that are relevant to the percentage of hydrophobic residues within the plot.
Some observations can be made from these illustrations. Interestingly, the significant density differences observed within the MW-pI plots ( Figures. 1, 2, 3) show the nonrandom nature of the protein expression spectrum within the "average" mRNA set. For instance, from the Phase I plot, we can see that for all of the four model organisms, the region with the biggest density (shown in yellow hue) is within the range of [20K, 70K] Da for MW and [5,9] pH for pI. This verifies the hypothesis we made earlier about the average profile of proteins in terms of MW and pI. As well, in a vertical comparison among the population, as the value of molecular weight gets bigger, the value range of the corresponding pI becomes narrower, demonstrating a tendency of converging to the [5,9] range as MW increases. Since a majority of the sequences have been eliminated from the Phase I data set due to the subsequent selection procedure, Figures 2, 3 are much more sparse than Figure 1. It is especially easy to n otice two horizontal gaps in both plots where the pI value is in between [7.5, 8) and (8, 8.5] pH. This is an indicator of the sparse distribution of the protein population whose pI lies within those two ranges. A table that provides the counts of the proteins fall under those two categories would suffice to verify our interpretation of the visual data (see Table 9). From the table we can see that in either Phase II or Phase III, since only 3 to 8 percent of the protein population have a pI value falling under those two ranges, the density in those horizontal areas is fairly small, thus the gap is shown. In order to verify our hypothesis that the gap is a result of some statistically significant reason other than a coincidence, we have conducted a paired statistical t-test. The procedure is summarized as follows. Assume the population of the organism are distributed at a random manner within the pI range 5 to 9. Therefore, for a horizontal stripe whose pI value ranges from 7.5 to 7.9 or 8.1 to 8.5, roughly 1/10 of the entire population should fall into that stripe. If we consider both regions ([7.5, 7.9] and [8. 1,8.5]), we should expect 2/10 of the population. Using the four model organisms' related data as the sample data, a The table reports the statistics of the data sets at different phases with the focus on the proteins' hydropathy plot. The particular metric called average positive hydrophobicity is computed by averaging all the hydrophobicity index values that are higher than or equal to 0 within one plot. This parameter is yielded by averaging over all the hydrophobicity values that are less than 0 within one plot. As well, more than 90% of the proteins satisfy the pI condition of ranging between 5 and 14 pH, with the majority fall under the range of 6-8. The data related to the Phase II and Phase III data sets reflect that consistently about 45% of the proteins have an isoelectric point that is in between 6 and 8 pH, making us believe in these four model organisms, a majority of the proteins that fall under an average profile are neutral.
Tables 4 to 8 have shown the hydrophobicity profile analysis (using Kyte-Doolittle scale) conducted on the data sets from different phases. In Phase I, generally speaking, A visual presentation of the Phase III data sets' MW-pI plot Figure 3 A visual presentation of the Phase III data sets' MW-pI plot. In obtaining the data sets at this phase, more restrictions are applied, making the distributions of scattered points even more sparse than Figure 2. more than 85% of the proteins have a hydrophobicity profile whose peak value is in between 2 and 4, more than 62% of the proteins have a HI profile with a lowest value that is in between -4 and -3, the average values for positive and negative hydropathy lie in the ranges of [0, 1] and [-2, -1] respectively, and the average percentage of hydrophobic residues falls under the range of [20%, 60%] for a majority of the protein population. Due to the selection process, collections in subsequent phases demonstrate that value ranges on the aforementioned five aspects have been narrowed. For instance, now in most of the cases, the maximal hydropathy value is 2 to 3 and the minimal one is -4 to -3. Although these information are still quite limited for providing a comprehensive representation of the hydrophobicity profiles associated with all of the proteins under investigation, they do offer some means of interpreting the hydropathy index plots.
It is important to note that so far for all of the four organisms, we have used a general range configuration for both MW and pI, which are 20-70 KDa and 5-9 respectively. Although this setting tends to reflect the characteristics of mammalian proteomic data (e.g. H. sapiens and M. musculus), it is more or less an arbitrary choice for C. elegans and D. melanogaster. However, the benchmark generation procedure that we present in this paper is independent of the values of the parameters that are predefined by the user.
To tailor towards the characteristics of the expressed proteins in a particular species, a thorough investigation of the molecular weight and isoelectric point of the proteins may be necessary. If such investigative operation is not desired, the user may choose to follow a strategy described as follows: the complete valid ranges for MW and pI can be divided up into several groups according to the convention observed in the proteomic study (Table 2 and Table  3 offer an example of this tactic), then each of these subranges are used individually to specify the parameter setting for the benchmark generation procedure and in the end, all of the resulting data sets are put together to comprise the representative sequence collection for the given organism. To demonstrate the generality and applicability of the proposed approach, we have used the aforementioned strategy to construct a second set of transcript collection for C. elegans and we refer to it as the C. elegans II. During the production process, we have used rather extensive ranges for the parameters MW (0-160 KDa) and pI (0-14) and divided them up into smaller sub-ranges which are consistent with the ones seen from Table 2 and  Table 3. In the end, we combine all of the subgroups together to arrive at the complete representative sequence collections. Figure 4 illustrates the MW-pI plots for each of the three phases that the data set construction process has undergone. From the plots we can observe that eventually (in Phase III) the most populated regions converge to certain ranges, i.e., MW in between 20-60 KDa and pI in between 4 to 10. A bimodal pattern can also be observed. The applicability of our strategy is therefore well demonstrated.

Evaluation
In order to demonstrate the applicability of the proposed algorithm, we have employed several state-of-the-art predictors for further evaluation,  We have employed four evaluation metrics to measure the performance of the TIS predictor: sensitivity (Sn), specificity (Sp), Adjusted Accuracy (AA) and Overall Accuracy (OA). A high sensitivity indicates a big portion of true TISs have been predicted. A high specificity means a big portion of pseudo-TISs have been accurately recognized by the predictor. Due to the evaluation bias each of these two metrics imposes, it is prudent not to discuss them in isolation. Instead, adjusted accuracy, which is the average of the two, serves as a more comprehensive metric. Each of these Presentation of the MW-pI plots for the C. elegans II data sets in three phases Figure 4 Presentation of the MW-pI plots for the C. elegans II data sets in three phases. These plots represent the distribution of the Phase I, II, III data sets for C. elegans II for the purpose of demonstrating the applicability of the proposed data set generation method in the absence of a priori parameter configuration scheme. We suggest the user to divide the entire ranges of the parameters (e.g. MW and pI) into subgroups and use the proposed algorithm to handle each subgroup individually. Ultimately a representative data set is obtained by combining all of the subgroups together. From the plots we can observe that eventually (in Phase III) the most populated regions converge to some certain ranges, i.e., MW in between 20-60 KDa and pI in between 4 to 10. A bimodal pattern can also be observed.  Table 10 is available. The formulas are listed as follows: A brief comparison between the data shown in Table 11 and Table 12 identifies the following: (1) GENSCAN, as a leading gene predictor, fails to perform adequately on the vertebrates and Arabidopsis thaliana collections while its performance on our H. sapiens data set seems more reasonable. This may be an indication of the extra trimming procedure conducted on the sequences included in the former two collections (or NetStart). Our explanation is described as follows. First of all, GENSCAN is mainly dedicated to predicting genes in genomic sequences. Therefore for any predicted gene, at most one AUG will be classified as TIS. The sequences in the NetStart collections have undergone extensive trimming process, resulting in the average length of these sequences to be only 150 nucleotides long. Therefore, if the GENSCAN software fails to identify a sequence in NetStart as the start of a gene, it will not make a positive TIS prediction, thus yielding a small true positive and a large false negative, which implies a small sensitivity. Secondly, GENSCAN relies heavily on the statistics obtained by analyzing long stretch of genomic sequences and it also takes advantage of compositional bias between exons and introns. However the NetStart sequences are simply too short and none of them   contain an entire open reading frame, which consequently lead to the surprisingly low performance offered by GENSCAN; (2) other than CUB, which demonstrates consistent performance on all of the data sets shown in Table 11 and Table 12, most of the peer predictors including NetStart, GENSCAN and TISHunter have shown inconsistent prediction performance on different genomes.

Conclusion
In this paper, we have presented a general procedure for constructing representative mRNA sequence collections for the purpose of testing translational initiation site recognition approaches. We believe an ideal benchmark test set should contain high quality mRNA sequences that are error-free and contain complete ORFs. It should also be highly representative, non-redundant and easily accessible. One definition for a "representative" data set would be for the expressed proteins to be functionally representative of what is obtained from proteomic analysis. Thus, the expressed protein data set should approximate the published distribution of proteins from proteomic analysis.
To the best of our knowledge, very few data sets have been widely used for evaluating an existing TIS prediction approach. Most of the collections proposed in the literature violate one or more of the aforementioned rules. To ensure the reliability of our data sets, we only selected the mRNA sequences from the NCBI RefSeq database, which is a high quality, non-redundant sequence repository that has been curated either manually or by specialized programs. We analyze the characteristics associated with each mRNA's corresponding protein product and conduct further elimination on the data set to remove the sequences that do not correspond to proteins which present an average profile for a given organism. In order to avoid the possibility of exaggerating the performance of an approach to be tested, for a family of homologous proteins, only one corresponding mRNA sequence is retained to represent the group of homologous sequences. It is worth noting that though redundancy elimination through homology examination is a common routine for benchmark constructions, the homology in proteins does not necessarily indicate similar context around TIS, which may distort the accuracy of a predictor. However, with the intention of proposing a general strategy for benchmark collection generation, we still choose to use homology as a guideline to eliminate redundant copies of homologous proteins, hoping that the resulting collections can be used to more general areas such as proteomic investigation.
Four model organisms have been used as an example to illustrate the construction procedure and to demonstrate its property of being a general and widely applicable scheme. Every data set in every phase has been analyzed in terms of the molecular weight, isoelectric point and hydrophobicity profile of the corresponding protein products. Colored plots have also been used to help visualize the data yielded by our analysis. Our comparison with published proteomic studies has shown that expression of our data set of transcripts generates a polypeptide population that is representative of that obtained from evaluation of biological specimens. Hence, we believe that our data set represents "real world" transcripts that will allow more accurate evaluation of algorithms dedicated to identification of TISs, as well as other translational regulatory motifs within mRNA sequences.
Although during the process of constructing these four data sets, we have selected a specific range for the parameters (MW and pI), we have also proposed a strategy that can be coupled with the proposed algorithm when the user does not intend to impose any pre-defined parameter setting. A sample collection on C. elegans has been constructed to demonstrate the applicability of the proposal.
Several state-of-the-art predictors have been employed to evaluate the actual merits of the data collections as benchmark testing sets. All of the systems have demonstrated rational performance using the H. sapiens and M. musculus data sets constructed by us. Comparative study has also been conducted which reports the predictors' performance on three other existing benchmark data sets. Inconsistent behavior has been observed on some of the approaches. In particular, GENSCAN appears to perform inadequately when tested on two of the existing benchmark sets. This further justifies the need for some TIS benchmark collections of higher quality.

Availability and Requirements
The data sets derived from four model organisms will be made available to the public through the following URL: http://www.cpsc.ucalgary.ca/~jzeng/tis_datasets/