Erratum to: genetic algorithm learning as a robust approach to RNA editing site site prediction
- James Thompson^{1} and
- Shuba Gopal^{1}Email author
https://doi.org/10.1186/1471-2105-7-406
© Thompson and Gopal; licensee BioMed Central Ltd. 2006
Received: 16 August 2006
Accepted: 06 September 2006
Published: 06 September 2006
The original article was published in BMC Bioinformatics 2006 7:145
After the publication of [1], we were alerted to an error in our data. The error was an one-off miscalculation in the extraction of position information for our set of true negatives. Our data set should have used randomly selected non-edited cytosines (C) as true negatives, but the data generation phase resulted in a set of nucleotides that were each one nucleotide downstream of known, unedited cytosines. The consequences of this error are reflected in changes to our results, although the general conclusions presented in our original publication remain largely unchanged.
Modifications to implementation
Changes to data sets
After correcting for the one off error in the data generation phase, we re-evaluated the data sets for all three of the genomes analyzed. Since the publication of our original work, the mitochondrial genomes of all three species have been updated. We therefore decided to revise our data sets using the new (as of April 2006) GenBank files for Arabidopsis thaliana, Brassica napus and Oryza sativa ([GenBank: NC_001284, GenBank: AP006644, GenBank: BA000029]).
As before, we focused on those edit sites associated with coding regions. In reviewing these updated GenBank files, we determined certain edit sites that were ambiguous for one of three reasons. Some C → U editing sites could not be reliably assigned to one coding region, while others were not on the correct strand as the annotated coding region. A smaller proportion of annotated edit sites were not cytosines (C) in the genomic sequence on the strand containing the relevant coding region. In addition, a few coding regions involved complex processes such as trans-splicing, and the annotated CDS coordinates did not yield a coding sequence that could be translated to the reported protein sequence. These discrepancies were of some concern to us since we could not independently confirm the presence or absence of editing. We therefore chose to select a subset of edit sites from the annotated set that were unambiguous and could be reliably assigned to a coding region whose translation exactly matched the annotated entry. From the set of 455 annotated edit sites in the A. thaliana mitochondrial genome, we retained 344 edit sites as unambiguous (see Additional File 1). For the B. napus genome, we retained 397 edit sites out of 428 annotated sites (see Additional File 2), and in the O. sativa genome, we utilized 419 edit sites out of the 485 annotated sites (see Additional File 3). For each set of true positives selected from the annotated edit sites, we chose an equivalent number of true negatives after correcting for the one off error.
As before, we used the set of true positives and negatives from A. thaliana to train our genetic algorithm (GA) and tested its performance using cross validation. We made one minor change to the method of cross-validation, using 10-fold cross-validation. This process involves reserving a randomly selected 10% of the known edited and unedited sites for testing. The remaining 90% of the data are used for training the GA. Ten such iterative splits are conducted, with training and testing occurring after each split. This has been demonstrated to reliably sample the entire data space in a data set of this size [2]. The results reported are the average of performance across all ten iterative splits.
Changes to GA development and training
In the process of reviewing our results with the corrected data, we had to modify our fitness function to improve performance. Our new fitness function is derived from the effect size statistic (also known as Cohen's d'), a measure of how far apart the means of two distributions are [3]. In this instance, the two distributions represent the GA scores for known true positives and known true negatives respectively (Figure 1). By using the effect size statistic, we could maximize the distance between these two distributions' means. In other words, we could obtain the best classification by ensuring that the means of the two distributions were as far apart as possible. The effect size statistic is calculated as follows:
The objective values for each of the six variables remain as before (see Additional File 4).
Based on this new fitness function, we identified the best organism during 10-fold cross validation on the A. thaliana genome. This GA organism has a GA genome with the following structure:
010100111110111101101100100010001000111101111000110010101001010000100101000011100100011000000001
The above GA organism is now encoded in the updated version of REGAL (RNA Editing site prediction by Genetic Algorithm Learning) included here (see Additional File 5).
Changes to REGAL output
In the course of reviewing our analysis, one aspect of the assessment of performance seemed to be somewhat limited in applicability. In our assessment of performance [1], we used sensitivity and specificity to demonstrate the ability of our classifier to make reliable predictions. That analysis provided an overall measure of the likelihood that predictions are correct. However, we did not assign an individual likelihood to each prediction so that users might immediately assess the likelihood that any given prediction is correct. We have now added an additional feature to the REGAL software that allows for an estimate of the likelihood that any given prediction is correct.
Since our analysis relies on Bayesian probability, these are the 90% credible intervals [4]. We can interpret these as roughly similar to the 90% confidence levels in a frequentist statistical analysis [5, 6]. In other words, when REGAL predicts that a site is edited, and the score assigned to that site is greater than 33,000, we have at least 90% confidence that the prediction is true. Similarly, if REGAL were to assign a score less than 20,000 for a cytosine, we would have 90% or greater confidence that the site was unedited. In considering the performance of REGAL with the other methods for predicting edit sites in these genomes, we consider only those predictions that are in the 90% credible interval range. Considering results from a set of credible intervals is a well-established and accepted practice in the statistical analysis of classifiers [2, 5, ?, 7]. It allows us to assess the performance of REGAL based on those predictions that have the greatest confidence.
Corrected results
Using the optimized weights, we scored each cytosine in the test data sets for A. thaliana, as well as the data sets from B. napus and O. sativa. REGAL now has an overall accuracy of 77%, with a sensitivity of 81% and a specificity of 74%. In the 90% credible interval range, the overall accuracy is 86%, with sensitivity of 89% and specificity of 83%. This is similar to our previously reported results, with sensitivity actually higher with the new organism. Specificity is somewhat reduced compared to our previously reported level. Nevertheless, the overall accuracy in the 90% credible intervals remains identical to our previous findings.
The output from REGAL now includes two values. The first is a score for a given cytosine assigned by the GA. Figure 1 shows the distribution of scores generated by REGAL for one of the test data sets from A. thaliana. The second output from REGAL is the posterior probability that the prediction is correct.
This value is estimated from the false positive and false negative rates, as described in Implementation. In Figure 1, the 90% credible intervals, based on this estimated posterior probability, are indicated by the dashed lines. In the subsequent description of results and in comparisons to other methods, we consider only the results from the 90% credible intervals. As discussed in Implementation, this is an accepted and well-established practice in evaluating the performance of classifiers [2, 5, ?, 7].
Overall performance of REGAL on A. thaliana.
Known Edited Sites Total: 17 – 26 | Known Unedited Sites Total: 18 – 28 | ||
---|---|---|---|
Predicted Edited Site | True positive 19.4 (± 3.4) | False positive 3.3 (± 1.2) | Sensitivity: 0.91 (± 0.06) Specificity: 0.85 (± 0.06) |
Predicted Unedited Site | False negative 2.0 (± 1.1) | True negative 19.7 (± 3.8) | PPV: 0.86 (± 0.05) Accuracy: 0.88 (± 0.05) |
Overall Performance of REGAL on B. napus.
Known Edited Sites Total: 258 | Known Unedited Sites Total: 263 | ||
---|---|---|---|
Predicted Edited Site | True positive 229 | False positive 51 | Sensitivity: 0.89 Specificity: 0.81 |
Predicted Unedited Site | False negative 29 | True negative 212 | PPV: 0.82 Accuracy: 0.85 |
Overall Performance of REGAL on O. sativa.
Known Edited Sites Total: 262 | Known Unedited Sites Total: 287 | ||
---|---|---|---|
Predicted Edited Site | True positive 228 | False positive 52 | Sensitivity: 0.87 Specificity: 0.82 |
Predicted Unedited Site | False negative 34 | True negative 235 | PPV: 0.81 Accuracy: 0.84 |
Comparing REGAL to other methods
Comparison of REGAL vs. Classification Trees.
Classification Trees | REGAL | |||||
---|---|---|---|---|---|---|
Sensitivity | Specificity | Accuracy | Sensitivity | Specificity (PPV) | Accuracy | |
A. thaliana | 0.65 | 0.89 | 0.71 | 0.91 | 0.85 (0.86) | 0.88 |
B. napus | 0.63 | 0.89 | 0.69 | 0.89 | 0.81 (0.82) | 0.85 |
O. sativa | 0.64 | 0.88 | 0.71 | 0.87 | 0.82 (0.81) | 0.84 |
Overall | 0.64 | 0.89 | 0.70 | 0.89 | 0.83 (0.83) | 0.86 |
Comparison of REGAL vs. Random Forests.
Random Forests | REGAL | |||||
---|---|---|---|---|---|---|
Sensitivity | Specificity | Accuracy | Sensitivity | Specificity (PPV) | Accuracy | |
A. thaliana | 0.70 | 0.81 | 0.74 | 0.91 | 0.85 (0.86) | 0.88 |
B. napus | 0.73 | 0.81 | 0.77 | 0.89 | 0.81 (0.82) | 0.85 |
O. sativa | 0.72 | 0.81 | 0.72 | 0.87 | 0.82 (0.81) | 0.84 |
Overall | 0.72 | 0.81 | 0.74 | 0.89 | 0.83 (0.83) | 0.86 |
Comparison of REGAL vs. PREP-Mt.
PREP-Mt | REGAL | |||||
---|---|---|---|---|---|---|
Sensitivity | Positive Predictive Value | Accuracy | Sensitivity | Specificity (PPV) | Accuracy | |
A. thaliana | 0.79 | 0.86 | 0.82 | 0.91 | 0.85 (0.86) | 0.88 |
B. napus | 0.87 | 0.87 | 0.87 | 0.89 | 0.81 (0.82) | 0.85 |
O. sativa | 0.81 | 0.85 | 0.83 | 0.87 | 0.82 (0.81) | 0.84 |
Overall | 0.82 | 0.86 | 0.84 | 0.89 | 0.83 (0.83) | 0.86 |
We regret any inconvenience the error in the data generation phase may have caused. We wish to thank Jeffrey P. Mower for bringing this error to our attention, and Saria Awadalla for conducting an independent review of the software prior to publication of this correction.
Notes
Declarations
Authors’ Affiliations
References
- Thompson J, Gopal S: Genetic algorithm learning as a robust approach to RNA editing site prediction. BMC Bioinformatics 2006, 7: 145. 10.1186/1471-2105-7-145PubMed CentralView ArticlePubMedGoogle Scholar
- Ewens WJ, Grant GR: Statistical Methods in Bioinformatics: An Introduction. New York: Springer-Verlag; 2001.View ArticleGoogle Scholar
- Lipsey M, Wilson D: Practical meta-analysis. Thousand Oaks, CA: Sage; 2001.Google Scholar
- Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. 2nd edition. Boca Raton, FL: Chapman and Hall/CRC; 2004.Google Scholar
- Altham P: Exact Bayesian analysis of a 2*2 contingency table, and Fisher's "exact" significance test. J of the Royal Statistical Society, Series B 1969, 31: 261–269.Google Scholar
- Gopal S, Awadalla S, Gaasterland T, Cross GA: A computational investigation of Kinetoplastid trans -splicing. Genome Biology 2005, 6: R95. 10.1186/gb-2005-6-11-r95PubMed CentralView ArticlePubMedGoogle Scholar
- Venables W, Ripley B: Modern Applied Statistics with S-Plus. third edition. Heidelberg: Springer Verlag; 1999.View ArticleGoogle Scholar
- Cummings MP, Myers DS: Simple statistical models predict C-to-U edited sites in plant mitochondrial RNA. BMC Bioinformatics 2004, 5: 132. [http://www.biomedcentral.com/1471–2105/5/132] 10.1186/1471-2105-5-132PubMed CentralView ArticlePubMedGoogle Scholar
- Mower JP: PREP-Mt: predictive RNA editor for plant mitochondrial genes. BMC Bioinformatics 2005, 6: 96. [http://www.biomedcentral.com/1471–2105/6/96] 10.1186/1471-2105-6-96PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.