Screening non-coding RNAs in transcriptomes from neglected species using PORTRAIT: case study of the pathogenic fungus Paracoccidioides brasiliensis
© Arrial et al; licensee BioMed Central Ltd. 2009
Received: 15 August 2008
Accepted: 4 August 2009
Published: 4 August 2009
Transcriptome sequences provide a complement to structural genomic information and provide snapshots of an organism's transcriptional profile. Such sequences also represent an alternative method for characterizing neglected species that are not expected to undergo whole-genome sequencing. One difficulty for transcriptome sequencing of these organisms is the low quality of reads and incomplete coverage of transcripts, both of which compromise further bioinformatics analyses. Another complicating factor is the lack of known protein homologs, which frustrates searches against established protein databases. This lack of homologs may be caused by divergence from well-characterized and over-represented model organisms. Another explanation is that non-coding RNAs (ncRNAs) may be caught during sequencing. NcRNAs are RNA sequences that, unlike messenger RNAs, do not code for protein products and instead perform unique functions by folding into higher order structural conformations. There is ncRNA screening software available that is specific for transcriptome sequences, but their analyses are optimized for those transcriptomes that are well represented in protein databases, and also assume that input ESTs are full-length and high quality.
We propose an algorithm called PORTRAIT, which is suitable for ncRNA analysis of transcriptomes from poorly characterized species. Sequences are translated by software that is resistant to sequencing errors, and the predicted putative proteins, along with their source transcripts, are evaluated for coding potential by a support vector machine (SVM). Either of two SVM models may be employed: if a putative protein is found, a protein-dependent SVM model is used; if it is not found, a protein-independent SVM model is used instead. Only ab initio features are extracted, so that no homology information is needed. We illustrate the use of PORTRAIT by predicting ncRNAs from the transcriptome of the pathogenic fungus Paracoccidoides brasiliensis and five other related fungi.
PORTRAIT can be integrated into pipelines, and provides a low computational cost solution for ncRNA detection in transcriptome sequencing projects.
Proteins are recognized as the most important players in cell homeostasis. Due to their importance and relatively straightforward characterization, it is expected that the main focus of transcriptome projects will be transcripts that code for proteins. To meet this demand, several specific computational tools have been created, both for absolute characterization and comparative analysis of these molecules. Only recently has attention begun to turn to those transcripts ignored or rejected by protein-oriented software packages: the so-called non-coding RNAs (ncRNAs). Classical, textbook examples of ncRNAs include ribosomal and transfer RNAs. More recently, other classes have been unveiled, such as microRNAs, siRNAs, piRNAs, asRNAs and the long, mRNA-like ncRNAs, widespread among all Domains, with evidence of ubiquitous tissue expression in plants and animals [1, 2].
Demand is now arising for specific tools for working with these molecules. A combination of new computational tools and advances in biological knowledge allowed for development of specific software for this purpose . Currently, it is not difficult to find software designed for the identification and characterization of individual ncRNA classes (as we will discuss later). However, the task is still considered complex and remains an open topic in bioinformatics.
Machine learning algorithms represent a solution for highly accurate detection and characterization of ncRNA patterns, and more improvements are expected as ncRNA biological properties are determined by biochemical and molecular experiments. Successful implementations have been reported for siRNA  and miRNA . The mRNA-like ncRNA, on the other hand, is arguably a class which is harder to identify due to its resemblance to mRNA molecules: they may be capped, may undergo splicing, and even harbor polyadenylation and ORF signals . Screening of mRNA-like ncRNA is possible on prokaryotic genomes using RNAGENiE . For transcriptome contexts, there are two notable implementations: CONC  and CPC . Both algorithms – CONC and CPC – can distinguish mRNA from ncRNA with high accuracy. CONC showed that putative proteins from ncRNA are distinguishable from those translated from mRNA, and CPC improved this idea by heavily focusing on homology information. However, their high accuracy relies on the quality of homology information (especially CPC), and both expect full-length sequences given the ORF translation schemes employed (especially CONC). These two assumptions hinder the use of these programs for analysis of transcriptomes from poorly characterized organisms because many of their sequences lack known protein homologs and are commonly built from low-quality, single-pass reads. Such drawbacks require special procedures to be employed for accurate analysis because canonical translation signals are often missing. The result is a bias toward false negatives when the input consists of low quality sequences because most transcripts code for unusual or truncated (but functional) proteins. Moreover, despite advances reported on CPC, the required computational processing power and running time remain prohibitive for labs with limited budgets.
In summary, these programs may be inappropriate for transcriptomes from neglected species. We propose new Support Vector Machine-based software to overcome these obstacles. EST sequencing errors, frameshifts and truncations are taken into account and corrected by a specially designed program, and a shunt is imposed on sequences without a predicted ORF, which are then analyzed separately. Database representation bias is eliminated by avoiding homology information and using only ab initio features. Also, only computationally light programs were chosen for calculation of features so as to allow pipelining from transcriptome sequencing projects with less demands on computational processing power.
Putative EST translation
The ANGLE software package  was chosen for translation of ESTs because it focuses on sequencing errors of the input sequences and has superior performance when dealing with small sequences. ANGLE implements a hybrid method composed of a sliding window CDS classifier using a weak learner, a hidden Markov model coupled to dynamic programming for determining optimal ORF path and a frameshift detector. The dynamic programming (DP) algorithm evaluates and punctuates putative proteins translated from the six frames; among all alternatives, the putative ORF with highest DP score is taken as the protein product coded by the transcript. Transcripts are separated into two groups: those with translated proteins and those that lack any putative ORF. A user-friendly interface for ANGLE was developed in PERL and is available from the authors upon request.
Support Vector Machines settings
Support Vector Machines (SVM) is a state-of-the-art machine learning algorithm developed from a solid statistical basis . SVMs have been shown to be successful and useful in Bioinformatics  and several other fields .
We used the LIBSVM v2.84 implementation  with Radial Basis Function kernel, which was shown to be the best kernel to deal with this task (Liu et al, 2006), set as C-SVM and binary classification problem, with the two classes being coding (positive set) and non-coding (negative set) RNA. Optimization of parameters (C and gamma) occurred in two runs using the accompanying grid.py script with 20,000 randomly selected instances from the main training set. Two models were induced separately: a protein-dependent one induced with dbTR_OP as training data, and a nucleotide-only using dbTR_OA for training [see Additional file 1].
Compared programs settings
PORTRAIT was benchmarked against two other classification programs: Naïve Bayes and CPC. Naïve Bayes (nB) is a machine learning algorithm used when a wealth of examples (or instances, or realizations) of a random variable is available, and it is desired to induce a model that is able to explain the distribution of this data. This induced model may be used to classify data yet unseen by the classifier. Although very simplistic, nB is also known to be fast and reliable, sometimes even surpassing more sophisticated machine learning algorithms .
Bayesian models were induced using the software package BC  with default parameters. Training was done with the same sets, features and normalization schemes used for SVM.
CPC  was installed locally and always executed with default parameters. CPC comes pre-installed with a classification model developed by its authors, which was developed using the database created by the authors of CONC .
Efficiency formulas, points for plotting ROC curves and area under ROC curves were calculated both by using PERL scripts and the PERF software .
Cross-validation is a traditional machine learning technique for estimating classifier performance by splitting the training set into n equally-sized datasets, without element repositioning. Afterwards, each subset is trained once and the model is evaluated on the n-1 remaining subsets. This process is repeated n times so that each subset is used for training exactly once. We used ten-fold cross-validation, which was carried out using LIBSVM for SVM, and a custom PERL script for naïve Bayes.
EST sequences of organisms phylogenetically related to P. brasiliensis (Ajellomyces capsulatus, Aspergillus niger, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Cryptococcus neoformans var. neoformans) were downloaded from the Entrez Nucleotide Database  and stored as FASTA-formatted files. After filtering transcripts shorter than 80 and longer than 65,335 letters, these sequences composed the dbFG set, comprising 137,629 entries.
Results and discussion
Training set construction
Seeking to discriminate between ncRNA and mRNA, we used Support Vector Machines (SVM) for induction of a classification model. SVM is a supervised machine learning method, and as such, it requires previously labeled data – the training set – for model induction (see Methods for details). In this work, mRNAs (the positive set) and ncRNAs (the negative set) compose the SVM training set, called dbTR (TRaining DataBase).
Files containing sequences from RNAdb , NONCODE  and Rfam , currently the three most comprehensive ncRNA databases, were downloaded in October 2006, comprising a total of 213,849 sequences. Nucleotide redundancy removal was done using BLASTCLUST with L = 0.5, S = 0.5 and W = 18. ORF prediction and redundancy elimination was carried out in the same way as in the positive set. The resulting 70,667 transcripts with ORFs and corresponding proteins integrated the dbTR_OP set, while remaining transcripts were merged into the dbTR_OA set. This process is shown on the rightmost part of Figure 1.
Feature vector description. Cited references either support the coding/non-coding discrimination power of the feature or describe the corresponding program.
Individual nucleotide frequency divided by total nucleotide frequency
Binary coding: length intervals < 100, 400, 900 and > 900.
Amino acid composition§
Individual amino acid frequency divided by total amino acid frequency
Binary coding: length intervals < 20, 60, 100 and > 100.
Value divided by 14
Amount of low complexity residues divided by sequence length
Summed means from sliding 3nt window
SVM optimization, training and testing
dbTR_OP and dbTR_OA were further randomly sub-divided on optimization, training and testing subsets, comprising, respectively, 20,000, 30,000 and 23,976 instances for dbTR_OP, and 10,000, 20,000 and 22,002 instances for dbTR_OA. Optimization set was used to obtain the best pair of values for two crucial SVM Radial Basis Function (RBF) Kernel parameters, the gamma and C, determined from a 10-fold cross-validation grid search. Training sets were used to induce SVM models, and test sets (from now on called dbTS_OP and dbTS_OA) were used to estimate performances of induced models.
Estimations of model performance are evaluated by traditional methods, such as efficiency formulae, cross-validation, ROC curves and running time comparison between related programs.
Efficiency formulas and runtime comparisons
For estimation of classifier accuracy, we used cross-validation (CV) with dbTR_OP and dbTR_OA as training/testing sets. Figures obtained for PORTRAIT and naïve Bayes (nB) were compared to those reported in the literature for CPC.
Speed performance (in minutes), standard efficiency measures and cross validation accuracy. Indices were calculated from the mean of predictions of the classifiers regarding dbTS_OP and dbTS_OA sets.
CV acc. (%)
Induced classifiers were used to evaluate the coding potential of transcripts from three test sets. The first one is dbRD, comprising 3,000 randomly generated transcripts with lengths varying from 80 to 3,000 nt. Another set is dbPB, which harbors 6,022 assembled ESTs generated during transcriptome sequencing of the pathogenic fungus Paracoccidioides brasiliensis . The third set is dbFG, composed of 137,629 transcript sequences from organisms phylogenetically related to P. brasiliensis: Ajellomyces capsulatus and Aspergillus niger, and as outgroups, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Cryptococcus neoformans var. neoformans.
Proportion of transcripts predicted as ncRNA by three classifiers.
Analysis of dbPB transcripts classified as ncRNA
Using the dbPB EST, PORTRAIT classified 16% as potential ncRNAs, 83% of those being unannotated sequences, thus presenting parallel evidence that those transcripts may not indeed code for proteins [see also Additional file 3]. It is important to note that this result corroborates non-coding status as an independent diagnostic only for PORTRAIT and nB, because these are the only ab initio classifiers. Additionally, 81% of the transcripts predicted as ncRNAs were singletons, corroborating evidence that ncRNAs are expressed at levels lower than mRNAs and thus tend to be assembled as singletons .
In this work we report an algorithm for identifying non-coding RNAs in a transcriptome context. The distinguishing characteristic of our approach is the focus on non-model organisms: by using an ORF translation program sensitive to low-quality EST sequences, and also by choosing only ab initio features. Even if the input sequence has been disrupted by frameshifts or indels to an extent where ORF identification is compromised, still the query transcript may be classified as protein-coding by the protein-independent SVM model of PORTRAIT. Therefore, our predictions are not biased to classify as ncRNA transcripts that may actually code for novel proteins, rare or even absent in the databases. This may be a factor contributing to the high specificity of PORTRAIT (Table 2). Also, our training set includes several recent ncRNAs and mRNAs from all life Domains, including prokaryotic and eukaryotic sequences. These factors make our program ideal for analysis of neglected or poorly characterized species.
Differences from the ab initio approach also show up in the number of transcripts predicted to be non-coding in comparison to the other classifiers (Table 3). Compared to SVM, the nB algorithm is notably less complex and less robust to inconsistencies in the training set. Thus, when looking at the number of predicted ncRNAs in the dbPB and dbFG sets, one may infer that the rules derived by this algorithm for identifying ncRNAs are far too simple, leading a significant amount of ncRNAs to be misclassified as mRNA (too many false positives). On the other hand, CPC classifies all transcripts from dbRD as being non-coding. At a glance, this result seems consistent; however, some of these randomly generated sequences could be "real" mRNA transcripts encoding for novel proteins not found in the databases (false negatives). This scenario is plausible for sequences from the transcriptomes of neglected organisms, for which very little is known and where there is the potential for novelty. Taking this hypothesis into account, CPC may not be suitable for this situation because it may be biased for classifying as non-coding those transcripts lacking good hits from protein databases. PORTRAIT emerges as a compromise between nB and CPC: it predicts as ncRNA a reasonable number of the transcripts from dbPB and dbFG, and also classifies some dbRD transcripts as mRNA, despite not having come into contact with similar sequences in the training phase.
We propose PORTRAIT, a software for ncRNA screening in transcriptomes. Our method is tailored to the analysis of neglected organisms: 1) we use a 6-frame translation scheme that takes into account sequencing errors and is optimized for small or truncated sequences; 2) no homology information is used; 3) only lightweight programs are used, so the method is suitable for less powerful computers. The output of the program may also provide insights or a second opinion about the coding status of known protein-coding transcripts. Subsequent homology analyses are up to the researcher and constitute an independent, parallel experiment.
Availability and requirements
Project name: PORTRAIT
Project home page: http://bioinformatics.cenargen.embrapa.br/portrait
Operating system(s): LINUX
Programming language: PERL
Other requirements: LIBSVM 2.84, CAST 1.0, ANGLE
License: GNU GPL
Any restrictions to use by non-academics: PORTRAIT is free for commercial use, but third-party authors of programs used by PORTRAIT must be contacted.
RTA is supported by a grant from National Counsel of Technological and Scientific Development – CNPq – Brazil. The authors thank EMBRAPA for lending computers for this study. RTA acknowledges Dr. Kana Shimizu for providing specially designed ANGLE software.
- Ravasi T, Suzuki H, Pang KC, Katayama S, Furuno M, Okunishi R, Fukuda S, Ru K, Frith MC, Gongora MM, Grimmond SM, Hume DA, Hayashizaki Y, Mattick JS: Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res 2006, 16: 11–19. 10.1101/gr.4200206PubMed CentralView ArticlePubMedGoogle Scholar
- Mattick JS: RNA regulation: a new genetics? Nat. Rev. Genet 2004, 5: 316–323. 10.1038/nrg1321View ArticlePubMedGoogle Scholar
- Jossinet F, Ludwig TE, Westhof E: RNA structure: bioinformatic analysis. Curr Op Microbiol 2007, 10: 279–285. 10.1016/j.mib.2007.05.010View ArticleGoogle Scholar
- Teramoto R, Aoki M, Kimura T, Kanaoka M: Prediction of siRNA functionality using generalized string kernel and support vector machine. FEBS Lett 2005, 579(13):2878–2882. 10.1016/j.febslet.2005.04.045View ArticlePubMedGoogle Scholar
- Xue C, Li F, He T, Liu G-P, Li Y, Zhang X: Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics 2005, 6: 310–317. 10.1186/1471-2105-6-310PubMed CentralView ArticlePubMedGoogle Scholar
- Rymarquis LA, Kastenmayer JP, Hüttenhofer AG, Green PJ: Diamonds in the rough: mRNA-like non-coding RNAs. Trends in Plant Science 2008, 13(7):329–334. 10.1016/j.tplants.2008.02.009View ArticlePubMedGoogle Scholar
- Carter RJ, Dubchak I, Holbrook SR: A computational approach to identify genes for functional RNAs in genomic sequences. Nucleic Acids Res 2001, 29: 3928–3938.PubMed CentralPubMedGoogle Scholar
- Liu J, Gough J, Rost B: Distinguishing protein-coding from non-coding RNAs through support vector machines. PLoS Genet 2006, 2: e29-e36. 10.1371/journal.pgen.0020029PubMed CentralView ArticlePubMedGoogle Scholar
- Kong L, Zhang Y, Ye Z-Q, Liu X-O, Zhao S-O, Wei L, Gao G: CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res 2007, 35: W345-W349. 10.1093/nar/gkm391PubMed CentralView ArticlePubMedGoogle Scholar
- Shimizu K, Adachi J, Muraoka Y: ANGLE: a sequencing errors resistant program for predicting protein coding regions in unfinished cDNA. J Bioinfo Comp Biol 2006, 4(3):649–664. 10.1142/S0219720006002260View ArticleGoogle Scholar
- Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques. San Francisco, Morgan Kaufmann; 2005.Google Scholar
- Noble WS: What is a support vector machine? Nat Biotech 2006, 24(12):1565–1567. 10.1038/nbt1206-1565View ArticleGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: a library for support vector machines.[http://www.csie.ntu.edu.tw/~cjlin/libsvm]
- Borgelt C: Full and Naive Bayes classifiers.[http://www.borgelt.net/bayes.html]
- PERF software package[http://kodiak.cs.cornell.edu/kddcup/software.html]
- NCBI Entrez Nucleotide Database[http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide]
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, 34: D187-D191. 10.1093/nar/gkj161PubMed CentralView ArticlePubMedGoogle Scholar
- Li W, Godzik A: CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658–1659. 10.1093/bioinformatics/btl158View ArticlePubMedGoogle Scholar
- Cochrane G, Aldebert P, Althorpe N, Andersson M, Baker W, Baldwin A, Bates K, Bhattacharyya S, Browne P, Broek A, Castro M, Duggan K, Eberhardt R, Faruque N, Gamble J, Kanz C, Kulikova T, Lee C, Leinonen R, Lin Q, Lombard V, Lopez R, Mchale M, McWilliam H, Mukherjee G, Nardone F, Pastor MPG, Sobhany S, Stoehr P, Tzouvara K, Vaughan R, Wu D, Zhu W, Apweiler R: EMBL nucleotide sequence database: developments in 2005. Nucleic Acids Res 2006, 34: D10-D15. 10.1093/nar/gkj130PubMed CentralView ArticlePubMedGoogle Scholar
- Harte N, Silventoinen V, Quevillon E, Robinson S, Kallio K, Fustero X, Patel P, Jokinen P, Lopez P: Public web-based services from the European Bioinformatics Institute. Nucleic Acids Res 2004, 32: W3-W9. 10.1093/nar/gkh405PubMed CentralView ArticlePubMedGoogle Scholar
- McGinnis S, Madden TL: BLAST: At the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 2004, 32: W20-W25. 10.1093/nar/gkh435PubMed CentralView ArticlePubMedGoogle Scholar
- Pang KC, Stephen S, Engström PG, Tajul-Arifin K, Chen W, Wahlestedt C, Lenhard B, Hayashizaki Y, Mattick JS: RNAdb – a comprehensive mammalian noncoding RNA database. Nucleic Acids Res 2005, 33: D125-D130. 10.1093/nar/gki089PubMed CentralView ArticlePubMedGoogle Scholar
- He S, Liu C, Skogerbø G, Zhao Y, Wang J, Liu T, Bai B, Zhao Y, Chen R: NONCODE v2.0: decoding the non-coding. Nucleic Acids Res 2008, 36: D170-D172. 10.1093/nar/gkm1011PubMed CentralView ArticlePubMedGoogle Scholar
- Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 2005, 33: D121-D124. 10.1093/nar/gki081PubMed CentralView ArticlePubMedGoogle Scholar
- Fickett JW, Tung C-S: Assessment of protein coding measures. Nucleic Acids Res 1992, 20(24):6441–6450. 10.1093/nar/20.24.6441PubMed CentralView ArticlePubMedGoogle Scholar
- Numata K, Kanai A, Saito R, Kondo S, Adachi J, Wilming LG, Hume DA, Hayashizaki Y, Tomita M, RIKEN GER Group, GSL members: Identification of putative noncoding RNAs among the RIKEN mouse full-length cDNA collection. Genome Res 2003, 3: 1301–1306. 10.1101/gr.1011603View ArticleGoogle Scholar
- Otaki JM, Ienaka S, Gotoh T, Yamamoto H: Availability of short amino acid sequences in proteins. Protein Sci 2005, 14: 617–625. 10.1110/ps.041092605PubMed CentralView ArticlePubMedGoogle Scholar
- Frith MC, Bailey TL, Kasukawa T, Mignone F, Kummerfeld SK, Madera M, Sunkara S, Furuno M, Bult CJ, Quackenbush J, Kai C, Kawai J, Carninci P, Hayashizaki Y, Pesole G, Mattick JS: Discrimination of non-protein-coding transcripts from protein-coding mRNA. RNA Biol 2006, 3(1):40–48.View ArticlePubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: The European molecular biology open software suite. Trends Genet 2000, 16: 276–277. 10.1016/S0168-9525(00)02024-2View ArticlePubMedGoogle Scholar
- Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy C, Hamodrakas S, Sander S, Ouzounis C: CAST: an iterative algorithm for the complexity analysis of sequence tracts. Bioinformatics 2000, 16(10):915–922. 10.1093/bioinformatics/16.10.915View ArticlePubMedGoogle Scholar
- Kyte J, Doolittle RF: A Simple Method for Displaying the Hydropathic Character of a Protein. J Mol Biol 1982, 157: 105–132. 10.1016/0022-2836(82)90515-0View ArticlePubMedGoogle Scholar
- Felipe MS, Andrade RV, Arraes FBM, Nicola AM, Maranhão AQ, Torres FAG, Silva-Pereira I, Poças-Fonseca MJ, Campos EG, Moraes LMP, Andrade PA, Tavares AHFP, Silva SS, Kyaw CM, Souza DP, PbGenome Network, Pereira M, Jesuíno RSA, Andrade EV, Parente JA, Oliveira GS, Barbosa MS, Martins NF, Fachin AL, Cardoso RS, Passos GAS, Almeida NF, Walter MEMT, Soares CMA, Carvalho MJA, Brígido MM: Transcriptional profiles of the human pathogenic fungus Paracoccidioides brasiliensis in mycelium and yeast cells. J Biol Chem 2005, 280: 24706–24714. 10.1074/jbc.M500625200View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.