Skip to main content

ViralmiR: a support-vector-machine-based method for predicting viral microRNA precursors

Abstract

Background

microRNAs (miRNAs) play a vital role in development, oncogenesis, and apoptosis by binding to mRNAs to regulate the posttranscriptional level of coding genes in mammals, plants, and insects. Recent studies have demonstrated that the expression of viral miRNAs is associated with the ability of the virus to infect a host. Identifying potential viral miRNAs from experimental sequence data is valuable for deciphering virus-host interactions. Thus far, a specific predictive model for viral miRNA identification has yet to be developed.

Methods and results

Here, we present ViralmiR for identifying viral miRNA precursors on the basis of sequencing and structural information. We collected 263 experimentally validated miRNA precursors (pre-miRNAs) from 26 virus species and generated sequencing fragments from virus and human genomes as the negative dataset. Support vector machine and random forest models were established using 54 features from RNA sequences and secondary structural information. The results show that ViralmiR achieved a balanced accuracy higher than 83%, which is superior to that of previously developed tools for identifying pre-miRNAs.

Conclusions

The easy-to-use ViralmiR web interface has been provided as a helpful resource for researchers to use in analyzing and deciphering virus-host interactions. The web interface of ViralmiR can be accessed at http://csb.cse.yzu.edu.tw/viralmir/.

Introduction

microRNAs (miRNAs) are non-protein-coding RNAs that is approximately 22 nucleotides long, which results in the degradation of mRNAs by complementarily binding to the 3' untranslated regions of target genes. Recent studies have demonstrated that miRNAs play a vital role in development, oncogenesis, and apoptosis by binding to mRNAs to regulate the posttranscriptional level of coding genes in mammals, plants, and insects. In addition, miRNAs modulate viral existence in plants and animals by targeting viruses [1, 2]. Conversely, miRNAs are produced by viruses [3]. Recent studies have demonstrated that the expression of viral miRNAs is associated with the ability of the virus to infect a host [2, 4]. Additionally, studies have reported that viral miRNAs are associated with human diseases [2–8]. For example, the Epstein-Barr virus, hepatitis B and C viruses, and human papillomavirus are highly associated with gastric and nasopharyngeal carcinoma, liver cancer, and cervical cancer, respectively.

Recently, several approaches have been developed for computationally identifying miRNA precursors (pre-miRNAs) [9–15]. Various classifiers, such as the support vector machine (SVM), random forest, relaxed variable kernel density estimator (RVKDE), and bootstrap aggregating, were applied while different features generated from sequences and secondary structural information were employed. The characteristics of the approaches are summarized in Table 1.

Table 1 Characteristics of tools for identifying pre-miRNAs.

Triplet-SVM [9] involves applying an SVM to human data by using features from local, contiguous structure-sequence information for distinguishing the hairpins of real pre-miRNAs from pseudo pre-miRNAs. Each hairpin is encoded as a set of 32 triplet elements. MiPred [10] involves applying a random forest machine learning algorithm to human data by using a hybrid feature, which consists of the 32 features used in Triplet-SVM and the minimum of free energy (MFE) of the secondary structure, and using a P-value randomization test to distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (pseudo pre-miRNAs). The de novo SVM classifier miPred [11] identifies pre-miRNAs without relying on phylogenetic conservation; 17 primary sequence features, 5 secondary structural features, and 7 normalized features are used in the model. The miR-KDE tool [12] was developed using the novel RVKDE classifier, which exploits local information, and is particularly suitable for predicting species-specific pre-miRNAs. Each hairpin-like sequence is summarized as a 33-dimentional feature vector, including the 29 features used in miPred and 4 stem-loop features. The microPred tool [13] uses effective machine learning methods for classifying human pre-miRNA hairpins from both pseudohairpins and other ncRNAs. Each hairpin is encoded using the 29 features from miPred, 6 RNAfold-related features, 4 Mfold-related features, 7 base-pair features, and 2 MFE-related features. The classification results showed reliability in both sensitivity (SN) and specificity (SP). The MiRenSVM tool [14] is an ensemble-SVM classification system for detecting miRNA genes, especially those with multiloop secondary structures; 8 triplet structural features, 8 base-pair group features, and 16 thermodynamic group features are considered in feature extractions of hairpin-like sequences. The miR-BAG tool [15] uses a bootstrap-aggregation-based machine learning approach to identify miRNA candidate regions in genomes by using scanning sequences. Comparative analysis results showed that miR-BAG performed more favorable than the previous six tools did. A next-generation sequencing module was combined with miR-BAG to provide high-throughput data analysis. Vir-Mir db [16] is a database for collecting predictive viral miRNA candidate hairpins into a virus genome by using the prediction filters of Srnaloop, sequences and structures, and open reading frames.

Most of the previously developed approaches mainly emphasized identifying pre-miRNAs in human, plants, and other animals. Thus far, a method designed specifically for identifying viral pre-miRNAs has not been developed. Therefore, we collected experimentally validated viral pre-miRNA data and constructed a predictive model by using several sequencing and structural features. This model can assist biological researchers who study virus-host interactions in identifying potential viral miRNAs in experimental sequencing data.

Materials and methods

Datasets

The positive dataset was collected from miRBase (Version 19). Two hundred sixty-three pre-miRNAs, including 437 mature miRNAs, from 26 virus species were collected as the positive dataset. The negative dataset consisted of three types of sequences, namely the virus genome, human pre-miRNAs, and Pseudo-8494. The virus genome dataset was composed of 789 randomly selected fragments with lengths of 120 bps in the virus genome, and the fragments containing positive data were removed. The human pre-miRNA dataset contained 1600 human pre-miRNAs collected from miRBase. Redundant or highly similar sequences had been removed from the dataset. The negative dataset was obtained from Xue et al. [9] and was used in miPred, miR-KDE, and other tools. We named this benchmark negative dataset "Pseudo-8494" because it was composed of 8494 fragments from the coding regions of human chromosome 19.

Feature extraction and selection

The SVM and random forest classification methods were applied to develop predictive models for viral-pre-RNA identification. The minimal free-energy and base-pair-related information were obtained using the RNAfold of the Vienna RNA package [17]. Fifty-four features were selected from previous research [9, 11, 13], and the feature score (F-score) [14] was used to evaluate the discriminative power of each feature. The features used in our model are described in the following paragraphs.

32 triplet elements

The local contiguous triplet-structure composition was defined as Triplet-SVM [9]. For the predicted secondary structure, an opening parenthesis "(" or a closing parenthesis ")" and a dot "." were used to denote paired and unpaired nucleotides. Generally, the opening parenthesis "("represents a paired nucleotide located near the 5′-end that can be paired with another nucleotide at the 3′-end, which is denoted by the closing parenthesis ")". This study used "(" for both situations. For any three adjacent nucleotides, there are eight (23) possible triple-structure compositions: "(((", "((.", "(.(", ".((", "(..", ".(.", "..(", and "...". Considering the middle nucleotide, there are 32 (4 × 8) possible structure-sequence combinations, which are denoted as "C(((", "A.((", etc.

4 sequential features

The GC content ratio (%G+C), sequence length, hairpin length, loop length in the RNA sequence, and secondary structure were used in the model.

8 thermodynamic features

The dP, dG, zP, zG, MFEI 1 , and MFEI 2 features were chosen from miPred [11] and the MFEI 3 and MFEI 4 features were selected from microPred [13]. The feature dP measures the total number of base pairs present in the RNA secondary structure S divided by the length L in nucleotides; dG is the ratio of the MFE to L, which measures the thermodynamic stability of the RNA structure S. The features zP and zG are the normalized variants of dP and dG; for each original sequence, 1000 random sequences were generated. The feature MFEI 1 is the ratio of dG to %G+C; MFEI 2 is the ratio of dG to the number of stems S, which are structural motifs containing more than three contiguous base pairs; MFEI 3 is the ratio of dG to the number of loops in the secondary structure; and MFEI 4 is the ratio of the MFE to the total number of base pairs in the secondary structure.

8 base-pair-related features

The features |A-U|/L, |G-C|/L, |G-U|/L, %(A-U)/n_stems, %(A-U)/n_stems, %(A-U)/n_stems, consecutive base-pairs (ConsecBP), and Avg_BP_stem were chosen and introduced in microPred [13], where |X-Y| is the number of (X-Y) base pairs in the secondary structure, (X-Y) ∈ {(A-U), (G-C), (G-U)}. The feature ConsecBP represents the longest pairing stretch observed in a given structural sequence, and Avg_BP_stem is the ratio of the total number of base pairs to the number of stems in the secondary structure, where a stem is a structural motif containing more than three contiguous stack of base pairs as defined in miPred.

2 RNAfold-related features

The frequency of the MFE structure (Freq) and the structural diversity (Diversity) were included. These features were generated using the RNAfold program [17] with the "-p" option at 37 °C, which calculates the partition function and the base paring probability matrix according to the algorithms proposed in [18].

Classification and performance evaluation

SVM is a machine learning approach for solving classification and regression problems. It constructs a set of hyperplanes in a high- or infinite-dimensional space and has been widely applied to biological sequence classification. Random forest is a nonparametric tree-based ensemble method that is broadly applied in machine learning and can account for interactions and correlations among features. It constructs multiple decision trees during training time and outputs the class that is the mode of the classes output by individual trees. Here, LIBSVM [19] and random forest approaches [20] were adopted to develop the predictive models for viral pre-miRNA identification.

Fivefold cross validation was applied in a performance evaluation of the predictive models. The SN, SP, precision (PRE), accuracy (ACC), balanced ACC, and Matthew's correlation coefficient (MCC) were used to measure the classification performance and were defined as follows: SN = TP/(TP + FN); SP = TN/(TN + FP); PRE = TP/(TP + FP); balanced ACC = (SN + SP)/2; ACC = (TP + TN)/(TP + FP + TN + FN); and MCC = TP × TN - FP × FN (TP + FN) × (TP + FP) × (TN + FP) × (TN + FN) , where TP, FP, TN, and FN are the numbers of true positives, false positives, true negatives, and false negatives [21], respectively. The MCC value is between -1 and 1, where 0 is a completely random prediction, 1 is a perfect prediction, and -1 is a perfectly inverse correlation.

Results

Classification results of the SVM and random forest models using different features

Table 2 shows the feature scores of the 54 features as sorted by F-score in descending order. The features with the highest F-scores, namely 1.09, 1.08, and 1.04, were "G(((", "C(((", and "G((.", respectively. The performance results from the fivefold cross validation of the SVM and random forest models conducted using different negative datasets are shown in Tables 3 and 4. The classification results showed that the SVM model had superior performance when applied to the Pseudo-8494 and human pre-miRNA datasets and that random forest had superior performance when applied to the virus genome dataset.

Table 2 F-scores of the 54 features.
Table 3 Classification results of the SVM model.
Table 4 Classification results of the random forest model.

Additionally, a compact model was constructed using the features with F-scores higher than 0.6; 40 features were used in this model. The classification results for the SVM and random forest models are shown in Tables 5 and 6. The results showed that the performance of both models increased after feature selection. The performance of the SVM model was superior to that of the random forest model for all datasets, achieving ACC values of 86.02%, 97.85%, and 90.23% when applied to the negative virus genome, Pseudo-8494, and human pre-miRNA datasets, respectively. Therefore, the SVM model was chosen as our final predictor for viral pre-miRNA identification.

Table 5 Classification results of the SVM model using the 40 features with the highest F-scores.
Table 6 Classification results of the random forest model using the 40 features with the highest F-scores.

Comparison with previous studies using an independent dataset

For a comparison of our approach with previously proposed approaches, 63 viral pre-miRNAs from a positive dataset and 189 sequences from a virus genome dataset were collected as an independent testing dataset, and our model for the comparison was constructed using the remaining data. Seven tools, namely Triplet-SVM, MiPred, miPred, miR-KDE, microPred, MiRenSVM, and miR-BAG, were used for the comparison. The classification results are shown in Table 7. The results showed that miPred had the highest SP (93.65%), ACC (86.9%), and MCC (0.63). Our approach had the highest SN (79.36%) as well as the highest balanced ACC (83.06%), which is calculated by considering the inflation of performance estimates caused by the use of an imbalanced dataset. Some testing data could be used from previous approaches when constructing the model, potentially resulting in an increase of the prediction performance.

Table 7 Performance comparison with previous studies using a partial dataset.

In addition to the use of the partial dataset for independent testing, 32 newly released virus pre-miRNAs from miRBase (Version 20) were collected as the positive data, and 96 randomly selected fragments from the virus genome were generated as the negative dataset. The classification results are shown in Table 8. The results showed that Triplet-SVM had the highest specificity (91.67%) and ACC (85.94%). Our model, ViralmiR, had the highest sensitivity (78.13%), ACC (85.94%), balanced ACC (83.33%), and MCC (0.64). The results showed that ViralmiR exhibited favorable performance in viral pre-miRNA identification and was superior to related predictors.

Table 8 Performance comparison with previous studies using newly released data from miRBase.

Web interface

A ViralmiR web interface was developed for identifying viral pre-miRNAs in RNA sequences. As shown in Figure 1, the ViralmiR web page provides a user-friendly interface and information related to predictive results. Users of the website can submit a sequence in the FASTA format to identify potential viral pre-miRNA. The positive dataset and three negative datasets used in this study are also provided on the website. The web server is available at http://csb.cse.yzu.edu.tw/viralmir/.

Figure 1
figure 1

Web interface of ViralmiR.

Discussion and conclusion

As shown in Table 2 the features with high F-scores are related to triplet elements, with "G(((" and "C(((" being the features having the highest F-scores. Other base-pair- and thermodynamic-related features, such as ConsecBP, dP, and %(G-C)/stems, also had high F-score values. The results showed that the G-C base-pair-related features play vital roles in viral pre-miRNA identification. In our analysis pipeline, the secondary structure for sequences was derived using RNAfold. However, in many instances, the structure predicted using the MFE may not resemble the real structure, and, thus, the predicted structure of the viral pre-miRNAs could not be formed as a hairpin-like shape, affecting the performance of the predictive models. An examination of our positive dataset showed that only 219 of the 263 viral pre-miRNAs could be formed into hairpin-like shapes by using the MFE. The other pre-miRNAs were formed in other shapes. Table 9 shows the number of hairpin-like and non-hairpin-like shapes in true-positive and false-negative predictions. The results show that most sequences (more than 93%) of true positive prediction were formed in hairpin-like shapes and most sequences (more than 77%) of false negative prediction were not formed in hairpin-like shapes in SVM model. e similar situations present in random forest model, showing that the prediction performance is highly associated with structural prediction. Therefore, further analysis of various folding parameters and window sizes is warranted to facilitate obtaining a more suitable parameter combination for predicting the secondary structure of viral pre-miRNAs.

Table 9 Number of hairpin-like shapes and non-hairpin-like shapes in prediction results

A tool for predicting viral pre-miRNAs in sequences can benefit biomedical researchers who study interactions between viral miRNAs and host genes. In this study, we present a virus-specific pre-miRNA prediction model, ViralmiR, based on sequence and RNA secondary-structure information. ViralmiR achieved a balanced ACC higher than 83%, which is superior to that of previously developed predictors. The easy-to-use ViralmiR web interface has been provided as a helpful resource for researchers to use in analyzing and deciphering virus-host interactions.

Availability and requirements

The ViralmiR system is freely available at http://csb.cse.yzu.edu.tw/viralmir/.

References

  1. Perez-Quintero AL, et al: Plant microRNAs and their role in defense against viruses: a bioinformatics approach. BMC Plant Biol. 2010, 10: 138-10.1186/1471-2229-10-138.

    Article  PubMed Central  PubMed  Google Scholar 

  2. Carl JW, Trgovcich J, Hannenhalli S: Widespread evidence of viral miRNAs targeting host pathways. BMC Bioinformatics. 2013, 14 (Suppl 2): S3-

    PubMed Central  CAS  PubMed  Google Scholar 

  3. Pfeffer S, et al: Identification of virus-encoded microRNAs. Science. 2004, 304 (5671): 734-6. 10.1126/science.1096781.

    Article  CAS  PubMed  Google Scholar 

  4. Lecellier CH, et al: A cellular microRNA mediates antiviral defense in human cells. Science. 2005, 308 (5721): 557-60. 10.1126/science.1108784.

    Article  CAS  PubMed  Google Scholar 

  5. Hansen A, et al: KSHV-encoded miRNAs target MAF to induce endothelial cell reprogramming. Genes Dev. 2010, 24 (2): 195-205. 10.1101/gad.553410.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. David R: VIRAL INFECTION miRNAs help KSHV lay low. Nature Reviews Microbiology. 2010, 8 (3): 158-158.

    Article  CAS  Google Scholar 

  7. Kim do N, et al: Expression of viral microRNAs in Epstein-Barr virus-associated gastric carcinoma. J Virol. 2007, 81 (2): 1033-6. 10.1128/JVI.02271-06.

    Article  PubMed  Google Scholar 

  8. Kincaid RP, Sullivan CS: Virus-encoded microRNAs: an overview and a look to the future. PLoS Pathog. 2012, 8 (12): e1003018-10.1371/journal.ppat.1003018.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Xue C, et al: Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics. 2005, 6: 310-10.1186/1471-2105-6-310.

    Article  PubMed Central  PubMed  Google Scholar 

  10. Jiang P, et al: MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 2007, W339-44. 35 Web Server

  11. Ng KL, Mishra SK: De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics. 2007, 23 (11): 1321-30. 10.1093/bioinformatics/btm026.

    Article  CAS  PubMed  Google Scholar 

  12. Chang DT, Wang CC, Chen JW: Using a kernel density estimation based classifier to predict species-specific microRNA precursors. BMC Bioinformatics. 2008, 9 (Suppl 12): S2-10.1186/1471-2105-9-S12-S2.

    Article  PubMed Central  PubMed  Google Scholar 

  13. Batuwita R, Palade V: microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics. 2009, 25 (8): 989-995. 10.1093/bioinformatics/btp107.

    Article  CAS  PubMed  Google Scholar 

  14. Ding J, Zhou S, Guan J: MiRenSVM: towards better prediction of microRNA precursors using an ensemble SVM classifier with multi-loop features. BMC Bioinformatics. 2010, 11 (Suppl 11): S11-10.1186/1471-2105-11-S11-S11.

    Article  PubMed Central  PubMed  Google Scholar 

  15. Jha A, et al: miR-BAG: bagging based identification of microRNA precursors. PLoS One. 2012, 7 (9): e45782-10.1371/journal.pone.0045782.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  16. Li SC, Shiau CK, Lin WC: Vir-Mir db: prediction of viral microRNA candidate hairpins. Nucleic Acids Res. 2008, D184-9. 36 Database

  17. Lorenz R, et al: ViennaRNA Package 2.0. Algorithms Mol Biol. 2011, 6: 26-10.1186/1748-7188-6-26.

    Article  PubMed Central  PubMed  Google Scholar 

  18. McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29 (6-7): 1105-19. 10.1002/bip.360290621.

    Article  CAS  PubMed  Google Scholar 

  19. Chang C-C, Lin C-J: LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011, 2 (3): 1-27.

    Article  Google Scholar 

  20. Breiman L: Random Forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.

    Article  Google Scholar 

  21. Jiawei Han MK: Data mining: concepts and techniques. 2006, Morgan Kaufmann, 2

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank the Ministry of Science and Technology of the Republic of China for financially supporting this research under grant no. MOST 103-2221-E-038-013-MY2, 103-2221-E-155-020-MY3 and 103-2633-E-155 -002.

Declarations

The authors approved the submission of this paper to BMC Bioinformatics for publication. The payment of publishing charges to BioMed Central for this article was supported by Ministry of Science and Technology of the Republic of China, No. MOST 103-2221-E-038-013-MY2, 103-2221-E-155-020-MY3 and 103-2633-E-155 -002.

This article has been published as part of BMC Bioinformatics Volume 16 Supplement 1, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S1

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tzu-Hao Chang.

Additional information

Competing interests

The authors declare that there are no competing interests.

Authors' contributions

TYL and THC conceived and supervised the study and drafted the manuscript. KYH and YCT were responsible for the design, computational analyses, and implementation of the system. All authors read and approved the final manuscript.

Kai-Yao Huang, Tzong-Yi Lee contributed equally to this work.

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, KY., Lee, TY., Teng, YC. et al. ViralmiR: a support-vector-machine-based method for predicting viral microRNA precursors. BMC Bioinformatics 16 (Suppl 1), S9 (2015). https://doi.org/10.1186/1471-2105-16-S1-S9

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-16-S1-S9

Keywords