ViralmiR: a support-vector-machine-based method for predicting viral microRNA precursors

Background microRNAs (miRNAs) play a vital role in development, oncogenesis, and apoptosis by binding to mRNAs to regulate the posttranscriptional level of coding genes in mammals, plants, and insects. Recent studies have demonstrated that the expression of viral miRNAs is associated with the ability of the virus to infect a host. Identifying potential viral miRNAs from experimental sequence data is valuable for deciphering virus-host interactions. Thus far, a specific predictive model for viral miRNA identification has yet to be developed. Methods and results Here, we present ViralmiR for identifying viral miRNA precursors on the basis of sequencing and structural information. We collected 263 experimentally validated miRNA precursors (pre-miRNAs) from 26 virus species and generated sequencing fragments from virus and human genomes as the negative dataset. Support vector machine and random forest models were established using 54 features from RNA sequences and secondary structural information. The results show that ViralmiR achieved a balanced accuracy higher than 83%, which is superior to that of previously developed tools for identifying pre-miRNAs. Conclusions The easy-to-use ViralmiR web interface has been provided as a helpful resource for researchers to use in analyzing and deciphering virus-host interactions. The web interface of ViralmiR can be accessed at http://csb.cse.yzu.edu.tw/viralmir/.

Introduction microRNAs (miRNAs) are non-protein-coding RNAs that is approximately 22 nucleotides long, which results in the degradation of mRNAs by complementarily binding to the 3' untranslated regions of target genes. Recent studies have demonstrated that miRNAs play a vital role in development, oncogenesis, and apoptosis by binding to mRNAs to regulate the posttranscriptional level of coding genes in mammals, plants, and insects. In addition, miRNAs modulate viral existence in plants and animals by targeting viruses [1,2]. Conversely, miRNAs are produced by viruses [3]. Recent studies have demonstrated that the expression of viral miRNAs is associated with the ability of the virus to infect a host [2,4].
Recently, several approaches have been developed for computationally identifying miRNA precursors (pre-miRNAs) [9][10][11][12][13][14][15]. Various classifiers, such as the support vector machine (SVM), random forest, relaxed variable kernel density estimator (RVKDE), and bootstrap aggregating, were applied while different features generated from sequences and secondary structural information were employed. The characteristics of the approaches are summarized in Table 1.
Triplet-SVM [9] involves applying an SVM to human data by using features from local, contiguous structuresequence information for distinguishing the hairpins of real pre-miRNAs from pseudo pre-miRNAs. Each hairpin is encoded as a set of 32 triplet elements. MiPred [10] involves applying a random forest machine learning algorithm to human data by using a hybrid feature, which consists of the 32 features used in Triplet-SVM and the minimum of free energy (MFE) of the secondary structure, and using a P-value randomization test to distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (pseudo pre-miRNAs). The de novo SVM classifier miPred [11] identifies pre-miRNAs without relying on phylogenetic conservation; 17 primary sequence features, 5 secondary structural features, and 7 normalized features are used in the model. The miR-KDE tool [12] was developed using the novel RVKDE classifier, which exploits local information, and is particularly suitable for predicting species-specific pre-miRNAs. Each hairpin-like sequence is summarized as a 33-dimentional feature vector, including the 29 features used in miPred and 4 stem-loop features. The microPred tool [13] uses effective machine learning methods for classifying human pre-miRNA hairpins from both pseudohairpins and other ncRNAs. Each hairpin is encoded using the 29 features from miPred, 6 RNAfold-related features, 4 Mfold-related features, 7 base-pair features, and 2 MFE-related features. The classification results showed reliability in both sensitivity (SN) and specificity (SP). The MiRenSVM tool [14] is an ensemble-SVM classification system for detecting miRNA genes, especially those with multiloop secondary structures; 8 triplet structural features, 8 base-pair group features, and 16 thermodynamic group features are considered in feature extractions of hairpin-like sequences. The miR-BAG tool [15] uses a bootstrap-aggregation-based machine learning approach to identify miRNA candidate regions in genomes by using scanning sequences. Comparative analysis results showed that miR-BAG performed more favorable than the previous six tools did. A next-generation sequencing module was combined with miR-BAG to provide high-throughput data analysis. Vir-Mir db [16] is a database for collecting predictive viral miRNA candidate hairpins into a virus genome by using the prediction filters of Srnaloop, sequences and structures, and open reading frames.
Most of the previously developed approaches mainly emphasized identifying pre-miRNAs in human, plants, and other animals. Thus far, a method designed specifically for identifying viral pre-miRNAs has not been developed. Therefore, we collected experimentally validated viral pre-miRNA data and constructed a predictive model by using several sequencing and structural features. This model can assist biological researchers who study virus-host interactions in identifying potential viral miRNAs in experimental sequencing data.

Datasets
The positive dataset was collected from miRBase (Version 19). Two hundred sixty-three pre-miRNAs, including 437 mature miRNAs, from 26 virus species were collected as the positive dataset. The negative dataset consisted of three types of sequences, namely the virus genome, human pre-miRNAs, and Pseudo-8494. The virus genome dataset was composed of 789 randomly selected fragments with lengths of 120 bps in the virus genome, and the fragments containing positive data were removed. The human pre-miRNA dataset contained 1600 human pre-miRNAs collected from miRBase. Redundant or highly similar sequences had been removed from the dataset. The negative dataset was obtained from Xue et al. [9] and was used in miPred, miR-KDE, and other tools. We named this benchmark negative dataset "Pseudo-8494" because it was composed of 8494 fragments from the coding regions of human chromosome 19.

Feature extraction and selection
The SVM and random forest classification methods were applied to develop predictive models for viral-pre-RNA identification. The minimal free-energy and base-pairrelated information were obtained using the RNAfold of the Vienna RNA package [17]. Fifty-four features were selected from previous research [9,11,13], and the feature score (F-score) [14] was used to evaluate the discriminative power of each feature. The features used in our model are described in the following paragraphs.

triplet elements
The local contiguous triplet-structure composition was defined as Triplet-SVM [9]. For the predicted secondary structure, an opening parenthesis "(" or a closing parenthesis ")" and a dot "." were used to denote paired and unpaired nucleotides. Generally, the opening parenthesis "("represents a paired nucleotide located near the 5′-end that can be paired with another nucleotide at the 3′-end, which is denoted by the closing parenthesis ")". This study used "(" for both situations. For any three adjacent nucleotides, there are eight (2 3 )

sequential features
The GC content ratio (%G+C), sequence length, hairpin length, loop length in the RNA sequence, and secondary structure were used in the model.

thermodynamic features
The dP, dG, zP, zG, MFEI 1 , and MFEI 2 features were chosen from miPred [11] and the MFEI 3 and MFEI 4 features were selected from microPred [13]. The feature dP measures the total number of base pairs present in the RNA secondary structure S divided by the length L in nucleotides; dG is the ratio of the MFE to L, which measures the thermodynamic stability of the RNA structure S. The features zP and zG are the normalized variants of dP and dG; for each original sequence, 1000 random sequences were generated. The feature MFEI 1 is the ratio of dG to %G+C; MFEI 2 is the ratio of dG to the number of stems S, which are structural motifs containing more than three contiguous base pairs; MFEI 3 is the ratio of dG to the number of loops in the secondary structure; and MFEI 4 is the ratio of the MFE to the total number of base pairs in the secondary structure. given structural sequence, and Avg_BP_stem is the ratio of the total number of base pairs to the number of stems in the secondary structure, where a stem is a structural motif containing more than three contiguous stack of base pairs as defined in miPred.

RNAfold-related features
The frequency of the MFE structure (Freq) and the structural diversity (Diversity) were included. These features were generated using the RNAfold program [17] with the "-p" option at 37°C, which calculates the partition function and the base paring probability matrix according to the algorithms proposed in [18].

Classification and performance evaluation
SVM is a machine learning approach for solving classification and regression problems. It constructs a set of hyperplanes in a high-or infinite-dimensional space and has been widely applied to biological sequence classification. Random forest is a nonparametric tree-based ensemble method that is broadly applied in machine learning and can account for interactions and correlations among features. It constructs multiple decision trees during training time and outputs the class that is the mode of the classes output by individual trees. Here, LIBSVM [19] and random forest approaches [20] were adopted to develop the predictive models for viral pre-miRNA identification. where TP, FP, TN, and FN are the numbers of true positives, false positives, true negatives, and false negatives [21], respectively. The MCC value is between -1 and 1, where 0 is a completely random prediction, 1 is a perfect prediction, and -1 is a perfectly inverse correlation.

Results
Classification results of the SVM and random forest models using different features Table 2 shows the feature scores of the 54 features as sorted by F-score in descending order. The features with the highest F-scores, namely 1.09, 1.08, and 1.04, were "G(((", "C(((", and "G((.", respectively. The performance results from the fivefold cross validation of the SVM and random forest models conducted using different negative datasets are shown in Tables 3 and 4. The classification results showed that the SVM model had superior performance when applied to the Pseudo-8494 and human pre-miRNA datasets and that random forest had superior performance when applied to the virus genome dataset.
Additionally, a compact model was constructed using the features with F-scores higher than 0.6; 40 features were used in this model. The classification results for the SVM and random forest models are shown in Tables 5 and 6.
The results showed that the performance of both models increased after feature selection. The performance of the SVM model was superior to that of the random forest model for all datasets, achieving ACC values of 86.02%, 97.85%, and 90.23% when applied to the negative virus genome, Pseudo-8494, and human pre-miRNA datasets, respectively. Therefore, the SVM model was chosen as our final predictor for viral pre-miRNA identification.

Comparison with previous studies using an independent dataset
For a comparison of our approach with previously proposed approaches, 63 viral pre-miRNAs from a positive dataset and 189 sequences from a virus genome dataset were collected as an independent testing dataset, and our model for the comparison was constructed using the remaining data. Seven tools, namely Triplet-SVM, MiPred, miPred, miR-KDE, microPred, MiRenSVM, and miR-BAG, were used for the comparison. The classification results are shown in Table 7. The results showed that miPred had the highest SP (93.65%), ACC (86.9%), and MCC (0.63). Our approach had the highest SN (79.36%) as well as the highest balanced ACC (83.06%), which is calculated by considering the inflation of performance estimates caused by the use of an imbalanced dataset. Some testing data could be used from previous approaches when constructing the model, potentially resulting in an increase of the prediction performance. In addition to the use of the partial dataset for independent testing, 32 newly released virus pre-miRNAs from miRBase (Version 20) were collected as the positive data, and 96 randomly selected fragments from the virus genome were generated as the negative dataset. The classification results are shown in Table 8. The results showed that Triplet-SVM had the highest specificity (91.67%) and ACC (85.94%). Our model, ViralmiR, had the highest sensitivity (78.13%), ACC (85.94%), balanced ACC (83.33%), and MCC (0.64). The results showed that Viral-miR exhibited favorable performance in viral pre-miRNA identification and was superior to related predictors.

Web interface
A ViralmiR web interface was developed for identifying viral pre-miRNAs in RNA sequences. As shown in Figure 1, the ViralmiR web page provides a user-friendly interface and information related to predictive results. Users of the website can submit a sequence in the FASTA format to identify potential viral pre-miRNA. The positive dataset and three negative datasets used in this study are also provided on the website. The web server is available at http://csb.cse.yzu.edu.tw/viralmir/.

Discussion and conclusion
As shown in Table 2 the features with high F-scores are related to triplet elements, with "G(((" and "C(((" being the features having the highest F-scores. Other basepair-and thermodynamic-related features, such as Con-secBP, dP, and %(G-C)/stems, also had high F-score values. The results showed that the G-C base-pairrelated features play vital roles in viral pre-miRNA identification. In our analysis pipeline, the secondary structure for sequences was derived using RNAfold. However, in many instances, the structure predicted using the MFE may not resemble the real structure, and, thus, the predicted structure of the viral pre-miR-NAs could not be formed as a hairpin-like shape, affecting the performance of the predictive models. An examination of our positive dataset showed that only 219 of the 263 viral pre-miRNAs could be formed into hairpin-like shapes by using the MFE. The other pre-miRNAs were formed in other shapes. Table 9 shows the number of hairpin-like and non-hairpin-like shapes in true-positive and false-negative predictions. The results show that most sequences (more than 93%) of true positive prediction were formed in hairpin-like shapes and most sequences (more than 77%) of false negative prediction were not formed in hairpin-like shapes in SVM model. e similar situations present in random forest model, showing that the prediction performance is highly associated with structural prediction. Therefore, further analysis of various folding parameters and window sizes is warranted to facilitate obtaining a more suitable parameter combination for predicting the secondary structure of viral pre-miRNAs. A tool for predicting viral pre-miRNAs in sequences can benefit biomedical researchers who study interactions between viral miRNAs and host genes. In this study, we present a virus-specific pre-miRNA prediction model, ViralmiR, based on sequence and RNA secondary-structure information. ViralmiR achieved a balanced ACC higher than 83%, which is superior to that of previously developed predictors. The easy-to-use ViralmiR web interface has been provided as a helpful resource for researchers to use in analyzing and deciphering virus-host interactions.