A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs

Li, Dingfang; Luo, Longqiang; Zhang, Wen; Liu, Feng; Luo, Fei

doi:10.1186/s12859-016-1206-3

Research article
Open access
Published: 31 August 2016

A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs

Dingfang Li¹,
Longqiang Luo¹,
Wen Zhang^2,3,
Feng Liu⁴ &
…
Fei Luo^2,3

BMC Bioinformatics volume 17, Article number: 329 (2016) Cite this article

5022 Accesses
56 Citations
2 Altmetric
Metrics details

Abstract

Background

Predicting piwi-interacting RNA (piRNA) is an important topic in the small non-coding RNAs, which provides clues for understanding the generation mechanism of gamete. To the best of our knowledge, several machine learning approaches have been proposed for the piRNA prediction, but there is still room for improvements.

Results

In this paper, we develop a genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs. We construct datasets for three species: Human, Mouse and Drosophila. For each species, we compile the balanced dataset and imbalanced dataset, and thus obtain six datasets to build and evaluate prediction models. In the computational experiments, the genetic algorithm-based weighted ensemble method achieves 10-fold cross validation AUC of 0.932, 0.937 and 0.995 on the balanced Human dataset, Mouse dataset and Drosophila dataset, respectively, and achieves AUC of 0.935, 0.939 and 0.996 on the imbalanced datasets of three species. Further, we use the prediction models trained on the Mouse dataset to identify piRNAs of other species, and the models demonstrate the good performances in the cross-species prediction.

Conclusions

Compared with other state-of-the-art methods, our method can lead to better performances. In conclusion, the proposed method is promising for the transposon-derived piRNA prediction. The source codes and datasets are available in https://github.com/zw9977129/piRNAPredictor.

Background

Non-coding RNAs (ncRNAs) are defined as the important functional RNA molecules which are not translated into proteins [1, 2]. According to lengths, ncRNAs are classified into two types: long ncRNAs and short ncRNAs. Usually, long ncRNAs consists of more than 200 nucleotides [3, 4]. Short ncRNAs having 20 ~ 32 nt are defined as small ncRNAs, such as small interfering RNA (siRNA), microRNA (miRNA) and piwi-interacting RNA (piRNA) [5]. piRNA is a distinct class of small ncRNAs expressed in animal cells, especially in germline cells, and the length of piRNA sequences ranges from 26 to 32 in general [6–8]. Compared with miRNA, piRNA lacks conserved secondary structure motifs, and the presence of a 5’ uridine is usually observed in both vertebrates and invertebrates [5, 9, 10].

piRNAs play an important role in the transposon silencing [11–15]. About nearly one-third of the fruit fly and one-half of human genomes are transposon elements. These transposons move within the genome and induce insertions, deletions, and mutations, which may cause the genome instability. piRNA pathway is an important genome defense mechanism to maintain genome integrity. Loaded into PIWI proteins, piRNAs serve as a guide to target the transposon transcripts by sequence complementarity with mismatches, and then the transposon transcripts will be cleaved and degraded, producing secondary piRNAs, which is called ping-pong cycle in fruit fly [13–17]. Therefore, predicting transposon-derived piRNAs provides biological significance and insights into the piRNA pathway.

The wet method utilizes immunoprecipitation and deep sequencing to identify piRNAs [18]. Since piRNAs are diverse and non-conserved, wet methods are time-consuming and costly [5, 9, 10]. Since the development of information science, the piRNA prediction based on the known data becomes an alternative. As far as we know, several computational methods have been proposed for piRNA prediction. Betel et al. developed the position-specific usage method to recognize piRNAs [19]. Zhang et al. utilized a k-mer feature, and adopted support vector machine (SVM) to build the classifier (named piRNApredictor) for piRNA prediction [20]. Wang et al. proposed a method named Piano to predict piRNAs, by using piRNA-transposon interaction information and SVM [21]. These methods exploited different features of piRNAs, and build the prediction models by using machine learning methods.

Motivated by previous works, we attempt to differentiate transposon-derived piRNAs from non-piRNAs based on the sequential and physicochemical features. As far as we know, there are several critical issues for developing high-accuracy models. Firstly, the accuracy of models is highly dependent on the diversity of features. In order to achieve high-accuracy models, we should consider as many sequence-derived features as possible. Secondly, how to effectively combine various features for high-accuracy models is very challenging. In the previous work [22], we adopted the exhaustive search strategy to combine five sequence-derived features to predict piRNAs, and used the AUC scores of individual feature-based models as weights in the ensemble learning. However, the method can’t effectively integrate a great amount of features (NP-hard complexity: 2^N-1 combinations of features, N is the number of features), and the determination of weights is arbitrary.

In this paper, we develop a genetic algorithm-based weighted ensemble method (GA-WE) to effectively integrate twenty-three discriminative features for the piRNA prediction. Specifically, individual features-based models are constructed as base learners, and the weighted average of their outputs is adopted as the final scores in the stage of prediction. Genetic algorithm (GA) is to search for the optimal weights for the base learners. Moreover, the proposed method can determine the weights for each base learner in a self-tune manner.

We construct datasets for three species: Human, Mouse and Drosophila. For each species, we compile the balanced dataset and imbalanced dataset, and thus obtain six datasets to build and evaluate prediction models. In the 10-fold cross validation experiments, the GA-WE method achieves AUC of 0.932, 0.937 and 0.995 on the balanced Human dataset, Mouse dataset and Drosophila dataset, respectively, and achieves AUC of 0.935, 0.939 and 0.996 on the imbalanced datasets of three species. Further, we use the prediction models trained on the Mouse dataset to identify piRNAs of other species. The results demonstrate that the models can produce good performances in the cross-species prediction. Compared with other state-of-the-art methods, our method produces better performances as well as good robustness. Therefore, the proposed method is promising for the transposon-derived piRNA prediction.

Methods

Datasets

In this paper, we construct datasets for three species: Human, Mouse and Drosophila, and use them to build prediction models and make evaluations.

As shown in Table 1 , raw real piRNAs, raw non-piRNA ncRNAs and transposons are downloaded from NONCODE version 3.0 [23], UCSC Genome Browser [24] and NCBI Gene Expression Omnibus [18, 25]. NONCODE is an integrated knowledge database about non-coding RNAs [23]. The UCSC Genome Browser is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species, integrated with a large collection of aligned annotations [24]. The NCBI Gene Expression Omnibus is the largest fully public repository for high-throughput molecular abundance data, primarily gene expression data [18, 25].

Table 1 Raw data about three species

Full size table

The datasets are compiled from the raw data (Table 1). By aligning raw real piRNAs to transposons with SeqMap (three mismatches at most) [26], the aligned real piRNAs are transposon-matched piRNAs, and they are adopted as the set of real piRNAs. The length of real piRNAs ranges from 16 to 35. To meet the length range of real piRNAs, we remove non-piRNA ncRNAs shorter than 16, and cut non-piRNA ncRNAs longer than 35 by simulating length distribution of real piRNAs. The cut sequences are then aligned to transposons, and the aligned ones are used as the set of pseudo piRNAs. The real piRNAs and the pseudo piRNAs for three species are shown in Table 2. In order to the build prediction models, we build the datasets based on real piRNAs and pseudo piRNAs. To avoid the data bias caused by different size of positive instances and negative instances, we construct both balanced datasets and imbalanced datasets for three species. For balanced datasets, all real piRNAs are adopted as positive instances, and we sample the same number of pseudo piRNAs as negative instances. For imbalanced datasets, we use all real piRNAs and pseudo piRNAs as positive instances and negative instances.

Table 2 Number of real piRNAs and pseudo piRNA

Full size table

Features of piRNAs

For prediction, we should explore informative features that can characterize piRNAs and convert variable-length piRNA sequences into fixed-length feature vectors. Here, we consider various potential features that are widely used in biological sequence prediction. Among these features, six features have been utilized for the piRNA prediction, while the rest are taken into account for the first time. These sequence-derived features are briefly introduced as follows.

Spectrum profile: k-spectrum profile, also named k-mer feature, is to count the occurrences of k-mers (k-length contiguous strings) in sequences (k ≥ 1), and its success has been proved by numerous bioinformatics applications [27–30].

Mismatch profile: (k, m)-mismatch profile also counts the occurrences of k-mers, but allows max m (m < k) inexact matching, which is the penalization of spectrum profile [30, 31].

Subsequence profile: (k, w)-subsequence profile considers not only the contiguous k-mers but also the non- contiguous k-mers, and the penalty factor w (0 ≤ w ≤ 1) is used to penalize the gap of non-contiguous k-mers [30, 32].

Reverse compliment k-mer (k-RevcKmer): k-RevcKmer is a variant of the basic k-mer, in which the k-mers are not expected to be strand-specific [29, 33, 34].

Parallel correlation pseudo dinucleotide composition (PCPseDNC): PCPseDNC is proposed to avoid losing the physicochemical properties of dinucleotides. PCPseDNC of a sequence consists of two components, the first component represents the occurrences of different dinucleotides, while the other component reflects the physicochemical properties of dinucleotides [28, 29, 35].

Three features: parallel correlation pseudo trinucleotide composition (PCPseTNC), series correlation pseudo dinucleotide composition (SCPseDNC) and series correlation pseudo trinucleotide composition (SCPseTNC) are similar to the PCPseDNC. PCPseTNC considers the occurrences of trinucleotides and their physicochemical properties, and SCPseDNC and SCPseTNC consider series correlations of physicochemical properties of dinucleotides or trinucleotides [28, 29, 35, 36].

Sparse profile [37] and position-specific scoring matrix (PSSM) [38–40] are usually generated from the fixed-length sequences. Here, we use a simple strategy to process the variable-length sequences, and obtain the features. We truncate the first d nucleotides of long sequences which lengths are more than d, and extend short sequences which lengths are less than d by adding the null character. Here, ‘E’ represent the null character, which are added to the short sequences to meet the length d. In this way, all variable-length sequences are converted into fixed-length sequences, and the fixed-length sequences consist of five letters {A, C, G, T, E}. For the sparse profile, by encoding each letter of sequence as a 5-bit vector with 4 bits set to zero and 1 bit set to one, the sparse profile of a sequence is obtained by merging the bit vector for its letters. For the PSSM feature, PSSM can be calculated on the fixed-length sequences consisted of five letters {A, C, G, T, E} [38–40]. Given a new sequence, it is truncated or extended, and then is encoded by PSSM as the feature vector. The PSSM representation of sequence x = R ₁ R ₂ … R _d is defined as:

$$ {f_d}^{PSSM}(x)=\left( score\;\left({R}_1\right),\; score\left({R}_2\right),\dots,\;score\left({R}_d\right)\right) $$

where

$$ score\left({R}_k\right)=\left\{\begin{array}{l}m\left({R}_k\right),\kern0.75em {R}_k\in \left\{A,C,G,T\right\}\\ {}0,\kern2.75em {R}_k=E\end{array}\right.,k=1,2,\dots, d $$

and m(R _k) represents the score of R _k in the k-th column of PSSM, if R _k ∈ {A, C, G, T}, k = 1, 2, …, d.

Local structure-sequence triplet elements (LSSTE): LSSTE adopts the piRNA-transposon interaction information to extract 32 different triplet elements, which contain the structural information of transposon-piRNA alignment as well as the piRNA sequence information [21, 41, 42].

A total of twenty-three feature vectors are finally obtained, and they are summarized in Table 3.

Table 3 Twenty-three sequence-derived features

Full size table

The GA-based weighted ensemble method

In the view of information science, a variety of features can bring diverse information, and the combination of various features can lead to better performance than individual features [22, 43–46]. Ensemble learning is a sophisticated feature combination technique widely used in bioinformatics. Its success has been proved by numerous bioinformatics applications, such as the prediction of B-cell epitopes [44] and the prediction of immunogenic T-cell epitopes [45].

To the best of our knowledge, there are two crucial issues for designing good ensemble systems, i.e. base learners and combination rules. First, the training sequences are encoded into different feature vectors, respectively, and multiple base learners are constructed on these feature vectors by using classification engines. We compare two most popular classification methods, random forest (RF) [47] and support vector machine (SVM) [48] (results are given in the section ‘Results and Discussion’), and finally adopt RF as the basic classification engine because of its high efficiency and high accuracy. Then, how to combine the outputs of base learners is crucial for the success of our ensemble system. Our ensemble learning adopts the weighted average of outputs from base learners as the final score. However, the determination of weights is difficult. In this paper, we develop a genetic algorithm (GA)-based weighted ensemble method, which can automatically determine the optimal weights and construct high-accuracy prediction models.

Given N features, we can construct N base learners: f ₁, f ₂, …, f _N on training set. w ₁, w ₂, …, w _N (∑ ^N_i = 1 w _i, 0 ≤ w _i ≤ 1, i = 1, 2, …, N) represent the corresponding weights. For a testing sequence x, f _i(x) ∈ [0, 1] represents the probability of predicting x as real piRNA, i = 1, 2, …, N, and the final predicted results of the weighted ensemble model is given as:

$$ F(x)={\displaystyle {\sum}_{i=1}^N{w}_i{f}_i(x)} $$

As discussed above, the optimal weights are very important for the weighted ensemble model. We consider the determination of weights as an optimization problem and adopt the genetic algorithm (GA) to search the optimal weights. GA is a search approach that simulates the process of natural selection. It can effectively search the interesting space and easily solve complex problems without requiring the prior knowledge about the space. Here, we use the adaptive genetic algorithm [49]. In the adaptive genetic algorithm, crossover probability and mutation probability are dynamically adjusted according to the fitness scores of chromosomes. The size of an initial population is 1000 chromosomes, and the iteration number is 500.

The flowchart of the GA-WE method is shown in Fig. 1. In each training-testing process, the dataset is split into the training set, the validation set and the testing set. In the GA optimization, a chromosome represents weights. For each chromosome (weights), the weighted ensemble model is constructed on the training set, and makes predictions for the validation set. The AUC score of the weighted ensemble model on the validation set is taken as the fitness of the chromosome. After randomly generating an initial population, the population is updated by three operators: selection, crossover and mutation, and the best individual with a chromosome will be obtained. Finally, the weighted ensemble model with the optimal weights is used to make predictions for the testing set.

Results and discussion

Performance evaluation metrics

The proposed methods are evaluated by the 10-fold cross validation (10-CV). In the 10-CV, a dataset is randomly split into 10 subsets with equal size. For each round of 10-CV, 8 subsets are used as the training set, 1 subset is used as the validation set and the rest one is considered as the testing set. Prediction models are constructed on the training set, and the parameters or optimal weights of models are determined on the validation set. Finally, optimized prediction models are adopted to predict the testing set. This processing is repeated until all subsets are ever used for testing.

Here, we adopt several metrics to assess the performances of prediction models, including the accuracy (ACC), sensitivity (SN), specificity (SP) and the AUC score (the area under the ROC curve). These metrics are defined as:

$$ SN=\frac{TP}{TP+FN} $$

$$ SP=\frac{TN}{TN+FP} $$

$$ ACC=\frac{TP+TN}{TP+TN+FP+FN} $$

Where TP, FP, TN and FN are the numbers of true positives, false positives, true negatives and false negatives, respectively. The ROC curve is plotted by using the false positive rate (1-specificity) against the true positive rate (sensitivity) for different cutoff thresholds. Here, we consider the AUC as the primary metric, for it assesses the performance regardless of any threshold.

Parameters of various features

As shown in Table 3, we consider twenty-three sequence-derived features to develop prediction models. Since subsequence profile, PCPseDNC, PCPseTNC, SCPseDNC, SCPseTNC, sparse profile and PSSM have parameters, we discuss how to determine the parameters based on the balanced Human dataset, and use them in the following studies. Considering the parameter λ and d refer to the length of piRNAs, we analyze the length distribution of piRNAs in three species (Human, Mouse and Drosophila). As shown in Fig. 2, the length of piRNAs ranges from 16 to 35, and reaches the peak at 30 for Human and Mouse, and 25 for Drosophila. Here, the impacts of parameters are evaluated according to the 10-CV performances of corresponding models.

In the mismatch profile, the parameter m represents the max mismatches. Here, we assume that m does not exceed one third of length of k-mers. Therefore, (3, 1)-mismatch profile, (4, 1)-mismatch profile and (5, 1)-mismatch profile are used.

In the subsequence profile, the parameter w represents the gap penalty of non-contiguous k-mers. As shown in Fig. 3 (a), w = 1 produces the best AUC scores for (3, w)-subsequence profile, (4, w)- subsequence profile and (5, w)-subsequence profile. Therefore, (3, 1)-subsequence profile, (4, 1)-subsequence profile and (5, 1)-subsequence profile are finally adopted in the following study.

In the PCPseDNC, PCPseTNC, SCPseDNC and SCPseTNC, the parameter λ represents the highest counted rank of the correlation, 1 ≤ λ ≤ L − 2 (for the PCPseDNC and SCPseDNC); 1 ≤ λ ≤ L − 3 (for the PCPseTNC and SCPseTNC) [28, 29, 35, 36]. L is the length of shortest piRNA sequences, and is 16 according to Fig. 2. To test the impact of the parameter λ on the four features, we construct the prediction models by using different values. As shown in Fig. 3 (b). λ = 1 leads to the best AUC scores for PCPseDNC, PCPseTNC, SCPseDNC and SCPseTNC. Therefore, the best parameters are adopted for the final prediction models.

In the sparse profile and PSSM, the parameter d represents the fixed length of sequences. As show in Fig. 2, the lengths of piRNAs range from 16 to 35. Therefore, the prediction models are constructed based on different lengths. As shown in Fig. 3 (c), d = 35 produces the best AUC scores for the sparse profile and PSSM feature. Therefore, we set the parameter d as 35 for the sparse profile feature and the PSSM feature.

Evaluation of various features

After discussing feature parameters, we compare the capabilities of various features for the piRNA prediction. Here, individual feature-based models are constructed on balanced Human dataset and imbalanced Human dataset by using classification engines, and the performances of these models are evaluated by 10-CV.

To test different classifiers, we respectively adopt the random forest (RF) and support vector machine (SVM) to build the individual feature-based prediction models. Here, we use the python package “scikit-learn” to implement RF and SVM, and default values are adopted for parameters. The results demonstrate that RF can produce better performances in most cases (13 out of the 23 individual feature-based models). Moreover, RF runs much faster than SVM, and it is very important for implementing the following experiments. Results of RF models and SVM models are provided in the Additional files 1 and 2. For these reasons, RF is adopted in the following study.

To test the impacts of the ratio of positive instances versus negative instances, we build the individual feature-based prediction models based on the balanced human datasets and the imbalanced human dataset. As shown in Table 4 and Table 5, the prediction models produce similar results on the balanced dataset and imbalanced dataset, indicating that they are robust to the different datasets. The performances of individual feature-based models help to rank the importance of features. According to Table 4 and Table 5, the sparse profile yields the best results among these features, and the performance of LSSTE is much poorer than that of other features. Therefore, we adopt features indexed from F1 to F22 (“F1 ~ F22”) for the final ensemble models.

Table 4 The performances of individual feature-based models on balanced Human dataset

Full size table

Table 5 The performances of individual feature-based models on imbalanced Human dataset

Full size table

Performances of GA-based weighted ensemble method

The GA-based weighted ensemble (GA-WE) method integrates sequence-derived features and constructs high-accuracy prediction models. We evaluate the performances of the GA-WE model on the datasets of three species. Moreover, we carry out the cross-species prediction, in which we build prediction models on Mouse species, and make prediction for other species.

Results of GA-WE models on three species

As show in Table 6, the GA-WE models achieve AUC of 0.932, accuracy of 0.839, sensitivity of 0.858 and specificity of 0.820 on the balanced Human dataset. Compared with the best individual features-based model (the sparse profile-based model), the GA-WE model improves AUC of >3%, indicating the GA-WE model can effectively combine various features to enhance performances. The proposed method also performs accurate prediction on balanced Mouse dataset, achieving AUC of 0.937. Compared with the piRNA prediction on mammalian: Human and Mouse, the prediction on Drosophila is much better, achieving AUC of 0.995. Similarly, the GA-WE model performs high-accuracy prediction on the imbalanced datasets of the three species, achieves AUC of 0.935, 0.939 and 0.996, respectively, which demonstrates that the GA-WE model has not only high accuracy but also good robustness.

Table 6 The performances of the GA-WE model on three species (Human, Mouse and Drosophila)

Full size table

Further, we investigate the optimal weights for the GA-WE model in each fold of 10-CV. Taking Human dataset as an example, the optimal weights of “F1 ~ F22” for the GA-WE model are visualized by the heat map (Fig. 4). We can draw several conclusions from the results. Firstly, different features have different weights in each fold of 10-CV, and the optimal weights can lead to the best ensemble model. Secondly, optimal weights reflect the contributions of the corresponding features for the ensemble model, and the feature having the best performances for piRNA prediction always makes the greatest contribution to the ensemble model. For example, the sparse profile (F21) performs the highest contribution to the ensemble model in each fold of 10-CV, for the sparse profile has the best predictive ability among all features. Thirdly, the optimal weights for the ensemble model depend on the training set, and determining the optimal weights is necessary for building high-accuracy models.

Results of cross-species prediction

Considering that Mouse instances are more than Human instances and Drosophila instances, we construct the GA-WE model on Mouse dataset, and make predictions for Human dataset and Drosophila dataset.

As shown in Table 7, the GA-WE model trained with Mouse dataset achieves AUC of 0.863 and 0.687 on the balanced Human and Drosophila datasets, and achieves AUC of 0.868 and 0.746 on the imbalanced datasets of the two species. Compared with the experiments on a same species, the cross-species experiments produce lower scores, indicating that piRNAs derived from different species may have different patterns. Moreover, the results on Human dataset are better than the results on Drosophila dataset, and the possible reason is that the length distribution of Mouse piRNAs is similar to that of Human piRNAs, and is different from that of Drosophila piRNAs (shown in Fig. 2). Therefore, we’d better train models and make predictions based on a same species.

Table 7 The performances of cross-species prediction

Full size table

Comparison with other state-of-the-art methods

Here, three latest methods: piRNApredictor [20], Piano [21] and our previous work [22] are adopted as the benchmark methods, for they build prediction models based on machine learning methods. piRNApredictor used k-mer feature (i.e, spectrum profile), k = 1, 2, 3, 4, 5, and Piano used the LSSTE feature. piRNApredictor and Piano adopted the support vector machine (SVM) to construct prediction models. Our previous work adopted the exhaustive search strategy to combine five sequence-derived features to predict piRNAs. We implement piRNApredictor obtain the results. Since the source codes of Piano are available at http://ento.njau.edu.cn/Piano.html, we can run the program on the benchmark datasets. The proposed methods and three benchmark methods are evaluated on six benchmark datasets by using 10-CV.

As shown in Table 8, our previous work, piRNApredictor and Piano achieve AUC of 0.920, 0.894 and 0.592 on the balanced Human dataset, respectively. Our GA-WE model produces AUC of 0.932 on the dataset. The proposed method also yields much better performances than piRNApredictor and Piano on the balanced Mouse dataset and balanced Drosophila dataset. There are several reasons for the superior performances of our method. Firstly, various useful features can guarantee the diversity for the GA-WE model. Secondly, the GA-WE model automatically determines the optimal weights on validation set.

Table 8 Performances of GA-WE and the state-of-the-art methods on three species

Full size table

Further, we compare the capabilities of the GA-WE method with the state-of-the-art methods in the cross-species prediction. All models are constructed on Mouse dataset, and make predictions for Human and Drosophila dataset. As shown in Table 9, our GA-WE model trained with Mouse dataset performs better than the state-of-the-art methods on the Human datasets, but performs worse than piRNApredictor on the Drosophila dataset. Moreover, the performances on Human dataset are always better than that on Drosophila dataset regardless of any method, and the possible reason is that the length distribution of Mouse piRNAs is similar to that of Human piRNAs, and is different from that of Drosophila piRNAs (shown in Fig. 2). In general, our method can produce satisfying results in the cross-species prediction.

Table 9 Performances of GA-WE and the state-of-the-art methods in the cross-species prediction

Full size table

Conclusions

In this paper, we develop the GA-based weighted ensemble method, which can automatically determine the importance of different information resources and produce high-accuracy performances. We compile the Human, Mouse and Drosophila datasets from NONCODE version 3.0, UCSC Genome Browser and NCBI Gene Expression Omnibus. In the computational experiments, the GA-based weighted ensemble method achieves AUC of >93% by 10-CV. Compared with other state-of-the-art methods, our method produces better performances as well as good robustness. In conclusion, the proposed method is promising for transposon-derived piRNA prediction. The source codes and datasets are available in https://github.com/zw9977129/piRNAPredictor.

Abbreviations

“F1~F22”:: The features indexed from F1 to F22
10-CV:: 10-fold cross validation
ACC:: Accuracy
AUC:: Area under ROC curve
GA:: Genetic algorithm
GA-WE:: Genetic algorithm-based weighted ensemble
LSSTE:: Local structure-sequence triplet elements
PCPseDNC:: Parallel correlation pseudo dinucleotide composition
PCPseTNC:: Parallel correlation pseudo trinucleotide composition
PSSM:: Position-specific scoring matrix
RF:: Random forest
SCPseDNC:: Series correlation pseudo dinucleotide composition
SCPseTNC:: Series correlation pseudo trinucleotide composition
SN:: Sensitivity
SP:: Specificity
SVM:: Support vector machine

References

Jean-Michel C. Fewer genes, more noncoding RNA. Science. 2005;309(5740):1529–30.
Article Google Scholar
Mattick JS. The functional genomics of noncoding RNA. Science. 2005;309(5740):1527–8.
Article CAS PubMed Google Scholar
Chaoyong X, Jiao Y, Hui L, Ming L, Guoguang Z, Dechao B, Weimin Z, Wei W, Runsheng C, Yi Z. NONCODEv4: exploring the world of long non-coding RNA genes. Nucleic Acids Res. 2014;42(D1):D98–103.
Article Google Scholar
Huang Y, Liu N, Wang JP, Wang YQ, Yu XL, Wang ZB, Cheng XC, Zou Q. Regulatory long non-coding RNA and its functions. J Physiol Biochem. 2012;68(4):611–8.
Article CAS PubMed Google Scholar
Meenakshisundaram K, Carmen L, Michela B, Diego DB, Gabriella M, Rosaria V. Existence of snoRNA, microRNA, piRNA characteristics in a novel non-coding RNA: x-ncRNA and its biological implication in Homo sapiens. J Bioinformatics Seq Anal. 2009;1(2):31–40.
CAS Google Scholar
Alexei A, Dimos G, Sébastien P, Mariana LQ, Pablo L, Nicola I, Patricia M, Brownstein MJ, Satomi KM, Toru N. A novel class of small RNAs bind to MILI protein in mouse testes. Nature. 2006;442(7099):203–7.
Google Scholar
Lau NC, Seto AG, Jinkuk K, Satomi KM, Toru N, Bartel DP, Kingston RE. Characterization of the piRNA Complex from rat testes. Science. 2006;313(5785):363–7.
Article CAS PubMed Google Scholar
Grivna ST, Ergin B, Zhong W, Haifan L. A novel class of small RNAs in mouse spermatogenic cells. Genes Dev. 2006;20(13):1709–14.
Article CAS PubMed PubMed Central Google Scholar
Seto AG, Kingston RE, Lau NC. The coming of age for Piwi proteins. Mol Cell. 2007;26(5):603–9.
Article CAS PubMed Google Scholar
Ruby JG, Jan C, Player C, Axtell MJ, Lee W, Nusbaum C, Ge H, Bartel DP. Large-scale sequencing reveals 21U-RNAs and additional Micro-RNAs and endogenous siRNAs in C. elegans. Cell. 2007;127(6):1193–207.
Article Google Scholar
Cox DN, Chao A, Baker J, Chang L, Qiao D, Lin H. A novel class of evolutionarily conserved genes defined by piwi are essential for stem cell self-renewal. Genes Dev. 1998;12(23):3715–27.
Article CAS PubMed PubMed Central Google Scholar
Klattenhoff C, Theurkauf W. Biogenesis and germline functions of piRNAs. Development. 2008;135(1):3–9.
Article CAS PubMed Google Scholar
Brennecke BJ, Aravin A, Stark A, Dus M, Kellis M, Sachidanandam R, Hannon G. Discrete small RNA-Generating Loci as master regulators of transposon activity in drosophila. Cell. 2007;128(6):1089–103.
Article CAS PubMed Google Scholar
Thomson T, Lin H. The biogenesis and function of PIWI proteins and piRNAs: progress and prospect. Annu Rev Cell Dev Biol. 2009;25(1):355–76.
Article CAS PubMed PubMed Central Google Scholar
Houwing S, Kamminga LM, Berezikov E, Cronembold D, Girard A, Elst HVD, Filippov DV, Blaser H, Raz E, Moens CB. A role for Piwi and piRNAs in germ cell maintenance and transposon silencing in Zebrafish. Cell. 2007;129(1):69–82.
Article CAS PubMed Google Scholar
Das PP, Bagijn MP, Goldstein LD, Woolford JR, Lehrbach NJ, Sapetschnig A, Buhecha HR, Gilchrist MJ, Howe KL, Stark R. Piwi and piRNAs act upstream of an endogenous siRNA pathway to suppress Tc3 transposon mobility in the caenorhabditis elegans germline. Mol Cell. 2008;31(1):79–90.
Article CAS PubMed PubMed Central Google Scholar
Nicolas R, Lau NC, Sudha B, Zhigang J, Katsutomo O, Satomi KM, Blower MD, Lai EC. A broadly conserved pathway generates 3′UTR-directed primary piRNAs. Curr Biol. 2009;19(24):2066–76.
Article Google Scholar
Hang Y, Haifan L. An epigenetic activation role of Piwi and a Piwi-associated piRNA in Drosophila melanogaster. Nature. 2007;450(7167):304–8.
Article Google Scholar
Betel D, Sheridan R, Marks DS, Sander C. Computational analysis of mouse piRNA sequence and biogenesis. Plos Computational Biology. 2007;3(11):e222.
Article PubMed PubMed Central Google Scholar
Zhang Y, Wang X, Kang L. A k-mer scheme to predict piRNAs and characterize locust piRNAs. Bioinformatics. 2011;27(6):771–6.
Article CAS PubMed PubMed Central Google Scholar
Wang K, Liang C, Liu J, Xiao H, Huang S, Xu J, Li F. Prediction of piRNAs using transposon interaction and a support vector machine. BMC Bioinformatics. 2014;15(1):1–8.
Article Google Scholar
Luo L, Li D, Zhang W, Tu S, Zhu X, Tian G. Accurate prediction of transposon-derived piRNAs by integrating various sequential and physicochemical features. PLoS One. 2016;11(4):e0153268.
Article PubMed PubMed Central Google Scholar
Bu D, Yu K, Sun S, Xie C, Skogerbø G, Miao R, Hui X, Qi L, Luo H, Zhao G. NONCODE v3.0: integrative annotation of long noncoding RNAs. Nucleic Acids Res. 2012;40(D1):D210–5.
Article CAS PubMed Google Scholar
Karolchik D, Barber G, Casper J, et al. The UCSC genome browser database: 2014 update. Nucleic Acids Res. 2014;42 suppl 1:D590–8.
Google Scholar
Barrett T, Suzek TO, Troup DB, et al. NCBI GEO: mining millions of expression profiles—database and tools. Nucleic Acids Res. 2005;33(D1):D562–6.
CAS PubMed Google Scholar
Jiang H, Wong WH. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics. 2008;24(20):2395–6.
Article CAS PubMed PubMed Central Google Scholar
Leslie C, Eskin E, Noble WS. The spectrum kernel: a string kernel for SVM protein classification. Biocomputing. 2002;7:564–75.
Google Scholar
Liu B, Liu FL, Wang XL, Chen JJ, Fang LY, Chou KC. Pse-in-One: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015;43(W1):W65–71.
Article PubMed PubMed Central Google Scholar
Liu B, Liu FL, Fang LY, Wang XL, Chou KC. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. 2015;31(8):1307–9.
Article PubMed Google Scholar
El-Manzalawy Y, Dobbs D, Honavar V. Predicting flexible length linear B-cell epitopes. Computational Syst Bioinformatics. 2008;7:121–32.
Article Google Scholar
Leslie CS, Eskin E, Cohen A, Weston J, Noble WS. Mismatch string kernels for discriminative protein classification. Bioinformatics. 2004;20(4):467–76.
Article CAS PubMed Google Scholar
Lodhi H, Saunders C, Shawe-Taylor J, Cristianini N, Watkins C. Text classification using string kernels. J Mach Learn Res. 2002;2(3):563–9.
Google Scholar
Noble WS, Kuehn S, Thurman R, Yu M, Stamatoyannopoulos J. Predicting the in vivo signature of human gene regulatory sequences. Bioinformatics. 2005;21 suppl 1:i338–43.
Article CAS PubMed Google Scholar
Gupta S, Dennis J, Thurman RE, Kingston R, Stamatoyannopoulos JA, Noble WS. Predicting human nucleosome occupancy from primary sequence. Plos Computational Biology. 2008;4(8):e1000134.
Article PubMed PubMed Central Google Scholar
Chen W, Lei T, Jin D, et al. PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition. Anal Biochem. 2014;456(1):53–60.
Article CAS PubMed Google Scholar
Qiu WR, Xiao X, Chou KC. iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int J Mol Sci. 2014;15(2):1746–66.
Article PubMed PubMed Central Google Scholar
Zhang W, Xiong Y, Zhao M, et al. Prediction of conformational B-cell epitopes from 3D structures by random forests with a distance-based feature. BMC Bioinformatics. 2011;12(2):341.
Article CAS PubMed PubMed Central Google Scholar
Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16(1):16–23.
Article CAS PubMed Google Scholar
Sinha S. On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics. 2006;22(14):e454–63.
Article CAS PubMed Google Scholar
Xia X. Position weight matrix, Gibbs sampler, and the associated significance tests in Motif characterization and prediction. Scientifica. 2012;917540–917555.
Xue C, Fei L, Tao H, Liu GP, Li Y, Zhang X. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics. 2005;6(2):1–7.
Google Scholar
Tafer H, Hofacker IL. RNAplex: a fast tool for RNA-RNA interaction search. Bioinformatics. 2008;24(22):2657–63.
Article CAS PubMed Google Scholar
Hu X, Mamitsuka H, Zhu S. Ensemble approaches for improving HLA class I-peptide binding prediction. J Immunol Methods. 2011;374(1-2):47–52.
Article CAS PubMed Google Scholar
Zhang W, Niu Y, Xiong Y, Zhao M, Yu R, Liu J. Computational prediction of conformational B-Cell Epitopes from antigen primary structures by ensemble learning. PLoS One. 2012;7(8):e43575.
Article CAS PubMed PubMed Central Google Scholar
Zhang W, Niu Y, Zou H, Luo L, Liu Q, Wu W. Accurate prediction of immunogenic T-Cell epitopes from epitope sequences using the genetic algorithm-based ensemble learning. PLoS One. 2015;10(5):e0128194.
Article PubMed PubMed Central Google Scholar
Zhang W, Liu J, Xiong Y, Ke M, Zhang K. Predicting immunogenic T-cell epitopes by combining various sequence-derived features. In IEEE International Conference on Bioinformatics and Biomedicine. Shanghai: IEEE Computer Society; 2013. p. 4–9.
Breiman L. Random forests. Machine Learning. 2001;45:5–32.
Article Google Scholar
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
Google Scholar
Srinivas M, Patnaik LM. Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Trans Syst Man Cybern. 1994;24(4):656–67.
Article Google Scholar

Download references

Acknowledgements

The authors thank Dr. Fei Li and Dr. Kai Wang for the codes of Piano.

Funding

The National Natural Science Foundation of China (61103126, 61271337, 61402340 and 61572368), Shenzhen Development Foundation (JCYJ20130401160028781) and Natural Science Foundation of Hubei Province of China (2014CFB194) support this work. These funding has no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Availability of data and materials

The datasets and source codes are available in https://github.com/zw9977129/piRNAPredictor.

Authors’ contribution

WZ, DL and LL designed the study. LL implemented the algorithm. LL and WZ drafted the manuscript. FL (Fei) and FL (Feng) helped to prepare the data and draft the manuscript. All authors have read and approved the final version of the manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Author information

Authors and Affiliations

School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China
Dingfang Li & Longqiang Luo
State Key Lab of Software Engineering, Wuhan University, Wuhan, 430072, China
Wen Zhang & Fei Luo
School of Computer, Wuhan University, Wuhan, 430072, China
Wen Zhang & Fei Luo
International School of Software, Wuhan University, Wuhan, 430072, China
Feng Liu

Authors

Dingfang Li
View author publications
You can also search for this author in PubMed Google Scholar
Longqiang Luo
View author publications
You can also search for this author in PubMed Google Scholar
Wen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Feng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Fei Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wen Zhang.

Additional file

Additional file 1: Table S1.

The performances of RF models and SVM models on the balanced Human dataset. (XLSX 13 kb)

Additional file 2: Table S2.

The performances of RF models and SVM models on the imbalanced Human dataset. (XLSX 13 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Li, D., Luo, L., Zhang, W. et al. A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs. BMC Bioinformatics 17, 329 (2016). https://doi.org/10.1186/s12859-016-1206-3

Download citation

Received: 18 March 2016
Accepted: 24 August 2016
Published: 31 August 2016
DOI: https://doi.org/10.1186/s12859-016-1206-3

A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs