- Methodology article
- Open access
- Published:
mLoc-mRNA: predicting multiple sub-cellular localization of mRNAs using random forest algorithm coupled with feature selection via elastic net
BMC Bioinformatics volume 22, Article number: 342 (2021)
Abstract
Background
Localization of messenger RNAs (mRNAs) plays a crucial role in the growth and development of cells. Particularly, it plays a major role in regulating spatio-temporal gene expression. The in situ hybridization is a promising experimental technique used to determine the localization of mRNAs but it is costly and laborious. It is also a known fact that a single mRNA can be present in more than one location, whereas the existing computational tools are capable of predicting only a single location for such mRNAs. Thus, the development of high-end computational tool is required for reliable and timely prediction of multiple subcellular locations of mRNAs. Hence, we develop the present computational model to predict the multiple localizations of mRNAs.
Results
The mRNA sequences from 9 different localizations were considered. Each sequence was first transformed to a numeric feature vector of size 5460, based on the k-mer features of sizes 1–6. Out of 5460 k-mer features, 1812 important features were selected by the Elastic Net statistical model. The Random Forest supervised learning algorithm was then employed for predicting the localizations with the selected features. Five-fold cross-validation accuracies of 70.87, 68.32, 68.36, 68.79, 96.46, 73.44, 70.94, 97.42 and 71.77% were obtained for the cytoplasm, cytosol, endoplasmic reticulum, exosome, mitochondrion, nucleus, pseudopodium, posterior and ribosome respectively. With an independent test set, accuracies of 65.33, 73.37, 75.86, 72.99, 94.26, 70.91, 65.53, 93.60 and 73.45% were obtained for the respective localizations. The developed approach also achieved higher accuracies than the existing localization prediction tools.
Conclusions
This study presents a novel computational tool for predicting the multiple localization of mRNAs. Based on the proposed approach, an online prediction server “mLoc-mRNA” is accessible at http://cabgrid.res.in:8080/mlocmrna/. The developed approach is believed to supplement the existing tools and techniques for the localization prediction of mRNAs.
Background
The discovery of asymmetrical distribution of \(\beta\)-actin mRNA in ascidian embryos and eggs by Jeffery et al. [1] laid the foundation for mRNA localization study. However, valuable understandings of mRNA localization are stemmed from the research undertaken thereafter in fungi and animals [2,3,4,5,6]. Localization of mRNAs plays a prominent role in spatio-temporal regulation of gene expression, which is crucial for different cellular and developmental processes including asymptotic cell division, cell migration, embryonic patterning and cellular adaptation to stress [2, 3, 7, 8]. Localization of mRNAs further facilitates sub-cellular localization of proteins that helps to establish and maintain the cell polarity [2]. In the nervous system of mammals, localization of mRNAs plays vital roles ranging from axon and dendrite path finding to synapse generation [9, 10]. Maintenance of synaptic plasticity responsible for long-lasting learning and memory also depends upon the localization of mRNAs [11,12,13]. Deregulation of mRNA stability and localization may cause different genetic disorders including cancer [9]. More details on the adverse impact of the deregulation of mRNA localizations in animals can be found in existing studies [14,15,16].
A crucial factor for the localization of mRNAs is the recognition of cis-acting signals (or zipcodes) by the trans-acting factors (mostly RNA binding proteins)[17]. The cis-acting elements are mostly present in 3’UTR of mRNAs [3, 18,19,20]. Localization of mRNAs takes place through three mechanisms i.e., vectorial transport from nuclei, localized protection from degradation and directional transport on the cytoskeleton via molecular motors [2, 3]. Since a single localized mRNA can produce multiple protein copies, the localization of mRNAs saves energy for a cell [21]. Besides, the locally translated proteins prevent themselves from interaction with the other proteins as well as from synthesis in the incorrect locations that is harmful to the cell [22]. The mRNA localization also facilitates the assembly of macromolecular protein complexes [23] and regulates the differential translation as well [24, 25].
Although in situ hybridization is a reliable experimental technique for mRNA localization, it is a slow and laborious approach. The in situ hybridization allows the rapid labeling of mRNAs [26, 27] but it is limited to certain tissues [28]. With the advancement of in vivo mRNA localization techniques such as MS2-PP7 [29] system, total internal reflection microscopy (TIRF) [30] and 3D structured illumination microscopy (3D-SIM) [31], a large number of localized mRNAs are available in the public domain. Hence, the development of computational tools for mRNA localization prediction has now become feasible.
Predicting the localization of mRNAs can be formulated as both supervised and unsupervised learning problems. In the case of unsupervised learning, homology-based methods such as BLAST [32] and HMM [33] can be used for predicting the localization of mRNAs. By using BLAST, a database of all the localized mRNA sequences can be built and based on the blast search each query sequence can be assigned to a particular localization depending upon the sequence similarity found. Similarly, an HMM profile can be created by using all the localization datasets and the localization of the query sequence can be searched against the created HMM profile. However, by using such models only a single localization can be predicted for any mRNA. On the other hand, if different databases/profiles are created for different localizations to predict multiple localizations, probability of getting a large number of false positives is always there. This may be the reason the existing tools are based on the supervised learning model. The supervised computational tools such as RNATracker [34], iLoc-mRNA [35], and mRNALoc [36] have been developed for predicting the mRNA localization. It is a well-known fact that a single mRNA could be present in more than one localization, whereas the existing tools are meant for predicting single localization only. In other words, only a single localization can be predicted for each mRNA by using the existing tools. The use of CeFra-Seq [37] and APEX-RIP [38] datasets in RNATracker further limits its application. Also, the numbers of localizations considered in the existing tools are not more than five.
The above-stated factors prompted us to develop a new computational model for predicting the multiple localization of mRNAs. The k-mer features were generated to translate the mRNA sequences into vectors of numeric elements. The Elastic Net [39] statistical model was used for the selection of important k-mer features. The Random Forest [40] supervised learning model was employed for predicting the localizations using the selected k-mer features. The performance of the model was evaluated by following the cross-validation and also by using the independent datasets. The proposed model achieved higher accuracy in most of the localizations, while compared using the independent test datasets. We have also developed an online prediction server for the multiple subcellular localization prediction of mRNAs. The developed model is expected to supplement the existing localization prediction tools in particular, and the wet-lab experiments in general.
Methods
Collection and processing of localization dataset
The mRNA sequences of 9 localizations i.e., cytoplasm (17,053), cytosol (18,173), endoplasmic reticulum (6179), exosome (1623), mitochondrion (525), nucleus (25,981), pseudopodium (923), posterior (402) and ribosome (16,602) were collected from the RNALocate database [41] (order of the locations will remain same hereafter). Except for the posterior and mitochondrion, sequences of the remaining localizations were found to be overlapping (Fig. 1a). In other words, the sequences were seen to be present in more than one localization and up to a maximum of 4 localizations. For instance, 2809 sequences were found common in the cytoplasm, cytosol, nucleus and ribosome (represented as purple dots in Fig. 1a). After removing the overlapped sequences, 2730, 4412, 2008, 1205, 488, 6522, 670, 385 and 3537 mRNA sequences were obtained for the respective localizations. Since a higher degree of sequence similarity may lead to overestimation of the prediction accuracy, CD-HIT [42] was employed with an 80% cut-off to remove the redundant sequences from each localization. The 80% threshold has also been used in earlier studies [35, 43]. After removing the redundancy, 1804, 2158, 1020, 843, 457, 3304, 216, 187 and 1838 sequences were obtained for the corresponding localizations. Variability in the sequence length was also observed to be different for different localizations, with the highest variability for the nucleus and lowest for the mitochondrion (Fig. 1b).
Preparation of the positive, negative and independent data sets
A single mRNA can be present in multiple subcellular locations. So, by using a single multi-class prediction model trained with 9 different classes (corresponding to 9 localizations), only a single localization (localization with the highest probability) can be predicted for each mRNA sequence. Therefore, it is required to establish multiple binary classifiers instead of a single multi-class classifier to predict the multiple localizations of mRNAs. Specifically, 9 binary classifiers are required for 9 localizations. Bu using 9 binary classifiers, the probability of each mRNA predicted in nine different localizations can be obtained. Then, with the probability threshold value 0.5, the localization (one or more) can be easily decided for the given mRNA sequence. Keeping above in mind, positive and negative datasets were prepared for each of the 9 binary classifiers. The sequence datasets obtained after removing the redundancy were randomly divided into 6 equal-sized subsets for each localization, where one randomly drawn subset from each localization was kept aside to use them as an independent dataset. The remaining 5 subsets of each localization were used to train and validate the classifiers, keeping in mind the five-fold cross-validation. In particular, 300, 360, 170, 140, 76, 550, 36, 31, 306 sequences were used as an independent set (Independent test set-I) and the remaining 1504, 1798, 850, 703, 381, 2754, 180, 156, 1532 sequences of the respective localizations were used for training and validating the classifiers (Fig. 2) through five-fold cross validation. Further, the number of training sequences for the pseudopodium (180) and posterior (156) were not large. Thus, if more sequences would be considered in the test set then the number of sequences for the training would be less. Similarly, the numbers of test sequences for these two localizations (36, 31) were also less. So, if more sequences would be taken in the training set, then a very few sequences would be left out for the test set. Thus, we did not consider any other combination of the training and test sets. Nevertheless, more stable accuracy can be obtained by training the model with a large dataset. For a given localization, sequences of that localization were used as the positive set and the sequences of the remaining 8 localizations were utilized as the negative set. For instance, 1504 sequences were used as the positive set for the cytoplasm and 8354 (1798 + 850 + 703 + 381 + 2754 + 180 + 156 + 1532) sequences of the remaining localizations were utilized as the negative set. Similarly, the positive and negative datasets were also prepared for other localizations. Besides Independent test set-I, we prepared another independent dataset (i.e., Independent test set-II) with the CD-HIT filtered-out sequences after removing redundancy at 80% threshold. In the Independent test set-II, there were 490, 1037, 485, 185, 14, 1266, 79, 121, 798 sequences for the respective localizations. A graphical representation with the steps involved for preparing the positive, negative and independent datasets is shown in Fig. 2. A summary of the number of positive and negative datasets used in training and test sets are provided in Additional file 1: Table S1.
K-mer feature generation
The k-mer features have been effectively and widely used in many bioinformatics applications including sequence alignment, genome assembly, characterization of microbial community and others [44,45,46]. These features have also been widely adopted for the recognition of regulatory elements on the genomic DNA/RNA [47,48,49]. The k-mer is a fragment of sequence containing k oligonucleotides, where the total number of possible k-mer is 4 k. We employed k-mer sizes 1, 2, 3, 4, 5, 6, and hence the total number of features generated was 5460 (41 + 42 + 43 + 44 + 45 + 46). Besides the “curse of dimensionality”, using such a large number of features may result in low prediction accuracy due to the presence of many irrelevant/redundant features [48, 50,51,52]. Over fitting of the model may be another issue with a large number of features, which results in low generalization predictive ability of the model [43]. These limitations could be overcome to a large extent by applying the feature selection technique.
Feature selection using elastic net
The Elastic Net [39] statistical model, which is a combination of the LASSO [53] and Ridge regression [54] algorithms, was employed for the selection of important k-mer features. Consider the generalized linear model
where \({\boldsymbol{Y}}_{\boldsymbol{n}\times 1}\) is the response for n observations, \({\boldsymbol{X}}_{\boldsymbol{n}\times \boldsymbol{p}}\) is the design matrix for p variables, \(\beta\) is the vector of coefficients and \(\epsilon\) is the vector of random errors. For this model, the estimates using the Elastic Net method can be obtained as
The parameter \(\lambda (\ge 0)\) controls the amount of shrinkage, where \(\lambda\)=0 represents an ordinary least square solution, and all the coefficients shrink to zero with \(\lambda =\infty\). The regularization parameter \(\alpha\) controls both the Ridge and LASSO, where \(\alpha =0\) and 1 deduce the Ridge and LASSO models respectively. In Elastic Net, the coefficients of the least important variables are shrunk to zero and the variables with the non-zero coefficients are considered important. We implemented the Elastic Net using “glmnet” R-package [55], where the classification task was performed with binary response variable by invoking the binomial distribution with parameter family = “binomial”. By using repeated five-fold cross-validation approach, the optimum value of \(\alpha\) (based on highest classification accuracy) was selected out of 10 different values of \(\alpha\) i.e., 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1. Using the optimum value of \(\alpha\), the function cv.glmnet (with nfolds = 5) was then employed to find the optimum value of \(\lambda\) (with respect to classification accuracy) through a grid search approach. Parameters were optimized with 50% random sample observations of the training set, having same number of instances from both the positive and negative classes.
Prediction using random forest
We employed the Random Forest (RF) [40] supervised learning algorithm for binary classification. The RF is an ensemble of several un-pruned classification trees (Fig. 3a), where each classification tree is grown upon a bootstrap sample of the training dataset [56, 57]. The decision tree starts with the root node containing all the observations and ends with the terminal nodes having the labels of observations (Fig. 3a). In RF, randomization is introduced at two stages, (1) while drawing the bootstrap sample and, (2) during the splitting of the nodes of the classification tree. Each classification tree of RF votes for every such instance that does not participate in the construction of the classifier and such instances are called out-of-bag (OOB) instances. The OOB observations are the source to estimate the classification error in RF [58], where the prediction label of each instance is decided by the majority voting scheme. The steps for building an RF classifier are shown in Fig. 3b. The RF is a non-parametric technique, handles large size datasets, provides higher accuracy even for noisy data and overcomes the problem of over-fitting with bootstrapping and ensemble strategy [59]. The ntree (number of classification trees) and mtry (number of variables to be selected for node splitting) are the parameters duo that are needed to be optimized to achieve higher accuracy. Since default parametric values often produce higher accuracy [60], we used ntree = 500 (default ntree) and default mtry i.e., square root of the number of features. Further, a larger number of instances for the negative class as compared to its positive counterpart (Fig. 2) may lead to the prediction bias towards the negative class. To overcome this problem, we employed 5 RF classifiers (instead of one) for each localization. In each RF classifier, all the instances of the positive class and an equal number of instances randomly drawn from the negative class were utilized. A five-fold cross-validation approach was adopted to measure the accuracy for each RF and a majority voting strategy was applied for the final prediction (Fig. 3c). In other words, if an instance was predicted to a certain class in 3 out of 5 RF classifiers the instance was said to be predicted in that class. This procedure was followed for all the 9 binary classifiers. A flow diagram showing the steps involved in the proposed approach is shown in Fig. 4.
Performance metrics
Five different performance metrics were used to measure the classification accuracy. The metrics are
-
sensitivity = \(\frac{{N}^{+}-{N}_{-}^{+}}{{N}^{+}}\),
-
specificity = \(\frac{{N}^{-}-{N}_{+}^{-}}{{N}^{-}}\),
-
accuracy = \(\frac{{N}^{-}-{N}_{+}^{-}+{N}^{+}-{N}_{-}^{+}}{{N}^{-}+{N}^{+}}\),
-
Matthew’s correlation coefficient (MCC) = \(\frac{1-\left(\frac{{N}_{-}^{+}}{{N}^{+}}+\frac{{N}_{+}^{-}}{{N}^{-}}\right)}{\sqrt{\left(1+\frac{{N}_{-}^{+}-{N}_{+}^{-}}{{N}^{+}}\right)\left(1+\frac{{N}_{+}^{-}-{N}_{-}^{+}}{{N}^{-}}\right)}}\) and
-
F1-score = \(\frac{{N}^{+}-{N}_{-}^{+}}{\left({N}^{+}-{N}_{-}^{+}\right)+\frac{1}{2}\left({N}_{+}^{-}+{N}_{-}^{+}\right)}\),
where \({N}^{+}\), \({N}^{-}\), \({N}_{-}^{+}\) and \({N}_{+}^{-}\) represent the number of observations of the positive class, number of observations of the negative class, number of positive observations misclassified in the negative class and the number of negative observations misclassified in the positive class respectively [61,62,63,64]. We also computed area under receiver-operating-characteristics (ROC) [65] curve (aucROC) and area under precision-recall (PR) curve (aucPR) [66] to measure the classification accuracy. All the performance metrics were computed by following the five-fold cross-validation. For the classification using RF, the sensitivity, specificity, accuracy, MCC and F1-score were computed based on the majority voting scheme over 5 RF classifiers. On the other hand, the aucROC and aucPR were computed by taking average over the 5 RFs. Prediction for the independent test set was made by passing it through all the 9 binary classifiers, where each binary classifier was built with 5 RF classifiers. The prediction label for each test instance was decided based on the majority voting of the 5 RF classifiers.
Results
Feature generation and selection analysis
Out of 10 different values of \(\alpha\) in Elastic Net, higher aucROC were obtained for \(\alpha =0.1\), irrespective of the localizations (Fig. 5a). Thus, the optimum value of \(\alpha\) was determined as 0.1. With an increase in the value of \(\alpha\), accuracy was declined and stabilized after \(\alpha =0.8\) for all the localizations (Fig. 5a). Using \(\alpha\)=0.1, the optimum values of aucROC i.e., 0.752, 0.721, 0.748, 0.691, 0.985, 0.788 0.718, 0.981 and 0.769 were obtained with \(\lambda\)=0.1644, 0.2079, 0.2196, 0.2298, 0.4932, 0.1329, 0.2013, 2.389 and 0.1939 respectively for the corresponding localizations (Fig. 5b). With the determined optimum values of \(\alpha\) and \(\lambda\), the number of k-mer features with non-zero coefficients were observed to be 339, 213, 311, 266, 180, 368, 509, 32 and 241 for the respective localizations. Distribution of the selected features for different k-mer sizes is shown in Fig. 6a. For instance, the number of selected features for the cytoplasm with k-mer sizes 2, 3, 4, 5 and 6 were 4, 8, 38, 92 and 197 respectively. For k-mer size 1, even a single feature was not found with non-zero coefficient for all the 9 localizations. Out of 2459 selected features across localizations, 1812 were found to be non-redundant. The k-mer CGAT was observed as the most important feature among all the 1812 k-mers because it was selected in 7 out of the 9 localizations (Fig. 6b). Further, 14, 29, 83 and 317 k-mer features were selected in 5, 4, 3 and 2 localizations respectively and the remaining features were found to be important in one location only. It was also observed that 9 out of the top 15 k-mer features were of k-mer size 4 (Fig. 6b). The 1812 non-redundant k-mer features were employed for the localization prediction using RF.
Cross-validated prediction analysis using random forest
Cross-validated performance metrics are presented in Table 1, whereas the ROC and PR curves for all the 5 RF classifiers are displayed in Fig. 7. Accuracies of 70.87, 68.32, 68.36, 68.79, 96.46, 73.44, 70.94, 97.42 and 71.77% were obtained for the respective localizations (Table 1). The average values of aucROC and aucPR over all the 5 RFs were observed to be ~ 75–98% and ~ 71–99% respectively (Table 1). The highest accuracy was found for the mitochondrion followed by the posterior, whereas the lowest accuracy was observed for the cytosol. Higher accuracies for the mitochondrion and posterior may be attributed to the fact that the mRNA sequences of these localizations were exclusively present in the respective localizations (Fig. 1a). More stable accuracies were observed for the larger size datasets (cytoplasm, cytosol, nucleus and ribosome) as compared to the smaller size datasets (endoplasmic reticulum, exosome, mitochondrion and pseudopodium). Although the sample size was small for the posterior, the accuracies were still found to be more stable as compared to the other localization with the small sample sizes. The probable reason may be that the sequences of the posterior are more conserved as compared to the other localizations which was evident from the fact that the sequences of the posterior were not present in other localizations (Fig. 1a).
Comparative analysis with other bootstrapping methods
We have compared the performance of RF with other bootstrapping methods i.e., Bagging [67] and Boosting [68]. Performances were computed in the same way as done in the case of RF, by using the same dataset. The Bagging and Boosting algorithms were implemented by using the ipred [69] and adabag [70] R-packages respectively. Default parameters setting were employed for prediction. The results are shown in Fig. 8. Accuracies of the RF for the cytosol (68.32), exosome (68.79), nucleus (73.44), posterior (97.42) and ribosome (71.77) were higher than that of Bagging (68.3, 68.3, 73.2, 95.1, 71.4) and Boosting (67.4, 68.2, 72.8, 97.0, 70.4). On the other hand, the accuracies of the Boosting algorithm were higher for the cytoplasm (73.1), endoplasmic reticulum (68.7), mitochondrion (97.7) and pseudopodium (71.3) than that of Bagging (71.1, 68.3, 94.4, 68.6) and RF (70.87, 68.36, 96.46, 70.94) methods. Similarly, the values of MCC of the RF were higher for the localizations cytosol (36.75), exosome (37.81), nucleus (46.88), posterior (94.85) and ribosome (43.55) than that of Bagging (36.7, 37.8, 46.4, 90.2, 42.7) and Boosting (34.8, 36.5, 45.6, 94.2, 40.9). On the other hand, Boosting algorithm achieved higher MCC for the cytoplasm (46.2), endoplasmic reticulum (37.5), mitochondrion (95.5) and pseudopodium (42.7) than that of Bagging (41.3, 36.8, 89.0, 37.1) and RF (41.8, 36.95, 91.26, 41.93) methods.
Performance analysis of random forest with different k-mer features
To study the effect of different k-mer sizes, the performance of RF was analyzed with six different k-mer features i.e., k = 1, k = 2, k = 3, k = 4, k = 5 and k = 6. Thus, the numbers of features utilized were 4, 16, 64, 256, 1024 and 4096 respectively. The accuracies were measured in terms of aucROC and aucPR following five-fold cross validation (Fig. 9). Except exosome, mitochondrion and pseudopodium, the accuracies are increased up to k = 3 for rest of the 6 localizations, whereas the accuracies for the exosome, mitochondrion and pseudopodium are found to be increased upto k = 4, k = 5 and k = 6 respectively (Fig. 9). Higher accuracy with k = 3 may be attributed to the formation of codon structure that resulted in better discrimination of mRNA localizations. On the other hand, less accuracy with k-mer sizes 4, 5 and 6 may be due to the generation of redundant features (mostly contained 0 s). This implies that the model performance may not increase with further increment in the k-mer size.
Evaluation with independent test set-I
Classification accuracies for the independent test set-I were obtained as 65.33, 71.37, 75.86, 72.99, 94.26, 70.91, 65.53, 93.60 and 73.45% respectively (Table 2). The aucROC values (~ 73–98%) were observed at par with the cross-validation accuracy, whereas the aucPR values (~ 7–75%) were much lower than that of cross-validation accuracy because of the highly imbalanced test dataset (Table 2). This is because ROC is independent of the class distribution and PR is dependent upon the number of observations present in different classes. The MCC also takes into account the class distribution and hence its values were also observed to be less because of imbalanced dataset (Table 2).
Evaluation with independent test set-II
Performance metrics for the independent test set-II are given in Table 3. Accuracies for the respective localizations were observed to be 69.34, 75.91, 74.91, 78.10, 95.28, 80.91, 93.16 and 78.69% respectively. Prediction accuracies were also found to be higher than that of independent test set-I, because independent test set-II shares a higher degree of similarity with the training set. The aucROC values were also much higher and it may be because ROC does not account for the class distribution.
Comparative analysis with the existing methods
There are three tools such as RNATracker [34], iLoc-mRNA [35] and mRNALoc [36] available for predicting mRNA localizations. Since the RNATracker has been trained with the CeFra-Seq/APEX-RIP dataset involving gene expression and coordinate files, it was not included for comparison. Between iLoc-mRNA and mRNALoc, the latter one has been shown to outperform the former one. Thus, we compared with the mRNALoc which is also the latest one in the series. The comparison was made using two datasets, i.e., Independent test set-I (Test set-I) and independent dataset of mRNALoc (Test set-II). It was also ensured that sequences of the mRNALoc test set were not present in our training set and sequences of the Independent test set-I were not present in the training set of mRNALoc. Furthermore, the comparison was made with the 4 localizations (cytoplasm, endoplasmic reticulum, mitochondrion and nucleus) that were common between the mRNALoc and our study. The performance metrics were computed by considering the sequences of a given localization as the positive set and the remaining three localizations as the negative set. For instance, in test set-I, the number of positive instances for the nucleus was 83 and the number of negative instances was 142 (86 + 31 + 25). For the test set-I, the proposed approach achieved higher accuracies (67.9, 60.7) and MCC (36.1, 23.8) for the localizations cytoplasm and nucleus respectively (Table 4). On the other hand, mRNALoc achieved higher accuracies (70.2, 70.9) and MCC (42.1, 47.5) for the localizations endoplasmic reticulum and mitochondrion respectively (Table 4). For the test set-II, the developed approach achieved higher accuracies for all the four localizations i.e., cytoplasm (64.3), endoplasmic reticulum (83.0), mitochondrion (91.3) and nucleus (77.6). Except for mitochondrion, the MCC values of the proposed approach were also higher for the rest of the three localizations i.e., cytoplasm (28.6), mitochondrion (62.4) and nucleus (21.4). Taking both the test sets into account, the proposed approach may be said to perform better than that of mRNALoc.
Prediction server mLoc-mRNA
An online prediction server “mLoc-mRNA” is freely accessible at http://cabgrid.res.in:8080/mlocmrna/ for predicting the multiple subcellular localization of mRNAs. The front-end of the server has been designed by using HTML (hypertext markup language), whereas the developed R-programs run at the back end with the help of PHP (hypertext preprocessor). There are provisions for both uploading the FASTA file of mRNA sequences as well as pasting of mRNA sequences in the text area by the user (Fig. 10a). The results were shown in terms of probabilities with which each mRNA sequence was predicted in 9 different localizations (Fig. 10b). The value 0.5 was treated as a threshold above which the test instance was predicted in the positive class and the negative class otherwise.
Discussion
Localization of mRNAs is an evolutionarily conserved phenomenon, found in plants, animals as well as in unicellular organisms [8]. Coupled with the translational regulation, subcellular localization of mRNAs provides means to control the time and location of the protein synthesis and their functions [2, 3, 8, 28]. Genome-wide association analysis [71,72,73] suggests that the establishment of functionally distinct compartments happens through sub-cellular targeting of mRNAs by polarized cells. Keep in view the numerous advantages of mRNA localization, this study presents a new computational model for predicting the multiple subcellular localization of mRNAs.
The mRNA localization datasets were collected from the RNALocate database. Besides the considered 9 localizations, mRNA sequences were also available for other localizations such as anterior, apical, axon and others. However, after removing the overlapped and redundant sequences, very few sequences were left for these localizations and thus not considered in this study. Though the cytosol is the aqueous part of the cytoplasm, it has its own function i.e., cytosol concentrates the dissolved molecules into the correct position for efficient metabolism, and a large number of proteins are localized in the cytosol [74]. Therefore, both cytoplasm and cytosol were considered despite the fact that the cytosol is the aqueous part of the cytoplasm. Furthermore, there were overlapped sequences between these two localizations, and we utilized the dataset after removing not only the overlapped sequences but also the sequences that shared more than 80% sequence identity with any other sequences in each localization.
We utilized k-mer features for translating the nucleotide sequences into numeric vectors, where the important k-mer features were selected by using the Elastic Net model. The LASSO eliminates the redundant/irrelevant features and hence reduces the over-fitting of the model. On the other hand, Ridge assigns less weight to the features that are less relevant for predicting the response. Elastic Net combines the features of both LASSO and the Ridge to improve the prediction performance [39]. Thus, the Elastic Net may be more effective as compared to the LASSO and Ridge. Hence, we employed the Elastic Net for the feature selection.
The RF algorithm was employed for the prediction purpose because it is an ensemble learning method that tends to produce higher accuracy as compared to the single base classifier [40]. The performance of the developed computational model was evaluated by following a five-fold cross-validation procedure as well as by using the independent test sets. Higher accuracies were obtained for the mitochondrion and posterior as compared to the other localizations because the sequences of these two localizations share a lesser degree of similarity with other localizations which is evident from the fact that the sequences of these localizations were present in the remaining localizations (Fig. 1a). The RF model is not only useful for the classification/prediction but also for the selection of important features through RF variable importance measure. There are two types of RF variable importance measure i.e., mean decrease in impurity (Gini importance) and mean decrease in accuracy (permutation importance). For the permutation importance of a variable, the prediction accuracy on the OOB sample is first computed. Then, the values of the variable in the OOB samples are permuted keeping all other variables the same and the accuracy with the permuted data is again computed. The difference in the prediction accuracy caused by the permutation is measured and averaged over all the tree classifiers of RF. The idea of permutation importance is that if a variable is important to achieve high prediction accuracy, shuffling that variable should lead to an increase in the error and vice-versa. Mean decrease in impurity is another variable importance measure of RF. The Gini impurity is used as a measure to choose the variable for splitting of parental node into daughter nodes. For a particular variable, the decrease in Gini impurity is accumulated for each tree every time the variable is chosen to split a node. The sum is averaged over all the trees in the forest to get the importance of each such variable. Though the Gini importance is a biased measure, minimal computation is required for the Gini impurity. On the other hand, though permutation-based importance employs an exhaustive permutation, it is a reasonably efficient technique than that of Gini importance.
Besides RF, we have also used the ensemble learning methods Bagging and Boosting. In terms of accuracy and MCC, Bagging achieved less accuracy than that of RF and Boosting methods. Out of 9 localizations, RF achieved higher accuracy and MCC in 5 localizations whereas Boosting achieved higher accuracy and MCC in the other four localizations. In other words, RF achieved higher accuracy in majority of the localization. However, the difference in accuracies between RF and Boosting algorithm were not much higher. Further, Boosting algorithm is more sensitive to noisy data and over fitting as compared to the RF because the classification trees are grown independently in RF whereas the classification trees are correlated in Boosting. Also, less numbers of parameters are needed to be tuned in RF as compared to the Boosting algorithm. Keeping all the above points in mind, we preferred RF for prediction in all the 9 localizations. In Bagging, randomization is only introduced while drawing the bootstrap sample, whereas in RF randomization is introduced at two stages i.e., during drawing the bootstrap sample and splitting of the nodes of the classification tree [40]. This may be the reason RF achieved higher accuracy than that of Bagging. As far as Boosting method is concerned, it works sequentially and iteratively adjusts the weights of observations until a better learner is achieved. On the other hand, in Bagging each classification tree is grown independently without adjusting the weights of observations [75]. This may be the reason for the better accuracy of the Boosting over Bagging method. Since RF was preferred for prediction in 9 localizations as compared to other bootstrapping methods, the RF was only used for prediction of the independent dataset. The cross validation accuracy and the accuracy with the independent dataset were found to be similar for the developed approach.
Performance of the mLoc-mRNA was also compared with the state-of-art single localization prediction tool mRNALoc using two different independent datasets. The mLoc-mRNA achieved higher accuracy in two localizations with the test set-I and all the four localizations with the test set-II. Thus, the proposed approach may achieve higher accuracy than that of mRNALoc with regard to predicting the single localization of mRNAs. The performance was compared with that of mRNALoc only because it has outperformed the other existing methods such as RNATracker [34] and iLoc-mRNA [35]. We have used the localization dataset of the RNALocate database that contains data from 65 organisms with most of the data from the model organisms like, Homo sapiens, Mus musculus and Saccharomyces cerevisiae. On the other hand, the existing tools such as RNATracker and iLoc-mRNA have been developed by using only localization data of human mRNA. Thus, the developed tool mLoc-mRNA is a more general one with regard to predicting the eukaryotic mRNA subcellular localization. Besides, the datasets i.e., CeFra-Seq [36], APEX-RIP [37] used in the RNATracker are sometimes noisy and inaccurate [38]. On the other hand, the dataset used in our study was collected from the RNALocate which contains a manually curated mRNA localization dataset with experimental evidence. Further, existing localization prediction tools such as RNATracker, iLoc-mRNA and mRNALoc have been developed for predicting the single localization of mRNAs. On the other hand, the mLoc-mRNA is meant for predicting both single as well as multiple subcellular localizations.
We have also developed an online prediction tool mLoc-mRNA (http://cabgrid.res.in:8080/mlocmrna/) for multiple localization prediction of mRNAs. Using the tool, the prediction label for the test instance is decided with the majority voting by 5 RF classifiers. The developed tool is expected to supplement the existing single localization prediction tools as well as the in situ hybridization technique for mRNA localization study.
Conclusions
In this article, we have developed a new computational method for predicting multiple subcellular localization of mRNAs. By using the developed method, user can easily predict the probability of any mRNA sequence being predicted in 9 different localizations. The developed web server is expected to be of great help for researchers from the non-computational background. The proposed methodology will certainly supplement the existing studies towards the localization prediction of mRNAs.
Availability of data and materials
All the datasets used in this study are available at http://cabgrid.res.in:8080/mlocmrna/dataset.html.
Abbreviations
- RNA:
-
Ribonucleic acid
- mRNA:
-
Messenger RNA
- UTR:
-
Untranslated region
- TIRF:
-
Total internal reflection microscopy
- SIM:
-
Structured illumination microscopy
- RF:
-
Random forest
- OOB:
-
Out-of-bag
- LASSO:
-
Least absolute shrinkage and selection operator
- ROC:
-
Receiver operating characteristics
- PR:
-
Precision-recall
- aucROC:
-
Area under ROC curves
- aucPR:
-
Area under PR curves
- HTML:
-
Hypertext markup language
- PHP:
-
Hypertext preprocessor
References
Jeffery WR, Tomlinson CR, Brodeur RD. Localization of actin messenger RNA during early ascidian development. Dev Biol. 1983;99:408–17.
Holt CE, Bullock SL. Subcellular mRNA localization in animal cells and why it matters. Science. 2009;326(5957):1212–6.
Medioni C, Mowry K, Besse F. Principles and roles of mRNA localization in animal development. Development. 2012;139:3263–76.
Weatheritt RJ, Gibson TJ, Babu MM. Asymmetric mRNA localization contributes to fidelity and sensitivity of spatially localized systems. Nat Struct Mol Biol. 2014;21:833–9.
Lazzaretti D, Bono F. mRNA localization in metazoans: a structural perspective. RNA Biol. 2017;14(11):1473–84.
Teimouri H, Korkmazhan E, Stavans J, Levine E. Sub-cellular mRNA localization modulates the regulation of gene expression by small RNAs in bacteria. Phys Biol. 2017;14:56001.
Martin KC, Ephrussi A. mRNA localization: gene expression in the spatial dimension. Cell. 2009;136:719–30.
Tian L, Chou HL, Fukuda M, et al. mRNA localization in plant cells. Plant Physiol. 2020;182(1):97–109.
Di Liegro CM, Schiera G, Di Liegro I. Regulation of mRNA transport, localization and translation in the nervous system of mammals. Int J Mol Med. 2014;33(4):747–62.
Wang ET, Taliaferro JM, Lee JA, et al. Dysregulation of mRNA localization and translation in genetic disease. J Neurosci. 2016;36(45):11418–26.
Wang DO, Martin KC, Zukin RS. Spatially restricting gene expression by local translation at synapses. Trends Neurosci. 2010;33:173–82.
Mauger DM, Siegfried NA, Weeks KM. The genetic code as expressed through relationships between mRNA structure and protein function. FEBS Lett. 2013;587(8):1180–8.
Jung H, Gkogkas CG, Sonenberg N. Remote control of gene function by local translation. Cell. 2014;157:26–40.
Cody NA, Iampietro C, Lécuyer E. The many functions of mRNA localization during normal development and disease: from pillar to post. Wires Dev Biol. 2013;2(6):781–96.
Fallini C, Donlin-Asp PG, Rouanet JP, et al. Deficiency of the survival of motor neuron protein impairs mRNA localization and local translation in the growth cone of motor neurons. J Neurosci. 2016;36:3811–20.
Chin A, Lécuyer E. RNA localization: making its way to the center stage. Biochimica et Biophysica Acta (BBA). 2017;1861(11):2956–70.
Hervé C, Mickleburgh I, Hesketh J. Zipcodes and postage stamps: mRNA localization signals and their trans-acting binding proteins. Brief Funct Genom. 2004;3(3):240–56.
Besse F, Ephrussi A. Translational control of localized mRNAs: restricting protein synthesis in space and time. Nat Rev Mol Cell Biol. 2008;9:971–80.
Meignin C, Davis I. Transmitting the message: intracellular mRNA localization. Curr Opin Cell Biol. 2010;22:112–9.
Tian L, Okita TW. mRNA-based protein targeting to the endoplasmic reticulum and chloroplasts in plant cells. Curr Opin Plant Biol. 2014;22:77–85.
Kloc M, Zearfoss NR, Etkin LD. Mechanisms of subcellular mRNA localization. Cell. 2002;108(4):533–44.
Doyle M, Kiebler MA. A zipcode unzipped. Genes Dev. 2012;26(2):110–3.
Mingle LA, Okuhama NN, Shi J, et al. Localization of all seven messenger RNAs for the actin-polymerization nucleator Arp2/3 complex in the protrusions of fibroblasts. J Cell Sci. 2005;118:2425–33.
Andreassi C, Riccio A. To localize or not to localize: mRNA fate is in 3′UTR ends. Trends Cell Biol. 2009;19:465–74.
Jung H, Yoon BC, Holt CE. Axonal mRNA localization and local protein synthesis in nervous system assembly, maintenance and repair. Nat Rev Neurosci. 2012;13(5):308–24.
Buxbaum AR, Wu B, Singer RH. Single β-actin mRNA detection in neurons reveals a mechanism for regulating its translatability. Science. 2014;343(6169):419–22.
Little SC, Tkačik G, Kneeland TB, et al. The formation of the Bicoid morphogen gradient requires protein movement from anteriorly localized mRNA. PLoS Biol. 2011;9(3):e1000596.
Parton RM, Davidson A, Davis I, Weil TT. Subcellular mRNA localization at a glance. J Cell Sci. 2014;127:2127–33.
Wu B, Chao JA, Singer RH. Fluorescence fluctuation spectroscopy enables quantitative imaging of single mRNAs in living cells. Biophys J. 2012;102:2936–44.
Sinsimer KS, Lee JJ, Thiberge SY, et al. Germ plasm anchoring is a dynamic state that requires persistent trafficking. Cell Rep. 2013;5(5):1169–77.
Weil TT, Parton RM, Davis I. Making the message clear: visualizing mRNA localization. Trends Cell Biol. 2010;20(7):380–90.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.
Johnson LS, Eddy SR, Portugaly E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinform. 2010;11(1):1–8.
Yan Z, Lecuyer E, Blanchette M. Prediction of mRNA subcellular localization using deep recurrent neural networks. Bioinformatics. 2019;35:i333–42.
Zhang ZY, Yang YH, Ding H, et al. Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief Bioinform. 2020. https://doi.org/10.1093/bib/bbz177.
Garg A, Singhal N, Kumar R, Kumar M. mRNALoc: a novel machine-learning based in-silico tool to predict mRNA subcellular localization. Nucleic Acids Res. 2020;48(W1):W239–43.
Bouvrette LPB, Cody NA, Bergalet J, et al. CeFra-seq reveals broad asymmetric mRNA and noncoding RNA distribution profiles in Drosophila and human cells. RNA. 2018;24(1):98–113.
Kaewsapsak P, Shechner DM, Mallard W, et al. Live-cell mapping of organelle-associated RNAs via proximity biotinylation combined with protein-RNA crosslinking. Elife. 2017;6:e29224.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67(2):301–20.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Zhang T, Tan P, Wang L, et al. RNALocate: a resource for RNA subcellular localizations. Nucleic Acids Res. 2017;45(D1):D135–8.
Huang Y, Niu B, Gao Y, et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680–2.
Su ZD, Huang Y, Zhang ZY, et al. iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics. 2018;34(24):4196–204.
Melsted P, Pritchard JK. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 2011;12:333.
Han GB, Cho DH. Genome classification improvements based on k-mer intervals in sequences. Genomics. 2019;111(6):1574–82.
Manekar SC, Sathe SR. A benchmark study of k-mer counting methods for high-throughput sequencing. GigaScience. 2018;7(12):giy125.
Zhu PP, Li WC, Zhong ZJ, et al. Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. Mol BioSyst. 2015;11(2):558–63.
Zhao YW, Su ZD, Yang W, et al. IonchanPred 20: a tool to prediction channels and their types. Int J Mol Sci. 2017;18(9):1838.
Feng P, Yang H, Ding H, et al. iDNA6mA-PseKNC: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics. 2019;111(1):96–102.
Yang H, Tang H, Chen XX et al. Identification of secretory proteins in mycobacterium tuberculosis using pseudo amino acid composition. Biomed Res Int. 2016:1
Liu B, Fang L, Wang S, et al. Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. J Ther Biol. 2015;385:153–9.
Lai HY, Chen XX, Chen W, et al. Sequence-based predictive modeling to identify cancerlectins. Oncotarget. 2017;8(17):28169–75.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Stat Methodol. 1996;58(1):267–88.
Hoerl AE, Kannard RW, Baldwin KF. Ridge regression: some simulations. Commun Stat Theory Methods. 1975;4(2):105–23.
Friedman J, Hastie T, Tibshirani R. glmnet: lasso and elastic-net regularized generalized linear models. R Pack Ver. 2009;1(4):1–24.
Díaz-Uriarte R. Supervised methods with genomic data: a review and cautionary view. In: Azuaje F, Dopazo J, editors. Data analysis and visualization in genomics and proteomics. New York: Wiley; 2005. p. 193–214.
Hua J, Xiong Z, Lowey J, et al. Optimal number of features as a function of sample size for various classification rules. Bioinformatics. 2005;21:1509–15.
Meher PK, Sahu TK, Rao AR. Prediction of donor splice sites using random forest with a new sequence encoding approach. BioData Min. 2016;9(1):4.
Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics. 2012;99(6):323–9.
Liaw A, Wiener M. Classification and regression by randomForest. Rnews. 2002;2:18–22.
Cheng X, Xiao X, Chou KC. pLoc-mEuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics. 2018;110(1):50–8.
Liu B, Yang F, Huang DS, et al. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics. 2018;34(1):33–40.
Meher PK, Sahu TK, Saini V, et al. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci Rep. 2017;7(1):1–12.
Meher PK, Sahu TK, Gahoi S, Rao AR. ir-HSP: improved recognition of heat shock proteins, their families and sub-types based on g-spaced di-peptide features and support vector machine. Front Genet. 2018;8:235.
Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006;27(8):861–74.
Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on machine learning. 2006, pp 233–240
Breiman L. Bagging predictors. Technical Report 421, Department of Statistics, UC Berkeley (1994)
Drucker H, Cortes C, Jackel LD, LeCun Y, Vapnik V. Boosting and other ensemble methods. Neural Comput. 1994;6(6):1289–301.
Peters A, Hothorn T, Hothorn MT. Package ‘ipred’. R Package, 2009
Alfaro E, Gámez M, Garcia N. adabag: an R package for classification with boosting and bagging. J Stat Softw. 2013;54(2):1–35.
Andreassi C, Zimmermann C, Mitter R, et al. An NGF-responsive element targets myo-inositol monophosphatase-1 mRNA to sympathetic neuron axons. Nat Neurosci. 2010;13(3):291.
Gumy LF, Yeo GS, Tung YC, et al. Transcriptome analysis of embryonic and adult sensory axons reveals changes in mRNA repertoire localization. RNA. 2011;17(1):85–98.
Cajigas IJ, Tushev G, Will TJ, et al. The local transcriptome in the synaptic neuropil revealed by deep sequencing and high-resolution imaging. Neuron. 2012;74(3):453–66.
Clegg JS, Barrios MB. The “Cytosol”: A Neglected and Poorly Understood Compartment of Eukaryotic Cells. In: Cañedo LE, Todd LE, Packer L, Jaz J, editors. Cell Function and Disease. Boston: Springer; 1988.
Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP. A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell. 2006;29(1):173–80.
Acknowledgements
The authors sincerely acknowledge the Advanced Super Computing Hub for Omics Knowledge in Agriculture (ASHOKA) facility at ICAR-IASRI, New Delhi, India. The assistance of Mr. Jai Bhagwan in maintaining the web-server is thankfully acknowledged.
Funding
This study was supported by ICAR CABin Scheme Network project on Agricultural Bioinformatics and Computational Biology (F.No. Agril.Edn. 14/2/2017-A&P dated 02.08.2017), received from Indian Council of Agricultural Research (ICAR), New Delhi. The funder had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
PKM conceived the study. PKM, AR and ARR collected and analyzed the data. PKM developed the prediction method and wrote the manuscript. AR and ARR corrected the draft manuscript. All authors reviewed and approved the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The author declares that they have no potential conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file1: Table S1
. Summary of the training data set, independent test set-I and II and comparison test set-I and II.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Meher, P.K., Rai, A. & Rao, A.R. mLoc-mRNA: predicting multiple sub-cellular localization of mRNAs using random forest algorithm coupled with feature selection via elastic net. BMC Bioinformatics 22, 342 (2021). https://doi.org/10.1186/s12859-021-04264-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-021-04264-8