mLoc-mRNA: predicting multiple sub-cellular localization of mRNAs using random forest algorithm coupled with feature selection via elastic net

Background Localization of messenger RNAs (mRNAs) plays a crucial role in the growth and development of cells. Particularly, it plays a major role in regulating spatio-temporal gene expression. The in situ hybridization is a promising experimental technique used to determine the localization of mRNAs but it is costly and laborious. It is also a known fact that a single mRNA can be present in more than one location, whereas the existing computational tools are capable of predicting only a single location for such mRNAs. Thus, the development of high-end computational tool is required for reliable and timely prediction of multiple subcellular locations of mRNAs. Hence, we develop the present computational model to predict the multiple localizations of mRNAs. Results The mRNA sequences from 9 different localizations were considered. Each sequence was first transformed to a numeric feature vector of size 5460, based on the k-mer features of sizes 1–6. Out of 5460 k-mer features, 1812 important features were selected by the Elastic Net statistical model. The Random Forest supervised learning algorithm was then employed for predicting the localizations with the selected features. Five-fold cross-validation accuracies of 70.87, 68.32, 68.36, 68.79, 96.46, 73.44, 70.94, 97.42 and 71.77% were obtained for the cytoplasm, cytosol, endoplasmic reticulum, exosome, mitochondrion, nucleus, pseudopodium, posterior and ribosome respectively. With an independent test set, accuracies of 65.33, 73.37, 75.86, 72.99, 94.26, 70.91, 65.53, 93.60 and 73.45% were obtained for the respective localizations. The developed approach also achieved higher accuracies than the existing localization prediction tools. Conclusions This study presents a novel computational tool for predicting the multiple localization of mRNAs. Based on the proposed approach, an online prediction server “mLoc-mRNA” is accessible at http://cabgrid.res.in:8080/mlocmrna/. The developed approach is believed to supplement the existing tools and techniques for the localization prediction of mRNAs. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04264-8.


Background
The discovery of asymmetrical distribution of β-actin mRNA in ascidian embryos and eggs by Jeffery et al. [1] laid the foundation for mRNA localization study. However, valuable understandings of mRNA localization are stemmed from the research undertaken thereafter in fungi and animals [2][3][4][5][6]. Localization of mRNAs plays a prominent role in spatio-temporal regulation of gene expression, which is crucial for different cellular and developmental processes including asymptotic cell division, cell migration, embryonic patterning and cellular adaptation to stress [2,3,7,8]. Localization of mRNAs further facilitates sub-cellular localization of proteins that helps to establish and maintain the cell polarity [2]. In the nervous system of mammals, localization of mRNAs plays vital roles ranging from axon and dendrite path finding to synapse generation [9,10]. Maintenance of synaptic plasticity responsible for long-lasting learning and memory also depends upon the localization of mRNAs [11][12][13]. Deregulation of mRNA stability and localization may cause different genetic disorders including cancer [9]. More details on the adverse impact of the deregulation of mRNA localizations in animals can be found in existing studies [14][15][16].
A crucial factor for the localization of mRNAs is the recognition of cis-acting signals (or zipcodes) by the trans-acting factors (mostly RNA binding proteins) [17]. The cis-acting elements are mostly present in 3'UTR of mRNAs [3,[18][19][20]. Localization of mRNAs takes place through three mechanisms i.e., vectorial transport from nuclei, localized protection from degradation and directional transport on the cytoskeleton via molecular motors [2,3]. Since a single localized mRNA can produce multiple protein copies, the localization of mRNAs saves energy for a cell [21]. Besides, the locally translated proteins prevent themselves from interaction with the other proteins as well as from synthesis in the incorrect locations that is harmful to the cell [22]. The mRNA localization also facilitates the assembly of macromolecular protein complexes [23] and regulates the differential translation as well [24,25].
Although in situ hybridization is a reliable experimental technique for mRNA localization, it is a slow and laborious approach. The in situ hybridization allows the rapid labeling of mRNAs [26,27] but it is limited to certain tissues [28]. With the advancement of in vivo mRNA localization techniques such as MS2-PP7 [29] system, total internal reflection microscopy (TIRF) [30] and 3D structured illumination microscopy (3D-SIM) [31], a large number of localized mRNAs are available in the public domain. Hence, the development of computational tools for mRNA localization prediction has now become feasible.
Predicting the localization of mRNAs can be formulated as both supervised and unsupervised learning problems. In the case of unsupervised learning, homology-based methods such as BLAST [32] and HMM [33] can be used for predicting the localization of mRNAs. By using BLAST, a database of all the localized mRNA sequences can be built and based on the blast search each query sequence can be assigned to a particular localization depending upon the sequence similarity found. Similarly, an HMM profile can be created by using all the localization datasets and the localization of the query sequence can be searched against the created HMM profile. However, by using such models only a single localization can be predicted for any mRNA. On the other hand, if different databases/profiles are created for different localizations to predict multiple localizations, probability of getting a large number of false positives is always there. This may be the reason the existing tools are based on the supervised learning model. The supervised computational tools such as RNATracker [34], iLoc-mRNA [35], and mRNALoc [36] have been developed for predicting the mRNA localization. It is a well-known fact that a single mRNA could be present in more than one localization, whereas the existing tools are meant for predicting single localization only. In other words, only a single localization can be predicted for each mRNA by using the existing tools. The use of CeFra-Seq [37] and APEX-RIP [38] datasets in RNATracker further limits its application. Also, the numbers of localizations considered in the existing tools are not more than five.
The above-stated factors prompted us to develop a new computational model for predicting the multiple localization of mRNAs. The k-mer features were generated to translate the mRNA sequences into vectors of numeric elements. The Elastic Net [39] statistical model was used for the selection of important k-mer features. The Random Forest [40] supervised learning model was employed for predicting the localizations using the selected k-mer features. The performance of the model was evaluated by following the cross-validation and also by using the independent datasets. The proposed model achieved higher accuracy in most of the localizations, while compared using the independent test datasets. We have also developed an online prediction server for the multiple subcellular localization prediction of mRNAs. The developed model is expected to supplement the existing localization prediction tools in particular, and the wet-lab experiments in general.

Collection and processing of localization dataset
The mRNA sequences of 9 localizations i.e., cytoplasm (17,053), cytosol (18,173), endoplasmic reticulum (6179), exosome (1623), mitochondrion (525), nucleus (25,981), pseudopodium (923), posterior (402) and ribosome (16,602) were collected from the RNALocate database [41] (order of the locations will remain same hereafter). Except for the posterior and mitochondrion, sequences of the remaining localizations were found to be overlapping (Fig. 1a). In other words, the sequences were seen to be present in more than one localization and up to a maximum of 4 localizations. For instance, 2809 sequences were found common in the cytoplasm, cytosol, nucleus and ribosome (represented as purple dots in Fig. 1a). After removing the overlapped sequences, 2730, 4412, 2008, 1205, 488, 6522, 670, 385 and 3537 mRNA sequences were obtained for the respective localizations. Since a higher degree of sequence similarity may lead to overestimation of the prediction accuracy, CD-HIT [42] was employed with an 80% cut-off to remove the redundant sequences from each localization. The 80% threshold has also been used in earlier studies [35,43]. After removing the redundancy, 1804, 2158, 1020, 843, 457, 3304, 216, 187 and 1838 sequences were obtained for the corresponding localizations. Variability in the sequence length was also observed to be different for different localizations, with the highest variability for the nucleus and lowest for the mitochondrion (Fig. 1b).

Preparation of the positive, negative and independent data sets
A single mRNA can be present in multiple subcellular locations. So, by using a single multi-class prediction model trained with 9 different classes (corresponding to 9 localizations), only a single localization (localization with the highest probability) can be predicted for each mRNA sequence. Therefore, it is required to establish multiple binary classifiers instead of a single multi-class classifier to predict the multiple localizations of mRNAs. Specifically, 9 binary classifiers are required for 9 localizations. Bu using 9 binary classifiers, the probability of each mRNA predicted in nine different localizations can be obtained. Then, with the probability threshold value 0.5, the localization (one or more) can be easily decided for the given mRNA sequence. Keeping above in mind, positive and negative datasets were prepared for each of the 9 binary classifiers. The sequence datasets obtained after removing the redundancy were randomly divided into 6 equal-sized subsets for each localization, where one randomly drawn subset from each localization was kept aside to use them as an independent dataset. The remaining 5 subsets of each localization were used to train and validate the classifiers, keeping in mind the five-fold cross-validation. In particular, 300, 360, 170, 140, 76, 550, 36, 31, 306 sequences were used as an independent set (Independent test set-I) and the remaining 1504, 1798, 850, 703, 381, 2754, 180, 156, 1532 sequences of the respective localizations were used for training and validating the classifiers (Fig. 2) through five-fold cross validation. Further, the number of training sequences for the pseudopodium (180) and  (156) were not large. Thus, if more sequences would be considered in the test set then the number of sequences for the training would be less. Similarly, the numbers of test sequences for these two localizations (36,31) were also less. So, if more sequences would be taken in the training set, then a very few sequences would be left out for the test set. Thus, we did not consider any other combination of the training and test sets. Nevertheless, more stable accuracy can be obtained by training the model with a large dataset. For a given localization, sequences of that localization were used as the positive set and the sequences of the remaining 8 localizations were utilized as the negative set. For instance, 1504 sequences were used as the positive set for the cytoplasm and 8354 (1798 + 850 + 703 + 381 + 2754 + 180 + 156 + 1532) sequences of the remaining localizations were utilized as the negative set. Similarly, the positive and negative datasets were also prepared for other localizations. Besides Independent test set-I, we prepared another independent dataset (i.e., Independent test set-II) with the CD-HIT filtered-out sequences after removing redundancy at 80% threshold. In the Independent test set-II, there were 490, 1037, 485, 185, 14, 1266, 79, 121, 798 sequences for the respective localizations. A graphical representation with the steps involved for preparing the positive, negative and independent datasets is shown in Fig. 2. A summary of the number of positive and negative datasets used in training and test sets are provided in Additional file 1: Table S1.

K-mer feature generation
The k-mer features have been effectively and widely used in many bioinformatics applications including sequence alignment, genome assembly, characterization of microbial community and others [44][45][46]. These features have also been widely adopted for the recognition of regulatory elements on the genomic DNA/RNA [47][48][49]. The k-mer is a fragment of sequence containing k oligonucleotides, where the total number of possible k-mer is 4 k . We employed k-mer sizes 1, 2, 3, 4, 5, 6, and hence the total number of features generated was 5460 (4 1 + 4 2 + 4 3 + 4 4 + 4 5 + 4 6 ). Besides the "curse of dimensionality", using such a large number of features may result in low prediction accuracy due to the presence of many irrelevant/redundant features [48,[50][51][52]. Over fitting of the model may be another issue with a large number of features, which results in low generalization predictive ability of the model [43]. These limitations could be overcome to a large extent by applying the feature selection technique.

Feature selection using elastic net
The Elastic Net [39] statistical model, which is a combination of the LASSO [53] and Ridge regression [54] algorithms, was employed for the selection of important k-mer features. Consider the generalized linear model where Y n×1 is the response for n observations, X n×p is the design matrix for p variables, β is the vector of coefficients and ǫ is the vector of random errors. For this model, the estimates using the Elastic Net method can be obtained as The parameter (≥ 0) controls the amount of shrinkage, where =0 represents an ordinary least square solution, and all the coefficients shrink to zero with = ∞ . The regularization parameter α controls both the Ridge and LASSO, where α = 0 and 1 deduce the Ridge and LASSO models respectively. In Elastic Net, the coefficients of the least important variables are shrunk to zero and the variables with the non-zero coefficients are considered important. We implemented the Elastic Net using "glmnet" R-package [55], where the classification task was performed with binary response variable by invoking the binomial distribution with parameter family = "binomial". By using repeated fivefold cross-validation approach, the optimum value of α (based on highest classification accuracy) was selected out of 10 different values of α i.e., 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1. Using the optimum value of α , the function cv.glmnet (with nfolds = 5) was then employed to find the optimum value of (with respect to classification accuracy) through a grid search approach. Parameters were optimized with 50% random sample observations of the training set, having same number of instances from both the positive and negative classes.

Prediction using random forest
We employed the Random Forest (RF) [40] supervised learning algorithm for binary classification. The RF is an ensemble of several un-pruned classification trees (Fig. 3a), where each classification tree is grown upon a bootstrap sample of the training dataset [56,57]. The decision tree starts with the root node containing all the observations and ends with the terminal nodes having the labels of observations (Fig. 3a). In RF, randomization is introduced at two stages, (1) while drawing the bootstrap sample and, (2) during the splitting of the nodes of the classification tree. Each classification tree of RF votes for every such instance that does not participate in the construction of the classifier and such instances are called out-of-bag (OOB) instances. The OOB observations are the source to estimate the classification error in RF [58], where the prediction label of each instance is decided by the majority voting scheme. The steps for building an RF classifier are shown in Fig. 3b. The RF is a non-parametric technique, handles large size datasets, provides higher accuracy even for noisy data and overcomes the problem of over-fitting with bootstrapping and ensemble strategy [59]. The ntree (number of classification trees) and mtry (number of variables to be selected for node splitting) are the parameters duo that are needed to be optimized to achieve higher accuracy. Since default parametric values often produce higher accuracy [60], we used ntree = 500 (default ntree) and default mtry i.e., square root of the number of features. Further, a larger number of instances for the negative class as compared to its positive counterpart (Fig. 2) may lead to the prediction bias towards the negative class. To overcome this problem, we employed 5 RF classifiers (instead of one) for each localization. In each RF classifier, all the instances of the positive class and an equal number of instances randomly drawn from the negative class were utilized. A five-fold cross-validation approach was adopted to measure the accuracy for each RF and a majority voting strategy was applied for the final prediction (Fig. 3c). In other words, if an instance was predicted to a certain class in 3 out of 5 RF classifiers the instance was said to be predicted in that class. This procedure was followed for all the 9 binary classifiers. A flow diagram showing the steps involved in the proposed approach is shown in Fig. 4.

Performance metrics
Five different performance metrics were used to measure the classification accuracy. The metrics are where N + , N − , N + − and N − + represent the number of observations of the positive class, number of observations of the negative class, number of positive observations misclassified in the negative class and the number of negative observations misclassified in the positive class respectively [61][62][63][64]. We also computed area under receiver-operatingcharacteristics (ROC) [65] curve (aucROC) and area under precision-recall (PR) curve (aucPR) [66] to measure the classification accuracy. All the performance metrics were computed by following the five-fold cross-validation. For the classification using RF, the sensitivity, specificity, accuracy, MCC and F1-score were computed based on the majority voting scheme over 5 RF classifiers. On the other hand, the aucROC and aucPR were computed by taking average over the 5 RFs. Prediction for the independent test set was made by passing it through all the 9 binary classifiers, where each binary classifier was built with 5 RF classifiers. The prediction label for each test instance was decided based on the majority voting of the 5 RF classifiers.

Feature generation and selection analysis
Out of 10 different values of α in Elastic Net, higher aucROC were obtained for α = 0.1 , irrespective of the localizations (Fig. 5a). Thus, the optimum value of α was determined as 0.1. With an increase in the value of α , accuracy was declined and stabilized after α = 0.8 for all the localizations (Fig. 5a) Fig. 6a. For instance, the number of selected features for the cytoplasm with k-mer sizes 2, 3, 4, 5 and 6 were 4, 8, 38, 92 and 197 respectively. For k-mer size 1, even a single feature was not found with nonzero coefficient for all the 9 localizations. Out of 2459 selected features across localizations, 1812 were found to be non-redundant. The k-mer CGAT was observed as the most important feature among all the 1812 k-mers because it was selected in 7 out of the 9 localizations (Fig. 6b). Further, 14, 29, 83 and 317 k-mer features were selected in 5, 4, 3 and 2 localizations respectively and the remaining features were found to be important in one location only. It was also observed that 9 out of the top 15 k-mer features were of k-mer size 4 (Fig. 6b). The 1812 non-redundant k-mer features were employed for the localization prediction using RF.

Cross-validated prediction analysis using random forest
Cross-validated performance metrics are presented in Table 1, whereas the ROC and PR curves for all the 5 RF classifiers are displayed in Fig. 7 Table 1). The highest accuracy was found for the mitochondrion followed by the posterior, whereas the lowest accuracy was observed for the cytosol. Higher accuracies for the mitochondrion and posterior may be attributed to the fact that the mRNA sequences of these localizations were exclusively present in the respective localizations (Fig. 1a). More stable accuracies were observed for the larger size datasets (cytoplasm, cytosol, nucleus and ribosome) as compared to the smaller size datasets (endoplasmic reticulum, exosome, mitochondrion and pseudopodium). Although the sample size was small for the posterior, the accuracies were still found to be more stable as compared to the other localization with the small sample sizes. The probable reason may be that the sequences of the posterior are more conserved as compared to the other localizations which was evident from the fact that the sequences of the posterior were not present in other localizations (Fig. 1a).

Comparative analysis with other bootstrapping methods
We have compared the performance of RF with other bootstrapping methods i.e., Bagging [67] and Boosting [68]. Performances were computed in the same way as done in the case of RF, by using the same dataset. The Bagging and Boosting algorithms were implemented by using the ipred [69] and adabag [70] R-packages respectively. Default parameters setting were employed for prediction. The results are shown in Fig. 8

Performance analysis of random forest with different k-mer features
To study the effect of different k-mer sizes, the performance of RF was analyzed with six different k-mer features i.e., k = 1, k = 2, k = 3, k = 4, k = 5 and k = 6. Thus, the numbers of features utilized were 4, 16, 64, 256, 1024 and 4096 respectively. The accuracies were measured in terms of aucROC and aucPR following five-fold cross validation (Fig. 9). Except exosome, mitochondrion and pseudopodium, the accuracies are increased up to k = 3 for rest of the 6 localizations, whereas the accuracies for the exosome, mitochondrion and pseudopodium are found to be increased upto k = 4, k = 5 and k = 6 respectively (Fig. 9). Higher accuracy with k = 3 may be attributed to the formation of codon structure that resulted in better discrimination of mRNA    Table 2). The aucROC values (~ 73-98%) were observed at par with the cross-validation accuracy, whereas the aucPR values (~ 7-75%) were much lower than that of cross-validation accuracy because of the highly imbalanced test dataset (Table 2). This is because ROC is independent of the class distribution and PR is dependent upon the number of observations present in different classes. The MCC also takes into account the class distribution and hence its values were also observed to be less because of imbalanced dataset ( Table 2).

Evaluation with independent test set-II
Performance metrics for the independent test set-II are given in Table 3 16 and 78.69% respectively. Prediction accuracies were also found to be higher than that of independent test set-I, because independent test set-II shares a higher degree of similarity with the training set. The aucROC values were also much higher and it may be because ROC does not account for the class distribution.  Table 3 Prediction accuracies with Independent test set-II The sensitivities are found to be much higher than that of Independent test set-I. Because, this dataset shares > 80% sequence similarity with the training set. However, overall accuracies are found at par with the training dataset. It can also be seen that the aucROC values are much higher, may be due to higher degree of sequence similarity with the training set

Comparative analysis with the existing methods
There are three tools such as RNATracker [34], iLoc-mRNA [35] and mRNALoc [36] available for predicting mRNA localizations. Since the RNATracker has been trained with the CeFra-Seq/APEX-RIP dataset involving gene expression and coordinate files, it was not included for comparison. Between iLoc-mRNA and mRNALoc, the latter one has been shown to outperform the former one. Thus, we compared with the mRNALoc which is also the latest one in the series. The comparison was made using two datasets, i.e., Independent test set-I (Test set-I) and independent dataset of mRNALoc (Test set-II). It was also ensured that sequences of the mRNALoc test set were not present in our training set and sequences of the Independent test set-I were not present in the training set of mRNALoc. Furthermore, the comparison was made with the 4 localizations (cytoplasm, endoplasmic reticulum, mitochondrion and nucleus) that were common between the mRNALoc and our study. The performance metrics were computed by considering the sequences of a given localization as the positive set and the remaining three localizations as the negative set. For instance, in test set-I, the number of positive instances for the nucleus was 83 and the number of negative instances was 142 (86 + 31 + 25). For the test set-I, the proposed approach achieved higher accuracies (67.9, 60.7) and MCC (36.1, 23.8) for the localizations cytoplasm and nucleus respectively (Table 4). On the other hand, mRNALoc achieved higher accuracies (70.2, 70.9) and MCC (42.1, 47.5) for the localizations endoplasmic reticulum and mitochondrion respectively (Table 4). For the test set-II, the developed approach achieved higher accuracies for all the four localizations i.e., cytoplasm (64.3), endoplasmic reticulum (83.0), mitochondrion (91.3) and nucleus (77.6). Except for mitochondrion, the MCC values of the proposed approach were also higher for the rest of the three localizations i.e., cytoplasm (28.6), mitochondrion (62.4) and nucleus (21.4). Taking both the test sets into account, the proposed approach may be said to perform better than that of mRNALoc.

Prediction server mLoc-mRNA
An online prediction server "mLoc-mRNA" is freely accessible at http:// cabgr id. res. in: 8080/ mlocm rna/ for predicting the multiple subcellular localization of mRNAs. The front-end of the server has been designed by using HTML (hypertext markup language), whereas the developed R-programs run at the back end with the help of PHP (hypertext preprocessor). There are provisions for both uploading the FASTA file of mRNA sequences as well as pasting of mRNA sequences in the text area by the user (Fig. 10a).
The results were shown in terms of probabilities with which each mRNA sequence was predicted in 9 different localizations (Fig. 10b). The value 0.5 was treated as a threshold above which the test instance was predicted in the positive class and the negative class otherwise.

Discussion
Localization of mRNAs is an evolutionarily conserved phenomenon, found in plants, animals as well as in unicellular organisms [8]. Coupled with the translational regulation, subcellular localization of mRNAs provides means to control the time and location of the protein synthesis and their functions [2,3,8,28]. Genome-wide association analysis [71][72][73] suggests that the establishment of functionally distinct compartments happens through sub-cellular targeting of mRNAs by polarized cells. Keep in view the numerous advantages of mRNA localization, this study presents a new computational model for predicting the multiple subcellular localization of mRNAs. The mRNA localization datasets were collected from the RNALocate database. Besides the considered 9 localizations, mRNA sequences were also available for other localizations such as anterior, apical, axon and others. However, after removing the overlapped and redundant sequences, very few sequences were left for these localizations and thus not considered in this study. Though the cytosol is the aqueous part of the cytoplasm, it has its own function i.e., cytosol concentrates the dissolved molecules into the correct position for efficient metabolism, and a large number of proteins are localized in the cytosol [74]. Therefore, both cytoplasm and cytosol were considered despite the fact that the cytosol is the aqueous part of the cytoplasm. Furthermore, there were overlapped sequences between these two localizations, and we utilized the dataset after removing not only the overlapped sequences but also the sequences that shared more than 80% sequence identity with any other sequences in each localization.
We utilized k-mer features for translating the nucleotide sequences into numeric vectors, where the important k-mer features were selected by using the Elastic Net model. The LASSO eliminates the redundant/irrelevant features and hence reduces the overfitting of the model. On the other hand, Ridge assigns less weight to the features that are less relevant for predicting the response. Elastic Net combines the features of both LASSO and the Ridge to improve the prediction performance [39]. Thus, the Elastic Net may be more effective as compared to the LASSO and Ridge. Hence, we employed the Elastic Net for the feature selection.
The RF algorithm was employed for the prediction purpose because it is an ensemble learning method that tends to produce higher accuracy as compared to the single base classifier [40]. The performance of the developed computational model was evaluated by following a five-fold cross-validation procedure as well as by using the independent test sets. Higher accuracies were obtained for the mitochondrion and posterior as compared to the other localizations because the sequences of these two localizations share a lesser degree of similarity with other localizations which is evident from the fact that the sequences of these localizations were present in the remaining localizations (Fig. 1a). The RF model is not only useful for the classification/prediction but also for the selection of important features through RF variable importance measure. There are two types of RF variable importance measure i.e., mean decrease in impurity (Gini importance) and mean decrease in accuracy (permutation importance). For the permutation importance of a variable, the prediction accuracy on the OOB sample is first computed. Then, the values of the variable in the OOB samples are permuted keeping all other variables the same and the accuracy with the permuted data is again computed. The difference in the prediction accuracy caused by the permutation is measured and averaged over all the tree classifiers of RF. The idea of permutation importance is that if a variable is important to achieve high prediction accuracy, shuffling that variable should lead to an increase in the error and vice-versa. Mean decrease in impurity is another variable importance measure of RF. The Gini impurity is used as a measure to choose the variable for splitting of parental node into daughter nodes. For a particular variable, the decrease in Gini impurity is accumulated for each tree every time the variable is chosen to split a node. The sum is averaged over all the trees in the forest to get the importance of each such variable. Though the Gini importance is a biased measure, minimal computation is required for the Gini impurity. On the other hand, though permutation-based importance employs an exhaustive permutation, it is a reasonably efficient technique than that of Gini importance.
Besides RF, we have also used the ensemble learning methods Bagging and Boosting. In terms of accuracy and MCC, Bagging achieved less accuracy than that of RF and Boosting methods. Out of 9 localizations, RF achieved higher accuracy and MCC in 5 localizations whereas Boosting achieved higher accuracy and MCC in the other four localizations. In other words, RF achieved higher accuracy in majority of the localization. However, the difference in accuracies between RF and Boosting algorithm were not much higher. Further, Boosting algorithm is more sensitive to noisy data and over fitting as compared to the RF because the classification trees are grown independently in RF whereas the classification trees are correlated in Boosting. Also, less numbers of parameters are needed to be tuned in RF as compared to the Boosting algorithm. Keeping all the above points in mind, we preferred RF for prediction in all the 9 localizations. In Bagging, randomization is only introduced while drawing the bootstrap sample, whereas in RF randomization is introduced at two stages i.e., during drawing the bootstrap sample and splitting of the nodes of the classification tree [40]. This may be the reason RF achieved higher accuracy than that of Bagging. As far as Boosting method is concerned, it works sequentially and iteratively adjusts the weights of observations until a better learner is achieved. On the other hand, in Bagging each classification tree is grown independently without adjusting the weights of observations [75]. This may be the reason for the better accuracy of the Boosting over Bagging method. Since RF was preferred for prediction in 9 localizations as compared to other bootstrapping methods, the RF was only used for prediction of the independent dataset. The cross validation accuracy and the accuracy with the independent dataset were found to be similar for the developed approach.
Performance of the mLoc-mRNA was also compared with the state-of-art single localization prediction tool mRNALoc using two different independent datasets. The mLoc-mRNA achieved higher accuracy in two localizations with the test set-I and all the four localizations with the test set-II. Thus, the proposed approach may achieve higher accuracy than that of mRNALoc with regard to predicting the single localization of mRNAs. The performance was compared with that of mRNALoc only because it has outperformed the other existing methods such as RNATracker [34] and iLoc-mRNA [35]. We have used the localization dataset of the RNALocate database that contains data from 65 organisms with most of the data from the model organisms like, Homo sapiens, Mus musculus and Saccharomyces cerevisiae. On the other hand, the existing tools such as RNATracker and iLoc-mRNA have been developed by using only localization data of human mRNA. Thus, the developed tool mLoc-mRNA is a more general one with regard to predicting the eukaryotic mRNA subcellular localization. Besides, the datasets i.e., CeFra-Seq [36], APEX-RIP [37] used in the RNATracker are sometimes noisy and inaccurate [38]. On the other hand, the dataset used in our study was collected from the RNALocate which contains a manually curated mRNA localization dataset with experimental evidence. Further, existing localization prediction tools such as RNATracker, iLoc-mRNA and mRNALoc have been developed for predicting the single localization of mRNAs. On the other hand, the mLoc-mRNA is meant for predicting both single as well as multiple subcellular localizations.
We have also developed an online prediction tool mLoc-mRNA (http:// cabgr id. res. in: 8080/ mlocm rna/) for multiple localization prediction of mRNAs. Using the tool, the prediction label for the test instance is decided with the majority voting by 5 RF classifiers. The developed tool is expected to supplement the existing single localization prediction tools as well as the in situ hybridization technique for mRNA localization study.

Conclusions
In this article, we have developed a new computational method for predicting multiple subcellular localization of mRNAs. By using the developed method, user can easily predict the probability of any mRNA sequence being predicted in 9 different localizations. The developed web server is expected to be of great help for researchers from the noncomputational background. The proposed methodology will certainly supplement the existing studies towards the localization prediction of mRNAs.