Comprehensive machine-learning-based analysis of microRNA–target interactions reveals variable transferability of interaction rules across species

Background MicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression post-transcriptionally via base-pairing with complementary sequences on messenger RNAs (mRNAs). Due to the technical challenges involved in the application of high-throughput experimental methods, datasets of direct bona fide miRNA targets exist only for a few model organisms. Machine learning (ML)-based target prediction models were successfully trained and tested on some of these datasets. There is a need to further apply the trained models to organisms in which experimental training data are unavailable. However, it is largely unknown how the features of miRNA–target interactions evolve and whether some features have remained fixed during evolution, raising questions regarding the general, cross-species applicability of currently available ML methods. Results We examined the evolution of miRNA–target interaction rules and used data science and ML approaches to investigate whether these rules are transferable between species. We analyzed eight datasets of direct miRNA–target interactions in four species (human, mouse, worm, cattle). Using ML classifiers, we achieved high accuracy for intra-dataset classification and found that the most influential features of all datasets overlap significantly. To explore the relationships between datasets, we measured the divergence of their miRNA seed sequences and evaluated the performance of cross-dataset classification. We found that both measures coincide with the evolutionary distance between the compared species. Conclusions The transferability of miRNA–targeting rules between species depends on several factors, the most associated factors being the composition of seed families and evolutionary distance. Furthermore, our feature-importance results suggest that some miRNA–target features have evolved while others remained fixed during the evolution of the species. Our findings lay the foundation for the future development of target prediction tools that could be applied to “non-model” organisms for which minimal experimental data are available. Availability and implementation The code is freely available at https://github.com/gbenor/TPVOD. Supplementary Information The online version supplementary material available at 10.1186/s12859-021-04164-x.

seed conservation, seed base-pairing stability, and target site location. In a subsequent study 7 , the model was refined with 96 features.

Training and testing datasets
All the above-mentioned studies trained and tested their models on a dataset of chimeric interactions from human cells generated with the CLASH method 5 , referred to as h1 in the main manuscript. In some of the studies, the dataset was filtered based on the location of the sites, seed-pairing pattern, or functional evidence; in other studies, it was complemented with additional interactions from other experiments.
DeepMirTar 14 and mirLSTM 10 filtered this dataset to include only canonical and non-canonical sites that are located at the 3'UTRs. They complemented this dataset with an additional small number of interactions retrieved from miRecords 15 . TarPmir 3 used all the interactions available from the human CLASH dataset h1. chimiRic 8 was trained on a combination of the human CLASH dataset and seed-containing sites from AGO-CLIP data. The model was tested on specific miRNA families by excluding their interactions from the training set. For miRAW 11 , the training set was built by intersecting the human CLASH, AGO-CLIP, and TargetScanHuman 1 datasets with mirTarBase 2 and Diana Tar-Base 12 to include only validated functional interactions. The initial evaluation was performed on experimentally verified miRNA-targets that were excluded from the training set. miRTarget 13 was trained and tested on human CLASH data combined with chimeras generated by endogenous ligation in human AGO-CLIP experiments 4 . miRTarget v4 7 combined human chimeras with miRNA overexpression data to identify common features that are characteristic of both miRNA binding and target downregulation.

Independent testing datasets
For additional independent testing, the above-mentioned methods employed various datasets, which were not necessarily derived from ligation-based experiments. DeepMirTar 14 was tested on a human PAR-CLIP dataset. mirLSTM 10 was tested on a small experimental dataset. TarPmir 3 was tested on two PAR-CLIP human datasets and a HITS-CLIP dataset from a mouse. chimiRic's 8 model was tested on chimeric interactions from C. elegans 4 and mouse 9 . miRAW 11 performed additional evaluations on microarray datasets reporting mRNA changes after transfecting miRNAs into HeLa cells. miRTarget 13 and miRTarget v4 7 tested their model on mRNA microarray datasets from miRNA knockdown experiments. miRTarget v4 performed additional testing on HITS-CLIP data from the mouse brain.

Negative datasets
For negative datasets, DeepMirTar 14 and mirLSTM 10 used mock miRNAs to generate a negative interaction for each positive one. TarPmir 3 generated negative target sites on positive mRNAs, such that they do not overlap with positive sites and their nucleotide composition is similar to the positive sites. miRAW built a negative dataset based upon experimentally verified data. chimiRic 8 generated a negative set from canonical miRNA seed matches that are not AGObound based on CLIP data, together with (miRNA, site) pairs where an AGO-bound site is paired with an incorrect miRNA based on CLASH chimeras. miRTarget 13 and miRTarget v4 7 generated negative examples based on CLIP data, selecting non-target sites based on a set of criteria, e.g., no overlap with a positive site, detectable expression of the transcript based on microarrays, or the existence of a perfect seed match to one of the miRNAs expressed in the cells.

TRAINING AND TESTING RANDOM DATASET SPLIT
The procedure of splitting the data into training and testing sets has a crucial role in the evaluation of machine-learning models. In the miRNA-target prediction task, there is no pre-defined split to training and testing sets as is common in other fields. In this work, we used a stratified training-testing split that ensures the same distribution of miRNA sequences in both the training and the testing sets. To assess how this type of split affects the classifier performance, we repeated the analysis with the XGBoost classifier, but this time used a random split strategy (control split) to generate the training and testing sets. Our results showed that there is almost no difference between the results achieved with the stratified and the control split methods. As with the stratified split method, we obtained very low standard deviation values, confirming that the split step did not bias our results. The accuracy results for the control splits are provided in Table S3.

DESCRIPTION OF THE CLASSIFICATION PERFORMANCE METRICS
The following metrics were used for an in-depth performance analysis of the XGBoost classifier: AUC -The area under the curve is a performance measurement for classification problems at different threshold settings. The AUC provides information on the capability of a model to differentiate between classes; AUC values range from 0 to 1, with 1 indicating a model with perfect predictions. TPR and TNR -The true positive rate (Equation S2) and the true negative rate (Equation S3) are the percentages of actual positive or negative results, respectively, that are correctly identified. For ideal classifiers, these metrics are close to 1.

MCC -the Matthews correlation coefficient (Equation S4
) measures the quality of classifications; a coefficient of +1 indicates a perfect prediction, 0 an average random prediction, and -1 an inverse prediction.  Canonical seed, Low density Canonical seed, Medium density Canonical seed, High density Non-Canonical seed, Low density Non-Canonical seed, Medium density Non-Canonical seed, High density Distribution of miRNA-target duplexes across six classes according to the seed type (canonical or non-canonical) and the base-pairing density (low: <than 11 bp, medium: 11-16 bp, or high: more than>16bp). The number above each bar indicates the total number of interactions.        The right table contains the means and the standard deviations (in brackets) acquired from 20 models that were built based on different random splits of the dataset to training/validation/testing. The cells contain the means and standard deviations (in brackets) acquired from 5 models that were trained and evaluated on different random dataset splits. The cells contain the means and standard deviations (in brackets) values acquired from 20 models that were trained and evaluated on different trainingtesting stratified dataset splits.  Each cell (i,j) represents the mean accuracy and standard deviation of the 20 classifiers that were trained on dataset i and tested on dataset j ACC(i, j).

ACC =
T P + T N T P + F P + F N + T N Equation S1 . . . .