Cross disease analysis of co-functional microRNA pairs on a reconstructed network of disease-gene-microRNA tripartite

Background MicroRNAs always function cooperatively in their regulation of gene expression. Dysfunctions of these co-functional microRNAs can play significant roles in disease development. We are interested in those multi-disease associated co-functional microRNAs that regulate their common dysfunctional target genes cooperatively in the development of multiple diseases. The research is potentially useful for human disease studies at the transcriptional level and for the study of multi-purpose microRNA therapeutics. Methods and results We designed a computational method to detect multi-disease associated co-functional microRNA pairs and conducted cross disease analysis on a reconstructed disease-gene-microRNA (DGR) tripartite network. The construction of the DGR tripartite network is by the integration of newly predicted disease-microRNA associations with those relationships of diseases, microRNAs and genes maintained by existing databases. The prediction method uses a set of reliable negative samples of disease-microRNA association and a pre-computed kernel matrix instead of kernel functions. From this reconstructed DGR tripartite network, multi-disease associated co-functional microRNA pairs are detected together with their common dysfunctional target genes and ranked by a novel scoring method. We also conducted proof-of-concept case studies on cancer-related co-functional microRNA pairs as well as on non-cancer disease-related microRNA pairs. Conclusions With the prioritization of the co-functional microRNAs that relate to a series of diseases, we found that the co-function phenomenon is not unusual. We also confirmed that the regulation of the microRNAs for the development of cancers is more complex and have more unique properties than those of non-cancer diseases. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1605-0) contains supplementary material, which is available to authorized users.


Supplementary information introduction
There are totally 8 supplementary files in Supplementary information. The Supplementary file 1 lists the disease-miRNA associations that we used to construct the DGR tripartite networks. Supplementary file 2 mainly introduces the details of optimizing our disease-miRNA association prediction models and the comparison of it with the other existing methods. There are several Supplementary Tables which list the results of the case studies of the disease-miRNA association prediction model, the top 50 candidate multi-disease associated co-functional miRNA pairs for cancer related DGR tripartite network and the non-cancer disease related DGR tripartite network. In Supplementary file 3, we provide the related codes and datasets of our methods. As the data files of the similarities between 2802 diseases and the similarities between 551 miRNAs are too big, if someone need these datasets, please contact Hui Peng (email: hui.peng-2@student.uts.edu.au). Supplementary file 4 shows the three data sets that applied by chen's method, xu's method and jiang's method, which we used them to make comparison with these three existing methods. In Supplementary file 5 and Supplementary file 6, we list the diseasegene associations and miRNA targets that we adopted to compute the disease similarities, miRNA similarities and find co-function miRNA pairs. Supplementary file 7 contains the GSE accession ids from the GEO database, which we used to obtain the reliable negative disease-miRNA association samples. The final samples of disease-miRNA associations from different databases are listed in Supplementary file 8 including the mapped disease DO ids, mature miRNA ids, the negative sample set according to the analysis of the miRNA expression level fold changes, and four positive sample sets.
The five matlab code files such as ComKerMat.m, PreDisRNA.m, DMMD.m, CoFunScore.m and CoFunScoreAll.m are the codes that implemented our methods. We paste all the codes at the 5 th section of this supplementary file. The first two code files are used for predicting disease-miRNA associations while the later three code files are for prioritizing the multi-disease associated cofunctional miRNA pairs. The input data can be found from the Supplementary files. The node in the code files have illustrated the meaning of the inputs, outputs and data structures. For prediction of disease-miRNA associations, the PreDisRNA.m is the main interface. The CoFunScoreAll.m is the main interface for prioritizing the multi-disease associated co-functional miRNA pairs.

The optimal precomputed kernel matrix and the prediction performance
During the optimization of our method and the comparison of different prediction methods, the following seven performance metrics were computed: specificity, recall (or sensitive), precision, accuracy, F1 and AUC (area under the ROC curve). The definition of mcc is given by: where TP, TN, FP and FN represent true positive, true negative, false positive and false negative respectively.
There are a weight parameter α and a kernel matrix type KMT which can be properly set to build an optimal prediction model in this work. Parameter α is used to mediate the similarities between diseases and the similarities between miRNAs, while KMT selects a kernel matrix type for support vector machine (SVM) to make accurate classification. Detailed explanation of and KMT can be found in Methods. Experiments for the proper selection of and KMT were conducted under three steps: (1) construction of training data. We extracted 1487 known disease-miRNA associations between 107 diseases and 276 miRNAs from the miR2Disease database, and used them as the set of positive training samples (denoted as positive_miR). We also constructed a set of 4638 negative samples between 53 diseases and 538 miRNAs after a comprehensive analysis of the GSE accessions (denoted as negative_expression). We randomly selected 1487 negative samples from negative_expression to construct a balanced training data set; (2) prediction model construction. This step has two layers of loops. The outer loop changes the value of α from 0 to 1 with a step of 0.1, while the inner loop sets KMT = 1, 2, or 3, which represent the three different types of kernel matrices (i.e., the average type, the squared root type and the center distance type). A prediction model was constructed with each α and KMT; (3) performance evaluation. We implemented 10fold cross-validation on the balanced data set with different α and KMT and the seven performance metrics (Specificity, Recall, Precision, Accuracy, F1, mcc and AUC) were computed. We ran the experiment 100 times. The averages of the seven indices were taken over the 100 times. Supp. Fig.  1. shows the AUC values and F1 scores.
The squared root type of KMT outperforms the other two types. When α increases, the AUC and F1 score increase first but then drop down, suggesting that the integration of different types of similarities can improve the prediction performance. Furthermore, when α = 0 or α = 1, the average type and the squared root type can still achieve the AUC values around 0.92 and F1 scores about 0.9. It means that our precomputed kernel matrix method can have a good prediction performance even with just one kind of similarity information. Comparing the curves in Supp. Fig. 1, it can be seen when α is around 0.8, the curves achieve better AUCs and F1 scores. Thus, we chose the squared root type of KMT and set α = 0.8 for our prediction model. To evaluate whether our prediction performance was obtained by chance, we conducted a permutation test as Jiang et al. [3] did. We did not use the true labels of the samples (positive samples and negative samples) but distributed the labels randomly. Then, we implemented the 10fold cross-validation and observed the changes of the performance. The positive_miR data set was adopted as the positive samples and balanced training data sets were built. The normal predictions (true labels) were considered as the control group while the permutation tests were regarded as the test group. All these two groups of experiments were repeated 10 times. The ROC curves of the test group and control group are shown in Supp. Fig. 2. The ROC curve of the test group is nearly overlapped with the random lines while the ROC curve of the control group can achieve an AUC value of 0.97, which indicates that the performance of our prediction model was not produced occasionally but contains biological significance.

Effect of the size of the negative samples on the prediction performance
To investigate whether the number of negative samples affects the performance of our predictions, we fixed the size of positive samples as the size of the positive_miR data set, and changed the number of negative samples in the training data set. All the negative samples were randomly selected from the negative_expression data set. We varied the number of negative samples from 3 times the number of positive samples to 2 times, to equal size, and to 80% of the size of positive samples, 60%, 40% and 20%.
In addition, the positive samples from the positive_HMDD (totally 4041positive samples which were extracted from the HMDD database) excluding those samples already in the data set of positive_miR were adopted to build the validation data set. There are 3484 positive samples in this validation data set. Again, 10-fold cross-validation was implemented on the training data. The prediction model was then tested on the Supp. Fig. 2 The ROC curves of the permutation test. The experiment includes the test group and the control group parts. The test group part used the permutated labels for the training samples while the control group part uses the original labels of the same training dataset. Both two parts of the experiment adopts our optimal prediction model. We can find that the AUC values have nearly no changes under different size ratios between negative and positive samples. However, the accuracy of the prediction on the validation data set drops when the size of negative samples increases. But, the mcc value increases till the size of negative samples is equal to that of positive samples. Then, it keeps at the same level even more negative samples are added. As mcc is a more comprehensive performance index than accuracy, we suggest that a balanced training data set of positive and negative samples should be adopted to infer new disease-miRNA associations as we did in this work.

Performance comparison when changing the approach of selecting negative samples
The negative samples of disease-miRNA relationship randomly selected from the negative expression data set were used by this work for the training of the prediction model. There are other ways for the construction of negative data sets, such as by random selection from the unconnected disease-miRNA pairs. We compared the performances of our prediction model when the approach to select negative samples was changed. The positive samples were always the same, i.e., the data set positive_miR containing 1487 known disease-miRNA associations.
The negative data set formed by a random selection from those unconnected disease-miRNA pairs is named negative_random (there are total 26704 disease-miRNA pairs). We conducted two experiments. In the first experiment, we used all the 1487 positive samples from positive_miR and 1487 negative samples randomly selected from the negative_expression data set to build the training data set. The second experiment is similar to the first one with the only difference that the 1487 negative samples were randomly selected from negative_random. 10-fold cross-validation was conducted on the training data sets. To get a test performance, we also used the above validation data set to test the prediction models. All these experiments were repeated 100 times, and the average performance was taken to reduce the bias of the predictions (Supp. Tab. 1). It is clear that the 10-fold cross-validation performance of selecting negative samples from negative_expression significantly outperformed another approach. For the 3484 samples of the validation data set, 73.15% of them can be correctly predicted by the model when the negative samples were selected from negative_expression, while the negative_random based model could only accurately predict 50.77% of the 3484 disease-miRNA associations. This comparison indicates that the approach for the selection of negative samples has significant impact on the prediction performance. The best choice is to select negative samples based on the analysis of expression data.

Comparing with other's methods
As several methods have been proposed to infer the disease-miRNA relationships, it's necessary for us to make comparison of our prediction model with those existing methods. Here we choose the representative non-machine learning method based and two machine learning method based prediction models for the comparisons. The first model is the RLSMDA that proposed by Chen et al. [1]. As the author didn't provide the source code of their prediction model, we adopted the data they provided and the algorithm they introduced in their paper, and implemented their model. Chen et al. reported that they had done the local leave-one-out cross validation (local LOOCV) and global leave-one-out cross validation (global LOOCV) on their dataset with 1395 known disease-miRNA associations, including 271 miRNAs and 137 diseases. We mapped the diseases and miRNAs to DO and miRBase v21, and finally obtained 1184 disease-miRNA pairs. We also mapped these 1184 disease-miRNA pairs to our data set. We implemented the global LOOCV on RLSMDA and our prediction method. We randomly selected 1184 disease-miRNA pairs from negative_expression as the reliable negative samples. The LOOCV was repeated 10 times for our model so that we selected 10 different negative sample sets. The indices were the average values of those 10 runs. The ROC curves of our method and the RLSMDA is showed in Supp. Fig. 4.
According to Supp. Fig. 4, our prediction model can achieve better performance than the RLSMDA based on the same positive samples and the leave-one-out cross-validation (AUC value = 0.9896 for our model and AUC value=0.9475 for RLSMDA).
Supp. Fig. 4 The ROC curves of our model compared with RLSMDA based on the same positive samples There are other two SVM based prediction models such as the method proposed by Xu et al. [2] and Jiang et al. [3]. Both two methods adopted the traditional idea that using the feature vectors of the disease-miRNA pairs as the input and then training and testing the samples. However, the feature vectors of Xu's method are hard to collect as they used the topological properties of the miRNA target-dysregulated network and the fold-change of the miRNA expression level. This method is hard to predict different kinds of diseases relate miRNAs simultaneously. They applied the 37 prostate cancer miRNAs as the positive samples and selected another 44 tissue-specific miRNAs with low expression levels as the negative samples. To compare with this method, we used these 37 prostate cancer miRNAs as the positive samples and randomly selected 37 disease-miRNAs from the negative-expression as the negative samples and then implemented the 5-fold cross-validation. As the sample size is small, we repeated the 5-fold cross-validation 1000 times, thus, we randomly selected 1000 negative sample sets, and the final evaluation indices are the average values of the 1000 runs. The ROC curves are listed in Supp. Fig. 5. The ROC curves of Xu's method are also showed in the right part of Supp. Fig. 5 (the curves were obtained from the Ref. [2]).
According to the Supp. Fig. 5, based on the same positive samples and the 5-fold cross-validation, our model can achieve the AUC value of 0.9854 which is better than that of Xu's method of 0.9189. Though, this experiment is based on a small sample set with less than 100 samples and just a kind of disease, we can draw the conclusion that our model can achieve better performance on this special dataset. We didn't implement Xu's method, thus it's impossible for us to make comparisons of our model with Xu's based on more diseases.
To compare with Jiang's method in Ref. [3], we also downloaded the positive samples that they adopted to evaluate the performance of their model. There are 270 disease-miRNA pairs in this positive sample set.

Supp. Fig. 5 The ROC curves of our model and Xu's based on the same positive sample set and 5fold cross validation
After mapping the diseases to DO and mapping the miRNAs to miRBase v21, there are 263 disease-miRNA pairs that can finally mapped to our disease and miRNA sets. We adopted these 263 disease-miRNA pairs as the positive samples and randomly selected 263 negative samples from negative-expression as the gold standard data set. 10-fold cross-validation was implemented on this data set for 100 times. The ROC curve of our method and Jiang's (the curves were obtained from Ref. [3]) are showed in Supp. Fig. 6. The indices are listed in Supp. Tab. 2 as comparison of the two prediction methods.

Supp. Tab. 2
The comparison of our method and Jiang's method in Ref. [3] based on their positive sample set with the 10-fold cross-validation  Fig. 6, we can find that based on the same positive sample set, all the evaluation index of our model is higher than that of Jiang's method. Furthermore, their negative samples were randomly selected from the disease-miRNA pairs that excluded the positive samples, which are not reliable enough in fact.
Above all, we can draw the conclusion that our method can achieve better performance than the existing methods for inferring the disease-miRNA relationships based on the cross-validations with different sample Supp. Fig. 6 The ROC curves of our method and Jiang's method based on their positive sample set. The right part of the figure was obtained from the corresponding published article.
sets. All the datasets for cross-validations can be found in the Supplementary file 3. Note: a: # represents this association has been confirmed by the HMDD database;