 Research
 Open Access
 Published:
Integrating microRNA target predictions for the discovery of gene regulatory networks: a semisupervised ensemble learning approach
BMC Bioinformatics volume 15, Article number: S4 (2014)
Abstract
Background
MicroRNAs (miRNAs) are small noncoding RNAs which play a key role in the posttranscriptional regulation of many genes. Elucidating miRNAregulated gene networks is crucial for the understanding of mechanisms and functions of miRNAs in many biological processes, such as cell proliferation, development, differentiation and cell homeostasis, as well as in many types of human tumors. To this aim, we have recently presented the biclustering method HOCCLUS2, for the discovery of miRNA regulatory networks. Experiments on predicted interactions revealed that the statistical and biological consistency of the obtained networks is negatively affected by the poor reliability of the output of miRNA target prediction algorithms. Recently, some learning approaches have been proposed to learn to combine the outputs of distinct prediction algorithms and improve their accuracy. However, the application of classical supervised learning algorithms presents two challenges: i) the presence of only positive examples in datasets of experimentally verified interactions and ii) unbalanced number of labeled and unlabeled examples.
Results
We present a learning algorithm that learns to combine the score returned by several prediction algorithms, by exploiting information conveyed by (only positively labeled/) validated and unlabeled examples of interactions. To face the two related challenges, we resort to a semisupervised ensemble learning setting. Results obtained using miRTarBase as the set of labeled (positive) interactions and mirDIP as the set of unlabeled interactions show a significant improvement, over competitive approaches, in the quality of the predictions. This solution also improves the effectiveness of HOCCLUS2 in discovering biologically realistic miRNA:mRNA regulatory networks from largescale prediction data. Using the miR1792 gene cluster family as a reference system and comparing results with previous experiments, we find a large increase in the number of significantly enriched biclusters in pathways, consistent with miR1792 functions.
Conclusion
The proposed approach proves to be fundamental for the computational discovery of miRNA regulatory networks from largescale predictions. This paves the way to the systematic application of HOCCLUS2 for a comprehensive reconstruction of all the possible multiple interactions established by miRNAs in regulating the expression of gene networks, which would be otherwise impossible to reconstruct by considering only experimentally validated interactions.
Background
MicroRNAs (miRNAs) are small noncoding RNA molecules (~22 nucleotides in length) representing one of the most interesting class of gene regulators. Since their discovery in 1993 [1], the number of scientific reports on their functional characterization in a great variety of organisms has been growing at an impressive rate. They regulate cell cycle, modulate cell development and differentiation, are involved in the maintenance of cell homeostasis and apoptosis, and ultimately can influence the development and progression of many types of human tumors [2]. The growing amount of evidence of their key role in cancer and recent evidence of their presence in body fluids, such as serum and plasma, has further sparked the interest of the scientific community, emphasizing the possibility of using them as therapeutic targets and noninvasive biomarkers of diseases and of therapy response [3]. However, the full potential of their possible applications in the clinical domain depends on the understanding of their mechanisms and functions. Basically, miRNAs are posttranscriptional regulators that inhibit translation of messenger RNAs (mRNAs) by binding to complementary short sequences (68 nt in length), located inside the 3' untranslated regions (3'UTRs) of transcripts. Depending on perfect or only partial complementarity between the miRNA seed sequence and its target site on the mRNA, the RNAiinduced silencing complex (RISC) associated to the miRNA can mediate the inhibition of translation initiation and/or mRNA decay [4]. More recent experimental evidence of miRNA functional targeting in gene promoter regions suggests that miRNAs may also play an important role in the transcriptional regulation of a great number of genes [5]. Moreover, the discovery of miRNAs' functional targeting in the 5' untranslated region (5'UTR) and in the coding sequence (CDS) of mRNAs further complicates the understanding of their mechanisms.
According to current knowledge, the ability of miRNAs to act as a balance for a large variety of biological processes relies on their capacity to coordinately orchestrate cell signaling pathways by the multiple binding of many key effectors. Therefore, the identification of individual miRNA:mRNA interactions is not sufficient to catch the capacity of miRNAs to regulate complex gene networks. For this reason, much of the research in this field focuses on the development and application of biclustering algorithms [6, 7].
In [7] we have recently proposed a method to identify significant miRNA:mRNA networks, by exploiting a novel biclustering algorithm. However, experiments performed on both experimentally validated and predicted interactions revealed that, although the latter provides a much larger amount of data to analyze, the significance of the networks obtained can be substantially affected by the reliability of the predictions. Indeed, prediction algorithms exhaustively analyze all the possible miRNA:mRNA pairs, searching for structural evidence that could suggest the existence of an interaction.
Examples of such algorithms are RNAhybrid [8], miRanda [9], TargesScan [10], DIANAmicroT [11] and picTar [12]. Although these approaches are significantly cheaper than those based on experimental validation, results of these methods are in many cases uncorrelated to each other and their degree of overlap is poor. Their weakness depends on many factors, especially on the impossibility to incorporate in a single model all the possible interplaying variants that can influence the effects of the miRNA targeting, especially in mammals. Different results can also depend on the approach used and on the rules considered for the miRNA targeting, as well as on the type of resource of sequences they use as a reference dataset [13, 14].
Furthermore, in [15] the authors showed that the reliability of such algorithms, in terms of precision and recall values computed against validated interactions, is, in general, very low. One of the approaches to overcome this issue consists in the combination of the predictions of several algorithms. In [16], some different approaches for combining predictions were compared, i.e. majority vote, ranking aggregation and Bayesian network classification. This last strategy represents one of the first attempts to exploit machine learning approaches to learn to combine predictions and, in this way, to identify a more reliable set of predicted interactions. In particular, the authors proposed the application of a supervised learning algorithm, i.e. a Bayesian network learner, to distinct sets of features considered in three wellknown prediction algorithms, i.e. RNAhybrid, miRanda and TargesScan.
Although they are promising, existing machine learning solutions for learning to combine predictions are still at an embryonic stage [17]. For example, the applicability of the method proposed in [16] is limited to those scenarios in which a large number of both positive and negative examples is available. In general, when exploiting machine learning approaches to learn to combine interactions predicted, some issues have to be taken into account: i) Very few interactions are experimentally validated and can be considered as "stable" training examples. ii) Only positive examples of interactions are available, whereas negative examples are not generally available and, when available, their number is relatively small. iii) Prediction algorithms consider similar features and their simple combination can lead to the socalled collinearity problem [18].
All these issues are considered in this paper. In order to face i), we propose a semisupervised learning algorithm, which takes into account both (positively) labeled examples and the huge amount of unlabeled (unknown) instances during the learning phase. In order to overcome issue ii), the proposed learning algorithm is able to learn from only positive examples. As for iii), the collinearity problem can be alleviated by considering as features the scores (outputs) obtained by several prediction algorithms (instead of the original features), resorting to a solution which is similar to those adopted in metalearning algorithms. The advantage of applying machine learning techniques to the outputs of several prediction algorithms consists in automatically adapting to unknown patterns of the outputs and associating more reliable prediction values when these patterns occur.
The proposed learning algorithm can be used either as a standalone software or in combination with the system HOCCLUS2 (an extension of the algorithm HOCCLUS [19]), in order to discover more complete and realistic miRNA:mRNA regulatory networks.
Related work
The research reported in this paper has its roots in work which studies semisupervised learning algorithms for learning from only positive examples. It also originates from work which studies the opportunity of combining the results of distinct miRNA target prediction algorithms, with the goal of obtaining more reliable predictions.
Learning a classifier from only positive and unlabeled training examples
The problem of learning a classifier in a semisupervised setting (or in a transductive setting [20]) and, in particular, from only positively labeled examples, has already been investigated in many research papers. In general, as reported in [21], two main approaches have been followed in previous works. The most common consists in the identification of the most likely negative examples from the unlabeled set and in the application of a standard supervised learning algorithm [22–25]. This approach is sometimes extended to identify also additional positive examples from the unlabeled set [26].
The less common approach consists in assigning weights to unlabeled examples and then training a classifier which interprets them as weighted negative examples. This approach is used for instance in [27, 28] and has been recently considered in [21], which inspired the method proposed in the present paper. The peculiarities of this last work are: first, it provides a principled way of choosing weights; second, it assigns a different weight to each unlabeled example, instead of assigning the same weight to every unlabeled example. However, contrary to our solution, the authors assume that each unlabeled example can be viewed as being both a weighted negative example and a weighted positive example, where the weights represent the probability that an unlabeled example is negative/positive. Since the two probabilities are not independent, this solution may generate redundancy in the representation. The second difference is that in [21], balancing is assumed between positive and unlabeled examples. This assumption does not hold in our case, where the number of miRNA:mRNA validated interactions is significantly lower than the number of possible miRNA:mRNA pairs. This last aspect motivates the use of the ensemble learning approach we have adopted, as explained in the rest of the paper.
Combining the output of miRNA target prediction algorithms
In [29], the authors identified two distinct approaches for data integration: the "lowlevel" approach, which directly deals with multifactorial raw data and the "highlevel" approach, which combines multiple sametype results from different studies. Following this classification, in [15] the authors evaluate a highlevel solution that combines predictions provided by several existing algorithms. An interaction is considered reliable if at least k algorithms confirm it. In this case, however, the decision is taken on the basis of a simple counting of the algorithms that confirm a prediction. This means that this solution does not identify patterns of the outputs and does not adapt the final prediction to them. Finally, it is highly sensitive to the collinearity problem: algorithms that work on the same features will produce similar predictions, affecting the counting.
Similarly, StarBase [30], a recently developed database for exploring miRNA:mRNA interaction maps from argonauta CLIPSeq (highthroughput sequencing of RNA isolated by crosslinking immunoprecipitation) and degradomeseq data (parallel analysis of RNA ends  PARE), intersects experimental results with predictions from six target prediction algorithms, to enhance precision and recall and identify miRNAtarget regulatory relationships in six different organisms.
In [16], the authors evaluated the performance of single target prediction algorithms and of some high and lowlevel integration approaches to improve prediction accuracy. In particular, for highlevel approaches they propose a simple majority voting solution and a ranking aggregation solution. As regards lowlevel approaches, the authors propose the application of a machine learning algorithm (i.e. Bayesian Network classification algorithm), which is able to provide a high level of adaptivity. The considered sets of features (lowlevel approach) are generated through combinatorially combining the sets of features taken into account by each single algorithm. Although the basic idea is similar to ours, the application of the machine learning algorithm to basic (possibly redundant) features could cause collinearity problems.
In addition to [16], in [31] the authors propose improving prediction capabilities through the application of machine learning solutions. However, in this case, a mixed high/lowlevel approach is followed. In particular, the authors propose the application of a Naïve Bayes classifier on a dataset of possible interactions represented by 57 structural features. The goal is to filter the output of the prediction algorithm miRanda, in order to decrease the amount of false positives. The problem of the absence of negative examples is solved by randomly generating dummy miRNAs and dummy interactions. The drawback of this solution is that the learned model is not deterministically determined and might be subject to some biases implicitly introduced in the artificially generated negative set.
Finally, in [32], the authors propose a highlevel approach to learn a logistic regression model from the output of several miRNA target prediction algorithms. The proposed approach works in the classical supervised learning setting and does not exploit information from unlabeled examples (potential interactions) during the learning phase. This makes the approach difficult to apply when only few labeled interactions are available during learning and a huge amount of possible interactions have to be predicted. Moreover, the problem of negative examples does not apply in this case, since TarBase [33] + LCL [34], which contains both positive and negative examples, is used. Although the use of these datasets is, in this respect, beneficial, it limits the training set to a small number of interactions which is not comparable to the number of interactions we take into account during the learning phase (thousands vs. millions).
Methods
The learning solution we present in this section is framed in the semisupervised learning setting which, in addition to positive examples, takes advantage from unlabeled examples. Indeed, since we do not have negative examples in the training set, it becomes necessary to resort to this learning setting.
Before formally introducing the problem we intend to solve, some useful definitions are necessary. Let:

$\mathcal{M}$ and $\mathcal{G}$ be the sets of miRNAs and mRNAs, respectively;

$x=\u27e8m,g\u27e9\in \left(\mathcal{M}\times \mathcal{G}\right)$ be a (possible) interaction between miRNA m and mRNA g;

p_{ k }(x) be the prediction score for the interaction x returned by the kth target prediction algorithm, 1 ≤ k ≤ s;

p(x) = [p_{1}(x), p_{2}(x), ..., p_{ s }(x)] be the vector of prediction scores for the interaction x;

l(x) be a function which returns 1 if x is a labeled (experimentally validated) interaction, 0 otherwise;

f(x) be an ideal function which returns 1 if x represents a true interaction, 0 otherwise;

$L=\left\{xx\in \left(\mathcal{M}\times \mathcal{G}\right)\wedge l\left(x\right)=1\right\}$ be the subset of labeled interactions;

$U=\left(\mathcal{M}\times \mathcal{G}\right)L$ be the subset of unlabeled interactions.
In our case, since only positive interactions are labeled, the following equation holds:
The goal is to learn a function ${f}^{\prime}\left(p\left(x\right)\right)$ which approximates the probability that f(x) = 1, that is ${f}^{\prime}\left(p\left(x\right)\right)\approx P\left(f\left(x\right)=\mathsf{\text{1}}\right)$. As suggested in [21], it can be learned by exploiting (1) in the following steps:
This means that
In this equation, both the numerator and the denominator can be estimated by an adhoc probabilistic classifier specifically used for this purpose. In [21], this classifier is called nontraditional classifier. The following subsection is devoted to explaining how this classifier is used.
Estimating P(l(x) = 1) and P(l(x) = 1  f(x) = 1))
In our work, the nontraditional classifier is learned through the LIBSVM algorithm [35] with Platt scaling [36], in order to get probability estimates. We choose a Support Vector Machine (SVM)based algorithm mainly for the following reasons: 1) they have a (relatively) good computational efficiency, especially in the prediction phase which is based on a very limited number of examples (support vectors); 2) they are robust to noise and to feature redundancy [37]; 3) their effectiveness (with Platt scaling) has already been evaluated and proved in the semisupervised setting described in the paper which inspired our research [21]. However, it is noteworthy that every other algorithm that exhibits similar properties can be plugged into our framework.
LIBSVM is applied in order to solve the following problem:
Given: a set of training examples {(p(x), l(x))}_{ x }, where p(x) is the vector of prediction scores associated to the interaction x and l(x) (1 if the example is labeled, 0 otherwise) represents the class for the nontraditional classifier;
Find: a probability function $g:{\mathbb{R}}^{s}\to \mathbb{R}$ which takes as its input a vector of prediction scores p(x) and returns the probability that the interaction x is labeled. In this way, g(p(x)) ≈ P(l(x) = 1).
In the way we use LIBSVM, we do not have testing examples and g(p(x)) represents the posterior class probability that a training example p(x) is classified as positive (that is, labeled), according to the optimal separating hyperplane of the nontraditional classifier.
As for the denominator, we assume that all labeled positive examples are taken completely randomly from all positive examples. Formally:
In other words, $P(l\left(x\right)=\text{1}f\left(x\right)=\text{1}))$ is independent of the specific interaction x. This assumption is essential for the purpose of learning from only positive examples and is coherent with the "selected completely at random" assumption in [21, 38]. In this particular domain, this assumption could appear too much strong, since mRNA:miRNA pairs are generally not chosen randomly for biological validation. However, this happens in many other application domains, where examples are chosen on the basis of the trainer/expert's background knowledge. Moreover, it is noteworthy that this assumption is similar to that which is typically made for the classical classification task, where we assume that the underlying distribution of (labeled) positive and (labeled) negative examples in the training set is similar to that of the examples to be classified. This analogous assumption is typically considered also in the application of classifiers for prediction tasks in the biological domain (e.g. protein function prediction).
Assumption (3) allows us to use g(p(x)) also in the computation of $P(l\left(x\right)=\text{1}f\left(x\right)=\text{1}))$. In particular, since a possible estimator of $P\left(l(\xb7)=\text{1}f(\xb7)=\text{1}\right))$ is the average value of g(p(x)) for all labeled positive examples, we have:
Differently from [21], in our case, we have to deal with the problem of unbalanced class distributions when learning the nontraditional classifier to obtain g(p(x)). Indeed, since the ratio between labeled and unlabeled examples is about 1/2000 (see Section "Results and Discussion"), LIBSVM would always learn a classifier which predicts all the interactions as unlabeled, independently of the considered interaction. In order to solve this problem, we resort to a sampling solution which is illustrated in the following.
Ensemble learning g(·)
The sampling procedure considered in this work is similar to that used in bootstrap estimation of the value of an evaluation measure (e.g., predictive accuracy of a classifier) [39], as well as in some ensemble data mining methods, such as bagging [40], which combine multiple models to achieve better prediction accuracy than any of the individual models.
More precisely, LIBSVM is run K times. At each execution, it is applied to the set of examples $L\cup {U}^{j}\left(j=\mathsf{\text{1}},\mathsf{\text{2}},...,K\right),$ that is, to all the labeled examples L and to a subset U^{j} of the unlabeled set U (Figure 1). In this way, we learn K nontraditional classifiers g_{ j }(p(x)), j = 1, ..., K that are combined to obtain g(p(x)).
The K subsets of unlabeled examples are built by randomly sampling, with replacement, n examples from U. The proportion of unlabeled examples in each U^{j} is $\frac{n}{\leftU\right}$.
It is noteworthy that the K samples U^{j} are neither mutually exclusive nor exhaustive, i.e., they do not partition the original data set, so, for instance, even K = 10 samples with n = 0.1·U do not generally cover the entire set of unlabeled examples U. The probability that a particular unlabeled example is in ∪_{ j }U^{j}is the following:
When K = U/n, the above probability approximates 1  e^{1} for large U, where e is Euler's number (≈ 2.7183). Since e^{1} ≈ 0.368, this means that the expected number of unlabeled examples in ${\cup}_{j}{U}^{j}$ is 63.2% of those in U.
Since we are interested in covering a given proportion γ of negative examples (e.g. 90%), we rewrite (5) in terms of the expected number of samples necessary to cover at least γ unlabeled examples:
Differently from data partitioning, which is affected by only one parameter K (the number of partitions), the data sampling procedure used in this work is controlled by two parameters: n and γ. The first parameter represents the number of unlabeled examples in each sample and can be reasonably chosen on the basis of the number of labeled examples, so that the unbalancing problem is mitigated. The second parameter represents the percentage of unlabeled examples we intend to cover.
Once the K classifiers are learned, each classifier g_{ j }(p(x)) is applied to obtain an estimate of P(l(x) = 1) for all the examples in ${U}^{j}$. Since the same unlabeled example can belong to more than one sample, the following equation is used:
Ensemble learning ${f}^{\prime}(\xb7)$
In order to identify the function ${f}^{\prime}\left(p\left(x\right)\right)$, a straightforward solution would be to directly apply Equation (2). However, as empirically proved in [21], a more effective solution consists in the computation of a weight for each example and in training a further (traditional) classifier.
Specifically, we compute the probability that an unlabeled example x represents a positive example as:
which, according to Equation (4), can be approximated to:
where $c=\frac{1}{\leftL\right}{\sum}_{x\in L}g\left(p\left(x\right)\right)$ and P(l(x) = 1) is approximated to g(p(x)).
The training set for the traditional classifier which is in charge of learning f'(p(x)) is then built as follows:
${f}^{\prime}\left(p\left(x\right)\right)$ is learned by applying a variant of LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances) which allows us to specify a weight for each example. In general, in this algorithm, the weight assigned to an example represents the cost of misclassifying it, which is then exploited in the SVM optimization process. In our case, the strategy adopted to compute the weight exploits the probability (Equation (11)) that the assigned label (Equation (10)) is correct. In this way, intuitively, the misclassification cost for a given example will be proportional to the confidence we have in the assigned label.
The strategy we adopt for learning the traditional classifier differs from that adopted in [21], which, as previously stated, represents each unlabeled example as both a positive example with weight $P\left(f\left(x\right)=\mathsf{\text{1}}l\left(x\right)=0\right)$ and a negative example with weight $\mathsf{\text{1}}P\left(f\left(x\right)=\mathsf{\text{1}}l\left(x\right)=0\right).$ This generates redundancy in the representation and possibly prevents the algorithm from learning a good separating hyperplane.
Similarly to the nontraditional classifier, also in this case, we solve the class unbalancing problem (in this case, unbalancing is between positive examples and negative examples, instead of labeled and unlabeled examples), by resorting to the same bagging procedure described in the previous section. The procedure in this case is still necessary since the number of true miRNA:mRNA interactions (positive examples) is significantly smaller than the number of remaining possible miRNA:mRNA pairs (negative examples).
A final remark is made to explain how our algorithm faces the collinearity problem [18]. Collinearity is the problem according to which if some features are (nearly) linearly dependent on the others, a predictive model may not be well estimated. In our case, it is possible that used prediction algorithms consider highly overlapping characteristics and, although we work on their outputs and not directly on the characteristics, the collinearity problem may still be present. In this respect, an important advantage introduced by our algorithm is that it automatically adapts to highly redundant characteristics and does not require a preliminary feature selection step. This important property is achieved through the use of an SVMbased solution, which, as in other scientific fields, has proved to be robust to noise and to highly redundant features [37].
Results and discussion
In this section, we present the considered datasets, define the experimental setting, introduce evaluation measures and present a discussion about obtained results.
Datasets
In order to evaluate our approach, we have considered as data sources a set of experimentally verified miRNA:mRNA interactions, i.e. miRTarBase [41], as well as the set of miRNA target predictions in mirDIP [15]. Interactions from miRTarBase have been used as positive/labeled examples and interactions from mirDIP, but not present in miRTarBase, have been considered as unlabeled examples. In the learning phase, examples are represented according to the standardized scores returned by the algorithms that mirDIP integrates (standardization is performed by mirDIP; although standardization makes scores comparable for the human expert, our algorithm does not strictly require it). Furthermore, we have used TarBase [33] as a testing set because it contains both positive and negative experimentally verified miRNA:mRNA interactions. It is noteworthy that TarBase can also be used in the training phase. However, in this work we have decided to use it in the testing phase, in order to provide good estimates of the algorithm performance on a valid independent test set.
The miRTarBase ver. 3.5 dataset (http://mirtarbase.mbc.nctu.edu.tw/) contains 4,867 experimentally verified miRNAtarget interactions between 729 miRNAs and 2,789 target genes among 17 species. miRNAtarget interactions are collected by manually surveying pertinent literature after applying text mining techniques to filter research articles related to functional studies of miRNAs. Generally, the collected interactions are validated experimentally by reporter assay, western blot, or microarray experiments with overexpression or knockdown of miRNAs. In our study, we only consider the human dataset.
The mirDIP dataset (http://ophid.utoronto.ca/mirDIP/) is an integrated database which includes miRNA target predictions of twelve different datasets. In this study, we consider only the predictions which refer to the 3' UTR region, i.e. those returned by DIANAmicroT [11], microCosm [42], miRanda [9], picTar 4way and picTar 5way [12], PITA All Targets and PITA Top Targets [43], TargetScan Conserved and TargetScan NonConserved [10] and RNA22 3' UTR [44]. As anticipated in the previous section, the high redundancy among the features considered by these datasets motivates the SVMbased solution. The mirDIP dataset used in our experiments contains approximately 5 million predicted interactions between 934 miRNAs and 30,875 mRNAs. The number of predictions returned by each algorithm is reported in Table 1.
TarBase 6.0 (http://www.microrna.gr/tarbase) is the largest available manually curated target database, indexing more than 65,000 miRNAgene interactions. The database includes targets derived from genespecific and high throughput experiments.
Experimental setting
The main goal of the experiments is twofold: a) To evaluate the accuracy of the predictions provided by our algorithm by taking as input unlabeled (a large set of predicted miRNA:mRNA interactions) and positive examples. b) To evaluate whether our algorithm can improve the identification of meaningful regulatory networks. Indeed, as shown in [7], working with a large set of interactions does not always lead to the improvement of the quality of the obtained results. On the contrary, especially when the input data are affected by a huge amount of false positives and false negatives, the significance of the obtained regulatory networks may be compromised. The complete workflow of the experiments is reported in Figure 2.
As for a), we have used TarBase [33] as a test set, which, although limited in the number of included interactions, contains both positive and negative examples. In order to guarantee a fair comparison, we have removed from TarBase all the examples that are also reported in miRTarBase, since they could give an advantage to our approach because used in the learning phase. We also removed inconsistent examples, that is, interactions labeled as both positive and negative. At the end, the considered test set contains 29,091 positive examples and 3,910 negative examples of interactions.
In this study, we compare our approach with several alternative solutions:

Single prediction algorithms (DIANAmicroT, microCosm, miRanda, picTar 4way, picTar 5way, PITA All Targets, PITA Top Targets, TargetScan Conserved, TargetScan NonConserved, RNA22 3' UTR).

Score averaging (SA): a simple algorithm that equally weights the contribution of each single prediction algorithm.

Score averaging  three best (SA3B): an algorithm that equally weights the contribution of the best three prediction algorithms (TargetScan Conserved, PITA Top Hits and picTar 5way), according to [15].

Weighted score averaging  three best (WSA3B): an algorithm that weights the contribution of the best three prediction algorithms (TargetScan Conserved, PITA Top Hits and picTar 5way). Weights are proportional to the reliability (computed on the basis of the FScore) of each algorithm, according to [15].
The last two solutions have been proposed in [7]. It is noteworthy that these combination strategies can be considered a finer variant of the majority vote [16] and counting [15] strategies, since the scores of the predictions are taken into account.
As regards b), in order to identify regulatory networks, we used the system HOCCLUS2 [7], which is based on a biclustering algorithm that has been proved to be a valid tool for supporting biologists in this task. In particular, HOCCLUS2 is able to extract cohesivenesspreserving biclusters, when compared with competitive approaches, containing mRNAs which are statistically more functionally similar than mRNAs which belong to different biclusters. HOCCLUS2 requires two parameters: α, which is the minimum cohesiveness value (see next section for details about the cohesiveness) a bicluster must satisfy after merging, and β, which is the minimum score that must be associated to an interaction to be considered as reliable. Experiments have been conducted with different values of α and β. This is necessary in order to understand if, with the proposed approach, the quality of discovered interaction networks depends on their values.
Coherently with the experimental setting a), HOCCLUS2 has been applied to different data sets of predicted interactions, obtained by applying our approach and three other combination strategies, that is, SA, SA3B, WSA3B.
Evaluation measures
In order to evaluate the accuracy of the predictive models learned by the proposed algorithm we consider the Area Under the ROC Curve (AUC) [45].
In order to evaluate the quality of extracted biclusters, we use the average biclustering cohesiveness and a statistical measures based on the Student's ttest. The average biclustering cohesiveness measures the average strength of the intrabicluster connections: ${\mu}_{q}\left({L}_{j},A\right)=\frac{1}{{\sum}_{{C}_{i}\in {L}_{j}}\left{C}_{i}\right}{\sum}_{{C}_{i}\in {L}_{j}}\left{C}_{i}\rightq\left({C}_{i},A\right)$, where L_{ j } is the set of biclusters obtained at the jth hierarchy level, C_{ i } is a bicluster, C_{ i } is the number of miRNAs and mRNAs in C_{ i } and q is defined as $q\left(C,A\right)=\frac{{\sum}_{x\in C}\left(miRNA\right){\sum}_{y\in C}\left(mRNA\right)\phantom{\rule{2.77695pt}{0ex}}{A}_{x,y}}{\left{C}^{\left(miRNA\right)}\right\cdot \left{C}^{\left(mRNA\right)}\right}$. In the definition of q(C, A), C^{(miRNA)}is the set of miRNAs in the biscluster C, C^{(mRNA)}is the set of mRNAs in C and A_{ x,y } is the score of the interaction (in mirDIP) between the miRNA x and the mRNA y. This function measures the weighted (i.e. by considering the score of the interactions) percentage of interactions in a bicluster, normalized by the maximum number of possible interactions.
In addition to μ_{ q }(), we also use an evaluation measure which is based on the statistical properties of the obtained biclusters. In particular, we use the independent twosample Student's ttest to evaluate the null hypothesis ${H}_{0}:{\mu}_{0}^{\prime}\left({L}_{j}\right)={\mu}^{\prime}\left({L}_{j}\right)$ against the alternative hypothesis ${H}_{1}:{\mu}_{0}^{\prime}\left({L}_{j}\right)={\mu}^{\prime}\left({L}_{j}\right)$, where ${\mu}_{0}^{\prime}\left({L}_{j}\right)$ is the average intrabicluster functional similarity ${\mu}_{0}^{\prime}\left({L}_{j}\right)=\frac{1}{\left{L}_{j}\right}{\sum}_{C\in {L}_{j}}{\mu}_{0}\left(C\right),{\mu}^{\prime}\left({L}_{j}\right)$ is the average interbicluster functional similarity defined as
and
In (12) and (13) SimGIC [46] is a semantic similarity measure computed between two genes, according to the UniProt Homo sapiens GO annotations.
The lower the pvalue (obtained by the twosample Student's ttest), the higher the difference between the average intrafunctional similarity and the average interfunctional similarity. We use both GO Biological Process (BP) and GO Molecular Function (MF) hierarchies to compute SimGIC. Henceforth we will refer to the pvalues computed on BP and MF as p_{ BP } and p_{ M F }, respectively.
Results
In Table 2 we report AUC results obtained by our approach with different values of the sampling parameters n and γ. As expected, the higher the value of γ the better the performance of the algorithm, since a larger amount of unlabeled examples is considered. On the other hand, setting γ to a value which is close to 1 leads to an infinite number of samples (according to Equation (6), $\underset{\gamma \to {1}^{}}{lim}\frac{1}{n}\cdot \frac{log\left(1\gamma \right)}{log\left(1\frac{1}{\leftU\right}\right)}=+\infty $). This means that γ has to be set by keeping in mind a good balance between effectiveness and efficiency.
Moreover, changing the number of unlabeled examples in each sample n does not lead to a significant difference in the results, although results with n = 10,000 outperform those obtained with other values of n. For the experiments reported in the rest of the paper we selected the parameters which let us obtain the best results, i.e. n = 10,000 and γ = 0.9.
In Table 3 we report AUC results for all the considered algorithms/approaches. They clearly show that results obtained with our approach outperform those obtained with all the single prediction algorithms. This confirms previous findings [15] and, in particular, that combined approaches, in general, are able to outperform single algorithms. The only algorithm which is able to produce results which are comparable to those obtained by our approach is PITA All Targets. This result is motivated by the high number of interactions in this dataset (see Table 1), which make the predictor more informed about TarBase interactions (test set). However (as we will argue later), it is not able to generate highquality biclusters, due to the large amount of false positives it predicts.
Figure 3 provides additional details on results reported in Table 3. In fact, as it can be seen, the most conservative algorithms are those for which the ROC curve provides good True Positive rate (TPr) values for small False Positive rate (FPr) values (concentrating on the bottomleft corner of the chart, see Figure 3(b)). This provides a way to refine what is suggested in [15]. In particular, one might suggest using the prediction algorithms as follows: 1) when looking for confirmatory evidence of a particular interaction, it is better to use a database with superior recall, such as TargetScan Conserved or TargetScan NonConserved (which have high TPr for low FPr). Contrary to results reported in [15], in our case, microCosm does not appear to satisfy these properties (see Figure 3(b)). 2) When identifying any possible targets for a particular microRNA to form the basis for in vitro or in vivo experiments, it would be best to consult a conservative algorithm, that is, an algorithm which returns a limited number of (possibly reliable) interactions, such as picTar 5way (see Table 1). 3) When finding in silico evidence for an interaction of a microRNA and a gene of a certain family or function, it is best to use an algorithm with a more even balance between precision and recall such as PITA All Targets and TargetScan NonConserved (which have high AUC, graphically, the area under the curve). This last conclusion is different from that drawn in [15], where the use of PITA Top Targets is suggested. A possible motivation for differences between our conclusions and those reported in [15] can be the use of a different test set (we use TarBase, while in [15] results of 15 publicly available microRNA overexpression/knockdown experiments are considered) and different evaluation measures (we use AUC instead of fixed thresholdbased precision and recall).
If we consider combined approaches, we see that, in general, they are able to reach predictive capabilities (in terms of AUC) which are comparable to the best prediction algorithms (see Figure 4). They are also able to work well for low FPr values. If we consider the specific case of our approach, it is able to outperform of a great margin all the combined approaches and all the prediction algorithms.
The results reported in Table 4 refer to the problem of identifying regulatory networks in the form of biclusters extracted by HOCCLUS2. In particular, Table 4 reports the quantitative results obtained for the first hierarchy level, the last hierarchy level and the best hierarchy level (according to p_{ BP } and p_{ MF } values), using different values of α and β. Observing the best level, it is possible to see that the proposed approach always leads to the identification of at least one level with very low p_{ BP } and p_{ MF } values, independently of the choice of the parameters of HOCCLUS2. Moreover, comparing the results with those obtained with the SA approach (which is the best among the considered competitors), it is noteworthy that our approach always lets HOCCLUS2 extract a smaller number of biclusters, grouping less miRNAs and mRNAs. This is due to the fact that our approach is able to better filter out false positives and allows HOCCLUS2 to focus only on more reliable interactions (lower FPr, for a given TPr).
Different considerations can be drawn from the analysis of results obtained by HOCCLUS2 on the single prediction algorithm which shows the best AUC value, that is, PITA All Targets. In fact, results with α = 0.2 and β = 0.5 (best configuration of HOCCLUS2) lead to p_{ MF }, p_{ BP } and μ_{ q } results which are, for the levels of the hierarchy whose results can profitably used by the expert (≥ 2), not comparable to those obtained by combined approaches (see Table 5). Indeed, the higher the level, the worse the performance in terms of all the considered evaluation measures. This is motivated by the high number of false positives of PITA All Targets which leads to a degeneration of the quality of the extracted interaction networks. It is noteworthy that this issue appears only during this evaluation, since HOCCLUS2 works on the whole set of interactions of the dataset, whereas the AUC value is computed only on the interactions that are reported in TarBase. For these reasons, in the following we will focus only on the results obtained by combined approaches.
In Table 6, we report the distribution of biclusters with p_{ BP } ≤ 0.05 over different levels of the hierarchy. From the table it is possible to see that, for all levels of the hierarchy, the only approach that is comparable to ours is SA which, however, does not reach the same number of biclusters with p_{ BP } ≤ 0.05. In particular, at lower and higher levels of the hierarchy, the difference between the results obtained in SA and our approach increases. Moreover, it is noteworthy that there is no degeneration (possibly due to the presence of false positives and false negatives) introduced in the merging phase of HOCCLUS2. This is motivated by the more reliable predictions provided by our algorithm with respect to other approaches.
Evaluation of biological consistency of extracted biclusters
In this subsection we report some examples of extracted biclusters and discuss the results of their biological analysis. Biclusters have been selected according to the statistical ranking returned by HOCCLUS2. Biological consistency of biclusters has been evaluated by considering: i) structural and functional properties of miRNAs; ii) functional clustering and pathway mapping of target genes; iii) information available from the literature supporting the functional miRNA:mRNA relationships suggested by the biclustering results. A series of web resources, such as miRBase [42] and GeneCards [47], have been used to retrieve information on gene families, gene clusters and gene functions. Set Distiller [48], from the GeneCards tool suite, and Reactome [49] have been used for functional clustering and pathway mapping of miRNA target genes, respectively. Other resources, such as STRING [50], have been used for networkbased enrichment analysis of target genes, on the basis of known and predicted protein interactions and functional relationships.
Quantitative comparison of HOCCLUS2 results on the miR1792 gene cluster family: new approach vs SA3B
In our previous work [7], HOCCLUS2 was tested by using two benchmarks, experimentally validated miRNA:mRNA interactions, i.e., miRTarBase [51], and miRNA target site predictions from mirDIP [15].
In order to prove the effectiveness of the new approach on mirDIP data, we focus on biclusters containing members of the miR1792 gene cluster family and its paralogs, miR106b25 and miR106a363. They have been chosen because of the wealth of information available from the current literature, which can be exploited to verify whether the obtained biclusters suggest biologically realistic miRNA:mRNA regulatory networks. Furthermore, different types of experimental evidence suggest that miRNAs belonging to miR1792 may perform specific functions, either individually or in combination, in a coordinated rather than in an additive manner [52]. Due to this peculiar feature, the miR1792 gene cluster family is, among all the possible candidates, the best for proving the ability of HOCCLUS2 to discover miRNA contextspecific regulatory modules at different granularity levels, according to the hierarchy of biclusters.
Looking at Tables 3 and 4, it would seem to be natural to compare our results with those of SA, since it shows a good AUC value as well as good p_{ BP } and p_{ MF } values. However, a preliminary qualitative analysis of the extracted biclusters revealed that significant biclusters (in terms of p_{ BP } and p_{ MF }) appear mainly at high levels in the hierarchy. Analyzing such biclusters with Reactome and STRING, we notice that they do not show the expected biological consistency. Furthermore, they group too many miRNAs (also belonging to different families) in the same bicluster. Although in principle this is a coherent behavior, such situation does not allow the researchers to distinguish between specific and general interactions at different granularity levels, which is the main goal of the task of discovering interaction networks organized in a hierarchy. This behavior is mainly due to the fact that SA averages the scores of all the algorithms, including also unreliable predictions. The consequence is that this algorithm tends to "flatten" the score of all the interactions and, consequently, to affect the possibility that HOCCLUS2 focuses only on reliable interactions. For these reasons, we compare our results with those obtained with the SA3B setting that, although generally showing worse results in terms of AUC, p_{ BP } and p_{ MF }, allowed us (also in the experiments conducted in our previous work [7]) to perform an analysis starting from the lowest (most specific) levels of the hierarchy.
Hereafter, the whole set of biclusters obtained on the basis of SA3B setting and with the new approach will be referred to as mirDIPA and mirDIPB, respectively. Comparison takes into account the number of biclusters, their biological statistical significance (p_{ BP } value) and cohesiveness values (μ_{ q }). In particular, we focus on biclusters extracted by HOCCLUS2 with α = 0.2 and β = 0.5.
As shown in Table 7, the results obtained in the two experiments are significantly different. In particular, they show a considerable improvement of the system performance with the new approach with respect to the SA3B setting. Indeed, among mirDIPA biclusters, the total number of biclusters containing miR17 is 13 and, among them, only one (about 8% of the total) has p_{ BP } ≤ 0.05 (i.e., bicluster 511512 for which p_{ BP } = 9.85 E  5). In mirDIPB, the total number of biclusters including miR17 is 26 and, among them, 16 (more than 60%) have a significant p_{ BP } value. This result is even more surprising if we consider the total number of biclusters obtained by the two experiments at all levels of the hierarchy. Indeed, despite the doubling of biclusters containing miR17, from mirDIPA to mirDIPB, the total number of mirDIPB biclusters is smaller (996) than that of mirDIPA biclusters (1192), with a size decrease in mirDIPB of about 27%. This is due to better precision and recall capabilities provided by the new algorithm, which lead to an improvement of HOCCLUS2's sensitivity in detecting, among all those possible, the miRNA:mRNA biclusters which are more functionally related.
From an overall evaluation of cohesiveness values, we can see that they are generally higher in mirDIPB biclusters. This result, combined with lower p_{ BP } and p_{ MF } values, indicates a higher biological consistency of biclusters extracted by HOCCLUS2 when exploiting interactions identified by the new algorithm.
Finally, the smaller dimension of biclusters in mirDIPB and the balanced distribution of significant biclusters among the different levels of the HOCCLUS2 hierarchy allow us to interpret better the results. In particular, this provides the necessary information to detect alternative cotargeting of miRNAs on different and potentially coregulated groups of target genes.
Biological evaluation of miR1792 biclusters in mirDIPA and mirDIPB
In the previous experiments (reported in [7]), we identified a series of highlyranked biclusters extracted from miRTarBase, containing the members of the miR1792 gene cluster family (see Table 8 in [7]). We also extensively discussed miRNA functions and multiple associations that might be consistent with functions and mechanisms of miR1792 reported in the literature. We were also able to demonstrate how the functional associations suggested by the analysis of HOCCLUS2 provide new clues on potential cooperative interactions of some members of miR1792 with other miRNAs that could be, in turn, the determining factors for a contextspecific activity of miR1792.
In spite of a good result obtained from miRTarBase, the biological evaluation of miR1792 biclusters extracted from mirDIP with the SA3B setting was quite disappointing. Indeed, as shown in Table 7, mirDIPA contains only one bicluster including miR17 with p_{ BP } ≤ 0.05. This bicluster (i.e., 511512) groups six different members of the miR1792 gene cluster family (i.e., miR17, miR93, miR20a and b, and miR106a and b) that potentially cotarget 116 different genes. Although the analysis of this bicluster with Reactome (see Additional File 1) does not provide a mapping for 68 out of the 116 genes, their overrepresentation analysis proves to be consistent with many of the known functions of miR1792 [52]. However, although significantly better than results obtained in the preliminary analysis of SA, they still return a picture that is too general, because of the high number of target genes included in the bicluster. On the other hand, the unavailability of enough biclusters with a statistical functional significance at different levels of the hierarchy affects the possibility of detecting alternative contributions of each member of the family on specific events or pathways i) in different combinations with other members of miR1792 in the same bicluster and ii) with other members not included in the bicluster.
The analysis of miR1792 biclusters in mirDIPB shows how the new approach helps to overcome these limitations. Indeed, as reported in the previous subsection, the approach presented in this paper has allowed HOCCLUS2 to identify many biclusters with a significant p_{ BP } value. The functional analysis of these biclusters demonstrates that they group together functionally related miRNAs and target genes. A significant example, among many that could be reported, is represented by bicluster 379, that is one of the topranked biclusters at level 1 of the hierarchy (see Table 7). This bicluster shows a significant enrichment in the TGFβ/BMP pathway, which regulates embryonic and adult cell proliferation and differentiation, and that is a wellknown target of miR1792. Bicluster 379 groups together TGFBR2, BMPR2, SMAD and PTEN as targets of miR17, miR19 and miR20a, which are members of the miR1792 gene cluster. BMPR2 and TGFBR2 are key factors for the activation of TGFβ/BMP receptor complexes and for the transduction of the signal from the cell surface to the cytosol. SMAD4 is essential for the transduction of the signal to the nucleus and the transcriptional activation of a series of effectors. PTEN is another key component of the TGFβ signaling cascade and, like other genes in this bicluster, it is a validated target of miR1792 [53]. This bicluster is particularly interesting because it mimes bicluster 66, obtained in our experiment on miRTarBase data, as reported in [7]. This result is a good indicator of the higher functional cohesiveness that is obtained by the use of the new algorithm on miRNA target site predictions.
Moreover, at level 2 of the hierarchy, bicluster 379 is merged with bicluster 405, which groups together miR17 and miR20a (belonging to the miR1792 gene cluster), with miR20b (belonging to the miR160a363 gene cluster). As shown in Table 7, the p_{ BP } value of bilcuster 405 is not significant. Indeed, its target genes (i.e., BCL2, CRTC3, MUC17, VEGFA, WDFY2, C6ORF151, KIAA1462) do not show any evident functional relationship. However, they appear functionally related after merging them with genes of bicluster 379. Indeed, analyzing bicluster 379405 with STRING, we have found that BCL2, CRTC3, VGFA and WDFY2 are included in the interaction network of all the genes in bicluster 379 (see Additional File 2). This result shows a potential cooperation of miR106a363 with miR1792 in mediating specific events, functionally related to the general control of miR1792 on the TGFβ signaling pathway, that could be context or tissuespecific. A further confirmation of this observation comes from the analysis of genes excluded by the interaction network in the STRING analysis, i.e. MUC17, C6ORF151 and KIAA1462. In particular, MUC17 is a membrane mucin that probably plays a role in maintaining homeostasis on mucosal surfaces and that is mainly expressed in the digestive tract. It may conduct signals in response to external stimuli that lead to cellular responses, including proliferation, differentiation, apoptosis or secretion of cellular products, such as other membranebound mucin members [54]. According to [54], this gene is a validated target of miR17, miR20a, miR20b. As for C6ORF151 and KIAA1462, the only information that can be retrieved is that C6ORF151 is a nuclear ribonucleoprotein and that KIAA1462 is a junctional protein associated with coronary artery disease. Although these two last genes do not appear to be directly related to the others in the bicluster, we cannot exclude that their potential functional relationships are not detected because of the still poor availability of functional data or of missing annotations in the main web resources. Indeed, as demonstrated for MUC17, neither Reactome nor STRING analysis have been able to detect its functional relationship with other genes in bicluster 319405. Finally, it is important to underline that MUC17 was not associated with miR1792 in the previous analysis [7] of miRTarBase.
Another interesting example that we can provide is represented by bicluster 415. It is another topranked bicluster at level 1 of the hierarchy which, similarly to bicluster 379, mimes a bicluster obtained by the previous analysis [7] on miRTarBase, i.e. bicluster 72. Bicluster 415 is highly enriched in genes specifically involved in the cell cycle. Namely, it includes six (i.e., E2F3, RB1, RBL2, CCND2, WEE1, CCND1) out of 13 genes in the mitotic G1G1/S phases as specific targets of miR17, miR20a and miR106b. In addition, bicluster 415 includes three genes that Reactome does not annotate, that are BECN1, C20ORF82 (i.e., p300 or KAT3B) and FAIM2. Similarly to bicluster 379405, the overrepresentation analysis of Reactome significantly maps only the genes involved in the cell cycle (Additional File 3), whereas STRING is able to find functional relationships among 10 out of the 13 genes included in the bicluster (see Additional File 4).
Many other significant examples could be reported, but the discussion of all the biological implications that they highlight would require too much space in the context of the present paper. Just as a last example, other interesting observations arise in the analysis of miR1792 biclusters at higher levels of the hierarchy in mirDIPB. Indeed, the functional analysis of biclusters at levels 4, 5 and 6 of the identified hierarchy, that observed a degeneration in mirDIPA, has surprisingly shown in mirDIPB a good distribution and a statistical overrepresentation in pathways that are perfectly consistent with miR1792 known biological functions. What is of more interest is that these biclusters group together miR1792 gene cluster members with those belonging to another important miRNA gene cluster, i.e. miR520. This finding is functionally related to the role of miR1792 in the cell cycle, development and differentiation. Indeed, the functional interrelationship between miR1792 and miR520 has been experimentally demonstrated in a study for investigating the molecular mechanisms responsible for the simultaneous maintenance of human embryonic stem (hES) cells, their selfrenewal properties and undifferentiated state [55]. The elucidation of the coordinated activity of miR1792 and miR520 miRNAs, as well as of the regulatory networks that they are able to establish with their target genes, can largely contribute to i) the understanding of the physiology of hES cells development and differentiation and to ii) the exploitation of their potential as best candidate resources for both cell replacement therapy and development research. The association of miR1792 with miR520 was not detected either in mirDIPA or in biclusters extracted from miRTarBase.
The conclusions that arise from the reported analysis clearly show the effectiveness of the proposed approach in improving the performance of HOCCLUS2 on mirDIP data under many aspects. In particular, it gives HOCCLUS2 the ability to extract biologically realistic biclusters, which appear more related at different levels of the hierarchy and, more importantly, which represent consistent functional interactions not detected on experimental data. This last aspect demonstrates that, in general, the use of largescale prediction data of miRNAs target sites can reveal functional connections otherwise impossible to detect from experimental data that are usually contextspecific and, hence, lack a comprehensive view of the system.
Conclusions
In this work we have investigated the possibility to improve the reliability of miRNA:miRNA predicted interactions. In particular, we have proposed the application of a machine learning technique, in order to learn to combine the outputs of several prediction algorithms. Since the domain in hand is characterized by the availability of a small number of labeled examples and a very large number of unlabeled examples, the proposed approach relies on a semisupervised algorithm, which exploits information conveyed by both positive/labeled and unlabeled examples. Moreover, the unbalancing between the number of labeled and unlabeled examples is tackled by adopting an ensemble learning approach.
The effectiveness of the proposed approach has been evaluated according to many criteria. First, the predictive performance of the proposed approach on an independent set of experimentally validated interactions is higher than that obtained by single prediction algorithms and by other baseline combination strategies. Second, HOCCLUS2 has been applied to different datasets of predicted interactions, according to different combination strategies. Results prove that the proposed approach is able to better filter out false positives and allows HOCCLUS2 to focus on only reliable interactions. This leads to the identification of more precise and significant interaction networks. Finally, an in depth biological analysis of some examples of extracted biclusters has been performed. This analysis shows how the proposed approach leads to the discovery of a hierarchy with a balanced distribution of significant biclusters among different levels, which, in general, improves the possibility to interpret results from a biological viewpoint. Moreover, we focused on biclusters that group together members of the miR1792 gene cluster family. In this case, we have observed that the functional analysis of biclusters at higher levels of the hierarchy, that appears highly degenerated with the other combination strategies, surprisingly shows a good distribution and a statistical overrepresentation in pathways that are perfectly consistent with the known miR1792 biological functions. Above all, HOCCLUS2 was able to group together, at high levels of the hierarchy, members of the miR1792 gene cluster with those belonging to the miR520 gene cluster. Its relation with mir1792 has been experimentally proved and was not identified either from experimentally validated interactions or from predicted interactions originating from other combination strategies.
These results prove that the contribution of the proposed approach is, in general, fundamental in the computational discovery of reliable miRNA:mRNA interactions. In particular, it is essential for the extraction of biological realistic networks of interactions between miRNAs and their target genes from prediction data. This last aspect opens up the possibility to expand the application of HOCCLUS2 on a "genomescale" dimension for a comprehensive reconstruction of all the possible multiple interactions established by miRNAs to regulate the expression of gene networks, which are otherwise impossible to identify when only experimentally validated interactions are considered.
For future work, we intend to investigate the possibility of integrating lowlevel features in the learning phase, with the aim of improving the predictive capabilities of the proposed approach.
Availability of supporting data
Project Home Page: http://www.di.uniba.it/~ceci/micFiles/systems/semisupervised_HOCCLUS2/index.html
Available resources: The proposed system, all the datasets and all the obtained results.
Abbreviations
 5' UTR:

5' Untranslated Region
 AUC:

Area Under the ROC Curve
 BP:

Biological Process
 CDS:

Coding Sequence
 CLIPSeq:

CrossLinking ImmunoprecipitationHighThroughput Sequencing
 FPr:

False Positive rate
 hES cells:

Human Embryonic Stem Cells
 MF:

Molecular Function
 miRNA:

microRNA
 mRNA:

messenger RNA
 PARE:

Parallel Analysis of RNA ends
 RISC:

RNAiInduced Silencing Complex
 SA:

Score Averaging
 SA3B:

Score averaging  Three Best
 SVM:

Support Vector Machine
 TPr:

True Positive rate
 WSA3B:

Weighted score averaging  Three Best.
References
 1.
Lee RC, Feinbaum RL, Ambros V: The C. elegans heterochronic gene lin4 encodes small RNAs with antisense complementarity to lin14. Cell. 1993, 75 (5): 843854. 10.1016/00928674(93)90529Y.
 2.
Huang Y, Shen X, Zou Q, Wang S, Tang S, Zhang G: Biological functions of microRNAs: a review. J Physiol Biochem. 2011, 67: 129139. 10.1007/s1310501000506.
 3.
Roth C, Rack B, Müller V, Janni W, Pantel K, Schwarzenbach H: Circulating microRNAs as bloodbased markers for patients with primary and metastatic breast cancer. Breast Cancer Res. 2010, 12: 18.
 4.
Jacek Krol WF, Inga Loedige: The widespread regulation of microRNA biogenesis, function and decay. Nature Reviews Genetics. 2010, 11 (9): 597610.
 5.
Piriyapongsa J, Bootchai C, Ngamphiw C, Tongsima S: microPIR: An Integrated Database of MicroRNA Target Sites within Human Promoter Sequences. PLoS ONE. 2012, 7: 3388810.1371/journal.pone.0033888.
 6.
Zhang SH, Li Q, Liu J, Zhou XJ: A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNAgene regulatory modules. Bioinformatics [ISMB/ECCB]. 2011, 27 (13): 401409. 10.1093/bioinformatics/btr206.
 7.
Pio G, Ceci M, D'Elia D, Loglisci C, Malerba D: A Novel Biclustering Algorithm for the Discovery of Meaningful Biological Correlations between microRNAs and their Target Genes. BMC Bioinformatics. 2013, 14 (Suppl 7): 810.1186/1471210514S7S8.
 8.
Krüger J, Rehmsmeier M: RNAhybrid: microRNA target prediction easy, fast and flexible. Nucl Acids Res. 2006, 34 (WebServer): 451454.
 9.
Enright A, John B, Gaul U, Tuschl T, Sander C, Marks D: MicroRNA targets in Drosophila. Genome Biol. 2003, 5: R110.1186/gb200351r1.
 10.
Lewis B, Burge C, Bartel D: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005, 120: 1520. 10.1016/j.cell.2004.12.035.
 11.
Maragkakis M, Alexiou P, Papadopoulos G, Reczko M, Dalamagas T, Giannopoulos G, Goumas G, Koukis E, Kourtis K, Simossis V, Sethupathy P, Vergoulis T, Koziris N, Sellis T, Tsanakas P, Hatzigeorgiou A: Accurate microRNA target prediction correlates with protein repression levels. BMC Bioinformatics. 2009, 10: 110. 10.1186/14712105101.
 12.
Grün D, Wang YL, Langenberger D, Gunsalus KC, Rajewsky N: microRNA Target Predictions across Seven Drosophila Species and Comparison to Mammalian Targets. PLoS Comput Biol. 2005, 1: e1310.1371/journal.pcbi.0010013.
 13.
Ritchie W, Flamant S, Rasko JE: Predicting microRNA targets and functions: traps for the unwary. Nat Methods. 2009, 6 (6): 397398. 10.1038/nmeth0609397.
 14.
Witkos TM, Koscianska E, Krzyzosiak WJ: Practical Aspects of microRNA Target Prediction. Curr Mol Med. 2011, 11 (2): 93109. 10.2174/156652411794859250.
 15.
Shirdel EA, Xie W, Mak TW, Jurisica I: NAViGaTing the Micronome  Using Multiple MicroRNA Prediction Databases to Identify Signalling PathwayAssociated MicroRNAs. PLoS ONE. 2011, 6 (2): e1742910.1371/journal.pone.0017429.
 16.
Zhang Y, Verbeek FJ: Comparison and Integration of target prediction algorithms for microRNA studies. J Integr Bioinform. 2010, 7 (3): 127
 17.
Pio G, Ceci M, Loglisci C, Malerba D, D'Elia D: The integration of microRNA target data by biclustering techniques opens new roads for signaling networks analysis. EMBnet journal. 2012, 18 (B): 142144. 10.14806/ej.18.B.582.
 18.
Draper NR, Smith H: Applied Regression Analysis (Wiley Series in Probability and Statistics). 1998, WileyInterscience, 3
 19.
Pio G, Ceci M, Loglisci C, D'Elia D, Malerba D: Hierarchical and Overlapping CoClustering of mRNA: miRNA Interactions. ECAI, Frontiers in Artificial Intelligence and Applications. 2012, IOS Press, 242: 654659.
 20.
Malerba D, Ceci M, Appice A: A relational approach to probabilistic classification in a transductive setting. Eng Appl Artif Intell. 2009, 22: 109116. 10.1016/j.engappai.2008.04.005.
 21.
Elkan C, Noto K: Learning classifiers from only positive and unlabeled data. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008, KDD '08, New York, NY, USA: ACM, 213220.
 22.
Wang C, Ding CHQ, Meraz RF, Holbrook SR: PSoL: a positive sample only learning algorithm for finding noncoding RNA genes. Bioinformatics. 2006, 22 (21): 25902596. 10.1093/bioinformatics/btl441.
 23.
Wu F, Weld DS: Autonomously semantifying wikipedia. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. 2007, CIKM '07, New York, NY, USA: ACM, 4150.
 24.
Yu H: SingleClass Classification with Mapping Convergence. Mach Learn. 2005, 61 (13): 4969. 10.1007/s1099400511227.
 25.
Yu H, Han J, Chang KCC: PEBL: Web Page Classification without Negative Examples. IEEE Trans on Knowl and Data Eng. 2004, 16: 7081. 10.1109/TKDE.2004.1264823.
 26.
Fung GPC, Yu JX, Lu H, Yu PS: Text Classification without Negative Examples Revisit. IEEE Trans on Knowl and Data Eng. 2006, 18: 620.
 27.
Lee WS, Liu B: Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression. Proc of the Twentieth International Conference on Machine Learning (ICML 2003). 2003, AAAI Press, 448455.
 28.
Liu Z, Shi W, Li D, Qin Q: Partially supervised classification: based on weighted unlabeled samples support vector machine. Proceedings of the First international conference on Advanced Data Mining and Applications. 2005, ADMA'05, Berlin, Heidelberg: SpringerVerlag, 118129.
 29.
Lin S, Ding J: Integration of ranked lists via cross entropy Monte Carlo with applications to mRNA and microRNA studies. Biometrics. 2009, 65: 918. 10.1111/j.15410420.2008.01044.x.
 30.
Yang JH, Li JH, Shao P, Zhou H, Chen YQ, Qu LH: starBase: a database for exploring microRNAmRNA interaction maps from Argonaute CLIPSeq and DegradomeSeq data. Nucl Acids Res. 2011, 39 (suppl 1): D202D209.
 31.
Yousef M, Jung S, Kossenkov AV, Showe LC, Showe MK: Naïve Bayes for microRNA target predictionsmachine learning for microRNA targets. Bioinformatics. 2007, 23 (22): 29872992. 10.1093/bioinformatics/btm484.
 32.
Gamazon ER, Im HK, Duan S, Lussier YA, Cox NJ, Dolan ME, Zhang W: ExprTarget: An Integrative Approach to Predicting Human MicroRNA Targets. PLoS ONE. 2010, 5 (10): e13534+10.1371/journal.pone.0013534.
 33.
Sethupathy P, Corda B, Hatzigeorgiou AG: TarBase: A comprehensive database of experimentally supported animal microRNA targets. RNA (New York, NY). 2006, 12 (2): 192197.
 34.
The International HapMap Consortium: The International HapMap Project. Nature. 2003, 426 (6968): 789796. 10.1038/nature02168.
 35.
Chang CC, Lin CJ: LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011, 2 (3): 27:127:27.
 36.
Platt JC: Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large Margin Classifiers. 1999, 6174.
 37.
Vapnik VN: Statistical learning theory. 1998, Wiley, 1
 38.
Ward G, Hastie T, Barry S, Elith J, Leathwick JR: Presenceonly data and the em algorithm. Biometrics. 2009, 65 (2): 554563. 10.1111/j.15410420.2008.01116.x.
 39.
Efron B, Tibshirani R: An Introduction to the Bootstrap. 1994, London: Chapman and Hall
 40.
Breiman L: Bagging Predictors. Mach Learn. 1996, 24 (2): 123140.
 41.
Hsu SD, Lin FM, Wu WY, Liang C, Huang WC, Chan WL, Tsai WT, Chen GZ, Lee CJ, Chiu CM, Chien CH, Wu MC, Huang CY, Tsou AP, Huang HD: miRTarBase: a database curates experimentally validated microRNAtarget interactions. Nucl Acids Res. 2011, 39: 163169. 10.1093/nar/gkq1107.
 42.
Kozomara A, GriffithsJones S: miRBase: integrating microRNA annotation and deepsequencing data. Nucl Acids Res. 2011, 39: D152D157. 10.1093/nar/gkq1027.
 43.
Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E: The role of site accessibility in microRNA target recognition. Nat Genet. 2007, 39 (10): 12781284. 10.1038/ng2135.
 44.
Miranda KC, Huynh T, Tay Y, Ang YS, Tam WL, Thomson AM, Lim B, Rigoutsos I: A PatternBased Method for the Identification of MicroRNA Binding Sites and Their Corresponding Heteroduplexes. Cell. 2006, 126 (6): 12031217. 10.1016/j.cell.2006.07.031.
 45.
Provost F, Fawcett T: Robust Classification for Imprecise Environments. Mach Learn. 2001, 42 (3): 203231. 10.1023/A:1007601015854.
 46.
Pesquita C, Faria D, Bastos HP, Ferreira AEN, Falc˜ao AO, Couto FM: Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics. 2008, 9 (S5): 4
 47.
Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H, SirotaMadi A, Olender T, Golan Y, Stelzer G, Harel A, Lancet D: GeneCards Version 3: the human gene integrator. Database (Oxford). 2010, baq020
 48.
Stelzer G, Inger A, Olender T, InyStein T, Dalah I, Harel A, Safran M, Lancet D: GeneDecks paralog hunting and geneset distillation with GeneCards annotation. OMICS. 2009, 13 (6): 477487. 10.1089/omi.2009.0069.
 49.
Haw R, Hermjakob H, D'Eustachio P, Stein L: Reactome pathway analysis to enrich biological discovery in proteomics data sets. Proteomics. 2011, 11 (18): 35983613. 10.1002/pmic.201100066.
 50.
Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, Mering Cv: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucl Acids Res. 2010, 39: D561D568.
 51.
Hsu CW, Juan HF, Huang HC: Characterization of microRNAregulated proteinprotein interaction network. Proteomics. 2008, 8 (10): 19751979. 10.1002/pmic.200701004.
 52.
Olive V, Jiang I, He L: mir1792, a cluster of miRNAs in the midst of the cancer network. Int J Biochem Cell Biol. 2010, 42 (8): 13481354. 10.1016/j.biocel.2010.03.004.
 53.
Ventura A, Young AG, Winslow MM, Lintault L, Meissner A, Erkeland SJ, Newman J, Bronson RT, Crowley D, Stone JR, Jaenisch R, Sharp PA, Jacks T: Targeted deletion reveals essential and overlapping functions of the miR17 through 92 family of miRNA clusters. Cell. 2008, 132 (5): 875886. 10.1016/j.cell.2008.02.019.
 54.
Kitamoto S, Yamada N, Yokoyama S, Houjou I, Higashi M, Goto M, Batra S, Yonezawa S: DNA methylation and histone H3K9 modifications contribute to MUC17 expression. Glycobiology. 2011, 21 (2): 247256. 10.1093/glycob/cwq155.
 55.
Ren J, Jin P, Wang E, Marincola F, Stroncek D: MicroRNA and gene expression patterns in the differentiation of human embryonic stem cells. J Transl Med. 2009, 7: 2010.1186/14795876720.
Acknowledgements
We would like to acknowledge the support of the European Commission through the project MAESTRA  Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT2013612944). This work was also funded by the project "PON01 02589  MicroMap" project "Caratterizzazione su larga scala del profilo metatrascrittomico e metagenomico di campioni animali in diverse condizioni fisiopatologiche". The authors thank Lynn Rudd for reading through the paper.
Declarations
Publication of this article was supported by the project "MBLab: The Molecular Biodiversity LABoratory Initiative" (MIUR DM 19410).
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 1, 2014: Integrated BioSearch: Selected Works from the 12th International Workshop on Network Tools and Applications in Biology (NETTAB 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S1.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
MC and GP contributed to the definition of the method. DD contributed to the conception of the biological investigation. GP and MC contributed to the software design. GP and DD took care of the review and selection of bioinformatic resources. GP implemented the system and ran the experiments. DD performed the biological analysis and validation of the results. GP and MC performed the analysis of the results, from the computer science point of view. MC, GP and DD contributed to the manuscript drafting. MC, GP, DD and DM contributed to the manuscript finalization. DM and MC supervised the study. All the authors read and approved the final manuscript.
Electronic supplementary material
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Pio, G., Malerba, D., D'Elia, D. et al. Integrating microRNA target predictions for the discovery of gene regulatory networks: a semisupervised ensemble learning approach. BMC Bioinformatics 15, S4 (2014). https://doi.org/10.1186/1471210515S1S4
Published:
Keywords
 Prediction Algorithm
 Collinearity Problem
 mRNA Interaction
 Target Prediction Algorithm
 mRNA Pair