A negative selection heuristic to predict new transcriptional targets

Background Supervised machine learning approaches have been recently adopted in the inference of transcriptional targets from high throughput trascriptomic and proteomic data showing major improvements from with respect to the state of the art of reverse gene regulatory network methods. Beside traditional unsupervised techniques, a supervised classifier learns, from known examples, a function that is able to recognize new relationships for new data. In the context of gene regulatory inference a supervised classifier is coerced to learn from positive and unlabeled examples, as the counter negative examples are unavailable or hard to collect. Such a condition could limit the performance of the classifier especially when the amount of training examples is low. Results In this paper we improve the supervised identification of transcriptional targets by selecting reliable counter negative examples from the unlabeled set. We introduce an heuristic based on the known topology of transcriptional networks that in fact restores the conventional positive/negative training condition and shows a significant improvement of the classification performance. We empirically evaluate the proposed heuristic with the experimental datasets of Escherichia coli and show an example of application in the prediction of BCL6 direct core targets in normal germinal center human B cells obtaining a precision of 60%. Conclusions The availability of only positive examples in learning transcriptional relationships negatively affects the performance of supervised classifiers. We show that the selection of reliable negative examples, a practice adopted in text mining approaches, improves the performance of such classifiers opening new perspectives in the identification of new transcriptional targets.


Background
An important challenge of computational biology is the reconstruction of large biological networks from high throughput genomic and proteomic data. Biological networks are used to represent and model molecular interactions between biological entities, such as genes and proteins in a given biological context.
In this paper we focus on the identification of new transcriptional targets, i.e. coding DNA regions directly regulated by transcription-factors. Transcription factors are proteins, coded by specific genes, that, alone or with other proteins in a complex, bind the targets cis-regulatory regions and control the target transcriptional activity by promoting or blocking the recruitment of RNA polymerase.
In identifying the interactions between transcriptionfactors and genes from experimental data, two broad classes of computational methods can be distinguished in literature [1,2]: those that rely on the physical interaction between molecules (gene-to-sequence interaction) which relate transcription factors to sequence motifs found in promoter regions; and algorithms based on the influence interaction that try to relate the expression of a gene to the expression of the other genes in the cell (gene-to-gene interaction). Most of the approaches of the second class are basically unsupervised and model the reconstruction of transcriptional relationships as a classification problem, where the basic decision is the presence or absence of a relationship between a given pair of genes [3][4][5][6]. Those methods can be distinguished in: i) gene relevance network models, which detect gene-gene interactions with a similarity measure and a threshold, such as ARACNE [7], TimeDelay-ARACNE [8], and CLR [9] that infer the network structure with a statistical score derived from the mutual information and a set of pruning heuristics; ii) boolean network models, which adopt a binary variable to represent the state of a gene activity and a directed graph, where edges are represented by boolean functions (e.g. REVEAL [10]); iii) differential and difference equation models, which describe gene expression changes as a function of the expression level of other genes with a set of ordinary differential equations (ODE) [11]; and iv) Bayesian models, or more generally graphical models, which adopt Bayes rules and consider gene expressions as random variables [12].
The experimental validation of predicted transcriptional regulations is performed with ChIP-on-chip [13], a technique used to investigate interactions between proteins and DNA in vivo by combining chromatin immuno-precipitation (ChIP) with microarray technology (chip). Specifically, it allows the identification of the cistrome, sum of binding sites, for DNA-binding proteins on a genome-wide basis. Whole-genome analysis can be performed to determine the locations of binding sites for almost any protein of interest, in particular transcription factors. The goal of ChIP-on-chip is to localize protein binding sites that may help identify functional elements in the genome. For example, in the case of a transcription factor as a protein of interest, one can determine its transcription factor binding sites throughout the genome.
A recent trend in computational biology aims reconstruct large biological networks with supervised approaches [5,6,14]. Supervised methods require a training set, which in our context means a set of transcriptional targets where the information that they are regulated by a transcription factor is known in advance. Training targets are used to estimate a function that is able to discriminate whether a new transcriptional interaction exists. The literature of machine learning proposed several supervised algorithms: neural networks, decision tree, logistic models, and Support Vector Machines (SVM) [15]. Among all SVM gave promising results in the reconstruction of biological networks [16][17][18]. For example, SIRENE adopted an SVM classifier to reconstruct the regulatory network of Escherichia coli, and obtained more accurate results than unsupervised methods based on mutual information (e.g. CLR and ARACNE) [16]. Compared to unsupervised methods, supervised methods are potentially more accurate, but in fact they need an initial set of known regulatory connections. This is in principle not a restriction as many regulations are progressively discovered and shared among researchers through public regulatory databases. Some examples are: RegulonDB (http://regulondb.ccg.unam.mx), KEGG (http://www.genome.jp/kegg/), TRRD (http://www. mgs.bionet.nsc.ru/mgs/gnw), Transfac (http://www.generegulation.com), IPA (http://www.ingenuity.com).
In general a supervised binary classifier needs both positive and negative examples to learn effectively. In the context of gene regulatory networks this condition is not satisfied, as counter negative examples are not available or may be hard to collect. In functional genomics, information about negative examples is in fact not available, as the input is usually a finite list of genes known to have a given function or to be associated to a given disease, and the goal is to identify new genes sharing the same property. Thus, under a machine learning perspective, the supervised inference of new transcriptional targets falls into a class of semi-supervised learning problems that consists of learning from positive and unlabeled data. The training set is composed just by known positive examples (positive set) and the goal is to predict the unknown positive in examples the unlabeled set.
Learning from only positive and unlabeled data is a hot topic in the literature of data mining, where three main families of approaches can be distinguished [19]: i) methods that reduce the problem to the traditional two-class learning by selecting reliable negative examples from the unlabeled set [20][21][22][23][24][25]; ii) methods that do not need labeled negative examples and basically adjust the probability of being positive estimated by a traditional classifier trained with positive and unlabeled examples [14,26]; and iii) methods that treat the unlabeled set as noisy negative examples [27].
In this paper we focus on the first class of approaches that rely on a starting selection of negative examples. The main problem is that some of the selected negative examples could in fact be positives embedded in the unlabeled data, reducing the prediction capability of a binary classifier. We empirically evaluate this effect by simulating the positive contamination inside the negative training set showing that the performance of the classifier improves when the positive contamination is low. Such a result demands for an approach that is able to generate a sufficiently large negative training set without positive contamination.
We propose, NOIT (NOn Indirect Targets), a method to select reliable negative training examples by exploiting the known gene regulatory network topology in the specific context of prediction new transcriptional targets. The method is an extension, to a specific context, of approaches recently published in [28] and [29] where reliable negatives selection benefits from the over presence, in the current known gene regulatory networks, of typical network motifs [30]. We introduce a new heuristic that still exploits the known regulatory network topology but not in terms of network motifs as in the specific context of transcriptional target prediction the relationships between transcriptionfactors and their targets does not exhibit significant network patterns. The NOIT method gives less importance to indirect targets, i.e. targets of a transcription-factor regulated indirectly through other transcription-factors. The idea is based on the observation that genes controlled directly by a transcription factor or indirectly through other transcription factors are likely to attain for the same family of functions, thus representing unreliable negative candidates. This is supported by the fact that transcription factors evolved in the service of specific biological functions and are usually classified according to their regulatory function [31] and sequence similarity [32,33]. Moreover downstream targets activity is usually modulated by regulatory circuits involving small groups of transcription factors organized in typical network motifs.
We compare NOIT with other negative selection approaches known in literature. For this purpose we adopt the dataset of Escherichia coli, where almost all transcriptional regulations are known and a huge amount of experimental data is available for benchmarking (e.g. Faith et al. [34]). Furthermore we provide an example of application in the prediction of BCL6 direct targets in normal germinal center human B cells by adopting the results of Basso et al. [13] showing that NOIT predicts 29 correct targets in the top 50 ranked genes, outperforming other supervised and unsupervised methods that predict less than 10 correct targets. The paper is organized as follows. The next section (Methods) introduces the NOIT heuristic, overviews the literature methods that are based on a reliable negative selection procedure, and describes the empirical procedures aimed at evaluating the performance of the negative selection heuristics. Section on results reports and discusses the outcomes of the study, and the last section concludes the paper outlining directions for future work.

Problem formulation
In a binary classification problem, given a set of training examples, (x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n ) X × {+1,-1}, the goal is to determine a function f(x): X {-1,+1} that is able to predict the label y {+1,-1}of a new observation We propose, NOIT (NOn Indirect Targets), a negative selection heuristic that exploits the known regulatory network topology by giving less importance to indirect targets, and formalized as follows. Let G be the set of all genes in an organism and TF ⊂ G the set of transcription factors. Given a transcription factor tf i TF, the goal is to infer a function, f tf i (φ(g)) : G → {−1, +1}, from a set of target genes, P tf i = {(g 1 , +1), (g 2 , +1), ..., (g n , +1)} ⊂ (G\TF), that are known to be regulated directly by tf i (i.e. positive examples). The function should be able to predict the label y of a new gene g ∈ U tf i = G\(TF ∪ P tf i ) from the unlabeled set. The transformation function j describes each gene with an m-dimensional real valued feature vector, φ(g) : G → R m , such as expression values measured in m different experimental conditions.
The goal of a negative selection heuristic is to select from the unlabeled set U tf i a sufficiently large negative training set without positive contamination. Our aim is to propose a method based on the assumption that an unlabeled gene g ∈ U tf i is a bad negative candidate if it is indirectly controlled by tf i , through other transcription factors. Such information can be extracted from the known gene regulatory network, or in the situation wherein such information is not available, it could be estimated with binding site promoter analysis [32] and/or unsupervised gene regulatory prediction [7,9].
We introduce a probability mass function pmf tf i (g) of negative candidates distribution to estimate the probability that an example g ∈ U tf i is a good negative candidate. We compute pmf tf i (g) as: where k [1, |TF|] is the minimum number of transcription factors, tf i+1 , tf i+2 , ..., tf i+k , that link tf i to g, i.e. for every j = i, ..., i + k -1, tf j+1 is a known target of tf j . The term 1/|U tf i | serves to scale the probability mass function to sum to 1. When a path linking tf i and g through a set of known transcription factors does not exist, we assume that k = |TF|. In that case the probability is maximum, instead it is minimum when at least one tf k exists such that g is regulated by tf k and tf k is regulated by tf i (Figure 1). The hypothesis is that the expression profile of genes regulated by tf i are more correlated with genes selected as bad negatives than those selected as good negatives. This is confirmed with a bootstrapping experiment where we selected (many times, e.g. 1000) two random genes, g 1 and g 2 , belonging to the targets of a transcription factor, and two genes, g good and g bad , belonging respectively to good and bad negative candidates as selected by the NOIT procedure. We computed the correlation between g 1 -g 2 , g 1 -g good , and g 1 -g bad obtaining the three distributions shown in Figure 2. The black curve shows the distribution of correlation between genes within the same targets, the red curve shows the distribution of correlation between targets and bad negative candidates, and the green curve shows the distribution of correlation between targets and bad negative candidates. A two sample Mann-Withney Test between the latter two distributions shows a significant difference (W = 5940280284, p-value < 2.2 × 10 -16 ) suggesting that the NOIT procedure is able to select negative that are more distant, in term of correlation, from targets. With a learning scheme similar to SIRENE [16] we divide the unlabeled set U tfi into three random folds. The labels of each fold are predicted with a binary classifier trained with the known positives and a selection of negative examples drawn from the other two folds. SIRENE adopts a method, known as PU learning (Positive Unlabeled learning), that is strongly affected by the positive contamination of unlabeled examples as all unlabeled examples are considered good negative candidates. We limit such a contamination by selecting the top NC negative candidates scored by the above introduced probability mass function pmf tfi (g). We consider a number of negatives candidates, NC, depending on the number of known positives N C = K * |P tf i |. The parameter K may affect the performance of the classifier. With an experiment performed in the context of Escherichia Coli we observed on the independent test set that the best performance is obtained with K around 10 ( Figure 3).

Negative selection methods in literature
In this Section we briefly review the most important positive only classification methods that include a reliable negative selection step in their classification schema.

Spy-SVM
Spy-SVM is a technique proposed in [20] that works as follows. A percentage of known positives, {s 1 , s 2 , ..., s k }, randomly selected from P tf i , that act as 'spies', are sent to the unlabeled set U tf i . An SVM classification algorithm is

PSoL -Positive Sample only Learning
PSoL selects strong negative example using the Euclidean distance measure [21]. The algorithm starts with a negative candidate that is the most farthest unlabeled example from P tf i calculated as the maximum of the minimum distance from the elements of P tf i . More negative candidates are selected from the unlabeled set U tf i satisfying the constrain that are different from the known positive examples and farthest from the previously selected negative ones. The algorithm assumes that the negative examples in the unlabeled set are located far from positives and

Rocchio-SVM
Rocchio-SVM is based on a technique adopted in information retrieval to improve the recall of pertinent documents through relevance feedback [22]. It identifies the set of reliable negatives by adopting two prototypes, one for the positive class, c P , and one for the unlabeled ones, c U , computed as follows: Figure 3 Effect of the NOIT parameter K on classifier performance. The parameter K determines the amount of negative candidates that will be included in the training set. The figure shows the classifier performance in terms of AUROC for different values of K. Each curve refers to a different percentage of known positives. The optimal value can be observed around K = 10.

Bagging -SVM
Baggin SVM is an ensemble technique that generally improves the performance of individual classifiers when they are unstable or not correlated to each other. Positive only learning have a particular structure that leads to instable classifiers due to the positive contamination of the unlabeled set which can be advantageously exploited by a bagging-like procedure [36,37]. The approach collects the outcome of a huge number classification runs (e.g. 1000), where each classifier, F i , is trained with the known positive examples, P tf i , and a random set of NC negative candidates drawn uniformly from U tf i , considered as negative examples. The ensemble classifier, F, scores an unlabeled example g by averaging the scores obtained by that example at each run: where g is a member drawn from U tf i , F i is the i-th classifier, and T g is the set of partial classifiers that were not trained with g, i.e. the unlabeled example g was not drawn by the random selection.

Empirical evaluation methods
In this section we introduce the datasets, the basic learning algorithm, and the methods we adopted to empirically evaluate to which extend a negative selection heuristic improves the performance of a classifier trained to infer new transcriptional targets.

Datasets
To test our approach we adopt the well known dataset of Escherichia coli provided by Faith et al. [34], and a dataset that was adopted by Basso et al. [13] to predict BCL6 direct target genes in normal germinal center human B cells.
The dataset of Escherichia coli consists of 445 different Affymetrix Antisense2 microarray expression profiles for 4345 genes. The transcriptional regulatory network of Escherichia coli is the most complete annotated network consisting of 3293 experimentally confirmed relationships between 154 transcription factors and 1211 direct targets extracted from RegulonDB (version 5) [38].
The dataset of Basso et al. is deposited in the Gene Expression Omnibus database and is accessible through GEO series accession number GSE12195. It consists of 136 expression profiles of 73 B-cell lymphoma biopsies, 10 purified tonsillar geminal center, 10 naive and memory B cells, 38 Follicular lymphoma biopsies, and 5 lymphoblastoid cell lines. We normalized the dataset from CEL files according to the RMA procedure [39] and filtered out probes with low inter experiment variability by means of the varFilter function of the genefilter Bioconductor package. The final dataset is composed by 136 samples and 9876 genes. Basso et al. identified a group of 120 new core targets down-regulated by BCL6 with an integrated biochemical-computational-functional approach (see Supplemental Table S2 of [13]), validated through ChIP-on-chip.
We show that those 120 new core targets can be predicted with a supervised learning approach starting from a positive training set of 171 targets annotated as downregulated by BCL6 in a previous work by Ci et al. [40]. For the NOIT negative selection procedure we rely on 47 transcription factors known to be regulated by BCL6 by TRANSFAC sequence motifs analysis which considers those that exhibit a BCL6-bound enrichment in their promoter regions as reported in [13]. Their targets were predicted preliminary with ARACNE as reported in the supplemental Table 5 of reference [13].

Basic Learning algorithm
We use the Support Vector Machine (SVM), with Platt scaling [41], to estimate the probability that a target is regulated by a transcription-factor. In particular we use the SVM implementation provided by KERNLAB [42], a package for kernel-based machine learning methods in R. The basic element of an SVM algorithm is a kernel function K (x 1 , x 2 ), where x 1 and x 2 are feature vectors of two gene targets. The idea is to construct a separation hyperplane between two classes, +1 and -1, such that the distance of the hyperplane to the points closest to it is maximized. The kernel function implicitly maps the original data into some high dimensional feature space, in which the optimal hyperplane can be found. In our experiment we adopt an SVM classifier for each transcription-factor tf i TF trained with the known positive targets and the reliable selection of negative examples performed with a negative selection approach. Such a classifier in then used to score the set of genes g G\TF according to their probability to be regulated by tf i . We used C-support vector classification (C-SVC) which solves the following problem: subject to: y T a = 0, where y i {+1,-1}is the class of vector x i , 0 ≤ a i ≤ C, i = 1, ..., 2n, e is a vector with all elements equal to one, and K(x i , y j ) is a kernel function. We adopt a radial basis kernel function defined as: where C and g are parameters that we set empirically inside the training loop [43].

Cross validation and performance measures
To estimate the unknown performance of a classifier designed for discrimination we adopt a workflow consisting of 5 steps (Figure 4). For each transcription factor tf i TF we partition the original dataset into 10 random folds. Alternatively 9 folds are used for training, while the other fold is used for testing (step 2). Each fold contains a density of positives that is almost similar to the density of positives in the original dataset. The known targets regulated by tf i belonging to the current training set is split into a positive set P tf i , assumed to be the known positive training set, and an unknown set Q tf i , forming with N tf i the current unlabeled set U tf i (step 3). The size of P tf i is incremented linearly starting from 2 or according to the fraction |P tf i | |P tf i ∪Q tf i | . To limit the selection bias we re-sample P tf i 100 times. The negative training set is extracted from the unlabeled set, U tf i (step 4), and adopted, together with the current known positives, to train an SVM classifier (step 5). Genes belonging to the test set are scored according to the current classifier and the accuracy of classification is evaluated at different ranking levels in terms of precision and recall as follows: where TP n is the number of true positives appearing in the top n ranked targets, and targets(tf i ) is the set of tf i targets we want to predict in each test set. Instead, true positive rates and false positive rates are computed as: where #true negatives is the number of true negatives in the test set. From those measures we compute also aggregate performance measures, such as: AUROC (areas under the ROC curve) and AUPR (area under the precision/ Figure 4 Evaluation procedure. A negative selection method is evaluated by adopting a completely labeled dataset and a stratified k-fold cross validation procedure, where the number of known positives is varied linearly starting from 2 or according to its percentage with respect to the unknown positives (from 10% to 100%). To limit the selection bias of known positives, within each k-fold, the percentage of known positives is re-sampled 100 times. recall curve). Within a selection of known positives performance measures are averaged among all folds, all positive sampling runs, and all transcription factors obtaining an overall performance estimation of the classifier.

Effect of positive contamination
The contamination of the training set with positive examples considered wrongly as negatives affects the performance of a classifier. We define the level of positive contamination as the fraction ρ Q of unknown positives, with respect to the total number of unknown positives (Q), selected wrongly as negatives. Figure 5 shows the effect, in terms of AUROC (on the left) and AUPR (on the right), of positive contamination in two extreme conditions: a training set with full positive contamination ( ρ Q = Q Q = 100%) and a training set with no positive contamination ( ρ Q = 0 Q = 0%). In the first all unknown positives have been selected (wrongly) as negatives, U = Q + N. Instead, in the second the training set is composed just by true negatives, U = N , and represents an ideal classifier with a perfect negative selection heuristic. In principle the actual performance of a negative selection heuristic should be within the area delimited by the two curves.
Both classifiers have been trained in the context of Escherichia coli with the procedure depicted in Figure 4 at different levels of known positives (on the x-axis between 0.1 and 1). The main effect is that the performance of both contaminated and uncontaminated classifiers decreases with the fraction of known positives, although the proportion of that decrement is more rapid for the classifier trained with full positive contamination. When the fraction of known positives is minimum (0.1) the difference between contaminated and uncontaminated classifiers is maximum.

Effect of the negative selection approach
The performance of a negative selection approach is affected by the proportion of known positives available in the training set. With the evaluation procedure depicted in Figure 4 we evaluated the performance of a negative selection approach by varying both the relative fraction and the absolute number of known positives. The latter being more in accordance with practical purposes, as users only know the total number of positives which they have. Figure 6 reports, for each method, the average AUROC computed at different fraction of known positives (on the left) and at different number of known positives (in logarithmic scale on the right). On average the performance of each method increases with the quantity of known positives. With the exception of Rocchio each method reaches the maximum performance (AUROC around 0.8) when the training set is completely labeled, i.e. the percentage of known positives is maximum (100%). At low levels of known positives the difference among methods is more significant. Up to a percentage of 60% of known positives, or, up to a number of 20 known positives, in the training set, the NOIT procedure outperforms significantly all other methods. At low levels of known positives the worst performance is registered by PU, as in fact does not adopt any negative selection approach. Instead, at high levels of known positives the worst performance is registered by Rocchio. Table 1 summarizes the performance of each method in terms of average Recall computed at 60% and 80% of precision. The table reports, at different fraction of known positives, the 95% confidence intervals of Recall measures and the statistical significance (corrected with Benjamini & Hochberg) obtained with a pairwise t-test performed between NOIT and each other method. The adoption of t-test was preliminarly justified as Recall measures follow a normal distribution (Shapiro test, p-value < 2.2 · 10 -16 )and the one-way ANOVA test showed that Recall measures among methods are significantly different (ANOVA, p-value < 2.2 · 10 -16 ). At low levels of known positives (precisely at 10% and 30%) the NOIT procedure outperforms significantly all other methods (with the exception of Bagging that exhibits a marginal significant difference when the precision is set to 60%). The increment in Recall can be estimated around 10% with respect to Bagging which is the current state of the art in supervised inference of gene regulatory connections [16,37].

Prediction of BCL6 core targets in GC human B cells
In order to illustrate an examples of application we predict BCL6 core targets in GC human B cells adopting data and results provided by Basso et al. [13]. Figure 7 shows the number of true BCL6 core targets appearing in the top n genes ranked by an SVM classifier trained with different negative selection approaches. Each classifier has been trained by using the previously known targets provided by Ci et al. [40] and the predicted ranked set of genes has been compared with the BCL6 new core targets published by Basso et al. [13]. For the NOIT selection procedure we rely on 47 transcription-factors, reported in the Supplemental Table S5 of by Basso et al. [13], known to be controlled by BCL6 by means of TRANSFACT sequence motif analysis. The Figure  includes also the result obtained with ARACNE [7], an unsupervised method adopted by Basso et al. [13], that ranks genes according to their mutual information with BCL6. It is noticeable that supervised reverse engineering methods perform better than unsupervised, a result already confirmed in literature [16]. Instead, among supervised methods there is a remarkable difference in the top 50 ranked genes, where NOIT predicts 29 correct targets (60% precision) outperforming other methods that predict less than 10 correct targets. Over the first 200 ranked genes the Bagging method exhibits the best performance reaching a correct prediction of 66 targets in the first 1000 ranked genes, whereas NOIT predicts only 51 and the others less than 45.
We like to remark that with this experiment we predicted an interesting number of BCL6 targets without the integrated approach consisting of wide spectrum genomics experiments adopted by Basso et al.  of [13]). Furthermore, among supervised techniques, the NOIT procedure can take advantage from supplemental transcriptional information, which is aviable in many contexts.

Conclusions
The availability of only positive examples affects negatively the performance of supervised classifiers. This is particularly manifested in the context of learning transcriptional relationships. We showed that the selection of reliable negative examples, a practice adopted in text mining approaches, could improve the performance of such classifiers opening new perspectives in predicting new transcriptional targets. We introduced a new negative selection heuristic, NOIT, that promotes, as negative candidates of a transcription-factor, genes that are not regulated indirectly through other transcription-factors. The method has been tested against other negative selection procedures showing that it is able to improve the average performance of almost 10%, in terms of AUROC and AUPR, especially when the number of known positives is low. We provided an example of application in the context of prediction of BCL6 direct core targets in normal germinal center human B cells by adopting the results of Basso et al. [13]. We showed that in the top 50 genes, ranked with the NOIT method, 29 targets out of 120 are those experimentally demonstrated by Basso et al. [13]. This is promising as such targets have been predicted without intersecting the results of ChIP-on-chip assays, ARACNe outcomes, and transcriptional repression in GC experiments.
Threats to external validity, concerning the possibility to generalize our findings, affect the study as we