Learning directed edges from a set of known regulations
Let be the set of genes of interest. We want to learn a function h that takes the descriptors of a gene G1 and a gene G2 and predicts if the gene G1 regulates G2. Two types of descriptors are considered: descriptors of genes, for instance protein locations within the cell, and relationships between genes reflecting, for instance, if two genes are located on the same chromosome. Let us denote by the set of descriptors on genes and by the set of relations. A special descriptor expresses the class: given an ordered pair of two genes G1 and G2, it is true when G1 regulates G2.
In this work we have chosen to use a first-order logic representation, which allows for an easy representation of several objects (here, genes) and their relationships. Facts representing information about objects and their relations are expressed by atomic expressions, called atoms. They are written P(t1,…,t
n
), where P is a predicate and t1,…,t
n
are terms; a term being either a variable or a constant. In the remainder strings corresponding to constants will start with upper-case letters and strings corresponding to variables with lower-case letters. An atom is said to be ground if all its variables are set to specific values. A ground atom can be true or false, depending of the truth value of the property it expresses. It can therefore be seen as a boolean variable.
Descriptors on genes are thus expressed by expressions of the form Attr(Gname,V), where Attr denotes the attribute, Gname the name of the gene G and V the value taken by G on the descriptor Attr. For instance ProtLoccell(Akt1,Cytoplasm) means that the subcellular localization of AKT1 product protein is the cytoplasm. For sake of simplicity, we have used the name of the gene to define its product. If a gene codes for several proteins, there is no limitation to denote one gene and all its products by different names. A predicate that relates CodesFor(Gname,Pname) is just needed. Relations between genes are expressed by expressions R e l(G1name,G2name) where Rel denotes the relation satisfied by genes G1 and G2. For instance, Samechro(Cth,Id3) expresses that the genes CTH and ID3 are located on the same chromosome. The property that G1 regulates G2 is expressed by the predicate Regulates(G1name,G2name). Given two genes G1 and G2, we aim to predict whether Regulates(G1name,G2name) is true or false. In short, when there is no ambiguity on the genes we write Y = 1 when it is true, and Y = 0 otherwise. We have chosen the probabilistic framework of supervised classification and we search for a classifier h that is based on an estimation of the a posteriori probability P(Y = 1|G1,G2). It can be more formally written
where represents the description of G
i
, represents the relations between G1 and G2 and represents the background knowledge. θ is a threshold, whose value will be discussed in the experiments. As shown by this formalization, the learning framework we consider is beyond the classical framework of machine learning in which data is represented only by attributes; it belongs to the ILP (Inductive Logic Programming) domain, a subfield of machine learning that aims at studying relational learning in first-order formalisms [33].
The model we have chosen is a Markov Logic Network, as introduced in [29]. Such a model is defined as a set of weighted first-order formulas. In this paper, we consider only a subset of first-order logic, composed of rules A1∧…∧A
n
⇒ Regulates(g1,g2), where A1, …, A
n
are atoms. Such restrictions correspond to Horn clauses. The left-hand side of the rule (A1∧…∧A
n
) is called the body of the rule whereas the right-hand side is called the head of the rule.
Learning a Markov Logic network
Statistical relational learning (SRL) relates to a subfield of machine learning that combines first-order logic rules with probabilistic graphical frameworks. Among the promising approaches to SRL, Markov Logic Networks (MLNs) introduced by Richardson and Domingos [28, 29] are an appealing model. An MLN is defined by a set of formulas F = {f
i
|i = 1,…,p} and a weight vector w of dimension p, where the clause f
i
has an associated weight w
i
(negative or positive) that reflects its importance. Therefore, an MLN provides a way of softening first-order logic and encapsulating the weight learning into a probabilistic framework.
A Markov Logic Network together with a finite set of constants C, among which the variables can take their values, defines a Markov Network. This Markov Network can be built by associating a node to each ground atom and by defining a link between two nodes when their corresponding ground atoms occur in the same ground formula. As a consequence, the ground atoms appearing together in a ground clause form a clique in the graph. Let us, for instance, consider the following weighted clause, where g1 and g2 are two variables representing genes:
(1)
where the predicate Processbio(Gname,Proc) says that gene G is involved in the biological process annotation Proc of Gene Ontology [34].
Let us suppose that we have two genes A and B. The clause (1) leads to four instantiated clauses, corresponding to the instantiations of g1 and g2 with A and B :
-
1.
Processbio(B,Cell_proliferation) ∧ Processbio(A,Negative_regulation_of_cell_proliferation) ⇒ Regulates(A,B)
-
2.
Processbio(A,Cell_proliferation) ∧ Processbio(B,Negative_regulation_of_cell_proliferation) ⇒ Regulates(B,A)
-
3.
Processbio(A,Cell_proliferation) ∧ Processbio(A,Negative_regulation_of_cell_proliferation) ⇒ Regulates(A,A)
-
4.
Processbio(B,Cell_proliferation) ∧ Processbio(B,Negative_regulation_of_cell_proliferation) ⇒ Regulates(B,B)
Variables of the Markov network are the ground atoms occurring in these clauses and they are linked when they occur in the same clause. For instance, the first instantiated clause leads to links between Processbio(B, Cell_proliferation) and Processbio(A,Negative_regulation_of_cell_proliferation), Processbio(B,Cell_proliferation) and Regulates(A,B), and Processbio(A,Negative_regulation_of_ cell_proliferation) and Regulates(A,B). Figure 1 gives the Markov network built from this clause.
A world is an assignment of truth values to all possible ground atoms. It is written for short X = x (X denotes the ground atoms and x their truth values). The probability of a world x is given by:
(2)
where n
i
(x) is the number of true groundings of the clause f
i
in the world x, and is the partition function used for normalization.
For instance, if we consider a world where Processbio(B, Cell_proliferation), Processbio(A,Negative_regulation_ of_cell_proliferation) are true and the other ground atoms are false, then the first instantiated clause is false in this world, whereas all the other instantiated clauses are true (because their premises are false and the logical implication is false). Thus, the number of true groundings of the clause (1) is 3.
For edge prediction, the aim is to infer a classifier for a specific target predicate, given a set of positive and negative examples and background knowledge. We are thus interested in the Conditional Log Likelihood. Given the predicate Y to learn (Y is Regulates), we note examples for this predicate Y
j
= y
j
, j = 1,…,n, and Y = y if and only if ∀j,Y
j
= y
j
. Given evidence X which corresponds to descriptors of genes, relations between genes and background knowledge, the Conditional Likelihood (CL) can be expressed using the structure of the Markov network:
(3)
where n
i
(x,y) is the number of true groundings of f
i
in the world (x,y).
Learning an MLN consists of structure learning, i.e., learning the logical formulas, and parameter learning, i.e. learning the weight of each formula. Completing these two issues simultaneously raises some complexity issues. Therefore, we have chosen to split the learning task into two subtasks. Structure learning can be handled by an inductive logic program (ILP) learner while weight learning can be addressed by maximizing the Conditional Log Likelihood. These subtasks are illustrated in Figure 2.
Learning the candidate rules with Aleph
The system Aleph, developed by Srinivasan [35], is a well known ILP learner that implements the method proposed in [36]. Aleph, like other relational learners, takes as input ground atoms corresponding to positive and negative examples and background knowledge. It also needs language biases, which restrict the set of clauses that can be generated, thus allowing to reduce the size of the search space. These restrictions can correspond to information specified on the predicates, like the place where they occur in the rule, the types of their arguments or the way they will be used (instantiated or not). In our case, we specified that the predicate Regulates occurs in the head of the rule, and the other ones in the body of the rule. Other constraints, such as the maximum number of atoms in a clause or the number of variables, can be defined in order to restrict the form of the rules that can be learned.
The main learning method developed by Aleph, called induce, is sketched in the following:
-
1.
Select a positive example not yet covered by a rule
-
2.
Build the most specific clause r that covers this example and that satisfies the language biases from the background knowledge. This clause is called the “bottom clause”.
-
3.
Search a clause more general than the bottom clause: perform a top-down search (from the most general to the most specific clause) in the search space bounded by r.
-
4.
Add the clause with the best score to the current theory and prune redundant clauses.
-
5.
Repeat until all positive examples are covered.
Weight learning
Richardson & Domingos [29] proposed performing generative weight learning for a fixed set of clauses by optimizing the pseudo log-likelihood. Several approaches have been proposed for discriminative learning, where the conditional log-likelihood is optimized instead [37, 38]. Huynh & Mooney [39] introduced a weight learning algorithm that targets the case of MLNs containing only non-recursive clauses. In this particular case, each clause contains only one target predicate, thus the grounding of the clauses will contain only one grounded target predicate. This means that the query atoms are all independent given the background atoms. Because of this special assumption on the structure of the model, their approach can perform exact inference when calculating the expected number of true groundings of a clause. Recently, Huynh & Mooney [40] have introduced a discriminative weight learning method based on a max-margin framework.
As we also considered MLNs containing only non-recursive formulas, we used an MAP approach, maximizing the conditional log-likelihood penalized by an ℓ2 norm constraint:
with
where is the set of clauses concluding on the target atom Y
j
, and is the number of true groundings of the ith clause when the atom Y
j
is set to the value Y
j
. For finding the vector of weights w optimizing this objective function, we used the limited-memory BFGS algorithm [41] implemented in the software ALCHEMY [42].
Materials for inference of the ID2 genetic regulatory network
Data
We conducted a transcriptomic analysis of microarray experiments of HaCaT cells presenting distinct expression levels of ID2. We analyzed three conditions: wild-type cells (wt), stable overexpression (prcID2) or transient knock-down achieved by RNA interference (siID2) of ID2 expression and their corresponding controls. Differentially expressed genes in prcID2 or siID2 versus the corresponding control cells were identified by a t-test analysis using a p-value cut-off of 0.005, a fold-change threshold of 1.5 and Benjamini & Hochberg multiple testing correction [43]. The resulting genes were mapped to genetic networks as defined by IPA tools and the significantly enriched networks associated with cell cycle regulation, cancer, gene expression and signal transduction were merged. In this merged network, only edges and their associated nodes (genes) corresponding to expression/transcriptional regulations were conserved. Genes with incomplete information for all the features were removed. This process led to the selection of a network containing a set of 63 genes, denoted by .
In order to use MLNs, we need to describe known properties of genes within the first-order logic setting.
Encoding data
Three low level predicates have been defined to reflect the corresponding experimental conditions:
-
The predicate Expwt(Gname,L) states that the expression level of gene G in the wild-type cells is L. In the following results, expression levels values were discretized using equal width discretization [44]: we divided the interval of gene expression values into 5 intervals of equal width.
-
Similarly, the predicate that states that the expression level of gene G is L is Expsiid2(Gname,L) when the expression of ID2 has been decreased, and Expprcid2(Gname,L) when it has been increased.
Three other predicates express an increase, a decrease or a lack of change of the expression level between the experience on the wild-type cells and the other experiences: Expmore(Gname,Exp), Expless(Gname,Exp) and Expsame(Gname,Exp), where Exp is either Prcid2 or Siid2.
In order to characterize regulatory interactions, we used other features describing genes. Some of these features concern proteins and not directly genes.
● Physical interaction between proteins: Physical interaction between proteins can provide a hint about the role played by the genes coding for these proteins. In our study, we used the protein interaction data from the IntAct database [45]. We encoded the information of a physical interaction by a predicate containing the name of the genes that are assumed to code the proteins: Inteprot(G1name,G2name).
● Subcellular localization of proteins:
● Another interesting information about proteins is their localization in the cell. All proteins were analyzed using the Ingenuity Pathway Analysis Knowledge Base (Ingenuity Systems, www.ingenuity.com), and we encoded the information on the subcellular localization by a predicate ProtLocCell(Gname,Loc) where G is the name of the gene that codes the protein and Loc is the name of the cellular compartment where the protein was found.
● Biological processes:
● We used Gene Ontology [34] to describe the genes by the biological processes in which they are involved. To do so, we have defined a predicate Processbio(Gname,Proc), which says that a gene G is involved in the process Proc.
● Chromosomal location of genes:
● We extracted the genes location on chromosomes and chromosomal bands from the Entrez Gene database [46]. This information is encoded by the predicates Locchro(Gname,Chro) and Locband (Gname,Arm_begin,Band_begin,Arm_end,Band_end). From these predicates, we built two other predicates that we used instead: Samechro(G1name,G2name) and Sameband(G1name,G2name). These predicates provide information on the proximity between the gene locations of G1 and G2.
Choice of a baseline for comparison
In the results, we present a comparison with two pairwise Support Vector Machines (SVMs) used as a baseline approach. Contrary to local classifiers, pairwise classifiers do not need an assumption about known transcription factors: any ordered pair of genes can be processed without any prior. As SVM is built from the definition of a similarity between input data, we need to define a kernel between ordered pairs of genes. We say that two ordered pairs of genes (G1,G2) and (G3,G4) are similar if the regulator candidate G1 is similar to the regulator candidate G3 and similarly, the regulee candidate G2 is similar to the regulee candidate G4. This definition requires to first choose a kernel between single data noted k and then writes as:
(4)
This pairwise kernel is the asymmetric version of the kernel proposed in [20, 22] for pairs of proteins to solve supervised protein-protein interaction network inference tasks. Alternative definitions of pairwise kernel have also been proposed, like the metric learning pairwise kernel [47] and the cartesian kernel [48, 49].
For the pairwise kernel defined in (4), when k is chosen to be gaussian and G1,G2,G3,G4 have a feature vector description, K is also equivalent to a simple unique gaussian kernel built on the concatenation of feature vectors of each pair’s component such as the one proposed in [4]. In the experimental results we present, we defined six gaussian kernels for each feature described previously: gene expressions, differences of gene expression, protein-protein interactions, subcellular localizations, biological processes and chromosomal locations. However the definition proposed in (4) opens the door to different ways of combining the information. We tested two ways of combining kernels that have been proposed in the pairwise SVM framework (see [20] for instance). The first one consists in deriving for each kernel k
i
, defined as a kernel between single data, a pairwise kernel k
i
and averaging them to build a pairwise kernel noted K
pairwisesum
:
The second one consists in first averaging the Gaussian kernels and build as final kernel K
sum
:
where .
Let us notice that kernels are appropriate tools to gather heterogeneous sources of information into the same framework and that combining multiple kernels allows active data integration. Once an SVM is built it is hard to open the “black box” and interpret the decision function.