### Mixture of linear regressions

In the following we want to model the observed expression level of all *N* genes, using different linear combinations of the *M* different regulatory signals associated with the promoters (i.e. binding affinities for various TFs and different histone modifications). To this end let *y*_{
i
} be the gene expression level of gene *i* (the dependent variable) and *x*_{
i
} be a corresponding vector of *M* regulatory signals (the regressor variables). The single linear regression model is then defined as

*y*_{
i
}*=b*_{0}*+ x*_{
i
}*B*^{T} + *∊*_{i}, (1)

where *B* is a vector (*b*_{1}, …,*b*_{
M
}) representing regression coefficients and *∊*_{i} is an error term. For mathematical convenience, we redefine the vector with the regressor variables to be *x*_{
i
} = (1,*x*_{
i
}_{1}, …,*x*_{
iM
}) and include the bias parameter *b*_{0} in the beginning of *B*, that is *B =* (*b*_{0}, *b*_{1},…, *b*_{
M
}). Assuming the error *∊* follows a Normal distribution with standard deviation *σ*^{2}, the linear regression model has the following distribution

ℙ(*y*_{
i
}*|x*_{
i
}, *B*,*σ*^{2} ) *=* **N**(*y*_{
i
}*|x*_{
i
}*B*^{T},σ^{2} ). (2)

A mixture of linear regression models is defined as a convex summation of *K* distributions

where Π = (*π*_{1}, …,*π*_{
K
}) are the mixture coefficients, which respect *π*_{
k
} ≥ 0 and , and Θ are the model parameters (Π, *B*_{1},…, *B*_{
K
}, , …, ).

For a given data *X* and *Y*, where *X* is a set on *N* observations *x*_{
i
} and *Y* a vector with *N* observations *y*_{
i
}, the mixture of linear regression models can be estimated with the Expectation-Maximization algorithm [14, 24]. We resort to Maximum-a-posteriori (MAP) estimates of the parameters, as described in the next section, to avoid over-fitting [25]. The EM works by finding estimates Θ maximizing the posterior distribution over the data *X* and *Y*

ℙ(Θ|*X*, *Y*, *Z*) ≈ ℙ(*Y*, *Z*|*X*,Θ)ℙ(Θ) (4)

where *Z* is the vector of hidden variables with *z*_{
i
} ∈ {1,…, *K*} indicating which linear model an observation *i* belongs to. ℙ(Θ) is the prior distribution over the model parameters (see next section for the definition of the prior). ℙ(*Y*, *Z|X*, Θ) is the complete data likelihood and is given by:

where *r*_{
ik
} is the posterior probability (or responsibility) [25] that observation *i* belongs to the linear model *k* and is given by:

For further details on mixture models we refer the reader to [25].

The EM algorithm works by iteratively estimating the model assignments (*r*_{
ik
}) and the model parameters Θ until some convergence criteria is reached. In the context of the mixture of linear regression models, we need estimates of the linear regression parameters (*B*_{
k
}, ) for a particular model *k*, and all other parameters (*r*_{
ik
},Π) follow the usual EM algorithm [25].

Once the mixture model is estimated, the predicted value *ŷ*_{
i
} for a particular regressor observation *x*_{
i
} is given by

That is, the linear regression prediction is a mixture of the predictions of each individual component times the posterior probability of the observation *i* to belong to the model *k.* In our particular application problem, we are interested in estimating the models which corresponds to an unsupervised learning problem, that is, the coefficients indicating whether a regulatory signal plays an important repressive or activating role. The predictions *ŷ* can thereby be used for evaluating the fit of our model. In cases where one wants estimate the expression level of genes, that is, estimation of *ŷ* (supervised learning problem), the above equation should not be used, as the posterior probabilities are based on the response variable *y*, which is usually unknown in a predictive scenario. In such a context, methods for combinations of predictors, such as [26], are required.

### Bayesian linear regression estimates

We resort to Bayesian approach for obtaining MAP estimates of the linear regression models as proposed in [27]. Therefore, we avoid problems related to over-fitting which usually occur with the EM algorithm and mixture models [25]. More formally, the prior distribution in Eq. 4 can be decomposed as

We use the following conjugate prior for the regression coefficient *B*_{
k
}

ℙ(*B*_{
k
}) = *N*(*B*_{
k
}|0, *β*_{
k
}**I**), (9)

where 0 is a vector with *M* zeros, **I** is a *M* x *M* identity matrix and *β*_{
k
} is the hyper-parameter.

Let *r*_{
k
} be an *N* dimensional vector (*r*_{1}_{
k
}, …,*r*_{
Nk
}) containing the posterior probabilities of the observations belonging to model *k* and let *W*_{
k
} *=* diag(*r*_{
k
}), then the estimates from model *k* maximizing Eq. 4 are defined as

with

From Eq. 10, we can see that *β*_{
k
} works by shrinking the regression coefficients. Small *β*_{
k
} imposes a higher shrinkage on the regression coefficients. Furthermore, for *β*_{
k
} → ∞ we have a non-informative prior and the regression coefficients are the maximum likelihood estimates.

We estimate the hyper-parameter *β*_{
k
} in an Empirical Bayes approach with

where

and *λ*_{
j
} is the *jth* eigenvalue of the PCA decomposition of matrix (see [27] for details).

Note that *β*_{
k
} requires the definition of *B*_{
k
}, which in our context is taken from the previous iteration of the EM algorithm.

For the mixture mixing coefficients, we use a symmetric Dirichlet distribution as prior

ℙ(Π) = Dirichlet(Π|α), (14)

where *α* is the hyper-parameter. Hence, the mixing coefficients estimates used by the EM algorithm are

We use a prior of *α =* 2, which avoids models with a low number of observations assigned to it.

### Transcription factor affinity

TF binding motifs are traditionally described in the form of position frequency matrices (PFMs). PFMs show how often a certain base occurs at a given position in the alignments of known binding sites of the TF. To predict the binding strength of a given TF to a promoter sequences we utilize the TRAP method [17]. In contrast to motif matching algorithms which make a binary distinction between binding sites and non-binding sites, TRAP avoids this artificial separation and instead computes the probability of a TF to bind site *i* in the sequence using the following equation

where *δE*_{
i
}(*λ*) is the energy difference between the state in which the factor is bound to site *i* and the state in which the factor is bound to its consensus site. This so called mismatch energy is scaled by a parameter *λ* which was previously determined to have an optimal value of 0.7 [17]. The second transcription factor dependent parameter *R*_{0} determines both, the binding energy between the factor and its consensus site as well as the TF concentration. *R*_{0} is derived for each PFM individually as

*R*_{0} = exp(0.6 • *W* – 6), (17)

where *W* is the number of columns in the PFM with information content exceeding 0.1 bits. Matrix positions which fall below this entropy cutoff also do not contribute to the mismatch energy in Eq. 16. The nucleotide dependent mismatch energies for each site in the promoter sequence are computed by

where *v*_{
i
}_{,}_{
max
} is the frequency of the consensus base at position *i* in the PFM and *v*_{
i
}_{,}_{
α
}, is the frequency of the observed base *α* at position *i* in the PFM. Eventually, TRAP obtains the expected number *N* of TFs bound to the promoter by summing over the individual probabilities from all *L* sites in the sequence:

As input, TRAP requires for each TF a PFM suitable for computing the mismatch energies and a DNA sequence of interest (see [17] for details).

For our study we use a selection of 102 PFMs from the Transfac database version 11.1 [28], which correspond to TFs involved in lymphoid development (see Additional File 1 for TF list). As we are mainly interested in binding sites near the promoter, the analysis was based on the 200 base pairs upstream of the transcription start site (TSS) of the genes. We restrict the analysis to genes with normalized CpG content < 0.5 in their promoter sequence [21], as such genes tends to be expressed in a tissue and stage specific way. In the end, we calculate the affinity (Eq. 19) for all the selected genes and PFMs. This yields the matrix *X* containing the TF binding data, where *x*_{
i
}_{,}_{
j
} corresponds to the affinity of TF *j* to the promoter of gene *i.*

### T-cell gene expression and histone modification data

We use the gene expression and histone modification data from Th1, Th2, Th17 and iTreg cells published by [16]. The histone modification data was measure with the Chip-Seq Illumina platform. We used the Cisgenome tool [29] to align sequence data and to detect peaks. As we are only interested in the modifications near the promoter, we consider the region of 8000 bps upstream and 2000 bps downstream of the TSS and kept the tag counts of the highest peak. Finally, we added a pseudo count to avoid zero values and applied a log transform. This yields the matrix *X* containing the histone modification data, where *x*_{
i
}_{,}_{
j
} corresponds to the number of ChIP-seq tags derived from a particular histone modification *j* that are being mapped to the promoter of gene *i.*

The expression data was measured with Affymetrix 430 chips. The raw data has been normalized using the variance stabilization method of [30] and normalized the tissues to have mean expression equal to zero.Microarray probes were mapped to ENSEMBL gene identifiers with the help of the biomart tool [31]. We thereby kept all genes that had their expression measured by multiple probe sets. In the following, we restrict our analysis to those 6154 genes with low CpG content for which both, gene expression as well as histone modification data is available. The final data sets used in this analysis can be found at http://www.cin.ufpe.br/~igcf/MixLin.

### Experimental design

We model gene expression from four different T helper cell types (Th1, Th2, Th17 and iTreg) with the use of either transcription factor affinities (TF), histone modifications (HM) or both regulatory signals combined (HM+TF). As parameter of our method, we vary the number of linear models, *K*, from 1 to 6. In order to select the optimal model for each cell type, we first perform 10 fold cross-validation on each parameter setting and then estimate the Mean Square Errors (MSE) from the validation sets. As the MSE tends to decrease with higher *K*[32], we use a model selection procedure, the Bayesian Information Criteriun (BIC) [25], to indicate the optimal number of models. The method has been implemented with Pymix [33] and is freely available at http://www.pymix.org.