Volume 12 Supplement 1

## Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011)

# Predicting gene expression in T cell differentiation from histone modifications and transcription factor binding affinities by linear mixture models

- Ivan G Costa
^{1}Email author, - Helge G Roider
^{2}, - Thais G do Rego
^{1}and - Francisco de AT de Carvalho
^{1}

**12(Suppl 1)**:S29

https://doi.org/10.1186/1471-2105-12-S1-S29

© Costa et al; licensee BioMed Central Ltd. 2011

**Published: **15 February 2011

## Abstract

### Background

The differentiation process from stem cells to fully differentiated cell types is controlled by the interplay of chromatin modifications and transcription factor activity. Histone modifications or transcription factors frequently act in a multi-functional manner, with a given DNA motif or histone modification conveying both transcriptional repression and activation depending on its location in the promoter and other regulatory signals surrounding it.

### Results

To account for the possible multi functionality of regulatory signals, we model the observed gene expression patterns by a mixture of linear regression models. We apply the approach to identify the underlying histone modifications and transcription factors guiding gene expression of differentiated CD4+ T cells. The method improves the gene expression prediction in relation to the use of a single linear model, as often used by previous approaches. Moreover, it recovered the known role of the modifications H3K4me3 and H3K27me3 in activating cell specific genes and of some transcription factors related to CD4+ T differentiation.

## Keywords

## Background

All cells in a multi-cellular organism arise from the same zygote and thus carry the same genetic information. However, complex regulatory programs allow stem cells to differentiate into distinct cell types. For instance, in response to different infectious agents Naive CD4+ T cells differentiate into at least four types of T helper cells—Th1, Th2, Th17, and inducible regulatory T cells (iTregs) [1]. While all of these cell types are involved in the adaptive immune response they serve distinct roles by secreting different cytokines. For example, Th1 acts against mycobacterial infections by releasing IFNγ, which activates the response of macrophages [1] while Th2 cells secrete various interleukins helping B-cells to induce humoral immunity.

On the transcriptional level, the differentiation process from stem cells to fully differentiated cell types is controlled by the interplay of chromatin modifications and transcription factor activity [2]. Chromatin structure is shaped primarily by histones. The presence or absence of these large globular protein complexes determines the accessibility of the promoter regions for the transcriptional machinery and thus performs a high-level control on gene expression [3, 4]. The affinity of histones to DNA is modified by the cell via a large repertoire of post-translational protein modifications including acetylations and methylations.

The resulting epigenetic histone code appears highly intricate, with a given histone frequently carrying several different modifications at a time. Despite this complexity, it has become clear that certain modifications, such as the trimethylation of the lysine 4 residue in the tail of histone H3 (abbreviated H3K4me3) are mainly associated with active promoters while other modifications such as H3K27me3 tend to be associated with inactive promoters [5]. The importance of histone modifications for the differentiation of Naive CD4 T-cells into Th1 cells has recently been verified at [6], which demonstrated that IFNγ expression is controlled by the histone methylation status of its promoter.

Aside from chromatin structure, transcription factors (TFs), play an essential role in controlling cell differentiation by guiding the transcriptional machinery to its target promoters and facilitating the initiation of transcription. For instance, in T-cell differentiation, in vitro studies demonstrated that either high levels of the transcription factor GATA3 or strong signalling via the transcription factor STAT5 is sufficient to determine the Th2 cell fate [1].

Particularly in the context of genome wide studies, computational biology analysis have become an essential component of elucidating the regulatory signals underlying observed gene expression patterns. Usually, the problem of identifying the promoter elements guiding differentiation and cell type specific gene expression is tackled by first selecting the genes which are most specifically expressed in the particular cell type and then performing motif over-representation analysis on their promoter sequences as in [7, 8] (see [9] for a recent review). While such methods allow identifying potentially regulating transcription factors they have the intrinsic drawback of requiring a previous grouping of genes and of being able to explain only the expression of the genes with highest specificity for the condition.

In contrast, linear regression models, as first proposed by [10, 11], combine all regulatory signals in order to explain the expression pattern of the genes. In their work, Bussemaker *e* t al. [10] focused on explaining gene expression based on combinations of predicted TF binding sites. The coefficients of the linear model indicate the importance of a particular regulatory signal. That is, signals which obtain large positive coefficients likely correspond to putative activators while signals with large negative coefficients likely act as suppressors. Recently, Karlic *e* t al. [12] also used a linear regression model in order to estimate promoter activity based on histone modification data. By design, the above approaches assume that a given regulatory signal exert the same regulatory effect on all its target genes.

However, transcription factors and thus their DNA binding motifs frequently act in a bi-functional manner, with a given DNA motif conveying both transcriptional repression and activation depending on its location with respect to the transcription start site (TSS) and the sequence motifs surrounding it. For instance, RUNX1 and RUNX3 have been shown to act both as repressors and activators in different tissues and are involved in determining T-cell fate [13].

In order to find the regression models best explaining the expression data, our method takes as input the matrix *Y* of observed expression profiles from all genes as well as a matrix *X*, containing the regulatory signals for the corresponding promoters (i.e. predicted TF binding affinities and presence of histone modifications). For each gene, it then estimates the coefficient vector *B*, representing the relative importance of each regulatory signal and its effect on gene expression (activation or repression).

We apply this novel approach to identify the underlying regulatory signals guiding gene expression in each of the four differentiated CD4+ T-helper cell types. As potential regulatory signals we consider both, histone modifications (HM) as measured by Chip-Seq [16] as well as predicted binding affinities [17] from a set of TFs related to lymphoiesis [1, 18–20]. As we are mainly interested in cell type specific signals, we restrict the analysis to genes with low CpG content in their promoters [21] as such genes tend to be expressed in a tissue and stage specific manner while genes with high CpG promoter content tend to be broadly expressed. Using this method we expect to improve the gene expression prediction in relation to the use of single linear model, but also to reveal the regulatory roles for histone modifications and transcription factors.

## Results and discussion

### Regulatory signals predicts expression

As a first step, we want to determine which set of regulatory signals, *X*, can explain the observed gene expression data, *Y* , best. To this end, as a first step we supply our algorithm with a matrix *X* containing only predicted TF binding affinities, only histone modification data or both sets of regulatory signals and assess how well the resulting regression models can capture the data. As measure of quality for the different models we thereby compute the mean square error between the predicted gene expression values and the actual measurements (see Methods for details).

*t*-test

*p*-value < 0.01) indicating the advantage of using mixtures to predict expression. The model selection procedure (see Methods) indicates that the data is optimally explained by the combination of 2-4 regression models (see Fig. 2) and that gene expression data can be well predicted based on histone modification data alone.

In contrast, using a single regression model to predict gene expression data based on TF binding affinities alone yields considerably larger MSEs across all cell types (average MSE = 2, see blue bars in Fig. 2). Interestingly, supplying our algorithm with the combined data from both histone modifications and TF binding affinities yields MSEs similar to the ones obtained with only histone modification data alone (see Fig. 2). This indicates that the utilized histone modification and TF binding data cover rather redundant than complementary information about gene expression. As histone modifications and HM+TF affinities yield the solutions with the lowest MSEs, in the following we will continue the analysis with the models based on these data sets.

### Control of Th1 gene expression

Having established that histone modification data together with a mixture of two regression models yields the most significant results we now investigate which type of modifications contribute most strongly to these models. We thereby restrict our analysis to the data from Th1 cells and the corresponding regulatory signals which obtain the largest absolute regression coefficients (results from other cell types closely resemble those from Th1 cells, see Additional File 1 for details).

For model 2, which explains the transcriptional activity of a small subset of highly expressed as well as most of the lowly expressed genes, we again find H3K4me3 to have the strongest positive regression coefficient (*b* = 0.71). This is reflected by a strong association of this modification with the most highly expressed genes of this set (see Fig. 3). In contrast, H3K27me3 obtains a regression coefficient of close to zero (*b* = 0.03) in this model as this modification appears with the same intensity in nearly all genes assigned to model 2.

### Comparison with previous studies

Several computational biology methods have been previously proposed for the use of linear models for predicting gene expression in the context transcription factor binding [10, 11] or histone modifications [12]. In all cases, distinct datasets were used and results are not directly comparable. In relation to [12], the analysis were based on Human naive CD4+ T cells and included 38 histone modifications. Their predicted model obtained a correlation coefficient of 0.72 on genes with low CpG content with HM H3K4me3 and H3K79me1, while our method had a coefficient at the range 0.64 – 0.68 for one model and 0.85 – 0.87 for two models for H3K4me3 and H3K27me3 data. The increase of the correlation coefficient from single linear models to two linear models is an indication that all these approaches would profit from the use of the mixture of linear regressions framework.

## Conclusion

Predicting gene expression from regulatory signals is an important but unmet goal in bioinformatics. In this study, we propose a novel approach which uses mixtures of linear regression models together with transcription factor binding and histone modification data for estimating transcriptional activity of CpG depleted promoters. In addition the approach allows to determine the functional activity of the various regulatory signals. We show that our approach obtains significantly smaller errors in predicting the expression of genes in comparison to simple linear regression models as used in previous approaches. For gene expression data from CD4+ T helper cells we find that both, histone modification data alone and histone modifications together with predicted TF binding affinities, yields the best expression predictors. In accordance with previous dedicated studies we recover the well known regulatory roles of H3K4me3 as an enhancing and H3K27me3 as a repressive regulatory signal for gene expression. Moreover, our predictions suggest that histone modifications act not in a binary on/off fashion but rather in a continuous way with levels of H3K4me3 and H3K27me3 steadily rising or falling over a large range of expression values in a non-linear way. With the use of TF binding affinities, we also partially recover the main factors such as the Stat family involved in T helper cell type specific gene expression. Interestingly, we observe a negative effect of cMyb on expression in all T helper cell types. This raises the question whether MyB, which has been recently showed to promote histone acetylation marks in hematopoiesis [22], could play a role in the down-regulation of genes in T helper cells types.

The advent of next generation sequencing provides an ever growing stock of high quality data for the full range of histone modifications, DNA methylation state and transcription factor occupancy across the entire genome from various cell types and differentiation stages. Several methodological improvements will be required to integrate this wealth of data in order to shed light on the complex interplay between the different regulatory signals acting in eukaryotics. Moreover, in an ideal case where all possible regulatory signals have been measured, advanced feature selection procedures such as postulated by [23], will be vital for the detection of all the players involved in determining gene expression.

## Methods

### Mixture of linear regressions

In the following we want to model the observed expression level of all *N* genes, using different linear combinations of the *M* different regulatory signals associated with the promoters (i.e. binding affinities for various TFs and different histone modifications). To this end let *y*_{
i
} be the gene expression level of gene *i* (the dependent variable) and *x*_{
i
} be a corresponding vector of *M* regulatory signals (the regressor variables). The single linear regression model is then defined as

*y*_{
i
}*=b*_{0}*+ x*_{
i
}*B*^{
T
} + *∊*_{i}, (1)

where *B* is a vector (*b*_{1}, …,*b*_{
M
}) representing regression coefficients and *∊*_{i} is an error term. For mathematical convenience, we redefine the vector with the regressor variables to be *x*_{
i
} = (1,*x*_{
i
}_{1}, …,*x*_{
iM
}) and include the bias parameter *b*_{0} in the beginning of *B*, that is *B =* (*b*_{0}, *b*_{1},…, *b*_{
M
}). Assuming the error *∊* follows a Normal distribution with standard deviation *σ*^{2}, the linear regression model has the following distribution

ℙ(*y*_{
i
}*|x*_{
i
}, *B*,*σ*^{2} ) *=* **N**(*y*_{
i
}*|x*_{
i
}*B*^{
T
},σ^{2} ). (2)

where Π = (*π*_{1}, …,*π*_{
K
}) are the mixture coefficients, which respect *π*_{
k
} ≥ 0 and
, and Θ are the model parameters (Π, *B*_{1},…, *B*_{
K
},
, …,
).

For a given data *X* and *Y*, where *X* is a set on *N* observations *x*_{
i
} and *Y* a vector with *N* observations *y*_{
i
}, the mixture of linear regression models can be estimated with the Expectation-Maximization algorithm [14, 24]. We resort to Maximum-a-posteriori (MAP) estimates of the parameters, as described in the next section, to avoid over-fitting [25]. The EM works by finding estimates Θ maximizing the posterior distribution over the data *X* and *Y*

ℙ(Θ|*X*, *Y*, *Z*) ≈ ℙ(*Y*, *Z*|*X*,Θ)ℙ(Θ) (4)

*Z*is the vector of hidden variables with

*z*

_{ i }∈ {1,…,

*K*} indicating which linear model an observation

*i*belongs to. ℙ(Θ) is the prior distribution over the model parameters (see next section for the definition of the prior). ℙ(

*Y*,

*Z|X*, Θ) is the complete data likelihood and is given by:

*r*

_{ ik }is the posterior probability (or responsibility) [25] that observation

*i*belongs to the linear model

*k*and is given by:

For further details on mixture models we refer the reader to [25].

The EM algorithm works by iteratively estimating the model assignments (*r*_{
ik
}) and the model parameters Θ until some convergence criteria is reached. In the context of the mixture of linear regression models, we need estimates of the linear regression parameters (*B*_{
k
},
) for a particular model *k*, and all other parameters (*r*_{
ik
},Π) follow the usual EM algorithm [25].

*ŷ*

_{ i }for a particular regressor observation

*x*

_{ i }is given by

That is, the linear regression prediction is a mixture of the predictions of each individual component times the posterior probability of the observation *i* to belong to the model *k.* In our particular application problem, we are interested in estimating the models which corresponds to an unsupervised learning problem, that is, the coefficients indicating whether a regulatory signal plays an important repressive or activating role. The predictions *ŷ* can thereby be used for evaluating the fit of our model. In cases where one wants estimate the expression level of genes, that is, estimation of *ŷ* (supervised learning problem), the above equation should not be used, as the posterior probabilities are based on the response variable *y*, which is usually unknown in a predictive scenario. In such a context, methods for combinations of predictors, such as [26], are required.

### Bayesian linear regression estimates

We use the following conjugate prior for the regression coefficient *B*_{
k
}

ℙ(*B*_{
k
}) = N(*B*_{
k
}|0, *β*_{
k
}**I**), (9)

where 0 is a vector with *M* zeros, **I** is a *M* x *M* identity matrix and *β*_{
k
} is the hyper-parameter.

*r*

_{ k }be an

*N*dimensional vector (

*r*

_{1}

_{ k }, …,

*r*

_{ Nk }) containing the posterior probabilities of the observations belonging to model

*k*and let

*W*

_{ k }

*=*diag(

*r*

_{ k }), then the estimates from model

*k*maximizing Eq. 4 are defined as

From Eq. 10, we can see that *β*_{
k
} works by shrinking the regression coefficients. Small *β*_{
k
} imposes a higher shrinkage on the regression coefficients. Furthermore, for *β*_{
k
} → ∞ we have a non-informative prior and the regression coefficients are the maximum likelihood estimates.

and *λ*_{
j
} is the *jth* eigenvalue of the PCA decomposition of matrix
(see [27] for details).

Note that *β*_{
k
} requires the definition of *B*_{
k
}, which in our context is taken from the previous iteration of the EM algorithm.

For the mixture mixing coefficients, we use a symmetric Dirichlet distribution as prior

ℙ(Π) = Dirichlet(Π|α), (14)

*α*is the hyper-parameter. Hence, the mixing coefficients estimates used by the EM algorithm are

We use a prior of *α =* 2, which avoids models with a low number of observations assigned to it.

### Transcription factor affinity

*i*in the sequence using the following equation

where *δE*_{
i
}(*λ*) is the energy difference between the state in which the factor is bound to site *i* and the state in which the factor is bound to its consensus site. This so called mismatch energy is scaled by a parameter *λ* which was previously determined to have an optimal value of 0.7 [17]. The second transcription factor dependent parameter *R*_{0} determines both, the binding energy between the factor and its consensus site as well as the TF concentration. *R*_{0} is derived for each PFM individually as

*R*_{0} = exp(0.6 • *W* – 6), (17)

*W*is the number of columns in the PFM with information content exceeding 0.1 bits. Matrix positions which fall below this entropy cutoff also do not contribute to the mismatch energy in Eq. 16. The nucleotide dependent mismatch energies for each site in the promoter sequence are computed by

*v*

_{ i }

_{,}

_{ max }is the frequency of the consensus base at position

*i*in the PFM and

*v*

_{ i }

_{,}

_{ α }, is the frequency of the observed base

*α*at position

*i*in the PFM. Eventually, TRAP obtains the expected number

*N*of TFs bound to the promoter by summing over the individual probabilities from all

*L*sites in the sequence:

As input, TRAP requires for each TF a PFM suitable for computing the mismatch energies and a DNA sequence of interest (see [17] for details).

For our study we use a selection of 102 PFMs from the Transfac database version 11.1 [28], which correspond to TFs involved in lymphoid development (see Additional File 1 for TF list). As we are mainly interested in binding sites near the promoter, the analysis was based on the 200 base pairs upstream of the transcription start site (TSS) of the genes. We restrict the analysis to genes with normalized CpG content < 0.5 in their promoter sequence [21], as such genes tends to be expressed in a tissue and stage specific way. In the end, we calculate the affinity (Eq. 19) for all the selected genes and PFMs. This yields the matrix *X* containing the TF binding data, where *x*_{
i
}_{,}_{
j
} corresponds to the affinity of TF *j* to the promoter of gene *i.*

### T-cell gene expression and histone modification data

We use the gene expression and histone modification data from Th1, Th2, Th17 and iTreg cells published by [16]. The histone modification data was measure with the Chip-Seq Illumina platform. We used the Cisgenome tool [29] to align sequence data and to detect peaks. As we are only interested in the modifications near the promoter, we consider the region of 8000 bps upstream and 2000 bps downstream of the TSS and kept the tag counts of the highest peak. Finally, we added a pseudo count to avoid zero values and applied a log transform. This yields the matrix *X* containing the histone modification data, where *x*_{
i
}_{,}_{
j
} corresponds to the number of ChIP-seq tags derived from a particular histone modification *j* that are being mapped to the promoter of gene *i.*

The expression data was measured with Affymetrix 430 chips. The raw data has been normalized using the variance stabilization method of [30] and normalized the tissues to have mean expression equal to zero.Microarray probes were mapped to ENSEMBL gene identifiers with the help of the biomart tool [31]. We thereby kept all genes that had their expression measured by multiple probe sets. In the following, we restrict our analysis to those 6154 genes with low CpG content for which both, gene expression as well as histone modification data is available. The final data sets used in this analysis can be found at http://www.cin.ufpe.br/~igcf/MixLin.

### Experimental design

We model gene expression from four different T helper cell types (Th1, Th2, Th17 and iTreg) with the use of either transcription factor affinities (TF), histone modifications (HM) or both regulatory signals combined (HM+TF). As parameter of our method, we vary the number of linear models, *K*, from 1 to 6. In order to select the optimal model for each cell type, we first perform 10 fold cross-validation on each parameter setting and then estimate the Mean Square Errors (MSE) from the validation sets. As the MSE tends to decrease with higher *K*[32], we use a model selection procedure, the Bayesian Information Criteriun (BIC) [25], to indicate the optimal number of models. The method has been implemented with Pymix [33] and is freely available at http://www.pymix.org.

## Authors contributions

IGC, RH, TGR implemented the approach and performed the experiments. IGC, RH and FATC designed the study and evaluated the results. All authors wrote the manuscript. All authors read and approved the final manuscript.

## Declarations

### Acknowledgements and funding

This work has been partially supported by Brazilian research agencies: FACEPE, CNPq and CAPES.

This article has been published as part of *BMC Bioinformatics* Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.

## Authors’ Affiliations

## References

- Zhu J, Paul WE: CD4 T cells: fates, functions, and faults. Blood 2008, 112(5):1557–1569. 10.1182/blood-2008-05-078154PubMed CentralView ArticlePubMedGoogle Scholar
- Goldberg AD, Allis CD, Bernstein E: Epigenetics: a landscape takes shape. Cell 2007, 128(4):635–638. 10.1016/j.cell.2007.02.006View ArticlePubMedGoogle Scholar
- Kouzarides T: Chromatin modifications and their function. Cell 2007, 128(4):693–705. 10.1016/j.cell.2007.02.005View ArticlePubMedGoogle Scholar
- Turner BM: Defining an epigenetic code. Nat Cell Biol 2007, 9: 2–6. 10.1038/ncb0107-2View ArticlePubMedGoogle Scholar
- Bibikova M, Laurent LC, Ren B, Loring JF, Fan JB: Unraveling epigenetic regulation in embryonic stem cells. Cell Stem Cell 2008, 2(2):123–134. 10.1016/j.stem.2008.01.005View ArticlePubMedGoogle Scholar
- Schoenborn JR, Dorschner MO, Sekimata M, Santer DM, Shnyreva M, Fitzpatrick DR, Stamatoyannopoulos JA, Stamatoyonnapoulos JA, Wilson CB: Comprehensive epigenetic profiling identifies multiple distal regulatory elements directing transcription of the gene encoding interferon-gamma. Nat Immunol 2007, 8(7):732–742. 10.1038/ni1474PubMed CentralView ArticlePubMedGoogle Scholar
- Costa IG, Roepcke S, Schliep A: Gene expression trees in lymphoid development. BMC Immunol 2007, 8: 25. 10.1186/1471-2172-8-25PubMed CentralView ArticlePubMedGoogle Scholar
- Costa IG, Roepcke S, Hafemeister C, Schliep A: Inferring differentiation pathways from gene expression. Bioinformatics 2008, 24(13):i156-i164. 10.1093/bioinformatics/btn153PubMed CentralView ArticlePubMedGoogle Scholar
- Bussemaker HJ, Foat BC, Ward LD: Predictive modeling of genome-wide mRNA expression: from modules to molecules. Annu Rev Biophys Biomol Struct 2007, 36: 329–347. 10.1146/annurev.biophys.36.040306.132725View ArticlePubMedGoogle Scholar
- Bussemaker HJ, Li H, Siggia ED: Regulatory element detection using correlation with expression. Nat Genet 2001, 27(2):167–171. 10.1038/84792View ArticlePubMedGoogle Scholar
- Keles S, van der Laan M, Eisen MB: Identification of regulatory elements using a feature selection method. Bioinformatics 2002, 18(9):1167–1175. 10.1093/bioinformatics/18.9.1167View ArticlePubMedGoogle Scholar
- Karlic R, Chung HR, Lasserre J, Vlahovicek K, Vingron M: Histone modification levels are predictive for gene expression. Proc Natl Acad Sci U S A 2010, 107(7):2926–2931. 10.1073/pnas.0909344107PubMed CentralView ArticlePubMedGoogle Scholar
- Woolf E, Xiao C, Fainaru O, Lotem J, Rosen D, Negreanu V, Bernstein Y, Goldenberg D, Brenner O, Berke G, Levanon D, Groner Y: Runx3 and Runx1 are required for CD8 T cell development during thymopoiesis. Proc Natl Acad Sci U S A 2003, 100(13):7731–7736. 10.1073/pnas.1232420100PubMed CentralView ArticlePubMedGoogle Scholar
- DeSarbo W, Cron W: A maximum likelihood methodology for clusterwise linear regression. Journal of Classification 1988, 5(2):249–282. 10.1007/BF01897167View ArticleGoogle Scholar
- Hinton GE, Revow M, Dayan P: Recognizing Handwritten Digits Using Mixtures of Linear Models. In NIPS. Edited by: Tesauro G, Touretzky DS, Leen TK. MIT Press; 1994:1015–1022.Google Scholar
- Wei G, Wei L, Zhu J, Zang C, Hu-Li J, Yao Z, Cui K, Kanno Y, Roh TY, Watford WT, Schones DE, Peng W, Sun HW, Paul WE, O’Shea JJ, Zhao K: Global mapping of H3K4me3 and H3K27me3 reveals specificity and plasticity in lineage fate determination of differentiating CD4+ T cells. Immunity 2009, 30: 155–167. 10.1016/j.immuni.2008.12.009PubMed CentralView ArticlePubMedGoogle Scholar
- Roider HG, Kanhere A, Manke T, Vingron M: Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics 2007, 23(2):134–141. 10.1093/bioinformatics/btl565View ArticlePubMedGoogle Scholar
- Barreda DR, Belosevic M: Transcriptional regulation of hemopoiesis. Dev Comp Immunol 2001, 25(8–9):763–789. 10.1016/S0145-305X(01)00035-0View ArticlePubMedGoogle Scholar
- Matthias P, Rolink AG: Transcriptional networks in developing and mature B cells. Nat Rev Immunol 2005, 5(6):497–508. 10.1038/nri1633View ArticlePubMedGoogle Scholar
- Rothenberg EV, Moore JE, Yui MA: Launching the T-cell-lineage developmental programme. Nat Rev Immunol 2008, 8: 9–21. 10.1038/nri2232PubMed CentralView ArticlePubMedGoogle Scholar
- Roider HG, Lenhard B, Kanhere A, Haas SA, Vingron M: CpG-depleted promoters harbor tissue-specific transcription factor binding signals-implications for motif overrepresentation analyses. Nucleic Acids Res 2009, 37(19):6305–6315. 10.1093/nar/gkp682PubMed CentralView ArticlePubMedGoogle Scholar
- Mo X, Kowenz-Leutz E, Laumonnier Y, Xu H, Leutz A: Histone H3 tail positioning and acetylation by the c-Myb but not the v-Myb DNA-binding SANT domain. Genes Dev 2005, 19(20):2447–2457. 10.1101/gad.355405PubMed CentralView ArticlePubMedGoogle Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B 2005, 67(2):301–320. 10.1111/j.1467-9868.2005.00503.xView ArticleGoogle Scholar
- Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 1977, 39: 1–38.Google Scholar
- McLachlan GJ, Peel D: Finite Mixture Models. Wiley Series in Probability and Statistics., Wiley, New York; 2000.View ArticleGoogle Scholar
- Breiman L: Bagging Predictors. Machine Learning 1996, 123–140.Google Scholar
- MacKay DJC: Bayesian Interpolation. Neural Computation 1992, 4(3):415–447. 10.1162/neco.1992.4.3.415View ArticleGoogle Scholar
- Matys V, Fricke E, Geffers R, Gössling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DUU, Land S, Lewicki-Potapov B, Michael H, Münch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic acids research 2003, 31: 374–378. 10.1093/nar/gkg108PubMed CentralView ArticlePubMedGoogle Scholar
- Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol 2008, 26(11):1293–1300. 10.1038/nbt.1505PubMed CentralView ArticlePubMedGoogle Scholar
- Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18(Suppl 1):S96–104.View ArticlePubMedGoogle Scholar
- Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMart-biological queries made easy. BMC Genomics 2009, 10: 22. 10.1186/1471-2164-10-22PubMed CentralView ArticlePubMedGoogle Scholar
- Brusco MJ, Cradit JD, Steinley D, Fox GL: Cautionary Remarks on the Use of Clusterwise Regression. Multivariate Behavioral Research 2008, 43: 29–49. 10.1080/00273170701836653View ArticleGoogle Scholar
- Georgi B, Costa IG, Schliep A: PyMix - The Python mixture package - a tool for clustering of heterogeneous biological data. BMC Bioinformatics 2010, 11: 9. 10.1186/1471-2105-11-9PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.