Experimental data
Gene expression data are part of the Grand Challenges in Global Health Project: Grant Number 37772, “Biomarkers of protective immunity against Tuberculosis in the context of HIV/AIDS in Africa” (funded by the Bill & Melinda Gates Foundation through the Grand Challenges in Global Health Initiative; http://www.biomarkers-for-tb.net/. PBMC from 40 TB cases and from 40 healthy household contact controls were extracted and analyzed by flow cytometry for proportions of CD3+ T-lymphocytes and CD14+mononuclear phagocytes as described before [1]. All donors gave informed consent. This study was approved by local ethics committees in Stellenbosch (South Africa) (N05/11/187) and Berlin (EA 10 1/176/07, Germany).
Signals of gene expression in whole blood as well as in CD3+ cells, for the Human Whole Genome Oligo 44K Agilent arrays (GE2_44k_1005) were measured according to manufacturer's protocols. The microarray design was an independent swop design as recommended by Landgrebe et al. [13]: 50% of each group ("TB", "TST+", "TST-") were labelled with Cy3, the other half using Cy5. Pairs for hybridization on an array were chosen to match regarding age and gender. For validation of the deconfounding algorithm we used CD3+ proportions. CD3+ cells sorted by fluorescence-activated cell sorting (FACS) were subjected to RNA extraction and microarray measurements of gene expression following the same procedure as for the whole blood samples. More details about the observational field study as well as the gene expression dataset will be published separately (see http://www.biomarkers-for-tb.net/publications).
Gene expression data were normalized using R-package limma [14]: background correction using the method normexp [15], lowess normalization was applied for each array (within array normalisation), quantile normalisation on the set of all arrays (between array normalisation) as recommended [16]. As proposed by [17], gene expression intensities for both groups were obtained as in Equations 4 and 5 from re-parameterizing the normalized log-ratios (M) and mean log-intensities (A) resulting from the limma analysis.
Summarizing, for each of the 40 TB cases and the 40 healthy household contact controls we were able to analyze gene expression data of whole blood as well as for the sorted CD3+ cells of the same samples together with their FACS-measured cell type proportions for the CD3+ cell population.
Deconfounding algorithm: implementation and enhancements
The basis of our deconfounding algorithm was implemented as proposed by Venet et al. (2001) [10] and Lahdesmaki et al. (2005) [4] using R [18]:
input X and n
normalize columns of X (either centre, or by quantile normalization)
generate start values for S and C
apply constraints to S and C (see below)
(*) fix S, calculate C using lsqnonneg-algorithm
apply constraints for S
fix C, calculate S using lsqnonneg-algorithm
apply constraints for C
if | X - SC | < a or number iterations > b then EXIT and report S and C
else continue at (*)
where X is the gene expression matrix measured from heterogeneous tissue (rows: genes, columns: samples), S and C as in equation 2, iteration exit criteria were set a = 0.1 and b = 100. The Least squares non-negative matrix factorization algorithm is implemented as in the MATLAB function lsqnonneg [19]. The constraints are:
-
1.
S non-negative and normalized (either centered, or by quantile normalization [16])
-
2.
0 ≤ c
ij
≤ 1 for all elements of C (cell type i, sample j)
-
3.
∑
i
c
ij
= 1 for all samples j (i.e. cell type proportions sum to 100%)
Our implementation is available as an R-package and has additional options for using quantile normalization instead of global normalization proposed previously [10]. Moreover, it is possible to run the deconfounding on log-values of the normalized intensities or on non-log data. Finally, our implementation does not apply the de-correlation proposed by Venet et al. [10].
To assign the right cell type for each of the estimated profiles, our implementation relies on a majority count decision involving the estimated gene expression profiles from nmarker = 9 markers. Five of these markers are considered to be expressed exclusively for a specific cell type (positive marker genes) and the remaining four exclusively not in this cell type (negative marker genes). Marker genes were chosen according to a priori molecular immunological knowledge. For our experimental dataset we used CD3D, CD3E, CD3G, CD2 and CD7 as positive markers, and CD19, FCGR1A, CD14 and MARCO as negative markers for the CD3+ T cells.
Simulated data
Cell type specific gene expression profiles (columns of the signature matrix S) were simulated according to a gamma distribution such that expectation value and variance were those of the experimental data (shape a = 12.5 and scale b = 0.65):
As by Venet et al. [10], biological variance was modeled as multiplicative error term ϵ, technical variance as additive error term ϵ. For our experimental data, variation was found to increase with mean signal intensities. Therefore, we decided to model a constant coefficient of variation instead of standard deviation:
where η = 0.17 and ϵ ~ N(0, χ · Igene), using χ = 0.1 as estimated from our experimental data.
Gene expression values for negative marker genes had expression Xmarker, neg = 6.0, positive marker genes had Xmarker, pos = 12.0 in the expressing cell type - as observed for the marker genes in our experimental study. Cell type proportions, Csim, were drawn from the uniform distribution between cell type specific maximum and minimum values as in our experimental flow cytometry data. The simulated gene expression matrix, Xsim, was calculated from simulated cell type-specific gene expression profiles, Ssim, and simulated cell type proportions, Csim, corresponding to equation 2:
To investigate the algorithm's capabilities regarding detection of differential expression of single features and for classification, two groups of gene expression profiles were simulated, e.g. corresponding to TB patients and TST+ controls in our experimental data. We simulated nsample = 100 individuals in each group. For each gene expression profile ngenes ∈ {1000, 10000} genes were considered, with nmarkers = 10 and ndiff ∈ {20, 600} differentially expressed biomarkers.
Differential expression was simulated by adding Δdiff ∈ 1, 2, 5} to the expression values of the biomarker genes in the first cell type. Figure 3 illustrates the generation of simulated profiles.
Power study: valid biomarkers with and without deconfounding
We simulated a gene expression experiment with samples mixed out of two cell types (CD3 and other) for 10,000 genes, where 600 genes were differentially expressed. For the differentially expressed genes we simulated all eight possible combinations of NEUTRAL, UP and DOWN. Sample sizes of the two groups under comparison (alike TB and TST+ healthy control) varied from nsamples ∈ {4, 10, 20, 40, 80, 120}. Simulated gene expression data were analyzed as the experimental data. As for the latter we were able to analyze a simulated whole blood sample (mixture of the two cell types) as well as the two cell type-specific gene expression profiles after deconfounding. Simulated whole blood gene expression data were analyzed using the t-test, ranking candidates for differential expression using p-values and - to enable a direct comparison - considering the 100 top candidates as positive candidates for differential expression. The cell type-specific gene expression profiles (columns of the signature matrix
) estimated from deconfounding were ranked using absolute log-fold-change values. Also here the 100 top candidates were chosen.
Classification in the case of reversely regulated differentially expressed biomarkers
The worst-case scenario for biomarker detection in heterogeneous tissues arises when cell types involved express differentially regulated biomarkers in opposite directions. In this case, in the tissue RNA isolate, signals for differential expression likely cancel each other and hamper detection of respective biomarkers markedly. To identify a possible exit strategy, we conducted a simulation study for this worst-case scenario, again considering noise values estimated from the experimental data in this study.
To exemplify the worst-case classification task, we simulated differential gene expression as above, but also subtracted the same value from the expression values of the second cell type. This way, for all cells in the mixture averaged over all samples, no differential expression is expected, while for the single cell types it is more or less evident. Gene expression profiles for new samples, for validation of the trained classifiers in the classification scenario, were generated using the identical signature matrices, Ssim, as for the training step, but with new values for the concentration matrices as well as for the noise term realizations.
Canonical classification approach
For feature selection, t-tests were used to identify biomarker candidates from the simulated heterogeneous tissue gene expression data: The top ncand ϵ {10, 20} were chosen to train a linear discriminant function as classificator. Classification errors in a validation (500 new cases simulated) for this classical classification approach were then compared to a deconfounding ranking approach, which is described in the following.
Deconfounding ranking approach
For the training dataset, a deconfounding analysis was run and ncand candidates top ranked for differential expression were picked from gene-wise mean absolute differences between the corresponding columns of the estimated signature matrices,
, for the two groups. In addition, using the simulated whole blood expression data, Xsim, from the training dataset, a random forest predictor was trained to estimate the cell type proportions
, resulting from the deconfounding algorithm run from the same training-data [20, 21].
Input to this statistical learning step were the gene expression data in Xsim for the nmarkers = 20 marker genes. For each new individual during the validation part of the study, cell type proportions were estimated from the simulated whole blood gene expression profile using the trained random forest machine. Deconfounding results
of the training dataset for the two groups A and B were then multiplied with the estimated cell type proportions for the new individual, to result in group-specific gene expression profiles
and with
with the cell type proportions of the sample in question. The actual gene expression signals of the sample at the chosen ncand biomarker loci were then compared to these group-specific gene expression matrices and the following summary score computed:
Classification was based on choosing the group for which γgroup was minimal.
Implementation and availability
R-package deconf implementing the deconfounding algorithm and options, R-scripts for data simulation, data analysis and an anonymized part of the experimental dataset is available as additional file 1 (Windows R-package) and additional file 2 (tar-gz archive).