### Microarray production

RNA isolation, microarray production, and microarray hybridization were carried out as described previously in [21]. RNA from normal human lymph node was purchased from a commercial source (Stratagene, La Jolla, CA). Five *μ* g aliquots of total RNA from normal lymph node and RKO colon cancer cell line were reverse transcribed using Superscript II RT (Invitrogen, Carlsbad, CA) in conjunction with oligodT-T7 primers according to the manufacturer's suggested protocol. The second strand was synthesized using 10U *E. coli* DNA ligase (vendor), 40U *E. coli* DNA polymerase I (vendor), and 2U *E. coli* Rnase H (vendor). This reaction was stopped with EDTA and then cleaned with Qiagen's PCR Purification kit (Qiagen, Valencia, CA). The double stranded cDNA was then amplified by an *in vitro* transcription reaction (Ambion, Austin, TX) and cleaned with Qiagen's Rneasy kit (Qiagen, Valencia, CA). Each amplified cRNA sample was then quantitated using a Beckman DU640 spectrophotometer (Beckman, Fullerton, CA). Five *μ* g amplified cRNA from Stratagene's normal lymph node was labeled with Cy5 for each microarray hybridization. Mixtures of appropriate volumes of cRNA from normal lymph node and RKO were labeled with Cy3 in a reverse transcription reaction using Superscript II RT (see Table 3). Labeled samples were co-hybridized overnight at 60°C in a humidified incubator on a cDNA microarray containing 4704 human genes in duplicate produced in-house. The 4704 genes represent most of the known genes in the cDNA library we used to generate the microarrays. For the purpose of this study, the identity of the genes is not very important since we only study the general effect of sample heterogeneity. As the mixing effect is the same for all the genes, we expect to have similar results when the whole genome arrays are used. Slides were scanned with an LS-IV laser scanner (Genomic Solutions, Ann Arbor, MI). In total, five different heterogeneous mixtures were measured. The measured mixing percentages are shown in Table 3.

### Preprocessing

The microarray data consists of five different heterogeneous mixtures of lymph node and colon cancer samples which are hereafter abbreviated as normal and RKO, respectively. (For more details, see Microarray production Section above and Table 3.) The gene expression data set was preprocessed as follows. The replicated background-subtracted signal intensities were averaged and log_{2}-transformed, and the dye-bias effect was corrected in the log_{2}-domain using the standard lowess smoothing-based normalization (see e.g. [22]) with smoothing parameter *f* = 0.7. Because the averaging effect (source of heterogeneity) takes place on the molecular level, the phenomenon must be modeled using the absolute expression values. Therefore, after the correction of the dye-bias, the data were transformed back to the original domain using the inverse of the log_{2}-transformation. Correspondingly, single-channel data were used for further analysis. In order to mitigate the between array variability, the data were further standardized for each array and the two channels separately.

### Modeling sample heterogeneity

The two samples, RKO colon cancer cells and normal lymphocytes, are mixed at the extracted RNA level. Therefore, without any further verification, the model can be assumed to be linear. Lymphocytes were used because tumor tissues often contain infiltrating lymphocytes, especially in tumor metastases in the lymph nodes. Let

and denote the expression level of the *i* th gene in the colon cancer (RKO) and in the lymph node (normal) samples, respectively. Assuming only two different cell types are mixed, the sample heterogeneity is modeled by a simple linear model

where

denotes the expression value of the *i* th gene in the *k* th heterogeneous sample, and 0 ≤ *α*_{
k
}≤ *1* denotes the fraction of the colon cancer cells in the *kth* mixture. It is worth noting that we use the same mathematical model for the sample heterogeneity as in [9–11]. Also note that in Equation (1) it is assumed that the expression level in RKO () and normal () is "fixed" and does not change between heterogeneous measurements. In other words, the measurements come from the same heterogeneous sample with different mixing fractions. In order to allow variation in the expression values between different samples/treatments/time points, the same model can be applied separately to each set of measurements from the other samples/treatments/time points. The same model can also be extended to more than two cell types (for more details, see Selection of the number of cell types Section below).

### Inversion of sample heterogeneity

The first objective is to invert the mixing effect shown in Equation (1), that is, to obtain estimates for the expression values of the pure colon cancer cells and the pure lymphocytes. In practice, however, the measured expression values, *y*_{
i
}, include one or more sources of noise. By making some distributional assumptions, one could use standard model-based estimation methods. However, in order to avoid making additional modeling assumptions, we prefer to use a general purpose least squares method to estimate the gene expression levels corresponding to the pure samples.

Let the number of genes be *n* and assume that one has measured the expression values for *K* different heterogeneous mixtures. Thus, one has measurements , 1 ≤ *i* ≤ *n*, 1 ≤ *k* ≤ *K*. Let us also assume for now that the mixing percentages are known or have been measured. For the *i* th gene the sample heterogeneity can be expressed as (excluding all noise terms)

When including all *n* genes, the above model can be rewritten as

where **0** denotes the *K*-by-2 zero matrix. Let the block matrix in Equation (2) above be denoted as *Ã* Assuming the column rank of *A* is full, the well-known least squares solution is given by

where

. Due to the structure of the matrix *Ã*, the least squares solution can be obtained gene-wise as .

The Gauss-Markov theorem says that the standard least squares solution is indeed the best linear unbiased estimate if the noise in the measurements is additive and i.i.d. with constant variance. However, a common observation is that the homoscedasticity does not always hold for microarray data, but instead, the noise variance depends on the underlying signal intensity [23, 24]. Such heteroscedasticity may decrease the power of the inversion method shown in Equation (3). Fortunately, the structure of the matrix *Ã* ensures that the inversion can also be performed for each gene separately. Consequently, it is not necessary for the homoscedasticity to hold globally. Indeed, all we need to assume is that the noise variance is approximately constant for each gene separately.

Also note that, in this two cell type model, no prior knowledge about the expression values of either of the two cell types is needed since the method estimates the expression values for both of the two cell types. The same is also true for more general models including more cell types, assuming the model is sufficiently over-determined (see also Selection of the number of cell types Section below).

### Optimization of mixing percentages

In practice, the mixing percentages must be measured by some means. Therefore, they are also likely to contain some error. So, overall, one would like to estimate not only the expression values for the pure cell types but also the most likely value of the mixing percentages.

Assuming the model in Equation (2) is sufficiently over-determined, the mixing parameters can be adjusted computationally, too. Let us again consider the case where only two different cell types are mixed. Note that *K* denotes the number of different heterogeneous mixtures measured. Therefore, the regression matrix *Ã* in Equation (2) has only *K* free parameters. Since the number of expression values to be estimated is 2*n*, the total number of free parameters in Equation (2) is 2*n* + *K*. The number of equations in Equation (2) is *Kn*. Hence, the model is over-determined if *Kn* > 2*n* + *K*, which, for a fixed *n* ≥ 3, holds if *K* > 2. (Note that in our case we have measured five different heterogeneous mixtures, i.e., *K* = 5.) As above, no assumptions on the noise distributions are being made and we use the least squares method. This results in the following optimization problem

A similar optimization problem was also introduced in [11].

Because the objective function in Equation (4) above is minimized over both *Ã* and **x**, the objective function is not linear in the parameters anymore and, therefore, cannot be solved as in Equation (3). In general, any iterative optimization method can be used to get a solution. Iterative methods usually become inefficient/unstable as the number of parameters to be optimized increases. In this case, the number of free parameters in *Ã* and **x** is 2*n* + *K*. Therefore, we use a two-step approach in the optimization. In the first step, given proper initial value for *Ã*, the least squares solution for **x** is found using Equation (3). In the second step, the mixing percentages are optimized in the least squares sense (subject to the constraints 0 ≤ *α*_{
k
}≤ 1 for all 1 ≤ *k* ≤ *K*) using the previously found value for **x**. These two steps are then repeated, essentially resulting in a type of expectation-maximization (EM) approach. A similar iterative procedure was also proposed in [11], except with different constraints. Note that when Equation (4) is minimized over *Ã*, given the value of **x**, the optimization problem is again linear in its parameters. Assuming the constraints are not violated, the standard equation (similar to the one in Equation (3)) can be applied. If that is not the case, then any general-purpose constrained optimization method may be applied. Let (resp. *Â*^{(j)}) denote the value of **x** (resp. *Ã*) after the *j* th iteration. Details of the algorithm are shown in Figure 8. Clearly, at each iteration of steps 2 and 3, the value of the objective function is decreased. Thus, a minimum will be found.

### Confidence intervals

It is useful to assess the confidence intervals of the obtained expression estimates. As explained above, the Gauss-Markov theorem is applied gene-wise that greatly alleviates the issue of heteroscedasticity. Should the noise variance *σ*^{2} be constant, then the variance of the estimated expression values would be . Due to the special structure of the matrix *Ã* (i.e., the gene-wise inversion of the mixing effect), the variance of the estimated expression values for the *i* th gene can be expressed as

where

is the noise variance for the *i* th gene. A straightforward way of obtaining an estimate of the variance is to compute the sample noise variance for each gene and then apply Equation (5) to get . That would result in somewhat sensitive variance estimates since there are only *K* = 5 error residuals associated with each gene. A better alternative is to pool genes which have approximately the same average expression value and then compute the sample noise variance from the error residuals of the pooled genes. Although we do not assume Gaussian noise distribution, we can resort to the Gaussian approximation when computing the confidence intervals. For example, using the Gaussian approximation, the 1 - 2*α* confidence interval for estimated expression value of the *i* th gene in the colon cancer cells is , where Φ^{-1}(·) is the inverse of the standard normal cumulative distribution function and denotes the (1,1) element of the estimated variance matrix (similarly for the lymph node sample: ). Alternatively, the confidence intervals can be obtained using the non-parametric bootstrap framework [25]. Here we consider the method in which one re-samples the error residuals with replacement (within the set of pooled genes) and computes the confidence intervals directly from the *α* and 1 - *α* percentiles of the bootstrap distribution of the expression estimates.

### Selection of the number of cell types

Although it is known that only two cell types are mixed in our experiments there may be other experimental settings where the number of cell types may be unknown. Then it is useful to assess the validity of the model as well. As was mentioned above, the linear mixing model can be extended to incorporate more than just two cell types using a straightforward extension:

, where denotes the expression value of the *i* th gene in the *j* th cell type, and 0 ≤ ≤ 1 denotes the fraction of the *j* th cell type in the *k* th mixture. The mixing percentages must also satisfy for all *k*. The significance of different 'regression coefficients' could be tested using standard regression-based statistical tests. Since those tests apply only to Gaussian noise we recommend using a general purpose cross-validation for model selection (see e.g. [26]). Here we consider the leave-one-out cross-validation (LOOCV) and test the one, two, and three cell type models. Thus, each heterogeneous sample is left out from the training data at a time, the regression coefficients are estimated based on the remaining four samples, and the model is then tested on the sample which was left out from the training data set.