Let the microarray data be represented by a matrix, where the *G* rows correspond to the genes and the *C* columns correspond to the samples. We previously described GMCimpute [10]. Briefly, the rows of the matrix are clustered into 1, 2, ..., *Q*-component Gaussian mixtures (*Q* is usually less than 10). For each *q*-component model, we assume that the expression data are generated from a mixture distribution

where *π*
_{
j
}is the mixing proportion, *μ*
_{
j
}= *μ*
_{
j
}(1), …, *μ*
_{
j
}(*C*) is the *j*-th component mean expression profile across the *C* columns, and the *C* × *C* covariance matrix Σ_{
j
}summarizes the relationship among the *C* columns. The mixture models are fit to the data by the Classification Expectation-Maximization algorithm (CEM) [21]; then the missing values are estimated by the Expectation-Maximization algorithm [22]; for each missing value, the estimate by GMCimpute is the simple average of the *Q* estimates. If the CEM algorithm takes *I* iterations to converge, then GMCimpute takes *O*(*IQmn*) time. If gene *g* has a missing value in column *c*, we use the information in the other columns {*g*(*c'*), *c'* ≠ *c*}, and the estimated relationship among the columns, to impute the value
(*c*) via a weighted average of the component-wise conditional expectations of *g*(*c*)|{*g*(*c'*), *c'* ≠ *c*}. That is,

where
refers to the posterior probability of gene *g* with respect to component *j* in the (*q*-component mixture model, Σ_{
j
}[*c*,-*c*] refers to the *c*-th row, and all but the *c*-th column of the covariance of component *j*, and similarly for all other entries.

Let *A'* be the imputed data matrix, i.e., an estimate of *A*. The accuracy of *A'* is measured by normalized root mean squared error (RMSE):

where *A*
^{2}, for example, is component-wise. We have found that GMCimpute is competitive in terms of imputation RMSE, and in terms of its effect on downstream significance and clustering analysis, and it is also computationally efficient [11]. Traditionally, *A'* is computed solely from *A*. In our meta-data imputation, when we apply GMCimpute to the missing values in the matrix *A*, the columns that are used by GMCimpute are not necessarily limited to only those of *A*. Let us assume that we impute missing values in a column *c* ∈ {1...*C*} from matrix *A*. We will identify the *M* columns, from *A* or from the *database matrix D*, with the largest absolute Pearson correlation to column *c*. We will then use these *M* columns in GMCimpute.

In July 2006, we downloaded from SMD [19] the data of 1,082, 469, and 630 cDNA microarrays for yeast, worm, and plant, respectively. Raw data processing (background subtraction and normalization) was performed by SMD. Each entry in the data is the base-two logarithm of the ratio of the red and green intensities. If an experiment used the dye-swap design [17], the two channels were swapped back so that the numerators of the ratios were always the samples under study. The data deposited in SMD over the years came from different microarray platforms. Thus we need to establish the correspondence of genes across the platforms. Yeast has 6,300 or so nuclear open reading frames (ORF), and they are uniquely identified by their ORF systematic names promulgated by the Saccharomyces Genome Database [23]. For the worm, the genes are identified by the clone identifiers maintained by the WormBase [24]. For the plant, the genes are identified by their GenBank Accession Numbers. The data we downloaded from SMD will be used to construct the database matrix *D* (one for each organism).

To examine the performance of meta-data imputation, we need a gold-standard data set on which to conduct a simulation study. We will thus create a data set without missing values. We will then randomly mask values in the gold-standard data set, pretend that they are missing, and apply imputation. Since the true values of these artificial missing entries are known, we can compare and validate imputation methods. For the yeast data, we use yeast ORF systematic names as the row labels, and obtain a 6314 × 1082 matrix. Some entries in this matrix are flagged by the experimenters as missing, and we have to remove rows and columns with too many missing values. After their removal, we obtain a 6220 × 442 matrix. This matrix still has 107,093 missing values (3.9%), and we use the following steps to impute them; (1) We order the columns by the numbers of missing values in them, from the smallest to the largest (i.e. from the easiest to the most difficult to impute); (2) For each column *c* in the order prescribed in Step 1, we identify the 40 columns (*c* excluded) that have the highest absolute values of Pearson correlation to *c*; (3) We impute missing values in *c* using these 40 columns by GMC impute; (4) We repeat Steps 2 and 3 ten times (to reach convergence, as measured by the RMSE between consecutive iterations). Finally, the 6220 × 442 matrix with 3.9% imputed values is the database matrix *D* for yeast. When Steps 2 and 3 are iterated, the imputed values in a column *c*
_{
i
}may be used to impute the missing values in a different column *c*
_{
j
}. The reason why we order the columns by the way described in Step 1 is to control the propagation of imputation errors presumably from the smallest to the largest. The imputed values are changing from one iteration to the next, because they are used to impute one another. Thus we need to iterate Steps 2 and 3 some number of times till the change becomes small enough, presumably reaching a (local) optimum. The worm and plant database matrices are similarly prepared. The worm database matrix is 13338 × 381, with 2% imputed values; the plant database matrix is 7424 × 301, with 1.9% imputed values.

In practice, researchers may submit their data for imputation without removing columns or rows with many missing values. However, we recommend a basic quality-control screening. If a sample (microarray) has an excessive number of missing values, it may be better to eliminate it from the logical set.