### Data integration

Recently, our group has developed a family of statistical molecular classification methods based on relative expression reversals [22, 33], and applied one variant based on a two-gene classifier to microarray data integration [23]. These methods only use the ranks of gene expression values within each profile and achieve impressive results in both molecular classification and microarray data integration. An important feature of rank-based methods is that they are invariant to monotonic transformations of the expression data within an array, such as those used in most array normalization and other pre-processing methods. This property makes these methods especially useful for combining data from separate studies since the nature of the primary features extracted from the data, namely comparisons of mRNA concentration between pairs of genes, eliminates the need to standardize the data before aggregation. Specifically, the ranks of gene expression values are invariant to monotonic data transformations within each microarray. Consequently, we directly merge gene expression data of the patients from three training data sets in Table 1, using the 22283 probe sets on Affymetrix HG-U133A microarray, to form an integrated training data set of 358 samples. After aggregation, we extract a list of pair-wise comparisons; each of these "features" corresponds to a pair of genes and is assigned the value zero or one depending on the observed ordering of expressions; see the following section. The number of features retained is much smaller than the number of genes on the array. The procedure is now described in more detail.

### Feature selection and transformation

Consider *G* genes whose expression values *X*= {*X*_{1}, *X*_{2}, ..., *X*_{
G
}} are measured using a DNA microarray and regarded as random variables. The class label *Y* for each profile *X* is a discrete random variable taking on one of two possible values corresponding to the two prognostic states or hypotheses of interest, namely "poor-outcome," denoted *Y* = 1, and "good-outcome," denoted *Y* = 2. The integrated training microarray data represent the observed values of *X* and comprise a *G* × *N* matrix *x*= [*x*_{
gn
}], *g* = 1, 2, ..., *G* and *n* = 1, 2, ..., *N*, where *G* is the number of genes in each profile and *N* is the number of samples (profiles) in the integrated data set. Each column *n* represents a gene expression profile of *G* genes with a class label *y*_{
n
}= 1 (poor-outcome) or *y*_{
n
}= 2 (good-outcome) for the two-class problem in our study. Among the *N* samples, there are *N*_{1} (respectively, *N*_{2}) samples labeled as class 1 (respectively, class 2) with *N* = *N*_{1} + *N*_{2}.

For each pair of genes (*i*, *j*), where *i*, *j* = 1, 2, ..., *G*, *i* ≠ *j*, let *P*(*X*_{
i
}<*X*_{
j
}|*Y* = *k*), *k* = 1,2, denote the conditional probability of the event {*X*_{
i
}<*X*_{
j
}*}* given *Y* = *k*. We define a score by*Δ*_{
j
}= |*P*(*X*_{
i
}<*X*_{
j
}|*Y* = 1) - *P*(*X*_{
i
}<*X*_{
j
}|*Y* = 2)|

and estimate the score of pair (*i*, *j*) based on the training set *x* by

{\Delta}_{ij}\approx \left|\frac{{N}_{ij}^{(1)}}{{N}_{1}}-\frac{{N}_{ij}^{(2)}}{{N}_{2}}\right|

(2)

where

\begin{array}{cc}{N}_{ij}^{(k)}=\left|\{n:1\le n\le N,{x}_{in}<{x}_{jn},{y}_{n}=k\}\right|,& k=1,2\end{array}.

In other words, the estimated score is simply the absolute difference between the fraction of poor-outcome patients for which gene *i* is expressed less than gene *j* and the same fraction in the good-outcome examples. The feature selection procedure consists of forming a list of gene pairs, sorted from the largest to the smallest according to their scores *Δ*_{
ij
}, and selecting the top *M* pairs. The *M* top-ranked gene pairs are then considered to be the most discriminating candidate gene pairs for breast cancer prognosis if only relative expressions are taken into account. During the process, we have transformed the original feature vector *X*= {*X*_{1}, *X*_{2}, ..., *X*_{
G
}*}*(*G* = 22283 in this study), each of which assumes scalar values, to a new ordered feature vector *Z*= {*Z*_{1}, *Z*_{2}, ..., *Z*_{
M
}*}*(usually, *M* <<*G*), each of which assumes only two values.

Suppose *Z*_{
m
}, *m* = 1, 2, ..., *M*, corresponds to the gene pair {*i*, *j*}. For convenience, the ordering (*i*, *j*) will signify which probability in Equation (1) is larger. The reason for this is to facilitate interpretation of the results, as will become apparent. If *P*(*X*_{
i
}<*X*_{
j
}|*Y = 1*) ≥ *P*(*X*_{
i
}<*X*_{
j
}|*Y = 2*), as estimated by the fractions in (2), we will write (*i*, *j*) and if *P*(*X*_{
i
}<*X*_{
j
}|*Y = 1*) <*P*(*X*_{
i
}<*X*_{
j
}|*Y = 2*) we will denote the pair by (*j*, *i*). The value assumed by *Z*_{
m
}is then set to be 1 if we observe *X*_{
i
}<*X*_{
j
}and set to 0 otherwise, i.e., if we observe *X*_{
i
}≥ *X*_{
j
}. Of course the same definition is applied to each feature in the training set. In this way, observing *Z*_{
m
}= 1 (resp., *Z*_{
m
}= 0) represents an indicator of the poor outcome (resp., good outcome) class in the sense that *p*_{
m
}= *P*(*Z*_{
m
}= 1|*Y* = 1) ≥ *q*_{
m
}= *P*(*Z*_{
m
}= 1|*Y* = 2). For all the features selected in our signature we in fact have *p*_{
m
}> 1/2 > *q*_{
m
}.

After this procedure, the original *G* × *N* data matrix is reduced to an *M* × *N* data matrix. The number of distinct genes in a prognostic signature is obviously fewer than *2M*. In our practice, there are always more than *M* distinct genes among the top *M* gene pairs. Given that the numbers of genes in published breast cancer prognostic signatures are mostly fewer than 100, we fix *M* = 200 in this study to maker sure we can identify a prognostic feature signature based on a reasonable number of genes.

### Likelihood ratio test

The classical likelihood ratio test (LRT) is a statistical procedure for distinguishing between two hypotheses, each constraining the distribution of a random vector *Z*= {*Z*_{1}, *Z*_{2}, ..., *Z*_{
M
}}. In our case the variables *Z*_{
m
}are the simple, binary functions of the gene expression profile defined above.

The LRT is based on the likelihood ratio

LR(z)=\frac{P(z|Y=1)}{P(z|Y=2)}

where *z*= {*z*_{1}, *z*_{2}, ..., *z*_{
M
}*}* are the observed values in a new sample. Notice that *z* assumes values in {0, 1}^{M}, the set of binary strings of length *M*. The LRT chooses hypothesis *Y* = 1 if *LR*(*z*) > *t* and chooses *Y* = 2 otherwise, i.e., if *LR*(*z*) ≤ *t*. The threshold *t* is adjusted to provide a desired tradeoff between type I error and type II error, i.e., between sensitivity and specificity. Choosing *t* small provides high sensitivity at the expense of specificity and choosing *t* large promotes the opposite effect.

#### Naive Bayes Classifier

In the special case in which the random variables *Z*_{1}, ..., *Z*_{
M
}are binary (as here) and are assumed to be conditionally independent given *Y*, the LRT has a particularly simple form. It reduces to comparing a linear combination of the variables to a threshold. Recall that *p*_{
m
}= *P*(*Z*_{
m
}= 1|*Y* = 1) and *q*_{
m
}= *P*(*Z*_{
m
}= 1|*Y* = 2), *m* = 1, 2, ..., *M*. In this case,

P(z|Y=1)={\displaystyle \prod _{m=1}^{M}{p}_{m}^{{z}_{m}}{(1-{p}_{m})}^{1-{z}_{m}}}

and a similar expression holds for *P*(*z*|*Y* = 2) with *p*_{
m
}replaced by *q*_{
m
}.

It follows that

LR(z)={\displaystyle \prod _{m=1}^{M}\frac{{p}_{m}^{{z}_{m}}{(1-{p}_{m})}^{1-{z}_{m}}}{{q}_{m}^{{z}_{m}}{(1-{q}_{m})}^{1-{z}_{m}}}}

and consequently

LLR(z)=\mathrm{log}\phantom{\rule{0.1em}{0ex}}LR(z)={\displaystyle \sum _{m=1}^{M}{z}_{m}\mathrm{log}\phantom{\rule{0.1em}{0ex}}\frac{{p}_{m}(1-{q}_{m})}{{q}_{m}(1-{p}_{m})}}+{\displaystyle \sum _{m=1}^{M}\mathrm{log}\phantom{\rule{0.1em}{0ex}}\frac{1-{p}_{m}}{1-{q}_{m}}}

The LRT then reduces to the form: Choose *Y* = 1 if

{\displaystyle \sum _{m=1}^{M}{\lambda}_{m}{z}_{m}}>t\text{'}

(3)

and choose *Y* = 2 otherwise, where

{\lambda}_{m}=\mathrm{log}\phantom{\rule{0.1em}{0ex}}\frac{{p}_{m}(1-{q}_{m})}{{q}_{m}(1-{p}_{m})}.

(4)

Since *p*_{
m
}> *q*_{
m
}, all these coefficients in Equation (4) are positive and the decision rule in Equation (3) reduces to weighted voting among the pair-wise comparisons: every observed instance of *z*_{
m
}= 1 is a vote for the poor outcome class with weight *λ*_{
m
}. Moreover, under the two assumptions of i) conditional independence and ii) equal *a priori* class probabilities (i.e., *P*(*Y* = 1) = *P*(*Y* = 2)), this is in fact the Bayes classifier (which is optimal) for the threshold *t* = 0.

#### Sensitivity vs. Specificity

Since our interest lies in high sensitivity at the expense of specificity if necessary, we do not choose *t* = 0. Since we want a very high likelihood of detecting the poor-outcome class, we choose the threshold *t* to achieve high sensitivity, defined to be above 90%. Let *t*_{
α
}denote the (largest) threshold achieving sensitivity 1 - α. That is, suppose

P({\displaystyle \sum _{m=1}^{M}{\lambda}_{m}{Z}_{m}}>{t}_{\alpha}|Y=1)=1-\alpha .

(We explain how to estimate *t*_{
α
}from the training data in the next sections.) Then, from the Neyman-Pearson lemma, we know that our decision rule achieves the maximum possible specificity at this level of sensitivity. More precisely, this threshold maximizes

P({\displaystyle \sum _{m=1}^{M}{\lambda}_{m}{Z}_{m}}\le {t}_{\alpha}|Y=2),

which is the probability of choosing good-outcome when in fact good-outcome is the true hypothesis.

Of course this is only a theoretical guarantee and depends very strongly on the conditional independence assumption which is surely violated in practice; indeed, some genes are common to several of the variables *Z*_{
m
}. Still, the LRT does provide a framework in which there are clearly stated hypotheses under which specificity can be optimized at a given sensitivity. Moreover, it provides a very simple test and the parameters *p*_{
m
}, *q*_{
m
}are easily estimated with available sample sizes. Most importantly, the decision procedure dictated by the LRT does indeed work well on independent test data (see 'Results').

### Signature identification and class prediction

In clinical practice, when selecting breast cancer patients for adjuvant systemic therapy, it is of evident importance to limit the number of poor-outcome patients assigned to the good-outcome category. The conventional guidelines (e.g., St. Gallen and NIH) for breast cancer treatment usually call for at least 90% sensitivity and 10–30% specificity. Therefore our objective in selecting the threshold *t* is to maintain high sensitivity (~90%); the specificity is then determined by the sample size and the information content in the features. In order to meet these criteria, we employ *k*-fold cross-validation to estimate the threshold which maximizes specificity at ~90% sensitivity for each signature size for our likelihood ratio test.

The idea is to use *k*-fold cross validation to estimate the sensitivity and specificity for each possible value of *m* = *5, 10, 15, ..., 200*, (the number of features in the LRT) and *t* = *1, 2, ..., 200* (the threshold in Equation (3)). For each such *m* we then compute the specificity at the largest threshold *t*(*m*) achieving at least 90% sensitivity; this function is plotted in Figure 1. (Obtaining 90% sensitivity can always be achieved by selecting a small enough threshold.) Finally, we then choose the smallest value *m*_{
opt
}which (approximately) maximizes specificity; the threshold is then *t*_{
opt
}= *t*(*m*_{
opt
}). From Figure 1, we see that *m*_{
opt
}= *80*.

Specifically, the steps are as follows: 1) Divide the integrated training data set into *k* disjoint subsets of approximately equal sample size; 2) Leave out one subset and combine the other *k*-1 subsets to form a training set; 3) Generate a feature list of *M* gene pairs ranked from most to least discriminating according the score defined in Equation (1) and compute the corresponding binary feature vector of length *M* for every training sample and every left-out sample; 4) Starting from the top five features, sequentially add five features at a time from the ranked list, generating series of 40 feature signatures of sizes *m* = *5, 10, ..., 200*; 5) For each signature in 4), classify the left-out samples using the LRT in Equation (3) for each possible integer threshold *t* = 1, 2, ..., 200 and record the numbers of misclassified poor-outcome and misclassified good-outcome samples; 6) Repeat steps 1)-5) exhaustively for all *k* divisions into training and testing in step 1); 7) Calculate the sensitivity and specificity for the prognostic LRT test for each of the 40 signatures, and keep only the largest threshold for which the sensitivity exceeds 90%. The optimal number of features, *m*_{
opt
}, is the smallest number which effectively maximizes specificity.

The final prognostic signature is the *m*_{
opt
}top-ranked features (gene pairs) generated from the original integrated training set. The final prognostic test is the LRT with these features and the corresponding threshold *t*_{
opt
}= *t*(*m*_{
opt
}); this is the classifier which is applied to the validation set and yields the error rates reported in 'Results'.

### Additional statistical analysis

We compute the odds ratio of our prognostic test for developing distant metastases within five years between the patients in the poor-outcome group and good-outcome group as determined by LRT classifier. The *p*-values associated with odds ratios are calculated by Fisher's exact text. We also plot the Kaplan-Meier curve of the signature on the independent test data with *p*-values calculated by log-rank test. The Mantel-Cox estimation of hazard ratio of distant metastases within five years for the signature is also reported. All the statistical analyses are performed using MATLAB.