Overview of the optimization model
For simplicity, we consider two classes to illustrate the optimization model. We note that both class discovery and class prediction for the two classes can be transformed into a sample labeling problem. In this section, the optimization model is formulated to find the best way to assign labels to the samples. The labeling problem for multiclass cases for class discovery and class prediction will be discussed in the next sections.
For twoclass cases, we denote one class by zero and the other class by one. Assume all the sample labels are continuous variables between zero and one. The objective of the optimization model is to assign similar labels to similar samples as much as possible. The formulations are given as follows:
where N is the total number of samples; s
_{
ij
} is the similarity score of samples x
_{
i
} and x
_{
j,
} which is calculated from the gene expression profiles; and f
_{
i
} is the unknown variable to be determined and represents the label of sample x
_{
i
}. A is a set of samples that are known to belong to Class Zero. B is a set of samples that are known to belong to Class One. The objective function in Equation (1) tends to assign similar labels to similar samples (
). Constraints in Equation (2) ensure that the resultant sample labels are consistent with the known information and that the final labels
are between zero and one.
The objective function (1) can be rewritten in vector form as
. Here f is the sample label vector (f
_{
i,} is the label of Sample i) and L is the Laplacian matrix of the similarity matrix S (s
_{
ij
}, the similarity score of samples i and j), i.e., L = D − S and D is a diagonal matrix with
. If s
_{
ij
} are all nonnegative, L is positive semidefinite. The objective function is convex and the constraints are linear. Thus the model (1–2) is a convex quadratic programming problem and a global optimal solution is guaranteed.
Due to the form of the objective function, our optimization model is tightly related to spectral clustering and semisupervised learning [21–23]. These links form the basis for class discovery and class prediction. Importantly, the constraints imposed in this model provide a few advantages for cutoff setting and outlier identification.
The sample similarity matrix
Usually the gene expression profile for n genes and m samples is mathematically denoted by an
matrix X. Each element x
_{
ij
} represents the expression level of gene i in sample j. x
_{
i
} is an mdimensional vector denoting the expression value of gene i. The construction of the sample similarity matrix is important because it is the only input for model (1–2) to fully utilize the gene expression data. Since the calculation of the similarity matrix and the solving of the optimization model are separated, various feature selection/extraction techniques and different measures of similarity can be applied here to incorporate prior information. A simple and straightforward method to construct a similarity matrix of samples based on the gene expression profiles is to calculate the Pearson correlation coefficients of each sample pair which provides a uniform measure between −1 and 1. To get nonnegative s
_{
ij
}, a linear transformation can be adopted to map [−1, 1] to [0, 1]. Because the Pearson correlation coefficients based on the gene expression profiles are calculated pairwisely between every two samples, it does not consider the similarities among samples globally. To provide a global similarity measure, a secondorder correlation similarity matrix can be constructed by exploiting the deduced sample correlation features (i.e., calculating the Pearson correlation coefficients of the sample correlation vectors). In this study we used secondorder correlation similarity matrices to identify the underlying structures of cancer gene expression data.
Setting for class discovery
Given the similarity matrix S, sets A and B are necessary to implement the class discovery task through Model (1–2). If A and B are not provided, i.e., without the corresponding constraints in Equation (2), the optimization model results in a trivial solution given nonnegative s
_{
ij
}. The trivial solution indicates that all the samples belong to one class, which is meaningless. To obtain a meaningful solution, A and B should be specified and intersection between A and B is not allowed. Usually for class discovery task, information about A and B is not available since all sample labels are unknown. Here we introduce a weak assumption to set up A and B. We name it here as the most dissimilar assumption. The assumption is that the two least similar samples should belong to different classes. Otherwise all samples should belong to one class. According to this assumption, the minimal s
_{
ij
} for
is identified, denoted by s
_{
ab
}. Let Sample x
_{
a
} be labeled with zero and x
_{
b
} be labeled with one, or vice versa. If there is more than one minimal value in S, the sample pair with minimal values in S^{n} (the power of similarity matrix S, where n > 1 is a positive integer) is also a candidate to determine set A and B. Model (1–2) is then well constructed and optimal labeling can be uniquely determined by solving the model.
Setting for class prediction
Class prediction tries to assign a set of particular samples to known classes. In this setting, goldstandard data are generally available and some gene expression profiles for samples are labeled with known classes. That is, A and B are available. Model (1–2) can therefore be implemented for class prediction.
A fast algorithm for largescale problems
Model (1–2) can be considered convex quadratic programming if all values of s
_{
ij
} are positive. It can be solved efficiently by the general solvers such as quadprog in Matlab and the sequential minimal optimization (SMO) algorithm which has been applied successfully to solve the optimization problems in support vector machine applications. Here, a simple customized algorithm is proposed to solve Model (1–2) quickly, even for very largescale problems by fully considering its particular characteristics.
The Lagrange function of optimization model (1–2) is:
Then the KarushKuhnTucker (KKT) conditions are:
These conditions can be reduced as:
We design the following algorithm to quickly find the solution:
Algorithm 1.
· Step 1: Let
and
for
,
for
and
for
.
· Step 2: Calculate
for
.
· Step 3: Let
. If
is less than a predefined threshold or t is larger than the maximal steps allowed, stop; otherwise, repeat Step 2 and Step 3.
Next, we prove the above algorithm is correct and convergent.
Theroem 1: Suppose Algorithm 1 gives rise to the sequence,
. It converges to
.
satisfies the KKT point of Model (1)(2).
Firstly, we prove that
Algorithm 1 is convergent. The Lagrangian function of our optimization model (1–2) is as follows,
Then an auxiliary function
is constructed for the Lagrangian function
where
L is the Laplacian matrix of the similarity matrix
S. The auxiliary function satisfies
. The second order derivative of
with respect to
is calculated as
where
is the Kronecker delta function, i.e.,
when i = j and
otherwise. Since
L is positive semidefinite,
is concave in
f. We can obtain global maxima when the first order derivative is zero.
Recalling the KKT condition and our iterative
Step 2 can be reformulated as,
By the property of the auxiliary function, we have
is monotonically increasing and is bounded from above. Thus our algorithm converges.
Secondly we show Algorithm 1 is correct. At convergence, the solution is
and satisfies
for
.
for
and
for
also hold. Then
satisfies the KKT condition (4)(5). This proves our algorithm correctly converges to a minimum satisfying KKT condition.
One advantage of our algorithm is that the computational complexity is low and it requires only a small amount of computer memory. So our algorithm can be applied to very large data sets.
Postprocessing the solutions
Each sample gets a continuous label between zero and one after the optimization model (1)(2) is solved. We can easily obtain the binary labels by applying a predefined threshold. If a training data set is available, this threshold can be learned from the training data by crossvalidation. Otherwise, the median of zero and one, 0.5, is a natural cutoff to convert the continuous labels into binary labels. If label
is close to zero, i.e.,
, the corresponding sample should be classified to Class Zero. Otherwise, if label
is close to one, i.e.,
, the corresponding sample will be classified to Class One. This is a great option compared to traditional spectral clustering methods in which the cutoff needs considerable human intervention. This advantage makes it much easier for clinicians and biologists to use.
Multipleclass cases
In practice, the samples may belong to more than two classes. For class discovery cases, the class labels can be obtained by recursively applying our model to classify samples into two groups on each step until some stopping criterion is satisfied. Here we propose an intuitive criterion and name it as the minimum similarity score criterion. Formally, the procedure for class discovery with multiple classes is described as follows:

Step 1: Classify samples into two classes by OTCC.

Step 2: Calculate the inner minimum similarity score for each class. If the minimum similarity score of some class is less than a predefined threshold, then repeat Step 1 to classify the samples of this class into two subclasses.

Step 3: repeat Step 2 until all the inner minimum similarity scores of the classes are above the threshold.
The procedure does not require the number of clusters but instead relies on the least tolerant similarity score within classes. Compared to the number of clusters which is generally required by many existing class discovery methods, our similarity score is tightly related to the expert’s knowledge and is expected to be defined by clinicians and biologists based on their knowledge. Alternatively, without predefining a stopping criterion, OTCC can be applied recursively until each sample is a single class. This outputs a binary tree in which all samples are leaves and the relationships among them are fully depicted. This property allows OTCC to reveal the fine structure of patient samples.
For class prediction cases, the relationship between multiple classes can be organized as a binary tree and then the model can be applied recursively according to the binary tree to obtain the labels of all samples. The binary tree should reflect the relationship of the classes. Otherwise wrong prior information will be introduced and mislead the class prediction results. When the class relationships are not available or all the classes are independent of each other, an arbitrary binary tree can be used. Onevsone or onevsall strategies can also be adopted to extend OTCC to multiclass cases.