Context based mixture model for cell phase identification in automated fluorescence microscopy

Background Automated identification of cell cycle phases of individual live cells in a large population captured via automated fluorescence microscopy technique is important for cancer drug discovery and cell cycle studies. Time-lapse fluorescence microscopy images provide an important method to study the cell cycle process under different conditions of perturbation. Existing methods are limited in dealing with such time-lapse data sets while manual analysis is not feasible. This paper presents statistical data analysis and statistical pattern recognition to perform this task. Results The data is generated from Hela H2B GFP cells imaged during a 2-day period with images acquired 15 minutes apart using an automated time-lapse fluorescence microscopy. The patterns are described with four kinds of features, including twelve general features, Haralick texture features, Zernike moment features, and wavelet features. To generate a new set of features with more discriminate power, the commonly used feature reduction techniques are used, which include Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), Maximum Margin Criterion (MMC), Stepwise Discriminate Analysis based Feature Selection (SDAFS), and Genetic Algorithm based Feature Selection (GAFS). Then, we propose a Context Based Mixture Model (CBMM) for dealing with the time-series cell sequence information and compare it to other traditional classifiers: Support Vector Machine (SVM), Neural Network (NN), and K-Nearest Neighbor (KNN). Being a standard practice in machine learning, we systematically compare the performance of a number of common feature reduction techniques and classifiers to select an optimal combination of a feature reduction technique and a classifier. A cellular database containing 100 manually labelled subsequence is built for evaluating the performance of the classifiers. The generalization error is estimated using the cross validation technique. The experimental results show that CBMM outperforms all other classifies in identifying prophase and has the best overall performance. Conclusion The application of feature reduction techniques can improve the prediction accuracy significantly. CBMM can effectively utilize the contextual information and has the best overall performance when combined with any of the previously mentioned feature reduction techniques.


Background
Quantitating the changes in cell cycle timing before and after drug treatment is useful for effective drug discovery research. Knowledge of the cell cycle progression, e.g., interphase, prophase, metaphase, and anaphase, is important to improving our understanding of the effects of various drugs on cancer cells [1][2][3][4]. Cell cycle progress can be identified by measuring changes in the nucleus as a function of time. Automated time-lapse fluorescence microscopy imaging provides an effective method to observe and study nuclei dynamically and is an important quantitative technique in the fields of cell biology and systems biology [2][3][4]. Nevertheless, the vast amount and complexity of image data acquired from automated microscopy renders manual analysis unreasonably timeconsuming. Accurate automatic classification of cell nuclei into interphase, prophase, metaphase, or anaphase, is an unresolved issue in cell biology studies using fluorescence microscopy. Murphy et al. [5][6][7][8][9] have proposed different feature extraction, feature reduction, and classification algorithms for a similar problem of classification of subcellular location patterns in fluorescence microscope images. Methods have also been proposed to identify the cell cycle phase recognition. Gallardo et al. [10] used Hidden Markov Models (HMMs) to classify the feature vector sequences that are extracted from the segmented, potential mitotic cells. Chen et al. [11] proposed an automated system to segment, classify, and track individuals in live cell population, in which the KNN classifier with a set of seven features was used. A novel hybrid fragments merging method based on watershed segmentation and HMMs is also proposed for cell phase identification [1,2].
In this work, an automated analytical system [1,2] is used to acquire images, track cell nuclei and generate features of each cell nucleus in a population of thousands of cells. In these specific time-lapse fluorescence microscopy images, nuclei are bright objects protruding out from a relatively uniform dark background; see an example in Figure 1. The cell nuclei are segmented from the acquired images and represented by a group of features for phase identification. To extract features from these time-lapse fluorescence images, four operating steps are conducted [1,2]: image preprocessing, cell nuclei segmentation, fragment merging, and cell nuclei tracking. After that, 145 features are extracted from each cell nucleus. But there are many noisy and functionally redundant features. Thus it is necessary to remove the noisy, irrelevant, and redundant features with feature reduction techniques. Many feature reduction methods have been proposed to improve the efficiency and effectiveness of cell phase identification [1,6,7,12,13].
Feature reduction techniques can be generally classified into feature extraction and feature selection approaches [13,[15][16][17][18][19]. Commonly used feature extraction algorithms, such as PCA [16,17], Linear Discriminant Analysis (LDA) [17,18], and Maximum Margin Criterion (MMC) [19] are investigated in this work. In PCA, the linear projections of the greatest variance from the top eigenvectors of the covariance matrix are computed, which works well when the data lies close to a flat manifold. LDA is one of the most commonly used supervised feature extraction algorithms. It is used to locate a lower dimensional space that best discriminates the samples from different classes. LDA explicitly utilizes the label information of the samples and thus is suitable for classification problems. However, it often suffers from small sample size when dealing with the high dimensional image data. Moreover, while LDA is guaranteed to find the best directions when each class has a Gaussian density with a common covariance matrix, it may fail if the class densities are more generalized. To solve the limitations of LDA, MCC, a supervised approach, has recently been proposed. The computation complexity of MMC is lower than LDA. In addition, the SDAFS is proposed as the best approach in feature selection for cell phase identification [7]. Other approaches, such as MIFS [15,20], GAFS [21], and T-test based Feature Selection (TFS) [22], are also evaluated. It is a NP-hard problem to determine the optimal feature subset for MIFS by global search. So we adopt a greedy searching algorithm, in which the features with the highest average mutual information are selected. Genetic algorithm [17,21] is a classical random optimization method, which mimics the evolutionary process of survival of the fittest.
A T-test is used to generate the initial individual feature subset, which belongs to the "Population" in the GA algorithm. To classify cell nuclei, some researchers have used KNN [11,17], BPNN [8], and SVM [23,24] to classify cell nuclei. Though acceptable results have been reported, both of them ignored the contextual information of timelapse microscopy. To utilize the contextual information, we propose a Context Based Mixture Model. Finally, as a standard practice in machine learning, a systematic comparison is conducted to select an optimal combination of a feature reduction technique and a classifier.
Feature reduction techniques can effectively improve the prediction accuracy, and more features do not necessarily guarantee better performance. Our finding indicates that CBMM outperforms SVM, BPNN, and KNN in identifying prophase and achieves the best overall performance.

Key Steps
The experiments consist of the following six steps.
Step 1. Image processing: Include image pre-processing, thresholding, fragment merging, and tracking. The cell nuclei are segmented from the background and tracked as cell sequences, refer to [1,2,[25][26][27] for a detailed description.
Step 2. Feature generation: Generate 145 features for each cell nucleus in all tracked sequences.
Step 4. Feature reduction: Use the six different approaches introduced in this paper to reduce the dimension of vector space.
Step 5. Classifier training: Train the classifiers using reduced training data, the four classifiers are trained using the reduced training data obtained from step 4.
Step 6. Phase identification: Identify cell cycle phases of the cells from the reduced testing data.

Materials
The data is generated from HeLa H2B GFP cells imaged during a 2-day period with images acquired 15 minutes apart using an automated time-lapse fluorescence microscopy. H2B GFP is a recombinant protein that localizes to DNA and is fluorescent. Each image has a resolution of 672*512 pixels. To get more reliable training data from each tracked cell sequence, we select the frames where the cell is in mitosis, thus including interphase, prophase, metaphase, and anaphase. In addition, the subsequence is supposed to start from a cell in interphase and end with a cell in anaphase. There are totally 100 manually labeled subsequences. A typical 200-frames sample of digital microscope images contains at least 18,000 interphase cells and the other types of cells sum up to less than 1,000. Obviously, the data sets are critically imbalanced. To handle this problem, we down-sample the interphase cells greatly while keeping other three classes of cells. But there are still serious problems with unequal distribution of the training examples, e.g., there are only 100 prophase cells and 306 anaphase cells.

Image Processing and Feature Generation
During cell phase identification, it is critical to separate nuclei from the background. Nuclei are bright objects protruding out from a relatively uniform dark background. Digital images usually require pre-processing to remove noise, discard undesirable features, and correct illumination artifacts. In a sequence of cell images, our preprocessing procedure includes four steps: image enhancement, adaptive shareholding, morphological filtering, and distance transformation [1][2][3][4]11]. Although the adaptive threshold can segment all the cells from the background effectively, it cannot separate touching nuclei clusters. To solve this problem, our system utilizes a watershed algorithm. Traditional watershed segmentation, however, will lead to over-segmentation. Thus, a hybrid fragment merging approach that combines the roughness score and Probability Distribution Function (PDF) score of each cell is used [1][2][3][4][25][26][27]. This algorithm can effectively segment separated nuclei and most of the touching ones. The dynamic behaviors of cell nuclei are tracked by distance and size. After tracking, the performance of fragment merging is improved by the contextual information. The revised segmentation results are then used to reinforce the tracking performance.
After obtaining the segmented nuclei, feature vectors that each contains 145 features are generated to represent the cells. They compose of twelve general image features about shape, size, and intensity (max intensity, min intensity, deviation of gray level, average intensity, length of long axis, length of short axis, long axis/short axis, area, perimeter) [11]; 14 Haralick co-occurrence textural features [6,28]; 49 Zernike moment features [6,29], and 70 features generated by Gabor transformation [2,30].

Parameters
To determine the dimensionality of the lower space for feature extraction algorithms, we vary the ratio of the energy preserved by feature extraction algorithms from 80% to 95% and compare the performance of different classifiers. Our results show that when the reduced dimen-sionality is 15, almost all the classifiers reach their best performance. Therefore, the reduced dimensionality for feature extraction algorithms is estimated as 15, with which 90% of the energy is preserved by PCA. With the FS algorithms, 10, 20, 30, 40, and 50 features are selected to compare the performance. We reduce the dimensions to 20 for all FS approaches. Both linear kernel and RBF kernel are used for the SVM classifier. The genetic algorithm is used in feature selection, and the parameters are as follows: populations size of 200, maximum generation size of 200, the portion of crossover is set to 0.5-th of the feature length, and the mutation rate is 0.3. One of the 200 populations is initialized with t-test feature selection method. The best performance of KNN is achieved by selecting K = 7 for cell phase identification. A BPNN [8] with a single hidden layer of 20 nodes is used to classify the four classes of cell phases, and is trained with back propagation algorithm. The best performance of SVM is always achieved by linear kernel except for the case without feature reduction. Only linear FE approaches are used in this paper due to their efficiency in contrast to nonlinear ones [31]. Thus, as indicated in Table 1 and 2, the best performance of SVM using linear kernel and RBF kernel is reported.

Measurements
We use precision and sensitivity as the measurements for our experimental results. Suppose TP, FP, and FN stand for the number of true positive, false positive and false negative samples respectively after the completion of cell phase identification. Precision is defined as precision = TP/(TP+FP), and sensitivity is defined as sensitivity = TP/ (TP+FN). In other words, precision is the portion of cells identified positively that are really positive. Sensitivity refers to the ability to identify positive cells correctly. We can calculate the precision and sensitivity for each class if we treat one class as positive and other classes as negative. The average precision and sensitivity of the four classes is used to indicate the overall performance. The ten-fold cross validation is used for testing the trained classifiers. The reduced dimensionality for feature extraction algorithm is 15, while the dimensionality for feature selection is 20. The best performance combination for each class and each classifier is displayed in bold.
To address the statistical significance of differences in classification result, the confidence intervals of classification accuracies are estimated to be at the 90% confidence level.

Testing Results
We first reduce the data to 15 dimensional vector spaces with different FE approaches and 20 for all FS approaches. Table 1 and 2 shows the results of all four classes (class1interphase, class2-prophase, class3-metaphase, and class4-anaphase). We describe the details of all four classes instead of only their average, since prophase and metaphase identifications are very important for drug discovery application [1][2][3][4]11].
According to Table 1 and 2, it can be observed that CBMM has the best overall performance regardless of which feature reduction algorithm is coupled with. PCA combined with the CBMM classifier outperforms other combina-tions. Although LDA achieves similar performance, its use is not recommended due to the singularity problem [18].
Since prophase is more important than the other three classes for drug screening, we show the specific measurement (precision and sensitivity) of prophase besides the average measurements of the four classes in Figure 2  On the other hand, SDAFS has been reported to be the best among all the FS approaches [7], its performance when combined with CBMM is shown in Table 3 and 4. From Table 3 and 4, it can be observed that the optimal dimension for CBMM is 20 and its performance cannot be enhanced significantly by the increasing of the subspace dimension.
These conclusions are drawn based on the preliminary analysis, which can serve as a guideline for future research. The reduced dimensionality for feature extraction algorithm is 15, while the dimensionality for feature selection is 20. The best performance combination for each class and each classifier is displayed in bold.
With the accumulation of new data, a more detailed and conclusive analysis will be presented in our future work.

Implementation
The functions for PCA, LDA, MMC, SDAFS, and t-test are developed in Matlab 6.5. MIFS is adapted from previously reported source code [1]. The feature generation tools are adapted from [2,11] and [30], whereas the twelve general features are generated by Matlab. We use Libsvm [23] as the SVM classifier and implement the CBMM, BPNN, and KNN classifier in Matlab 6.5.

Discussion
Our method can successively classify over 80% of the cell phases. Although such precision is acceptable for most biological applications, additional heuristic rules and online-training will improve the precision further. Since the cell cycle is confined by biological constraints, knowledge-driven heuristic rules can be applied to compensate for certain phase identification errors. For example, we are going to implement the following three biological rules to enhance the system performance: • Phase progression rule: Once a cell enters a defined cellcycle phase, it cannot go back to its previous phase.
• Phase timing rule: The time period that a cell stays in a phase also obey certain biological rules. Cells will usually stay in prophase no more than 45 minutes, metaphase for about 1 hour in untreated cell sequences, and anaphase under 1 hour. On the other hand, a cell can stay in interphase for more than 20 hours. In time-lapse sequences in a drug-treated cell population, certain cells can stay in metaphase for an even longer period of time.

Precision Sensitivity
• Phase continuation rule: Cells cannot skip the one cell cycle phase and enter next phase following the one it skipped, e.g., cells cannot jump from metaphase to interphase or from anaphase to metaphase.
It is worth noting that the problems with unequal distribution of training examples can be solved in a supervised framework while the unsupervised approach heavily depends on the distribution of training examples. For example, we may first down-sample the training sets with appropriate sampling method, then train the SVMs by assigning the training samples with different cost weights according to class size [23,24,32]. In the "weighted" SVM [23], the prediction accuracy of prophase can be increased by 10%~20% at the expense of slightly decrease of the classes with large samples using the reduced features. The weights for interphase, prophase, metaphase, and anaphase are 1, 10, 10, and 10 respectively.

Conclusion
This paper proposes a new Context-Based Mixture Model for dealing with the time-series cell cycle sequence information, which outperforms other traditional classifiers in identifying prophase. The application of feature reduction techniques can effectively improve the prediction accuracy, whereas more features do not necessarily guarantee better performance.

Feature Reduction
From the Feature Reduction (FR) perspective, the traditional and the state-of-the-art dimensionality reduction methods can be generally classified into Feature Extraction (FE) and Feature Selection (FS) [15][16][17][18][19][20][21]. FE aims to project high dimensional data to a lower dimensional space by algebraic transformation according to certain criteria while FS identifies a subset of the most representative features according to pre-defined criteria, thus the features are ranked according to their individual predictive power. We compare certain commonly used feature reduction approaches below.
Series of cell divisions can be represented in lineage trees. For this study, we only use one of the daughter cells after each division. This results in only one cell being tracked in any given frame, thus a sequence of cell nuclei are tracked for all of the frames in the current experiment. In other words, a sequence of these cell nuclei is mathematically represented by a n × d matrix X ∈ R n × d , where d is the number of features and n is the length of a cell sequence. Each cell nuclei is represented with a feature vector. X T is used to denote the transpose of matrix X. M sequences of cell nuclei (or M cell sequences) are denoted by a mn × d matrix ∈ R mn × d . Each cell is denoted by a row vector x i , i = 1, 2,ʜ, mn. Assume that these feature vectors belong to c different classes and the sample number of the j th class is n j , we use c j to represent class j, j = 1, 2,ʜ, c. The mean vector of c j is . The mean vector of all the cell nuclei is . The feature reduction problem can be framed as the problem of finding a function f : R d → R p according to an objective function J, where p is the dimension of data after the dimensionality reduction, so that an object

Classifiers
We select the established classifiers KNN [17], BPNN [8,17], and SVM [23,24] for comparison. In addition, the CBMM is also proposed to incorporate the contextual information. Figure 4 provides an example of cell mitosis process. The occurrence of phases in a sequence can be regarded as a stochastic process; hence, the cell sequence can be represented as a Markov chain where phases are in hidden states. The occurrence of the first phase in the sequence is characterized by the initial probability of the Markov chain. The occurrence of the other phases, which is given by the occurrence of its previous phase, is characterized by the transition probability. We calculate initial and transition probabilities for Markov chains with a set of training nuclei sequences. In addition, we assume each hidden state can generate a group of continuous visible states described by R Gaussian Mixtures. We optimize these  Changes in the appearance of a nucleus during cell mitosis Figure 4 Changes in the appearance of a nucleus during cell mitosis. From (a) to (h) consecutive image subframes form a sequence showing nuclei size and shape changes during cell mitosis.

Context based Mixture Model Classifier
The next issue is how to use this model to predict cell phases. According to traditional Continuous Gaussian Mixture HMM, the probability of a cell x t belonging to phase s m , i.e. θ t = s m , should only depend basically on x t and x t-1 . Thus the probabilities we need for the categorization of cells could be denoted as p(θ t = s m |x t , x t-1 ). Based on the Bayesian formula, we rewrite them as: where m = 1,2,...M, p(θ t = s j , θ t-1 = s i ) = p(θ t = s j |θ t-1 = s i )p(θ t-1 = s i ) = a ij π i . The p(x t , x t-1 |θ t = s j , θ t-1 = s i ) means given θ t = s j and θ t-1 = s i , the probability of Gaussian mixtures ϕ(x, μ jr , Σ jr ) and ϕ(x, μ kr , Σ kr ), r = 1, 2,..., R can generate vectors x t and x t-1 .
Traditional CHMM has utilized the information of the previous time point to predict the state of current time point. In our application, we know both the information of "left point" and "right point", since we have obtained the cell trace and the features of all cells. In contrast to the traditional CHMM introduced above, we propose to utilize the contextual information for cell phase identification, i.e., we propose to use Gaussian mixture models based on two-pixels. This model is called the Context Based Mixture Model Classifier.
where p(m, i, j) = p(θ t = s m , θ t-1 = s i , θ t+1 = s j ) is the prior probability defined as: Since the denominator in (2) is the same for each class, in implementation we can neglect this term, i.e.,  To estimate p(x t , x t-1 , x t+1 |θ t = s m , θ t-1 = s i , θ t+1 = s j ), the simplest way is to assume that the phase of three states are independent, then it can be simply approximated by the product of the three items p(x t |θ t = s m ), p(x t-1 |θ t-1 = s i ) and p(x t+1 |θ t+1 = s j ). This assumption is not always true in realistic applications. In this case, we assume that they are dependent. More complex models are trained than the standard continuous models described above. We collect the cells in status satisfying (θ t = s m , θ t-1 = s i , θ t+1 = s j ).