A graphical model approach to automated classification of protein subcellular location patterns in multi-cell images
© Chen and Murphy; licensee BioMed Central Ltd. 2006
Received: 28 September 2005
Accepted: 23 February 2006
Published: 23 February 2006
Knowledge of the subcellular location of a protein is critical to understanding how that protein works in a cell. This location is frequently determined by the interpretation of fluorescence microscope images. In recent years, automated systems have been developed for consistent and objective interpretation of such images so that the protein pattern in a single cell can be assigned to a known location category. While these systems perform with nearly perfect accuracy for single cell images of all major subcellular structures, their ability to distinguish subpatterns of an organelle (such as two Golgi proteins) is not perfect. Our goal in the work described here was to improve the ability of an automated system to decide which of two similar patterns is present in a field of cells by considering more than one cell at a time. Since cells displaying the same location pattern are often clustered together, considering multiple cells may be expected to improve discrimination between similar patterns.
We describe how to take advantage of information on experimental conditions to construct a graphical representation for multiple cells in a field. Assuming that a field is composed of a small number of classes, the classification accuracy can be improved by allowing the computed probability of each pattern for each cell to be influenced by the probabilities of its neighboring cells in the model. We describe a novel way to allow this influence to occur, in which we adjust the prior probabilities of each class to reflect the patterns that are present. When this graphical model approach is used on synthetic multi-cell images in which the true class of each cell is known, we observe that the ability to distinguish similar classes is improved without suffering any degradation in ability to distinguish dissimilar classes. The computational complexity of the method is sufficiently low that improved assignments of classes can be obtained for fields of twelve cells in under 0.04 second on a 1600 megahertz processor.
We demonstrate that graphical models can be used to improve the accuracy of classification of subcellular patterns in multi-cell fluorescence microscope images. We also describe a novel algorithm for inferring classes from a graphical model. The performance and speed suggest that the method will be particularly valuable for analysis of images from high-throughput microscopy. We also anticipate that it will be useful for analyzing the mixtures of cell types typically present in images of tissues. Lastly, we anticipate that the method can be generalized to other problems.
The location (or locations) of a protein within cells is an important attribute that can be largely independent of its structure, enzymatic activity, or level of expression. Systematic and comprehensive analysis of subcellular location is therefore needed as part of systems biology efforts to understand the behavior of all expressed proteins. Work in this area can be divided into experimental determination and computational prediction. Of course, the accuracy and utility of prediction methods is dependent on the accuracy, coverage and resolution of determination methods. This is because experimentally determined locations are the starting point for the machine learning methods at the heart of prediction systems [1–3]. Subcellular location is most frequently determined by visual interpretation of fluorescence microscope images, but such interpretations can be highly variable from observer to observer. We have therefore developed automated systems to recognize major subcellular patterns [4–6] and to learn new patterns directly from fluorescence microscope images [7, 8]. These systems utilize high resolution images and have been shown to be able to distinguish similar patterns better than visual examination .
Automated interpretation of subcellular patterns in micrographs
Using large collections of HeLa cell images containing ten distinct subcellular patterns, our systems have achieved classification accuracies as high as 92% and 98% for 2D and 3D single cell images, respectively [10, 11]. The patterns of dissimilar classes can be distinguished quite well; however, there is still room to improve the classification accuracy for similar classes (such as endosomal and lysosomal proteins and different Golgi proteins).
In order to improve the classification accuracy, one strategy is to incorporate additional or improved features and another is to combine more than one classifier using voting methods. The performance improvements we have obtained for 2D HeLa images, from 83% using a library of 84 features and a neural network classifier  to 92% using a library of 180 features and a majority-voting ensemble , resulted from implementing both of these strategies. A majority-voting ensemble combines the results from many different classifiers into a single decision, as illustrated in Figure 1b.
These improvements were obtained while considering the classification of patterns in single cells. An additional strategy is to utilize information from more than one cell from the same sample. For example, when sets of HeLa cells from the same slide were individually classified and allowed to vote for a single classification for the entire set, overall accuracy improved from 83% to 98% . The penalty for this improvement is that we give up the ability to identify more than one pattern in a given set. A possible improvement on this approach is therefore to first estimate the number of classes that are present from the frequencies of the classes (by ruling out classes that have a low frequency), and then assign each cell to one of the remaining classes. (If we rule out all but one class, this approach reduces to the previous one.) So that we can decide which classes to rule out, we assume that the "true" classes are present in roughly equal proportions. In this paper, we first evaluate this simple strategy. We then describe more sophisticated approaches that construct a graphical model representing pattern information for more than one cell in a field so that improved classification accuracy can be achieved while retaining the ability to classify each cell individually (and without the assumption that classes are present in equal frequencies).
Graphical models have been extensively applied to problems in the computer vision field, such as image segmentation and object recognition, where the pixels in an image can be segmented or classified into two (foreground and background) or more classes . Many classification problems where the labels of related objects must be consistent with each other, such as hypertext classification  and identification of protein functions in the protein-protein interaction network , can also utilize graph-based methods. To our knowledge, graphical models have not previously been applied to the recognition of subcellular patterns in multi-cell images. Large numbers of such images are increasingly being acquired both in projects aimed at determining the subcellular location of all proteins [8, 15–17] and in drug screening by high-throughput microscopy . Part of the motivation behind the work we describe here is the need to classify fields of cultured cells that may be expressing different tagged proteins (such fields arise when a population of cells is randomly tagged). An additional motivation is the desire to classify individual cell patterns in tissues that may consist of more than one cell type.
The problem to be solved using a graphical model is to infer the posterior probability of each class for each node (cell) using information about the likely classes of other nodes (cells). For some graphical models, an exact solution can be found using the belief propagation (BP) algorithm . However, BP can only calculate the posterior probability correctly on trees or forests, that is, on graphs where there is at most one path between any two nodes. If there are loops in the graph, the junction tree algorithm  can be used to convert a loopy graph into a tree by clustering nodes together. Exact inference can then be achieved by applying BP on the converted tree, but the running time is exponential in the size of the largest cluster in the converted graph. We therefore need approximate inference methods for cases where the size of the largest cluster is large. A commonly used approximate method is loopy belief propagation (LBP), which iteratively applies belief propagation updates on a graph with loops. LBP often gives good approximate inference when it converges , and often runs very quickly, but can fail to converge on some graphs. Other approximate inference algorithms, such as variational methods  and Monte Carlo methods , are also widely used. Running times for these approximate inference methods can be prohibitive for large graphs.
A graphical model consists of an algorithm for constructing the graph itself and an algorithm for making inferences given the graph. In this paper we describe how to construct graphs for the problem of subcellular location classification, and also present a novel algorithm, which we term prior updating, that permits inferences to be made for the (often large) resulting graphs.
Problem Statement: At the outset, we formalize our problem by describing our assumptions about the process used to create cell images. We assume that the process of creating a slide (or a well, plate or chamber) for imaging starts by creating a mixture of any number of cells from each of many possible classes. We further assume that cells are randomly distributed over the slide at some time t plate before imaging, that the cells divide with an average generation time of t g , and that the class of a cell is stably inherited by its daughters (the latter assumption can be relaxed slightly to allow for mutation without substantially changing our treatment). Lastly, we assume that we have accurate methods for segmenting multi-cell images into regions containing single cells, and classification methods that provide a likelihood for each possible class for each segmented cell. The task is: Given an image of a field containing a number of cells meeting the assumptions above, assign a class to each cell as accurately as possible.
Equal-sized class model
As discussed above, performance of a single cell classifier on a multi-cell image can be improved if the assumption can be made that all cells in the field should show the same pattern. This can be done by assigning the most frequent class in the image to all cells . While this assumption may be true in some cases, it is quite restrictive. The goal of the work in this paper is to improve the analysis of multi-cell images without the drastic assumption of homogeneity. We begin by considering a variation on this assumption, namely that each multi-cell image is composed of a small number of classes with roughly equal numbers of cells. In this case, one strategy is to decide upon the number of classes using a threshold on the observed frequencies of each class. We define T n = 1/(1 + n) + β, where n is the number of classes and β is an adjustable parameter that ranges from -0.5 to 0.5. To find the number of classes, we find the smallest n for which the frequencies of exactly n classes are greater than T n and record which classes those are. This definition is based on the assumption that the true classes are present in roughly equal proportion, and hence that the percentage of each should be greater than the expected frequency of a class if one more true class was present (plus a tolerance controlled by β). We consider an example to illustrate the approach. Using β = 0.1 results in T1 = 0.6 and T2 = 0.43. Given a field with three classes with frequencies (0.7,0.2,0.1), we would choose n = 1. However, if the frequencies were (0.45,0.5,0.05), n = 2 would be chosen. Once n is chosen, each cell in the trial field is assigned to the one of those classes that has the largest likelihood for that cell (as assigned by the single cell classifier). Note that this might not be the class with the highest likelihood if that class was not retained during the selection of the number of classes. If no n meets the criterion, we simply keep the classification results from the single cell classifier. Note that as β decreases to -0.5, we increasingly favor finding only one class, and as β approaches 0.5 we increasingly favor making no changes to the original class assignments.
To illustrate and test approaches to multi-cell classification, we need multi-cell images in which the class of each cell is known with certainty. Since it is nearly impossible to collect such images (without, for example, using micro-manipulation to spot cells on a slide), we have simulated them by combining images from a large library of single cell images (the 2D HeLa cell image collection described in the Methods). The library contains images of ten subcellular pattern classes, and to classify individual cells we have used a multi-class support vector machine classifier whose outputs were converted to probabilities for each class.
Construction of graphical models
We next consider what information may be available about the likely class of a cell given information about its neighbors in the field, and how we can construct a graphical model to use that information. Two limit cases can be considered. These limits are based on the relative magnitudes of the constants t plate and t g defined in the problem statement above.
Feature space model
The first possibility is that t plate is short relative to t g such that cells would not have time to undergo significant cell division prior to their being imaged. In this case, the proximity of cells does not provide any information about their likely similarity (i.e., whether they are derived from the same class). The only clues that we have about the number of classes present (and the number of cells in each) are the similarities between cells in the SLF feature space. In this case, we initially construct an undirected graph in which each cell is represented by a node and edges are created between each pair of nodes with length equal to the z-scored Euclidean distance between the feature vectors of the corresponding cells.
Physical space model
If, however, the amount of time that elapses between plating and imaging is significantly greater than the generation time (t plate ≫ t g ), each original cell is expected to have divided a number of times and we may consider it likely that the class of cells adjacent to one another is the same. The rate (v trans ) at which daughter cells move away from each other relative to the rate at which they divide becomes the determining factor. Thus, if v trans is high, we may consider physical proximity to be of little predictive value and are forced to use the feature space model described above. If, on the other hand, v trans is low, we can construct an undirected graph using the Euclidean distance between the centers of cells in the field.
Initially, the graphs for both model types are fully connected. Each edge suggests the two nodes it connects should influence each other's labels. Since we can assume that they should not influence each other if the distance between them is too large (and to improve computational efficiency), edges whose length is greater than a free parameter d cutoff are removed. Note that the units of d cutoff are different for the two types of models.
Inference by prior updating
Feature space model
When α is zero, the priors are not updated so that cells do not influence each other. As α increases, the priors of classes that are present in the field are increased while others are decreased. As seen in Figure 4, classification accuracy also increases as α increases but roughly plateaus at α near 0.2. The results suggest that a large α usually gives good improvement in classification accuracy; however, the best α has to be found by applying cross-validation methods.
The d cutoff parameter is designed to determine the neighbors of a cell. If d cutoff is very small, the cell does not have any neighbors to influence and be influenced by. As d cutoff gets larger, the cells start to be influenced by other similar cells, and so the classification accuracy can be improved. If d cutoff is set to infinity, all the cells are connected to each other in the graph and so contribute to the updates of each other's priors. In this case, some dissimilar cells will affect each other's priors and the classification accuracy could be worse than when the best d cutoff is used. The best d cutoff can be found by applying cross-validation methods.
Physical space model
Multiple classes test
Results for multiple classes.
Classification Accuracy (%)
No. of classes
Effect of training set size
Results for different training set sizes.
Classification Accuracy (%)
No. of training data
Our work has particular implications for classification of patterns in images obtained by high-throughput microscopy. Since high-throughput systems typically use low magnification, the number of cells per field is often high and the accuracy of single-cell classifiers is usually not perfect. By applying this method on multi-cell images made of real single cells and synthesized locations, we are able to verify that our scheme can be used for such systems to achieve significantly better performance.
Since we have proposed a new approximate inference algorithm, it is important to identify when this method works better than other approximate inference methods. This method is very fast compared to previously described graphical model algorithms: its runtime is linearly proportional to the number of cells in each trial field and to the number of classes it needs to choose from. Whether this method has better classification performance under different circumstances will be examined in future work. We anticipate that the method can be made more general so that it can be used for other applications, both for biomedical applications like classification of cell types in tissue images and for other applications like Internet link analysis.
This paper addresses a supervised learning problem in the domain of protein subcellular location determination. We have proposed a novel graphical representation where multiple cells in a field influence each other. Assuming that these cells are only composed of a small number of classes, the classification accuracies are improved by manipulating the prior distributions of classes. The improvement is largest for groups of classes which would be difficult for the base classifier to distinguish from one another.
We have also shown the robustness of our prior updating scheme. The accuracies for different classes were always improved under different assumptions about the distribution of cells in the field, different sizes of the two classes of cells present in the field, different numbers of classes, and different training set sizes.
The results are very encouraging since the prior updating method improves the overall accuracy from the base classifier by around 5 percentage points and the accuracy of similar classes by around 9 percentage points. The combination of the prior updating method and the base single cell classifier outperforms the majority voting classifier that with an accuracy of 92.3% had the best prior reported performance on this dataset .
2D HeLa cell image collection
Subcellular Location Features (SLF)
We have developed several sets of informative features to describe protein subcellular patterns. These features, termed Subcellular Location Features (SLFs), are of several types, including Zernike moment features, Haralick texture features, morphological features and wavelet features. The details for different versions of SLFs are reviewed in . The best classification results obtained to date for the 2D HeLa dataset were with feature set SLF16 , and we have therefore used the SLF16 feature set in this work. Each cell in the dataset is thus represented by a feature vector x of length d = 47.
Bayesian decision theory
Bayesian decision theory is a fundamental statistical approach to pattern classification problems . The Bayes formula can be expressed as:
where w j is the class with index j, p(w j ), termed the prior probability, is the probability of class j being observed in the absence of any other information, p(x | w j ), termed the likelihood probability, is the probably density function for an observed feature vector x given that the class is w j , p(w j | x), termed the posterior probability, is the probability of the class being w j given that x has been observed, and p(x), termed the evidence, is just a normalization to guarantee that the posterior probabilities sum to one. For n classes, the evidence can be formulated as
A probabilistic classifier assigns an observation x to class i if
p(w i | x) > p(w j | x) ∀j ≠ i
That is, the classifier assigns x to the class with the maximum posterior probability.
In our previous work, each cell was classified independently. Since the priors were not known in advance, they were assumed to be equal. In this case, the classification with the "Maximum a Posteriori Probability" (MAP) is equivalent to the "Maximum Likelihood" (ML).
Classifier – Support Vector Machine
Support Vector Machines (SVM) were originally designed for binary classification by finding a maximum margin hyperplane between two classes . They can be extended to solve multi-class classification problems by combining several binary classifiers. There are several commonly used methods, such as one-against-all, one-against-one, and directed acyclic graph. Here we adapt the one-against-all method [26, 27], which constructs n SVM classifiers where n is the number of classes. The ith SVM is trained using all of the examples in the ith class with positive labels and all others with negative labels. The test example is fed into these n SVMs and the one with the highest output score is selected as the final class. Each SVM used an exponential radial basis function kernel with C = 20 and σ = 7, where C mediates the trade-off between maximizing the margin and minimizing the training error, and σ is the parameter in the expression:
The kernel function K is a distance function for two feature vectors x and y. The multi-class SVM produces uncalibrated scores that are expected to be positively correlated with the confidence of the assignment but which are not directly comparable between classes. Thus, we use a sigmoid function to calibrate the output scores of the SVM. The parameters of the function can be found by minimizing the negative log likelihood of the training data . The resulting probabilities are then comparable between different classes. We associate with each node an evidence vector consisting of the probabilities for each class and a label corresponding to the class with largest evidence. The confidence of this label is defined as the difference between the two highest class probabilities.
Creation of synthetic multi-cell images
To synthesize multi-cell images, we used the 2D HeLa image set composed of 10 classes of major subcellular location patterns (described above). To meet the assumptions that cells are only composed of a small number of classes, we constructed trial fields consisting of cells drawn from all possible pairs of the 10 classes in the 2D HeLa dataset. For each trial, N1 and N2 cells were randomly picked from two different classes with total number of 12 cells. Separate trials were conducted for N1 from 0 to 6.
For cross-validation, we split the data into five folds: one fold for the testing pool and the other four folds for the training pool. In the training pool, 50 images from each class were randomly chosen and for each trial, N1 and N2 cells were randomly picked from all possible pairs of classes out of the testing pool. Each of the five folds was in turn used for testing and the remaining four for training a multi-class SVM classifier. The classification accuracies were averaged for each pair of classes over all five folds. Some of the images are used neither for training nor for testing in any one fold, but the testing images may be used more than once overall due to lack of data. Because of this reuse, this evaluation method is similar to the usual five-fold cross validation procedure but not the same. In expectation it will report the correct accuracy for the classifier, but the variance of its reported accuracy is difficult to compute. To reduce this variance as much as possible we average 10 trials by randomly assigning images in the testing and training pools.
The data and source code used for the work described in this paper is available from http://murphylab.web.cmu.edu/software.
We thank Geoffrey Gordon for helpful discussions and critical reading of the manuscript. This work was supported in part by NIH grant R01 GM068845, NSF grant EF-0331657, and a research grant from the Commonwealth of Pennsylvania Tobacco Settlement Fund.
- Park KJ, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003, 19(13):1656–1663.View ArticlePubMedGoogle Scholar
- Chou KC, Cai YD: Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition. J Cell Biochem 2003, 90(6):1250–1260.View ArticlePubMedGoogle Scholar
- Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R: Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 2004, 20(4):547–556.View ArticlePubMedGoogle Scholar
- Boland MV, Markey MK, Murphy RF: Automated recognition of patterns characteristic of subcellular structures in fluorescence microscopy images. Cytometry 1998, 33(3):366–375.View ArticlePubMedGoogle Scholar
- Murphy RF, Boland MV, Velliste M: Towards a Systematics for Protein Subcellular Location: Quantitative Description of Protein Localization Patterns and Automated Analysis of Fluorescence Microscope Images. Proc Int Conf Intell Syst Mol Biol 2000, 8: 251–259.PubMedGoogle Scholar
- Boland MV, Murphy RF: A Neural Network Classifier Capable of Recognizing the Patterns of all Major Subcellular Structures in Fluorescence Microscope Images of HeLa Cells. Bioinformatics 2001, 17(12):1213–1223.View ArticlePubMedGoogle Scholar
- Chen X, Murphy RF: Objective Clustering of Proteins Based on Subcellular Location Patterns. J Biomed Biotechnol 2005, 2005(2):87–95.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen X, Velliste M, Weinstein S, Jarvik JW, Murphy RF: Location proteomics - Building subcellular location trees from high resolution 3D fluorescence microscope images of randomly-tagged proteins. In Proc SPIE. Volume 4962. San Jose, CA, U. S. A. ; 2003:298–306.Google Scholar
- Murphy RF, Velliste M, Porreca G: Robust Numerical Features for Description and Classification of Subcellular Location Patterns in Fluorescence Microscope Images. J VLSI Sig Proc 2003, 35(3):311–321.View ArticleGoogle Scholar
- Huang K, Murphy RF: Boosting Accuracy of Automated Classification of Fluorescence Microscope Images for Location Proteomics. BMC Bioinformatics 2004, 5: 78.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen X, Murphy RF: Robust Classification of Subcellular Location Patterns in High Resolution 3D Fluorescence Microscopy Images. In Proc 26th Intl Conf IEEE Eng Med Biol Soc. San Francisco, CA ; 2004:1632–1635.View ArticleGoogle Scholar
- Felzenszwalb PF, Huttenlocher DP: Efficient Belief Propagation for Early Vision. Proc 2004 IEEE Conf on Computer Vision Pattern Recognition 2004, 1: 261–268.View ArticleGoogle Scholar
- Taskar B, Abbeel P, Koller D: Discriminative Probabilistic Models for Relational Data. Uncertainty in Artificial Intelligence 2002, 485–492.Google Scholar
- Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from protein-protein interaction networks. Nat Biotechnol 2003, 21(6):697–700.View ArticlePubMedGoogle Scholar
- Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M, Piccirillo S, Umansky L, Drawid A, Jansen R, Liu Y, Cheung KH, Miller P, Gerstein M, Roeder GS, Snyder M: Subcellular localization of the yeast proteome. Genes Develop 2002, 16(6):707–719.PubMed CentralView ArticlePubMedGoogle Scholar
- Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, O'Shea EK, Weissman JS: Global analysis of protein expression in yeast. Nature 2003, 425(6959):737–741.View ArticlePubMedGoogle Scholar
- Conrad C, Erfle H, Warnat P, Daigle N, Lorch T, Ellenberg J, Pepperkok R, Eils R: Automatic Identification of Subcellular Phenotypes on Human Cell Arrays. Genome Res 2004, 14(6):1130–1136.PubMed CentralView ArticlePubMedGoogle Scholar
- Perlman ZE, Slack MD, Feng Y, Mitchison TJ, Wu LF, Altschuler SJ: Multidimensional Drug Profiling by Automated Microscopy. Science 2004, 306(5699):1194–1198.View ArticlePubMedGoogle Scholar
- Pearl J: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann; 1988.Google Scholar
- Huang C, Darwiche A: Inference in belief networks: a procedural guide. Intl J Approximate Reasoning 1996, 15(3):225–263.View ArticleGoogle Scholar
- Murphy K, Weiss Y, Jordan M: Loopy Belief Propagation for Approximate Inference - an Empirical Study. Uncertainty in Artificial Intelligence 1999, 467–475.Google Scholar
- Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK: An Introduction to Variational Methods for Graphical Models. Machine Learning 1998, 37(2):183–233.View ArticleGoogle Scholar
- Mackay DJC: Introduction to Monte Carlo methods. In Learning in graphical models. Cambridge, MA, MIT Press; 1998:175–204.View ArticleGoogle Scholar
- Duda RO, Hart PE: Pattern Classification and Scene Analysis. New York, John Wiley & Sons; 1973:482.Google Scholar
- Cortes C, Vapnik V: Support vector networks. Machine Learning 1995, 20: 1–25.Google Scholar
- Vapnik V: Statistical Learning Theory. New York City, Wiley; 1998.Google Scholar
- Hsu CW, Lin CJ: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 2002, 13: 415–425.View ArticlePubMedGoogle Scholar
- Platt J: Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large Margin Classifiers, MIT Press 1999, 61–74.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.