- Research article
- Open Access
A comparative study of cell classifiers for image-based high-throughput screening
© Abbas et al.; licensee BioMed Central Ltd. 2014
- Received: 7 August 2014
- Accepted: 29 September 2014
- Published: 21 October 2014
Millions of cells are present in thousands of images created in high-throughput screening (HTS). Biologists could classify each of these cells into a phenotype by visual inspection. But in the presence of millions of cells this visual classification task becomes infeasible. Biologists train classification models on a few thousand visually classified example cells and iteratively improve the training data by visual inspection of the important misclassified phenotypes. Classification methods differ in performance and performance evaluation time. We present a comparative study of computational performance of gentle boosting, joint boosting CellProfiler Analyst (CPA), support vector machines (linear and radial basis function) and linear discriminant analysis (LDA) on two data sets of HT29 and HeLa cancer cells.
For the HT29 data set we find that gentle boosting, SVM (linear) and SVM (RBF) are close in performance but SVM (linear) is faster than gentle boosting and SVM (RBF). For the HT29 data set the average performance difference between SVM (RBF) and SVM (linear) is 0.42 %. For the HeLa data set we find that SVM (RBF) outperforms other classification methods and is on average 1.41 % better in performance than SVM (linear).
Our study proposes SVM (linear) for iterative improvement of the training data and SVM (RBF) for the final classifier to classify all unlabeled cells in the whole data set.
- Support Vector Machine
- Radial Basis Function
- Linear Discriminant Analysis
- HT29 Cell
- Support Vector Machine Classifier
The technology of high-throughput screening has facilitated many biological fields and has become a widely used method in drug discovery. It assists scientists in conducting millions of chemical as well as genetic tests to study biological paths. Cell biology is one of those fields which are currently focusing on analysis of massive amounts of cell image data produced by high-throughput screening [1–4]. Biologists study the morphology of these cells and can classify their phenotypes by visual inspection under a microscope. The microscopic study of a huge amount of cell image data has triggered the need for automatic methods to handle this huge amount of cell image data.
Machine learning and data mining have the potential to objectively and effectively analyze the massive amounts of image data . In recent years, many studies have shown advantages of using classification methods to classify images based on features derived from them [2, 6–11]. Examples of classification methods are the Support Vector Machine (SVM), the gentle boosting classifier, Linear Discriminant Analysis (LDA), the K-nearest neighbor (KNN) classifier, the multi-layered perceptron, Artificial Neural Networks (ANNs) and the decision tree classifier [11–16].
Usually, there are three steps involved in classification of cells as shown in Figure one by Jones et al. . The first step is segmentation and feature calculation. The second step concerns the training of classification models on a training set and their performance evaluation with cross-validation. The training set is a subset of a few thousand cells visually classified by a biologist. The third step boils down to the classification of whole screen using the best performing classifier from step 2.
Open source tools for high-throughput screening
CellProfiler & CP Analyst 
Weighted Nearest Neighbor
Many image features
Enhanced CellClassifier 
Supervised Spectral Clustering
CellMorph, EBImage 
Link to machine
Hidden Markov Model
R package for SVM
Nearest Neighbor, Random Forest,
User friendly and extensible
SVM and Decision Trees
To the best of our knowledge, there is no study that compares the performance of different classification methods and their suitability in an iterative feedback and machine learning setting for high-throughput screening of images. In this paper we compare classification methods based on accuracy and cross-validation time. We also explore how performance and computational time vary with a different number of phenotypes. We use two data sets of HT29 and HeLa cancer cells that have different numbers of features and phenotypes. We investigate which classifier is a good choice in terms of performance and cross-validation time. Cross-validation time is important because it is the time needed to evaluate the performance of a classifier and cross-validation needs to be done many times in training a classifier in an iterative fashion. The next part describes the data sets, the classification methods and the approach used in this study. The last part consists of results and discussion.
For this study, we used two data sets. The first data set contains HT29 colon cancer cells which was first published by Moffat  and is available as image set B B B C018v1 from the Broad Bioimage Benchmark Collection . Cells were stained for DNA, actin and phospho-histone proteins. DNA was stained with Hoechst 33342 fluorescent dye. Actin proteins were stained with a fluorescent phalloidins dye while phospho-histone proteins were stained with a fluorescent tagged antibody . Carpenter et al.  developed the open source software package CellProfiler through which they identified about 8.3 million cells in 40,000 images of the HT29 data set. Each cell has a set of 615 features which are shape, intensity and texture features of the DNA, actin and phospho-histone (ph3) channels. These features consist of geometric (extension, eccentricity, axis lengths, size and size ratio between cell and nucleus etc.), Haralick (angular moments, contrast, correlation, variance and entropy etc.) and Zernike features. The HT29 data set contains linearly dependent features because some features were derived from other features. This linear dependency poses no problem for the SVM and boosting classifiers, but is problematic for standardLDA.
HT29 colon cancer cells with 14 phenotypes
Actin blebs (AB)
Actin dots (AD)
Anaphase -Telophase (AT)
Angular cell edges (ACE)
Crecent nuclei (CN)
Large spread cells (LSC)
Long projections (LP)
Peas in a pod (PIP)
Perpheral actin (PA)
Phospho-Histone H3 dots (PHD)
HeLa cancer cells with 10 phenotypes
Actin fiber (AF)
Big cells (BC)
Condensed cells (C)
Membrane blebbing (MB)
Normal cells (N)
Protrusion and elongation (P)
Tables 2 and 3 show that each phenotype is represented by a different number of cells which makes the data sets class imbalanced. The HT29 data set suffers from greater class imbalance than the HeLa data set. In case of the HT29 data set, the phenotype with the largest number of cells (metaphase) is about 16 times bigger than the phenotype with the smallest number of cells (peas in a pod), while in the case of the HeLa cells the phenotype with the largest number of cells (normal cells) is about 5 times bigger than the phenotype with the smallest number of cells (telophase). To make sure that the relative frequencies among phenotypes remain roughly the same across all folds, we used 20-fold cross-validation with stratified sampling on the class variables.
There is no single classification method which outperforms all other classification methods on all data sets. The list of classification methods is large and every method has its own strengths and limitations [12, 13]. In this study we include five classification methods: SVM (RBF), SVM (linear), gentle boosting, joint boosting (CPA) and LDA. We choose SVM (RBF), because it has been used in [8, 26] to classify the HeLa data set. Joint boosting (CPA) is included since it is part of the CellProfiler Analyst software applied in  to analyze the HT29 data set. The other three classifiers are included to check whether we can obtain similar performance with simpler classifiers. We include gentle boosting as a lean alternative to joint boosting (CPA) and SVM (linear) as an alternative to SVM (RBF). We include LDA because it is traditionally considered to be a good benchmark classifier. The details of the implementation and tuning of the parameters of the classifiers are as follows.
Joint boosting (CPA): A multi-class version of gentle boosting with shared regression stumps . This classifier learns to use common features shared across the phenotypes. The classifiers for each phenotype are trained jointly, rather than independently . CellProfiler Analyst (CPA) has implemented the idea of  without sharing features. In boosting, the classifiers are built using regression stumps. The learning time increases with increasing number of regression stumps. The manual of CellProfiler Analyst advises the use of 50 regression stumps and  has also used 50 regression stumps for the HT29 data set. In this study we also use 50 regression stumps for joint boosting (CPA). Since, as we will see below, the performance of joint boosting (CPA) with the recommended 50 regression stumps falls short, we also considered using the same method with 200 regression stumps. We will refer to those as joint boosting (CPA-50) and joint boosting (CPA-200). For joint boosting (CPA), we used CellProfiler Analyst 2.0 (r11710). This method uses the one-versus-all strategy for multi-class classification.
Gentle boosting: Boosting methods such asadaboost, real-adaboost, logit-boost and gentle boost perform well on images or scenes cluttered with objects [15, 27, 28]. Boosting methods build a good classifier from many weak classifiers and are similar to decision trees in building classification rules [15, 28]. We use 50 regression stumps for gentle boosting. This method uses the one-versus-all strategy for multi-class classification and also uses multiple features with different thresholds and different weights for each phenotype [27, 28].
Support vector machine with radial basis function (RBF): Generally, the SVM (RBF) classifier is better in performance and is tolerant to irrelevant and interdependent features as compared to decision trees, neural networks and K-nearest neighbor classifiers [9, 12, 13, 29, 30]. SVM (RBF) is a useful method when data is not linearly separable but is slower because of the optimization of the hyper parameters C and γ. The hyper parameter C is the cost parameter which gives a trade-off between training error and model complexity [31, 32]. The higher the value of the C, the higher cost for non-separable examples . The hyper parameter γ is the inverse of the width of the radial basis function. For selection of parameters C and γ, a grid search was performed on values C∈ [ 2-1,20,…,26] and γ∈ [ 2-5,2-4,…,21] for both data sets. A 5-fold cross-validation was performed to select the hyper parameters. In this study, the LIBSVM 3.17 library  is used which implements the one-against-one strategy for multiclass classification.
Linear support vector machine (SVM linear): SVM (linear) is an alternative to SVM (RBF) for large data sets where with/without nonlinear mappings gives similar performance [12, 33]. SVM (linear) requires only one hyper parameter C which reduces the training and testing times. A 5-fold cross-validation was performed to select the hyper parameter. The search for the optimal hyper parameter C was performed on values C∈ [ 2-5,2-4,…,26] for both data sets. In this study we used the Liblinear 1.94 library  which uses a one-vs-all approach for multiclass classification. This library has different versions of regularized linear classification. We used the L2 regularized linear classification with the L2 loss function because it is computationally fast. The performance was similar for the other loss functions.
Linear discriminant analysis (LDA): LDA is a useful method when features are linearly independent and normally distributed. LDA tries to maximize the separation between classes by estimating classes boundedness as a linear combination of the features. LDA does not require any parameter tuning. As the HT29 data set contained linearly dependent features, we used the Moore-Penrose pseudo inverse for the covariance matrix which is provided in the Matlab implementation of LDA.
For performance evaluation of each classifier 20-fold cross-validation was performed. The performance (accuracy) of a classifier is defined as the number of correctly classified cells divided by the total numbers of the cells. For SVM (RBF), SVM (linear), gentle boosting and LDA classifiers, the time elapsed by the 20-fold cross-validation was recorded by using the t i c/t o c functions available in Matlab. The t i c/t o c functions resemble the wall-clock time. The cross-validation time also includes the time of the tuning of the parameters required by a classifier. The implementation of joint boosting (CPA) is in Python while other classifiers are implemented in C++ and called from Matlab using wrapper functions. The Python implementation of joint boosting uses the time function which is similar to the t i c/t o c functions of Matlab. Features of both data sets were normalized and then scaled between 0 and 1. The analysis was performed on a Macbook Pro, Intel core i5 CPU with 2.4 GHz processing speed using Matlab version R2013a installed on OS X 10.9.3 (13D65).
To put the cross-validation time in perspective, we timed the calculation for (1) image segmentation and feature extraction and (2) the time to label all cells in a screen. The software packages and data related to the HeLa data set are available on . We took the data from this site and reran it to find the time taken by segmentation and feature measurements. It took about 4321 seconds to segment and calculate features of 32778 cells in 516 images. Each image size was 670×510 pixels. Since we had unlabeled data of the HeLa data set, we trained the classifiers with optimal parameters obtained through cross-validation and noted the time used by the classifiers to label all unlabeled data. On about 1.6 million cells, it took about 7, 11, 20 and 324 seconds by gentle boosting, SVM (linear), LDA and SVM (RBF) respectively.
Joint boosting (CPA-50) has the worst performance of all classifiers under consideration. To find an explanation for the bad performance of joint boosting (CPA-50), we increased the number of regression stumps from 50, as used by  and advised by the CellProfiler manual, to 200. In case of 14 phenotypes of HT29 cells, joint boosting (CPA) with 200 regression stumps gives an accuracy of 86% in 19047 seconds. In case of 10 phenotypes of HeLa cells, joint boosting (CPA) with 200 regression stumps reaches an accuracy of 75% in 1631 seconds. We tried even more regression stumps, but did not find any further substantial performance improvement. In any case, by increasing the number of regression stumps, the accuracy of joint boosting (CPA) does become close to the other classifiers as shown by the line for joint boosting (CPA-200) in Figure 2. The increase in number of regression stumps increases the performance evaluation time considerably and makes joint boosting (CPA) an order of magnitude slower than its competitors.
LDA is the fastest among all classifiers in cross-validation but suffers from low performance especially in case of more than seven phenotypes. Cross-validation time is the same for SVM (linear) and gentle boosting, but gentle boosting suffers from lower performance in the case of the HeLa data set as shown in Figure 2. For the HT29 data set, SVM (linear) has an overall similar performance as compared to SVM (RBF) and gentle boosting. SVM (RBF) is a slow method which consumes time in a grid search of hyper parameters and there is little performance gain over other classifiers in the case of HT29 cells. For HT29 cells, the average performance difference between SVM (RBF) and SVM (linear) is 0.42%. On average across all number of phenotypes SVM (linear) is about 15 times faster than SVM (RBF) in the case of HT29 data set. For HeLa cells, SVM (RBF) is slower than SVM (linear), gentle boosting and LDA, but has better performance. For HeLa cells, the average difference in performance between SVM (RBF) and SVM (linear) is 1.41%. On average across all number of phenotypes SVM (linear) is about 12 times faster than SVM (RBF) in the case of HeLa data set.
Our study finds that the difference in performance is small between SVM (linear) and SVM (RBF) but that SVM (linear) is faster than SVM (RBF) on both data sets. This finding leads us to investigate further which of these two classifiers is suitable in the iterative approach of training classifiers and their performance evaluation using cross validation. To answer this question, we investigated whether the misclassified cells by SVM (RBF) are a subset of the misclassified cells by SVM (linear). We ran 100 times 20-fold cross-validation on both data sets. We call a cell misclassified if in 80 or more of the 100 runs it was wrongly classified. For the HT29 data set, we find that 75% of the cells misclassified by SVM (RBF) are also misclassified by SVM (linear). For HeLa data set, we find that 87% of the cells misclassified by SVM (RBF) are also misclassified by SVM (linear). Since the fraction of cells misclassified only by SVM (RBF) is relatively small, this suggests that it is safe to use the faster classifier in the iterative improvement of the classifier. Once biologists are satisfied with the labeled phenotypes of the training data and classifier, they can use SVM (RBF) to classify all unlabeled cells in whole data set. In this approach, the iterative phase would be fast with SVM (linear) and final labeling (testing phase) would have the performance gain with SVM (RBF).
Several other studies have evaluated classification performance based on images obtained in high-throughput screening [4, 9, 10, 12, 38, 39]. Classification methods are mostly applied for the classification of sub-cellular protein localization, cell phase, cell phenotype and cellular compounds on data sets obtained in high-throughput screening [12, 39]. Previous studies have applied different methods for classification of different number of phenotypes with different number of features [9, 10, 12, 38, 39]. The geometric, Haralick and Zernike features are the most commonly used features for image-based high-throughput screening of cells in different software packages, but with different segmentation, feature selection and classification methods [5, 6, 24]. Our study recommends software packages to include both SVM (linear) and SVM (RBF) classifiers to help biologists in performing a fast and efficient analysis of high-throughput data.
We imagine a partition of labor of analyzing a high-throughput screen in three steps as presented by Jones et al. (2009) in Figure 1. The first step consists of image segmentation and feature calculation. This is a computation intensive step and took about 72 minutes for a subset of the HeLa data set consisting of 516 images of 670 by 510 pixels with 232K cells. While computation intensive, this step typically does not involve much manual labor. An investigator can try several image segmentation algorithms and judge the quality of the segmentation. Importantly, this step is independent of later steps.
The second step involves iterative training of a classifier. Here an investigator is presented with a set of randomly selected images and the investigator provides the phenotypes (labels) to the computer. From this initial set, the classifier is trained and its performance (accuracy) is computed with cross validation. This performance is evaluated by the investigator who can then decide to label more cells either randomly selected by the computer or selected from certain phenotypes in which the investigator is interested. Either way, as this iterative training of the classifier might be done many times, the classification algorithm should be relatively fast, possibly at the expense of a reduction of testing accuracy. As we have shown SVM (linear) to be 13 times faster than SVM (RBF) at the expense of a reduction in accuracy of 0.9% (average over both data sets and all number of phenotypes), we propose the use of SVM (linear) for this second step.
The third step is classification of the phenotypes of all cells in the screen. Given its small but clear classification accuracy benefit, we advocate the use of SVM (RBF) as others [8–10, 26, 38]. As an extension, we investigated whether a classifier’s notion of its own classification accuracy as the posterior probabilities can be used to screen for “high quality” cells. Indeed, as we show in Figure 5, thresholding the posterior probabilities improves the objective accuracy. Thus, in case an investigator has the luxury of a large number of cells of a particular phenotype in a particular experimental condition, he or she can decide to focus on the cells that have the particular phenotype with more certainty.
We did not draw any conclusion from the similarities among phenotypes shown in Figure 1. Some previous studies find cell-to-cell variations among cells of the same phenotype . In future studies it would be interesting to explore the performance of more classification methods on other image-based high-throughput data sets with more focus on the similarities between phenotypes and the cell-to-cell variations among cells of the same phenotype.
In summary, our study advocates that among the considered classifiers and data sets in this study, SVM (linear) is the appropriate choice for high-throughput screening data sets in iterative training of the classifier while SVM (RBF) is the appropriate choice for the final classifier to classify all cells including unlabeled cells.
This research was funded by an HEC grant from Pakistan to Syed Saiden Abbas. Tjeerd Dijkstra was supported by an NWO Computational Life Sciences grant. We thank Joris Kraak for his early work. We are thankful to Anne Carpenter for giving early access to the HT29 image data set.
- Jones TR, Carpenter AE, Golland P, Sabatini DM: Methods for high-content, high-throughput image-based cell screening. MIAAB Workshop Proceedings. 2006, 65-72.Google Scholar
- Conrad C, Gerlich DW: Automated microscopy for high-content RNAi screening. J Cell Biol. 2010, 188 (4): 453-461. 10.1083/jcb.200910105.View ArticlePubMed CentralPubMedGoogle Scholar
- Moffat J, Grueneberg DA, Yang X, Kim SY, Kloepfer AM, Hinkle G, Piqani B, Eisenhaure TM, Luo B, Grenier JK, Carpenter AE, Foo SY, Stewart SA, Stockwell BR, Hacohen N, Hahn WC, Lander ES, Sabatini DM, Root DE: A lentiviral RNAi library for human and mouse genes applied to an arrayed viral high-content screen. Cell. 2006, 124 (6): 1283-1298. 10.1016/j.cell.2006.01.040.View ArticlePubMedGoogle Scholar
- Buggenthin F, Marr C, Schwarzfischer M, Hoppe P, Hilsenbeck O, Schroeder T, Theis F: An automatic method for robust and fast cell detection in bright field images from high-throughput microscopy. BMC Bioinformatics. 2013, 14: 297-10.1186/1471-2105-14-297.View ArticlePubMed CentralPubMedGoogle Scholar
- Shamir L, Delaney JD, Orlov N, Eckley DM, Goldberg IG: Pattern recognition software and techniques for biological image analysis. PLoS Comput Biol. 2010, 6 (11): e1000974-10.1371/journal.pcbi.1000974.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhou J, Lamichhane S, Sterne G, Ye B, Peng H: BIOCAT: a pattern recognition platform for customizable biological image classification and annotation. BMC Bioinformatics. 2013, 14: 291-10.1186/1471-2105-14-291.View ArticlePubMed CentralPubMedGoogle Scholar
- Jones TR, Carpenter AE, Lamprecht MR, Moffat J, Silver SJ, Grenier JK, Castoreno AB, Eggert US, Root DE, Golland P, Sabatini DM: Scoring diverse cellular morphologies in image-based screens with iterative feedback and machine learning. Proc Natl Acad Sci USA. 2009, 106 (6): 1826-1831. 10.1073/pnas.0808843106.View ArticlePubMed CentralPubMedGoogle Scholar
- Fuchs F, Pau G, Kranz D, Sklyar O, Budjan C, Steinbrink S, Horn T, Pedal A, Huber W, Boutros M: Clustering phenotype populations by genome-wide RNAi and multiparametric imaging. Mol Syst Biol. 2010, 6: 370-View ArticlePubMed CentralPubMedGoogle Scholar
- Hamilton NA, Pantelic RS, Hanson K, Teasdale RD: Fast automated cell phenotype image classification. BMC Bioinformatics. 2007, 8: 110-10.1186/1471-2105-8-110.View ArticlePubMed CentralPubMedGoogle Scholar
- Nanni L, Lumini A: A reliable method for cell phenotype image classification. Artif Intell Med. 2008, 43 (2): 87-97. 10.1016/j.artmed.2008.03.005.View ArticlePubMedGoogle Scholar
- Gul-Mohammed J, Arganda-Carreras I, Andrey P, Galy V, Boudier T: A generic classification-based method for segmentation of nuclei in 3D images of early embryos. BMC Bioinformatics. 2014, 15: 9-10.1186/1471-2105-15-9.View ArticlePubMed CentralPubMedGoogle Scholar
- Huang K, Murphy R: Boosting accuracy of automated classification of fluorescence microscope images for location proteomics. BMC Bioinformatics. 2004, 5: 78-10.1186/1471-2105-5-78.View ArticlePubMed CentralPubMedGoogle Scholar
- Kotsiantis SB: Supervised machine learning: a review of classification techniques. Informatica. 2007, 31 (3): 249-268.Google Scholar
- Kiang MY: A comparative assessment of classification methods. Decis Support Syst. 2003, 35 (4): 441-454. 10.1016/S0167-9236(02)00110-0.View ArticleGoogle Scholar
- Torralba A, Murphy KP, Freeman WT: Sharing visual features for multiclass and multiview object detection. IEEE Trans Pattern Anal Mach Intell. 2007, 29 (5): 854-869.View ArticlePubMedGoogle Scholar
- Somfai G, Tatrai E, Laurik L, Varga B, Olvedy V, Jiang H, Wang J, Smiddy W, Somogyi A, DeBuc D: Automated classifiers for early detection and diagnosis of retinopathy in diabetic eyes. BMC Bioinformatics. 2014, 15: 106-10.1186/1471-2105-15-106.View ArticlePubMed CentralPubMedGoogle Scholar
- Carpenter AE, Jones TR, Lamprecht MR, Clarke C, Kang IH, Friman O, Guertin DA, Chang JH, Lindquist RA, Moffat J, Golland P, Sabatini DM: CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 2006, 7 (10): R100-10.1186/gb-2006-7-10-r100.View ArticlePubMed CentralPubMedGoogle Scholar
- Orlov N, Shamir L, Macura T, Johnston J, Eckley DM, Goldberg IG: WND-CHARM: Multi-purpose image classification using compound image transforms. Pattern Recognit Lett. 2008, 29 (11): 1684-1693. 10.1016/j.patrec.2008.04.013.View ArticlePubMed CentralPubMedGoogle Scholar
- Misselwitz B, Strittmatter G, Periaswamy B, Schlumberger MC, Rout S, Horvath P, Kozak K, Hardt WD: Enhanced CellClassifier: a multi-class classification tool for microscopy images. BMC Bioinformatics. 2010, 11: 30-10.1186/1471-2105-11-30.View ArticlePubMed CentralPubMedGoogle Scholar
- FARSIGHT toolkit. [http://www.farsight-toolkit.org/wiki/FARSIGHT_Toolkit],
- Pau G, Fuchs F, Sklyar O, Boutros M, Huber W: EBImage–an R package for image processing with applications to cellular phenotypes. Bioinformatics. 2010, 26 (7): 979-981. 10.1093/bioinformatics/btq046.View ArticlePubMed CentralPubMedGoogle Scholar
- Held M, Schmitz MH, Fischer B, Walter T, Neumann B, Olma MH, Peter M, Ellenberg J, Gerlich DW: CellCognition: time-resolved phenotype annotation in high-throughput live cell imaging. Nat Methods. 2010, 7 (9): 747-754. 10.1038/nmeth.1486.View ArticlePubMedGoogle Scholar
- CellXpress. [http://www.cellxpress.org],
- Sommer C, Strähle C, Köthe U, Hamprecht FA: ilastik: interactive learning and segmentation toolkit. Eighth IEEE International Symposium on Biomedical Imaging (ISBI 2011). Proceedings. 2011, 230-233.Google Scholar
- Ljosa V, Sokolnicki KL, Carpenter AE: Annotated high-throughput microscopy image sets for validation. Nat Methods. 2012, 9 (7): 637-10.1038/nmeth.2083.View ArticlePubMed CentralPubMedGoogle Scholar
- Coelho LP, Kangas JD, Naik AW, Osuna-Highley E, Glory-Afshar E, Fuhrman M, Simha R, Berget PB, Jarvik JW, Murphy RF: Determining the subcellular location of new proteins from microscope images using local features. Bioinformatics. 2013, 29 (18): 2343-2349. 10.1093/bioinformatics/btt392.View ArticlePubMed CentralPubMedGoogle Scholar
- Sebastien P: A Matlab code for Gentle adaBoost classifier with two different weak-learners: Decision Stump and Perceptron. Mathworks. 2011, [http://www.mathworks.nl/matlabcentral/fileexchange/22997-multiclass-gentleadaboosting],Google Scholar
- Friedman J, Hastie T, Tibshirani R: Additive logistic regression: a statistical view of boosting. Ann Stat. 2000, 95 (2): 337-407.View ArticleGoogle Scholar
- Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction. Bioinformatics. 2001, 17: 721-728. 10.1093/bioinformatics/17.8.721.View ArticlePubMedGoogle Scholar
- Chang CC, Lin CJ: LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011, 2: 27:1-27:27.View ArticleGoogle Scholar
- Alpaydin E: Introduction to Machine Learning (Adaptive Computation and Machine Learning). 2004, The MIT Press, ISBN: 026201243Google Scholar
- Joachims Thorsten: Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. 2002, Norwell: Kluwer Academic PublishersView ArticleGoogle Scholar
- Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ: LIBLINEAR: a Library for Large Linear Classification. J Mach Learn Res. 2008, 9: 1871-1874.Google Scholar
- CellMorph. [http://www.ebi.ac.uk/huber-srv/cellmorph/],
- Duin RPW, Tax DMJ: Advances in Pattern Recognition, Volume 1451. 1998, Springer Berlin HeidelbergGoogle Scholar
- Lin HT, Lin CJ, Weng R: A note on Platt’s probabilistic outputs for support vector machines. Mach Learn. 2007, 68 (3): 267-276. 10.1007/s10994-007-5018-6.View ArticleGoogle Scholar
- Platt JC: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers. 1999, MIT Press, 61-74.Google Scholar
- Kummel A, Selzer P, Beibel M, Gubler H, Parker CN, Gabriel D: Comparison of multivariate data analysis strategies for high-content screening. J Biomol Screen. 2011, 16 (3): 338-347. 10.1177/1087057110395390.View ArticlePubMedGoogle Scholar
- Zhou X, Wong STC: Informatics challenges of high-throughput microscopy. IEEE Signal Process Mag. 2006, 23: 63-72.View ArticleGoogle Scholar
- Altschuler SJ, Wu LF: Cellular heterogeneity: do differences make a difference?. Cell. 2010, 141 (4): 559-563. 10.1016/j.cell.2010.04.033.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.