cellXpress: a fast and user-friendly software platform for profiling cellular phenotypes
- Danai Laksameethanasan†1,
- Rui Zhen Tan†1,
- Geraldine Wei-Ling Toh1, 2 and
- Lit-Hsin Loo1, 3Email author
© Laksameethanasan et al.; licensee BioMed Central Ltd. 2013
Published: 22 October 2013
High-throughput, image-based screens of cellular responses to genetic or chemical perturbations generate huge numbers of cell images. Automated analysis is required to quantify and compare the effects of these perturbations. However, few of the current freely-available bioimage analysis software tools are optimized for efficient handling of these images. Even fewer of them are designed to transform the phenotypic features measured from these images into discriminative profiles that can reveal biologically meaningful associations among the tested perturbations.
We present a fast and user-friendly software platform called "cellXpress" to segment cells, measure quantitative features of cellular phenotypes, construct discriminative profiles, and visualize the resulting cell masks and feature values. We have also developed a suite of library functions to load the extracted features for further customizable analysis and visualization under the R computing environment. We systematically compared the processing speed, cell segmentation accuracy, and phenotypic-profile clustering performance of cellXpress to other existing bioimage analysis software packages or algorithms. We found that cellXpress outperforms these existing tools on three different bioimage datasets. We estimate that cellXpress could finish processing a genome-wide gene knockdown image dataset in less than a day on a modern personal desktop computer.
The cellXpress platform is designed to make fast and efficient high-throughput phenotypic profiling more accessible to the wider biological research community. The cellXpress installation packages for 64-bit Windows and Linux, user manual, installation guide, and datasets used in this analysis can be downloaded freely from http://www.cellXpress.org.
High-throughput, image-based phenotypic profiling enables multi-parameter measurements of cellular responses to large-scale genetic or chemical perturbations. These measurements are useful for unraveling complex changes in cellular morphology and protein subcellular localization , and have been used to study drug responses , cell division , cytoskeleton remodelling , and endocytosis . Several free software tools are currently available for analyzing microscopy images. They include CellProfiler , ImageJ , BioImageXD , Icy , OMERO , and EBImage .
Overall software design and architecture
The cellXpress platform has the following advantages over existing bioimage analysis software packages. First, to achieve higher processing speeds, we have developed the core processing engine of cellXpress completely in C/C++ based on a highly-optimized computer vision library, OpenCV , and efficient single-instruction-multiple-data (SIMD) instruction sets . To fully utilize all the processing units in modern multi-core processors, we have also developed a dynamic job scheduler based on the OpenMP interface . The scheduler manages a job queue for cell segmentation and feature extraction jobs, and automatically assigns pending jobs to free processors or cores. Thus, all the available processors will be fully ultilized by cellXpress.
Second, to extract diverse types of features, we have developed a new subcellular region detection algorithm that can automatically identify eight different subcellular regions based on the nuclear and cell masks from each segmented cell. These sub-cellular regions include whole-cell, nucleus, nuclear boundary, inner nucleus, peri-nucleus, cytoplasm, cytoplasmic boundary, and inner cytoplasm. The cellXpress platform automatically computes four different feature types, namely morphology, intensity, region-level intensity ratio and pixel-level intensity correlation, for each of the identified regions (see Feature Extraction). These diverse types of features will allow the quantification of complex protein subcellular localization patterns at the single-cell level.
Third, to perform profile construction or other custom data analyses, we have developed a suite of library functions called "cXlibrary" under the R computing environment . Users can import cellXpress data saved in HDF5 files into a R session and construct phenotypic profiles. The resulting data or profiles can then be used for supervised classification, unsupervised clustering, or other types of analysis. We also implemented phenotypic profiling algorithms to condense large numbers of raw features extracted in typical high-throughput studies into more concise and discriminative profiles, such as the support-vector-machine (SVM)-based "drug profiles" (d-profiles) [2, 14].
Seeded-watershed-based cell segmentation algorithm
We have implemented and optimized a seeded-watershed-based cell segmentation algorithm  for the cellXpress platform. The segmentation algorithm is general and was previously used to identify individual mouse fat cells, human cancer cells, and neutrophil-like cells from fluorescence microscopy images [2, 21, 22]. In brief, the algorithm consists of two major steps. The first step is to identify nuclear regions from the image background using a combination of h-dome operator , Laplacian-of-Gaussian edge detector, and Otsu's thresholding algorithm . Then, a watershed algorithm  is used to break apart connected nuclear regions. In the second step, a composite cell image obtained from the linear combination of the images of all fluorescence channels is used to identify cellular regions based on Otsu's thresholding algorithm. Finally, a seeded watershed algorithm that uses the nuclear regions as seeds is used to break apart connected cellular regions. In the future, we plan to include additional segmentation algorithms in the cellXpress platform.
The cellXpress platform has a flexible feature extraction module that can be used to measure cellular features based on different combinations of feature types, fluorescent markers, and subcellular regions. We have designed a user-friendly graphical interface to configure this feature extraction module (Figure 5) and automatically expand feature selections based on the chosen regions. For example, region-level intensity ratios will be computed for all possible pairs of the selected sub-cellular regions. Information about the markers and regions used in a feature is encoded in its name, which follows the following format: XXXX:YYYY:ZZZZ, where XXXX is the feature type, YYYY is the fluorescent markers used, and ZZZZ is the subcellular regions used. For example, "fraction_total_intensity:ERK:dna_region-cell_region" measures the fraction of total ERK intensity level in the nuclear region to the entire cellular region, and "total_intensity_ratio:Tubulin-ERK:nondna_region-dna_region" measures the ratio of total tubulin intensity level in the cytoplasmic region over total ERK intensity level in the nuclear region. The names of the extracted features are included in all cellXpress HDF5 data files or CSV export files. Users can load specific subsets of extracted features into the R environment by using the load_wells() function in the cXlibrary.
To transform extracted features into discriminative profiles, we implemented a support-vector-machine-(SVM)-based phenotypic profiling algorithm called "drug profiling" (d-profiling)  in the cellXpress platform. Given two sets of feature values for cells under treated and control conditions, respectively, the algorithm trains a linear SVM to obtain a hyperplane that optimally separates these two set of values in high-dimensional feature space. Then, the unit vector normal to the hyperplane is used as a profile to represent changes in the phenotypes of the treated cells with respect to the control cells . Our implementation is based on the LIBLINEAR library .
Comparisons with existing tools
To evaluate the performance of cellXpress, we considered several other alternative free biological image analysis software platforms (Figure 1). The functions of many of these platforms may be extended through third-party plugins or custom scripting/programming. However, most biological scientists will have limited resources or expertise in developing such custom plugins or programs. Therefore, we only considered built-in functions or plugins that are bundled with default installation packages. We chose to compare the performance of cellXpress (version pro 1.0)  to the Broad Institute's CellProfiler (version 2.0)  because they have the most similar functions (Figure 1). We also included NIH's ImageJ (version 1.47)  with plugins from the Fiji package  because it is a standard image analysis tool and widely used by biological scientists. We focused on evaluating the processing speed, cell segmentation accuracy, and profile clustering performance of these software packages.
Applications to Kc167, HT29, and HeLa datasets
To evaluate cell segmentation performance of cellXpress, we used two standard image benchmark datasets, namely Kc167 and HT29, which represent different cell types and numbers of image frames [31, 32]. The first dataset was collected from a Drosophila melanogaster cell line, Kc167. We used the dataset's DNA marker for detecting nuclear regions, and actin marker for detecting cellular regions. This dataset has three image frames (an image frame refers to an imaging position in a well), but we only used one of them for testing cell segmentation speed to mimic the situation when computation cannot be parallelized at the image-frame level. Each image has a resolution of 1000 × 1006 pixels, and there are ~200 cells per frame. The second dataset  was collected from a human colon cancer cell line, HT29. We only used the dataset's DNA marker for detecting nuclear regions, and actin marker for detecting cellular regions. The dataset was generated in a shRNA screen for finding mitotic gene regulators . It has 56 image frames, and was used to test cell segmentation when computation may be parallelized at the frame-level. Each image has a resolution of 512 × 512 pixels, and there are ~100 cells per frame. We followed the procedures recommended on the CellProfiler website , and used the original images and the provided pipeline without any further image pre-processing.
To evaluate the phenotypic-profiling performance of cellXpress, we used an image dataset from a previous high-throughput siRNA screen [34, 35] on HeLa cells stained for DNA, tubulin, and actin markers. The dataset was generated by transfecting HeLa cells with a genome-wide siRNA library for 48 hours, and used to predict functions of genes based on their knockdown phenotypes. There are four 670 × 510 pixel image frames per gene knockdown, each of which has around 50 cells. siRNAs for a non-human gene, renilla luciferase (Rluc), were used as negative controls. We selected 32 genes, which can be categorized into four groups representing structural components of actins or microtubules, or the synthesis machineries for RNAs or proteins (Additional file 1). The RNA and protein synthesis genes were selected from genes encoding the subunits of RNA polymerase II and ribosome, respectively. Microtubule structural components were selected from the α-tubulin, β-tubulin and γ-tubulin families. For structural components of actins, we included three actin isoforms (alpha, beta and gamma) and genes from the spectrin family, which are actin-crosslinking proteins that link the plasma membrane to the actin cytoskeleton .
Evaluation criteria for segmentation accuracy
where and are individual pixels within sets and , respectively; is the cardinality operator; and is the Euclidean norm.
where N is the total number of pixels in the segmentation mask M.
Generation of phenotypic profiles for HeLa dataset
To construct phenotypic profiles for HeLa cells, we first segmented the dataset using cellXpress. Actin and tubulin were used as cell markers and DNA as a nuclear marker for the watershed algorithm. Then, we measured the morphology, intensity, intensity ratio, and pixel-level intensity correlation features for actin and tubulin in the whole cell, nuclear and non-nuclear regions; and for DNA in the nuclear region only. In total, we measured 290 features for every cell (Additional file 2). Then, we constructed three different types of phenotypic profiles for the dataset. The first type of profiles is based on the arithmetic mean of each feature across all cells that have been treated with a specific siRNA. The second type of profiles is based on principal component analysis (PCA) . We kept the number of principal components needed to explain 95% of the variation in our data, and used the scores vector as the phenotypic profiles. The last type of profiles is the SVM-based "d-profiles"  (see Implementation Section).
Evaluation criteria for phenotypic profiling
where is the set of all profiles in the j- th group, and is the total number of groups.
where E() is the mean operator. This evaluation is repeated for different values of n.
Computer software and hardware platforms
The evaluations were performed on a desktop computer with a Intel Core i7 3.07 GHz processor, 8 GB of memory, 64-bit Windows 7 operating system, and Java version 7 Update 9 (build 1.7.0_09-b05). All image and data files were stored in a local harddrive. For the evaluation of processing speed and segmentation accuracy, we implemented a script in Matlab version R2007b (Mathworks, USA) to compute and compare both the boundary and Rand error indices. For the evaluation of phenotypic profiling, we generated multidimensional scaling (MDS) plots for all the constructed profiles using the MASS  and the rgl libraries  under the R computing environment (version 2.14.2).
Results and discussion
Processing speed and segmentation accuracy
A fast and accurate bioimage analysis software platform is required to process the huge amount of microscopy images generated from high-throughput phenotypic profiling experiments. We compared the processing speed and accuracy of cellXpress, CellProfiler , and ImageJ/Fiji  in segmenting the Kc167, HT29, and HeLa datasets. These three software packages implement variants of similar seeded watershed segmentation algorithms . For CellProfiler , we used the segmentation pipeline and optimized parameters included in the Kc167 dataset zip file . We disabled the image cropping function, "show all windows on run" option, and feature-extraction steps in the original pipeline. The pipeline identified DNA regions and the cellular regions using Otsu's thresholding algorithm . For Fiji , we implemented a macro script to perform watershed segmentation. The script identified cellular regions using Li's Minimum-Cross-Entropy  and Triangle-thresholding algorithms  implemented in Fiji for the Kc167 and HT29 datasets, respectively. We found that these two thresholding algorithms gave the best segmentation results for the respective datasets for Fiji.
To evaluate segmentation accuracy, we compared cell masks obtained automatically from the three software platforms to cell masks obtained from manual segmentation. For the Kc167 dataset, we manually segmented each individual cell based on the actin channel. For the HT29 dataset, we used the manual segmentation masks from the Broad Institute's website . The image frame "10779.DIB" was excluded from analysis, as suggested from the website, because of insufficient image quality. We found that the cellXpress had slightly better or similar segmentation accuracies than Fiji and CellProfiler (Figure 6b). The boundary error of cellXpress was significantly lower than CellProfiler (P<0.001), but the Rand errors of the three tested tools were not significantly different from each other (P>0.05, both using two-sided t-tests). Therefore, the faster speed of cellXpress does not come at the cost of segmentation accuracy.
Evaluation of phenotypic profiling
The better performance of d-profiles may be attributed to its ability to capture more informative features. Mean profiles are the arithmetic means of the extracted features across all cells, and PCA profiles are based on an orthogonal transformation of the features into a new set of linearly uncorrelated variables with descending variance (see Evaluation Methods). Both methods do not remove or penalize non-informative features that show high-variance but similar values in both siRNA-treated and control cells. However, d-profiles are based on SVM hyperplanes that optimally separate between treated and control cells, and thus will give lower weights to these non-informative features. Interestingly, we found that d-profiles could distinguish genes involved in the synthesis machineries of RNAs or proteins (Figure 7a), although the cells were only stained with markers for cytoskeleton components. This shows the potential of using morphological and intensity features of a small set of markers to distinguish genes with different biological functions.
The cellXpress platform is specifically designed to make fast and efficient high-throughput phenotypic profiling more accessible to the wider scientific community. Other biological image analysis software platforms may be more appropriate for analyzing time-lapse or 3D microscopy images, or managing large image databases (Figure 1). The cellXpress platform is actively maintained and updated. Future planned improvements include graphics-processing-unit (GPU)-based acceleration, and gene or chemical annotation analysis. The cellXpress software package can be downloaded from http://www.cellXpress.org.
We thank Cecilia Wei-Ling Tan for her comments and help in performing manual cell segmentation.
Publication of the work is supported by Biomedical Research Council of A*STAR (Agency for Science, Technology and Research), Singapore.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 16, 2013: Twelfth International Conference on Bioinformatics (InCoB2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S16.
- Boland MV, Murphy RF: A neural network classifier capable of recognizing the patterns of all major subcellular structures in fluorescence microscope images of HeLa cells. Bioinformatics. 2001, 17 (12): 1213-1223. 10.1093/bioinformatics/17.12.1213.View ArticlePubMedGoogle Scholar
- Loo LH, Wu LF, Altschuler SJ: Image-based multivariate profiling of drug responses from single cells. Nat Methods. 2007, 4 (5): 445-453.PubMedGoogle Scholar
- Neumann B, Walter T, Heriche JK, Bulkescher J, Erfle H, Conrad C, Rogers P, Poser I, Held M, Liebel U: Phenotypic profiling of the human genome by time-lapse microscopy reveals cell division genes. Nature. 2010, 464 (7289): 721-727. 10.1038/nature08869.PubMed CentralView ArticlePubMedGoogle Scholar
- Bakal C, Aach J, Church G, Perrimon N: Quantitative morphological signatures define local signaling networks regulating cell morphology. Science. 2007, 316 (5832): 1753-1756. 10.1126/science.1140324.View ArticlePubMedGoogle Scholar
- Collinet C, Stoter M, Bradshaw CR, Samusik N, Rink JC, Kenski D, Habermann B, Buchholz F, Henschel R, Mueller MS: Systems survey of endocytosis by multiparametric image analysis. Nature. 2010, 464 (7286): 243-249. 10.1038/nature08779.View ArticlePubMedGoogle Scholar
- Kamentsky L, Jones TR, Fraser A, Bray MA, Logan DJ, Madden KL, Ljosa V, Rueden C, Eliceiri KW, Carpenter AE: Improved structure, function and compatibility for CellProfiler: modular high-throughput image analysis software. Bioinformatics. 2011, 27 (8): 1179-1180. 10.1093/bioinformatics/btr095.PubMed CentralView ArticlePubMedGoogle Scholar
- Collins TJ: ImageJ for microscopy. Biotechniques. 2007, 43 (1 Suppl): 25-30.View ArticlePubMedGoogle Scholar
- Kankaanpaa P, Paavolainen L, Tiitta S, Karjalainen M, Paivarinne J, Nieminen J, Marjomaki V, Heino J, White DJ: BioImageXD: an open, general-purpose and high-throughput image-processing platform. Nature methods. 2012, 9 (7): 683-689. 10.1038/nmeth.2047.View ArticlePubMedGoogle Scholar
- de Chaumont F, Dallongeville S, Chenouard N, Herve N, Pop S, Provoost T, Meas-Yedid V, Pankajakshan P, Lecomte T, Le Montagner Y: Icy: an open bioimage informatics platform for extended reproducible research. Nature methods. 2012, 9 (7): 690-696. 10.1038/nmeth.2075.View ArticlePubMedGoogle Scholar
- Allan C, Burel JM, Moore J, Blackburn C, Linkert M, Loynton S, Macdonald D, Moore WJ, Neves C, Patterson A: OMERO: flexible, model-driven data management for experimental biology. Nature methods. 2012, 9 (3): 245-253. 10.1038/nmeth.1896.PubMed CentralView ArticlePubMedGoogle Scholar
- Pau G, Fuchs F, Sklyar O, Boutros M, Huber W: EBImage--an R package for image processing with applications to cellular phenotypes. Bioinformatics. 2010, 26 (7): 979-981. 10.1093/bioinformatics/btq046.PubMed CentralView ArticlePubMedGoogle Scholar
- Fourment M, Gillings MR: A comparison of common programming languages used in bioinformatics. BMC Bioinformatics. 2008, 9: 82-10.1186/1471-2105-9-82.PubMed CentralView ArticlePubMedGoogle Scholar
- Prechelt L: An empirical comparison of C, C++, Java, Perl, Python, Rexx and Tcl. IEEE Computer. 2000, 33: 23-29. 10.1109/2.876288.View ArticleGoogle Scholar
- Feng Y, Mitchison TJ, Bender A, Young DW, Tallarico JA: Multi-parameter phenotypic profiling: using cellular effects to characterize small-molecule compounds. Nat Rev Drug Discov. 2009, 8 (7): 567-578. 10.1038/nrd2876.View ArticlePubMedGoogle Scholar
- Bradski G: The OpenCV Library. Dr Dobb's Journal of Software Tools. 2000, 25 (11): 122-125.Google Scholar
- Raman SK, Pentkovski V, Keshava J: Implementing streaming SIMD extensions on the Pentium III processor. IEEE Micro. 2000, 20 (4): 47-57. 10.1109/40.865866.View ArticleGoogle Scholar
- Loo LH, Lin HJ, Singh DK, Lyons KM, Altschuler SJ, Wu LF: Heterogeneity in the physiological states and pharmacological responses of differentiating 3T3-L1 preadipocytes. J Cell Biol. 2009, 187 (3): 375-384. 10.1083/jcb.200904140.PubMed CentralView ArticlePubMedGoogle Scholar
- Loo LH, Lin HJ, Steininger RJ, Wang Y, Wu LF, Altschuler SJ: An approach for extensibly profiling the molecular states of cellular subpopulations. Nat Methods. 2009, 6 (10): 759-765. 10.1038/nmeth.1375.PubMed CentralView ArticlePubMedGoogle Scholar
- Vincent L: Morphological grayscale reconstruction in image analysis: applications and efficient algorithms. IEEE Trans on Image Processing. 1993, 2: 176-201. 10.1109/83.217222.View ArticleGoogle Scholar
- Otsu N: A threshold selection method from grey-level histograms. IEEE Trans Systems, Man and Cybernetics. 1979, 9: 62-66.View ArticleGoogle Scholar
- Vincent L, Soille P: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans on Pattern Analysis and Machine Intelligence. 1991, 13: 583-598. 10.1109/34.87344.View ArticleGoogle Scholar
- Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research. 2008, 9: 1871-1874.Google Scholar
- (downloaded on 1st December 2012), [http://www.cellprofiler.org/linked_files/Examplezips/ExampleFlyImages.zip]
- (downloaded on 1st December 2012), [http://www.broadinstitute.org/bbbc/BBBC018]
- Moffat J, Grueneberg DA, Yang X, Kim SY, Kloepfer AM, Hinkle G, Piqani B, Eisenhaure TM, Luo B, Grenier JK: A lentiviral RNAi library for human and mouse genes applied to an arrayed viral high-content screen. Cell. 2006, 124 (6): 1283-1298. 10.1016/j.cell.2006.01.040.View ArticlePubMedGoogle Scholar
- (downloaded on 20 Dec 2010), [http://www.ebi.ac.uk/huber-srv/cellmorph]
- Fuchs F, Pau G, Kranz D, Sklyar O, Budjan C, Steinbrink S, Horn T, Pedal A, Huber W, Boutros M: Clustering phenotype populations by genome-wide RNAi and multiparametric imaging. Mol Syst Biol. 2010, 6: 370-PubMed CentralView ArticlePubMedGoogle Scholar
- Bennett V, Healy J: Membrane domains based on ankyrin and spectrin associated with cell-cell interactions. Cold Spring Harb Perspect Biol. 2009, 1 (6): a003012-10.1101/cshperspect.a003012.PubMed CentralView ArticlePubMedGoogle Scholar
- Unnikrishnan R, Pantofaru C, Hebert M: Toward objective evaluation of image segmentation algorithms. IEEE Trans Pattern Anal Mach Intell. 2007, 29 (6): 929-944.View ArticlePubMedGoogle Scholar
- Pearson K: On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine. 1901, 2 (11): 559-572. 10.1080/14786440109462720.View ArticleGoogle Scholar
- Loo LH, Wu LF, Altschuler SJ: Image-based multivariate profiling of drug responses from single cells. Nat Methods. 2007, 4 (5): 445-453.PubMedGoogle Scholar
- Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, Springer, New York, 4View ArticleGoogle Scholar
- Schindelin J, Arganda-Carreras I, Frise E, Kaynig V, Longair M, Pietzsch T, Preibisch S, Rueden C, Saalfeld S, Schmid B: Fiji: an open-source platform for biological-image analysis. Nat Methods. 2012, 9 (7): 676-682. 10.1038/nmeth.2019.View ArticlePubMedGoogle Scholar
- Carpenter AE, Jones TR, Lamprecht MR, Clarke C, Kang IH, Friman O, Guertin DA, Chang JH, Lindquist RA, Moffat J: CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 2006, 7 (10): R100-10.1186/gb-2006-7-10-r100.PubMed CentralView ArticlePubMedGoogle Scholar
- Li C, Lee C: An Iterative Algorithm for Minimum Cross Entropy Thresholding. Pattern Recognition. 1998, 26 (4): 617-625.View ArticleGoogle Scholar
- Zack G, Rogers W, Latt S: Automatic measurement of sister chromatid exchange frequency. J Histochem Cytochem. 1977, 25 (7): 741-753. 10.1177/25.7.70454.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.