CellProfiler Analyst: data exploration and analysis software for complex image-based screens
© Jones et al. 2008
Received: 11 July 2008
Accepted: 15 November 2008
Published: 15 November 2008
Skip to main content
© Jones et al. 2008
Received: 11 July 2008
Accepted: 15 November 2008
Published: 15 November 2008
Image-based screens can produce hundreds of measured features for each of hundreds of millions of individual cells in a single experiment.
Here, we describe CellProfiler Analyst, open-source software for the interactive exploration and analysis of multidimensional data, particularly data from high-throughput, image-based experiments.
The system enables interactive data exploration for image-based screens and automated scoring of complex phenotypes that require combinations of multiple measured features per cell.
Visual analysis of cell samples has played a dominant role in the history of biology. The scientific community has only begun to scratch the surface of computationally extracting the rich information visible in fluorescence microscopy images of cell samples . This capability is increasingly important given the ease now to systematically perturb cells with libraries of chemicals or gene-perturbing reagents like RNA interference or gene overexpression and collect hundreds of thousands of images of these cell samples [2, 3]. We recently developed open-source image analysis software, CellProfiler, which measures a rich set of cellular features in images, such as size, shape, and staining patterns including intensity, texture, and colocalization [4, 5]http://www.cellprofiler.org. This tool has been useful for extracting image-based measurements to score sophisticated screens [6–8], with many more in progress.
The volume and richness of individual-cell data from large image-based screens is unprecedented and existing software is inadequate for the challenge of data analysis. For analysis of small or very simple experiments, spreadsheet programs like Microsoft Excel are sufficient, and useful open-source tools exist for analysis and exploration of data from high throughput screens in general [9–12]. Existing software packages targeted for image-based screening, however, have one or more limitations which prevent sophisticated visualization and extraction of information from image-based screens: (a) they are not designed for the hierarchical data structure inherent in image-based data (each treatment condition is replicated in several samples, each sample is usually represented by several images, each image contains a population of cells, and each cell has hundreds of associated measures), (b) they ignore the inherent biological variability of cell populations such that assays requiring subpopulation analysis cannot be scored, (c) they cannot handle the volumes of data typical in image-based experiments (e.g., ~500 measurements for each of ~100 million individual cells), (d) they provide limited linking to raw or processed image data or chemical structure data, (e) they allow only limited statistical analyses of the data, (f) they are proprietary and new methods cannot be easily added, (g) they are limited to data from a particular image analysis package, (h) they require expertise in statistics or programming, and/or (i) they require intense hands-on data management.
Given that no existing tools meet the specific needs of image-based screens, researchers have needed computational expertise to directly query databases of image-based information using command-line tools. Often, the researchers best able to explore and interpret the data lack these computational skills. These researchers are therefore less likely to make serendipitous discoveries (or identify quality-control issues) in their image-based screens, which inherently contain enormous amounts of information beyond that which is pertinent to the original, intended biological question. It is critical to provide exploration tools to screening researchers, tools that employ their understanding of the experiment in question and their creativity and ability to recognize and interpret patterns and relationships within data. These capabilities flourish when united with a computer's unique ability to store, retrieve, display, and quantitatively analyze billions of data points.
We therefore sought to develop a software system that would make high-dimensional image-based data exploration feasible for researchers who lack computational skills, and flexible for computer scientists who want to develop and add advanced new methods for image-based screening, such as machine learning-based phenotype scoring. We describe here the result of our work, an open-source software package called "CellProfiler Analyst".
Each data point in a plot can represent an individual cell or, by contrast, the mean value of the population of cells within an image. Data can also be grouped by characteristics the samples have in common (e.g., chemical name or dose). Multiple experiments that investigate the same set of treatment conditions (e.g., chemical compounds or RNA interference reagents) can be grouped together, which eases analysis of replicates. For all types of plots, the data to be displayed can be filtered, for example to plot data only from a single image, from a sample of data points at specified equal intervals, or data that satisfies certain criteria (specified in SQL "where" clauses like "CellCount > 100").
Additionally, a data point or set of data points can be selected and a plot of the measurements of individual cells that were present in those images can be displayed as a separate subplot. This allows, for example, a DNA content histogram indicating cell cycle distribution of the cell population to be displayed for a particular image or set of images of interest (Figure 2c and Figure 3b). To investigate the identity of interesting samples, a simple list of the treatment conditions that produced a set of data points can be displayed to get an overview (Figure 2d). For further investigation, web-based information about each image's treatment condition can be launched in an external web browser (Figure 3f), if web addresses associated with each sample are stored in the database. All available measurements and other information for a particular sample can be displayed in a simple table and saved as a comma-delimited text file for analysis in another software package (Figure 3c).
If more than two features are needed to score a phenotype, sequential gates can be used upon the cell data. This approach is applied as follows: (1) display the entire population of cells from an experiment in a density plot, (2) draw a gate around the data points representing potential cells of interest, (3) adjust the gate to include nearly all positive cells and exclude as many negative cells as possible, (4) plot the resulting gated subpopulation in a new density plot with two new measurement features as axes, (5) gate the subpopulation again based on these new features, and (6) calculate the percentage of each image's cells that fall within the final gate.
Several groups have tested automated methods for scoring mitotic subphases [18–20]; these studies were accomplished by computational tools tailored to the specific assay and often relied on multiple cellular stains. Machine learning methods have been explored by our own group and others [21–26] (and see Conclusions), but we also wanted to explore allowing the user to manually select a small number of features of known biological relevance, followed by sequential gating on those features. This would give the researcher full control over the features used in the scoring, and the scoring would be more readily transferable from one experiment to the next because a small number of features are selected. We therefore wanted to score mitotic subphases using a DNA stain only, using supervised selection of measurements followed by sequential gating on those measurements, in the context of a software package usable by a non-computer scientist.
We screened genes using Drosophila RNA interference living cell microarrays [27–29] to identify gene "knockdowns" that yield a disproportionate number of cells in two sub-phases of mitosis: metaphase and anaphase/telophase (referred to as telophase for simplicity). We created and analyzed 5 replicates of a Drosophila array, with 1120 spots of dsRNA on a single microscope slide (Figure 5b), including three replicate spots for each of 288 genes (mostly kinases and phosphatases), plus 256 negative control spots lacking dsRNA. Some phenotypes produced in these Drosophila Kc167 cells (e.g. cell death) are visible at low resolution (5× lens; Figure 5c), but to identify telophase and metaphase nuclei we collected individual high resolution images within each spot on each slide (40× lens; small portion of one image shown in Figure 5d).
Accuracy of scoring the metaphase and telophase phenotype
# positive cells (Observer 1)
# positive cells (Observer 2)
Mean # positive cells (Observers 1&2)
# positive cells (CellProfiler)
# cells called positive by CellProfiler but not by Observers
# cells called positive by Observers but not by CellProfiler
false positive rate
false negative rate
We separately performed the same procedure for the metaphase phenotype (using four features to distinguish metaphase nuclei from all other nuclei); a complete list of the 288 genes tested and their scores for telophase and metaphase is shown in additional data file 2.
Interestingly, the only metaphase hit in this screen (Figure 6c, last row) is the B'/B56 subfamily regulatory subunit of PP2A (CG5643/widerborst), which at the time of our screen had not been linked to cell cycle regulation. The percentage of cells that were phospho-histone H3-positive was not much higher than normal (Figure 6c, fifth column). We confirmed by eye the metaphase-inducing phenotype of widerborst knockdown in the original images and in separate experiments with two other dsRNAs, including one that was non-overlapping with the original (Figure 8a). Widerborst is an essential gene involved in planar cell polarization  and apoptosis [31, 32]. Notably, in other contexts (circadian clock protein cycling  and sensory organ development ) widerborst is indirectly linked to the B/PR55 subfamily member twins/aar, which is itself known to be required for metaphase to anaphase transition . Our work therefore confirms, with non-overlapping dsRNAs, a recently reported cell cycle regulation role for widerborst  and together indicates that it is unlikely this phenotype is due to off-target effects [37, 38].
We have described here a software system for exploration and analysis of large, hierarchical, multi-dimensional data sets. While it is compatible with any type of data (e.g., players on teams, trees within forests), it is particularly capable of high-end exploration and analysis of measured features from high-throughput image-based screens for both quality control and identifying hits in a screen. Researchers are welcome to download the Java source code and add new types of plots and analysis tools (e.g., for normalizing screen data [9, 10]) to the system.
We have demonstrated the utility of this software for interactive data exploration and analysis – especially for intentionally selecting cells with particular measurement values in order to score complex visual phenotypes. Of course, often the features that successfully specify a particular phenotype are either unknown or so numerous as to make the sequential plotting shown here impractical, and choosing decision boundaries empirically may not be optimal to score the phenotype. For these reasons, we recently added machine-learning methodology to CellProfiler Analyst (TRJ, AEC, DMS, PG, unpublished data). Nonetheless, the complete control over features and thresholds offered by sequential gating is quite useful in some cases. Often a researcher needs to ignore certain features of positive control cells (for example, when a positive control treatment has pleiotropic effects on cells) and emphasize other, better-understood cellular features. Interactive observation of the original cellular images while making gating decisions to define a phenotype also leverages the biologist's intuition about a phenotype. Within the same open-source software infrastructure, both approaches (sequential gating and machine learning) can now be applied to large-scale imaging screens.
Project name: CellProfiler Analyst
Project home page: http://www.cellprofiler.org
Operating systems: Platform independent (Mac, Windows, and Unix)
Programming language: Java
Other requirements: Java 1.4.2 or greater. For full functionality, CellProfiler Analyst requires Java 1.5.0_6 or greater, Python version 2.5 or greater http://www.python.org/ and the NumPy Python package http://scipy.org.
License: GNU GENERAL PUBLIC LICENSE, Version 2
No additional restrictions to use by non-academics
CellProfiler Analyst can be downloaded for Mac, Windows, and Unix operating systems from the CellProfiler Project website http://www.cellprofiler.org, where it is distributed under an open-source license (GNU General Public License, version 2). An archived version is also available as additional data file 3, submitted with this article. The Examples page of the website provides demonstration movies showing the software in use, an example database and images, and links to an online forum where questions about the software are answered.
CellProfiler Analyst is designed to explore and analyze any MySQL database of image-based screening data that follows a simple format: at least one image table with rows corresponding to images and columns of image data (examples of columns are: the name of the treatment condition, total intensity of the entire image, mean cell area averaged over all cells in the image, path to the original image), and at least one object table with rows corresponding to objects (e.g., cells) and columns of object data (examples of columns are: area of the cell, intensity of DNA stain in the nucleus, location of the cell in the original image – the latter being important for viewing individual cells during exploration). This data format is automatically produced if images are analyzed with CellProfiler open-source cell image analysis software http://www.cellprofiler.org, using its ExportToDatabase module. The data should be normalized for plate-to-plate or spatial-layout variations prior to exploration in CellProfiler Analyst. While the software is designed to access remote databases because typical data sets are far too large to be stored in physical memory, the "Make Local Object Table" option allows particularly relevant measurements to be stored locally in memory to speed analysis while still allowing access to the full dataset in the remote database.
We prepared Drosophila Kc167 cells as previously described . In brief, cells were grown on living cell microarrays with spots of double stranded RNA for 3 days. For confirmation of phenotypes in Drosophila, we grew cells on plain slides for 3 days, after being pre-treated with dsRNA for 2 days. We used images of human HT29 cells as previously described .
For the screen of the metaphase and telophase phenotypes, each gene was tested in three replicate spots on five independently prepared cell array slides, and the results for all genes are shown in Additional file 2. Because the three replicate spots were near each other, cell counts for the groups of three were accumulated and not treated as independent samples. A p-value for each gene on each of the five slides was calculated based on the number of metaphase nuclei found and the number of cells total, relative to the average percentage of metaphase nuclei on the entire slide (i.e., as a Bernoulli random variable). To add stringency, we report results for second- and third-strongest scoring replicates only (shown on two separate sheets of Additional file 2). We required that two or three of the five scores were above a threshold that results in a combined p-value below 0.01. For Bonferroni-adjusted p-values from single experiments, these thresholds are 0.6 for two experiments (out of 5), and 5.2 for three (out of 5). De-enriched samples are listed with a p-value of 1 and samples with a p-value of 1 are ordered by enrichment. "Enrichment" is the fold-enrichment of the sample relative to all the samples.
In the bar charts in the fourth column of Figure 6c, we statistically analyzed the cell cycle distribution and cell count for the screens' hits. To do this, we first gathered DNA content data (i.e., integrated nuclear DNA intensity) from the database for all cells on the slides where the hits occurred. Then, to normalize for illumination and staining variation between slides and between images, the DNA content measurements were log2-transformed and shifted so that the mode of the DNA content for each image (calculated by binning the log2-transformed DNA into 50 bins) was equal to 1. Based on this normalized log2(DNA intensity), cells were then counted as 2N, 4N, and 8N as follows:
[-0.5, 0.5) was categorized as "2N"
[0.5, 1.5) was categorized as " 4N"
[1.5, 2.5) was categorized as " 8N", although this includes ~6N to ~11N
Using the resulting cell counts for each subpopulation (2N, 4N, 8N), we calculated p-values as follows: first, subpopulation counts were converted to fractions for each image by dividing the subpopulation counts by the total number of cells in the image (taken as the sum of the 2N, 4N, and 8N subpopulations). Each fraction was then normalized by the median fraction for that subpopulation on that slide, to account for any per-slide biases in cell-cycle distribution. These normalized fractions were averaged across replicate samples for each gene. Lastly, these averaged normalized fractions were used to calculate p-values for each subpopulation in a 10,000-trial permutation test (where labels were permuted within slides, but not between slides, to ensure that the same number of images was taken from the slide as in the experiment). Cell-count p-values were calculated similarly: the total number of cells in each image was normalized by the median per-image cell count on that slide, to prevent biases for more densely populated slides, and a permutation test was performed on the average normalized cell-count. For cell cycle and cell count, p-values were Bonferroni-corrected for 20 experiments (5 genes examined for 4 populations: 2N, 4N, 8N, and count).
The authors are grateful to Michael R. Lamprecht, David A. Guertin, Vebjørn Ljoså, Peggy Anthony, and Jason Moffat for technology development and technical support that made possible the software and biological experimentation presented in this paper.
This work was supported by the Broad Institute, the MIT EECS/Whitehead/Broad Training Program in Computational Biology (NIH grant DK070069-01) supporting TRJ, DOD TSC research program grant W81XWH-05-1-0318-DS (DMS), NIH NIGMS R01 GM0725555 (DMS), NIH NIAD RO1 AI047389 (DMS), NSF CAREER award 0642971 (PG), a Merck/CSBi postdoctoral fellowship (AEC), a L'Oreal for Women in Science fellowship (AEC), a Novartis fellowship from the Life Sciences Research Foundation (AEC), a Society for Biomolecular Screening Academic grant (AEC).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.