Program Feature 1: Training and interactive decision making
Enhanced CellClassifier is a novel application which efficiently integrates image analysis results from the open source program CellProfiler [4] with SVM classification algorithms [25]. Multi-class classification is a distinguishing feature of Enhanced CellClassifier. The current version of Enhanced CellClassifier supports five classes; a case study involving 3 classes is given in biological example 1 below: Hepatocyte growth factor induced ruffling of HeLa cells. Enhanced CellClassifier facilitates image display in a browser which supports three channels, scaling, zooming, and image navigation. The images are randomly selected from user defined image groups which correspond to the wells from which the images are derived. The class of an object is directly shown on the image; the color of the outline of the object thereby indicates the class. Both, display of images and the presentation of objects can be customized.
During supervised learning, the user labels the objects; thereby attributing a class to them (a cell for instance could be mitotic or non-mitotic). These objects (for instance cells) had first been identified and measured by CellProfiler; object attributes (object features) are thereby extracted (for instance mean intensity of the cell in the actin channel). Thereby, a data set of trained cells containing object attributes and the class label is assembled. The algorithm of the classifier then calculates a set of instructions to predict the object labels from the measured object attributes. Several strategies to achieve this have been proposed and successfully applied including decision tree based, Bayesian and nearest neighbor classifiers, neuronal networks, perceptrons and Support Vector Machines (SVM). Enhanced CellClassifier uses an SVM algorithm with a radial basis function (RBF) kernel for training; the open source program libsvm [25] is integrated in our tool and is exclusively used for these calculations. A detailed discussion of SVM and machine learning is given in the implementation section.
In Enhanced CellClassifier, training is done by direct clicking on the respective object on an image (Figure 2). Training strategies might critically influence the classification process; Enhanced CellClassifier offers four intuitive training modes. Training in the exploratory mode ("default") provides maximum flexibility to the user to freely select any object. However, in this mode the user might avoid frequent borderline phenotypes. Therefore, a second mode, "Random", exists. This is a forced choice mode; the user is required to decide about the phenotype of randomly selected objects from a randomly selected image. This training mode thereby avoids any selection bias by the user. At any time point training can be interrupted for calculation of a SVM model.
In a later stage of the training process, the user might want to refine a preliminary model. Training more objects would obviously be useful. However, a more efficient strategy would to be to limit training to objects which had been difficult to classify for the algorithm. The predictions for these objects will either be incorrect or just correct; the objects are located at the decision boundaries for the classifier. Therefore, in a third training mode, the "correction" mode, predictions for all objects will be displayed on the image. Only objects corrected by the user will be memorized. Adding these borderline objects to the data set can greatly improve the model. Finally, in a fourth mode, the "decision boundaries" mode, only objects for which the predictions are closest to the decision boundaries of the current model are presented; these objects are also most valuable for further refinement. The user can freely switch between training modes; moreover, training can be performed in an "informed" or a "blinded" fashion, either displaying the image filenames or not.
Presentation of objects within the original images directly illustrates the biological process, image segmentation and performance of the classifier to the user. It might enhance training accuracy, since the context of each object can be taken into account by an experienced biologist. On the other hand, the context might result in a training bias and image based training might cause under-representation of objects from images with high cell densities. Therefore, training can also be conducted in another window; here, 10 individual objects from up to 10 different images from the 96- or 384-well plate will be displayed. 8 of those objects are selected to be close to the decision boundaries of the current model, the remaining 2 illustrate the positive and negative phenotype. This mode avoids any training bias and might enhance the efficiency of the training process by selecting the most interesting objects from the whole plate. A similar training mode has recently been described [6].
Program feature 2: data integration
Enhanced CellClassifier facilitates the integration of classification information with other CellProfiler data. During image analysis by CellProfiler inter-object information can be calculated. For instance, two independent object types can be related to each other if they overlap (for instance cells and "spots", see below); one object will be labeled as the child of the other. With a different module, neighborhood information for objects of the same kind can be calculated. However, in CellProfiler this information can only be utilized after individual programming.
Enhanced CellClassifier allows the user to define internal representations of the objects which we call "vectors". A vector is binary and will be calculated for each image; it contains as many numbers as objects of the specified kind. Classification, measurement and inter-object information can all be translated into binary vector information. Since one can generate new vectors from existing vectors using logical operations, the user is now able to define any subgroup of objects desired. For example: If an image contains 5 cells, of which 1 and 3 are mitotic, the vector for mitotic cells for this image would be 1, 0, 1, 0, 0. If cells 1, 4 and 5 are calculated to be infected by a pathogen, the vector for infected cells would be 1, 0, 0, 1, 1. The vector infected mitotic cells would be 1, 0, 0, 0, 0, infected non-mitotic cells 0, 0, 0, 1, 1 and so on. This vector concept enables the user to handle cases combining classification and inter-object relationships or other object properties which would otherwise only possible by scripting. Feature integration is illustrated below in the biological example 2, Salmonella- docking onto HeLa cells.
Program feature 3: Dynamic data extraction
To ensure the greatest possible flexibility three further internal representations can be defined by the user: "image variables ", "well variables " and "plate variables". Image variables comprise just one number for each image, for instance "number nuclei", "number infected cells" or "percent infected cells". They are in most situations calculated from vectors; however, Enhanced CellClassifier also allows importing CellProfiler data directly, for instance threshold information. Well variables are summaries of the image variables of one well. Well variables can also be the result of a calculation, for instance the normalization of the number of docked or ruffling cells by the total number of nuclei in this well. Plate variables are summaries of variables from wells chosen by the user. They are especially useful for normalizing all data on a plate or for bringing internal controls prominently to the attention of the user. All variables are defined via a graphical user interface where predefined choices avoid "impossible" settings.
Program feature 4: Flexible output options
Most important for the user is the summary of the whole experiment in a comprehensible and human readable format. Our program generates four different kinds of output data: outlined images, Excel-files, graphical summaries and a Matlab readable output. Outlined images visualize a vector or the result of the classification for a given image; if for instance the user wanted to visualize the vector "mitotic cells" using a yellow color, for all objects for which the vector had been positive (i.e. all mitotic cells) the outlines would be stained yellow (for examples see Figures 3, 4). Outlined images allow for a visual control of the final analysis and documentation. Excel-data are probably the most popular data format for biologists; all image, well and plate variables are automatically exported to an Excel-sheet. Well variables from the whole plate can be visualized as heat maps, histograms or scatter plots. They allow a quick overview over the whole experiment. When doing larger experiments, the user might want to do further customized analysis. Therefore, image and well variables of interest can be exported in a Matlab readable format.
Enhanced CellClassifier supports automatic processing of multiple plates. To allow for the analysis of a high content screen, the Enhanced CellClassifier output can be imported into the open source program HCDC-KNIME which enables the analysis of large experiments as well as integration with the original images, experimental data and further advanced bioinformatics analysis.
Biological example 1: Hepatocyte growth factor induced ruffling
We chose ruffling of cells in response to hepatocyte growth factor (HGF) as an example for automated identification of complex phenotypes. HGF is a cytokine which can stimulate cell motility, proliferation and morphogenesis. A visible sign of HGF-activity is the appearance of pronounced "dorsal" ruffles on the surface of the cell [31]. Ruffles are driven by rapid actin polymerization. In this context, the master regulator of actin polymerization Arp2/3 with its central component ACTR3 is known to play a major role. The intracellular signaling from HGF-receptor leading to ruffling is currently a subject of intense research [32].
We established an image based ruffling assay using HeLa cells (Figure 3). HeLa cells were incubated for 5 minutes with HGF at a final concentration of 100 ng/ml, fixed and stained with 4', 6-diamidino-2-phenylindole (DAPI), for visualization of nuclei and tetramethyl rhodamine iso-thiocyanate (TRITC)-phalloidin for staining of the actin cytoskeleton. Where indicated, cells had been incubated with Lipofectamine 2000 and 20 nM of an siRNA directed against the mRNA of ACTR3 for down regulation of this protein prior to the assay. 20 images were acquired per well with an Image Xpress microscope (Molecular devices) using a 20×-objective. In the microscopy images, nuclei and cells were identified in the DAPI and the actin channel, respectively using established CellProfiler modules.
Cells and nuclei were subsequently measured in both channels using CellProfiler tools for measurements of object texture and intensity. In brief, CellProfiler texture measurements include Haralick features, comprising a set of statistical calculations derived from the grey level co-occurrence matrix of an object [33] and Gabor features, obtained after applying Gabor filters [34] in the x and y direction. Intensity measurements include the minimum, maximum, median and mean pixel intensities over an object and its edge regions, respectively. Both, intensity and texture measurements were performed for the actin- and DAPI-channel of the image for the region of the nucleus and the cell.
In order to improve the performance of the classifier, customized CellProfiler modules were developed. Our modules take advantage of the high difference in the intensity of a ruffle in the actin channel compared to the remainder of the cell and their distinctive compact shape. In brief, in one strategy we determined the regions of the cell with the brightest intensities, either by applying a fixed circular mask or by thresholding using the Otsu algorithm [35]. Subsequently, features describing the contrast between the brightest area and the remaining area of the cell were extracted [36]. In an additional approach, we took advantage of the fact that the area of ruffles within a cell consisted usually of the 5% brightest pixels within a cell. The shape of the thus identified regions was measured (solidity, eccentricity) as well as the contrast (difference, z-score) of the potential ruffle relative to the remainder of the cell.
However, no single feature could clearly distinguish ruffling from non-ruffling cells (not shown). This was not entirely unexpected, since changes in actin polymerization also happen during normal cellular life, for instance at the entry into mitosis. Therefore this problem required to identify three different cell types: ruffling, non-ruffling and mitotic cells. This task could conveniently be achieved using Enhanced CellClassifier.
For identifying dorsal ruffles on HGF-treated cells, objects were trained in the "default" and "random" training mode. After training a preliminary SVM model, incorrectly classified cells were corrected in the "correction" mode, yielding a final data set of 782 objects. After a grid search of the parameters C and gamma for the RBF-kernel, the 5-fold cross-validation accuracy was 87.7%. This slightly less than optimal performance is most likely due to the presence of weakly ruffling cells with a borderline phenotype which are difficult to classify, even for a human observer. In agreement with this interpretation, a detailed look on the confusion matrix of the 5-fold-cross-validation procedure showed, that mitotic and non-ruffling cells were mainly correctly predicted (94% and 91%, respectively), in contrast to ruffling cells (77%) which were frequently misclassified as non-ruffling (22%). In further tests with the same images, a dataset containing 340 objects with exclusively strong phenotypes was classified almost perfectly (5-fold cross-validation accuracy 95%), while another dataset from the same image set containing 700 objects trained in a strictly random and blinded manner yielded a 5-fold cross-validation accuracy of 85.7%.
To allow for experimental comparison of the performance of different classifiers, our dataset (782 objects) was exported to the open source program WEKA. We tested a large set of classifiers of which only few algorithms approached the accuracy of SVM with an RBF-kernel (additional file 1). From these tests we conclude, that for this dataset the performance of libsvm with an RBF-kernel and our settings cannot easily be outperformed by other algorithms.
When the different object features were ranked by WEKA for their ability to distinguish ruffling from non-ruffling cells using different algorithms, for instance SVM attribute selection [37], the object attributes describing texture in the actin channel consistently ranked best followed by our customized object attributes. To determine the relationship between the number of available object attributes and the 5-fold cross-validation accuracy, we systematically tested the performance of our classifier using increasing numbers of object attributes. We started with one attribute and added more attributes in the order suggested by the SVM attribute selection algorithm and optimized the kernel parameters C and gamma. A set of 28 object attributes performed best, achieving a 5-fold cross-validation accuracy of 89%, thereby marginally exceeding the cross-validation accuracy of the whole set of object attributes. Object attribute selection (restricting training to a subset of object attributes) has the additional advantage of decreasing training time. Nevertheless, a model for this dataset could be calculated in only 2 seconds on a desktop computer, therefore no further attempts were made. Object attribute selection algorithms are currently not supported by Enhanced CellClassifier but will be the scope of future developments.
In summary, Enhanced CellClassifier could identify mitotic and non-ruffling cells with high and ruffling cells with moderate accuracy. Using our tool in the "correction" training mode, the biologist can directly visualize the predictions of the model on different images; this increases the confidence of the researcher to the analysis algorithm. Subsequently, the model was applied to the complete dataset of the biological experiment. As shown in Figure 3, frequent ruffling was observed under control conditions; in contrast, without HGF, only background ruffling was observed. Moreover, elimination of a critical component of the cascade leading from HGF to actin polymerization also reduced ruffling: After siRNA mediated knockdown of the ACTR3-component of the Arp2/3 complex, ruffling was reduced to background. Therefore, Enhanced CellClassifier can automatically analyze HGF-induced ruffling. This could be useful for future identification of new intracellular proteins important for ruffling.
Biological example 2: Docking of Salmonella onto HeLa cells
Salmonella Typhimurium is an important food borne pathogen causing diarrhea and rarely systemic disease. Central to the pathogenesis by Salmonella is its ability to invade epithelial cells [38, 39]. Docking onto cells is the first crucial step of the infection by Salmonella. This process can be studied in tissue culture: Cells were incubated with the non-invasive Salmonella Typhimurium strain M566 (SL1344 ΔSipA, ΔSopBEE2, [40]) for 6 minutes, washed and fixed. Nuclei were visualized using DAPI, Salmonella by indirect immunofluorescence in the green channel using a rabbit antibody directed against the O-side chain of LPS (Difco). 4 images per well were acquired with a 4×-objective. Using CellProfiler modules, nuclei could be identified in the DAPI-images. Cells were defined by expansion of the area of the nucleus. Infectious "spots" representing single bacteria or a small number of Salmonella cells were identified as independent objects in the green channel. During CellProfiler analysis inter-object data were collected: the relationship of spots and cells was determined using the CellProfiler module "relate": any spot overlapping with a cell was labeled the child of this cell. In addition, neighborhood information of different cells was also calculated.
When looking at the microscopy images, a striking preference of Salmonella for mitotic cells was observed (Figure 4). Salmonella were also enriched at cells adjacent to mitotic cells. Therefore, when investigating the docking process, the researcher would like to quantify docking properties for 3 types of cells: mitotic cells, neighbors of mitotic cells and non-mitotic cells.
Nuclei of mitotic cells can easily be recognized in the DAPI-channel by the human observer. However, for automated analysis more than one object attribute was necessary (not shown). Therefore, the final analysis was done using Enhanced CellClassifier. Two classes were defined: mitotic and non-mitotic nuclei and trained first in the "default" training mode, followed by a refinement of a model in the "correction" training mode. The final data set contained 2001 objects from samples of forty 96-well plates of 2 independent experiments.
Object attributes, measured by CellProfiler available to the classifier were intensity and texture measurements of the nuclei in the DAPI-channel (for explanation see biological example 1). The object attributes ranking best according to their ability to distinguish between classes [37] included intensity measurements followed by Gabor and Haralick features (not shown). The combination of these object attributes enabled the SVM-algorithm to reliably distinguish between mitotic and non-mitotic nuclei with a 5-fold cross-validation accuracy of 96.0%. Using a large data set including 2000 cells was not necessary to achieve reliable discrimination between the classes, since 5-fold cross-validation accuracies above 93% were consistently achieved with random samples as small as 250 objects. Nevertheless, a larger data set did not require extensive computational power, since calculations needed only 2.5 seconds on a desktop computer.
Other classifiers performed equally well on this data set: after exporting the training data set to WEKA, 5-fold cross-validation accuracies ranging from 94% to 97% were obtained with the 20 algorithms tested (additional file 1).
For the summary of the data 3 vectors were calculated: one with information about the cell cycle (mitotic, non mitotic), the second with information about neighborhood to mitotic cells and a third with information about associated spots. Combining these three vectors yielded all the desired subtypes of cells (Figure 4). As shown in Figure 4C, mitotic cells, neighbors of mitotic cells and normal cells differ greatly in the percentage of docked Salmonella. To the best of our knowledge, this is the first demonstration of the docking preference of Salmonella to mitotic cells; however, the biological basis for this interesting phenomenon remains elusive. In any case, to investigate the docking process independently from the cell cycle, the user can now concentrate on the purged cell population.
Experimental comparison of Enhanced CellClassifier with CP Analyst
We wanted to compare our new tool with existing software; among the available open-source software only the program CellProfiler Analyst (CP Analyst) was developed with a similar scope as our tool: flexible image analysis using machine learning algorithms for biologists without the need for scripting.
We compared several aspects of the two programs including the scope of the classifier, the training process, and the user interface and export options. Importantly, CP Analyst is limited to 2 class classification where as Enhanced CellClassifier can manage up to 5 different classes. This is a limitation for the analysis of many complex biological phenotypes. For the training process, both, CP Analyst and Enhanced CellClassifier provide innovative methods. In CP Analyst, the algorithm selects for presentation of an adjustable number of objects, either randomly or of the "positive" or "negative" phenotype; these objects are chosen to be close to the decision boundaries of the current model and can quickly improve the current model. The different training options of our tool have been described above. In CP Analyst, the user interface is less friendly and offers very little flexibility. The objects are presented to the user as little image snippets which have to be sorted into a bin of positive and negative objects. Therefore, visual judgment of the object phenotypes becomes extremely difficult. In contrast, Enhanced CellClassifier offers many options for image presentation in order to ease visual inspection and training. Furthermore, Enhanced CellClassifier provides a visual feedback of the current model on the whole image which allows for immediate evaluation of the performance of the current model. In contrast, CP Analyst lacks such a feature. Finally, CP Analyst uses a MySQL database for data retrieval which facilitates quick summarization of data. However, output options were severely limited; for example, the results of the 2-class classification cannot be integrated with other object information. In addition, customization or well-based data summary were not supported. In comparison, Enhanced CellClassifier has a dynamic way of integrating results with maximum flexibility. This allows the user to define an output with almost unlimited options (details discussed above).
For experimental comparison, we chose our biological examples mentioned above. For biological example 1, classification had to be simplified since CP Analyst only supports two classes; mitotic cells therefore could not be simultaneously identified. However, the program could clearly distinguish ruffling and non-ruffling cells and recognized the phenotypes of the RNAis tested in this experiment (not shown). In biological example 2, the program could learn to distinguish mitotic from non-mitotic cells, however, the differential analysis we did with Enhanced CellClassifier to measure the percentage of infected cells for mitotic cells, its neighbors and interphase cells were not possible with CP Analyst.
In summary, while classification of biological objects is also possible in CP Analyst, the user is restricted to two-class classification and an inflexible display and output which only provides most basic analysis options. Most likely, these problems will be solved in the next version of this software, Classifier 2.0, which is not available for windows yet.