Statistical and visual differentiation of subcellular imaging
© Hamilton et al. 2009
Received: 20 November 2008
Accepted: 22 March 2009
Published: 22 March 2009
Skip to main content
© Hamilton et al. 2009
Received: 20 November 2008
Accepted: 22 March 2009
Published: 22 March 2009
Automated microscopy technologies have led to a rapid growth in imaging data on a scale comparable to that of the genomic revolution. High throughput screens are now being performed to determine the localisation of all of proteins in a proteome. Closer to the bench, large image sets of proteins in treated and untreated cells are being captured on a daily basis to determine function and interactions. Hence there is a need for new methodologies and protocols to test for difference in subcellular imaging both to remove bias and enable throughput. Here we introduce a novel method of statistical testing, and supporting software, to give a rigorous test for difference in imaging. We also outline the key questions and steps in establishing an analysis pipeline.
The methodology is tested on a high throughput set of images of 10 subcellular localisations, and it is shown that the localisations may be distinguished to a statistically significant degree with as few as 12 images of each. Further, subtle changes in a protein's distribution between nocodazole treated and control experiments are shown to be detectable. The effect of outlier images is also examined and it is shown that while the significance of the test may be reduced by outliers this may be compensated for by utilising more images. Finally, the test is compared to previous work and shown to be more sensitive in detecting difference. The methodology has been implemented within the iCluster system for visualising and clustering bio-image sets.
The aim here is to establish a methodology and protocol for testing for difference in subcellular imaging, and to provide tools to do so. While iCluster is applicable to moderate (<1000) size image sets, the statistical test is simple to implement and will readily be adapted to high throughput pipelines to provide more sensitive discrimination of difference.
With applications such as drug discovery , the ability and the desire to experimentally determine protein localization and trafficking is leading to a rapid growth in cell image data sets in need of analysis on a scale comparable to that of the genomic revolution [2, 3]. A key problem in location proteomics is that the analysis and comparison of localizations is largely performed by the slow, coarse-grained and biased process of manual inspection. Just as algorithms such as BLAST have been developed to search, compare, cluster and draw conclusions from the sequence information of the genome revolution, it is critical that a similar suite of tools be developed for the flood of bio-imaging to maximise its benefit.
Towards this goal, image statistics have proved invaluable in the analysis of fluorescent subcellular imaging. Measures of features such as texture and morphology (for instance, the length of the perimeter of the object of interest) in combination with machine learning methods such as neural networks and support vector machines have proved highly successful at classifying subcellular images of the major organelles of a cell, and have achieved near perfect accuracy [4–6]. However, a difficulty with such systems is that organelle structure can vary widely between each cell type, and thus automated classification systems usually require that they be re-trained for each cell type, though research is ongoing in removing this limitation . Another difficulty is that subcellular localisation classes and representative training images for each need to be chosen prior to training. Hence automated classification is to some extent fitting an image into a pre-defined box which may not necessarily reflect the true diversity of protein expression within the cell. Despite these limitations, the question of "where is the protein in the cell?" can readily be answered using automated classification, and these techniques have been applied to the whole yeast proteome imaging  and demonstrated that automated classification can produce high confidence classifications on real world high throughput imaging .
The aims of the current work are three-fold. Firstly, we introduce a novel method of statistical testing, the centroid distance test, for comparing image sets and returning p-values for the null hypothesis, that is, there is no change. Comparison to previous tests shows the centroid distance test to be significantly more sensitive in detecting difference in subcellular imaging. While the work we describe here has been implemented in iCluster, the statistical test is simple to implement and hence could readily be applied within other image analysis pipelines. Secondly, by examining the core issues in establishing an image analysis pipeline such as "How many images are needed?", "Do cells need to be selected from the images?", "What is the effect of outliers?" and "How subtle an effect can be detected", the aim is to outline a protocol for creating a workflow. Finally, by releasing the iCluster software the hope is that there will be a much wider uptake of quantitative methods within the bio-imaging community to truly enable the many benefits that the new high throughput microscope technologies offer.
iCluster is being released with this publication and is available for download under the GNU General Public Licence from http://icluster.imb.uq.edu.au/. It is available for Windows, GNU Linux and MacOSX and includes source code. A java applet demonstration is also available on the same site.
Depending on the application it may be beneficial to calculate image statistics for individually selected cells. For a screen in which cells are relatively uniform across the population, selection might not be required, while for transfection experiments in which cell populations may be more heterogeneous selection may be recommended. Avoiding cell selection can be advantageous in that automated selection methods can give variable results, especially when cells are confluent on the slide. Selection will typically involve experimenting with a variety of softwares to find the one that best suits the assay.
One of the advantages of threshold adjacency statistics (TAS) (see Methods) is that they may be calculated either for images containing multiple cells or for images in which individual cells have been selected. In  it was noted that classification accuracies using support vector machines with TAS on multiple cells per image or selected cells were comparable. Hence images may be pre-processed before input to iCluster using dedicated cell selection software to give individual cells, or raw images containing multiple cells may be directly utilised.
To avoid confounding results by variability in the success of cell selection, here we test on images for which no pre-processing for selection or cropping has occurred.
Worst case p-values for subsets of Image set A
n (# images)
Worst p-value (no outliers)
It can be seen that the inclusion of outlier images significantly increases the p-value for a given image set size, hence reducing the confidence with which the null hypothesis may be rejected. To achieve a 95% confidence level (p-value < 0.05) requires 19 images with outliers included, while only 12 images are needed when outliers have been removed. Hence outlier removal while not essential if their number is relatively small, greatly improves confidence.
Average p-values comparing images of plasma membrane and actin cytoskeleton images
n (# images)
Average p-val (no outliers)
Again, it can be seen that outliers degrade confidence in rejecting the null hypothesis, though once 9 or more images are used both cases (on average) achieve 95% confidence. Overall the results of Tables 1 and 2 suggest that outlier removal is to be recommended and that a reasonable number of images to collect to differentiate image sets of these types would be 20, allowing that outlier removal might leave 15.
Two issues may arise in using image statistics to detect difference in imaging. The first potential problem is in whether the statistics are able to detect relatively subtle but discernable differences. The second is whether the statistics are overly discriminating, that is difference is detected when there is none or little, perhaps due to changes in imaging conditions rather than due to a redistribution of a protein within the cells. When testing for changes in a protein's subcellular localisation under treatment, over-sensitivity may be controlled by ensuring that microscope settings such as exposure time and imaging conditions are identical for all image sets compared.
To test if the methodology might be sensitive to detecting random variability, repeat experiments were performed. Using the procedure outlined in Methods for Image Set C, cells expressing fluorescently labelled LAMP1 were prepared. One set of cells was imaged on one day, and another on the consecutive day. The cells were divided into three separate populations corresponding to wells: two wells from day 1 and one well from day two. The images from the 3 wells were then compared pair-wise by randomly selecting 12 images of each well and generating a p-value for the null hypothesis of no change. Repeating the random selection 100 times then gave an average p-value for each pair of wells. The wells imaged on the same day gave an average p-value of 0.392, while comparing wells imaged on distinct days gave p-values of 0.316 and 0.300. While the p-values are lower when comparing wells from distinct days, they would not give cause to reject the null hypothesis. Hence, with careful control of experimental conditions the chance of detecting change where there is none is reduced.
It should be strongly emphasised that as image statistics become more sensitive there is a real danger of detecting differences in the imaging conditions or the hardware setup rather than real changes in localisation. Hence the ideal experiment is to compare image sets for which the classes to be compared are imaged at the same time on a single plate in distinct wells with identical technical specifications.
One potential problem with randomised permutation methods is rejection of the null hypothesis may occur at too high a rate . To test the null hypothesis rejection rate, randomly chosen subsets of 15 images of the endoplasmic reticulum from image set A were selected. For two such (disjoint) sets, a p-value for the null hypothesis was calculated. This was repeated 10,000 times, to give 10,000 p-values. Of the 10,000 p-values, the null hypothesis was rejected (p > 0.95) 510 times, which is close to the expected number of 500. Further, binning the p-values into intervals of length 0.05, each bin contained 500 +/- 44, showing that the distribution of p-values is relatively flat. Hence it can be concluded that rejection of the null hypothesis is occurring at approximately the correct rate.
As described in Methods, in  several statistical tests for difference were compared, and it was shown the most sensitive for subcellular image statistics was the 3-neighbour test . It was shown that using around 40 images of individual cells of each subcellular localisation and applying this test, the null hypothesis could be rejected at a rate of 90%.
Comparing centroid distance and 3-neighbour tests
% with p > 0.05
To load the 500 images of Image Set A into iCluster and calculate TAS took 70 seconds. To calculate the spatial layout of the images (Sammon map) took approximately 5 minutes. It should be noted that while the calculation of TAS is essentially linear in the number of images, the calculation of the Sammon map it not. Hence calculation of spatial layout for 100 images may only take 2–5 seconds. Calculation of p-values (1000 repeats) for moderate size image pairs set (50 images each) is essentially instantaneous from the user's point of view. Hence for moderate size (less than 100) image sets, the images can be loaded, statistics and layout calculated, and p-values found in a few 10's of seconds.
Testing was conducted on an Intel Core Duo 2 T5600 notebook with nVidia GeForce Go 7900 GS graphic card under the Fedora Core 8 Linux operating system.
The intention here has been to provide a new statistical test and a protocol for detecting difference in subcellar fluorescent microscopy imaging. It has been shown that the major subcellular localisations may readily be distinguished with as few as 12 images from high throughput microscopes, and that subtle shifts in localisation such as endosomal redistribution can be automatically detected. It has also been shown that outlier images may easily be detected from large image sets by visual inspection, and that their removal can significantly improve confidence in null hypothesis testing. In some experiments it may be the outliers that are the most interesting images in that an unusually high number of cells are not expressing the protein in the expected manner. Further, the statistical testing framework utilising permutation testing has been rigorously evaluated to show that the p-values generated reject the null hypothesis at the expected rate and that the sensitivity is higher than previous approaches.
A significant advantage of the methodology outlined is in speed of computation. Previous comparison of computing time for TAS and the commonly used Haralick measure showed TAS to be 30 times faster to calculate . Few image statistics are as computationally simple as TAS. Hence for high throughput screening applications, an implementation of TAS with the centroid difference test could detect those experiments in which treatment has changed protein localisation in days rather than months of computational time. It is also worth noting that as a simple, fast and sensitive test, the centroid distance test could readily be implemented in high throughput screening pipelines without utilising iCluster. Indeed, we plan to apply TAS and the centroid difference test for screening applications in the near future. Another advantage over previous approaches is that it can operate with or without cell selection, hence reducing computational expense and variability of results due to differing levels of success in the selection procedure.
It should be emphasised that care was taken to avoid human intervention in the preparation of the image sets, and to use microscopes and microscope settings commonly used for high throughput imaging. As far as we are aware this is the first study on testing for difference in subcellular imaging that utilises high throughput images that have not been selected by human intervention in any way. This gives strong confidence that the results obtained will be applicable and reproducible in "real" applications.
A feature of iCluster is that it may equally well operate with user supplied statistics. A simple text file format outlined in the user manual may be used to describe each image and a set of statistics associated with it. iCluster will then calculate spatial layout and do statistical testing just as has been shown here for TAS. Similarly, iCluster can operate with user supplied statistics but without images being supplied, in which case each data points is represented as a simple sphere. Hence the methodology is not limited to subcellular localisation imaging and could be applied to any data or image set for which the researcher has generated some form of statistics.
As such we foresee many applications of iCluster to visual data exploration. As an example, in collaboration with other members of the Institute for Molecular Bioscience, iCluster has been used to explore data from tri-localisation experiments in cells (B. Woodcroft, L. Hammond, J. Stow, N. Hamilton: Automated organelle-based colocalisation in whole cell imaging, submitted). Each data point corresponded to an endosome from a cell, with 7 numbers describing the degree of overlap of each of 3 fluorescent markers on that endosome. With some 875 endosomes in one data set, iCluster was utilised to map the set of 7 dimensional vectors associated with the endosomes into 2 dimensions. In this representation the data naturally fell into a triangle, with each vertex of the triangle corresponding to one of the three markers used in the experiment, and points within the triangle corresponding to varying degrees of colocalisation of the proteins. In this way it was then possible to view and make sense of the whole data set and the diversity of the (co)localisations of the proteins marked on each of the endosomes in a way that was not possible by viewing a spreadsheet of the data. As bio-data sets become increasingly larger there is an urgent need for tools to explore and make sense of them, and we believe that iCluster will be invaluable in visual data exploration.
A nocodazole treated versus control image collection was generated by imaging endogenous sorting nexin 1 (SNX1) in A-431 (human epithelial carcinoma) cells treated with 10 μM nocodazole (Sigma Aldrich) or equivalent concentrations of the carrier (dimethyl sulfoxide) for 30 min (nocodazole treatement disrupts the microtubule network of the cell (20)). Endogenous SNX1 was detected with a monoclonal antibody raised against the first 108 amino acids of human SNX1 (BD Biosciences). Confocal Z-stacks (0.7 μm) of the entire volume of the monolayers were captured on a Zeiss LSM 510 confocal scanning microscope using a 63× oil objective. Maximum projections were generated using the LSM software (Zeiss). In total there were 17 treated and 16 untreated images captured at 512 × 512 resolution.
Repeat experiments of the LAMP1 marker were performed in the manner described for Image Set A. Imaging occurred on two distinct days. The image set consists of 64 images each from two distinct wells imaged on day 1, and a further 64 images from a single well captured on day 2.
Image sets are available for download from the LOCATE database home page .
A wide variety of classes of image statistics have been tested for their capacity to distinguish images of sub-cellular localization, primarily for use in image classification. Conrad et al.  tested 448 different image features and applied a variety of feature reduction and machine learning methods. Of those tested, Haralick texture measures , sometimes known as co-occurrence measures, were found to give the best performance. Subsequently, our group introduced threshold adjacency statistics (TAS), and found that these statistics in combination with machine learning methods could provide comparable classification accuracy (up to 95%) to the Haralick measures while being at least an order of magnitude faster to calculate . Further, TAS may be used with or without selecting individual cells from an image and do not require a separate image to identify the nuclear region. Hence for reasons of speed and simplicity, TAS are utilised here for visual and statistical testing of difference. Each image is then associated with a vector of 27 real numbers calculated from TAS.
Briefly, TAS are generated by first applying an adaptive threshold range to the image to create a binary image. Nine statistics are then calculated from the binary image. For each white pixel, the number of adjacent white pixels is counted. The first threshold statistic is then the number of white pixels with no white neighbours; the second is the number with one white neighbour, and so forth up to the maximum of eight. The nine statistics are normalised by dividing each by the total number of white pixels in the threshold image. Two other sets of threshold adjacency statistics are also calculated as above, using two other threshold ranges, giving in total 27 statistics. Note that in order that each statistic be given equal weighting in the subsequent calculations, each is normalised by subtracting the mean for that statistic for an image set and dividing by the standard deviation. Details may be found in .
The Hotelling T2 test , a multivariate form of the student t-test, has previously been applied to the problem of statistical differentiation of subcellular imaging . However, there are a number of difficulties with this approach. The test assumes that each of the statistics has a normal distribution, which is often not the case for statistics generated from subcellular imaging. Further, the test requires there to be more images in each class being tested than the number of statistics generated for each image. With anywhere from 27  to 57  statistics generated for subcellular imaging, this can severely limit the application of the test. In , it was shown that using around 40 images of cells each of the major subcellular localisations could be differentiated using the Friedman-Rafsky test that utilises minimal spanning trees, or by k-nearest neighbour testing. Briefly, in the k nearest neighbour test, the nearest neighbors of each of the data points are examined, the nearest neighbors of a data point being those data points that are closest to the point as measured by the Euclidean distance. For a given data point, the number of the k closest points (k nearest neighbors) to that point that are of the same class as the point is recorded. The test statistic is then the total number of k nearest neighbors of elements of a set that are also in that set .
Both approaches use statistics on the classes of the neighbours of each image, and whether those neighbours are of the same class. Hence these tests are to some degree measuring the disjointness of the statistics of the image sets being compared.
Towards detecting shifts in the statistical centres of image sets rather than the discreteness of clusters, the approach taken here is via a centroid distance test employing permutation testing . To test for difference between two image sets I1 and I2, 27 TAS are generated for each image in the sets. The mean statistics vectors μ(I1) and μ(I2) are then calculated for each, together with the Euclidean distance d(μ(I1), μ(I2)). The null hypothesis is that the image statistics of I1 and I2 are drawn from the same distribution, more specifically that the population means are the same : μI1 = μI2. To test this, the observations of I1 and I2 are randomly permuted to give sets R1 and R2 which have the same sizes as I1 and I2, respectively, but may have statistics vectors from either. The distance d(μ(R1), μ(R2)) is then calculated. Repeating 1000 times, the fraction of the repeats for which d(μ(R1), μ(R2)) > d(μ(I1), μ(I2)) then gives a p-value for the null hypothesis. For image sets for which there is a detectable difference, it would be expected that the mean vectors would be more separated, on average, than the randomisations, hence giving a small number of repeats for which d(μ(R1), μ(R2)) > d(μ(I1), μ(I2)).
Threshold adjacency statistics
Cervical cancer cells named after their donor, Henrietta Lacks
Human epithelial carcinoma cells.
This work was supported by funds from the Australian Research Council of Australia and the Australian National Health and Medical Research Council of Australia. Confocal microscopy was performed at the Australian Cancer Research Foundation (ACRF)/Institute for Molecular Bioscience Dynamic Imaging Facility for Cancer Biology, which was established with the support of the ACRF.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.