Hybrid clustering for microarray image analysis combining intensity and shape features
© Rahnenführer and Bozinov 2004
Received: 29 October 2003
Accepted: 29 April 2004
Published: 29 April 2004
Skip to main content
© Rahnenführer and Bozinov 2004
Received: 29 October 2003
Accepted: 29 April 2004
Published: 29 April 2004
Image analysis is the first crucial step to obtain reliable results from microarray experiments. First, areas in the image belonging to single spots have to be identified. Then, those target areas have to be partitioned into foreground and background. Finally, two scalar values for the intensities have to be extracted. These goals have been tackled either by spot shape methods or intensity histogram methods, but it would be desirable to have hybrid algorithms which combine the advantages of both approaches.
A new robust and adaptive histogram type method is pixel clustering, which has been successfully applied for detecting and quantifying microarray spots. This paper demonstrates how the spot shape can be effectively integrated in this approach. Based on the clustering results, a bivalence mask is constructed. It estimates the expected spot shape and is used to filter the data, improving the results of the cluster algorithm. The quality measure 'stability' is defined and evaluated on a real data set. The improved clustering method is compared with the established Spot software on a data set with replicates.
The new method presents a successful hybrid microarray image analysis solution. It incorporates both shape and histogram features and is specifically adapted to deal with typical microarray image characteristics. As a consequence of the filtering step pixels are divided into three groups, namely foreground, background and deletions. This allows a separate treatment of artifacts and their elimination from the further analysis.
In DNA microarray experiments, genetic probes with known identity are affixed to a glass slide or another substrate at discrete spots. The probes are prepared for binding with cDNA or mRNA samples. Typically, the genetic composition of two such samples is compared. The two samples are labeled with red-fluorescent and green-fluorescent dye, respectively, mixed and competitively hybridized to the microarray containing the complementary probes. For early references of this technology see Schena et al.  and Shalon et al. .
Using a laser scanner, TIFF images of the microarray are obtained. The relative abundance of one or the other sample is represented by a red or green signal at the spot location. The two major objectives of microarray image analysis are therefore to find the discrete spot locations and to quantify the spot intensities. Many available tools provide algorithms to solve these problems; among these, GenePix (Axon Instruments ), Imagene (Biodiscovery, Inc. ), QuantArray (GSI Lumonics ) and ScanAlyze (Eisen ) are widely used. Most methods assume circular spot shapes and require manual alignment of the grid locations. Therefore, automated grid and spot finding as well as robust intensity quantification are highly desirable. For oligonucleotide fingerprinting and complex hybridizations, automated array processing has for example been presented by Steinfath et al. . The intensities are calculated based on a normal distribution model for every single spot. In cDNA microarray images, the assumption of a circular spot shape is usually not justifiable due to artifacts caused by the printing process and the hybridization technique. Generally, two main concepts dealing with this obstacle have been presented, namely pixel intensity histogram methods and shape detection methods. Histogram methods are widely used, for example by Imagene or QuantArray, see Chen et al.  for an early reference. The first effective shape method is implemented in Spot (Buckley ) and uses the 'seeded region growing' algorithm (Adams and Bischof ), see also Yang et al.  for further details. The Spot software is currently one of the most competitive software tools for microarray image analysis, as it successfully deals with different spot shapes and artifacts.
A new robust method is pixel clustering. The basic algorithm, introduced in Bozinov and Rah-nenführer , was developed for spot identification and intensity quantification. Subsequently, the clustering approach was extended for grid finding. This idea was first tested in Bozinov et al. , where it was applied to a single high-quality array. The Gridclus algorithm described in this paper represents an improvement that leads to satisfying results also for low-quality arrays.
The original pixel clustering algorithm is an adaptive histogram method without direct attention towards the spot shape. In the present paper, we describe the further development and improvements of this algorithm. The detailed algorithms are given in the Methods section. The two main features of the new approach are repeated clustering and mask matching. Repeated clustering is applied, if the clustering selects only very few pixels as foreground region, mainly in case of low foreground intensity and small bright artifacts. In such a case the outlier pixels are removed and clustering is repeated on the background pixels, until at least m pixels (e.g. m = 50) constitute for the spot foreground area. Mask matching integrates the spot shape into the algorithm. Based on the cluster results of all spots, a bivalence mask is constructed to estimate the average spot shape. The mask is used as a template to filter out low-quality parts of single microarray spots. This yields a genuine combination of the two central features 'histogram' and 'shape' and thus a favorable hybrid image analysis solution. An advantage of this method is the partitioning of pixels into foreground, background and discarded pixels. This allows the elimination of artifacts from the further analysis.
In the Methods section below, first the pixel clustering approaches for grid and spot detection are reviewed. Then the hybrid approach and the associated quality measure 'stability' are introduced. The stability is the proportion of pixels in the foreground area that are not deleted by the mask matching step. The complete spot detection algorithm is called HYBRID PXKMEANS.
In the next chapter, we report on the application of the new algorithms to real microarray images. The ability to produce reliable values in an experiment with replicates is compared, between the new hybrid solution and one of the most successful alternatives, the Spot software (Buckley ). The results show the competitiveness of the pixel clustering approach.
HYBRID PXKMEANS was applied to a real microarray image from the Microarray Core Facility of the University of Nebraska Medical Center. The array consisted of 24 rows and 36 columns of gene spots, in total 864 spots. With Gridclus the target areas could be perfectly identified.
After single clustering, the foreground area size was smaller than 25 pixels for 118 spots, approximately 14% of all 864 spots. For all others spots the area was bigger than 50 pixels. Clearly, in the 118 problematic cases, the algorithm identified just very few outlier pixels as foreground. After repeated clustering, all area sizes were bigger than 50 pixels, by construction of the algorithm. It is striking, that the distribution of the sizes of these 118 reorganized spot areas resembles the distribution of the spot areas identified after the first clustering. For 115 out of those 118 spots, the algorithm terminated immediately after the second clustering; for the other 3 spots, after the third clustering. These observations indicate, that the spot shape is already detected after the second clustering, and the foreground area size is not just continually increased by some random pixels. More evidence for this hypothesis is obtained by the distribution of the stability values. This will be explained in more detail below. The choice of the number m is not critical, since the only purpose of this step is to eliminate the influence of bright pixels belonging to small artifacts. Such pixels often constitute for the whole foreground area after initial clustering. In the example presented, any number between 25 and 50 would have led to identical results.
After single clustering of all spots in the present image, the number of foreground assignments was determined for every pixel as described in step 4 of HYBRID PXKMEANS. The average of these numbers over pixels was = 146.98. According to step 4 of HYBRID PXKMEANS, every pixel with f > is assigned to foreground in the bivalence mask. In other words, if in more than 147 out of 864 spots a pixel at a fixed position belonged to foreground, it was also set to foreground in the bivalence mask. The other pixels were assigned to background.
In Figures 7 and 8 we also see that parts of the deleted areas resemble almost ring-shaped regions between foreground and background. This is a designed property of many other microarray image analysis tools. Here, it is an unforced consequence of the algorithm, which adds to the confidence in the strength of this approach.
Figure 12 underlines the differences of the two approaches. The Spot algorithm produces only values with higher intensities, due to the treatment of the background, whereas HYBRID PXKMEANS leads to intensity estimates in the whole potential intensity range. This is an advantage of the pixel clustering method, since for the low-quality array used for the comparison, the true expression of many genes is in fact rather low. The estimated relative expression of such genes is regressed towards 1 by the Spot software. The clustering method thus leads to a finer resolution for low expression values, whereas Spot is less sensitive to systematic errors. Naturally, in the low intensity range we observe a worse reproducibility for the clustering method, see Figure 11. For intermediate intensities (in the range 8–10), the reproducibility is extremely similar for both algorithms (see Figure 12), and for high intensities, pixel clustering performs better. One has to take into account the smaller number of genes in this range, though.
To summarize, both algorithms create similar reproducibility values, which underlines the competitiveness of the new HYBRID PXKMEANS algorithm. One problem of pixel clustering should be discussed, though. In some cases, the deletion step through mask matching leads to spots with only very few pixels in the foreground area. This is the case for example, if a large artifact in the background contains most of the high-intensity pixels. This also is the reason for the four long lines in the bottom right corner of Figure 10. In all cases, for one of the two replicates the problem mentioned above appears and the spots should be flagged.
Finally, we stress the improvement of the algorithm in comparison with the original simple pixel clustering approach introduced in Bozinov and Rahnenführer . Without the extensions described in the present paper, about 20% of the spots in the low-quality array have to be flagged, because less then 5 pixels constitute for the foreground area. Again, this is caused by small artifacts with a very high intensity. The repetitive clustering eliminates this problem. The mask matching step is a good control, if now correct parts of the spot target area are assigned to foreground and background. For the low-quality array, only four spots are left to be flagged. This demonstrates the usefulness and the suitability of the proposed modifications.
The proposed HYBRID PXKMEANS algorithm represents a true hybrid microarray image analysis solution. Originally being a pure histogram method, the important shape aspect is included through the mask matching step. We use indicator values of the clustering results instead of original intensity values to construct the bivalence mask. This is favorable, since then all microarray spots weigh equally. Single low-quality spots with high intensities cannot impair the results, and many spots with low intensities add valuable information.
Using a high quality array, the 'stability' values provide a good check of the ability of the algorithm to identify spot areas. The median stability of 0.945 for our real data set demonstrates the feasibility of the algorithm. With this knowledge, we actually use the 'stability' as a reliable quality measure for the microarray image itself, both for single spots and for the whole array. Spots with a low 'stability' are flagged. Another criterion to flag spots could be a small absolute foreground area size relative to the bivalence mask.
On a low-quality microarray image with replicates, the new algorithm is compared to the Spot software. The results show the competitiveness of the extended pixel clustering algorithm with established methods. Like other approaches, this one can easily be generalized for different types of arrays, for example with non-competitive hybridizations. The only adjustment then, is to simply apply the cluster algorithm to the one-dimensional values.
The software used for the analyses in this paper is available upon request from the authors. The program was written in Java and thoroughly tested on Windows.
A fully automated microarray image analysis consists of spot target area determination, partitioning of these areas into foreground and background, and intensity extraction. We first describe the Gridclus algorithm that correctly identified grids in a variety of real microarray images. Then, the original pixel clustering algorithm for spot finding and quantification is described, as a basis for its extension, the new hybrid approach.
Two two-dimensional arrays of intensity values are given, one for the red channel and one for the green channel. For a pixel in row i 1 and column i 2, let R(i 1, i 2) and G(i 1, i 2) denote the intensity values for the red and the green channel, with i 1 = 1..N 1 and i 2 = 1..N 2. Here, N 1 and N 2 denote the total number of pixel rows and columns, respectively, on the entire microarray.
Iterative clustering of all pixels into foreground and background
Define m 0 := (min(R), min(G)) and m 1 := (max(R), max(G)). Apply the k-means cluster algorithm with k = 2 to the N 1 · N 2 two-dimensional values (R(i 1, i 2), G(i 1, i 2)). Choose m 0 and m 1 as starting points for k-means. The output is an indicator function I(i 1, i 2) that assigns a cluster membership value to every pixel: 1 (for foreground) to pixels in the cluster with the larger average intensity and 0 (for background) to pixels in the other cluster. Calculate the fraction of pixels f in the foreground by dividing the number of foreground pixels (I(i 1, i 2) = 1) by the total number of pixels N 1 · N 2. As long as f is smaller than a prefixed value (e.g. 0.2), repeat clustering the group of background pixels from the previous step into foreground and background pixels and reassign the former background pixels, increasing f in every step.
Assignment of pixel rows and columns to foreground or background
For every pixel row or column l, calculate the corresponding fraction of foreground pixels f l . Apply smoothing to the fractions f l with a fixed window size (e.g. w = 7), i.e. calculate averages of w contiguous fractions. Then, assign a line to foreground, if f l >f, and to background, if f l ≤ f. The output are two binary vectors, one for columns and one for rows, that assign all lines to foreground or to background.
Calculation of target area separation lines
Both for rows and columns, determine intervals of contiguous background lines. Calculate means of those intervals as spot target area separation lines and adjust leftmost and rightmost separation line based on median separation line distance.
Determination of fixed size spot target areas
Between each pair of contiguous target area separation lines, determine the maximum of the smoothed fractions f l . This is the estimated midpoint of a spot row or column, respectively. Choose a fixed size area (e.g. 25 × 25 pixels) around the estimated midpoints as final spot target areas.
In some cases, a further refinement is necessary. Microarray grids are often slightly rotated. Thus, out of a set of potential rotation angles one is chosen that maximizes an optimization criterion for best separation into foreground and background lines. The choice of the smoothing constant w = 7 is rather conservative and guaranteed for all tested examples, that contiguous sequences for the foreground and background regions are found. For high-quality arrays, a value of w = 3, in some cases even no smoothing can produce the same results. The fixed size area of 252 pixels has to be adjusted, if the true average spot size is significantly different. For the clustering algorithms it is most convenient to choose an area that consists of at least twice as many points as belong to the true average spot size.
The feasibility of a basic version of Gridclus was demonstrated on one real high-quality microarray image tiff-file (Bozinov et al. ). In this old version, there was no iteration in step 1. The iteration is needed only in the presence of few high-intensity outlier pixels to avoid an extreme unbalanced segmentation into foreground and background. Thus the specific value 0.2 for the minimum portion of foreground pixels is not critical and can be used for every array. Furthermore, step 4 was not present in the old version. The centralization is essential for the mask matching step in the hybrid spot clustering algorithm described below. The extensions in Gridclus allow the application to microarray images of a wider quality range.
The basic spot clustering approach uses the k-means algorithm. A more suitable version for typical impediments encountered in microarray spots was developed by Bozinov and Rahnenführer  using the algorithm PAM (Partitioning around medoids, introduced by Kaufman and Rousseeuw ) with a robust dissimilarity matrix.
Construction of initial representatives
Define starting midpoints m 1 = (R fg , G fg ) and m 2 = (R bg , G bg ), where R fg and G fg are the highest red and green intensity values and R bg and G bg the lowest ones.
Determination of local optimum of cluster problem (k-means)
Repeat alternating the following two steps until convergence:
Assign each data point to its closest of the two midpoints
Calculate two new midpoints as the means of all points assigned to the old midpoints. The outcome are two cluster midpoints m 1 and m 2
Let R fg and G fg be the values of the cluster mean with higher intensity values and R bg and G bg those of the other one. Calculate (R fg - R bg )/(G fg - G bg ) as the final relative abundance estimate.
Several refinements of this algorithm have been introduced in the original paper, especially the use of the median instead of the mean in step 3. This adjustment is not justified in theory. A more theoretically based approach replaces k-means by PAM. Here, steps 2 and 3 are inherently solved in one step. The cluster algorithm minimizes an objective function and returns the corresponding prototypes.
Calculation of dissimilarity matrix of spot pixels
Calculate the Manhattan distances between all pairs of pixels, i.e. the distance between two pixels is defined as the sum of the absolute differences of red and green intensities.
Construction of initial representatives
Build phase of PAM with Manhattan dissimilarity matrix.
Determination of local optimum of cluster problem
Swap phase of PAM.
See PXKMEANS, only cluster means are replaced by cluster medoid pixels.
The analysis of real microarray spots showed that both methods yield very similar final results in practice. Whereas the refined PXKMEANS represents a more heuristic solution, the theoretically more meaningful use of PAM is by far more time consuming and sometimes lacks stability. In the following we refer to pixel clustering as implemented in PXKMEANS. For related literature on the 2-center problem and cluster analysis we refer to Agarwal et al. , Gordon , Haralick and Shapiro  and Jain et al. .
Pixel clustering applied to real microarray images on a larger scale disclosed two drawbacks. To address these problems, the pixel clustering algorithm is extended by two steps. We first present the problems and then the improved algorithm.
Sometimes spots are comprised of a few outlier pixels with extreme high intensity, due to noise or generally low foreground intensities. Consequently, very few pixels (less then 20 in most cases) are assigned to foreground and the rest to background. Many true foreground pixels are missed, and the spot intensities are overestimated.
Most spot shapes are recovered well by pixel clustering. In some cases, however, regions within the spot target areas are obviously incorrectly assigned, taking into consideration the known theoretical circular shape of microarray spots. In particular, artifacts in the background are assigned to foreground and inner regions with low intensities are assigned to background.
Determination of starting values and applying k-means
See PX KMEANS .
Potential repeated clustering to increase number of foreground pixels
Choose a minimum number m of pixels for the foreground area, typically at least m = 50. As long as this number is not reached, the set of background pixels alone is continuously clustered into two groups and the brighter group is assigned to foreground.
Pixel reduction through mask matching
All target areas have the same size by construction of the grid clustering algorithm. For every pixel with a fixed position in the target area, count the number f, for how many spots of the array the pixel is considered to be in the foreground. Calculate the average of these values over all pixels. Define the 'bivalence mask' as follows. If f > , a pixel in the mask is assigned to foreground, otherwise to background. For a single spot, delete all pixels that are assigned to foreground in the spot and to background in the mask and vice versa.
See PX KMEANS , only use medians instead of means and omit pixels that are deleted due to mask matching in step 4.
Step 3 is designed to guarantee a minimum size for the foreground area. The idea of step 4 is to overlay all single spot clustering results in order to obtain more reliable information through the aggregated results. The bivalence mask represents a prototype for the average spot shape (for an example see Figure 3). Note that frequencies are not compared to the constant 1/2, but to the average value for all pixels. This can be interpreted as a statistical method, based on the comparison with the null hypothesis of random foreground or background membership. A convenient consequence of the mask matching step is that pixels are divided into three groups instead of two: Foreground, background and deletions. This is desirable, since it allows a separate treatment of artifacts and their elimination from the further analysis.
To demonstrate the practical feasibility of this proceeding, a suitable quality measure is introduced. In a perfect scenario, all spots have the same shape, can be fully identified by the clustering algorithm, and no pixels are deleted through mask matching. The deviation from this ideal case illustrates, how well the algorithm works for a specific array.
For a fixed spot on the array, the stability s is defined as the relative frequency of pixels in the foreground area that are not deleted by mask matching in step 4 of the HYBRID PXKMEANS algorithm.
For a whole microarray, the median stability is defined as the median of all single spot stabilities s.
Obviously, it holds 0 ≤ s ≤ 1 and 0 ≤ ≤ 1. Values close to 1 indicate a high quality for the respective spot or array. 'Stability' is well suited as a quality measure, since the original pixel clustering algorithm doesn't take the shape into account at all. Thus the 'stability' represents a good subsequent control for the correct assignment of the pixels. The 'median stability' is a meaningful scalar summary measure. Especially so-called black holes with higher background than foreground intensities can be easily detected. In such a situation, most pixels are deleted by the mask matching step and the stability yields a number close to 0. In general, low stability values can be caused both by a failure of the algorithm and by poor array quality.
This work was supported by the "Deutsche Forschungsgemeinschaft" (JR, RA 870/2-2), by the Nebraska Research Initiative Grant 31-3205-0502 (DB) and through a consultancy (JR) on NIH grant R01HD037804-04 (Claudia Kappen). We thank Andrea Krempler for carefully reading the manuscript. Parts of this work were done at the Department of Statistics, University of California, Berkeley.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.