A fully automatic gridding method for cDNA microarray images
- Luis Rueda^{1}Email author and
- Iman Rezaeian^{1}
DOI: 10.1186/1471-2105-12-113
© Rueda and Rezaeian; licensee BioMed Central Ltd. 2011
Received: 17 September 2010
Accepted: 21 April 2011
Published: 21 April 2011
Abstract
Background
Processing cDNA microarray images is a crucial step in gene expression analysis, since any errors in early stages affect subsequent steps, leading to possibly erroneous biological conclusions. When processing the underlying images, accurately separating the sub-grids and spots is extremely important for subsequent steps that include segmentation, quantification, normalization and clustering.
Results
We propose a parameterless and fully automatic approach that first detects the sub-grids given the entire microarray image, and then detects the locations of the spots in each sub-grid. The approach, first, detects and corrects rotations in the images by applying an affine transformation, followed by a polynomial-time optimal multi-level thresholding algorithm used to find the positions of the sub-grids in the image and the positions of the spots in each sub-grid. Additionally, a new validity index is proposed in order to find the correct number of sub-grids in the image, and the correct number of spots in each sub-grid. Moreover, a refinement procedure is used to correct possible misalignments and increase the accuracy of the method.
Conclusions
Extensive experiments on real-life microarray images and a comparison to other methods show that the proposed method performs these tasks fully automatically and with a very high degree of accuracy. Moreover, unlike previous methods, the proposed approach can be used in various type of microarray images with different resolutions and spot sizes and does not need any parameter to be adjusted.
Background
Microarrays are one of the most important technologies used in molecular biology to massively explore how the genes express themselves into proteins and other molecular machines responsible for the different functions in an organism. These expressions are monitored in cells and organisms under specific conditions, and have many applications in medical diagnosis, pharmacology, disease treatment, just to mention a few. We consider cDNA microarrays which are produced on a chip (slide) by hybridizing sample DNA on the slide, typically in two channels. Scanning the slides at a very high resolution produces images composed of sub-grids of spots. Image processing and analysis are two important aspects of microarrays, since the aim of the whole experimental procedure is to obtain meaningful biological conclusions, which depends on the accuracy of the different stages, mainly those at the beginning of the process. The first task in the sequence is gridding [1–5], which if done correctly, substantially improves the efficiency of the subsequent tasks that include segmentation [6], quantification, normalization and data mining. When producing cDNA microarrays, many parameters are specified, such as the number and size of spots, number of sub-grids, and even their exact locations. However, many physicochemical factors produce noise, misalignment, and even deformations in the sub-grid template that it is virtually impossible to know the exact location of the spots after scanning, at least with the current technology, without performing complex procedures. Roughly speaking, gridding consists of determining the spot locations in a microarray image (typically, in a sub-grid). The gridding process requires the knowledge of the sub-girds in advance in order to proceed (sub-gridding).
Many approaches have been proposed for sub-gridding and spot detection. The Markov random field (MRF) is a well known approach that applies different constraints and heuristic criteria [1, 7]. Mathematical morphology is a technique used for analysis and processing geometric structures, based on set theory, topology, and random functions. It helps remove peaks and ridges from the topological surface of the images, and has been used for gridding the microarray images [8]. Jain's [9], Katzer's [10], and Stienfath's [11] models are integrated systems for microarray gridding and quantitative analysis. A method for detecting spot locations based on a Bayesian model has been recently proposed, and uses a deformable template to fit the grid of spots using a posterior probability model for which the parameters are learned by means of a simulated-annealing-based algorithm [1, 3]. Another method for finding spot locations uses a hill-climbing approach to maximize the energy, seen as the intensities of the spots, which are fit to different probabilistic models [5]. Fitting the image to a mixture of Gaussians is another technique that has been applied to gridding microarray images by considering radial and perspective distortions [4]. A Radon-transform-based method that separates the sub-grids in a cDNA microarray image has been proposed in [12]. That method applies the Radon transform to find possible rotations of the image and then finds the sub-grids by smoothing the row or column sums of pixel intensities; however, that method does not automatically find the correct number of sub-grids, and the process is subject to data-dependent parameters.
Another approach for cDNA microarray gridding is a gridding method that performs a series of steps including rotation detection and compares the row or column sums of the top-most and bottom-most parts of the image [13, 14]. This method, which detects rotation angles with respect to one of the axes, either x or y, has not been tested on images having regions with high noise (e.g., the bottom-most of the image is quite noisy).
Another method for gridding cDNA microarray images uses an evolutionary algorithm to separate sub-grids and detect the positions of the spots [15]. The approach is based on a genetic algorithm that discovers parallel and equidistant line segments, which constitute the grid structure. Thereafter, a refinement procedure is applied to further improve the existing grid structure, by slightly modifying the line segments. Using maximum margin is another method for automatic gridding of cDNA microarray images based on maximizing the margin between rows and columns of spots [16]. Initially, a set of grid lines is placed on the image in order to separate each pair of consecutive rows and columns of the selected spots. Then, the optimal positions of the lines are obtained by maximizing the margin between these rows and columns using a maximum margin linear classifier. For this, a SVM-based gridding method was used in [17]. In that method, the positions of the spots on a cDNA microarray image are first detected using image analysis operations. A set of soft-margin linear SVM classifiers is used to find the optimal layout of the grid lines in the image. Each grid line corresponds to the separating line produced by one of the SVM classifiers, which maximizes the margin between two consecutive rows or columns of spots.
Results and Discussion
For testing the proposed method (called Optimal Multi-level Thresholding Gridding or OMTG), three different kinds of cDNA microarray images have been used. The images have been selected from different sources, and have different scanning resolutions, in order to study the flexibility of the proposed method to detect sub-grids and spots with different sizes and features.
The first test suite consists of a set of images drawn from the Stanford Microarray Database (SMD), and corresponds to a study of the global transcriptional factors for hormone treatment of Arabidopsis thaliana samples. The images can be downloaded from smd.stanford.edu, by selecting "Hormone treatment" as category and "Transcription factors" as subcategory. Ten images were selected, which correspond to channels 1 and 2 for experiments IDs 20385, 20387, 20391, 20392 and 20395. The images have been named using AT (which stands for Arabidopsis thaliana), followed by the experiment ID and the channel number (1 or 2).
The second test suite consists of a set of images from Gene Expression Omnibus (GEO) and corresponds to an Atlantic salmon head kidney study. The images can be downloaded from ncbi.nlm.nih.gov, by selecting "GEO Datasets" as category and searching the name of the image. Eight images were selected, which correspond to channels 1 and 2 for experiments IDs GSM16101, GSM16389 and GSM16391, and also channel 1 of GSM15898 and channel 2 of GSM15898. The images have been named using GSM followed by the experiment ID, and the channel number (1 or 2).
Test suite used to evaluate the performance of the methods
Suite Name | SMD | GEO | DILN |
---|---|---|---|
Database Name | Stanford Microarray Database | Gene Expression Omnibus | Dilution Experiment |
Image Format | Tiff | Tiff | Tiff |
No. of Images | 10 | 8 | 2 |
Image Resolution | 1910 × 5550 | 1900 × 5500 | 600 × 2300 |
Sub-grid Layout | 12 × 4 | 12 × 4 | 5 × 2 |
Spot Layout | 18 × 18 | 13 × 14 | 8 × 8 |
Spot Resolution | 24 × 24 | 12 × 12 | from 12 × 12 to 3 × 3 |
Processing time of sub-grid and spot detection
Sub-grid Detection | Spot Detection | |
---|---|---|
SMD | 379.1 | 10.8 |
GEO | 384.7 | 9.2 |
DILN | 62.3 | 3.8 |
Sub-grid and Spot Detection Accuracy
Accuracy of the proposed method on the SMD dataset
Sub-grid Detection | Spot Detection | |||||
---|---|---|---|---|---|---|
Image | Incorrectly | Marginally | Perfectly | Incorrectly | Marginally | Perfectly |
AT-20385-CH1 | 0.0% | 0.0% | 100% | 4.30% | 0.46% | 95.24% |
AT-20385-CH2 | 0.0% | 0.0% | 100% | 2.83% | 0.09% | 97.08% |
AT-20387-CH1 | 0.0% | 0.0% | 100% | 2.90% | 0.14% | 96.96% |
AT-20387-CH2 | 0.0% | 0.0% | 100% | 0.52% | 0.11% | 99.37% |
AT-20391-CH1 | 0.0% | 0.0% | 100% | 0.64% | 0.17% | 99.19% |
AT-20391-CH2 | 0.0% | 0.0% | 100% | 0.32% | 0.26% | 99.42% |
AT-20392-CH1 | 0.0% | 0.0% | 100% | 4.10% | 0.33% | 95.57% |
AT-20392-CH2 | 0.0% | 0.0% | 100% | 0.21% | 0.25% | 99.54% |
AT-20395-CH1 | 0.0% | 0.0% | 100% | 0.41% | 0.12% | 99.47% |
AT-20395-CH2 | 0.0% | 0.0% | 100% | 0.98% | 0.31% | 98.71% |
Accuracy of the proposed method on the GEO dataset
Sub-grid Detection | Spot Detection | |||||
---|---|---|---|---|---|---|
Image | Incorrectly | Marginally | Perfectly | Incorrectly | Marginally | Perfectly |
GSM15898-CH1 | 0.0% | 0.0% | 100% | 0.58% | 0.16% | 99.26% |
GSM15899-CH2 | 0.0% | 0.0% | 100% | 1.00% | 0.21% | 98.79% |
GSM16101-CH1 | 0.0% | 0.0% | 100% | 0.00% | 0.32% | 99.68% |
GSM16101-CH2 | 0.0% | 0.0% | 100% | 1.57% | 0.06% | 98.37% |
GSM16389-CH1 | 0.0% | 0.0% | 100% | 0.79% | 0.12% | 99.09% |
GSM16389-CH2 | 0.0% | 0.0% | 100% | 0.57% | 0.04% | 99.39% |
GSM16391-CH1 | 0.0% | 0.0% | 100% | 0.00% | 0.24% | 99.76% |
GSM16391-CH2 | 0.0% | 0.0% | 100% | 0.14% | 0.13% | 99.73% |
Accuracy of the proposed method on the DILN dataset
Sub-grid Detection | Spot Detection | |||||
---|---|---|---|---|---|---|
Image | Incorrectly | Marginally | Perfectly | Incorrectly | Marginally | Perfectly |
Diln4-3.3942.01A | 0.0% | 0.0% | 100% | 2.23% | 0.05% | 97.72% |
Diln4-3.3942.01B | 0.0% | 0.0% | 100% | 1.71% | 0.11% | 98.18% |
Effectiveness of the refinement procedure
Without Refinement Procedure | With Refinement Procedure | |||||
---|---|---|---|---|---|---|
Image | Incorrectly | Marginally | Perfectly | Incorrectly | Marginally | Perfectly |
AT-20385-CH1 | 4.73% | 0.79% | 94.48% | 4.30% | 0.46% | 95.24% |
AT-20387-CH2 | 0.93% | 0.54% | 98.53% | 0.52% | 0.11% | 99.37% |
AT-20391-CH2 | 0.71% | 0.58% | 98.71% | 0.32% | 0.26% | 99.42% |
AT-20395-CH2 | 1.37% | 0.76% | 97.87% | 0.98% | 0.31% | 98.71% |
GSM16101-CH2 | 2.13% | 0.21% | 97.66% | 1.57% | 0.06% | 98.37% |
GSM16389-CH1 | 0.93% | 0.19% | 98.88% | 0.79% | 0.12% | 99.09% |
GSM16391-CH2 | 0.47% | 0.26% | 99.27% | 0.14% | 0.13% | 99.73% |
Rotation Adjustment Accuracy
Accuracy of the proposed method on the rotated images
AT-20395-CH1 | GSM16391-CH2 | |||||
---|---|---|---|---|---|---|
Rotation | Incorrectly | Marginally | Perfectly | Incorrectly | Marginally | Perfectly |
none | 0.41% | 0.12% | 99.47% | 0.14% | 0.13% | 99.73% |
5° | 0.41% | 0.12% | 99.47% | 0.14% | 0.13% | 99.73% |
10° | 0.43% | 0.12% | 99.45% | 0.15% | 0.14% | 99.71% |
15° | 0.41% | 0.13% | 99.46% | 0.14% | 0.13% | 99.73% |
20° | 0.42% | 0.13% | 99.45% | 0.15% | 0.14% | 99.71% |
25° | 0.42% | 0.15% | 99.43% | 0.14% | 0.15% | 99.71% |
-5° | 0.41% | 0.12% | 99.47% | 0.14% | 0.13% | 99.73% |
-10° | 0.41% | 0.12% | 99.47% | 0.14% | 0.13% | 99.73% |
-15° | 0.42% | 0.13% | 99.45% | 0.14% | 0.14% | 99.72% |
-20° | 0.42% | 0.14% | 99.44% | 0.15% | 0.13% | 99.72% |
-25° | 0.42% | 0.16% | 99.42% | 0.14% | 0.15% | 99.71% |
Comparison with other methods
Conceptual comparison of different methods
Method | Parameters | Sub-grid Detection | Spot Detection | Automatic Detection No. of Spots | Rotation |
---|---|---|---|---|---|
RTSG | n: Number of sub-grids | √ | × | × | √ |
BSAG | α ,β: Parameters for balancing prior and posterior probability rates | × | √ | √ | √ |
GABG | μ, c :Mutation and Crossover rate, p_{ max }: probability of maximum thresh-old, p_{ low }: probability of minimum threshold, f_{ max } : percentage of line with low probability to be a part of grid, T_{ p }: Refinement threshold | √ | √ | √ | √ |
HCG | λ , σ: Distribution parameters | × | √ | √ | × |
M ^{3} G | c: Cost parameter | × | √ | √ | √ |
OMTG | None | √ | √ | √ | √ |
Comparison of the proposed method with GABG and HCG
Dataset | Method | Incorrectly | Marginally | Perfectly |
---|---|---|---|---|
SMD | OMTG | 1.72% | 0.22% | 98.06% |
GABG | 5.37% | 0.51% | 94.12% | |
HCG | 2.12% | 1.23% | 96.65% | |
GEO | OMTG | 0.58% | 0.16% | 99.26% |
GABG | 4.49% | 0.32% | 95.19% | |
HCG | 2.55% | 0.74% | 96.71% | |
DILN | OMTG | 1.97% | 0.08% | 97.95% |
GABG | 4.35% | 0.34% | 95.31% | |
HCG | 3.78% | 0.65% | 95.57% |
Biological Analysis
Results on dilution experiments
Dilution steps | Diln4-3.3942.01A | Diln4-3.3942.01B |
---|---|---|
1 | 22.02 | 21.75 |
2 | 20.63 | 20.78 |
3 | 19.75 | 19.94 |
4 | 18.12 | 18.05 |
5 | 17.98 | 18.25 |
6 | 16.98 | 17.03 |
7 | 16.18 | 16.17 |
8 | 15.07 | 15.46 |
Conclusions
A new method for separating sub-grids and spot centers in cDNA microarray images is proposed. The method performs four main steps involving the Radon transform for detecting rotations with respect to the x and y axes, the use of polynomial-time optimal multilevel thresholding to find the correct positions of the lines separating sub-grids and spots, a new index for detecting the correct number of sub-grids and spots and, finally, a refinement procedure to increase the accuracy of the detection.
The proposed method has been tested on real-life, high-resolution microarray images drawn from three sources, the SMD, GEO and DILN. The results show that (i) the rotations are effectively detected and corrected by affine transformations, (ii) the sub-grids are accurately detected in all cases, even in abnormal conditions such as extremely noisy areas present in the images, (iii) the spots in each sub-grid are accurately detected using the same method, (iv) using the refinement procedure increases the accuracy of the method, and (v) because of using an algorithm free of parameters, this method can be used for different microarray images in various situations, and also for images with various spot sizes and configurations effectively. The results have also been biologically validated on dilution experiments.
Methods
A cDNA microarray image typically contains a number of sub-grids, and each sub-grid contains a number of spots arranged in rows and columns. The aim is to perform a two-stage process in such a way that the sub-grid locations are found in the first stage, and then spots locations within a sub-grid can be found in the second stage. Consider an image (matrix) A = {a_{i,j}}, i = 1, ..., n and j = 1, ..., m, where a_{ ij } ∈ ℤ^{+}, and A is a sub-grid of a cDNA microarray image. The method is first applied to a microarray image that contains a template of rows and columns of sub-grids (usually, a_{ ij } is in the range [0..65,535] in a TIFF image). The aim of the first stage, sub-gridding, is to obtain vectors, h = [h_{1}, ...h_{p-1}] ^{ t } and v = [v_{1}, ...v_{q-1}]^{ t }, where v_{ i } ∈[1, m], h_{ j } ∈ [1, n] and p and q are the number of horizontal and vertical sub-grids respectively. These horizontal and vertical vectors are used to separate the sub-grids.
Ones the sub-grids are obtained, the gridding process, namely finding the locations of the spots in a sub-grid, can be defined analogously. The rectangular area between two adjacent horizontal vectors h_{ j } and h_{j+1}, and two adjacent vertical vectors v_{ i } and v_{i+1}delimit the area corresponding to a spot (spot region). The aim of gridding is to find the corresponding spot locations given by the horizontal and vertical adjacent vectors. Post-processing or refinement allows us to find a spot region for each spot, which is enclosed by four lines.
To perform the gridding procedure our method may not need to know the number of sub-grids or spots. Although in many cases, based on the layout of the printer pins, the number of sub-grids or spots are known, due to misalignments, deformations, artifacts or noise during producing the microarray images, these numbers may not be accurate or unavailable. On the other hand, the optimal multi-level thresholding method needs the number of thresholds (sub-grids or spots) to be specified. Thus, we use an iterative approach to find the gridding for every possible number of thresholds, and then evaluate it with the proposed α index to find the best number of thresholds.
Rotation Adjustment
where and are the best angles of rotation found by the Radon transform.
Optimal Multilevel Thresholding
Although various parametric and non-parametric thresholding methods and criteria have been proposed, the three most important streams are Otsu's method, which aims to maximize the separability of the classes measured by means of the sum of between-class variances [19], the one that uses information theoretic measures in order to maximize the separability of the classes [20], and the minimum error criterion [21]. In this work, we use the between-class variance criterion [19].
Full details of the algorithm, whose worst-case time complexity is O(kn^{2}), can be found in [22].
Automatic Detection of the Number of Sub-grids and Spots
where t_{ i } is the i th threshold found by optimal multilevel thresholding and p(t_{ i } ) is the corresponding probability value in the histogram.
To find the best number of thresholds, K *, we perform an exhaustive search on all positive values of K from 1 to δ and find the value of k that maximizes the α index. In our experiment we set δ to (cf. [25]).
The Refinement Procedure
Declarations
Acknowledgements
This work has been partially supported by NSERC, the Natural Sciences and Engineering Research Council of Canada. The authors would like to thank L. Ramdas for providing the cDNA microarray images for the dilution experiments, the DILN dataset. The authors also thank the anonymous reviewers for their feedback.
Authors’ Affiliations
References
- Antoniol G, Ceccarelli M: A Markov Random Field Approach to Microarray Image Gridding. Proc. of the 17th International Conference on Pattern Recognition 2004, 550–553.Google Scholar
- Brandle N, Bischof H, Lapp H: Robust DNA microarray image analysis. Machine Vision and Applications 2003, 15: 11–28. 10.1007/s00138-002-0114-xView ArticleGoogle Scholar
- Ceccarelli B, Antoniol G: A Deformable Grid-matching Approach for Microarray Images. IEEE Transactions on Image Processing 2006, 15(10):3178–3188.View ArticlePubMedGoogle Scholar
- Qi F, Luo Y, Hu D: Recognition of Perspectively Distorted Planar Grids. Pattern Recognition Letters 2006, 27(14):1725–1731. 10.1016/j.patrec.2006.04.014View ArticleGoogle Scholar
- Rueda L, Vidyadharan V: A Hill-climbing Approach for Automatic Gridding of cDNA Microarray Images. IEEE Transactions on Computational Biology and Bioinformatics 2006, 3: 72–83. 10.1109/TCBB.2006.3View ArticlePubMedGoogle Scholar
- Qin L, Rueda L, Ali A, Ngom A: Spot Detection and Image Segmentation in DNA Microarray Data. Applied Bioinformatics 2005, 4: 1–12. 10.2165/00822942-200504010-00001View ArticlePubMedGoogle Scholar
- Katzer M, Kummer F, Sagerer G: A Markov Random Field Model of Microarray Gridding. Proceeding of the 2003 ACM Symposium on Applied Computing 2003, 72–77.View ArticleGoogle Scholar
- Angulo J, Serra J: Automatic Analysis of DNA Microarray Images Using Mathematical Morphology. Bioinformatics 2003, 19(5):553–562. 10.1093/bioinformatics/btg057View ArticlePubMedGoogle Scholar
- Jain A, Tokuyasu T, Snijders A, Segraves R, Albertson D, Pinkel D: Fully Automatic Quantification of Microarray Data. Genome Research 2002, 12(2):325–332. 10.1101/gr.210902PubMed CentralView ArticlePubMedGoogle Scholar
- Katzer M, Kummert F, Sagerer G: Automatische Auswertung von Mikroarraybildern. Proc. Of Workshop Bildverarbeitung für die Medizin 2002.Google Scholar
- Steinfath M, Wruck W, Seidel H: Automated Image Analysis for Array Hybridization Experiments. Bioinformatics 2001, 17(7):634–641. 10.1093/bioinformatics/17.7.634View ArticlePubMedGoogle Scholar
- Rueda L: Sub-grid Detection in DNA Microarray Images. Proceedings of the IEEE Pacific-RIM Symposium on Image and Video Technology 2007, 248–259.Google Scholar
- Wang Y, Ma M, Zhang K, Shih F: A Hierarchical Refinement Algorithm for Fully Automatic Gridding in Spotted DNA Microarray Image Processing. Information Sciences 2007, 177(4):1123–1135. 10.1016/j.ins.2006.07.004View ArticleGoogle Scholar
- Wang Y, Shih F, Ma M: Precise Gridding of Microarray Images by Detecting and Correcting Rotations in Subarrays. Proceedings of the 8th Joint Conference on Information Sciences 2005, 1195–1198.Google Scholar
- Zacharia E, Maroulis D: Micoarray Image Gridding Via an Evolutionary Algorithm. IEEE International Conference on Image Processing 2008, 1444–1447.Google Scholar
- Bariamis D, Maroulis D, Iakovidis D: M^{3}G : Maximum Margin Microarray Gridding. BMC Bioinformatics 2010, 11: 49. 10.1186/1471-2105-11-49PubMed CentralView ArticlePubMedGoogle Scholar
- Bariamis D, Maroulis D, Iakovidis D: Unsupervised SVM-based gridding for DNA microarray images. Computerized Medical Imaging and Graphics 2010, 34(6):418–425. 10.1016/j.compmedimag.2009.09.005View ArticlePubMedGoogle Scholar
- Ramdas L, Coombes KR, Baggerly K, Abruzzo L, Highsmith WE, Krogmann T, Hamilton SR, Zhang W: Sources of nonlinearity in cDNA microarray expression measurements. Genome Biology 2001., 2(11):Google Scholar
- Otsu N: A Threshold Selection Method from Gray-level Histograms. IEEE Trans. on Systems, Man and Cybernetics 1979, SMC-9: 62–66.Google Scholar
- Kapur J, Sahoo P, Wong A: A New Method for Gray-level Picture Thresholding Using the Entropy of the Histogram. Computer Vision Graphics and Image Processing 1985, 29: 273–285. 10.1016/0734-189X(85)90125-2View ArticleGoogle Scholar
- Kittler J, Illingworth J: Minimum Error Thresholding. Pattern Recognition 1986, 19: 41–47. 10.1016/0031-3203(86)90030-0View ArticleGoogle Scholar
- Rueda L: An Efficient Algorithm for Optimal Multilevel Thresholding of Irregularly Sampled Histograms. Proceedings of the 7th International Workshop on Statistical Pattern Recognition 2008, 612–621.Google Scholar
- Maulik U, Bandyopadhyay S: Performance Evaluation of Some Clustering Algorithms and Validity Indices. IEEE Trans. on Pattern Analysis and Machine Intelligence 2002, 24(12):1650–1655. 10.1109/TPAMI.2002.1114856View ArticleGoogle Scholar
- Theodoridis S, Koutroumbas K: Pattern Recognition. 4th edition. Elsevier Academic Press; 2008.Google Scholar
- Duda R, Hart P, Stork D: Pattern Classification. 2nd edition. New York, NY: John Wiley and Sons, Inc; 2000.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.