The proposed method consists of the steps illustrated in the flowchart of Fig. 1.
Rotation estimation
The 16-bit greyscale microarray image (Fig. 2a) is initially analyzed by the Radon transform (Eq. 1), which is applied to estimate the image rotation angle. The Radon transform has been the method of choice for microarray rotation estimation in approaches such as [10].
In the above equation, the grey level intensity of the image in pixel (x, y) is denoted by I(x, y). In the transformed image illustrated in Fig. 2b, the intensity of each pixel with coordinates (a, r) is equal to the integral of the image brightness over a straight line with an angle a to the x axis and a distance r to the origin. The rotation angle θ of the microarray image is estimated by locating the column with the highest mean brightness in the transformed image, which is denoted by the arrow. The image is subsequently counter-rotated by angle θ as illustrated by Fig. 2c.
Preprocessing
The next step involves the pre-processing of the microarray image by linearly normalizing it so as to fit the intensity histogram to the full dynamic range of the 16-bit image. The image is then quantized to 256 grey levels in order to reduce the computational complexity of the next steps. Subsequently, the edges of the spots are detected by the application of the Sobel operator [20] on the normalized image. A threshold is determined automatically using the Otsu method [21] in order to binarize the image and isolate the sharpest edges that correspond to spots, as illustrated in Fig. 2d.
Since the image has been normalized to the full dynamic range and the result of the preprocessing step is a binarized image, the quantization error is expected to affect a very small number of pixels. Indeed, in the data set used for the evaluation of the proposed method (see Results), the percentage of the affected pixels in each binarized image was less than 0.01%, having no effect on the subsequent grid placement.
Spot detection
The binarized image is then analyzed so as to locate all the groups of consecutive white pixels that reside on the same spot edge. Each of these groups is characterized by the location of the center pixel and the size of the group. Fig. 2e illustrates the detected groups by representing them as rectangles that circumscribe the pixels of each group. Ideally, each rectangle should contain the edge pixels of a single microarray spot, however depending on the threshold used, it might also include artefacts or multiple merged spots, due to the noise present in the image and the inter-spot proximity. A spot selection step is introduced in the next subsection, in order to refine the spots based on their shape and size characteristics resulting in the rectangles shown in Fig. 2f.
Spot selection
The spot selection process aims to the removal of false spots introduced by noise and artefacts. In the spot selection step, the aspect ratios of the detected pixel groups are first evaluated. Considering that the ideal spot shape is circular, the rectangles (Fig. 2e) should not deviate much from being square, so that each rectangle contains only one microarray spot. Therefore, the aspect ratio of each spot must be close to unity. Then, a lower bound s
min
and an upper bound s
max
of the spot sizes are calculated so as to maximize the similarity of the spot size distribution to the normal distribution. The spots that have sizes out of the calculated bounds are considered false and discarded.
In order to quantify the similarity of the distribution of the spot sizes to the normal distribution, it has to be taken into account that the spot sizes can only be positive, in contrast to the normal distribution N (x; μ, σ) (Eq. 2) that also spans into the negatives. The comparison should therefore be made to a normal distribution for which the negative values are explicitly set to zero. Such a distribution N
m
(x; μ, σ) (Eq. 3) can be derived by nullifying the probability of N (x; μ, σ) for x < 0 and scaling it accordingly so that the total probability remains equal to unity. The corresponding cumulative distributions C(x; μ, σ) and C
m
(x; μ, σ) are expressed by Eqs. 4 and 5 respectively.
A measure of dissimilarity E between the discrete probability distribution of the spot sizes and the continuous probability distribution N
m
(x; μ, σ) can be established based on their respective cumulative distribution functions. The cumulative histogram of the spot sizes C
h
(x) is defined as a function of the histogram h(x) as shown in Eq. 6. The dissimilarity E is defined as the total area between C
h
(x) and C
m
(x; μ, σ) as shown in Eq. 7.
The optimal bounds s
min
and s
max
are calculated so as to minimize the dissimilarity E defined above. By selecting the spots with sizes within the range defined by these bounds, the resulting cumulative spot size distribution closely resembles the normal distribution, as illustrated in the example of Fig. 3. In this case, any spot that is smaller than s
min
= 6.4 pixels or larger than s
max
= 17.1 pixels is considered false and discarded. It is evident that the cumulative histogram of the selected spots almost coincides with the cumulative normal distribution (Fig. 3d), whereas the original cumulative distribution (Fig. 3c) differs substantially from the respective cumulative normal distribution.
Distance estimation between consecutive rows and columns
The optimal distance between spot rows is calculated by segmenting the input microarray image into horizontal stripes with a height of d
r
pixels, as shown in Fig. 2g, which are then averaged. If d
r
is selected so that it is equal to the distance between the rows, the spots of all rows will be in the same relative positions in the horizontal stripes, therefore they will be highly overlapping in the resulting average stripe. Thus, the average stripe will contain well defined spot areas, as illustrated in Fig. 4a. If a suboptimal value of d
r
is selected, the spots will reside in different relative positions in the horizontal stripes and will thus blend with the background in the average stripe (Fig. 4b). The optimal value of d
r
is selected by maximizing the standard deviation of the pixel intensities of the average stripe. The standard deviation can be used as an effective measure of spot overlap, since high values of the standard deviation indicate distinct dark and bright areas, whereas low values of the standard deviation indicate abundant grey areas. Thus, the standard deviation should be maximized with respect to d
r
in order to obtain the optimal value of d
r
. The optimal column width d
c
is likewise estimated using vertical stripes.
A wide range of d
r
values is tested in order to find the optimal value, ensuring successful estimation without any user intervention. The standard deviation
of the average stripes is calculated forall values of d
r
within that range, using a small real valued step. From all the tested values of d
r
, those that result in local maxima of the standard deviation are selected. These local maxima are most often located on multiples of the optimal d
r
, since such an estimation also results in highly overlapping spots. For each of the selected d
r
values, the mean of the resulting standard deviation
in its neighbourhood is calculated. The neighbourhood for the calculation of each
is equal to the range between its adjacent local maxima. The value of d
r
that results in the highest value of the
ratio is selected as optimal.
Another method for estimating the distances between rows and columns of spots has been proposed by Ceccarelli et al. [22]. It employs the Orientation Matching (OM) and Radon transforms in order to extract the spot positions and grid rotation respectively. Subsequently, the spot locations are projected on the axes of the grid and the distance between rows and columns is estimated. This method requires prior knowledge about the radii of the spots and uses a filter for noise reduction. In contrast to this approach, the proposed method performs distance estimation without any parameters or arbitrarily selected filters, based on the maximization of the standard deviation of the average stripe. This maximization over a wide range of d
r
values allows successful estimation without any user-defined parameters, whereas the use of the average stripe acts as a low pass filter, allowing high tolerance to noise. An evaluation of the distance estimation for noisy images has been included in the Results section.
Maximum margin gridding
In order to determine the optimal grid lines, each spot that has been selected by the spot selection step (Fig. 2e) is represented by a vector
, i = 1... N, where N is the total number of selected spots in the image and the components of each vector
are the coordinates of the spot centre. These vectors are assigned into distinct rows and columns, based on the distances d
r
and d
c
, which were previously estimated. Each pair of consecutive rows or columns of spots can now be separated by a single separating line. The optimal separating line is positioned so as to maximize the margin between the rows or columns of the spots. For a pair of rows numbered k and k + 1, the vectors that belong to row k or to any row above it are assigned a class label c
i
= +1 and the vectors that belong to row k + 1 or to any row below it are assigned a class label c
i
= -1. These vectors
, along with their respective class labels c
i
are provided as a training set to a linear Support Vector Machine (SVM) classifier [23], which produces the maximum margin grid line.
In this particular application, the SVM classifier is provided with the aforementioned training set
. By solving a quadratic programming optimization problem, it produces the normal vector
and the parameter b of the separating line
, which maximizes the margin between vectors
of different classes, i.e. the margin between spots of distinct rows or columns. The width of the margin is equal to 2/||w||, therefore the widest margin is found by minimizing ||w|| under the constraints c
i
(
- b) ≥ 1, i.e. requiring that all the spots lie on the correct side of the resulting grid line.
The support vector machine described above is called a hard-margin SVM and does not take into account any outliers. One of its properties is that the separating line is solely determined by the vectors that that lie on the edges of the margin, called support vectors. In a linear SVM, a very small number of support vectors determine the separating line and the margin. In the case of outliers present inside the margin, the positioning of the separating line will be exclusively determined by the outliers and will thus be suboptimal for gridding. This problem can be solved using the soft-margin SVM, where a slack variable ξ
i
is introduced for each vector
. The constraints are then formulated as
and the separating hyperplane can be found by minimizing
where C is a cost parameter that determines the effect of outliers on the positioning of the resulting grid line. Large values of C result in a grid line that is mostly determined by any outliers, while on the other hand, smaller values of C result in a grid line that follows the general trend of the spot locations given to the classifier, virtually ignoring any outliers. The hard-margin classifier is equivalent to a soft-margin classifier with an infinitely large C[24].
The soft-margin SVM is employed, in order to diminish the effects of misdetected spots that result from artefacts or noise. A small fraction of these outliers might have a shape and size similar to valid spots and could therefore pass through the selection step without being discarded. The soft-margin SVM ensures that such outliers will not have an impact on the produced grid lines. For ideal microarray images, where all spots could be successfully detected and no outliers are present, a hard-margin SVM could be used as well, but a gridding application for real microarray images requires robustness against outliers. Furthermore, in ideal noiseless images, the training set for the SVM classifier would consist only of the necessary spots, i.e. those residing on rows k and k + 1. However, in real microarray images, there are cases where several consecutive spots might be very weakly expressed and therefore not detected. In order to cope with this problem, spots from rows above k and below k + 1 are included in the training set, providing redundant data to the classifier to ensure successful gridding. Using an algorithm based on the Sequential Minimal Optimization (SMO) to solve the SVM optimization problem [25], the additional data introduces only a small computational overhead, since such algorithms evaluate vectors that are far from the separating line in only the first few iterations of their outer loops [26, 27]. The SVM has been selected over similar methods for the determination of the grid lines, such as a least squares fit, because the soft margin SVM is adjustable with regards to its tolerance to outliers through the cost parameter C[28].
In the case that row k contains less than two detected spots, the two grid lines that separate this row from rows k - 1 and k + 1 cannot be determined by the use of the SVM classifier. This is a rather rare case considering that the image is normalized during the preprocessing step. To cope with this limitation, the endpoints of the two grid lines are positioned equidistantly between the endpoints of the first neighboring grid lines above and below them. In the case where the top or bottom rows of spots contain less than two spots, the endpoints of the grid lines that cannot be determined are positioned d
r
pixels further from the nearest grid lines.
Fig. 5 illustrates the case of gridding in the presence of an obvious outlier, denoted by the arrow. It is evident that for a small value for the cost parameter C (Fig. 5a) the margin is determined by the other spots in the row and the outlier is ignored, whereas for larger values of C (Fig. 5b) the outlier affects the positioning of the separating line by moving it significantly closer to the vectors of the top row, reducing the margin and rendering it suboptimal for gridding.