The design of Affymetrix SNP 6.0 array
The latest Affymetrix SNP Array 6.0 chip is a 2680 × 2572 oligonucleotide microarray, capable of simultaneously interrogating more than 900 K SNPs as well as over 900 K non-polymorphic (NP) loci that are dedicated for the detection of chromosomal copy-number variation across the human genome [11]. SNPs and NP sites are represented by clusters of identical 25-nucleotides oligomers immobilized at specific location on the microarray. A cluster of oligomers are commonly referred to as a probe. Each probe is manufactured to perfectly match nucleic acid sequences containing the corresponding SNP or NP locus. Because only two alleles are observed in nature for most SNPs, several pairs of SNP probes are used to examine a specific SNP. Probes in a probe pair differ by just one nucleotide in the position that corresponds to the SNP locus. Probes targeting the same SNPs or NP sites constitute a probeset.
Our approach leverages two key features of the design of the array. First, in contrast to some previous generations of Affymetrix SNP chips in which probes in a probeset may have different orientation (forward or reverse DNA strand) and offset (SNP resides at the centre or shifted by -4 to +4 base pairs), probe pairs in the non-control SNP probesets of the Affymetrix 6.0 arrays are strict technical replicates. In other words, all probes of a particular SNP allele have the same nucleotide sequences and should therefore exhibit identical hybridization characteristics. These probes are deposited on the chip in either triplicate or quadruplicate. Secondly, pairs in a probeset are distantly distributed on the chip, though the members of each probe pair are located in adjacent positions. These arrangements enable one to statistically identify location-dependent biases, and separate them from real biological information and noise. Note that SNP probe pairs used for quality control purposes and probes querying NP sites are not replicated within the Affymetrix 6.0 array. Strong heterogeneity in hybridization responses makes measurements of nonreplicated probes from the same probeset not directly comparable and less useful for within-array spatial biases detection. Nonreplicated probes are, therefore, not utilized in our spatial normalization procedure.
Datasets
We shall make use of three publicly available sets of Affymetrix 6.0 cell intensity files to train, test and validate our methods. The first set consists of 5 replicated chips for one of the HapMap Phase I + II sample [12–14]. These files will be used for demonstration and training purposes in the remainder of this manuscript. The second set includes 270 chips for all HapMap Phase I + II individuals. Cel files for both Set 1 and Set 2 were obtained from Affymetrix [15]. The last set, downloadable from the SNP Affycomp website [16, 17], refers to a single 'first-pass' experiment with 96 HapMap Phase I + II samples. Sets 1 and 2 are considered to be of high quality whereas Set 3 represents 'typical' quality. Sets 2 and 3 will be utilized in the Results Section for testing and validation.
Spatial normalization procedure
Affymetrix SNP 6.0 arrays uses fluorescent labeling to quantify the amount of genomic DNAs hybridized to each probe. The levels of fluorescent signals for each chip are stored in a separate cell intensity file (.CEL) which is used as input to the proposed algorithm. Our approach to spatial normalization relies on two assumptions: 1.) that any discrepancies in intensity values between replicates are necessarily attributable to stochastic fluctuations in conjunction with spatial biases, and 2.) that the same spatial artifact affects many probes in close vicinity in a similar fashion. For probesets with replication, we decompose the measured intensities I
xy
, or their transformation, at cell (x,y) for SNP j with allele k (k = 1, 2) into
where A
jk
is the intensity resulting from both specific and nonspecific hybridization. Fluorescence emitted by the DNA molecules represented by the probe is specific, while signals from other DNA molecules and background are considered nonspecific. The second component S
xy
denotes the spatial bias effects. Additional sources of experimental variation are captured by the spatially uncorrelated error terms ε
xy
assumed to obey a zero mean symmetrical distribution.
Model (1) is meant to be array-specific since both allelic intensities and spatial biases are likely to vary from sample to sample (see also Section 3.1). Our single-chip estimation and correction scheme is briefly described below. The motivation and justification of each step will be detailed in the coming sub-sections.
-
(1)
Apply a generalized logarithmic (glog) transformation on I
xy
.
-
(2)
For each SNP allele, initialize Â
jk
with the median intensity of the replicates.
-
(3)
Estimate Ŝ
xy
= E(S
xy
) by a two-dimensional wavelet surface fitted to I
xy
- Â
jk
.
-
(4)
Update Â
jk
with the corresponding means of the bias-adjusted signals, i.e. I
xy
- Ŝ
xy
.
-
(5)
Iterate steps 2 and 3 until convergence.
-
(6)
Remove outliers, i.e. probes with extreme local biases and revise Â
jk
.
Glog transformation
As evidenced in Figure 1a, the variance of ε
xy
in Model (1) is not constant across the value of Â
jk
if the very first step above is skipped. This implies that probes with more noisy intensity levels would have a relatively larger impact on the estimation of the spatial effects. The amount of systematic information that can be borrowed across neighboring probes would also be compromised. Although the log transformation is often considered as a method for achieving constant variance, it tends to produce high variance at low regions of intensity. Glog is a family of transformations [18–20] that has been proposed for addressing the problem of heteroscedasticity that avoids this effect by bounding the logarithmic argument away from zero. Following Durbin and Rocke [21], it is defined as
We adopt the algorithm suggested by the authors for estimating c. In order to reduce the computational burden, we perform the estimation using only 50,000 randomly selected SNP alleles. Visual inspection of Figure 1b indicates that a satisfactorily stabilized variability is achieved across the intensity range.
Median or mean adjustment
Dust particles, bubbles and scratches are typical contaminants that disturb intensity signals greatly but affect only limited areas of a microarray. Hence it is reasonable to assume that most of the replicates that are constructed to be distant from each other on the chip are free from these strong perturbations. As a first approximation, we choose to calculate Â
jk
in Step 2 using the median because of its robustness against outliers. However, once the fitting of S
xy
absorbs the spatial disturbances, we conjecture that inferring A
jk
with the mean intensity value will be much more efficient since the probes are duplicated only 3 to 4 times. This motivates the use of the mean in Step 4.
Wavelet de-noising
Step 3 intends to reconstruct the spatial biases robustly without assuming a particular structure for the underlying signal. The justifications for choosing wavelet-based methods [22] rely mainly on their theoretical properties and computational efficiency. Regional defects can show complex and discontinuous behaviors. Wavelets, a family of nonparametric techniques, are particularly suited for capturing abrupt local changes as well as smooth global trends from noisy data. We utilize maximal overlap discrete wavelet transform (MODWT) in this study. MODWT is translation invariant (x-y origin is arbitrary), can compute all coefficients in just O(N log2N) multiplications, and does not require the number of probes along the x- or y-direction to be a power of 2. However, as with other discrete wavelet transformations, it needs a complete data array. Affymetrix 6.0 chips have 'holes' for the locations that do not contain replicated SNP probes. For example, probes allocated for alignment and control purposes or the NP probes. As a solution, we fill in these 'holes' with the average of neighboring points (see Additional File 1). A large portion of the NP probes are allocated into a horizontal and vertical stripes across the chip (see Figure 2). Rather than interpolate across these broad bands, the wavelet basis was fitted separately to each of the four corners.
The reconstruction of regional biases is performed using the 'denoise.mowdt.2d' function implemented in the R package 'waveslim', version 1.6.1. Several parameters need to be specified. We use the simple Haar wavelet and enforce the universal soft thresholding rule as a way of reducing the level of noise while preserving the significant features of the true signal. In light of the hierarchical nature of wavelet transforms, we select a decomposition level (also known as resolution level or scale) that optimized the overall reproducibility of Â
jk
across the replicated samples in our training data, i.e. Set 1. To reflect the symmetrical role played by these training arrays, we measure reproducibility in a pairwise manner through
in which λ
1
denotes the major eigenvalue of each pairwise correlation matrix (see Additional File 2). Notice that R2is simply the coefficient of determination of a simple linear Deming's model [23]: the higher the R2, the better the reproducibility. We set the level of decomposition to 3 as additional resolutions did not lead to a better average R2value - based on 10 possible pairings from Set 1 (data not shown). Other wavelet functions were also used during the testing phase of the procedure but none gave better results.
An iterative procedure
Figure 2a shows the pattern (or lack thereof) of the glog transformed intensities of a particular chip. It demonstrates the point that strong differences in probe hybridization performances coupled with variation introduced by true biological signals make the raw intensities useless for detecting all but the most prominent defects. We need to estimate A
jk
first in order to isolate the effects of regional biases. Figures 2b and 2c display the Â
jk
adjusted intensities and the denoised results of applying wavelets, Ŝ
xy
, respectively. By updating in an iterative manner, each new estimations of A
jk
and S
xy
should result in a more accurate assessment of the other effects. Utilizing the replicated chips from Set 1, we observed that almost all of the improvement in average R2value is achieved after only one iteration (see Additional File 3). This is perhaps not surprising since A
jk
and S
xy
are independent by design. Notice in Model (1) that adding a constant to Â
jk
would cause Ŝ
xy
to reduce by a similar amount, i.e. Â
jk
+ Ŝ
xy
= {Â
jk
+ c} + {Ŝ
xy
- c}. Thus the parameters A
jk
and S
xy
are not 'identifiable'. One might consider imposing a constraint on Ŝ
xy
to make Model (1) identifiable. We bypass this in our applications because a shift in intensity values would have no effect on the final genotyping algorithm - specifically the quantile scaling procedure.
Outliers
An ideal experiment with no spatial bias would generate a zero theoretical value for all Ŝ
xy
. Non-balanced thermal conditions, uneven distribution of fluids and curvatures of chip surfaces would give rise to gradual perturbation of intensity measurements. Conversely, extreme localized deviations of Ŝ
xy
from 0 are indicative of the presence of physical artifacts such as debris, blobs and scratches. Intensities of the areas affected by the latter may retain no or very little biological information thereby making a recovery of the original signals impossible. This suggests the potential need for detecting and eliminating aberrant probe signals. This can be achieved, for example, by considering Ŝ
xy
lying n median absolute deviation (MAD) away from the median (of all Ŝ
xy
) as outliers. One could also refine the grouping approach by reinstating spatially isolated aberrations and discarding previously approved probes that were surrounded by ≥50% immediate outliers. The value of n was tested on Set 1 over the range of 2 to 10 with the increment of 1. On the basis of the reproducibility measure R2, we estimated that n = 3 is sufficient to filter out 'unreliable' probes. A more comprehensive use of the neighborhood structure might improve the grouping of outliers even further. The reader is referred to Suárez-Farinãs et. al.[5] for an example of this approach. Â
jk
can then be recalculated using all but these unreliable probes and serve as input for genotyping analyses. Notice that the exclusion of probes with outlying Ŝ
xy
from a certain region does not automatically lead to the elimination of the corresponding SNPs, on account of the scattered allocation of probe replicates on the array.