The user should refer to the Affymetrix web site http://www.affymetrix.com and the documentations that come with the MAS4 software for technical details regarding Affymetrix array and MAS4 algorithm.
Affymetrix Oligonucleotide Chip
At the time of this study, Affymetrix synthesizes 25-mer oligonucleotides on a 640 × 640 array with 20 μm feature size using photolithographic fabrication techniques (some data used in this study were collected from some previous generation arrays of 540 × 540 cells and 24 μm feature size). Each cell (also called a probe) in the array contains the same oligonucleotide sequences. As shown in Figure 9, typically a set of 16–20 probes are designed for each targeting cluster (called a probe set, or sometimes a "gene".). An expression value will be derived per probe set as the result of analysis. Probes taken as fragments from a target sequence are called perfect matches (PM). Multiple matches per set serve as independent signal detectors and provide a possibility to capture statistical uncertainties. For each match, there is also a corresponding mismatch (MM) probe, whose sequence differs from its match by a single base in the middle. Mismatches are meant to detect cross hybridization components of their corresponding matches, and are used by MAS4 for probe-specific background subtraction. There are also several Affymetrix control sets on each chip used for quality references, which were used in the spiking experiments to validate and refine hybridization protocols.
Intensity Distribution
In an ideal case, all the probes in a set should give similar signals and serve as replicate measurements. One common rule in probe design is to select probes with similar melting temperatures to minimize the variances among probes. The chemistry involved in a hybridization experiment is often too complicated to be predicted computationally at the current stage, therefore in reality probe intensity distribution for a set is usually fairly wide and has a long tail, which causes all kinds of difficulty for data analysis.
One usual attraction in expression analysis is to assume a normal distribution for probe intensities. This hypothesis can be examined by the following calculation. For each probe set, we applied a linear transformation to probe intensities, so that the mean and the standard deviation of the intensities in the set were normalized to zero and one, respectively. Then all the resulting match intensities were used to generate an overall distribution. Our calculation shows the normal distribution assumption is not an appropriate one. When similar analysis was applied to mismatch intensities and mismatch-subtracted match intensities, the conclusion stays more or less the same.
Despite the fact that the intensity distribution of a set is non-Gaussian, it is still crucial to find out the distribution function for match or mismatch intensities to generate meaningful statistics tests, if such distributions actually exist. The authors tried several analytical functions on match and mismatch probe intensities. Unfortunately, the distributions seemed to be quite "bad", the tail portions even fade out slower than the extreme value distribution. It was found that distributions of mismatch signals also significantly depended on the sample and experimental conditions. One may reasonably suspect that cross hybridization, intensity saturation, overall sample concentration, chip production irregularities, and miscellaneous noise may all cause a change in the signal distribution. The study seems to suggest the distribution of cross hybridization should be taken from each experiment as a result of measurement instead of using an a priori determined analytical expression. This guideline is used in MOID algorithm.
MOID algorithm: hypothesis and principles
As mentioned in the intensity distribution analysis, MOID assumes cross hybridization behavior of probes is mainly non-probe-specific, and therefore all the match cells can share the same cross hybridization background. MOID primarily aims at saving 50% of the chip space currently dedicated to mismatch cells in the Affymetrix design under this study. It was also discussed before that cross hybridization distributions should be taken from experimental data instead of from some analytical function. Based upon the general belief that there is always a significant amount of genes unexpressed in any mRNA experiment, we will use the match cell signals from the 5% darkest probe sets in the current public Affymetrix arrays to prove the point that it is possible for MOID not to use any mismatch signal. For the next generation gene chip designed based on MOID algorithm, some designated probes may be used as generic mismatches to collect non-probe-specific cross hybridization data. In fact, on GNF-HS1, 16,460 probes from 476 viral genes are used for such a purpose.
As discussed, the distributions of quantities in expression analysis are usually asymmetric and contain a long tail, regardless whether it is match intensity, mismatch intensity, or average intensity difference. For such abnormal distributions, concepts like mathematical average and standard deviation are not the most appropriate statistical terms to use. Pre-filtering the data, like what MAS4 does, certainly helps reshape the distribution, but by no means this can bring the distribution back to normal without throwing out a significant portion of measurements. The MOID algorithm is specifically designed to avoid these problems by using percentile instead of mean and using confidence intervals instead of standard deviation in all possible cases.
MOID algorithm for significance of gene presence
A non-probe-specific integral background distribution (representing noise, cross hybridization, and possible other factors), B(I), is derived from the intensities of the 5% darkest probe sets (or from generic mismatches in the future). I stands for intensity; B satisfies boundary conditions: B(0) = 0 and B(∞) = 1. For each probe set k, the match signals are sorted and the integral distribution, S
k
(I), is generated as well. Figure 7 is a schematic diagram.
If the gene represented by a probe set is absent, the intensity data from matches are due to pure background contribution, therefore S
k
is likely to be close to B. We choose the Kolmogorov-Smirnov test to determine likelihood that the observed signal distribution, S
k
, could be explained by B. If we define dk max as the maximum vertical distance between B and S
k
, according to K-S statistics, the probability of observing discrepancies greater than dk max is determined by a P value [11],
where
n
k
is the effective number of data points in the K-S statistics, which is the number of PM for gene k in our case. P carries the meaning of probability, therefore is a number between 0 and 1. Practically, Log10P, a negative value, is used as our final representation. In this way, MOID uses a continuous LogP criterion to replace MAS4 absolute calls. Those signal sets that can be easily explained by noise are assigned a LogP value closer to zero.
MOID algorithm for expression level calculation
MOID uses the horizontal distance between S
k
and B to represent gene expression level (Figure 7). Since the darkest probes may be more likely caused by their poor binding properties to the target gene, and the brightest probes may be more likely caused by serious cross hybridization issues, different parts of the integral distribution tend to have different qualities. Therefore, the MOID algorithm uses the horizontal distance measured at a certain percentile, pct, instead of the whole curve. Based on the analysis of the spiking experiments discussed later, it is empirically determined that 70% is the optimal percentile for signal retrieval. We also tried to use the average of horizontal distances from several pct, but without significant improvement. Therefore the expression level E
k
, related to gene concentration for gene k, is defined as
E
k
= max(I|
Sk
=
pct
- I|B = pct, 0).
Instead of using the standard deviation, we take confidence intervals directly from the distribution curve S
k
, with the background subtracted. E.g., 80% confidence intervals can be represented by a lower bound (E
kl
) at 10 percentile and an upper bound (E
ku
) at 90 percentile of the distribution, i.e.,
E
kl
0 (0.1) = max (I|Sk = 0.1 - I|B = pct, 0)
and
E
ku
0 (0.9) = max (I|Sk = 0.9 - I|B = pct, 0).
As one might expect, increment of the probe set size helps narrowing down the confidence intervals in a manner determined by the statistics of probe distribution. However, this piece of information is unknown and we took an ad hoc approach by modifying the boundaries formulae as the following:
and
MOID algorithm for normalization
The most common way of understanding gene functions by expression profiling is to study the change of their expression levels in various disease states and cellular environments. The observed expression level is a product of complicated protocols, and is subjected to various factors from sample preparation to final image scanning. Therefore, it is a common practice to normalize expression data drawn from multiple profiles before making any reliable interpretation.
The normalization procedure in MAS4 aims at normalizing the average intensities of probe sets (excluding the top and bottom 2%). To avoid possible signal contamination, MOID takes similar approach by normalizing the probe sets between 10 and 90 percentiles. However, it is noticed that an ideal normalization factor should be derived from only those genes that do not change their expression levels between the two experiments. MAS software provides the possibility to use a list of housekeeping genes for such purposes; however, it certainly requires careful downstream research to validate any such a list. Some common normalization criteria were summarized in a recent study by Zien et al. [12]. MOID uses a heuristic bootstrap method to identify an approximate list of unchanged genes between two data sets:
Step 1: include all genes as the normalization gene list.
Step 2: sort the E
k
values (already background subtracted) for all the genes in the list; the interesting portion of expression values (between 10% and 90% in our case) is retrieved and average intensity is calculated for each experiment, respectively. The initial normalization factor nf is calculated by the ratio of the two average intensities.
Step 3: normalize the data sets using nf obtained in the previous step; refine the gene list by further excluding those genes, whose intensity values changed by more than a certain fold, >fmax or < 1/fmax, between the two data sets.
Step 4: repeat Step 2 and update nf to nf', until one of the following conditions is met:
-
1)
the maximum number of iterations, Itrmax, is reached;
-
2)
|nf/nf' - 1| ≤
max, where
max is a small predefined threshold;
-
3)
size of the normalization gene list drops below a predefined threshold, Szmin.
In practice, a single iteration is usually sufficient for most comparisons. User can adjust threshold parameters towards their preferred stringent levels. In this study we use fmax = 2.0,
max = 0.05, Itrmax = 5, Szmin = 1000.
MOID algorithm for comparison analysis
Different probes within the same probe set generally respond very differently to mRNA sample fragments. We examined various probe properties such as melting temperature, nucleotide base composition, possible sequence motifs, and potential secondary structures in order to understand the cause of such diverse response properties. That study did not find any conclusive factor that explains observed probe signal distribution satisfactory. Although it seems rather difficult to pre-compute and predict probe responses, it is possible to derive some features afterwards for a probe based on a significant amount of expression data where the target gene of interest is present [8]. Instead of using a modeling approach to explain various signal contributions, MOID uses a new approach to avoid the affect of diversities of probe response factors in calculating expression fold changes.
Let us assume any two probes in a probe set k of size n
k
always respond to their target gene concentration with factor r1 and r2, respectively. That is, if the gene is present at concentration c1 and c2 in two hybridizations, the two probes in average give signal intensities r1c1 and r2c1 in the first experiment, r1c2 and r2c2 in the second experiment. As in MAS4, fold change is calculated by the ratio between r1c2 + r2c2 and r1c1 + r2c1, where the result is essentially dominated by the probes with the largest response factors, and the statistics of the ratio becomes difficult to estimate. However, we observe that by taking the ratio of the two signals for each probe individually, one essentially is looking at n
k
independent measurements of c2/c1 for that probe set. This opens a possibility for obtaining a distribution of fold change values.
If one assumes most probes have a rather stable intrinsic response factor, r, a fold change number can be calculated from each probe i in the set independently, which hopefully removes the affect of the unknown response factors r
i
. However, it seems the response factor of each probe has a non-negligible intrinsic spread as well; this may be further complicated by alternative splicing and tissue-specific cross hybridization issues. In addition, the n
k
ratio numbers calculated for a probe set are not normally distributed (it is well known that even the ratio of two normally distributed values is no longer normally distributed). The current background subtraction algorithm may not take fully into account response factor-unrelated signals, therefore could further increase the non-normal component. To overcome this difficulty, MOID uses a percentile, pct
f
, of the integral distribution formed by the n
k
fold change numbers, F
k
(f), which satisfied the boundary conditions: F(0) = 0 and F(∞) = 1. Empirically, 70% is used for pct
f
as determined by spiking experiments discussed later. The confidence intervals are also directly taken from relevant percentiles in the distribution corresponding to the lower bound and upper bound, respectively. The effect of the number n
k
is taken into account in the same heuristic manner as we did previously for absolute analysis. The final formulae are:
MAS4 Algorithm for Absolute Analysis
After a sample is hybridized to probes on a chip, the chip is scanned and fluorescent signals are collected and stored in an Affymetrix DAT file. For cell i, after filtering out boundary pixels, N
i
numbers of pixels are used to calculate an average intensity value I
i
and standard deviation σ
i
. Hereafter we index a probe set by k, which contains n
k
probe pairs. The calculated intensity values for thej th pair in the k th probe set are denoted as PM
kj
for PM and MM
kj
for MM. Background intensity, bg, is derived from the average intensity of the 2% darkest cells. Noise level, Q, is determined as
where average is done over all background cells.
MAS4 assumes that the cross hybridization component of a match is captured by its mismatch companion, therefore the difference, PM
k
- MM
k
, is used to derive expression levels. MAS4 also uses a "super-Olympic scoring" procedure to iteratively filter out outliers, defined as those probe pairs with difference values more than three times the standard deviation away from the mean. After filtering, the average difference value for a probe set, D
k
, is used to represent the expression level of that particular gene.
The significance of gene presence is determined by an empirical absolute call decision matrix coupled with proprietary MAS4 algorithms, parts of which are described in the Microarray Suite documentation. As the result, a probe set is marked as either "Present", "Absent", or "Marginal".
MAS4 calculates average difference by
D
k
= <PM
kj
- MM
kj
>
j
,
which is essentially
D
k
= <PM
kj
>
j
- <MM
kj
>
j
,
where the average is taken over all the probe pairs surviving the super-Olympic filtering process mentioned above. Based on this observation, the one-one correspondences between PM
kj
and MM
kj
, typically with an average correlation coefficient around 0.8, are essentially lost during such averaging. Since MAS4 is successful in various applications despite the fact it averages out probe-specific mismatch signals, we hypothesize that contributions of the mismatch probes are mainly probe-nonspecific, if the pairwise correspondence between PM
kj
and MM
kj
are not enforced. This observation led to the idea of only using match probes in the MOID algorithm discussed above.
MAS4 Algorithm for Normalization
MAS4 introduces a normalization factor (a scaling factor in MAS4 serves the same purpose). The goal of normalization is to reduce false positives in the expression fold change calculation and ensure housekeeping genes are marked as unchanged. Since it is usually unknown a priori what genes sustain their expression level during the two experiments, different heuristic normalization schemes may be adopted and it is unclear at this point which particular approach is optimal. MAS4 normally derives the normalization factor by scaling average D
k
for all genes with expression levels within the central 96% to a certain target intensity constant. This algorithm is based on the assumption that the total copy of mRNA per cell is a conserved number among experiments.
MAS4 Algorithm for Comparison Analysis
In comparison of two chips, we follow Affymetrix convention of calling the reference chip measurement as the base (b), the other one as the experiment (e). After the above normalization procedure, the fold change value for gene k, f
k
, is essentially calculated by dividing D
k
(e) with D
k
(b). MAS4 uses a filtering process to ensure that a common subset of "good" probe pairs between b and e are used to recalculate D
k
. Very weak expression signals (sometimes negative) are rounded to the noise level mentioned before. The final formula for f is:
f
k
= [D
k
(e) + max (Q - D
k
(b), 0)] / max (D
k
(b), Q), if D
k
(e) ≥ D
k
(b),
or
f
k
= max (D
k
(e), Q) / [D
k
(b) + max (Q - D
k
(e), 0)], otherwise.
An empirical difference call decision matrix and a proprietary MAS4 algorithm, partly described in the Microarray Suite 4.0 documentation, determine the significance of the fold change. As a result, the probe set is marked as "Increase", "No Change", "Decrease", "Marginally Increase", or "Marginally Decrease". It is obvious that the fold change number is only meaningful if the probe set is clearly present at a level above the noise in both data sets.