Microarray data analysis
Background correction
Array2BIO follows the original Affymetrix procedure of background correction. An array of probes is separated into 16 zones (4 × 4 grid). Raw intensities for each zone are ranked and the background level is defined as the 2% lowest intensity for each zone. The distance from each probe to the zone center is used to estimate the background level at each probe location, which is then subtracted from the raw probe intensity.
Filtering out non-specific hybridization
Each probe intensity is measured in duplicates – a perfect match (PM) intensity and mismatch (MM) intensity, where the MM intensity estimates the cross-reactivity with other genes. Array2BIO excludes all probes with a PM intensity less than 1.25*MM. It also calculates the ratio of probes with specific hybridization that pass through this filtering. MM intensity is subtracted from the PM intensity for the remaining probes, such that the raw intensity is measured as the relative (PM-MM) intensity.
Normalization and Log2 transformation
Median (PM-MM) array intensity is calculated for the remaining probes after the filtering step. Individual (PM-MM) probe intensities I
i
undergo normalization and a base 2 logarithmic transformation:
EP
i
= log
s
(I
i
/).
Probe to tag mapping
Affymetrix .CDF files are used to map individual probe intensities EP
i
onto Affymetrix gene tags GP
j
. Usually each tag accumulates ~ 10 good probes that span the corresponding gene transcript.
Averaging experiment replicas
Several experimental replicas can be averaged in comparative analysis to reliably estimate signal and background gene expression levels.
Filtering out the outliers
It is common to observe that the expression level of several gene probes differs significantly from the median level of transcript expression P
j
. To filter out the outliers, Array2BIO excludes transcript probes with expression values that differ from P
j
by an x number of standard deviations σ
j
(thresholds defined by the user). A strict filtering (1* σ
j
) and a medium stringency filtering (2* σ
j
) are set as defaults for the comparative and clustering analyses, correspondingly.
Statistical methods (comparative analysis)
Handling low-expressors
The significance of fold-difference in intensity values (ie. expression) varies dramatically for low- vs. high-expressor genes. This occurs because dividing a small number by another small number (in case of low-expressors) can result in a large fold-difference simply by chance. Array2BIO utilizes local mean normalization and local variance correction across intensities to differentially handle low- and high-expressors and to define separate fold-difference thresholds for different intensity levels. Array2BIO employs an approach highly similar to the previously described SNOMAD method (Colantuoni et al. 2002) and represents a 'pooled local variance' approach with 100 bins of gene tags. First, fold-expression levels of Affymetrix tags are ordered by their average expression level across signal and control data. Then gene tags are binned into 100 groups by the average expression level and local variation of fold-expressions is calculated for each group. This allows one to compute the local standard deviation (σi) and subsequently local z-score (z
j
) of fold-difference for each individual gene tag in each i-th group that j-th gene tag belongs to:
, where is the average fold-difference in expression of the i-th group. Differentially expressed tags identified by Z-score greater than 2.0 are selected for further analysis (Figure 3).
Welch's t-test of differential expression significance
Signal and control tags that survive the balance analysis of low- and high-expressors are next subjected to statistical testing using the Welch's t-test method. Statistical testing is performed on the average signal and control tag expression using standard deviations of their probe expression distribution. A p-value is assigned to every differentially expressed tag and tags with p-values less than 0.05 are selected for multiple testing correction analyses.
Mapping Affymetrix tags onto UCSC known genes
Array2BIO first identifies a set of unique (non-overlapping) genes in a genome matching the original.CEL file by using the 'known genes' annotation provided by the UCSC Genome Browser database (Karolchik et al. 2003). Next, Affymetrix tags are mapped onto (and are grouped by) UCSC 'known genes'. Accession numbers for the corresponding mRNA sequences and their genomic locations are retrieved for each gene during the mapping process. This information is next used to dynamically link genes to the NCBI database and to the ECR Browser.
Gene Ontology (GO) and KEGG analyses of biological functions and gene interactions
Array2BIO utilizes a locally installed version of the Gene Ontology (GO) (Harris et al. 2004) and KEGG (Ogata et al. 1999) databases to contrast the distribution of differentially expressed functional categories of genes to the average distribution in the corresponding genome. Observed and expected category population values are compared and the statistical 'enrichment' (or 'depletion') of a category is quantified by using hypergeometric distribution statistics. Functional categories with p-values smaller than 0.05 are selected for subsequent multiple testing correction analyses. The GO database provides biological classification of gene function through membership to functional categories that relate to certain biological processes, molecular functions, or to cellular components. The KEGG database combines information on gene interactions that are grouped into (1) metabolism, (2) genetic information processing, (3) environmental information processing, (4) cellular processes, and (5) human diseases categories.
Correction for multiple testing
Array2BIO performs correction for multiple testing to exclude false positive predictions associated with the statistical testing of differential tag expression or enrichment/depletion in GO and KEGG categories that is performed multiple times. Array2BIO provides two statistical methods to correct for multiple testing and also allows omitting multiple testing if the user does not want to apply this function. The default method used by Array2BIO is the medium stringency Benjamini-Hochberg correction (Benjamini and Hochberg 1995). Benjamini-Hochberg correction is based on controlling the false discovery rate (FDR) – the expected proportion of false discoveries amongst the rejected hypothesis. In general it provides a good balance between discovery of statistically significant differences and limitation of false positive occurrences. Alternatively, the Bonferroni correction method can be applied. The latter is one of the most stringent multiple testing correction methods and can be used to select for the most outstanding overexpressor genes or enriched/depleted functional categories.
Clustering analysis
Microarray data clustering
Array2BIO utilizes the Unix version of the Cluster tool (Eisen et al. 1998). Cluster's hierarchical analysis is implemented into Array2BIO, which allows clustering of genes and/or conditions; provides 9 distance measures and 4 methods. Due to Cluster limitations, Array2BIO restricts the maximum number of clustered transcripts to less than 2500 genes. Genes are ranked by their standard deviation in expression across different conditions. Genes with the largest variation from their average expression across all conditions are selected for clustering.
Interactive tree visualization
Array2BIO provides an interactive web utility for visualizing clustering results, which is similar in graphical display and operation to Java TreeView (Saldanha 2004). Clustered gene expression across multiple conditions is visualized in a matrix format. The tree of clustering relationships is given to the left of the gene expression image (Figure 4A). A mouse click on a tree branch generates a 'zoom in' image of that branch and gives a detailed description of related genes (including gene names, accession numbers, corresponding Affymetrix tags, and genomic locations) (Figure 4B).
Interconnection with external tools
ECR Browser – evolutionary conservation analysis
The ECR Browser (Ovcharenko et al. 2004) is a dynamic whole-genome navigation tool for visualizing and studying evolutionary relationships among genomes. Evolutionary Conserved Regions (ECRs) are extracted from genome alignments, mapped to genomes, and graphically visualized in relation to the genes that have been annotated in the reference genome.
Creme 2.0 – identification of clusters of transcription factor binding sites in promoters
Crème 2.0 (Sharan et al. 2004) relies on a database of putative transcription factor binding sites that have been carefully annotated across the human genome using evolutionary conservation with the mouse and rat genomes. An efficient search algorithm is applied to this data set to identify combinations of transcription factors whose binding sites tend to co-occur in close proximity to the start site of the input gene set. These combinations are statistically evaluated, and significant combinations are reported and visualized.
NCBI – detailed sequence information
Detailed mRNA transcript information including: nucleotide and protein sequences, related publications, gene annotation, etc. are provided through the dynamic interconnection to the NCBI database.