Dataset
This research used 802 whole genomes (279 control, 191 case, 332 unknown, and 444 males and 358 females) from the Alzheimer’s Disease Neuroimaging Intiative (ADNI). The genomes were processed by ADNI using the Burrows-Wheeler aligner (BWA) [15] and the best practices of the Genome Analysis Toolkit (GATK) [16]. Genomes were obtained from the ADNI database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. For up-to-date information, see www.adni-info.org.
Single-SNP GWAS analysis
Single-SNP GWAS uses a modified form of linkage disequilibrium (LD) to infer relationships between single SNPs and observed phenotypes. In order to understand this approach, it is necessary to understand LD and how it is usually applied in genetic analyses. LD is a measure of how often two genomic features are inherited together within a population of interest, compared to how often they “should” be inherited together (i.e., the difference between the observed co-occurence of two SNPs and the expected co-occurence of the two SNPs). LD D, the co-occurrence of events (SNPs) A
1 and B
1 (as opposed to A
2 and B
2, respectively) is calculated as:
$$D = p(A_{1} \land B_{1}) - p(A_{1}) \ast p(B_{1}) $$
In the simple case often seen in genetics, these events are different nucleotides (A, C, T, or G) that exist at specific positions in the genome. We would like our LD to reflect the likelihood of observing both SNPs together (e.g., if A
1 occurs then we know confidently that B
1 also occurs, and vice versa). Unfortunately, D does not provide this information. However, a second measure of LD, r
2, based on D, is a measure of how closely related the two events (SNPs) are. A measure of r
2=1.0 means they provide the exact same information, or always co-occur. D is converted to the Pearson correlation coefficient r by the following:
$$r = D / \sqrt{p(A_{1}) \ast p(A_{2}) \ast p(B_{1}) \ast p(B_{2})} $$
In GWAS, LD is used to select which SNPs from the genome to analyze. For example, if two SNPs provide the exact same information (i.e., they always or almost always are inherited together), then only one of the SNPs is analyzed. This reduces the total number of tests (i.e., preserves statistical power) by eliminating redundant tests.
In our research, rather than using p-values from a regression to assess the relationship between a SNP (or multiple SNPs) and AD case/control status (i.e., which SNPs are correlated with AD), we calculated r
2 between each SNP in the dataset and AD case/control status. In this approach, the two events we are measuring are the co-occurence of a SNP and AD case/control status (i.e., do case status and a particular SNP co-occur with high confidence). We accomplished this by writing our own algorithm that computes LD between each SNP and AD case/control status.
An outline of our single-SNP GWAS algorithm is given in Algorithm 1. Note that there are multiple genotype values of A
i
because there are two haplotypes. The computation of Pearson’s r is as described above given the computed probabilities. This approach on a set of individuals, S, and a set of SNPs, L, runs in O(∥S∥×∥L∥) time and with O(1) space.
Multi-SNP GWAS analysis
To extend to multi-SNP GWAS, we calculate LD between two SNPs and a phenotype. Comparing the co-occurence of two SNPs and a trait results in eight different calculations for D of the form:
$$D_{i,\,j,\,k} = p(A_{i} \land B_{j} \land C_{k}) - p(A_{i}B_{j}) \ast p(C_{k}) $$
Unlike in the single-SNP case, the magnitude of D may differ between each of the eight comparisons. Because we are concerned with the possibility of a specific combination of alleles impacting the trait, we take the maximum among all eight values for D. Next we calculate r
2 using the specific combination of alleles as one event and any other combination (three possibilities, in the two SNP case) as the alternate outcome for that event:
$$d_{i,\,j,\,k} = \sqrt{p(A_{i}B_{j}) * (1-p(A_{i}B_{j})) * p(C_{1}) * p(C_{2})} $$
This results in the following equation for r:
$$r = {max}_{i,\,j,\,k} (D_{i,\,j,\,k} / d_{i,\,j,\,k}) $$
To calculate the correlation of all SNPs with every other SNP would have exceeded our computational resources. We therefore calculated correlations between all SNPs and a subset of the SNPs that were located in genes with strongest and most consistent associations with AD (Table 1). Even this matrix was too large to fit into memory, so we computed subsections of the matrix in parallel on different machines. We found 2101 pairs of SNPs with an r
2 correlation greater than 0.04. These we plotted using R(Studio) to look for genes that had a (relatively) high correlation with other SNPs, as the relationship between two genes could provide important insights into disease processes.
An outline of our multi-SNP parallelized GWAS algorithm is given in Algorithm 2. The required START and END allow the task to be partitioned and run in parallel. There are multiple genotype values of A
i
and B
j
because there are two haplotypes. Our solution on on a set of individuals, S, and a set of SNPs, L, runs in O(∥S∥×∥L∥2/k) time and with O(∥L∥) space for each of k parallel runs (for our run k=201 with each run allocated 32GB of RAM and a wall time of 6 h).