MI-GWAS: a SAS platform for the analysis of inherited and maternal genetic effects in genome-wide association studies using log-linear models

Background Several platforms for the analysis of genome-wide association data are available. However, these platforms focus on the evaluation of the genotype inherited by affected (i.e. case) individuals, whereas for some conditions (e.g. birth defects) the genotype of the mothers of affected individuals may also contribute to risk. For such conditions, it is critical to evaluate associations with both the maternal and the inherited (i.e. case) genotype. When genotype data are available for case-parent triads, a likelihood-based approach using log-linear modeling can be used to assess both the maternal and inherited genotypes. However, available software packages for log-linear analyses are not well suited to the analysis of typical genome-wide association data (e.g. including missing data). Results An integrated platform, Maternal and Inherited Analyses for Genome-wide Association Studies (MI-GWAS) for log-linear analyses of maternal and inherited genetic effects in large, genome-wide datasets, is described. MI-GWAS uses SAS and LEM software in combination to appropriately format data, perform the log-linear analyses and summarize the results. This platform was evaluated using existing genome-wide data and was shown to perform accurately and relatively efficiently. Conclusions The MI-GWAS platform provides a valuable tool for the analysis of association of a phenotype or condition with maternal and inherited genotypes using genome-wide data from case-parent triads. The source code for this platform is freely available at http://www.sph.uth.tmc.edu/sbrr/mi-gwas.htm.


Background
Candidate gene, and more recently, genome-wide association (GWA) studies have been used to identify associations between several complex diseases and the genotype of affected individuals (i.e. cases) [1][2][3][4][5][6]. However, for some phenotypes (e.g. birth defects, perinatal outcomes, pediatric cancers), the maternal genotype may also directly contribute to risk, via an effect on the in utero environment [7]. However, despite increasing recognition of the importance of maternal genetic effects in genetic epidemiology studies [8][9][10][11][12][13][14][15][16][17], nearly all GWA studies to date have ignored maternal genetic effects [18], which could partially explain why GWA studies have not identified the majority of the genetic contribution to common diseases. Failure to account for maternal genetic effects in previous studies may be due both to lack of appropriate data (i.e. maternal DNA is not collected in traditional case-control studies) and lack of a readily available platform for performing analyses that account for maternal genetic effects using typical GWA data.
The most common study designs used in GWA studies include the case-control and case-parent triad designs. Though the case-control approach has been used more frequently, distinguishing between maternal and inherited genetic effects using this study design requires the addition of samples from mothers of both cases and controls, resulting in increased genotyping costs [7]. Moreover, maternal DNA has not been collected for most existing case-control studies. By contrast, maternal genetic effects can be directly assessed in existing data from case-parent triad GWA studies. However, in the majority of these studies, data have been analyzed using the transmission disequilibrium test (TDT) [19,20], which does not assess maternal genetic effects.
The most commonly used method for assessing maternal and inherited genetic effects using case-parent triad data is a log-linear, likelihood-based approach [21,22]. In addition to evaluating maternal genetic effects, this approach has the advantage (over the TDT) of providing estimates of effect size and, similar to the TDT, does not require the assumption of Hardy-Weinberg equilibrium [21][22][23][24]. Further, using this approach, data from incomplete triads can be included in the analysis by use of the expectation-maximization (EM) algorithm [23]. These log-linear analyses can be conducted using standard software (e.g. SAS ® , SAS Institute Inc.), but, when data from incomplete triads are included, require programming of the EM, which is cumbersome. The specialized program, LEM [25,26], can also be used to conduct these analyses, and has the advantage over other programs (e.g. SAS) in that it does not require programming of the EM and can be easily programmed to explore a variety of additional types of effects (e.g., gene-gene interaction, gene-environment interaction). However, because LEM requires individual data and program files for each SNP, it is not feasible to conduct a GWA study analysis using LEM as a stand-alone program.
To address the need for an efficient approach for analyzing maternal and inherited genetic effects using GWA data from case-parent triad studies, a novel computational platform was created. This platform uses SAS to prepare the data in an LEM-compatible format, calls LEM to evaluate each SNP, and extracts and summarizes relevant data from the LEM output files. The performance of this platform was evaluated using existing, case-parent triad GWA data.

Log-linear Modeling
The theoretical distribution (assuming Mendelian inheritance and mating symmetry) of case-parent triad genotype data (defined by the combination of father, mother, and child genotypes) can be fitted to the observed triad counts by maximum likelihood using the following log-linear model: The values of E(n F,M,C ) correspond to the expected count of each genotype combination (i.e., father, mother, and child genotypes). The γ j term stratifies on parental mating type (i.e., each combination of possible parental genotypes), and I is an indicator variable that equals 0, 1, or 2, corresponding to the number of high-risk alleles present in the child's or mother's genotype (C and M respectively). The offset, w F,M,C , accounts for the heterozygous genotype being twice as likely as either homozygous genotype in offspring of double heterozygous matings (assuming Mendelian transmission). Under this model, the risk corresponding to a genotype with one copy of the "high-risk" allele compared to no copies can be estimated by exp(β 1 ) for the inherited genotype and exp(α 1 ) for the maternal genotype, and the risk corresponding to a genotype with two copies compared to no copies can be estimated by exp(β 2 ) for the inherited genotype and exp(α 2 ) for the maternal genotype. The significance of the inherited genetic effect can be evaluated using a two degree of freedom likelihood ratio test (LRT) to compare the log-likelihood of the full model (i. e. Model 1) to that of a reduced model in which β 1 = β 2 = 0. The null hypothesis for this test is that conditional on parental mating type, the distribution of affected offspring agrees with Mendelian expectation. Similar to the TDT, this LRT provides a test of linkage in the presence of linkage disequilibrium that is not vulnerable to confounding due to population stratification [21,27]. Likewise, the significance of the maternal genetic effect can be evaluated by using the LRT to compare the log-likelihood of the full model to that of a reduced model in which α 1 = α 2 = 0. The null hypothesis for this test is that reciprocal parental mating types (e.g. Aa × AA and AA × Aa) occur at the same frequency among parents of affected individuals [21][22][23]. This LRT is vulnerable to a specific form of population stratification that violates the underlying assumption of mating symmetry. Mating asymmetry occurs when reciprocal mating types for a given allele do not occur with equal frequency in the population (e.g. Aa × AA > AA × Aa). The potential for this type of bias can be limited by restricting analyses to matings from the same race and ethnicity [7]. Data from incomplete triads can also be included in these analyses by use of the EM algorithm.
Model 1 is a general (i.e. unrestricted) model, in which no constraints are placed on the relationships between alleles (i.e. no constraints on β 1 and β 2 or α 1 and α 2 ). However, in some circumstances power can be increased by imposing constraints on the general model. For example, a linear constraint can be imposed upon the full model for the genetic effect being tested, leaving the terms for the other genetic effect unconstrained. This allows for a one degree of freedom LRT. The use of a linear constraint has been shown to perform well under a variety of circumstances [7,[21][22][23]28].

LEM Software
The log-linear modeling approach with EM can be implemented using the LEM program, which can be downloaded at: http://spitswww.uvt.nl/~vermunt/#Software. LEM requires individual data and program files for each association test. Thus, for each SNP, it requires one data file and four program files: one for the full model with a linear constraint imposed for inherited effects, one for the full model with a linear constraint imposed for maternal effects, one for the reduced model to test for inherited effects, and one for the reduced model to test for maternal effects. Further, LEM data files must be formatted so that each row contains genotypes for the three members of the triad, whereas it is common for genome-wide genotype data to be outputted as a single file with one individual per row and one allele per column (e.g. PLINK [29] format) ( Table 1).

Maternal and Inherited Analyses for Genome-wide Association Studies (MI-GWAS)
SAS version 9.2 was used to develop a platform for the efficient analysis of maternal and inherited genetic effects in GWA data, using LEM to apply the log-linear modeling likelihood-based approach. This platform is called Maternal and Inherited Analyses for Genomewide Association Studies (MI-GWAS) and is freely available at http://www.sph.uth.tmc.edu/sbrr/mi-gwas.htm.
Briefly, MI-GWAS: 1. Reads data from a single PLINK ped file that contains a separate row of data for each triad member. Each row contains all genotype data for a single individual, and each genotype is coded as two alleles (i.e. two columns of data per genotype). By default, MI-GWAS treats the allele coded as "2" under the Illumina 1/2 allele coding system as the high-risk allele.
2. Converts the PLINK data into one LEM data file per SNP, with a single row of genotype data for each triad, with genotypes in the order of mother, father, and child. These genotypes are coded as 0,   . Sorts results by the LRT p-value, separately for the inherited and maternal genotype, and outputs two corresponding data files. 10. Sorts results by relative risk estimates, separately for the inherited and maternal genotype, and outputs two corresponding data files.
To increase computation efficiency, repeating sections of code (i.e. macros) are used to split these steps into consecutive blocks that process subsets of SNPs (by default 1,000). Each consecutive subset of SNPs is processed separately, and after processing of the last subset is complete, all results are merged. A visual overview of the structure of these steps of the platform is provided in Figure 1.
The use of SAS, a popular statistical software package, allows for flexibility in the MI-GWAS platform with simple programming changes. For example, the user can specify a range of successive SNPs to analyze rather than running all SNPs at once. This allows the user to split the analysis between multiple runs or between multiple computers. The program also includes code that can be modified to allow the user to specify specific individuals, triads, or SNPs to be excluded from the analysis. Further, the log-linear models can be modified to accommodate other family-based association designs (e. g. study designs that incorporate grandparents' genotypes [7]), other relationships between alleles (e.g. dominant inheritance), and to include other effects such as gene-gene and gene-environment interactions and parent of origin effects. By default, a linear constraint is imposed for the effect that is being tested (e.g. inherited genotype) and no constraint is imposed for the other effect (e.g. maternal genotype). The platform can also process imputed data that has been converted to PLINK format. Though by default MI-GWAS processes PLINKformatted genotype data with Illumina 1/2-based allele coding, the SAS code can be modified to process genotype data that are in other formats.

Evaluation
To validate the MI-GWAS platform and evaluate its performance, it was used to analyze a large, unpublished GWA study dataset derived from case-parent triads ascertained through cases with a conotruncal heart defect. The subject recruitment methods for this study have been previously described [17]. Briefly, case-parent triads were recruited through the Cardiac Center at the Children's Hospital of Philadelphia (CHOP) from 1997-2007 and all participants provided informed consent under a protocol approved by the Institutional Review Boards for the Protection of Human Subjects. Cases had a nonsyndromic, classic conotruncal defect (i.e. tetalogy of Fallot, D-transposition of the great arteries, double outlet right ventricle, truncus arteriosus or interrupted aortic arch) or a related malformation (i.e. perimembranous or posterior malalignment type ventricular septal defect or an isolated aortic arch anomaly). The genotype data were generated from blood or saliva samples using the Illumina HumanHap550 or Human610-Quad Bead-Chip Platforms at the Center for Applied Genomics at CHOP. Only SNPs present on both platforms were analyzed.
To verify MI-GWAS results for inherited genetic effects using the CHOP dataset, selected SNPs were evaluated for complete triads using both PLINK and MI-GWAS and the chi-square values for the TDT and LRT, respectively, were compared. (The TDT approach under PLINK cannot assess maternal genetic effects and does not incorporate data from incomplete triads into evaluations of inherited genetic effects.) To verify MI-GWAS results for maternal genetic effects using this dataset, selected SNPs were evaluated using both MI-GWAS and LEM run outside of MI-GWAS, and the resulting LRT pvalues were compared. In addition, MI-GWAS was used to replicate the findings of a candidate gene study conducted in this study population, using LEM run outside of MI-GWAS [17]. To be consistent with the previous analyses, MI-GWAS was modified to use an unrestricted model of inheritance for both genotypes (i.e. a two degree of freedom LRT), for the latter analysis.

Results
The evaluation dataset included data on 530,350 SNPs from 837 case-parent triads (497 complete triads and 340 triads with one or two members missing). To confirm MI-GWAS results for the test of inherited genetic effect, MI-GWAS was used to analyze all SNPs in the subset of complete triads, and chi-square values for the LRT were compared to TDT chi-square values obtained using PLINK to analyze the same data. The chi-square values for the LRTs of inherited genetic effects generated by MI-GWAS were quite similar to the TDT chisquare values generated using PLINK (Table 2). There were subtle differences in the chi-square values for some of the most significant SNPs, likely due to differences between the two platforms in rounding and/or how numbers are stored (e.g., floating point representation). In addition, identical results were obtained when program and data files generated by MI-GWAS were analyzed using the MI-GWAS platform and LEM run outside of MI-GWAS (data not shown). Finally, the results for maternal and inherited genetic effects from a previously published candidate gene study from the same study population [17] were replicated using MI-GWAS (data not shown).
As expected, MI-GWAS running times were somewhat faster on computers with better specifications. For the same 60,000 SNPs, running times ranged from 11 hours 35 minutes to 22 hours 48 minutes for four computers with differing specifications (Table 3).

Discussion
Computational platforms that can evaluate maternal as well as inherited genetic effects in genome-wide data from case-parent triads have not been previously described. By automating the implementation of such analyses, MI-GWAS provides such a platform. Comparison of results from MI-GWAS and LEM run outside of MI-GWAS, as well as results from MI-GWAS and PLINK, indicate that this platform performs as intended. Further, MI-GWAS performs relatively efficiently. From  the observed running times (Table 3), it can be inferred that on a single average modern consumer computer, it may take approximately one week to run a GWA study analysis of~500,000 SNPs using MI-GWAS. Such an analysis may run overnight if split between around eight average consumer computers. The MI-GWAS platform has the advantage of using readily available software (i.e. SAS and LEM) and reading a common GWA data input format (i.e. PLINK format). Further, unlike the TDT approach under PLINK, analysis under MI-GWAS uses a log-linear approach that provides estimates of effect size, allows use of data from incomplete triads [21][22][23], and, most importantly, allows estimation of the significance of maternal effects in addition to inherited effects.

Conclusions
For some conditions, maternal genetic effects may influence the risk of disease in offspring via an effect on the in utero environment. However, maternal genetic effects have not been widely evaluated in GWA studies, at least partially due to lack of a platform designed to analyze maternal genetic effects using GWA data from case-parent triads. The application of the MI-GWAS platform to GWA analyses expands the types of genetic effects that can be evaluated with triad GWA data, which may lead to new insights regarding the etiology of common diseases. Future developments of the MI-GWAS platform will involve improving the efficiency of the platform, and incorporating analyses of additional types of effects (e.g. parental imprinting, interactions).