PREMIM and EMIM: tools for estimation of maternal, imprinting and interaction effects using multinomial modelling.

BACKGROUND
Here we present two new computer tools, PREMIM and EMIM, for the estimation of parental and child genetic effects, based on genotype data from a variety of different child-parent configurations. PREMIM allows the extraction of child-parent genotype data from standard-format pedigree data files, while EMIM uses the extracted genotype data to perform subsequent statistical analysis. The use of genotype data from the parents as well as from the child in question allows the estimation of complex genetic effects such as maternal genotype effects, maternal-foetal interactions and parent-of-origin (imprinting) effects. These effects are estimated by EMIM, incorporating chosen assumptions such as Hardy-Weinberg equilibrium or exchangeability of parental matings as required.


RESULTS
In application to simulated data, we show that the inference provided by EMIM is essentially equivalent to that provided by alternative (competing) software packages such as MENDEL and LEM. However, PREMIM and EMIM (used in combination) considerably outperform MENDEL and LEM in terms of speed and ease of execution.


CONCLUSIONS
Together, EMIM and PREMIM provide easy-to-use command-line tools for the analysis of pedigree data, giving unbiased estimates of parental and child genotype relative risks.


Background
Genomewide association studies have popularized the use of the case/control design to detect effects associated with an individual's own genotype, however many diseases (especially those related to pregnancy outcomes) may in fact be due to more complex effects such as maternal genotype effects, maternal-fetal genotype interactions or parent-of-origin (imprinting) effects. To detect such effects it is necessary to collect genotype data from one or both parents of cases, in addition to genotyping the cases themselves. Two existing popular approaches analyse either genetic data from affected offspring and their mothers (case/mother duos), along with an appropriate control sample [1][2][3], or else analyse genetic data from affected offspring and both parents (case/parent trios), without use of controls [4][5][6]. In contrast, our software EMIM uses a multinomial modelling *Correspondence: heather.cordell@ncl.ac.uk Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle upon Tyne, NE1 3BZ, UK approach [7] that allows the simultaneous consideration of both case/mother duos and/or case/parent trios, with additional child and parent genotype data (such as individual cases and controls, case/father duos and control matings) included when available. The child-parent genotype data can be extracted from standard PLINKformat [8] pedigree files using our companion software PREMIM.
Full details and evaluation of the multinomial modelling approach used by EMIM have been described previously [7]. The early beta version of EMIM described in [7] allowed a more limited set of child-parent configurations than are supported in the current version, and did not include the current full range of optional likelihood assumptions (such as conditioning on parental genotypes (CPG) [6,9]). Most importantly, the companion program PREMIM was not available, limiting the ease with which EMIM could be applied to real data. http://www.biomedcentral.com/1471-2105/13/149

PREMIM: Pedigree file conversion
For each SNP in turn, PREMIM performs a simple algorithm to select from each pedigree the most informative sub-unit of child-parent genotype data. Different pedigree sub-units are chosen in order of preference as listed in Table 1.
There are a number of options that may be given to PRE-MIM. In particular, it is possible to override the default choice of individuals by stating a proband subject for certain pedigrees. These proband subjects are then chosen as cases (with parents where available). This may be useful to avoid possible bias when larger pedigrees have been ascertained on the basis of a specific affected individual. For larger pedigrees, it is also possible to select multiple case/parent trios or multiple control matings from each pedigree, potentially increasing the power to detect genetic effects. This option does have the potential to generate bias (depending on the analysis options chosen [6,10]), and so results should be interpreted with caution, although we anticipate that most people will apply these types of method to small pedigrees such as child/parent trios, making this issue less of a concern in practice. (Alternative methods for dealing with larger pedigrees, valid under the assumptions of random mating and/or Hardy-Weinberg equilibrium (HWE), have been described by [10,11]).

EMIM methodology
The basic principle behind EMIM is simple: to test for the existence of (and estimate) genotype relative risk parameters that increase (or decrease) the probability that a child is affected. By default, PREMIM chooses the minor allele to be considered as the 'risk' allele, although this option can be overridden if required. We denote by R 1 (R 2 ) the factor by which an individual's disease risk is multiplied if they possess one (two) risk alleles at a given locus. We denote by S 1 (S 2 ) the factor by which an individual's disease risk is multiplied if their mother possesses one (two) risk alleles at that locus. We denote by I m (I p ) the factor by which an individual's disease risk is multiplied if they inherit a risk allele from their mother (father). Lastly, to test for mother-child interactions, we denote by γ ij the factor by which an individual's disease risk is multiplied if the mother carries i risk alleles and the child carries j risk alleles. A summary of these relative risk parameters is shown in Table 2. A variety of restrictions may be made on the parameters as desired.
For example, a multiplicative model for the effects of the alleles in the mother (S 2 = S 2 1 ) or child (R 2 = R 2 1 ) may be imposed. In addition, EMIM also supports several alternative previously-proposed paramaterizations for the imprinting and interaction effects [4,5] (see [7] for more details).
As an example, denote the major and minor alleles by 1 and 2, then for a case/parent trio where the genotypes of the mother, father and child are 22, 11, 12, respectively, the penetrance is modelled as: where α is the baseline probability of disease and g m , g f and g c are the genotypes of the mother, father and child.
EMIM uses a multinomial model to estimate the relative risk parameters on the basis of observed counts of genotype combinations in case/parent trios as shown in Table 3. EMIM models the 15 different cell probabilities (corresponding to the 15 possible combinations of Mother has one minor allele and child has one minor allele (mother-child interaction effect) γ 12 Mother has one minor allele and child has two minor alleles (mother-child interaction effect) γ 21 Mother has two minor alleles and child has one minor allele (mother-child interaction effect) γ 22 Mother has two minor alleles and child has two minor alleles (mother-child interaction effect)

I m
The child receives a minor allele from the mother (maternally operating imprinting effect)

I p
The child receives a minor allele from the father (paternally operating imprinting effect) http://www.biomedcentral.com/1471-2105/13/149 genotypes that are consistent with Mendelian inheritance) in terms of the desired genotype relative risk parameters (R 1 , R 2 , S 1 , S 2 , I m , I p , γ 11 , γ 12 , γ 21 , γ 22 ). A maximum of 7 parameters are estimable, meaning that not all of these parameters can be estimated simultaneously. Cordell et al. [12] suggested building up models from simpler to more complex via a series of nested hypothesis tests. Given a model for the penetrances in terms of the genotype relative risk parameters, the overall likelihood for the data in Table 3 may be written represent the genotypes of a mother, father and child in genotype combination i. The probabilities P(g m i , g f i , g c i |child diseased) may be written in terms of the genotype relative risk parameters of interest and six nuisance parameters μ 1 − μ 6 (corresponding to mating type stratification parameters as indexed in Table 3, see [4,7,13] for details). If any of the subjects are missing, we no longer have 15 genotype counts as shown in Table 3, but instead we must collapse together rows to express the data in terms of counts of observed genotype combinations. For example, given data for case/mother duos (i.e. all fathers missing), the 7 observable counts are as shown in Table 4. The likelihood for the data in this table may be written where (g m i , g c i ) represent the genotypes of a mother and child in (Table 4) genotype combination i. Table 4 Observed genotype combinations in case/mother duos In practice, at any given SNP, we observe genotype counts (some of which may equal 0) for the following types of unit: case/parent trios (15 possible genotype combinations); parents of cases (9 possible genotype combinations); case/mother duos (7 possible combinations); case/father duos (7 possible combinations); mothers of cases (3 possible combinations); fathers of cases (3 possible combinations); cases (3 possible combinations). The data for each unit creates a table corresponding to a (possibly collapsed) version of Table 3, and the overall likelihood to be maximized may be constructed as the product of the likelihoods for the individual tables. Similarly, we may add in data for controls (either unaffected individuals or population-based controls of unknown disease status) by further multiplying the likelihood by the product of the likelihoods for a similar set of control tables. EMIM makes use of the following types of control unit: parents of controls (9 possible genotype combinations); control/mother duos (7 possible combinations); control/father duos (7 possible combinations); controls (3 possible combinations). Furthermore, EMIM assumes that the frequencies of the different genotype combinations in control units correspond to those in the general population. This is equivalent to making a rare disease assumption, in the event that the controls are all genuinely unaffected.
By default, EMIM assumes 'mating symmetry' [13] (equivalent to a 'conditional on exchangeable parental genotypes' (CEPG) [12] model), which corresponds to assuming that parental matings (g m = i, g f = j) are as likely as matings (g m = j, g f = i). This results in the estimation of six mating type stratification parameters [13] μ 1 − μ 6 (see Table 3). Two more restricted (and therefore potentially more powerful) models are also available in EMIM: 1. A model that assumes parental allelic exchangeability (PAE) [2] (which corresponds in this context to assuming that μ 4 = μ 3 ) 2. A model that assumes Hardy-Weinberg equilibrium (HWE) and random mating, estimating a single allele frequency parameter in place of the six mating type stratification parameters.
In addition to these more restricted models, a less restricted 'conditional on parental genotypes' (CPG) [2,9,12] model (that results in the estimation of nine mating type stratification parameters μ 1 − μ 9 , see Table 3) is also available. This model would be expected to be less powerful than the CEPG, PAE or HWE models, but should be more robust to any departure from mating symmetry, PAE or HWE.
EMIM reads in genotype data from input files created by PREMIM. In addition, there are two other files required by EMIM. Firstly, a file 'emimmarkers.dat' , which provides the minor allele frequencies for each SNP (used as starting values in the maximization algorithm). These can optionally be estimated by PREMIM using the pedigree data, although other (e.g. population-based) sources for this information may be preferred where available. (See [7] for an investigation of EMIM's sensitivity to misspecification of the assumed or estimated allele frequencies). The other required file is a parameter file 'emimparams.dat' , describing the type of analysis that EMIM should perform, which parameters to estimate, and which assumptions (such as HWE or PAE) should be made.

Implementation
PREMIM is written in C++ and for a binary pedigree file with 913 pedigrees, 1730 subjects and 45323 SNPs it takes 19 seconds to process on a Six-Core AMD Opteron TM Processor with 2.6 GHz CPUs. EMIM is written in FORTRAN 77 and makes use of a subroutine MAXFUN, originally written as part of the S.A.G.E. [14] package. For these same data (pre-processed by PREMIM) on the same machine, EMIM takes 1 minute and 22 seconds to perform an analysis to test for multiplicative child genotype effects, assuming HWE. For larger data sets, EMIM and PREMIM have options that allow easy parallel processing by dividing the SNPs to analyse into different batches.

Example analysis using simulated data
We used the program SimPed [15] to generate a single replicate of simulated data for 200 case/parent trios, 200 case/mother duos, 200 control/mother duos and 1000 unrelated controls at 8000 SNPs across a chromosome. We used a simplified linkage disequilibrium (LD) model that assumed LD operated in haplotype blocks, each of length 8 SNPs. We simulated child genotype effects (R 1 = 1.5 and R 2 = 2.25) at SNP 76 and maternal genotype effects (S 1 = 2 and S 2 = 3) at SNP 6004. We then used EMIM to test for maternal effects, with and without allowing for child genotype effects ( Figure 1C, Figure 1A), and to test for child genotype effects, with and without allowing for maternal effects ( Figure 1D, Figure 1B). In all four analyses, we see a strong signal at the correct location, with the high significance probably due to the relatively large effect sizes assumed.
A tutorial for this example (with a listing of the required commands) is available on the PREMIM and EMIM website: http://www.staff.ncl.ac.uk/richard.howey/emim/ example.html

Comparison of HWE, PAE, CEPG and CPG likelihoods
The power to detect genetic effects can vary depending on the assumptions made. As a demonstration, we http://www.biomedcentral.com/1471-2105/13/149  simulated 1000 replicates of data at a single SNP for a sample consisting of 50 of each of the following units: case/parent trios, case/mother duos, case/father duos, control matings, control/mother duos and control/father duos. We assumed either a child genotype effect (R 2 = 2), a maternal genotype effect (S 2 = 2), or a maternal imprinting effect (I m = 1.8). PREMIM and EMIM were used to estimate the parameters R 1 , R 2 , S 1 , S 2 and I m for each different likelihood assumption and for each set of simulated data. Figure 2(A-C) shows that the power to detect the relevant effect decreases as one makes less restrictive (but potentially more robust) assumptions, while Figure 2(D-E) shows that unbiased parameter estimation is achieved using the most restrictive assumption (HWE) (provided that assumption is correct). Similar unbiased parameter estimation is achieved for the other likelihood assumptions, when they are met (data not shown).

Effect of missing data on power
As a demonstration of the effect that missing data has on the power, we performed analyses at a single SNP using simulated data (10,000 replicates, each replicate consisting of 100 case/parent trios and 100 control/parent trios) and assuming a range of probabilities of missing genotype data. We assumed a maternal genotype effect (S 1 = 1.5, S 2 = 2.25). The expected proportion of pedigree units of different types remaining in the analysis are shown in Figure 3A and Figure 3B respectively. The trios are all present when there is no missing data, but the expected proportion quickly decreases when the probability of missing genotype data is increased. The expected proportion of the other pedigree types then increases, but subsequently decreases and converges to 0 as the probability of missing data approaches 1. The power to detect the maternal genetic effects (when correctly modelled) also decreases with increasing proportion of missing data http://www.biomedcentral.com/1471-2105/13/149  ( Figure 3C). An advantage of the EMIM framework is that it makes efficient use of data from all possible available individuals, allowing one to recover information even from incompletely genotyped trios.

HWE PAE CEPG CPG
Buyske [16] pointed out that maternal genotype effects can masquerade as child genotype effects, if analysed as such. If the maternal genetic effects are incorrectly modelled as child genetic effects ( Figure 3D), we find limited power to detect these effects even when there is no missing data. Increasing the proportion of missing data has little effect on the power of this analysis, until the probability of missing genotype data becomes very large (e.g. more than 80%).

Comparison with MENDEL
Several other software packages exist that allow testing and estimation of genotype relative risk parameters similar to those tested in EMIM. One such package is MENDEL [17]. MENDEL most easily allows the estimation and testing of mother-child interaction effects via the maternal-fetal genotype incompatibility (MFG) test [5], although a "Generalized Risk" analysis that allows implementation of more complex user-defined paramaterizations (through the imposition of various parameter restrictions) is also available.
We used computer simulations (500 replicates each with 200 case parent trios) to compare the performance of MENDEL and EMIM under three different comparable models: 1. Model 1. This model has been used to test for RhD incompatibility [18] and estimates the relative risk corresponding to the mother having no risk alleles and the child one risk allele. MENDEL was used to estimate this one relative risk parameter by setting the sex-specific effects (parameters MFG M and MFG F in MENDEL) to be equal. The equivalent single parameter γ 01 (corresponding to the parametrization of [5,18]) was estimated in EMIM. The data were simulated assuming γ 01 = 2. 2. Model 2. This model has been used to test for noninherited maternal antigens (NIMA) on rheumatoid arthritis (RA) [19] and consists of three parameters (ignoring sex-specific MFG testing): a relative risk parameter (γ 10 ) for MFG when the mother has one risk allele and the child has no risk alleles, and two parameters for child effects when the child has one or two risk alleles.  effect γ 10 = 2. The power to detect the the MFG effect in either MENDEL or EMIM was calculated by considering twice the difference between the negative log likelihood from a model that includes all three parameters (R 1 , R 2 and the MFG parameter) and that from a model where the MFG parameter has been removed. 3. Model 3. This MENDEL model is a general MFG test consisting of one relative risk parameter for each of the 7 mother/child genotype combinations. The relative risk parameter denoted U 00 in the MENDEL documentation (corresponding to the situation where the mother and child have no risk alleles) was set to 1 and not estimated to avoid over-parametrization. The other 6 parameters, U 22, U 21, U 12, U 11, U 10, U 01, were estimated. The 6 parameters estimated by EMIM were R 1 , R 2 , S 1 , S 2 , γ 11 and γ 22 . These parameters are not indvidually equivalent to the 6 MENDEL parameters, but the models as a whole can be shown to be equivalent. Data for this comparison were simulated assuming R 1 = S 1 = γ 11 = γ 22 = 1.5 and R 2 = S 2 = 2.25. One difference between EMIM and MENDEL was the time taken to perform the analysis, with EMIM performing considerably quicker than MENDEL. For example, the time to run model 3 (with 200 case/parent trios) showed that PREMIM and EMIM combined took 0.0257 seconds and MENDEL took 6.45 seconds (averaged over 300 runs). This shows that PREMIM and EMIM combined were approximately 250 times faster than MENDEL in this example. The same analysis with 400 case/parent trios gave times of 0.0302 seconds for PREMIM and EMIM combined and 14. 300 runs), showing PREMIM and EMIM to be approximately 472 times faster then MENDEL. A possible reason for the difference in running times is the fact that the extended MFG model [11] implemented in MENDEL is a slightly more complicated model than the parent/offspring trio model implemented in EMIM (thus providing MENDEL with the ability to analyse larger pedigrees).

Comparison with LEM
Another program with the capability to analyse complex genetic effects (most notably mother/child/imprinting effects) is LEM [20]. LEM is a Windows-based log-linear modelling program designed primarily to be used via a graphical user interface, although it is possible to run it from the DOS command line, in order to implement scripts that allow the analysis of large numbers of loci or replicates. LEM takes an input parameter file which defines the model, the parameters to be estimated and the name of the input data file. We created input parameter and data files based on examples provided by the authors of LEM [20] for case/parent trios and by [21] for case/mother and control/mother duos.
1. Case/parent trios. SimPed [15] was used to simulate a single replicate of data at 8000 SNPs across a chromosome for 4000 case/parent trios. Child effects (R 1 = 1.5, R 2 = 2.25) were simulated at SNP number 1004 and maternal effects (S 1 = 2, S 2 = 3) were simulated at SNP number 6004. In both EMIM and LEM we tested for maternal effects while allowing for child and maternal imprinting effects (i.e. we compared an alternative 5-parameter model (R 1 , R 2 , S 1 , S 2 , I m ) with a null 3-parameter model (R 1 , R 2 , I m )). We calculated the p-value for LEM on the basis of the reported log likelihoods by using the Wald http://www.biomedcentral.com/1471-2105/13/149 statistic as a χ 2 value with 2 degrees of freedom. (The p-value reported by LEM was not suitable as it is only given to 3 decimal places, which was insufficient for SNPs with p-values less than 10 −3 ). 2. Case/mother duos and control/mother duos.
Again, data were simulated at 8000 SNPs but this time for 2000 case/mother duos and 2000 control/mother duos. Child effects (R 1 = 1.5, R 2 = 2.25) were simulated at SNP number 1000 and maternal effects (S 1 = 2, S 2 = 3) were simulated at SNP number 6004. In both EMIM and LEM we tested for maternal and child effects i.e. we compared a null model with no fitted parameters to an alternative model with parameters (R 1 , R 2 , S 1 , S 2 ).
A comparison of EMIM versus LEM for the case/mother and control/mother duos is shown in Figure 6. show that R 2 and S 2 are also approximately equal, but with more variability. Figure 7 shows the same plots for the case/parent trios, but with the addition of estimates for the extra parameter I m . We see that the p-values and parameter estimates provided by the two programs are virtually indistinguishable.
These results indicate that the inference provided by LEM and EMIM is essentially identical. This is as expected given the mathematical equivalence [7,22] between the multinomial model fit by EMIM and the log linear model fit by LEM. The main difference between the programs is the time taken to perform the analysis, with EMIM performing considerably quicker than LEM. For example, the http://www.biomedcentral.com/1471-2105/13/149  The improved speed for the LEM trios analysis was most likely due to the fact that it took fewer steps than the duos analysis during the likelihood maximization process (possibly on account of the fact that the example parameter file we were using requested the program to switch to using a Newton-Raphson algorithm following 10 iterations of an EM algorithm). It is possible that differences between maximization algorithms and convergence criteria could account for some of the differences in speed between PREMIM/EMIM and LEM; we found it difficult to determine how to obtain precise control over such factors in LEM and were forced to use input files that very closely matched the examples provided by [20,21]. Another factor influencing speed could be the fact that LEM does not (as far as we are aware) allow the input of multiple SNPs simultaneously, meaning that we had to create and read into LEM a separate input file for each SNP analysed.

Conclusions
Here we have presented two new computer tools, PRE-MIM and EMIM, for the estimation of parental and child genetic effects, based on genotype data from a variety of different child-parent configurations. The current version of EMIM improves upon the early beta version described in [7] by allowing a larger set of possible childparent configurations, a larger range of optional likelihood assumptions, and by the development of the companion program, PREMIM, for generating the required input files from standard PLINK-format files, considerably improving the ease with which EMIM can be applied to real data. In application to simulated data, we have shown that the inference provided by EMIM is essentially equivalent to that provided by alternative (competing) software packages such as MENDEL and LEM. EMIM does have the advantage of allowing easy implementation of a wider class of models than are most easily implemented in MENDEL and LEM, although the expert MENDEL/LEM user could probably achieve the same model flexibility through judicious choice of parameter restrictions. However, PREMIM and EMIM (used in combination) considerably outperform MENDEL and LEM in terms of speed of execution, an advantage that is likely to be all the more important when applying these approaches to large-scale data sets such as those generated in genome-wide association studies. To allow further increases in speed, PREMIM and EMIM also have the advantage of allowing easy parallel processing (e.g. on a computer cluster) by dividing the SNPs to analyse into different batches.
Limitations of PREMIM and EMIM include the fact that larger pedigrees are divided into case/parent or control/parent trios (or smaller sub-units) prior to analysis, and the fact that SNPs are analysed one at a time, without borrowing information from neighbouring markers (e.g. on the basis of regional linkage disequilibrium patterns). Methods for dealing with larger pedigrees, valid under the assumptions of random mating and/or Hardy-Weinberg equilibrium (HWE), have been described by [10,11], while [23] present an approach that models haplotypes rather than individual SNPs, allowing the borrowing of information (including information on parent-of-origin or missing genotype data) across neighbouring SNPs. Both of these features would be valuable additions to future releases of our software. Nevertheless, the current versions of EMIM and PREMIM provide easy-to-use command-line tools for the analysis of pedigree data, allowing testing and estimation of a variety of parental and child genotype relative risks.

Availability and requirements
Project name: EMIM and PREMIM Project home page: http://www.staff.ncl.ac.uk/richard. howey/emim/ Operating systems: Windows and Linux executables; FORTRAN and C++ source code Programming language: FORTRAN and C++ Other requirements: None Licence: GNU General Public License Any restrictions to use by non-academics: None