A quantitative genetic and epigenetic model of complex traits

Background Despite our increasing recognition of the mechanisms that specify and propagate epigenetic states of gene expression, the pattern of how epigenetic modifications contribute to the overall genetic variation of a phenotypic trait remains largely elusive. Results We construct a quantitative model to explore the effect of epigenetic modifications that occur at specific rates on the genome. This model, derived from, but beyond, the traditional quantitative genetic theory that is founded on Mendel’s laws, allows questions concerning the prevalence and importance of epigenetic variation to be incorporated and addressed. Conclusions It provides a new avenue for bringing chromatin inheritance into the realm of complex traits, facilitating our understanding of the means by which phenotypic variation is generated.


Background
Systematic or stochastic changes in chromatin states, such as DNA methylation, chromatin remodeling, histone modification and RNA interference, have been thought to provide an additional driving force for phenotypic variation in complex traits and diseases [1][2][3][4][5][6][7][8][9]. Different chromatin states, called epialleles, that occur in the same sequence allele cannot be captured by an analysis based on DNA sequence alone [10]. With the increasing availability of epigenome technologies, there has been an unprecedented opportunity to understand the role of epiallelic variants in maintaining and inducing functional variation for organisms to better buffer against environmental perturbations. This hence entails the development of quantitative models that can enable our knowledge about the amount and pattern of quantitative variation determined by epialleles. By integrating with linkage or association mapping strategies, these models can retrieve epigenetic variation that cannot be estimated presently [10][11][12][13].
There have been several publications on methodological development for epigenetic detection [14][15][16][17]. Johannes and Colome-Tatche [16] proposed an experimental approach for estimating epigenetic variation in experimental crosses derived from epigenomically perturbed isogenic lines. This approach is powered to model the effects of epiallelic instability, recombination, parentof-origin effects, and transgressive segregation on phenotypic variation across generations. Tal et al. [15] derived an expression form for covariances between relatives due to epigenetic transmissibility. A statistical model based on multiple testing procedures has been developed to identify the genomic regions of epigenetic variability among different individuals from genome-wide DNA methylation data [18]. These model developments, in a combination with empirical studies, can be used to test the hypothesis that epigenetic variation arising from chromatin modifications of DNA directly or indirectly is an important contributor to the missing heritability [17,19].
Despite these advances, we are still unclear how much of the phenotypic variation is contributed by epigenetic modifications and, more importantly, through which way epialleles trigger their effects on phenotypic values. The motivation of this article is to develop a quantitative model for estimating and testing the contribution of epigenetic variants to quantitative trait variation. The model allows the prediction of how much genetic variation is produced through a change in the rate of occurrence of epigenetic mutation and the effect of epigenetic factors in a natural population. We particularly discuss how the epigenetic effect interacts with other genetic effects, such as additive and dominant, to affect phenotypic traits. By implementing it into genome-wide association studies [19], the model proposed provides useful guidance for designing efficient and effective molecular experiments to characterize a comprehensive picture of the epigenetic variation of complex traits or diseases in different organisms.

Occurrence rate of methylation
Consider an epigenetic study population of n individuals that are randomly drawn from a natural population, in which a nucleotide site, with two alleles A 1 and A 2 , is thought to affect a phenotypic trait. Let p and q (p + q = 1) denote the allele frequencies of A 1 and A 2 in the natural population at Hardy-Weinberg equilibrium (HWE), respectively. The genotypic frequencies of A 1 A 1 , A 1 A 2 , and A 2 A 2 at the nucleotide site studied are expressed as p 2 , 2pq, and q 2 , respectively [20,21].
At the nucleotide site studied, some cytosines within a CpG dinucleotide are methylated by adding a methyl group to the 5 position of the cytosine pyrimidine ring. With no loss of generality, allele A 1 is a cytosine which is, if any, methylated into a new "allele" called the epiallele, denoted as A e , at a rate u. After DNA methylation, the population frequencies of non-methylated A 1 allele, epiallele A e and allele A 2 are (1u)p, up, and q, respectively. Current technologies allow the distinction of epialleles from non-methylated alleles. The process of methylation and the resulting frequencies of six distinguishable genetic and epigenetic types are expressed as where D 12 , D 1e , and D 2e are the coefficients of Hardy-Weinberg disequilibrium (HWD) due to a non-random association between alleles A 1 and A 2 , between allele A 1 and epiallele A e , and between allele A 2 and epiallele A e , respectively. It is possible that the previous equilibrium of the population is violated by DNA methylation, leading to the HWD quantified by D 12 , D 1e , and D 2e . Thus, the genotype and epigenotype frequencies may be determined by allele and epiallele frequencies and HWD coefficients. Let n 11 , n 1e , n ee , n 12 , n 2e , and n 22 (n 11 +n 1e +n ee +n 12 + n 2e +n 22 = n) denote the observations of the corresponding genotypes/epigenotypes (1) in the study population. Based on the frequencies of these genotypes/ epigenotypes, we formulate a polynomial likelihood from which to obtain the maximum likelihood estimates (MLEs) of the allele frequencies, the occurrence frequency of methylation, and HWD usinĝ D 2e ¼ûpq À n 2e 2n ð6Þ We are interested in investigating whether there is significant occurrence of DNA methylation at the nucleotide site. This can be tested by formulating a null hypothesis, H 0 : u = 0, vs. an alternative hypothesis, H 0 : u ≠ 0, under each of which the likelihoods (L 0 and L 1 ) are calculated, respectively. However, because the u value in the H 0 lies on the boundary of parameter space, the log-likelihood ratio calculated, may not follow a standard chi-square distribution. Self and Liang [22] showed that the null distribution of the LR test statistic is a mixture of projections of chi-square variables onto surfaces, with the weights of mixtures that can be derived analytically only in special cases. By establishing the asymptotic null and alternative distributions of quasilikelihood ratio, rescaled quasi-likelihood ratio, Wald, and score tests, Andrews [23] suggested the use of these test statistics to test the boundary value of a model parameter.
While the first three test statistics are easy to compute, the score test is more difficult by deriving the first and second-order derivatives of the alternative log-likelihood. Similar tests can be performed for individual HWD, D 1e , D 2e , or D 12 , or their combinations, by formulating the null hypotheses, respectively. Under the alternative hypothesis H 1 associated with each null hypothesis considered, the likelihood is calculated. The LR value calculated is thought to be asymptotically chi-square distributed with the degree of freedom equal to the difference in the number of parameters to be estimated between the alternative and null hypotheses.

Genetic and epigenetic effect
We assume that the study population is investigated under a uniform condition so that the phenotypic variation can be simply partitioned into genetic/epigenetic components and errors. There are only three genotypes, A 1 A 1 , A 1 A 2 , and A 2 A 2 , prior to DNA methylation. Let a denote the additive effect of the nucleotide site due to the substitution of allele A 1 by A 2 or vice versa and d denote the dominant effect due to the interaction between the two alleles. The values of three genotypes are diagrammed over an axis as follows: As described above, allele A 1 is assumed to be methylated into the epiallele A e . The values of six distinguishable genetic and epigenetic types are expressed as where the genotypic value of the trait is decomposed into different components, i.e., the overall mean (μ), the additive effects due to the substitution of allele A 1 (a 1 ) and epiallele A e by allele A 2 (a e ), and the dominance effects due to the interaction between allele A 1 and epiallele A e (d 1e ), between allele A 1 and allele A 2 (d 12 ) and between allele A 2 and epiallele A e (d 2e ).
Let y i denote the phenotypic value of the trait for individual i (i =1, . . ., n) in the study population. The MLEs of the genotypic value for each genotype/epigenotype can be obtained by simply taking its mean over all individuals belonging to this genotype/epigenotye (9). The genetic and epigenetic effects can be estimated by solving a group of regular equations for the genotypic values (9), i.e., Each of these effects (10) - (14) can be tested by the log-likelihood ratio approach. For an epigenetic study, we are more interested in testing the epigenetic effect of the nucleotide site a e and dominant effects due to the interactions between the alleles and epiallele d 1e and d 2e .
The log-likelihood ratio test statistics for each hypothesis test is thought of being asymptotically chi-square distributed with the degree of freedom equal to the difference in the number of parameters to be estimated between the alternative and null hypotheses.

Genetic and epigenetic variation
We first give the genetic variance explained by the nucleotide site studied prior to DNA methylation. By defining a new parameter called the average effect α = a + (q-p)d [20], we derived the overall genetic variance of the trait due to this site as where σ a 2 = 2pqα 2 is the additive genetic variance depending on both a and d, and σ d 2 = (2pqd) 2 is the dominant genetic variance only depending on d. Both additive and dominance variances are affected by the relative magnitudes of allele frequencies p and q. These two variances reach their maximums when two alternative alleles A 1 and A 2 occur at the same frequency.
In what follows, we model how the epigenetic change contributes to the genetic variance of a complex trait based on the frequencies (1) and values of genotypes/ epigenotypes (9). The total genetic variation among the six genotypes/epigenotypes is derived as where m is the population mean expressed as It can be seen from equation (16) that the total genetic variance includes 15 different parts, i.e., Here, we define a new heritability, called the epigenetic heritability, which describes the proportion of the phenotypic variance explained by the effect of the epiallele and its interactions with the other effects, expressed as Also, we use the proportion of the epigenetic variance to the total genetic variance to describe the relative contribution of epigenetic methylation to the overall genetic variance, expressed as These two parameters can be used to assess the contribution of DNA methylation to the total phenotypic variation of a quantitative trait.

Numerical analysis
In this section, we performed numerical analyses to investigate how epigenetic marks contribute to the heritability of a complex trait. The occurrence of epigenetic marks is described by population genetic parameters including the occurrence rate of the epiallele and its Hardy-Weinberg disequilibria with unmarked alleles. The effect of epigenetic marks can be specified by quantitative genetic parameters including the epigenetic effect of the epiallele and its interactions with other effects. As analyzed above, population genetic parameters (p, q, u, D 1e , D 2e , D 12 ) and quantitative genetic parameters (a 1 , a e , d 1e , d 2e , d 12 ) contribute to the genetic variance in a complex way (16). We will analyze the contribution of epigenetic marks by separately investigating how these population and quantitative genetic parameters affect R e 2 .

Population genetic effect
Suppose there is a study population in which methylated sites are observed for a phenotypic trait. Consider a nucleotide site with two alleles A 1 and A 2 , one of which, say A 1 , is methylated at a rate u (u takes any value in [0,1]). This methylation may violate the previous HWE assumption. Based on a simple algebraic analysis, we obtain the intervals of D 1e , D 2e and D 12 as follows: Because of DNA methylation, the change of the genetic variance explained by the site takes place. By fixing quantitative genetic parameters, we quantitatively examined the impacts of different occurrence rates of methylation and different HWD coefficients on the epigenetic ariance. A small value of occurrence rate may lead to the formation of substantial epigenetic variance, although this phenomenon depends on the disequilibrium degree of association between two original alleles produced following methylation (Figure 1). The epigenetic variance is also positively associated with the degree of disequilibrium for the unmarked alleles and epiallele ( Figure 2).

Quantitative genetic effect
By fixing population genetic parameters, the influence of genetic effects triggered by the epiallele was investigated. A small value of the additive effect a e formed by the epiallele brings about considerable epigenetic variance ( Figure 3). This influence increases with increasing a e values. The epigenetic variance is also remarkably affected by the dominant effect between the original alleles and epiallele (Figure 4). It is clear that these effect parameters contribute to the epigenetic variance also through their complex interactions.

Computer simulation
Our model allows the estimation and test of epigenetic effects. We carried out simulation studies to examine the statistical properties of the model. A study population was simulated by assuming a set of population and quantitative genetic parameters and a normally distributed residual error with mean zero and variance scaled under a range of trait heritabilities. As expected, the estimation precision increases with increasing sample size and heritability. A sample size 400 is sufficient to provide reasonable estimates of all population genetic parameters ( Table 1). Note that the estimation precision of the population parameters does not rely on the size of heritability. In general, the reasonable estimation   of quantitative genetic parameters, especially dominant genetic effects, needs a much larger sample size, say 1000 (Table 1). As expected, the estimation precision of genetic effects is sensitive to heritability. In practice, every effort should be given to precisely measure the phenotypic trait, aimed to increase the level of heritability.
We also investigated the power of detecting epiallelic HWD occurrence and epigenetic effects as well as the false positive rates for epigenetic effect identification under different heritabilities and sample sizes ( Table 2). Given a medium sample size 400, the model possesses adequate power (> 0.95) for the detection of small epialleli HWD coefficients, along with small false positive rates (< 0.10). The power of the model to detect epigenetic effects was calculated by testing the hypothesis, H 0 : a e = d 1e = d 2e = 0 vs. H 1 : at least one of the effects in the H 0 is not equal to zero, and comparing the resulting log-likelihood ratio test statistic with the critical threshold of a chi-square distribution with three degrees of freedom. The proportion of the number of simulation replicates that reject the null hypothesis over the total number of simulation replicates is empirically used as the power of the model. The power of epigenetic effect detection is very sensitive to the magnitude of the epigenetic effect, heritability and sample size ( Table 2). When the epigenetic effect is small, the model has low power to detect it, although the power increases with increasing heritability and sample size. To detect a small epigenetic effect, a large sample size (2000 or more) is required for a precisely measured phenotype (with a large heritability). For a medium-size epigenetic effect, a sample size 1000 may be adequate for its detection if then phenotype is precisely measured. In general, the model has reasonably small false positive rates even for a medium sample size (Table 2).

Implementing the epigenetic model into GWAS
The epigenetic model proposed can be implemented to genome-wide association studies (GWAS). In GWAS, it is likely that we have a million of methylated sites detected throughout the entire genome on a much smaller number of samples. Moreover, samples collected for human GWAS are highly heterogeneous in terms of genetic background, gender, age, race, and many   other demographic characteristics. These demographic factors should be modeled as covariates. For a single methylated site, we can build a linear model to describe the phenotypic value of individual i by considering its multifactorial determinants, expressed as where ξ i1 , . . ., ξ i5 are the indicator variable for subject i that corresponds to a specific genetic or epigenetic effect at a methylated site, u ir (r = 1, . . ., R) is the value of the rth continuous covariate, such as age and BMI, for subject i, α r is the effect of the rth continuous covariate, v sl (l = 1, . . ., L s , s = 1, . . ., S) is the effect of the lth level for the sth discrete covariate, such as race, gender, and treatment, with P l=1 Ls υsl = 0 where L s is the number of levels for the sth discrete covariate, x isl is an indicator variable of subject i who receives the lth level of the sth discrete covariate, and e i is a random error.
A standard multiple linear regression approach can be used to estimate all the effects described in model (19). If the test is made individually for each of the methylated sites, the significance of each effect should be adjusted by multiple comparison approaches such as Bonferroni or FDR.
Analysis of one single methylated site at a time is limited for statistical inference about a comprehensive picture of the genetic and epigenetic architecture of complex phenotypes. The best way such a picture is illustrated is to analyze all sites simultaneously. Li et al. [24] proposed a new approach by incorporating the least absolute shrinkage and selection operator (lasso) [25] to simultaneously analyze a larger number of variables using a much smaller sample size. A detailed algorithm for the Bayesian lasso has been derived [24] and can be readily implemented to GWAS aimed to identify epige-netic variants.

Discussion
Epigenetic alternations have been increasingly recognized to play an important role in generating and maintaining quantitative genetic variation for complex phenotypes underlying physiology and diseased [6,7,9,[26][27][28]. Preliminary estimates in plants suggest that it can account for up to 30% of the variation in commonly studied phenotypes such as height and flowering time [8]. Many theoretical models have been available to analyze the contributions of epigenetic marks to missing heritability in genome-wide association studies (GWAS) [14][15][16][17][18]. In this article, we extended Mendelian inheritance-based genetic principles to derive a quantitative framework by which to analyze the pattern of how DNA methylation contributes to overall genetic variance. By defining several epigenetic effect parameters, the analytical framework allows the mechanistic characterization of epigenetic actions within the quantitative genetic context.
Through numerical analysis, a small incidence of DNA methylation as well as a small effect due to methylation alternations could lead to a substantial increase of genetic variance, suggesting that epigenetic marks may be an important cause for genetic diversity in nature. Given our finding, the neglection of epigenetic variants in many current GWAS may partly explain the problem of missing heritability [17]. Simulation studies suggest that the model can provide reasonable estimates of epigenetic effect parameters with a sample size of 200 -400, even when the trait studied has a small heritability. It should be pointed out, however, that this conclusion is based on a well-controlled study in which there are few background noises. For the GWAS in humans, the estimated genetic variation is likely to be confounded by many factors, such as population structure, heterogeneous genetic background, demographic complexity, and highly noisy phenotypic measurements among others. To remove these confounding effects from genetic and epigenetic analysis, a considerably large sample size may be needed.
The model only considers a single methylated site. However, there is no technical difficulty in extending the model to explore two or more sites at the same time which may interact with each other to produce a complex network of epistasis [29]. For two methylated sites, a total of 25 interaction parameters are formed between parameter sets each composed of (a 1 , a e , d 1e , d 2e , d 12 ) for each site. In this case, an exponentially increasing sample size and more precise phenotypic measurement (aimed to increase the trait's heritability) are needed. For the methylated population, originally existing HWE assumption may be violated in which case it is not possible to use gametic linkage disequilibria to specify the association between the two sites. Wu et al. [30] proposed a robust approach to analyze the marker-marker association by deriving a so-called zygotic linkage disequilibrium model. Wu et al.'s approach can be incorporated to identify the contribution of epigenetic marks at two sites to the overall genetic variance. Epigenetic changes may be an adaptation to environmental perturbations [5,17,28]. Thus, it is crucial to incorporate the epigenetic model into a genotypeenvironment interaction study. By doing so, we can identify which and how epigenetic effects interact with the environment to determine final phenotypes so that the genetic etiology of quantitative variation can be better elucidated. In addition, there is a considerable body of evidence that epigenetic effects may transmitted from one generation to next [31,32], although other studies found the reprogramming of epigenetic effects during meiosis [5,33,34]. By embedding our epigenetic model into a family-based design, we can develop a powerful approach to test the relative importance of these two phenomena in trait control [35][36][37]. Traditional models analyze the inheritance of quantitative traits based on Mendel's laws, failing to study the contribution of epigenetic modifications. In addition, many GWAS are based on a case-control study in which genotype frequencies are compared between two groups. To study the association between epigenetic effects and a particular disease, such as cancer, we can incorporate quantitative epigenetic models as described by equations (10) -(14) into a case-control framework, allowing each effect to be tested. The integration of general quantitative genetic models and a case-control design has been discussed and its statistical properties investigated through analytical derivations and computer simulations [38][39][40]. With these extensions, the new model proposed in this article by integrating traditional quantitative genetic theory and the latest discoveries of epigenetic effects will allow geneticists to chart a more comprehensive picture of the genetic landscape for complex phenotypes underlying agricultural production, physiology and human diseases.