Rapid screening for phenotype-genotype associations by linear transformations of genomic evaluations

Gualdrón Duarte, Jose L; Cantet, Rodolfo JC; Bates, Ronald O; Ernst, Catherine W; Raney, Nancy E; Steibel, Juan P

doi:10.1186/1471-2105-15-246

Methodology article
Open access
Published: 19 July 2014

Rapid screening for phenotype-genotype associations by linear transformations of genomic evaluations

Jose L Gualdrón Duarte¹,
Rodolfo JC Cantet¹,
Ronald O Bates²,
Catherine W Ernst²,
Nancy E Raney² &
…
Juan P Steibel^2,3

BMC Bioinformatics volume 15, Article number: 246 (2014) Cite this article

4753 Accesses
51 Citations
4 Altmetric
Metrics details

Abstract

Background

Currently, association studies are analysed using statistical mixed models, with marker effects estimated by a linear transformation of genomic breeding values. The variances of marker effects are needed when performing the tests of association. However, approaches used to estimate the parameters rely on a prior variance or on a constant estimate of the additive variance. Alternatively, we propose a standardized test of association using the variance of each marker effect, which generally differ among each other. Random breeding values from a mixed model including fixed effects and a genomic covariance matrix are linearly transformed to estimate the marker effects.

Results

The standardized test was neither conservative nor liberal with respect to type I error rate (false-positives), compared to a similar test using Predictor Error Variance, a method that was too conservative. Furthermore, genomic predictions are solved efficiently by the procedure, and the p-values are virtually identical to those calculated from tests for one marker effect at a time. Moreover, the standardized test reduces computing time and memory requirements.

The following steps are used to locate genome segments displaying strong association. The marker with the highest − log(p-value) in each chromosome is selected, and the segment is expanded one Mb upstream and one Mb downstream of the marker. A genomic matrix is calculated using the information from those markers only, which is used as the variance-covariance of the segment effects in a model that also includes fixed effects and random genomic breeding values. The likelihood ratio is then calculated to test for the effect in every chromosome against a reduced model with fixed effects and genomic breeding values. In a case study with pigs, a significant segment from chromosome 6 explained 11% of total genetic variance.

Conclusions

The standardized test of marker effects using their own variance helps in detecting specific genomic regions involved in the additive variance, and in reducing false positives. Moreover, genome scanning of candidate segments can be used in meta-analyses of genome-wide association studies, as it enables the detection of specific genome regions that affect an economically relevant trait when using multiple populations.

Background

The availability of high density genotypes of single nucleotide polymorphism (SNP) markers for plants and livestock species, in conjunction with phenotypic data for complex traits, allows the calculation of: 1) estimates of genomic breeding values (GEBVs) [1, 2] for genomic evaluation [3], and 2) estimates of the effects of genomic regions associated with the genetic variability in genome wide association studies (GWAS) [2, 4, 5].

There is an increasing number of GWAS data sets analyzed by mixed models and multiple testing procedures [6], after fitting all individual effects of genomic regions into the model [4]. The model may be difficult to fit when both, the number of individuals and SNP effects, are large. We propose to use a linear transformation of genomic breeding values to estimate the marker effects from a simpler equivalent mixed model, and then testing those effects using a standardized test statistic that employs the variance (rather than prediction error variance) of the same effects.

The method of genomic selection proposed by Meuwissen et al. [7] to estimate GEBVs starts by fitting the SNP effects to a given data set. Next is to estimate GEBV of any individual using its genotype (SNP), by adding across the entire genome those solutions corresponding to the individual's SNP. The mixed model employed conveys vectors of fixed effects, and random effects of markers or SNPs ( g ) assumed to be normally distributed with null mean and a covariance matrix proportional to the identity matrix times the variance of SNP effects $(I σ_{g}^{2})$ . Errors are assumed to be Gaussian, independent and identically distributed with null mean and covariance matrix $I σ_{e}^{2}$ . An equivalent mixed model discussed by Garrick [8] and Stranden [9] is fitted after the linear transformation a = Z g where a is a random vector of breeding values, and Z an incidence matrix that relates elements in a to those in g. Each column of Z is associated with a given SNP and the elements are standardized by functions of SNP allele frequencies and by the total number of SNP. It is worth noting that the same Z is used in our implementation of the model of Meuwissen et al. [7] to relate the vector of marker effects in g to the data phenotypes. Moreover, GEBVs in the equivalent model have variance-covariance matrix $G σ_{A}^{2} = Z Z' σ_{g}^{2}$ . The procedure requires that the variances are equal, i.e. $σ_{A}^{2} = σ_{g}^{2}$ . Once the equivalent model is fit, SNP effects are calculated by the transformation g = Z^'G^− 1a, and individual SNP effects in g are divided by the square root of its variance (Var( g_j )) to get the so called SNP_ej test statistics. We also provide a formula to calculate Var( g_j ) without having to fit the model with SNP effects. The next step is to select genome segments that may be highly associated with the genetic variability of the trait for each chromosome. In doing so, we look for the SNP having the highest value of minus the logarithm of the p-value throughout the chromosome. Once the SNP is located, a segment of one Mb to the left and one to the right is defined, and a relationship matrix is calculated using only the information from those markers. The relationship matrix is used as the proportional variance-covariance of the segment effects in a model that also includes fixed effects and random GEBVs. In a final step, the likelihood ratio is calculated to test the significance of the largest effect segment of each chromosome by comparing against a reduced model with fixed effects and GEBVs. The critical value (size of the test) is adjusted by the Bonferroni correction. The algorithm not only delivers genome wide associations and genomic predictions efficiently, but it also minimizes computing time and memory requirements. Moreover, the specific variance of the SNP effects is used in calculating the test, thus taking into account the amount of information of any given marker. Instead, other testing approaches rely on a prior variance or a constant estimate of the additive variance.

Methods

Dataset

The experimental population was raised at the Michigan State University Swine Teaching and Research Farm, East Lansing, MI [10]. Parents from the initial generation (F₀) were four Duroc boars mated to 15 Pietrain sows by artificial insemination. From all resulting F₁ animals, 50 females and 6 males (progeny of 3 F₀ sires) were selected as parents for the F₂ generation, by avoiding full or half sib matings. A total of 1,259 F₂ piglets were born alive from 142 litters out of 11 farrowing groups. Phenotypic data for growth, carcass merit and meat quality traits were collected for approximately 950 F₂ pigs (for more details refer to Edwards et al. [10, 11]). Data used for the study were measures of the growth trait 13 week tenth rib backfat (mm) (bf10_13wk). The trait was chosen as it displays a sizable heritability (0.42) and a normal distribution. Animal protocols were approved by the Michigan State University All University Committee on Animal Use and Care (AUF# 09/03-114-00).

Genotyping and data editing

DNA was isolated from white blood cells using standard procedures as previously described for this population [10]. Quantity and quality of DNA samples were determined using a Qubit fluorometer (Invitrogen by Life Technologies, Carlsbad, CA, USA). The experimental population was genotyped with two marker SNP panels. 1) 411 animals were genotyped (4 F₀ Duroc boars, 15 F₀ Pietrain sows, 6 F₁ males, 50 F₁ females and 336 F₂ pigs) with a commercial panel, the Illumina PorcineSNP60 beadchip (60 K) [12] and 2) 612 F₂ animals were genotyped with a second panel composed of a 9 K tagSNP set referred to as the GeneSeek Genomic Profiler for Porcine LD (GGP-Porcine, GeneSeek a Neogen Company, Lincoln, NE) [13] . A set of 5,350 SNP out of M = 62,163, were eliminated from all analyses as their physical positions were unknown. Mendelian inconsistencies (≤0.01%) were taken as missing genotypes, and 21 animals (1 F₁ and 20 F₂) with more than 10% of SNP missing were not used for any analysis. By similar considerations, 2,978 SNP were removed from the analyses as they had more than 10% missing data. Additionally, 9,877 SNP were excluded as their minor allele frequency (MAF) was below 0.01. This editing procedure followed that of Badke et al. [14] and Gualdrón et al. [15], and the program PLINKv1.07 [16] was used for the task. F₂ animals genotyped with the 9 K panel were imputed to 60 K following procedures discussed by Gualdrón et. al [15], by means of the software AlphaImpute [17], resulting in imputation accuracy of around 0.99 [15]. Genotypes imputed in the F₂ had a second editing procedure by MAF < 0.01, which excluded 759 virtually monomorphic SNP. The editing policies and genotype imputation resulted in a data set with records from 1002 pigs (F₀, F₁ and F₂) having 44,055 SNP per animal.

Estimation of genomic relationship matrix

The genomic relationship matrix was estimated from observed and imputed high density (~44 K) SNP genotypes. Genotypes were expressed as allelic dosage [13, 15], such that genotypes were entered into a marker matrix M of dimension (n × m), where n is the number of animals and m the number of SNP, having elements in the interval [0, 2], i.e. the count of the allele used as reference. In the sequel, we will use the sub index i to refer to the individual. Matrix M was standardized to matrix Z that has generic elements equal to

Z_{ij} = \frac{M_{ij} - 2 p_{j}}{\sqrt{m (2 p_{j} (1 - p_{j}))}}

Elements of Z are then calculated by subtracting twice the frequency of the reference allele at the jth marker (p_j), to the corresponding element of M[18], and then dividing the resulting difference by the square root of the expected variance 2p_j(1 − p_j) of each element in the column multiplied by the number of columns (m) in M. The allele frequency p_j was calculated from the F₀ generation (19 animals). The genomic relationship matrix was finally calculated as:

G = Z Z'

(1)

Prediction model

Using the genomic relationship matrix from equation (1), the centered animal model for genomic evaluation can be written as:

y = X β + a + e

(2a)

where y is the phenotypic vector containing the data on 13-week tenth rib backfat (mm), X is the incidence matrix that relates records to the fixed effects of sex in β, vector a contains the random breeding values such that $a \sim N (0, G σ_{A}^{2})$ , e is the random error vector such that $e \sim N (0, I σ_{e}^{2})$ , and I is the identity matrix. Variance components were estimated with REML using the regress version 1.3-10 R package [19].

Following Stranden et al. [9] an equivalent model to (2a) is

y = X β + Z g + e

(2b)

Every element in (2b) is defined as before except for the vector g of SNP effects. To show that (2a) and (2b) are equivalent models, we employ the fact that a = Z g. Then, the variances of a and g are related in the following manner:

G σ_{A}^{2} = Var (a) = Var (Z g) = Z Var (g) Z' = Z Z' σ_{g}^{2}

Necessary conditions for models (2a) and (2b) to be equivalent (Henderson, 1984) are that G = Z Z ' and $σ_{A}^{2} = σ_{g}^{2}$ .

Variance of SNP effects

In this section, we describe the algorithm to calculate the variance of the estimated SNP effects g $(i . e . Var (\hat{g}))$ . The SNP effects were obtained from a linear transformation of breeding values in $\hat{a}$ [4, 9, 20, 21], as follows:

\begin{array}{l} BLUP (\hat{g}) = \hat{g} = cov (g, a') {[Var (a)]}^{- 1} \hat{a} \\ = cov (g, g') Z' G^{- 1} {(σ_{A}^{2})}^{- 1} \hat{a} \\ = (\frac{σ_{g}^{2}}{σ_{A}^{2}}) Z' G^{- 1} \hat{a} = Z' G^{- 1} \hat{a} \end{array}

(3)

The last step results from the fact that model equivalence involves $σ_{A}^{2} = σ_{g}^{2}$ . Now, from equation (3) $Var (\hat{g})$ is obtained as follows:

Var (\hat{g}) = Var (Z' G^{- 1} \hat{a}) = Z' G^{- 1} Var (\hat{a}) G^{- 1} Z

(4)

Now, we know that the predictor error variance (PEV) of $\hat{a}$ from model (2a) is equal to:

PEV (\hat{a}) = Var (a - \hat{a}) = C^{aa} = Var (a) - Var (\hat{a})

So that

Var (\hat{a}) = Var (a) - C^{aa} = G σ_{A}^{2} - C^{aa}

Matrix C^aa results from inverting the coefficient matrix of the mixed model equations [22] such that:

C^{aa} = σ_{e}^{2} {(I - X {(X' X)}^{- 1} X' + G^{- 1} λ)}^{- 1}, λ = \frac{σ_{e}^{2}}{σ_{A}^{2}}

Then, on replacing with the latter expression into $Var (\hat{a})$ (displayed in (4)), we have:

\begin{array}{l} Var (\hat{g}) & = Z' G^{- 1} (G σ_{A}^{2} - C^{aa}) G^{- 1} Z \\ = Z' G^{- 1} Z σ_{A}^{2} - Z' G^{- 1} C^{aa} G^{- 1} Z \end{array}

(5)

Expression (5) results in a large matrix of dimension (m × m) with m the number of SNP. However, we only need its diagonal elements. Also notice that the first term in (5), Z ' G^− 1 Z, can be computed and stored to be reused for the different traits, whereas C^aa has to be computed for each trait.

Standardization of SNP effects (SNP_ej)

The estimated SNP effects in (3) were standardized by dividing with their corresponding $Var ({\hat{g}}_{j})$ obtained from (5) as follows:

SN P_{e j} = \frac{{\hat{g}}_{j}}{\sqrt{Var ({\hat{g}}_{j})}}

(6)

P-values and genome screening

The p-values were assessed as 1 minus the cumulative probability density of the absolute value of SNP_e j, a number that was then multiplied by 2 so as to obtain:

p - valu e_{j} = 2 (1 - Φ (|SN P_{e j}|))

where Φ(x) is the cumulative density function of the normal distribution for the random variable x. When analyzing the trait 13 week tenth rib backfat (mm), the p-values for each SNP were plotted across the genome as –Log₁₀ (p-value) using the physical position of the SNP in Mega-bases (Mb).

Standardization of SNP effects using the PEV of the marker

A second standardization of the j^th SNP effect (3) was performed using the PEV $(\hat{g})$ as follows:

SN P_{ep j} = \frac{{\hat{g}}_{j}}{\sqrt{Var (g_{j}) - Var ({\hat{g}}_{j})}}

(7)

As discussed above, $σ_{g}^{2} = σ_{A}^{2}$ . The p-values and genome screening for SNP_{ep j} were assessed and plotted in the same fashion as for SNP_{e j}.

Simulation

A plasmode simulation was performed to compare how the standardized values SNP_{e j} and SNP_{ep j} affected the nominal size of the test for the effect to be equal to zero. Data on 928 animals with 44,055 SNP each were used for the study, and the 1018 SNP on chromosome 18 were reshuffled. Two scenarios were considered: 1) Dependency: rows of the genotype matrix were permuted for columns corresponding to SNP on chromosome 18, thus keeping Linkage Disequilibrium (LD) within chromosomes but breaking the relationship between genotypes and phenotypes for the 1018 SNP on the chromosome. 2) Independency: the genotype of any animal was permuted independently by marker (resulting in linkage equilibrium, or LE between markers) for those SNP on chromosome 18, and the relationship with the phenotype was broken too. For both scenarios model (2a) was fitted to the data, and two tests were calculated for each scenario: test1 = SNP_ej and test2 = SNP_epj. Permutations were repeated 200 times per scenario, and in each permutation the G matrix was calculated while fitting model (2a). As a result, the heritability of the trait was similar to the original heritability due to relationships in the other 17 chromosomes being kept intact, and p-values for those SNP (that are now non-associated) on chromosome 18 were obtained for the different tests. Under the null hypothesis and assuming independence (i.e., SNP are unlinked to the polymorphism controlling the trait), an approach that controls for type I error appropriately [23], the 1018 test p-values follow a uniform distribution. Consequently, to estimate the empirical quantiles of the distribution for the null hypothesis, we used a uniform density U ∼ (0, 1) to generate 200 replicated sets for the 1018 p-values.

SNP effects and tests obtained by a single marker model

The SNP effects were tested on a one by one basis. The model approach used for testing purposes is better known as “efficient mixed-model association” (EMMA) [24]. The model included fixed effects of sex and one-marker-at-a-time; random variable was the animal effect with variance-covariance equal to the genomic relationship matrix using all markers, which was calculated as described before. The R package rrBLUP [25] was used for fitting the different models and for calculating the tests and p-values.

Proportion of variance explained by segments with large effect

After the genome screen using model 2a, the SNP with the smallest p-values were selected to form SNP segments. These segments were defined by taking all SNP within one Mb upstream and one Mb downstream of the SNP with smallest p-value on each chromosome. The size of the segment was chosen using a criterion similar to the one employed by Hayes et al. [4]. The point of change in the rate of decay in linkage disequilibrium in this population was about r² = 0.2 at 1 Mb (data not shown), which essentially would imply a minimal contribution to the additive variance from markers located beyond such distance. Moreover, segment sizes about two Mb have been reported to be significant in association studies [20, 26–28]. The proportion of variance associated with each segment was estimated by building a genomic relationship matrix G₁ (as described in (1)) using all SNPs that belonged to the segment, whereas genomic relationship matrix G₂ was built using all remaining SNPs. The model fitted can be represented as:

y = X β + a_{1} + a_{2} + e

(8)

where a₁ is the vector of additive random effects associated with those SNP located in the segment, such that $a_{1} \sim N (0, G_{1} σ_{A_{1}}^{2})$ , and a₂ is the vector of additive random effects associated with all SNPs except those involved with a₁, such that $a_{2} \sim N (0, G_{2} σ_{A_{2}}^{2})$ . Model (8) assesses the proportion of variance explained by the segment of interest (local variance) from the genome variance explained by all SNPs (global variance). The variances estimated in (8) were compared with those estimates from model (2a). Hayes et al. [4] used a similar model to assess the segment variance. Applying either model (8), or the approach of Hayes et al. [4] gave similar estimated variance components. In practice, the advantage of fitting model (8) is that G₂ is computed by subtracting from G the columns of Z related to the segment being tested. Let Z_s be a matrix having as columns those related to the segment being tested, then $G_{2} = G - Z_{s} Z_{s}^{'}$ . On the contrary, in the model of Hayes et al. [4]Gis different from segment to segment. Additionally, the calculation of G₁ and $Z_{s} Z_{s}^{'}$ is fast and involves only those SNPs located in the segment.

To adjust the level of significance for multiple comparisons, a Bonferroni Correction (BC) was performed. In this context, if the pig genome is ~2800 Mb long and the average size of the segment is 2 Mb, there are 1400 segments along the genome with corresponding multiple tests. Thus, for α = 0.05, the BC was equal to 0.05/1400 = 3.571429e⁻⁰⁵ (adjusted α or critical value). Hence, in order to evaluate the significance of the segments, a second p-value for the Likelihood Ratio Test (p − value_LRT) was calculated to compare against BC. This p − value_LRT was assessed as 1 minus the distribution function of a chi-square (χ²) random variable with 0.5 degrees of freedom [29, 30] as follows:

p - {value}_{LRT} = 1 - Ω (LRT)

where Ω(x) is the distribution function of a random variable having the χ² as density, and LRT is the Likelihood Ratio Test obtained by contrasting appropriate models.

Results

Genome screening

The p-values of the 44055 SNP were obtained as described in the Methods section. First, the p-values for SNP_ej, i.e. using $Var ({\hat{g}}_{j})$ , were plotted along the genome (Manhattan plot in Figure 1) to identify genomic positions that are associated with variation in 13-week tenth rib backfat (mm). Large peaks (−Log₁₀(p-value) above 5 can be seen at chromosomes 6 and 3, suggesting noticeable genetic variation for the trait. On the other hand, p-values for SNP_epj (i.e. standardized with prediction error variance) were very large, with a maximum − Log₁₀( p-value2) of 0.20. In essence, the pattern observed in Figure 2 is the result of dividing the non-standardized SNP effects by a constant. Specifically, the normalizing value was [Var (g_j) − Var $({\hat{g}}_{j})$ ], with Var (g_j) = 2.6768. The use of the square root of the difference between those two values resulted in a practically constant denominator for the test-statistic that was equal to 2.66. Also, a look at Figure 2 suggests signals at chromosomes 1, 12, 14, and 18, a fact that is not observed in Figure 1. However, this might be an artefact of the constant denominator that tends to overestimate the true variability for some SNP, thus resulting in corresponding false positives across the genome.

In order to study the type I error rate of the two proposed tests we performed a plasmode simulation [31]. A plasmode is a dataset created from real data where some of the truth is known. In brief, our plasmode is a simulation that uses reshuffling in a portion of the data as explained in the methods section. We performed a simulation assuming independent SNP, and another one keeping the dependency between SNP (LD structure) intact. Simulation results were plotted into a Quantil-quantil plot graph (Figure 3) using the number –Log(p-value) for each case of standardization. First, the p-values for test1 (SNP_ej) obtained in the scenario under independent SNPs (scenario 2, LE) displayed an identical distribution of p-values when obtained by the reference distribution U ∼ (0, 1). In contrast, under dependency (scenario 1, LD) less extreme p-values were observed, a fact that was not reflected in a uniform distribution. This is a well known fact in human genetic epidemiology [32], where the implementation of the Bonferroni correction of p-values from associated SNP under the assumption of independence results in tests that are too conservative. On the other hand, for test 2 (SNP_epj) even p-values obtained for independent SNP (scenario 2, LE) displayed a distribution that was too conservative. Furthermore, the results from the dependent scenario (LD) were even more conservative than those from the independent scenario (results not displayed in the Q-Q plot), thus indicating that the use of the square root of $Var ({\hat{g}}_{j})$ as the denominator of the test-statistic results in a more powerful and not too liberal choice when compared to the use of the square root of PEV = $Var (g_{j}) - Var ({\hat{g}}_{j})$ .

SNP effects and tests obtained by the marker model

The analyses of one SNP tested at a time using the EMMA procedure [24] resulted in p-values that were almost identical (Additional file 1) to those of SNP_ej (Additional file 2). The time taken to compute 44055 SNP tests one at a time was 84 minutes. In comparison, the algorithm used to fit model (2a) and to perform the tests of standardized effects took a total time of 29 minutes (CPU and memory: Quad-core 2.7GHz AMD Opteron 8384, 256 GB). This time includes the computation of the G matrix, the fit of the animal model, the back transformation to calculate the SNP effects, and the calculation of the standard errors that are needed to compute the test-statistics.

Tests of segment effects

We also compared the results from our proposed method to those obtained with a segment-based likelihood ratio test that has been used by animal breeders [4]. Due to computational demand, we only performed the LRT to test for segment effects. Thus, the SNP with the smallest p-values (or highest − Log₁₀(p-values)) on each chromosome were chosen, whereas no segments were tested using LRT for regions with SNP_epj resulting in exceedingly low p-values. The three segments from chromosomes with the smallest p-values are displayed in Table 1, and the remaining segments from the 15 other chromosomes are shown in the additional files (Additional file 3). All segments measured 2 Mb (1 Mb on each side of the SNP with the smallest p-value). The estimates of the variance components and the LogLikelihood obtained from model equation (8) were compared with those from model equation (2a). These results are displayed in Table 2.

Table 1 SNP selected by smallest p-value per chromosome

Full size table

Table 2 Variance components and LogLikelihood for models with or without the segment

Full size table

Results from the LRT indicated that the segment on chromosome 6 was significant: p − value_{LRT ‒ 6} = 1.133459e⁻⁰⁹, a number smaller than the critical 0.05 Bonferroni threshold for 1400 segments (P_critical = 0.05/1400 = 3.571429e⁻⁰⁵). On the contrary, the segments located on all other chromosomes were not significant. The proportion of variance explained by the segment from chromosome 6 (−Log(p-value) = 8.02) was 11% of the total variance, a fact that was reflected in a similar reduction of the estimated additive variance $(σ_{A}^{2})$ in model (8): 1.952 + 0.698 = 2.650. This latter value is close to 2.678, i.e. the estimated value of $σ_{A}^{2}$ from model (2a) (see Table 2). For all other chromosomal segments, the estimated value of $σ_{A}^{2}$ did not decrease to a significant amount.

Discussion

The main goal of this research was to develop a novel procedure to perform a rapid genome scan, or GWAS analysis, from a genomic evaluation. Moreover, the sufficient statistics of our methodology are: the Best Linear Unbiased Prediction (BLUP) of the breeding values from an animal model, G as the covariance matrix (or H for a single step evaluation [33]), Z as the standardized marker effects matrix, variance components, and C^aa. This setting makes the implementation extremely feasible after the genomic evaluation has been performed as discussed by Legarra et al. [33].

Variance of the SNP effect

First, the SNP effects ${\hat{g}}_{j}$ were calculated by a linear transformation of $\hat{a}$ using expression (3). Then, we calculated $Var ({\hat{g}}_{j})$ using an expression derived from mixed model theory (see (4–5)). Next, we divided ${\hat{g}}_{j}$ by the square root of $Var ({\hat{g}}_{j})$ to standardize the effect, and referred the statistics as SNP_ej. The p-values for the tests of specific genome regions were calculated with a level of significance − Log₁₀(p-value) = 5. Additionally, Prediction Error Variance ( $PEV = Var (g_{j}) - Var ({\hat{g}}_{j})$ ) was employed for a second standardization, and it was called the SNP_epj statistic. After the analyses, we obtained higher p-values (maximum − Log₁₀(p-value) = 0.20) and detected stronger signals (higher peaks in the Manhattan plot) for SNP_epj than with SNP_ej. Furthermore, a simulation was carried out with the same structure of SNPs markers and animal data as in the current study, in order to compare the performance of empirical p-values of both standardized tests. The SNPs markers of chromosome 18 were reshuffled, and two scenarios were simulated: 1) Dependent genotypes (LD), and 2) Independent genotypes (LE). Neither scenario displayed a relationship with the phenotype, whereas both standardized tests were calculated at each scenario. The reference distribution for the p-values considered was the uniform. In the independent scenario (LE), standardization with $Var ({\hat{g}}_{j})$ gave an empirical distribution of p-values that resembled the uniform density, but in the dependent scenario (LD) the SNP_ej performed conservatively. Instead, the standardization with $[Var (g_{j}) - Var ({\hat{g}}_{j})]$ produced conservative results in the independent scenario (LE), and very conservative tests in the dependent scenario (LD). In this context, standardizing SNP effects with $Var ({\hat{g}}_{j})$ resulted in p-values that were closer to the simulated ones. Moreover, the performance of SNP_ej under LD was not too conservative, a scenario that could be extrapolated to the genotypes in the current study. In addition, the p-values calculated using the EMMA procedure [24] were similar to those obtained with SNP_ej. These results suggest that SNP_ej behaves reasonably to control type I error rate or false positives. Also, the computing time for fitting model (2a) and then calculating (6) using expressions (3)-(5) was 2.5 to 3 times less than the computing time for the EMMA model.

In order to identify SNP with important phenotypic associations [34], the calculation of SNP effects ${\hat{g}}_{j}$ from genomic breeding values $\hat{a}$ [8, 9, 34] has been used in several studies [5, 20, 21]. In this context, the variance of SNP effects has been estimated using different approaches. Wang et al. [21] employed the classical definition of the variance of additive effects from quantitative genetics [35], so that the variance for each jth marker was obtained as follows: $σ_{A, j}^{2} = {\hat{g}}^{2}_{j} 2 p_{j} (1 - p_{j})$ . Whereas, McClure et al. [20] proposed equating the variance of SNP effects to ${(2 \sum p_{j} q_{j})}^{- 1} σ_{A}^{2}$ , and then normalizing the SNP effects with the square root of this estimated and constant variance. This test performed similar to SNP_{ep j} (7), when the estimated SNP effects $({\hat{g}}_{j})$ was divided by a constant denominator, a value almost equal to the prior variance 2.67, and resulted in a very conservative test.

In contrast, the advantage of the standardized test (SNP_ej) presented here was that each SNP effect was scaled by its own (and different) standard deviation rather than the use of a prior variance [20] or by the square of each specific SNP effect ${\hat{g}}^{2}_{j}$ [21] as variance. Furthermore, the computation of SNP_ej, involves the same variance for the same SNPs markers and animals, i.e. $σ_{g}^{2} = σ_{A}^{2}$ , and the use of the standardized incidence matrix Z, a function of 2p_j(1 − p_j), takes into account this latter quantity into SNP_ej. Additionally, the matrix Z uses the allele frequencies from the F₀ generation calculated with unrelated individuals, and a proper expected variance by marker (see Methods section). In addition, the test statistics SNP_ej that standardizes SNP effects produces a p-value, a result that is appealing to many researchers that are more familiar with the method of testing one SNP at the time rather than with the proportion of additive variance that is explained by a genomic region. A further advantage of the method is that detection of many false positives are avoided, and genome positions with sizeable effects are highlighted.

Candidate segment approach

Later in the research, genome segments that expressed higher signals were located. To this purpose, SNPs with the smallest p-values from SNP_ej (6) were selected, and for each of these SNP a segment of 2 Mb long (1 Mb at each side) was created. The next step was to estimate the variance components and the Log-Likelihood from the centered animal models (2a) and (8). The latter model includes the random vector of SNP segments a₁. Lastly, we compare the performance of both models. Hayes et al. [4] used a similar model to (8), although the random SNP effect was taken from the breeding value and fitted as a separate segment effect. We observed similar results from the use of either approach. The advantage of fitting model (8) is that matrix G is the same for all segments, so that it was calculated only once, and stored in memory for the calculations, whereas in the model of Hayes et al. [13] a different G has to be calculated for each segment. This implies an extended computing time and higher requirements of CPU memory to obtain similar results to those from model (8).

To evaluate the significance of the segments, the effects of each chromosome segment were tested by the Likelihood Ratio Test. The size of the test was adjusted by the Bonferroni correction. As a result, the segment located on chromosome 6 (physical position 135 Mb-137 Mb) was significant, and explained 11% of the trait total variance. Previous studies by Edwards et al. [10] and Choi et al. [36], using microsatellites and a small number of SNP, found significant regions (physical positions between 135 and 139 Mb) on chromosome 6 for 13 week tenth rib backfat in the current population under study.

Additionally, forty eight markers between the physical position between 128 Mb and 139 Mb on chromosome 6 (http://www.animalgenome.org/QTLdb/pig.html), have been reported to be associated with the trait. Furthermore, recent studies showed the importance of chromosome 6 [37, 38] in the expression of the trait. Therefore, our results confirm the presence of genetic variability in the trait from chromosome 6.

Conclusions

Fast genome screening of SNP effects linearly transformed from genomic breeding values is advantageous, as a by-product of genomic evaluations for different species of farm animals. Moreover, the standardized tests of SNP effects using their own variance $(Var ({\hat{g}}_{j}))$ developed in this study helps in detecting specific genomic regions involved in the additive variation of the trait and reducing false positive locations using less computing time. Additionally, genome segments of about 2 Mb formed by surrounding the SNP with the smallest p-values on each chromosome, and tested with a standardized test involving $Var ({\hat{g}}_{j})$ and with the Bonferroni correction, could detect genome regions responsible for sizeable fractions of the trait genetic variance. This methodology involving genome scan and candidate segment approach is a useful method for meta-analyses of genome-wide association studies, as it enables the detection of specific genome regions that affect an economically relevant trait when using multiple populations. Code and data to obtain and reproduce the results presented is publicly available at https://www.msu.edu/~steibelj/JP_files/GBLUP.html.

References

Crossa J, Pérez P, de los Campos G, Mahuku G, Dreisigacker S, Magorokosho C: Genomic selection and prediction in plant breeding. J Crop Improv. 2011, 25: 239-261.
Article Google Scholar
Goddard ME, Hayes BJ: Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat Rev Genet. 2009, 10: 381-391.
Article PubMed CAS Google Scholar
Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME: Invited review: genomic selection in dairy cattle: progress and challenges. J Dairy Sci. 2009, 92: 433-443.
Article PubMed CAS Google Scholar
Hayes BJ, Pryce J, Chamberlain AJ, Bowman PJ, Goddard ME: Genetic architecture of complex traits and accuracy of genomic prediction: coat colour, milk-fat percentage, and type in Holstein cattle as contrasting model traits. PLoS Genet. 2010, 6: e1001139-
Article PubMed Central PubMed Google Scholar
Kumar S, Garrick DJ, Bink MC, Whitworth C, Chagné D, Volz RK: Novel genomic approaches unravel genetic architecture of complex traits in apple. BMC Genomics. 2013, 14: 393-
Article PubMed Central PubMed CAS Google Scholar
Zhou X, Stephens M: Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 2012, 44: 821-824.
Article PubMed Central PubMed CAS Google Scholar
Meuwissen TH, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157: 1819-1829.
PubMed Central PubMed CAS Google Scholar
Garrick DJ: Equivalent mixed model equations for genomic selection. J Bone Miner Res. 2007, 90 (Suppl): 376-(Abstr.)
Google Scholar
Strandén I, Garrick DJ: Technical note: derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J Dairy Sci. 2009, 92: 2971-2975.
Article PubMed Google Scholar
Edwards DB, Ernst CW, Tempelman RJ, Rosa GJM, Raney NE, Hoge MD, Bates RO: Quantitative trait loci mapping in an F2 Duroc x Pietrain resource population: I. Growth traits. J Anim Sci. 2008, 86: 241-253.
Article PubMed CAS Google Scholar
Edwards DB, Ernst CW, Raney NE, Doumit ME, Hoge MD, Bates RO: Quantitative trait locus mapping in an F2 Duroc x Pietrain resource population: II. Carcass and meat quality traits. J Anim Sci. 2008, 86: 254-266.
Article PubMed CAS Google Scholar
Ramos AM, Crooijmans RP, Affara NA, Amaral AJ, Archibald AL, Beever JE, Bendixen C, Churcher C, Clark R, Dehais P, Hansen MS, Hedegaard J, Hu Z-L, Kerstens HH, Law AS, Megens H-J, Milan D, Nonneman DJ, Rohrer GA, Rothschild MF, Smith TPL, Schnabel RD, Van Tassell CP, Taylor JF, Wiedmann RT, Schook LB, Groenen MA: Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS One. 2009, 4: e6524-
Article PubMed Central PubMed Google Scholar
Badke YM, Bates RO, Ernst CW, Schwab C, Fix J, Van Tassell CP, Steibel JP: Methods of tagSNP selection and other variables affecting imputation accuracy in swine. BMC Genet. 2013, 14: 8-
Article PubMed Central PubMed CAS Google Scholar
Badke YM, Bates RO, Ernst CW, Schwab C, Steibel JP: Estimation of linkage disequilibrium in four US pig breeds. BMC Genomics. 2012, 13: 24-
Article PubMed Central PubMed CAS Google Scholar
Gualdrón Duarte JL, Bates RO, Ernst CW, Raney NE, Cantet RJC, Steibel JP: Genotype imputation accuracy in a F2 pig population using high density and low density SNP panels. BMC Genet. 2013, 14: 38-
Article PubMed Central PubMed Google Scholar
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575.
Article PubMed Central PubMed CAS Google Scholar
Hickey JM, Kinghorn BP, Tier B, van der Werf JH, Cleveland MA: A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation. Genet Sel Evol. 2012, 44: 9-
Article PubMed Central PubMed Google Scholar
VanRaden PM: Efficient methods to compute genomic predictions. J Dairy Sci. 2008, 91: 4414-4423.
Article PubMed CAS Google Scholar
Clifford D, McCullagh P: The regress function. R News. 2006, 6: 6-10.
Google Scholar
McClure MC, Ramey HR, Rolf MM, McKay SD, Decker JE, Chapple RH, Kim JW, Taxis TM, Weaber RL, Schnabel RD, Taylor JF: Genome-wide association analysis for quantitative trait loci influencing Warner-Bratzler shear force in five taurine cattle breeds. Anim Genet. 2012, 43: 662-673.
Article PubMed Central PubMed CAS Google Scholar
Wang H, Misztal I, Aguilar I, Legarra A, Muir WM: Genome-wide association mapping including phenotypes from relatives without genotypes. Genet Res (Camb). 2012, 94: 73-83.
Article CAS Google Scholar
Henderson C: Applications of Linear Models in Animal Breeding. 1984, Guelph: University of Guelph
Google Scholar
Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES: A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006, 38: 203-208.
Article PubMed CAS Google Scholar
Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E: Efficient control of population structure in model organism association mapping. Genetics. 2008, 178: 1709-1723.
Article PubMed Central PubMed Google Scholar
Endelman JB: Ridge regression and other Kernels for genomic selection with R package rrBLUP. Plant Genome J. 2011, 4: 250-
Article Google Scholar
Rangkasenee N, Murani E, Brunner RM, Schellander K, Cinar MU, Luther H, Hofer A, Stoll M, Witten A, Ponsuksili S, Wimmers K: Genome-wide association identifies TBX5 as candidate gene for Osteochondrosis providing a functional link to cartilage perfusion as initial factor. Front Genet. 2013, 4: 78-
Article PubMed Central PubMed Google Scholar
Do DN, Ostersen T, Strathe AB, Mark T, Jensen J, Kadarmideen HN: Genome-wide association and systems genetic analyses of residual feed intake, daily feed consumption, backfat and weight gain in pigs. BMC Genet. 2014, 15: 27-
Article PubMed Central PubMed Google Scholar
Fan Y, Xing Y, Zhang Z, Ai H, Ouyang Z, Ouyang J, Yang M, Li P, Chen Y, Gao J, Li L, Huang L, Ren J: A further look at porcine chromosome 7 reveals VRTN variants associated with vertebral number in Chinese and Western pigs. PLoS One. 2013, 8: e62534-
Article PubMed Central PubMed CAS Google Scholar
Liang K-Y, Self SG: On the asymptotic behaviour of the Pseudolikelihood Ratio Test Statistic. J R Stat Soc Ser B. 1996, 58: 785-796.
Google Scholar
Self SG, Liang K-Y: Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J Am Stat Assoc. 1987, 82: 605-610.
Article Google Scholar
Vaughan LK, Divers J, Padilla M, Redden DT, Hemant K, Pomp D, Allison DB: The use of plasmodes as a supplement to simulations: a simple example evaluating individual admixture estimation methodologies. Comput Stat Data Anal. 2009, 53: 1755-1766.
Article PubMed Central PubMed Google Scholar
Klein RJ, Zeiss C, Chew EY, Tsai J, Sackler RS, Haynes C, Henning AK, Sangiovanni JP, Mane SM, Susan T, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J: Complement factor H Polymorphism in age-related macular degeneration. Science (80-). 2006, 308: 385-389.
Article Google Scholar
Legarra A, Aguilar I, Misztal I: A relationship matrix including full pedigree and genomic information. J Dairy Sci. 2009, 92: 4656-4663.
Article PubMed CAS Google Scholar
Sun X, Fernando RL, Garrick DJ, Dekkers JCM: An iterative approach for efficient calculation of breed- ing values and genome-wide association analysis using weighted genomic BLUP. J Anim Sci. 2011, 89 (E–Suppl 2): e11-
Google Scholar
Falconer D, Mackay T: Introduction to quantitative genetics. 1996, New York: Longman
Google Scholar
Choi I, Steibel JP, Bates RO, Raney NE, Rumph JM, Ernst CW: Application of alternative models to identify QTL for growth traits in an F2 Duroc x Pietrain pig resource population. BMC Genet. 2010, 11: 97-
Article PubMed Central PubMed Google Scholar
Fan B, Onteru SK, Du Z-Q, Garrick DJ, Stalder KJ, Rothschild MF: Genome-wide association study identifies Loci for body composition and structural soundness traits in pigs. PLoS One. 2011, 6: e14726-
Article PubMed Central PubMed CAS Google Scholar
Switonski M, Stachowiak M, Cieslak J, Bartz M, Grzes M: Genetics of fat tissue accumulation in pigs: a comparative approach. J Appl Genet. 2010, 51: 153-168.
Article PubMed CAS Google Scholar

Download references

Acknowledgements

This project was supported by Agriculture and Food Research Initiative Competitive Grants no. 2010-65205-20342 and no. 2011-67015-30338from the USDA National Institute of Food and Agriculture and by funding from the National Pork Board Grant no. 11–042. Partial funding was also provided by the US Pig Genome Coordinator. Computer resources were provided by the Michigan State University High Performance Computing Center (HPCC). JLGD and RJCC were funded by UBACyT 20020100100861 from Universidad de Buenos Aires, and PIP 11220120100621CO from CONICET (Argentina).

Author information

Authors and Affiliations

Departamento de Producción Animal, Facultad de Agronomía, UBA-CONICET, Buenos Aires, Argentina
Jose L Gualdrón Duarte & Rodolfo JC Cantet
Department of Animal Science, Michigan State University, East Lansing, MI, USA
Ronald O Bates, Catherine W Ernst, Nancy E Raney & Juan P Steibel
Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI, USA
Juan P Steibel

Authors

Jose L Gualdrón Duarte
View author publications
You can also search for this author in PubMed Google Scholar
Rodolfo JC Cantet
View author publications
You can also search for this author in PubMed Google Scholar
Ronald O Bates
View author publications
You can also search for this author in PubMed Google Scholar
Catherine W Ernst
View author publications
You can also search for this author in PubMed Google Scholar
Nancy E Raney
View author publications
You can also search for this author in PubMed Google Scholar
Juan P Steibel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan P Steibel.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JPS, RJCC, JLGD: performed and supervised statistical and simulation analyses and wrote the manuscript. ROB, CWE: designed the resource population and led collection of phenotypic data. CWE, NER: performed DNA extraction and coordinated genotyping with commercial laboratory. JPS, ROB, CWE: designed high density genotyping scheme. All authors read and approved the paper.

Electronic supplementary material

12859_2014_6514_MOESM1_ESM.pdf

Additional file 1: Highest − Log₁₀(p-values) on each chromosome for trait 13-week tenth rib backfat (mm) by standardization SNP_ej and EMMA. The blue and red circle represents highest − Log₁₀(p-values) on each chromosome by the standardization SNP_ej and efficient mixed-model association (EMMA) using rrBLUP. respectively. (PDF 29 KB)

12859_2014_6514_MOESM2_ESM.pdf

Additional file 2: Dispersion plot of − Log₁₀(p-values) for trait 13-week tenth rib backfat (mm) by EMMA and standardization SNP_ej. Dispersion plot for 44055 –log₁₀ (p-values) by efficient mixed-model association (EMMA) using the rrBLUP R package in the x axis, and by the standardization SNP_ej in the y axis. Red straight line is the reference line 0–1. (PDF 255 KB)

12859_2014_6514_MOESM3_ESM.pdf

Additional file 3: Variance components and LogLikehood for models with or without the segment for all chromosomes. (Results for the 18 chromosomes). (PDF 181 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Gualdrón Duarte, J.L., Cantet, R.J., Bates, R.O. et al. Rapid screening for phenotype-genotype associations by linear transformations of genomic evaluations. BMC Bioinformatics 15, 246 (2014). https://doi.org/10.1186/1471-2105-15-246

Download citation

Received: 03 February 2014
Accepted: 07 July 2014
Published: 19 July 2014
DOI: https://doi.org/10.1186/1471-2105-15-246

Rapid screening for phenotype-genotype associations by linear transformations of genomic evaluations

Abstract

Background

Results

Conclusions

Background

Methods

Dataset

Genotyping and data editing

Estimation of genomic relationship matrix

Prediction model

Variance of SNP effects

Standardization of SNP effects (SNP ej )

P-values and genome screening

Standardization of SNP effects using the PEV of the marker

Simulation

SNP effects and tests obtained by a single marker model

Proportion of variance explained by segments with large effect

Results

Genome screening

SNP effects and tests obtained by the marker model

Tests of segment effects

Discussion

Variance of the SNP effect

Candidate segment approach

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

12859_2014_6514_MOESM1_ESM.pdf

12859_2014_6514_MOESM2_ESM.pdf

12859_2014_6514_MOESM3_ESM.pdf

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us

Standardization of SNP effects (SNP_ej)