Since the goal is to search for SNPs associated with a quantitative trait, we will consider both detection and localization. The proposed analysis technique includes calculation of a score at each SNP site and an assessment of significance by performing hypothesis tests via permutation. In order to examine the performance of the methods, we use a novel data simulation technique so that we know the location of the SNP truly associated with the quantitative trait (if one exists). This yields an opportunity to compare the type I error and power of the proposed method with that of QBlossoc. We begin by giving the details of the method of analyzing the data, and then describe the simulation technique.
Data analysis
The evolutionary history at each SNP site can be represented by a local phylogenetic tree, Θ. At each SNP, the local tree topology is estimated using Blossoc’s approach [6]. The branch lengths of each tree are estimated using a modification of the algorithm in [13], which yields an approximation to the maximum likelihood estimate of the branch length. In the case of DNA sequence data, the Rogers-Swofford method [13] is based on the use of a fast heuristic method to approximate the state at each internal node of the tree. The distance between a pair of nodes in the tree, , is then calculated to be the proportion of the sites in the sequence that differ between the reconstructed states at each node. can then be used to obtain an estimate of the branch length under an appropriate model, such as the Jukes-Cantor model [16].
We modify this method to handle SNP data as follows. First, the same heuristic method as in the Rogers-Swofford method (based on a most parsimonious reconstruction) is applied to the phased 0-1 SNP data at each internal node of the tree. The proportion of SNPs that differ between each node, , is counted, and the branch length connecting the two nodes is estimated to be:
(1)
If for a branch, then we set . This distance equation is derived under the M2 model, a two-state Markov model for nucleotide sequence data, which is a specific case of the more general Mk model described in [17]. Notice that the branch length estimate, , increases as the proportion of differing SNPs between two nodes increases, as expected.
For each estimated local phylogeny, the branch lengths are used to estimate the variance-covariance matrix of the tree, V(Θ), as shown in the example in Figure 2. The covariance between two quantitative trait values is defined to be proportional to the time of shared evolutionary history between those two observations. Given the local estimated phylogeny at the SNP of interest, the quantitative trait data were assumed to follow a multivariate normal distribution, with covariance structure determined by the local phylogeny.
The proposed method accounts for only a focused portion of the evolutionary history among observations using a clustering technique. At each SNP site, the tree can be partitioned into k clusters using only the earliest (k−1) edges in the tree. An example of this clustering is shown in Figure 2. For a fixed partition of the tree into k clusters, we define a matrix, D, with elements:
(2)
for i=1,2,…,n and j=1,2,…,k, where n is the number of observations (for diploid individuals, this is twice the number of individuals in the study). Then the trait data, Y=(Y1,Y2,…,Y
n
), are assumed to follow a multivariate normal distribution along tree Θ:
(3)
Here, as in QBlossoc, each cluster has its own mean, denoted μ=(μ1,μ2,…,μ
k
). However, instead of assuming independence, the variance-covariance matrix of the tree, σ2V=σ2V(Θ), allows for covariance structure to be present among the quantitative trait observations.
Using the distribution in Equation 3, the maximum likelihood estimates of the parameters are straightforward to calculate,
(4)
(5)
Hypothesis testing is carried out using a likelihood framework. In particular, we use a penalized likelihood similar to that proposed by [8]. We define the Likelihood Score Statistic (LSS) to be
(6)
To calculate LSS, the maximum likelihood is penalized by subtracting a penalty as in the Bayesian Information Criterion (BIC). Calculation of the likelihood involves estimation of (k+1) parameters, including the mean trait value in each cluster and the variance, σ2. The BIC criterion penalizes for k of these parameters. At each SNP, a local tree is scored according to (6), for varying numbers of clusters, k=1,…,k
m
a
x
, and the resulting tree score is the maximum score over the number of clusters.
To address detection, after the score in Equation 6 is calculated for the phylogenetic tree at each locus along a chromosome, permutation testing based on this location-specific test statistic can be used to evaluate significance. Permutation of the observed trait values among the tips of the estimated phylogenetic tree yields permuted data sets, and the score in Equation 6 is calculated for each permuted data set. The p-value for detection at each locus is the proportion of data sets scoring higher than the observed data set at each particular locus. To address localization, the distance (in DNA base pairs) between the maximally-scored locus and the disease locus is calculated.
In addition to the tree topology, our method also requires an estimate of the covariance structure in the data. This covariance structure is estimated via estimation of the branch lengths along the topology. By using the clustered tree to consider only the broad-scale phylogenetic relationships among SNPs, our technique can account for the evolutionary history among genes without using all coalescent relationships. Using only broad-scale relationships enables a computationally feasible algorithm that is able to account for the most important aspects of the covariance structure among observations.
Data simulation
To assess the performance of the proposed likelihood technique, simulated data sets are used. This provides a setting where the presence and location of the SNP truly associated with the quantitative trait is known. In our simulation study, we simulate SNP data for 100 replicate data sets from a diploid population using the program ms (without selection) [18]. Each data set consists of the SNP data corresponding to one chromosome. For each simulated replicate, a single DNA base pair location is randomly chosen to be associated with the trait. This choice of “disease” locus is restricted so that the minor allele frequency is between 10% and 30%.
For each SNP, quantitative data is simulated along the phylogenetic tree at the disease locus according to a generalized version of the Ornstein-Uhlenbeck (OU) process described by [19],
(7)
where Y
i
(t) is the quantitative trait value for the ith lineage at time t, Θ is the target trait value, α is the strength of selection toward the target value, σ
Y
is the standard deviation of the process per unit time, and d B
i
(t) represents a Brownian Motion process for lineage i, so that values of d B
i
(t) for small time increments, dt, are independent, identically distributed random variables from a normal distribution with mean zero and variance dt. Thus, the OU process is a mean-reverting process with a deterministic component, α(θ−Y
i
(t))d t, modeling the selection of a trait toward the optimum target value, and a stochastic component, σ
Y
d B
i
(t), providing the “random noise” for the process. Notice that the deterministic portion of this process implies that the selection of the trait toward the target is proportional to the distance between the trait and the target value, Θ. When two observations share a portion of their evolutionary history, they share the trait value, Y
i
(t), for that portion of time. Along the corresponding phylogenetic tree, observations share an evolutionary history when they evolve along the same lineage.
When this process is applied in the setting of phylogenetics, the stochastic process gives the same value during the time when the evolutionary history is shared for any two observations. However, after two lineages split, their trait values evolve independently from one another. This implies that before a split, two observations are perfectly correlated, while after the split, they evolve in an uncorrelated manner.
For this study, a more flexible form of the Ornstein-Uhlenbeck process is used, the Generalized Hansen model [19]. This allows the trait to evolve toward a non-constant optimum (target) as follows,
(8)
The trait is simulated according to this stochastic process with the target trait value determined solely by the SNP state at any time in the evolutionary history at that SNP, where X
i
(t) is the SNP state for the ith observation at time t.
Using the Generalized Hansen Model leaves us with a quantitative trait value for each haplotype that has both a (deterministic) genetic component, determined by the SNP, and a stochastic component. This process imposes an evolutionary history of the quantitative trait which can be portrayed by the phylogenetic tree at the disease locus, and allows the two haplotypes of a diploid individual to evolve independently along the phylogeny at the disease locus. This is intuitive as long as two haplotypes for an individual are unrelated to the trait. In order to simulate data for each individual, or diplotype, based on the haplotypic data, we use an additive model. The trait value for each diplotype is the average trait value across the two copies of the trait for each individual at the disease location.
During the simulation studies, we simulate the SNP using these parameters in ms: the diploid population size was N0=20,000, the neutral mutation rate for each DNA base pair was μ=2.0×10−10, the rate of recombination per generation per DNA base pair was ν=10−8, and each simulated chromosome was 1,000,000 base pairs long. During simulation of the quantitative trait values, we vary the strength of selection, α, and the standard deviation of the quantitative trait per unit time, σ
Y
. The two target trait values we consider are θ1=80 and θ2=100.