TDT-HET: A new transmission disequilibrium test that incorporates locus heterogeneity into the analysis of family-based association data

Background Locus heterogeneity is one of the most documented phenomena in genetics. To date, relatively little work had been done on the development of methods to address locus heterogeneity in genetic association analysis. Motivated by Zhou and Pan's work, we present a mixture model of linked and unlinked trios and develop a statistical method to estimate the probability that a heterozygous parent transmits the disease allele at a di-allelic locus, and the probability that any trio is in the linked group. The purpose here is the development of a test that extends the classic transmission disequilibrium test (TDT) to one that accounts for locus heterogeneity. Results Our simulations suggest that, for sufficiently large sample size (1000 trios) our method has good power to detect association even the proportion of unlinked trios is high (75%). While the median difference (TDT-HET empirical power - TDT empirical power) is approximately 0 for all MOI, there are parameter settings for which the power difference can be substantial. Our multi-locus simulations suggest that our method has good power to detect association as long as the markers are reasonably well-correlated and the genotype relative risk are larger. Results of both single-locus and multi-locus simulations suggest our method maintains the correct type I error rate. Finally, the TDT-HET statistic shows highly significant p-values for most of the idiopathic scoliosis candidate loci, and for some loci, the estimated proportion of unlinked trios approaches or exceeds 50%, suggesting the presence of locus heterogeneity. Conclusions We have developed an extension of the TDT statistic (TDT-HET) that allows for locus heterogeneity among coded trios. Benefits of our method include: estimates of parameters in the presence of heterogeneity, and reasonable power even when the proportion of linked trios is small. Also, we have extended multi-locus methods to TDT-HET and have demonstrated that the empirical power may be high to detect linkage. Last, given that we obtain PPBs, we conjecture that the TDT-HET may be a useful method for correctly identifying linked trios. We anticipate that researchers will find this property increasingly useful as they apply next-generation sequencing data in family based studies.


Background
In genetics, heterogeneity is a major feature of human traits. Genetic heterogeneity occurs when the same or clinically indistinguishable phenotypes are caused by different genetic factors. This can be due to multiple variants located in the same locus (allelic heterogeneity) or to mutations located in different loci (locus heterogeneity).
The focus of this work is locus heterogeneity, specifically heterogeneity caused by having an unknown subset of pedigrees in a sample being unlinked to a disease locus while the rest are linked [1,2].
Locus heterogeneity can substantially affect the power of linkage and association analyses [18][19][20][21][22][23][24][25][26][27]. In linkage analysis, there are many examples of methods that address this issue. For example, we have: the M test [28] (also known as K-test [29,30]), a likelihood ratio test (LRT) that estimates the value of the (assumed fixed) recombination fraction (θ) for each pedigree in a sample; the B-test [29], which is a more powerful version of the M-test that assumes an underlying beta null distribution for each estimated θ; the admixture test (A-test), which is based on the difference between the log-likelihood of the admixture model (data are composed of linked and unlinked families) and the homogeneity model (families are all linked with a common θ) [2,[31][32][33][34][35][36]; the D-test [30], a combination of the A and B tests and finally, the C-test [19], which is based on the M-test and for which the underlying null probability distribution is determined by simulation. The M and B tests were originally developed to identify different values of θ for different pedigrees. For the A-test, families are grouped into two types: a proportion a that are linked to the disease locus (θ < 1/2) and a proportion 1-a that are unlinked (θ = 1/2) [1,2]. As contrasted with M and the B tests, which place pedigrees into classes a priori, the A test accounts for heterogeneity by maximizing the standard log-odds (LOD) score [37] over a and θ. That is, each pedigree has some probability of being in the linked or unlinked group. This statistic is known as the heterogeneity LOD score (HLOD) [38].
The A-test has been implemented in a suite of programs to test for heterogeneity vs. homogeneity (HOMOG) [38]. More complex heterogeneity scenarios are also available in this package: HOMOG1 allows for gender specific differences in θ. HOMOG2, HOMOG3, HOMOG4, distinguish two, three and four types of families respectively, each linked to different disease loci on the same chromosome. HOMOG3R is a special case of HOMOG3 where there are three family classes: the first class is linked to a given marker; the second is linked to another marker on a different chromosome and the third is linked to neither marker. Lastly, HOMOGM [39], an extension of HOMOG3R, allows for any number of disease loci.
It is important to mention linkage analysis methods for quantitative trait loci (QTL) that account for locus heterogeneity in the analysis. Yang et al. [40] proposed a QTL mapping model for sib pair data. Knight et al. [41] and Ekstrøm et al. [42] independently developed LRTbased models in which the underlying null probability distributions are determined by simulation while Wang and Peng [43] proposed three test statistics with known null asymptotic distributions. It appears that relatively fewer publications considering locus heterogeneity for association have been published as compared with heterogeneity for linkage. When using the search terms "(locus heterogeneity) AND (linkage)" in ISI Web of Knowledge, we retrieve a total of 2,418 titles. By contrast, using the using the search terms "(locus heterogeneity) AND (association)", we retrieve a total of 884 titles, an almost 67% reduction. Having documented that, we do note that methods to address locus heterogeneity for association-based methods have been developed.
Yang et al. [55] extended the Posterior Probability of Linkage (PPL) method to one that incorporates linkage disequilibrium information between marker and disease alleles. Huang et al. [56] extended the PPL method to case-control data. These methods maintain all the features of the original PPL method for linkage, namely, they do not require correction for multiple testing and they can sequentially update information across multiple data sets.
Wang and Huang [22] developed two LRT extensions of the HLOD: the LD-Het for general pedigrees and the LD-multinomial for affected sib pair data. Here, LD stands for linkage disequilibrium. Schmidt et al. [57] proposed using a two-stage linkage/association approach for affected sib pair data. Finally, Zhou and Pan [58] used a mixture model to allow for locus heterogeneity in a case-control design.
The purpose of this work is the development of a new test statistic that we call TDT-HET, that allows for locus heterogeneity when applying the TDT statistic. This work is largely motivated by the recent work of Zhou and Pan [58]. As in their paper, our statistic is based on an underlying mixture model. We apply an expectationmaximization (EM) algorithm to compute log-likelihoods of the data under null and alternative hypotheses. The EM algorithm also produces maximum likelihood estimates of parameters such as the probability that a heterozygous parent transmits the disease allele to an affected child, the probability that a trio (mother, father, affected child) is linked to the locus in question, and the probability that certain trio types (determined by the constellation of genotypes) are linked to the locus being studied. In addition, we extend our TDT-HET method to statistic that can evaluate multiple loci jointly. This extension is motivated by and similar to the work of Hoh, Ott, and colleagues. They called their method SumStat [59][60][61][62].
For both single-locus and multi-locus simulations, we evaluate the type I error rate and the power of the TDT-HET method to detect association. In addition, we apply the TDT-HET method to candidate loci from a study of idiopathic scoliosis trios to determine if there is any suggestion of locus heterogeneity at the loci considered, and whether the results suggest evidence for association in the presence of heterogeneity.

Notation
Much of the notation we use comes from the work of Zhou and Pan [58], who developed a test statistic for case-control data that allows for locus heterogeneity. Also, much of the TDT notation comes from the work of Schaid and Sommer [63]. Here we present notation used in the main body of this work. A fuller notation list may be found in the additional file 1, Appendix (Notation section). M = The disease allele at the putative disease SNP locus. N = The non-disease allele at the putative disease SNP locus.
x abc = The trio where parent 1, parent 2, and affected child have a, b, and c copies of the M allele at the putative disease locus (range for all copies: 0 -2). For example, x 222 is the trio with mating type MM × MM and affected child genotype MM. Throughout this work, we will use the notation abc interchangeably with x abc . n abc = The number of trios x abc in the sample. n = The total number of trios in the study. D = Event that the child in a trio is affected. A = Event that individual in a population is affected. j = Pr(A) = Disease prevalence. f i = Pr(A|i copies of M allele in individual's genotype) = Disease penetrances, i = 0,1,2.
R i = f i f 0 , i = 1,2 = genotype relative risks (GRR) [63]. R 1 corresponds to the heterozygote GRR and R 2 to the GRR for disease allele homozygote. We consider three kinds of disease modes of inheritance (MOI) in this work: R 1 = R 2 (dominant); R 1 = 1 (recessive); R 2 1 = R 2 (multiplicative). p = Pr(M) = Disease allele frequency (DAF). q = Pr(N) = 1p = Non-disease allele frequency. t = Pr(heterozygous parent transmits M allele to affected offspring). In this work, the null hypothesis, H 0 , is t = 0.5. The alternative hypothesis, H 1 , is t ≠ 0.5. μ k,i = Pr(Mating type = i|D, pop = k) = probability that the mating type is i given that the child is affected and the trio comes from the k th population, 1 ≤ k ≤ 2. Throughout this work, we shall use the notation k = 1 to indicate that the trio is in the linked population (t ≠ 0.5) and k = 2 to indicate that the trio is in the unlinked population (t = 0.5). Similar to Schaid and Sommer [63], we consider 6 mating types in this work. We recognize that other models, such as those considered by Weinberg and colleagues [64,65], require more than six mating type frequencies. We conjecture that our work extends to such situations. π 1 = Pr(trio is linked to trait locus) = Pr(t ≠ 0.5). In this work we specify that t is the same for all linked trios. This specification is also made for the recombination fraction in some tests of linkage allowing for heterogeneity (see, e.g., work by C. A. B. Smith [1,2] and Ott [38], specifically the method implemented in programs such as HOMOG [38], GENEHUNTER [66], SIMWALK2 [67], VITESSE [68], MERLIN [69], and other programs).
x = Maximum likelihood estimate (MLE) of the parameter x. This MLE is determined by means of the EM algorithm.
z k,j = The indicator variable for population k and trio x j , where the subscript j indicates the j th trio in the sample. τ (r) k,j = r th iteration step estimate that the j th trio is in the k th population, k = 1,2. Without loss of clarity, we will use sometimes write τ (r) k,abc , where abc refers to the trio x abc (see above).

TDT-HET Test Statistic
The TDT-HET statistic is a likelihood ratio statistic. Log-likelihoods under the null hypothesis, H 0 : t = 0.5 or π 1 = 0, and under the alternative hypothesis,H 1 : t ≠ 0.5 and π 1 ≠ 0, are computed by maximizing these parameters for the observed data. We compute the maximum likelihood estimates under H 0 and H 1 using the Expectation-Maximization method [70]. P-values are computed using permutation methods. Full details are provided in the additional file 1, Appendix (TDT-HET Statistic section).
All trios drawn from a population with one set of parental mating types

Single locus
To evaluate the type I error rate and power of the test statistic under different scenarios, we perform simulations. In this section, we describe simulations where we consider type I error rate and power for a single disease locus that has been genotyped. The parameter settings that we consider are presented in Table 1.
We comment that, in item 3 in Table 1, we specify that the disease locus is in Hardy Weinberg Equilibrium.
In our simulations, we use the value p to determine the mating type frequencies. Specifically, we specify random mating in the single-locus simulations, so that the mating-type frequencies μ i are the products of the parental genotype frequencies, which themselves are determined by the allele frequency p according to Hardy-Weinberg Equilibrium. For example, the frequency of the matingtype MN × NN is 2 × (2pq) × q 2 = 4pq 3 , where q = 1p is the frequency of the N allele. Schaid and Sommer provide similar results in their Table 1 [63]. While we do not simulate non-Hardy-Weinberg situations in our single-locus simulations, we do so in our multi-locus simulations (see below).

Multi-locus
To evaluate the TDT-HET statistic for multiple loci, we apply a slight variant of the "SumStat " procedure developed by Hoh, Ott, and colleagues [59][60][61][62]. While these researchers consider sums of ever-increasing number of SNPs, in this work, we consider just the full sum. Specifically, for each of the k loci, 1 ≤ k ≤ L, where L is the number of loci in the simulation, we compute TDT-HET(k), the value of the statistic at the k th locus. We then compute: Empirical significance levels are determined through permutation. Since each locus k has 500 permuted TDT-HET(k) statistics associated with it, we can compute the permuted SumStat statistics for each permutation number (1 to 500) (This value changes to 100,000 for the Idiopathic scoliosis Candidate Loci -see below). The empirical significance level is defined as the proportion of SumStat values that exceed the SumStat value for the observed data. Table 2 lists the parameter settings for our multi-locus simulations (those parameters not listed there are the same as listed in Table 1).
Here, we consider 4 correlated SNPs in each simulation. Mating types for the first locus (labeled MT [1] [i]) are determined using the disease allele frequency p (setting 2b in Table 2). As above, each mating type i = 1, ..., 6 will have its frequency determined using HWE proportions. See above (Simulations -Single Locus) for formulas determining mating type frequencies for the first locus. For each consecutive locus l, 2 ≤ l ≤ 4, the mating type frequencies for the i th mating type is determined in the following fashion. Number of EM steps per starting point 100 Penalty C in EM algorithm (Equation (1)  Note that, if r = 1 (perfect correlation), then the mating type frequencies for each locus are identical. If r = 0 (no correlation), then each locus has mating type frequencies that are essentially random numbers that sum to 1. In the Results section, power is computed at the 1% significance level (see below).

Idiopathic Scoliosis Candidate Loci
We applied our method to a dataset that included selected loci from our published genome-wide association study (GWAS) of adolescent idiopathic scoliosis (AIS) [71]. Briefly, AIS is a common spinal deformity with a prevalence of~3% in school age children worldwide. The underlying genetics of AIS are generally complex and heterogeneity is apparent [71,72]. In the work presented here we selected genotypes for five loci derived in a total of 447 trios (1849 samples) from 447 families that were included in our previous publication [71]. Of the five loci, four (rs1400180, rs10510181, rs1040315, and rs2222973) were selected due to their significance by TDT analysis, their evidence of clustering, and their proximity to genes of potential biological relevance. We also selected an additional locus, rs11770843, because of its proximity to haplotypes previously linked and associated with AIS [73].
While we keep a number of the settings fixed ( Table  1, settings 8-9), we alter the number of permutations per statistic to 100,000. Note that this number is much larger than the number performed in our simulation studies. The reason for this is that we are analyzing far fewer markers here than in our simulations, so time/ CPU constraints are not really an issue. Also, the Sum-Stat P-value is based on 100,000 permutations, since we have 100,000 permutation TDT-HET statistics for each locus.
As a comparison, we compute the TDT statistic [52] as implemented in the PLINK software [74]. We also compute point-wise and family-wise permutation pvalues (labeled Emp1 and Max(T), respectively by Purcell et al. [74]). The Max(T) permutation statistic is based on the maximum observed test per permutation and so accurately reflects the family-wise error rate in the presence of LD.
While this description is for a genome-wide study, we consider only the situation max(T) applied to 5 candidate SNPs. We compare the max(T) statistic to our Bonferroni-corrected maximum TDT-HET SumStat statistic (corrected over 2 chromosomes, since one chromosome has one locus).

Simulations
Null hypothesis (Type I error rate) For the situation where R 2 = 1.0, we present empirical type I error rates at the 5% and 1% levels in Figure 1. The minimum observed type I error rate at the 5% level for TDT-HET is 0.04 (-log(0.04) = 1.46), which occurs for the settings: j = 0.05, π 1 = 0.50, p = 0.10, and the maximum observed type I error rate is 0.07 (-log(0.07) = 1.17), which occurs for the settings: j = 0.15, π 1 = 0.50, p = 0.75. The median type I error rate is 0.05.
Given that the type I error rate is computed over 250 replicates for each simulation vector setting in Table 1, we can use the method implemented in the BINOM program [38] to compute exact 95% confidence intervals for each empirical type I error rate. For the minimum and maximum empirical rates presented above from Figure 1, BINOM indicates that 0.05 and 0.01 are contained in in each respective 95% confidence interval. In addition, in Figure 1 we include linear trend lines using the method implemented in the MS Office 2007 Excel Spreadsheet software. Note that the 5% and 1% trend lines are very close to the constant lines y = 1.30 and y = 2.00, which are the -log-transformed values of 0.05 and 0.01, respectively. This result suggests that the TDT-HET maintains the correct type I error rate under the null hypotheses.
As a confirmation of our simulation code, we comment that the minimum observed type I error rate at the 5% level for TDT is 0.03 (-log(0.03) = 1.55), the maximum observed type I error rate is 0.06 (-log(0.06) = 1.19), and the median type I error rate is 0.05. At the 1% level, the minimum observed type I error rate for TDT is 0.004 (-log(0.004) = 2.40), the maximum observed type I error rate is 0.02 (-log(0.02) = 1.72), and the median type I error rate is 0.01. These results suggest that our simulation code is correctly simulating null data.

Alternative hypotheses (Power) Single Locus
In Figures 2, 3 and 4 we present contour plots of empirical powers of the TDT-HET method for a single locus at the 5% significance level for dominant, multiplicative and recessive MOIs, respectively. The contour plots provide empirical power ranges as a function of the proportion of linked trios π 1 and DAF (p), where settings of the input values are stated in Table 1, items 4 and 5). Other parameter settings that are used in these simulations are also provided in Table 1 Studying these figures, we can draw a number of conclusions. First, we see that, independent of the disease MOI, as the proportion of linked trios π 1 increases, the empirical power increases as well. This result is not surprising. It is interesting to note that power for a fixed DAF is very much dependent upon disease MOI. For example, we see in Figure 2 that empirical power for a dominant MOI tends to be larger when p ≤ 0.50. For a multiplicative MOI (Figure 3), empirical power tends to be larger for 0.25 ≤ p ≤ 0.75. Finally, for a recessive MOI (Figure 4), MOI tends to be larger when p ≥ 0.50.
How does the TDT-HET statistic's power compare with that of the TDT in the presence of heterogeneity? Our previous work determining the non-centrality parameter of the TDT in the presence of heterogeneity [27] allows us to answer this question directly. However, we use TDT empirical power instead to compare "apples to apples" for power. In Figure 5, we present a Box and Whiskers plot [75] that reports a summary of the distribution of differences (TDT-HET empirical power -TDT empirical power) for the various MOIs (Dominant, Multiplicative, Recessive) and significance levels (5%, 1%). TDT empirical power is computed using the method we previously published [27].
While the median power difference is approximately 0 for all six categories, we see that there is a pattern associated with disease MOI. That is, for the dominant MOIs, TDT tends to have larger power than TDT-HET (gray quartile boxes below 0 in Figure 5), while for

Multi-locus
In Figures 6, 7 and 8, we present TDT-HET and TDT SumStat empirical power values (type I error rate values in Figure 6) for parameter settings listed in the Methods. The results presented in Figure 6 (R 2 = 1.0) indicate that our SumStat statistic appears to maintain the correct type I error rate for all simulation parameters considered, with the exception of the settings: TDT SumStat Statistic, π 1 = 0.25, DAF = 0.90, which gives an empirical type I error rate of 0.03 at the 1% level. According to BINOM, the exact 95% confidence interval for this value does not contain 0.01 (the lower bound of the interval is 0.02). However, all other simulations do contain 0.01 in their 95% confidence intervals.
Regarding empirical power, when R 1 = 1.5 (Figure 7), TDT-HET and TDT produce nearly identical powers. This can be seen from the fact that the hollow symbols of the TDT empirical powers do not seem to appear in Figure 7. The reason is that they are covered by the TDT-HET empirical power symbols. When R 1 = 3.0  Figures  6 and 7. We comment that, when R 1 = 3.0, we have very high power at the 1% significance level even with the proportion of linked trios is low (π 1 = 0.25; diamonds and triangles; Figure 8). This result suggests that genotype relative risk can "trump" locus heterogeneity. We have observed this phenomenon in previous studies, where genotype relative risk is the most significant factor in determining power [76], even in the presence of "missing data" (e.g., misclassification errors).

Idiopathic Scoliosis Candidate Loci
In Table 3, we present the results of our TDT-HET analysis for the five candidate loci mentioned in the Methods section. They are: RS1400180, RS10510181 (both on  exact 95% confidence intervals overlap (full results not shown). The one exception is for locus RS11770843 on Chromosome 7. For this locus, the upper bound of the exact 95% confidence interval of the TDT-HET permutation p-value as computed by BINOM is 5.6 × 10 -5 , while the lower bound of the exact 95% confidence interval of the PLINK TDT permutation p-value (Perm01) is 9.1 × 10 -5 . This result suggests that, for this marker locus, the TDT-HET has slightly more power.
As for the multi-locus results, the situation is quite similar. The Bonferroni corrected minimum p-value of the TDT-HET SumStat statistic is 0.00, on Chromosome 21. The upper bound of the exact 95% confidence interval is 3.0 × 10 -5 . The lower bound of the exact 95% confidence interval for the minimum max(T) p-value is 2.8 × 10 -5 , indicating that the p-values overlap. Thus power for each method is equivalent for this data set. While additional studies need to be performed, this result suggests that the SumStat method for TDT-HET may not be as advantageous when loci are in HWE and/or are in linkage disequilibrium.
If there is no gain in power for the TDT-HET method over the standard TDT method, what is its utility? We suggest that the value comes from the estimates of the transmission probability, the proportion of linked trios, and most especially, the estimates of the probabilities that each of the trios is linked to a particular locus. Similar information is available for the HLOD statistic in that we may obtain probability estimates that each family is linked to a particular locus [38]. We demonstrate this utility for locus RS2222973, on Chromosome 21. In Table 4, we present the posterior probability estimatesτ (r) 1,abc that each of the 10 coded trios abc (see Table 5) is linked to the locus. We note that in Table 3, the estimate of the overall proportion of linked trios, π 1 , is 0.87. As Ott points out [32] and [38] (page 224), this value can be used as a cutoff for determining coded trios that are linked to the locus as compared with coded trios that are not. Specifically, if τ (r) 1,abc > π 1 , then we conclude that the coded trio x abc is linked to the locus. Similarly, ifτ (r) 1,abc < π 1 , then we conclude that the coded trio x abc is unlinked. The results in Table 4 suggest that the coded trios x 000 , x 100 , x 110 , x 201 , x 211 , x 222 are linked to locus RS2222973, while the remaining 4 coded trios x 101 , x 111 , x 112 , x 212 are not. Of the coded trios for which at least one parent is heterozygous (i.e., either a or b equals 1), the linked coded trios are defined by the fact that the affected child always receives the "1" allele from a heterozygous parent. This result suggests that, for this locus, the "1" allele is the risk allele.

Discussion
In this work, we present a mixture model of linked and unlinked trios and develop a statistical method to estimate the probability t that a heterozygous parent transmits the disease allele at a di-allelic locus, as well as the probability π 1 that any trio is in the linked group. The null hypothesis is that t = 0.5. The purpose here is the development of a test, the TDT-HET, which extends the classic transmission disequilibrium test (TDT) to one that accounts for locus heterogeneity. Our results suggest that use of permutation p-values enable us to correctly maintain correct type I error rates at the 5% and 1% significance levels. Power simulations using disease MOIs suggest that power can be disease model dependent, with the TDT being slightly more powerful for dominant MOIs, and the TDT-HET being more power for recessive MOIs. Also, we find that our statistic can have high power, even in the presence of locus heterogeneity, when the GRR is larger. It is interesting to note that the value of the TDT-HET statistic and the corresponding permutation p-value appears to be about the same as that of ordinary TDT for the Idiopathic scoliosis Candidate Loci data set even though results of the TDT-HET analysis suggest that there is locus heterogeneity for several loci. Based on our simulations, we might conjecture that the singlelocus MOI for each SNP is multiplicative.
We computed parameters for the situation where linked and unlinked trio types come from populations with different sets of parental mating type frequencies, but apart from determining the r th iteration step estimates, we did not investigate this form of the TDT-HET statistic further. Given the extensive amount of work already present, we consider this work to be beyond scope of the present manuscript. We plan to follow up this research and report our findings in another manuscript.
As noted in Results, Idiopathic Scoliosis Candidate Loci section, Ott documents that a decision rule for determining whether a particular trio type x abc is linked to a locus is seeing whether the inequality: is satisfied. Having said that, Terwilliger and Ott [77] report that, for linkage, the conditional probabilities "... should be taken with a grain of salt, and they cannot ever be validly used to separate families for the remainder of a linkage study. It should be required that any further marker typings be done on all families combined..." Their rationale for this statement is that selectively typing only linked families would introduce bias and increase the type I error rate of the linkage statistic. However, this book was published in 1994, even before  Tables 1 and 2. In this figure, solid symbols represent TDT-HET SumStat empirical powers; hollow shapes represent TDT SumStat empirical powers. More specifically: Solid diamonds = TDT-HET SumStat empirical type I error rate at 1% significance level when π 1 = 0.25. Solid squares = TDT-HET SumStat empirical type I error rate at 1% significance level when π 1 = 0.75. Hollow diamonds = TDT SumStat empirical type I error rate at 1% significance level when π 1 = 0.25. Multiplication signs = TDT SumStat empirical type I error rate at 1% significance level when π 1 = 0.75. the advent of SNPs. We are now producing next generation sequence data, so that the causative variant may well be typed in the first set. It remains an open question whether one can use the parameter estimatesτ (r) 1,abc to find trios that contain the causative variant(s). We recognize that there are situations where parameter estimation may be quite difficult. Vieland and Logue [78] documented that when the genetic models at linked and unlinked loci differ, maximizing the HLOD yields incorrect parameter estimates. These authors found that the admixture parameter a does not even measure the proportion of linked families within the sample, as is commonly supposed.
We conjecture that having additional information on the posterior probabilitiesτ (r) 1,abc may increase the probability of correctly identifying linked trios. One of the advantages of the TDT-HET statistic is that it provides estimates that each of the 10 types of trios (Table 5) is linked/unlinked. We can use this information to create a decision rule about whether a particular trio type is linked (i.e., harbors the disease allele). One possible decision rule is the inequality documented by Ott [38] and listed above (2). Ott reports that, for linkage analysis allowing for locus heterogeneity, a decision rule for determining whether a particular family is linked to a locus is checking whether the posterior probability that the family is linked is larger than or equal to the overall estimate of the proportion of linked families. We can extend this rule to our work by making the decision rule be that a trio type x abc is linked to a locus if and only if the inequality is satisfied. Here r is the iteration step such that the log-likelihoods are less than the stopping criterion.
This decision rule potentially reduces the number of trios that we need consider when looking for linked trios. We can further reduce the number of trios considered by adding the condition that we only consider trios in which at least one parent is heterozygous. Thus, the two decision rules we consider here for selecting linked trios using the TDT-HET statistic are: (i) all trios that satisfy inequality (2); and (ii) all trios for which at least   Tables 1 and 2. Solid diamonds = TDT-HET SumStat empirical power at 1% significance level when π 1 = 0.25. Solid squares = TDT-HET SumStat empirical power at 1% significance level when π 1 = 0.75. Hollow diamonds = TDT SumStat empirical power at 1% significance level when π 1 = 0.25. Multiplication signs = TDT SumStat empirical power at 1% significance level when π 1 = 0.75. The headings for each of the columns are defined as follows: Chr = Human chromosome on which locus is located. Locus = Particular SNP genotyped in idiopathic scoliosis trios. BP = Base pair position of Locus. This position is based on the human reference sequence (NCBI Build 36.1/HG18). TDT-HET = Value of the TDT-HET statistic for particular locus genotype data in idiopathic scoliosis trios. P-value (Perm) = P-value of corresponding TDT-HET statistic, based on 100,000 random permutations. For a description of how the permutation p-value is computed, see Methods, P-values by permutation. t = EM-Algorithm estimate of the probability, t, that a heterozygous parent transmits a "1" allele. π 1 = EM-Algorithm estimate of the probability, π 1 , that a trio is linked to the locus in question.
TDT-HET SumStat = ∑ k TDT-HET (k), where k indexes the set of all loci on a chromosome and TDT-HET (k) is the value of the TDT-HET statistic at the particular locus. For example, in Table 3

t = T T + NT
= The maximum likelihood estimate of the probability, t, that a heterozygous parent transmits the disease allele. Here, T is the number of times a heterozygous parent transmits the disease allele, and NT = the number of times a heterozygous parent does not transmit the disease allele. It has been shown that, for the likelihood form of the TDT, this value is the maximum likelihood estimate of the transmission probability (see, e.g., [81][82][83]). Max(T) P-value (Perm02) = Permutation p-value computed by PLINK that controls the family-wise type I error rate. For more information, see Methods, Idiopathic Scoliosis Candidate Loci. one parent is heterozygous and that also satisfy inequality (2).
For the TDT statistic, our analogous decision rules are: (i) all trios; and (ii) all trios for which at least parent is heterozygous.
We plan to perform an extensive analysis to evaluate the empirical probabilities that each statistic can correctly identify linked trios. We can simulate linked and unlinked trios using the method implemented in the FASTSLINK software [79,80]. We can use different genetic model parameter settings, specifically, settings in which the genetic effect is small/large. Since FASTLINK produces pedigree files that indicate which pedigrees are linked or unlinked, we can directly test our decision rules. This is work in progress.
Given that next generation sequencing data applied to families is bound to identify large amounts of locus heterogeneity, any methods that increase the probability of identifying true disease variants should be welcome. We realize that, though, the probabilities of correctly identifying linked trios may be dependent upon the true proportion of linked trios. One way we can reduce heterogeneity is to look at larger family sizes. We plan to apply our statistic to such families and investigate its performance.

Conclusions
Motivated by the recent work of Zhou and Pan [58], we have developed a TDT statistic, TDT-HET, that allows for locus heterogeneity among coded trios. This method is an extension of TDT, in that our simulation results suggest it has approximately the same power as the original TDT. Results of our simulations suggest that our method maintains correct type I error for the null hypothesis (R 1 = 1.0). Benefits of our method include: estimates of parameters in the presence of heterogeneity (e.g., the proportion of linked coded trios, the posterior probabilities that a particular trio type is linked to a locus), and reasonable power even when the proportion of linked trios is lower. Also, we have extended Hoh, Ott, and colleagues' SumStat method to TDT-HET. The parameter estimation above, particular, estimation of the probability that a trio is linked will be useful as we enter the age of next-generation sequencing, where one can expect extensive levels of locus heterogeneity given the rare disease frequencies.

Additional material
Additional file 1: Appendix. Full details of the derivation of the TDT-HET statistic, including notation.  In this table, the high risk allele is M. Also, we define D to be the event that the child is affected. Note that 1 ≤ k ≤ 2. The last column is computed using the definition of conditional probability. Schaid and Sommer [63] also demonstrated this calculation. Note that Pr(x abc |D, pop = k) = f B (x abc ; θ k ) . Finally, t = Pr (heterozygous parent transmits an M allele to an affected child).