 Methodology article
 Open access
 Published:
Using the longest significance run to estimate regionspecific pvalues in genetic association mapping studies
BMC Bioinformatics volume 9, Article number: 246 (2008)
Abstract
Background
Association testing is a powerful tool for identifying disease susceptibility genes underlying complex diseases. Technological advances have yielded a dramatic increase in the density of available genetic markers, necessitating an increase in the number of association tests required for the analysis of disease susceptibility genes. As such, multipletests corrections have become a critical issue. However the conventional statistical corrections on locusspecific multiple tests usually result in lower power as the number of markers increases. Alternatively, we propose here the application of the longest significant run (LSR) method to estimate a regionspecific pvalue to provide an index for the most likely candidate region.
Results
An advantage of the LSR method relative to procedures based on genotypic data is that only pvalue data are needed and hence can be applied extensively to different study designs. In this study the proposed LSR method was compared with commonly used methods such as Bonferroni's method and FDR controlling method. We found that while all methods provide good control over false positive rate, LSR has much better power and false discovery rate. In the authentic analysis on psoriasis and asthma disease data, the LSR method successfully identified important candidate regions and replicated the results of previous association studies.
Conclusion
The proposed LSR method provides an efficient exploratory tool for the analysis of sequences of dense genetic markers. Our results show that the LSR method has better power and lower false discovery rate comparing with the locusspecific multiple tests.
Background
Recently whole genome association studies (WGA) with high density SNP data are becoming popular due to new technology in genotyping (e.g., Affymetrix and Illumina) [1–4]. Optimal study design in whole genome association remains unresolved although the twostage association test design has gained popularity [5, 6].
In order to improve power, many samples have been used in WGA studies; however the cost for such studies is still expensive even though the expense of genotyping has dropped significantly. One problem common of WGA studies is a high false positive rate if the direct method, based on the simple chisquare test for association, is performed without any correction for multiple testing. By far, the most commonly used remedy is the Bonferroni approach. However, its overly conservative correction might result in power reduction. There have been several methods proposed to circumvent the problem of the over correction of multiple testing procedures, e.g. false discovery rate (FDR) methods [7–9] that control the expected proportion of false rejections, or False Positive Report Probability (FPRP) [10] which requires explicit consideration of the prior probability for each hypothesis under association. Although the FPRP method aims to provide a higher power than traditional association test, the need for priors is not always possible. The FDR controlling method was originally designed for multiple comparisons of independent tests [7], and was then later extended to dependent cases under certain conditions, e.g. assuming the test statistics are either equally positively correlated and normally distributed, or having satisfied socalled the positive regression dependency [11]. As for the high density SNP data that were considered in this study, the aforementioned assumptions need to be justified in advance.
Most direct association tests are locusspecific and therefore seldom account for the association of different markers. Different markers usually have heterogeneous genetic backgrounds, such as allele frequency and marker characteristics. Singlepoint inference sometimes yields a misleading conclusion for an overall phenomenon. Methods considered multiple markers simultaneously and included logistic regression analysis, haplotype analysis [12, 13], global significance method [14], multivariate association analysis [15], and the consideration of genegene interactions [16].
Instead of testing the significance of a single marker, our novel method tests the significance of all markers within a defined region, and therefore can be regarded as a simultaneous test for multiple markers that account for the dependence of close genetic markers.
The proposed "longest significant run" (LSR) method is a twostage procedure. The first stage conducts conventional association tests, such as the chisquare test for the casecontrol design or the transmission disequilibrium test [17] for the familybased design. Based on the prespecified size of a given test, the pvalue of each test is converted into a zero/one indicator (1 for significance or 0 otherwise). In the second stage, this binary sequence is scanned for the longest region of consecutive 1s (hence "longest significant run") and the results determine whether or not the run is inordinately long or simply a random pattern.
This regionspecific testing procedure is motivated by the dependence of association tests on dense markers: if a disease susceptibility gene lies in a specific region, then the disease gene and the nearby markers will show a relatively positive trend of association (i.e., linkage disequilibrium (LD)). A special nonrandom pattern (i.e., a cluster of positive signals) indicates that disease genes may be included in the candidate region. The evidence supporting such a nonrandom pattern is then evaluated with the magnitude of the longest run of consecutive significance.
There is a long list of applications for longest run statistics [18, 19], one of which concerns the alignment and testing of the homology among DNA sequences. Considering two aligned sequences with length n, a match of locus between the sequences is assigned to be "1", and a mismatch is "0". More homologous sequences should have larger longest matching subsequences (run) than others [20], and its corresponding probability is used as an important reference to the homology between two sequences.
Given a similar concept, we use the longest significance run to find a region that is most likely to harbour a disease susceptibility gene in association studies. The dependent structure of the binary sequence is obtained from association tests by considering it as a discretetime Markov chain model. Using extensive simulations, we demonstrate that the LSR approach provides a reasonable model of dependence for association test results whereby the falsepositive and falsenegative rates are all controlled effectively. The program in R code for LSR method is available (Additional File 1).
Results
The program SIMLA version 3.1 (SIMulation of pedigree data for Linkage and Association studies)[21] was used to simulate familystructure genetic data. The program is available [22]. In the process, we generated 1000 trio families with genotypic data for SNP (single nucleotide polymorphism) markers. To determine the effects of the number of SNP markers, we considered two cases for 50 and 100 SNPs in a candidate region. To mimic the scenario of dependent markers, we considered a dense intermarker distance with a mean of 6 kb and a standard deviation of 10.7 kb. Note that in 500 K Affymetrix data, marker distance has an average of 5.8 kb and a standard deviation of 10.7 kb. In the simulation, the intermarker distance is generated from a lefttruncated normal distribution with mean 6 kb and standard deviation 10.7 kb. Under the setup, the distribution of intermarker distance is similar to that of real Affymetrix data in the aspects of mean and standard deviation. Furthermore, the prevalence of the disease in the simulation was set as 0.1, and genetic relative risk was set to be 2.0, 2.5 and 3.0. The allele frequencies were 0.075 and 0.1, and an additive disease model was assumed. Note that the use of high relative risk is for observing contrasting powers between the methods under the stringent definition we used for detecting true positive markers (described in next sections). Under a looser definition, the relative risk can be lowered to 1.5 or less.
In the simulation, the location of disease gene was assigned to be within the two central markers, and the disease haplotype was comprised by 15 markers clustering around the disease gene. We had also tried randomly assigning the location of disease genes to be within any nonboundary markers, and found similar results. We considered three haplotype scenarios: (1) three common haplotypes each with a frequency of 30%, (2) two common haplotypes each with a frequency of 45%, and (3) two haplotypes with frequencies of 85% and 5%. Since the results of the three scenarios were similar, we present the scenario of threecommonhaplotype.
During the first stage of the LSR method, the transmission disequilibrium test from the TDT procedure of SAS/GENETIC package release 8.2.39 [23] was used to test the locusspecific association for each marker. The output pvalues {{\widehat{p}}_{j}, j = 1, Λ n} were transformed into a sequence of significance indicators {X_{ j }, j = 1,Λ,n} based on a test size α_{1}, where X_{ j }is 1 if the test of the j^{th} marker is significant and 0 otherwise. In the second stage, we scanned this binary sequence to identify the longest perfect run of 1, L_{0}, the nearly perfect run of 1 allowing one zero within, namely the L_{1} and the run of 1 allowing two zeros within, namely the L_{2}. For each of these three LSR statistics, the corresponding tail probabilities were calculated using the method of Chang et al. [20].
Testing statistics
Although allowing more interruptions (larger k) seems to make the LSR approach more plausible, it is helpful only if it can bridge up nearby but separated clusters of "1's". On the other hand, containing irrelevant markers may complicate further analysis to identify the target gene. To compromise the situations we defined the testing statistics, LSR_{ {k} }, as L_{ i }with minimal tail probability, for i from 1 to k and considered only kϕ2. In the simulation, we compared our method with the Bonferroni method, the popular correction used for multiple tests, by evaluating three quantities, namely the falsepositive rate, power, and false discovery rate in the following three sections.
Falsepositive rate
According to our setup in the simulation, markers in a sequence can be divided into three categories: 2 target markers where the disease gene lain between them, 13 nearby markers in linkage disequilibrium (LD) with the 2 target markers, and other markers that are in linkage equilibrium (LE). For the purpose of stringency, a true positive detection was considered only if at least one of the two target markers was significant under the criterion of each method. For Bonferroni's method, the criterion is that a target marker's pvalue < α_{2}/n. For the FDR controlling method, Benjamini and Hochberg [7] proposed the following criterion. The n single marker pvalues were sorted from smallest to largest: P_{(1)} ⋯ P_{ (n) }. Starting from P_{(n)}, we compared P_{(i)}with i·α_{2}/n. This process was continued as long as P_{(i)}> i·α_{2}/n. If k is the turn around point, then significance is declared if a target marker corresponds to the k^{th} smallest pvalue. For our LSR method, the criterion was that a target marker is covered by a significant LSR (pvalue< α_{2}). On the other hand, a false positive case was considered if a LE marker was detected under each of the three criteria described above. If a nontarget LD marker was selected, neither a false positive nor a true positive would be counted.
The falsepositive rate is the probability that a test mistakenly rejects the null hypothesis. In our case, it is the rate at which a test falsely detects disease susceptibility genes where none exists. In our first part of the simulation, no disease gene was assumed. Table 1 lists the corresponding falsepositive rates for the two LSR approaches and Bonferroni approach based on 200 replications for each of the eight scenarios (2 disease gene allele frequencies, 2 relative risk, 2 marker numbers). The test size in the first stage was α_{1} = 0.1, and the test size was α_{2} = 0.05 in the second stage. We found that all the falsepositive rates were under or similar to the prespecified nominal test size α_{2}, indicating that all the methods adequately controlled the falsepositive rate.
Power and false discovery rate
Statistical power is the probability that a test correctly rejects the null hypothesis. In our case, it is the probability that a method correctly accesses at least one of 2 target markers with statistical significance. On the other hand, the false discovery rate is the fraction of false positives among all tests declared significant.
Figures 1, 2, 3, 4 illustrate the performance of Bonferroni and LSR methods given different nominal test sizes. The test size in the first stage for LSR was α_{1} = 0.05 and 0.1 and the test size in the second stage was α_{2} = 0.05. Higher level of significance (0.1) at screening stage resulted in better power than that of α_{1} = 0.05, in trade of a slightly higher false discovery rate. LSR_{2} had better performance in both power and false discovery rate than LSR_{1}. The stringent definition of a "run" for LSR_{1} yields a relatively narrow candidate region that was more likely to miss the target markers. Both LSR_{1} and LSR_{2} had better power and false discovery rate than those from the Bonferroni approach among all the scenarios that we tried. LSR_{1} has similar power to that of the FDR controlling method, but better false discover rate. In summary, in terms of power, LSR_{2} > LSR_{1}~ FDR controlling method > Bonferroni; and in terms of false discovery rate, LSR_{2}~LSR_{1} ~<FDR controlling method < Bonferroni. The detailed results are available in Tables 2 and 3.
The good performance of LSR does not indicate that it can completely replace the markerspecific test procedures like the Bonferroni approach. Rather, we consider LSR as a useful screening tool to find a smaller region that possibly contains the disease susceptibility gene. After the region is identified, each maker within the region still needs to be examined biologically. At this stage, LSR_{2}, although better than LSR_{1} with respect to both power and false discovery rate, pays a token for retaining more makers for further examining. Analogue can be inferred in comparing the performance of using α_{1} = 0.1 over α_{1} = 0.05 as the threshold in the first stage of LSR, where the former usually includes more markers.
The tail probabilities of LSR were calculated under the null hypothesis of no disease marker. In the simulation, Markov independence was assumed for the binary sequences under null hypothesis. Nonetheless, the method can also be applied to the scenario of first order Markovdependency, if the corresponding transition matrix can be assumed or estimated. We provide a robust approach to estimate the dependency structure of the sequence using the concept of sliding window, which is described in the method section. The method was applied on the following authentic data as illustration.
Demonstrating the applications of our method to two authentic data sets
Example of psoriasis data
We assessed the practical application of the LSR method using an authentic genetic data set collected for a psoriasis study [24]. Psoriasis is a common chronic skin disorder characterized by inflammation and scaling. Recent studies indicate a significant association of important psoriasis predisposing loci with chromosome 17 [24, 25]. Helms et al. [24] collected 242 European nuclear families comprising 572 psoriasis cases and genotyped 123 genetic markers on chromosome 17q25. The familybased association test TDTAE [26] was applied to locate the psoriasis susceptibility genes.
In the first stage of LSR procedure, we used α_{1} = 0.05 as a test size to convert the sequence of pvalues into a binary sequence. Under null hypothesis assuming Markov independency, the LSR obtained a significant pvalue <0.001, and the region of 12 consecutive markers identified by L_{0} captured two functional genes SLC9A3R1 and DKFZPP564C103, which are known to be psoriasisrelated. The results are consistent with the finding of Helms et al. [24].
However, when we proceeded with the sequential goodnessoffit tests proposed by Anderson and Goodman [27], the hypothetical assumption of {X_{ j }} being an independent sequence was rejected thereby supporting the hypothesis of it being a first order Markov dependence. Therefore, a sliding windows approach was applied with size of 40 markers and sliding distance of 10 markers to estimates of the transition matrix under the hypothesis of Markov dependency. The same region was identified by LSR with pvalue 0.007 under the dependency assumption of null hypothesis. Similar results were produced when minor variations in window size and sliding distance were used.
In this example, L_{1} and L_{2} did not further extend the region of L_{0}, therefore, L_{0} is sufficient to confirm this candidate region. On the other hand, in order to attain approximately evenly spaced intermarker distance for simplecount approach and avoid combining two distinct chromosome regions, we tried only using the results from the 5^{th} to the 86^{th} loci among the 123 markers. Moreover, we also excluded results from marker loci with less than 50 trios in the analysis to preclude possibly unreliable association results. It resulted in 78 loci included in the final analysis. The consequent analysis also came out with the same significant region with a pvalue <0.001. Of particular note, all of the 123 markers had pvalue >0.0001, therefore, none of them was significant after direct Bonferroni adjustment.
Example of asthma data
Asthma is a common chronic disease characterized by airway inflammation resulting in some symptoms such as the difficulty of breathing, attacks of wheezing and coughing, etc,. In the positionalcloning study of Allen et al. [28], there were 224 families containing 239 asthma children and 79 markers covering a region of 384 kb on chromosome 2q14. Immunoglobulin E often causes asthmatic inflammation and has been recognized to be an important concomitant factor of children asthma. Allen et al. [28] conducted the transmission disequilibrium tests to assess the relationship between marker loci with asthma and immunoglobulin E.
Using test size α_{1} = 0.05, the sequence of pvalues was converted into a sequence of twentythree 1s and fiftysix 0s:
X = [0000000011111110111110010011000100000101101011001000000000000000000000000000000].
The goodnessoffit test suggested that the sequence follows a first order Markov chain. In the second stage of LSR method, we estimated the transition probabilities η_{00} and η_{11} to be 0.647 and 0.7, respectively, and identified L_{1} (from marker 543WTC21P at 191388 to 543WTC91P at 252279, details are listed in Additional File 2) with length 13 and yielded the corresponding pvalue 0.0002. Note that L_{1} contains a significant candidate region of 60891 bp closed to an asthmasuspected gene DPP10 with functions of catalytic activity and dipetidylpeptidase IV activity. The conclusion is consistent with Allen et al. [28] who found the highly significant STRP marker D2S308 at 261056 in the vicinity of our identified region.
Discussion
As discussed by Rosenthal [29] and Zaykin et al. [30], a series of borderline significant results may together suggest significance. Therefore, it is likely that two consecutive pvalues of 0.06 may suggest evidence against the null than one isolated pvalue of 0.05. This phenomenon is often observed in the identification of candidate regions of complex disorders if the marginal effect of a disease allele is modest or minor with a few adjacent loci that are in LD. In this study, we propose the longest significant run, LSR, to estimate regionspecific pvalues while searching for disease susceptibility genes in gene mapping studies. The method transforms the pvalues from locusspecific association tests (e.g., the transmission disequilibrium test or any other association test) to a binary sequence with the value '1' representing significance and '0' otherwise, and determines the LSR to identify the location of the target disease susceptibility gene. The statistical significance of LSR method can be accessed by imbedding the sequence onto a Markov chain. A sequential goodnessof fittest [27] can be used to justify the assumption that a sequence of indicator of pvalues is Markov chain, simulations assuming no disease gene were carried out, and the test did not reject the null hypothesis of Markov independence 99.8% of the time (998 out of 1000). On the other hand, in the simulations assuming a disease gene with allele frequency of 0.075 and RR = 2.5 imbedded in a region flanking with dense markers, the test rejected the null hypothesis of Markov independence 92.4% of the time (924 Out of 1000) but did not reject the null hypothesis of first order of Markov chain 74.4% of the time (744 out of 1000). Likewise, the test did not reject the null hypothesis of first order Markov chain 82.2% of the time (822 out of 1000), assuming a disease susceptibility gene with allele frequency of 0.1. The allele frequencies in our simulations were assumed for the diseasecausing alleles. The allele frequency for most markers in genomewide association study is higher than 0.05, therefore we chose to use diseasecausing allele frequencies to be 0.075 and 0.1. Other frequencies were also tested and found that the power was positively correlated with the allele frequency. This disease related marker was removed in all our subsequent analyses. Only results from additive disease model were reported since it was more plausible for complex diseases; we also examined dominant model which showed higher power and recessive model which showed lower power.
The LSR method avoids the controversial multipletest adjustment required for locusspecific association tests, and thus has potential applications in exploratory data analysis. For example, if there were a few significant markers in a genomewide association found by using single locus test [31], our proposed methods can be applied to screen out the most likely region for further biological examination. However, for an isolated significant marker, it is likely that the LSR method will miss the signal. Therefore, singlemaker method and LSR method should be considered complementary.
In addition to the perfect LSR, we also propose a "nearly perfect" LSR in which a few interruptions (insignificance) are allowed within the run. Our simulation results suggest that the LSR s which allow for one or two interruptions within the run are generally more flexible and have some gain of testing power. The approach can be easily extended to allow more interruptions by modifying the imbedded transition matrix in Equation (5) in the method section.
An advantage of our proposed method relative to testing procedures based on genotypic data is that only pvalue data are needed. This method can be adapted extensively to different study designs and testing procedures if reliable pvalues are provided. An important application is metaanalysis which combines pvalues from different studies. Due to the high sensitivity of LSR method, many published results with summarized pvalues originally reported as insignificant can be reanalyzed by this method for more convincing conclusions.
To explore the genetic structure of human genome, we downloaded the first 3,000 markers of chromosomes 1, 5, 10, 15 and 20, respectively, from the HAPMAP website [32]. According to Gabriel's criterion (95% CI of D' = 0.7, 0.98) [33], there are 672 LD blocks among the 15,000 markers. The mean frequency and standard deviation of the top haplotype are 0.55 and 0.19. They are 0.24 and 0.09 for the second haplotype and 0.12 and 0.07 for the third one. If the top two or three haplotypes are defined as the "common haplotypes", about 62% of the 672 LD blocks have a cumulative frequency of 90% and about 99.4% of the blocks have a cumulative frequency of 50%. To compare the powers of our proposed method between optimal and modest LD scenarios, we used two settings of haplotype frequencies, (85%, 5%) and (30%, 30%, 30%) for the former and a setting of (35%, 10%, 5%), where the frequencies were roughly those mean frequencies minus one respective standard deviation, for the latter. As expected, the former scenario with optimal LD block resulted in better power for all methods in our simulation than the latter with modest LD block. The power of LSR_{2} for this modest LD scenario is about 77% assuming 0.1 disease allele frequency, RR = 3 and α_{1} = 0.1. The power is about 23% for the Bonferroni procedure. The results for the different LD scenarios comprised of 3 common haplotypes are also presented in Figure 5.
We further compared the power of our proposed methods with those of haplotype analyses carried out by the FBAT program [34] assuming RR = 2 and disease allele frequency of 0.075. For the optimal LD scenarios, the multipledegree of freedom (mdf) and singledegree of freedom (1df) haplotype tests yield powers of 0.46 and 0.232, respectively, for the setting of (85% and 5%). The powers for mdf and 1df are 0.35 and 0.176, respectively, for the setting of (30%, 30%, 30%). For the modest LD scenario of (35%, 10%, 5%), the powers for mdf and 1df are 0.18 and 0.126, respectively. The power of LSR_{2} is 0.44 with α_{1} = 0.1 and the power of Bonferroni procedure is 0.15. Therefore, the power of LSR_{2} is about the same as the power of haplotype analyses under optimal LD scenarios but outperforms under modest LD scenario. Powers using other values of RR were also calculated. In general, powers of both haplotype analyses and LSR increase assuming higher values of RR.
Our simulation study demonstrates that all the methods adequately control the falsepositive rate, however, the LSR methods, in particular, LSR_{2}, had better performance in both power and false discovery rate than those for the Bonferroni and FDR controlling methods. Moreover, as the number of markers increases, the power of both the Bonferroni and FDR controlling method drops significantly, whereas the power of LSR remains high and may even increase slightly.
Further studies for potential extensions of the LSR method are currently under development. Firstly, the LSR method transforms continuous pvalues to a dichotomous random sequence of 0s and 1s, which may cause a loss of information. This limitation implies that some isolated disease susceptibility loci cannot be identified by the LSR method. A model based on a more complex continuousstate stochastic process constitutes an alternative that would exploit more information. Secondly, in our simulation, we assumed that there is only one disease susceptibility gene in the region of interest. However, the LSR method will miss some isolated target genes when a region contains two or more. One remedy is to extend the LSR theory to incorporate the second or third longest significant run.
The remarkable advances in genotyping technology have facilitated the pursuit of genomewide association mapping. The increased density of available SNP markers provides finer resolution and higher statistical power for gene mapping. For example, the average distance between two SNPs using the 100 K assay of Affymetrix [35] is 24 kb. The intermarker distance is only 5.8 kb for the 500 K chips. We foresee that the multiple tests correction will remain a crucial issue as the number of genetic markers becomes very large. Our method can be applied to such data in combination with the sliding window method to screen out candidate regions associated with the disease genes.
Conclusion
In summary, the LSR method provides an efficient exploratory tool for the analysis of sequences of dense genetic markers thereby complementing current locusspecific methods. Our simulation study demonstrates that the LSR method has reasonable statistical power and avoids the overcorrection problem that plagues most of the locusspecific methods. When applied to actual genetic data, the LSR method successfully confirmed the location of two important psoriasisassociated genes and an asthmarelated gene. The application to genomewide screening studies may further enhance the proposed LSR method.
Methods
Suppose that there are a total of n genetic markers in a study. The LSR method contains two main stages, the first of which transforms the results from n locusspecific association tests into a sequence of significance indicators and the sequence is used in the second stage to locate the LSR and compute its tail probability.
The firststage procedure
Association tests are applied to examine the significance of associations between genetic markers and a disease susceptibility gene, and the corresponding pvalues (significance probabilities) of the tests are denoted by {{\widehat{p}}_{j}, j = 1, Λ n}. Under the prespecified test size, α_{1}, we consider an indicator function of the test on marker j: X_{ j }= I[{\widehat{p}}_{j}, <α_{1}] where I [A] = 1 if event A holds, and I [A] = 0 otherwise.
In the following hypothetical example of 15 genotyped genetic markers (n = 15), the corresponding pvalue vector of 15 association tests is \widehat{P}=[{\widehat{p}}_{1},\cdots ,{\widehat{p}}_{15}] = [0.30, 0.04, 0.03, 0.04, 0.12, 0.04, 0.02, 0.01, 0.35, 0.50, 0.02, 0.03, 0.04, 0.45, 0.04] Based on the setting of α_{1} = 0.05, we obtainX = [X_{1}, L, X_{15}] = [0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1],
where markers 2, 3, 4, 6, 7, 8, 11, 12, 13, and 15 show significant association with the disease susceptibility gene.
The secondstage procedure
Before applying the LSR method, the assumption of Markov chain for the sequence of the indicator of pvalues can be checked by carrying out the goodnessoffit test of Markov Chain [27]. Assuming that the binary random sequence {X_{ j }, j = 1,Λ,n } to be a first order Markov chain stochastic process satisfyingPr{X_{ j }= s_{ j } X_{j1}= s_{j1},Λ, X_{1} = s_{1} = Pr{X_{ j }= s_{ j } X_{j1}= s_{j1}},
where s_{ k }∈ {0,1} = insignificance, significance}, k = 1,Λ, n . Let the initial probability be Pr{X_{1} = 1} = η_{1} and Pr{X_{1} = 0} = 1  η_{1}, then the transition probability matrix is
where η_{00j}+ η_{01j}= 1 and η_{10j}+ η_{11j}= 1, j = 2,Λ,n. This model assumes conditional independence and is well established in a variety of fields.
An inordinately long LSR indicates possible associations between markers and a disease susceptibility gene. Due to random error, low heterozygosity or other unknown reasons, not all markers linked to the disease gene show positive association. Hence, it is sensible to allow for a few interrupting 0s (insignificances) in a run. We denote the more flexible run as a "nearly perfect run" in contrast to a perfect run that does not contain 0s. A unifying notation of the length of an LSR is L_{ k }, where the subscript k denotes the number of interrupting 0s. When k = 0, the case reduces to a perfect run. In the example in Equation (1), L_{0} = 3, L_{1} = 7, L_{2} = 8 and L_{3} = 12.
Consider the following hypotheses
H_{0}: "there is no disease susceptibility gene in the sequence of markers", versus
H_{1}:"there is at least one disease susceptibility gene in the sequence".
According to the argument above, an inordinately long LSR (i.e. large L_{ k }) indicates a nonrandom cluster of significant associations with nearby markers, suggesting that disease susceptibility genes may exist in the candidate region. For a prechosen k, an intuitive rule to reject H_{0} is L_{ k }> m*, where the constant m* is the minimum integer of m satisfying the following inequalityPr{L_{ k }≥ m  H_{0}} ≤ α_{2}
for a test size α_{2}. To determine the critical region, we need to calculate the tail probability of testing statistic L_{ k }.
Distributions of LSR statistics
There has been a long history of studies on the distribution of the length of the longest run. Erdõs and Révész [36] considered a binary sequence from a cointossing game, with outcomes of head (coded as "1") and tail ("0"). The pvalue of L_{0} of LSR is analogous to the tail probability of their longest headrun. Fu and Koutras [19] have provided an algorithm to calculate the exact probability of L_{0}. Arratial et al. [37] and Karlin et al. [38] have provided asymptotic results on the distribution L_{ k }, however, the former method [37] is valid only for independent sequences, whereas the latter [38], although allowing dependency, yields a large bias for the estimation of tail probability when n is of moderate size (e.g., n = 200) [20].
Chang et al. [20] provided an algorithm and software to calculate the exact probability of L_{ k }by the following formula:
where u = [1,0,Λ,0], ν = [1,Λ,1,0] and the exact form of the imbedded transition matrix Λ_{ i }is a function of η_{1} and η_{ sti }s in Equation (3). By considering a homogeneous Markov chain, the subscript i can be suppressed and Λ_{ i }and η_{ sti }can be reduced to Λ and η_{ st }. We estimated the initial probability by the proportion of the occurrence of "1" in {X_{ j }, j = 1,Λ,n}, i.e., {\widehat{\eta}}_{1} = (number of "1")/n. Under null hypothesis H_{0} of independency, {X_{ j }is assumed to be of no special pattern and the transition probabilities η_{1t}and η_{0t}can be replaced by {\widehat{\eta}}_{1} and 1  {\widehat{\eta}}_{1}. For dependent sequences, we propose the following approach.
Estimation of dependency structure under null hypothesis
To examine the Markov chain assumption on sequence {X_{1}...X_{n}}, we used the following sequential goodnessoffit tests proposed by Anderson and Goodman [27]:

(i)
H_{0}: independent sequence vs. H_{1}: first order Markov chain

(ii)
H_{0}: first order Markov chain vs. H_{1}: second order Markov chain
If the first order Markov chain hypothesis is accepted, we suggest using the following approach to estimate the dependency structure and calculate the pvalue.
Consider a set of sliding windows of fixed size w and sliding distance d, d <w <n, under proper choices of d and w, the windows {X_{1}...X_{w}}, {X_{1+d}...X_{w+d}}, ..., {X_{1+sd}...X_{w+sd}} will cover nearly the whole sequence, where s is the largest integer smaller than or equal to (nw)/d. For the subsequence in the i th window, the transition probabilities η_{ st }(i) = P(X_{i+1}= t  X_{ i }= s), where s, t = 0,1, are estimated by its maximal likelihood estimator, namely, {\widehat{\eta}}_{11}(i) = (number of consecutive (1,1) pairs)/(number of 1s), and {\widehat{\eta}}_{00} (number of consecutive (0,0) pairs)/(number of 0s), for i = 1 to d+1. For a homogeneous firstorder Markov chain, we estimate {\widehat{\eta}}_{11} and {\widehat{\eta}}_{00} by the medians of {\widehat{\eta}}_{00}(i)'s and {\widehat{\eta}}_{11}(i)'s, respectively, and {\widehat{\eta}}_{10}=1{\widehat{\eta}}_{11} and {\widehat{\eta}}_{01}=1{\widehat{\eta}}_{00}.
The validity of the sliding window approach is based on the assumption that most of the widows contain no disease markers, and therefore the sequences within the set of sliding windows of fixed size w can be used to estimate the transition probabilities. After {\widehat{\eta}}_{st}(i)'s are estimated from each window, the robustness of median can diminish the influence from those few windows which might contain disease markers. With the estimated η_{ st }'s, tail probabilities of LSR statistics can be calculated using (5), and then the null hypothesis of no disease gene is rejected if L_{ k }≥ m_{ α }, where m_{ α }is a threshold such that Pr{L_{ k }≥ m_{ α }H_{0}} = α.
To summarize, we performed simulation studies to investigate the falsepositive rate, the power, and the false discovery rate of the proposed LSR method, and we compared the results with those from Bonferroni method.
References
Barrett JC, Cardon LR: Evaluating coverage of genomewide association studies. Nat Genet 2006, 38: 659–662. 10.1038/ng1801
Scott LJ, Mohlke KL, Bonnycastle LL, Willer CJ, Li Y, Duren WL, Erdos MR, Stringham HM, Chines PS, Jackson AU, Olsson LP, Ding CJ, Swift AJ, Narisu N, Hu T, Pruim R, Xiao R, Li XY, Conneely KN, Riebow NL, Sprau AG, Tong M, White PP, Hetrick KN, Barnhart MW, Bark CW, Goldstein JL, Watkins L, Xiang F, Saramies J, Buchanan TA, Watanabe RM, Valle TT, Kinnunen L, Abecasis GR, Pugh EW, Doheny KF, Bergman RN, Tuomilehto J, Collins FS, Boehnke M: A genomewide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 2007, 316: 1336–1341. 10.1126/science.1142382
Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S, Balkau B, Heude B, Charpentier G, Hudson TJ, Montpetit A, Pshezhetsky AV, Prentki M, Posner BI, Balding DJ, Meyre D, Polychronakos C, Frogue P: A genomewide association study identifies novel risk loci for type 2 diabetes. Nature 2007, 445: 881–885. 10.1038/nature05616
Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ, Perry JRB, Rayner NW, Freathy RM, Barrett JC, Shields B, Morris AP, Ellard S, Groves CJ, Harries LW, Marchini JL, Owen KR, Knight B, Cardon LR, Walker M, Hitman GA, Morris AD, Doney ASF;, The Wellcome Trust Case Control Consortium (WTCCC), McCarthy MI, Hattersley AT: Replication of genomewide association signals in U.K. samples reveals risk loci for type 2 diabetes. Science 2007, 316: 1336–1341. 10.1126/science.1142364
Herbert A, Gerry NP, McQueen MB, Heid IM, Pfeufer A, Illig T, Wichmann HE, Meitinger T, Hunter D, Hu FB, Colditz G, Hinney A, Hebebrand J, Koberwitz K, Zhu X, Cooper R, Ardlie K, Lyon H, Hirschhorn JN, Laird NM, Lenburg ME, Lange C, Christman MF: A common genetic variant is associated with adult and childhood obesity. Science 2006, 312: 279. 10.1126/science.1124779
Lin DY: Evaluating statistical significance in twostage genomewide association studies. Am J Hum Genet 2006, 78: 505–509. 10.1086/500812
Benjamini Y, Hochberg Y: Controlling the false discovery rate – a practical and powerful approach to multiple testing. J Roy Stat Soc B Met 1995, 57: 289–300.
Benjamini Y, Yekutieli D: Quantitative trait loci analysis using the false discovery rate. Genetics 2005, 171: 783–790. 10.1534/genetics.104.036699
Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Ann Stat 2001, 29: 1165–1188. 10.1214/aos/1013699998
Wacholder S, Chanock S, GarciaClosas M, El Ghormli L, Rothman N: Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. Journal of the National Cancer Institute 2004, 96: 434–442.
Cardon LR, Bell JI: Association study designs for complex diseases. Nat Rev Genet 2001, 2: 91–99. 10.1038/35052543
Becker T, Schumacher J, Cichon S, Baur MP, Knapp M: Haplotype interaction analysis of unlinked regions. Genet Epidemiol 2005, 29: 313–322. 10.1002/gepi.20096
Tzeng JY, Devlin B, Wasserman L, Roeder K: On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am J Hum Genet 2003, 72: 891–902. 10.1086/373881
McIntyre LM, Martin ER, Simonsen KL, Kaplan NL: Circumventing multiple testing: a multilocus Monte Carlo approach to testing for association. Genet Epidemiol 2000, 19: 18–29. 10.1002/10982272(200007)19:1<18::AIDGEPI2>3.0.CO;2Y
Fan R, Knapp M: Genome association studies of complex diseases by casecontrol designs. Am J Hum Genet 2003, 72: 850–868. 10.1086/373966
Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB: Detection of gene × gene interactions in genomewide association studies of human population data. Human Heredity 2007, 63: 67–84. 10.1159/000099179
Spielman RS, McGinnis RE, Ewens WJ: Transmission test for linkage disequilibrium: the insulin gene region and insulindependent diabetes mellitus (IDDM). Am J Hum Genet 1993, 52: 506–516.
Ewens WJ, Grant GR: Statistical Methods in Bioinformatics. New York: Springer; 2001:184–189.
Fu JC, Koutras MV: Distribution theory of runs: A Markov chain approach. J Am Stat Assoc 1994, 89: 1050–1058. 10.2307/2290933
Chang CJ, Fann CSJ, Chou WC, Lian IB: On the tail probability of the longest wellmatching run. Stat and Probability Letters 2003, 63: 267–274. 10.1016/S01677152(03)000919
Schmidt M, Hauser ER, Martin ER, Schmidt S: Extension of the SIMLA package for generating pedigrees with complex inheritance patterns: environmental covariates, genegene and geneenvironment interaction. Stat Appl Genet Mol Biol 2005, 4: Article 15.
SIMLA Simulation Software Version 2.3[http://wwwchg.duhs.duke.edu/software/simla.html]
SAS Institute: SAS/Genetics User's Guide. Cary, North Carolina; 2002.
Helms C, Cao L, Krueger JG, Wijsman EM, Chamian F, Gordon D, Heffernan M, Daw JAW, Robarge J, Ott J, Kwok PY, Menter A, Bowcock AM: A putative RUNX1 binding site variant between SLC9A3R1 and NAT9 is associated with susceptibility to psoriasis. Nat Genet 2003, 35: 349–356. 10.1038/ng1268
Hwu WL, Yang CF, Fann CSJ, Chen CL, Tsai TF, Chien YH, Chiang SC, Chen CH, Hung SI, Wu JY, Chen YT: Mapping of psoriasis to 17q terminus. J Med Genet 2005, 42: 152–158. 10.1136/jmg.2004.018564
Gordon D, Heath SC, Liu X, Ott J: A transmission disequilibrium test that allows for genotyping errors in the analysis of single nucleotide polymorphism data. Am J Hum Genet 2001, 69: 371–380. 10.1086/321981
Anderson TW, Goodman LA: Statistical inference about Markov chains. Ann Math Stat 1957, 28: 89–110. 10.1214/aoms/1177707039
Allen M, Heinzmann A, Noguchi E, Abecasis G, Broxholme J, Ponting CP, Bhattacharyya S, Tinsley J, Zhang Y, Holt R, Jones EY, Lench N, Carey A, Jones H, Dickens NJ, Dimon C, Nicholls R, Baker C, Xue L, Townsend E, Kabesch M, Weiland SK, Carr D, von Mutius E, Adcock IM, Barnes PJ, Lathrop GM, Edwards M, Moffatt MF, Cookson WOCM: Positional cloning of a novel gene influencing asthma from Chromosome 2q14. Nat Genet 2003, 35: 258–263. 10.1038/ng1256
Rosenthal R: Combining results of independent studies. Psychological Bulletin 1978, 85: 185–193. 10.1037/00332909.85.1.185
Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS: Truncated product method for combining pvalues. Genet Epidemiol 2002, 22: 170–185. 10.1002/gepi.0042
Owen MJ, Williams NM, O'Donovan MC: The molecular genetics of schizophrenia: new findings promise new insights. Mol Psychiatry 2004, 9: 14–27. 10.1038/sj.mp.4001444
International HapMap Project[http://www.hapmap.org/downloads/index.html.en]
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, LiuCordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science 2002, 296: 2225–2229. 10.1126/science.1069424
Horvath S, Xu X, Lake SL, Silverman EK, Weiss ST, Laird NM: Familybased tests for associating haplotypes with general phenotype data: application to asthma genetics. Genet Epidemiol 2004, 26: 61–69. 10.1002/gepi.10295
Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, SianiRose MA: NetAffx: Affymetrix probesets and annotations. Nucleic Acids Res 2003, 31: 82–86. 10.1093/nar/gkg121
Erdõs P, Révész P: On the length of the longest headrun. Colloq Math Soc Janos Bolyai 1977, 16: 219–228.
Arratial R, Gordan L, Waterman MS: The ErdosRenyi law in distribution for coin tossing and sequence matching. Ann Stat 1990, 18: 539–570. 10.1214/aos/1176347615
Karlin S, Ost F, Blaisdell BE: Mathematical Methods for DNA Sequences. Boca Raton: CRC Press; 1990.
Acknowledgements
This research was supported by National Science Council grants (NSC 923112B001014, 922320B018001) and an Academia Sinica grant (93IBMS2PPC) of Taiwan. We would like to thank three anonymous reviewers; their valuable comments have greatly improved this manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
I–BL developed method and directed simulations. Y–HL and Y–CL conducted simulations and data analyses. H–CY drafted the manuscript. C–JC proposed the original idea and directed simulations. CSJF directed simulations, drafted manuscript and supervised the project. All authors have read and approved the final manuscript.
Electronic supplementary material
12859_2007_2231_MOESM2_ESM.doc
Additional file 2: The details of the markers in the 0–1 sequence. The marker position and marker name of the 0–1 sequence. (DOC 86 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Lian, IB., Lin, YH., Lin, YC. et al. Using the longest significance run to estimate regionspecific pvalues in genetic association mapping studies. BMC Bioinformatics 9, 246 (2008). https://doi.org/10.1186/147121059246
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121059246