A note on generalized Genome Scan Meta-Analysis statistics

Koziol, James A; Feng, Anne C

doi:10.1186/1471-2105-6-32

Methodology article
Open access
Published: 17 February 2005

A note on generalized Genome Scan Meta-Analysis statistics

James A Koziol¹ &
Anne C Feng¹

BMC Bioinformatics volume 6, Article number: 32 (2005) Cite this article

3910 Accesses
5 Citations
Metrics details

Abstract

Background

Wise et al. introduced a rank-based statistical technique for meta-analysis of genome scans, the Genome Scan Meta-Analysis (GSMA) method. Levinson et al. recently described two generalizations of the GSMA statistic: (i) a weighted version of the GSMA statistic, so that different studies could be ascribed different weights for analysis; and (ii) an order statistic approach, reflecting the fact that a GSMA statistic can be computed for each chromosomal region or bin width across the various genome scan studies.

Results

We provide an Edgeworth approximation to the null distribution of the weighted GSMA statistic, and, we examine the limiting distribution of the GSMA statistics under the order statistic formulation, and quantify the relevance of the pairwise correlations of the GSMA statistics across different bins on this limiting distribution. We also remark on aggregate criteria and multiple testing for determining significance of GSMA results.

Conclusion

Theoretical considerations detailed herein can lead to clarification and simplification of testing criteria for generalizations of the GSMA statistic.

Background

Wise, Lanchbury and Lewis [1] introduced a rank-based statistical technique for meta-analysis of genome scans, the Genome Scan Meta-Analysis (GSMA) method, and derived its exact null distribution using a clever inclusion/exclusion argument. Koziol and Feng [2] provided an alternative derivation of the null distribution of the GSMA statistic via a combinatoric approach involving probability generating functions, and suggested an Edgeworth series approximation to its exact null distribution that improves upon the Wise [1] normal approximation.

Levinson [3] described two generalizations to the GSMA statistic: (i) a weighted version of the GSMA statistic, so that different studies could be ascribed different weights for analysis; and (ii) an order statistic approach, reflecting the fact that a GSMA statistic can be computed for each chromosomal region or bin across the various genome scan studies. Wise [1] had suggested that each chromosomal region (bin) be about 30 cM, leading to a total of about n = 120 bins spanning the entire genome, and correspondingly 120 GSMA statistics. Wise [1] and Koziol and Feng [2] had investigated the marginal distribution of any of these (exchangeable) GSMA statistics, whereas under the order statistic formulation of Levinson [3], the joint distribution of the entire set of GSMA statistics is taken into account. In this note, we consider both generalizations in turn. In particular, (i) we provide an Edgeworth approximation to the null distribution of the weighted GSMA statistic, analogous to that in Koziol and Feng [2]; and (ii) we examine the limiting distribution of the GSMA statistics under the order statistic formulation, and quantify the relevance of the pairwise correlations of the GSMA statistics across different bins on this limiting distribution. We conclude with remarks concerning the Levinson [3] aggregate criteria and multiple testing for determining significance of GSMA results.

Results

The GSMA statistics

We first introduce some notation. Let X_ij, i = 1, ..., m, j = 1, ..., n, denote the rank of any particular linkage test statistic (e.g., LOD score) in the j^thchromosomal region (bin) from the i^thstudy, with each study being ranked separately. Levinson [3] rank the bins from 1 = "best" to n = "worst" on the basis of, say, maximum LOD score or lowest p value observed within each bin, but the reverse ranking from 1 = "worst" to n = "best" is also feasible. In practice, m can be as few as 4 (e.g., [4, 5]); and, following Wise [1], n is generally about 120. The GSMA statistics are then S₁, ..., S_n, where . The exact (marginal) null distribution of each S_jwas derived in Wise [1]; in the notation of Levinson [3], P_AvgRnk, the "pointwise probability" of any S_j, is obtained from its marginal null distribution. The normal approximation to the exact distribution of the S_jis straightforward: the S_jare identically distributed, and each S_jhas an approximate normal distribution with mean and variance under the null hypothesis that ranks are randomly assigned within each study. Koziol and Feng [2] provided an Edgeworth correction to this approximation, and recommended that the correction be used, at least for m ≤ 12.

The weighted GSMA statistic

Levinson [3] proposed a weighted version of the GSMA statistic, namely, , with the weight w_iascribed to the i^thstudy reflecting the relative linkage information from that study. (We are temporarily omitting the j subscript for clarity.) The normal approximation to the marginal null distribution of S_wis straightforward, and depends on the two parameters and . The combinatorial argument utilized by Koziol and Feng [2] to derive the exact distribution of the unweighted GSMA statistic (which relies on probability generating functions) is generally no longer applicable in the weighted setting. Nevertheless, as in Koziol and Feng [2], we here provide an Edgeworth correction that may be applied to the weighted GSMA statistic. To this end, we equivalently consider the linear transform , where . We then have E[R_w] = 0, , and

(We have used Koziol and Feng [2], eqn. 11, for .)

The Edgeworth Type A series approximation to the density of

up to 4^th order terms, is f (z) = φ (z)[1 + c₄H₄ (z)], where φ(·) is the standard normal density function, , H₄ (·) is the 4^th degree Hermite-Chebyshev polynomial, H₄ (z) = z⁴ - 6z² + 3, and the constant c₄ given by (Stuart and Ord [6], eqn. 6.42). Furthermore, the cumulative distribution function of the Edgeworth series is given simply by

where Φ(·) denotes the cumulative distribution function of the standard normal distribution, c₄ and φ(·) are as above, and H₃(·) is the 3^rd degree Hermite-Chebyshev polynomial, H₃ (z) = z³ - 3z (Stuart and Ord [6], eqn. 6.43).

In practice, we would expect the Edgeworth series approximation to provide an adequate representation of the exact distribution of the weighted GSMA statistic, in a manner analogous to the unweighted case [2]. Here, we briefly investigate the adequacy of the Edgeworth approximation, using an example from Lewis et al. [7]. They had applied the GSMA methodology to data from m = 20 schizophrenia genome scans, and found strong evidence for linkage on chromosome 2q, as well as suggestive evidence for linkage at several other chromosomal locations. The rank data for each scan are available online at D.F. Levinson's website (accessed July 14, 2004) [8], and we use these data to reconstruct first the unweighted GSMA statistics S_j, j = 1, 2, ..., 120, corresponding to the 120 bins spanning the entire genome, then their preferred weighted versions. Lewis [7] had recommended weights for each individual study proportional to the square root of the number of affected cases for that study. From Levinson's website [8], the individual weights w_i, i = 1, 2, ..., 20, are 2.32, 1.77, 1.20, 1.17, 1.17, 1.16, 1.15, 1.08, 1.03, 1.01, 0.95, 0.88, 0.80, 0.80, 0.68, 0.67, 0.59, 0.54, 0.53, 0.51, greater than a four-fold range.

We simulated the null distribution of the weighted GSMA statistic

, where with m = 20, n = 120, by drawing each X_ias an independent random integer from 1 to 120 (that is, a uniform distribution of the integers from 1 to 120), then forming R_wwith the Levinson [8] weights. We used the random number generator in R [9], to produce 10,000 values for R_w. We then formed , and compared the empirical distribution of the 10,000 Z values with the Edgeworth approximation as described above; for comparative purposes, we also computed the normal approximation, which is based on matching the first two moments rather than the first four moments with Edgeworth. Figure 1 shows the resulting quantile-quantile plot of the empirical distribution of the weighted GSMA Z values with both a normal approximation, panel A, and the Edgeworth approximation, panel B. Note that, even in this setting of the weighted combination of m = 20 individual GSMA statistics, the normal approximation is particularly ill-fitting in the tails. Agreement in the tails would be of particular relevance in practical applications, as these represent the areas of potentially significant findings (p-values). The noticeable disagreement in the tails between the weighted GSMA statistic and its normal approximation is largely ameliorated with the Edgeworth approximation. With attendant computational savings, the Edgeworth approximation provides a practical means of determining significance of weighted GSMA results compared to simulation; tail probabilities derived from the normal approximation should only be used with extreme caution.

The GSMA order statistics

We turn next to order statistic considerations (and reintroduce the subscript j). The Levinson [3] order statistic approach to inference relating to the GSMA statistics takes into account the inherent ordering of the S_j: their P_ordrefers to the probability of any observed S_kgiven the k^thbin's place in the ordering of all of the S_j. We here derive approximations to this distribution. Let , j = 1, ..., n, and T₍₁₎ <T₍₂₎ < ... T_(n)denote the order statistics. We note first that the T_jhave (approximately) a singular symmetric multivariate normal distribution, with means 0, variances 1, and correlations . That the joint distribution is singular follows from the observation that for all i, hence, is identically 0. If we dismiss the correlations as negligible (of absolute magnitude < 0.01 for n > 100), then the T_jare (approximately) independent, identically distributed N(0,1) (standard normal) random variates, and the cumulative distribution function (cdf) F_kof the k^thorder statistic T_(k)is given by

with Φ(·) as above (David [10], eqn. 2.1.3).

We briefly examine whether correlations can be ignored when determining the distributions of the T_(j). Numerical computation of the distributions of the order statistics from a symmetric multivariate normal distribution is feasible in a number of cases; we here examine perhaps the most relevant case, concerning the extreme T_(n).Note that

Prob (T_(n)≤ x) = Prob (T₁ ≤ x,T₂ ≤ x, ...,T_n≤ x); (2)

this latter probability may be calculated in R using the mvtnorm package [9], based on methodology by Genz [11, 12]. With n = 120, we depict in Figure 2 a Q-Q plot of the (approximate) distribution of T_(n)under independence, eqn. (1), compared to the distribution from eqn. (2) with pairwise correlations . The independence model tends to agree quite closely to the correlation model in this particular case, especially in the critical right tail, and has the virtue of numerical simplicity. We remark that one might improve slightly on the normal independence model by incorporating the Edgeworth correction into the individual cumulative distribution functions in equation (1).

Aggregate criteria and multiple testing

Levinson [3] had proposed an aggregate criterion for detecting linkage based on both the marginal distributions and the order statistic distributions of the GSMA statistics. In particular, they argued that bins that have achieved both P_AvgRnk< 0.05 and P_ord< 0.05 "are the most likely to contain linked loci or to be adjacent to linked bins". Note that their criterion entails both the marginal distribution of the T_j, through P_AvgRnk, and the (joint) order statistic distribution of the T_j, through P_ord. We remark that there is some redundancy to the aggregate criterion {P_AvgRnk< 0.05 and P_ord< 0.05}, as can be seen through consideration of critical values relating to their aggregate criterion. With the normal approximation to the distribution of each normalized GSMA statistic T_j, the criterion {P_AvgRnk< 0.05} is equivalent to the criterion {T_j> 1.645}. The criterion {^Pord < 0.05} may be computed from eqn. (1), and depends on the ordering of the individual T_j. With n = 120, then for the ten largest order statistics T₍₁₂₀₎, T₍₁₁₉₎, ..., T₍₁₁₁₎, the criterion {P_AvgRnk< 0.05 and P_ord< 0.05} reduces to {P_ord< 0.05}, since their 95^th percentiles under their joint order statistic distribution exceed 1.645 [implying that, if {P_ord< 0.05} obtains, then {^PAvgRnk < 0.05} will automatically be satisfied]; and, for the remaining order statistics T₍₁₁₀₎, T₍₁₀₉₎, ..., T₍₁₎, the criterion {P_AvgRnk< 0.05 and P_ord< 0.05} reduces to {P_AvgRnk< 0.05}, equivalently, {T_(j)> 1.645}, as their 95^th percentiles under the order distribution, eqn. (1), are less than 1.645 [implying that, if {P_AvgRnk< 0.05} obtains, then {P_ord< 0.05} will automatically be satisfied].

We conclude with a remark concerning multiple testing. Levinson [3] suggested a simple Bonferroni correction for multiple testing when determining the significance of GSMA results. In particular, they used the criterion {P_AvgRnk< 0.000417} (0.05 corrected for 120 tests) for evidence that a bin is likely to contain a linked locus or loci. One can improve on this procedure by using Holm's [13] paradigm for multiple testing rather than Bonferroni. We illustrate Holm's [13] procedure by returning to the Lewis [7] study with m = 20 schizophrenia genome scans. As noted above, we used the online data to reconstruct the normalized unweighted GSMA statistics T_j,j = 1, 2, ..., 120, corresponding to the 120 bins spanning the entire genome. With m = 20 studies, we shall invoke the normal approximation to the distributions of the individual T_j.

Lewis [7] had extensively investigated various criteria for linkage from the 20 schizophrenia genome scans, and we shall not reproduce their analyses. Rather, we here illustrate a graphical procedure for the simultaneous evaluation of p-values from tests on the same data; this procedure is immediately applicable to the simultaneous consideration of the 120 GSMA statistics. The procedure, originally suggested by Schweder and Spjøtvoll [14], consists of a probability plot of the p-values versus the uniform distribution. Koziol [15] subsequently suggested that Holm's [13] paradigm for multiple testing be applied to Schweder and Spjøtvoll's [14] scenario, for a graphical determination of the number of true hypotheses.

Let us briefly review the Holm [13] method, which is an extension of the Bonferroni method for multiple comparisons. Suppose we compare the smallest p-value P₍₁₎ among n p-values with α/n and we find that the p-value is less than α/n. Then our multiple testing problem has been reduced by one test, and we should compare the next smallest p-value P₍₂₎ to . In general, we would compare P_(i)with . Holm's [13] step-down test begins with i = 1, comparing P_(i)with , and stops as soon as P_(i)exceeds , rejecting at overall level α all prior tests with smaller p-values. The Holm [13] method, like Bonferroni, makes no assumption on the dependence of tests, but beyond P₍₁₎ is less conservative than Bonferroni.

In Figure 3A we present a probability plot of the 120 p-values corresponding to the 120 individual T_jstatistics, which we have recomputed from the online Levinson dataset [8]. On this plot, the points corresponding to the "true" hypotheses of no linkage in individual bins should roughly fall along a straight line passing through the origin. We have also superimposed the Bonferroni and Holm boundaries for overall alpha level 0.05 and n = 120 p-values; but, the two boundaries are virtually indistinguishable. There is little indication of large departures from the global null hypothesis of no linkage.

In Figure 3B we rescale the y-axis, and focus solely on the Bonferroni and Holm boundaries. Differences are most readily apparent for the largest ordered p-values. On the other hand, with a large number of hypotheses (here 120), the improvement of Holm over Bonferroni at the smallest ordered p-values is marginal at best. As a reviewer has presciently remarked, the Holm procedure generally is most helpful (advantageous) relative to Bonferroni with only a small number of hypotheses.

In Figure 3C we zoom in on the part of the probability plot nearest the origin; we here have superimposed the Holm [13] boundary. In accord with Lewis [7], we find that only one GSMA statistic achieves statistical significance at overall alpha level 0.05, namely, the statistic corresponding to bin 2.5. [Recall that the Holm and Bonferroni boundaries coincide at the smallest p-value, P₍₁₎.] That is, in this particular instance, the unweighted GSMA statistics with either Bonferroni or Holm [13] correction for multiple testing identify statistically evidence for linkage on chromosme 2q.

Conclusion

For practitioners utilizing GSMA statistics, the question arises as to whether the methods proposed here as well as in Koziol and Feng [2] are merely of theoretical interest, or have practical import. If one utilizes solely the unweighted GSMA statistic, and chooses to consider its marginal distribution (corresponding to the P_AvgRnkformulation of Levinson [2]), then the exact null distribution of the GSMA statistic is available from Wise [1] or Koziol and Feng [2], and should be preferred over any approximate methods. If the exact null distribution is computationally intractable for practitioners, then the Edgeworth approximation of Koziol and Feng [2] provides a simple and accurate means of assessing significance; we would argue that the Edgeworth approximation is preferable to a normal approximation in this instance. When weights are introduced into the GSMA statistic, then the combinatoric arguments of Wise [1] and Koziol and Feng [2] will typically be insufficient to derive the exact null distribution [though we remark that a moment generating function approach patterned after the probability generating function formulation of Koziol and Feng [2] can be brought to bear on this problem.] One can either simulate the null distribution or derive an Edgeworth approximation: we do not believe either method enjoys global advantages over the other. We caution against simple reliance on a normal approximation: in the situation investigated here, Figure 1, the weighted combination of m = 20 individual GSMS statistics, the normal approximation is particularly ill-fitting in the tails. [Agreement in the tails is of particular relevance to practitioners, as these represent the areas of potentially significant findings (p-values).] As for the order statistic formulation and the aggregate criteria of Levinson [3], we believe that the theoretical considerations given in this paper can lead to clarification and simplification of testing criteria.

References

Wise LH, Lanchbury JS, Lewis CM: Meta-analysis of genome searches. Ann Hum Genet 1999, 63: 263–72. 10.1046/j.1469-1809.1999.6330263.x
Article CAS PubMed Google Scholar
Koziol JA, Feng AC: A note on the genome scan meta-analysis statistic. Ann Hum Genet 2004, 68: 376–380. 10.1046/j.1529-8817.2004.00103.x
Article CAS PubMed Google Scholar
Levinson DF, Levinson MD, Segurado R, Lewis CM: Genome scan meta-analysis of schizophrenia and bipolar disorder, part I: Methods and power analysis. Am J Hum Genet 2003, 73: 17–33. 10.1086/376548
Article PubMed Central CAS PubMed Google Scholar
Chiodini BD, Lewis CM: Meta-analysis of 4 coronary heart disease genome-wide linkage studies confirms a susceptibility locus on chromosome 3q. Arterioscler Thromb Vasc Biol 2003, 23: 1863–1868. 10.1161/01.ATV.0000093281.10213.DB
Article CAS PubMed Google Scholar
Demenais F, Kanninen T, Lindgren CM, Wiltshire S, Gaget S, Dandrieux C, Almgren P, Sjogren M, Hattersley A, Dina C, Tuomi T, McCarthy MI, Froguel P, Groop LC: A meta-analysis of four European genome screens (GIFT consortium) shows evidence for a novel region on chromosome 17p11.2-q22 linked to type 2 diabetes. Hum Mol Genet 2003, 12: 1865–1873. 10.1093/hmg/ddg195
Article CAS PubMed Google Scholar
Stuart A, Ord JK: Kendall's Advanced Theory of Statistics. 5th edition. New York: Oxford University Press; 1987. [Distribution Theory, Vol 1.]
Google Scholar
Lewis CM, Levinson DF, Wise LH, DeLisi LE, Straub RE, Hovatta I, Williams NM, Schwab SG, Pulver AE, Faraone SV, Brzustowicz LM, Kaufmann CA, Garver DL, Gurling HM, Lindholm E, Coon H, Moises HW, Byerley W, Shaw SH, Mesen A, Sherrington R, O'Neill FA, Walsh D, Kendler KS, Ekelund J, Paunio T, Lonnqvist J, Peltonen L, O'Donovan MC, Owen MJ, Wildenauer DB, Maier W, Nestadt G, Blouin JL, Antonarakis SE, Mowry BJ, Silverman JM, Crowe RR, Cloninger CR, Tsuang MT, Malaspina D, Harkavy-Friedman JM, Svrakic DM, Bassett AS, Holcomb J, Kalsi G, McQuillin A, Brynjolfson J, Sigmundsson T, Petursson H, Jazin E, Zoega T, Helgason T: Genome scan meta-analysis of schizophrenia and bipolar disorder, part II: Schizophrenia. Am J Hum Genet 2003, 73: 34–48. 10.1086/376549
Article PubMed Central CAS PubMed Google Scholar
Levinson Research Page[http://depressiongenetics.med.upenn.edu/meta-analysis/index.htm]
The R Project for Statistical Computing[http://www.r-project.org/]
David HA: Order Statistics. New York: John Wiley; 1970.
Google Scholar
Genz A: Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics 1992, 1: 141–150.
Google Scholar
Genz A: Comparison of methods for the computation of multivariate normal probabilities. Computing Science and Statistics 1993, 25: 400–405.
Google Scholar
Holm S: A simple sequentially rejective multiple test procedure. Scand J Statist 1979, 6: 65–70.
Google Scholar
Schweder T, Spjøtvoll E: Plots of p-values to evaluate many tests simultaneously. Biometrika 1982, 69: 493–502.
Article Google Scholar
Koziol JA: A note on plots of p-values to evaluate many tests simultaneously. Biom J 1989, 8: 969–972.
Article Google Scholar

Download references

Acknowledgements

This research was supported in part by NIH grant RR00833 to the Scripps General Clinical Research Center. We thank the reviewers for their insightful comments and suggestions.

Author information

Authors and Affiliations

Department of Molecular and Experimental Medicine, The Scripps Research Institute, 10550 North Torrey Pines Road, MEM216, La Jolla, CA, 92037, USA
James A Koziol & Anne C Feng

Authors

James A Koziol
View author publications
You can also search for this author in PubMed Google Scholar
Anne C Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to James A Koziol.

Additional information

Authors' contributions

JAK conceived, designed and drafted the manuscript. ACF performed the statistical simulation.

Both authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koziol, J.A., Feng, A.C. A note on generalized Genome Scan Meta-Analysis statistics. BMC Bioinformatics 6, 32 (2005). https://doi.org/10.1186/1471-2105-6-32

Download citation

Received: 24 November 2004
Accepted: 17 February 2005
Published: 17 February 2005
DOI: https://doi.org/10.1186/1471-2105-6-32

A note on generalized Genome Scan Meta-Analysis statistics