BMC Bioinformatics BioMed Central

Background Combining multiple independent tests, when all test the same hypothesis and in the same direction, has been the subject of several approaches. Besides the inappropriate (in this case) Bonferroni procedure, the Fisher's method has been widely used, in particular in population genetics. This last method has nevertheless been challenged by the SGM (symmetry around the geometric mean) and Stouffer's Z-transformed methods that are less sensitive to asymmetry and deviations from uniformity of the distribution of the partial P-values. Performances of these different procedures were never compared on proportional data such as those currently used in population genetics. Results We present new software that implements a more recent method, the generalised binomial procedure, which tests for the deviation of the observed proportion of P-values lying under a chosen threshold from the expected proportion of such P-values under the null hypothesis. The respective performances of all available procedures were evaluated using simulated data under the null hypothesis with standard P-values distribution (differentiation tests). All procedures more or less behaved consistently with ~5% significant tests at α = 0.05. Then, linkage disequilibrium tests with increasing signal strength (rate of clonal reproduction), known to generate highly non-standard P-value distributions are undertaken and finally real population genetics data are analysed. In these cases, all procedures appear, more or less equally, very conservative, though SGM seems slightly more conservative. Conclusion Based on our results and those discussed in the literature we conclude that the generalised binomial and Stouffer's Z procedures should be preferred and Z when the number of tests is very small. The more conservative SGM might still be appropriate for meta-analyses when a strong publication bias in favour of significant results is expected to inflate type 2 error.


Background
GC-content (% of guanine(G) or cytosine(C) bases) is known to vary along human chromosomes. To describe large genomic regions of homogeneous GC%, the term "isochore" was coined in 1980s [1]. Since then, the question has been intensively debated, why genomes contain GC-high and GC-low isochore regions. The initially proposed hypotheses was that GC-rich isochore constitute an adaptation to homeothermy in warm-blooded species [2], as well as favorable bendability and B-Z helix transition that lead to more open chromating and ease transcription [3]. This explanation fits well to the correlation between GC-content and gene density [4,5]. The second hypotheses to explain variation in GC-content is a mutation bias related to processes like DNA replication and repair [6,7]. The third explanation arose from the later discovery that local GC-content and recombination rate (number of crossing over events per meiosis per unit sequence length) are strongly correlated [8]. The molecular basis for this explanation is recombination associated biased gene conversion (BGC), which may act to increase GC-content [7,[9][10][11]. The availability of full genome sequences now allows to draw a more complex picture of GC-content variation than only separating the genome into a set of discrete isochore categories. Early after completion of the first human genome draft sequence, it was observed that seemingly homogeneous region at one length scale may not be homogeneous at shorter length scales and that it is possible to have "domains within a domain" [12,13]. More recently, a fine-grained picture also arose for variation of recombination rate along human chromosomes [14][15][16][17]. This facilitated the study of the relationship between GC-content and recombination rate on a much finer scale, showing that recombination hotspots are associated with local increases in GC-content [18] but do not significantly influence local substitution rate. In parallel, the BGC-hypothesis has been supported by several additional lines of evidence [19][20][21]. In a most recent study, recombination rates was found to be the major determinant of limiting-GC-content -the stationary GC-content towards which the human genome is currently evolving [22], strongly supporting recombination associated BGC as a major determinant of GC-content.
Nevertheless, it is not entirely clear how the two correlations of GC-content with both recombination rate and gene density relate to each other. In the simplest case, a third correlation between gene density and recombination rate would exist. In this case one could test whether increased GC-content in gene dense regions were a consequence of increased recombination. In the absence of a correlation between recombination rate and gene density, their shared relationship with GC-content remains to be explained. In particular, the correlation between GC-content and gene density is less understood. Thus, the true model of the evolution of genome-wide and regional GCcontent may have a neutral (non-Darwinian) and additionally a (positive and negative) selection component [23][24][25] or it may be void of this selection component. Because the correlation with gene density has been a major argument of evolutionary models that explain local GC-content as result of selection, a better understanding of the correlations between these variables is an important task.
To understand the relationship between recombination rate, gene density and GC-content, it is further important to note that even if BGC were the only reason for GC-content variation, this would not necessarily imply a purely neutral model of isochore evolution, because local recombination rate may itself evolve under the influence of natural selection. For instance, it has been observed that recombination is increased at human central nervous sys-tem genes and immune-system genes [26,27]. These gene categories had been observed before to be subject to accelerated or faster sequence evolution, respectively [28]. Because more recombination at a genetic locus may increase the effective strength of selection, this led to the suggestion that gene selection intensity might be one determinant of local recombination rate variation [26,27].
In the present study, we aim at the assignment of "direct" and "indirect" labels, as well as "cause" and "effect", whenever possible, to variables that are informative about local GC-content. We notice that many previous analyses are based on statistical correlation, whereas the causal relationship between them remains undecided. For instance, researchers who are interested in understanding the causes of recombination rate variation or gene sequence evolution, GC-content itself or hidden variables associated with GC-content may be seen as possibly confounding factors. On the other hand, for people who are interested in in GC-content variation, recombination and the associated gene conversion, and possibly mutation events, are a priori treated as causal variables.
When dealing with several correlated variables, a widely used statistical method is multiple regression. However, multiple regression is not always a good method to test for causal relationships, because the equality sign in a regression analysis does not have a direction. Thus, one can move an independent variable from the right-hand side of the equation to the left-hand side to be a dependent variable [29]. Moreover, two unconditionally independent variables can be correlated conditional on a common causal child, which is exactly what is carried out in a multiple regression [29]. Therefore, we propose to use techniques for inferring causal relationship by conditional correlation analysis to understand the relationship between GC-content, recombination rate, and gene density in the human genome.
To this end, we start representing a group of pairwise correlated variables by an undirected graph structure: nodes/ vertices represent variables and links/edges represent observed statistical correlations. In the next step, we remove all links that are inferred to be indirect associations, based on the absence of conditional correlation. Finally, we apply causal inference rules to assign causal arrows, if possible. In cases where the complete causal model cannot be inferred from the data, the result is a partially directed graph that optimally characterizes the relationship among the tested variables. Similar inference techniques have been previously applied to other genomics problems [30] and for studying relationships between human-disease related intermediate-phenotypes [31].

Results and discussion
Three variables: GC%, recombination rate, and distance to telomere In a recent study, it was shown by Arndt and Duret [22] that besides the positive correlation with recombination rate (RR), GC-content (GC%) is negatively correlated with the distance to telomere (DT). These results were mainly based on the analysis of noncoding sequence in a 1 Mb sized window that have high quality finished sequence available both in the chimpanzee and the macaque genome [22]. We start our analysis by using both their data and our own dataset of the same 1 Mb windows for the human genome sequence, regardless of coding and noncoding status or the existence of quality sequence in other organisms. The GC% in these two datasets is not totally identical, but highly correlated (ρ = 0.98). Similarly, the HapMap estimate of RR [27] in the two datasets is correlated with ρ = 0.82. We discarded windows, if the number of HapMap single-nucleotide-polymorphism (SNP) is less than 20 or more than 30% of genomic sequence are missing. In total, 2647 and 2668 1 Mb windows are available with information on GC%, RR and DT for the two datasets. We performed log-transformation of distance to telomere (DT), because the scatter plot showed a non-linear correlation between DT with the other two variables, and then multiplied it by -1 to change the negative correlation with GC% to positive. The unconditional and conditional Pearson's correlation coefficients between GC%, RR and DT are shown in Table 1. All correlation coefficients are highly significant (p-value = 0) and results from both datasets are highly similar. Because an earlier study had observed that the correlation between RR and GC% is maximal when both variables are measured in the 50 kb window [15], we also looked at a dataset where GC%, RR, DT are measured by using the window size of 50 kb. Due to the smaller window size (1/20 of the 1 Mb window), RR is fluctuating in a much wider range as can be seen from the quantile values in Table 2. We also note that a square-root transformation of RR under 50 kb window leads to a slightly better linear correlation with GC%, and a larger correlation coefficient (result not shown).
The correlation and partial correlation between the three variables from 50 kb window is shown in Table 3. In contrast to [15], we found the correlation between GC% and RR to be higher using the 1 Mb sized window than the 50 kb window. This discrepancy may result from the threefold higher SNP density provided by the HapMap phase II [27]. Importantly, the correlation between GC% and DT is less affected by the change of window size, although RR-DT correlation is far weaker in the 50 kb window than in the 1 Mb window. This change of the strength of the correlation of RR with GC% and DT from one window size to another may be related to the "domains within domains" phenomenon that had been found for GC-content variation and that may exist for fine-scale recombination rate variation too.
Because none of the pairwise correlations between GC%, RR and DT is rendered insignificant by conditioning on the third variable, it is not possible to remove any edge in the relationship graph for GC%, RR, and DT ( Figure  1(A)).

Chromosome-specific correlation and partial correlations
In the next step, we checked the chromosome-specific correlations and partial correlations between the three variables. Table 4 shows these result in form of correlation and partial correlation coefficients (and p-value if it is larger than 0.01) for our main dataset (1 Mb window including all available human genome sequence independent from its coding status). There are several notable observations: (1) RR-log(1/DT) correlation is unchanged by conditioning on GC% for non-acrocentric chromosomes, indicating that the position of the window already explains RR, rendering GC% unlikely to be causal. (2) For acrocentric chromosomes (13,14,15,21,22), the position of the window (DT) is only marginally correlated with RR. In contrast, DT is correlated with GC% for all chromosomes including the acrocentric chromosomes. (3) For some (3, 7,8,9,10,11,12,18,19), but not for all, chromosomes, the correlation between GC% and RR is weakened by conditioning on DT.(4) For chromosome 2 the positive correlation between RR and DT is not turned negative by conditioning on GC. This result is interesting, because chromosome 2 is known to result from a relatively recent fusion event of different chromosomes [32,33] To examine the robustness of these chromosome specific correlations (Table 4), we carried out the same correlation analysis using the noncoding sequence 1 Mb windows [22] and the 50 kb window ( Figure 2). Most of the correlations in Table 4 are confirmed in these two additional datasets. One interesting observation in Figure 2 is that the correlation between RR and DT is weaker for the 50 kb window, probably because finer details of recombination rate variation are revealed at this length scale and the dependence of RR on DT is no longer monotonic. Thus DT is primarily correlated with large scale recombination rate variation, which could relate to the proposed conservation of large-scale rates on longer time scales [15,22].
An example of chromosome specific patterns of recombination rate was recently discussed in the context of a putative gene that controls overall recombination rate [34]. This study illustrates the effect of a SNP on increasing the female recombination rates by almost the same amount on all chromosomes with the exception of chromosome 21. Another SNP reduces the male recombination rates by variable degrees for different chromosomes [34].

Three variables: GC%, recombination rate, and number of exons
Gene density constitutes a further variable that is known to be strongly correlated with GC% [4,5]. To better understand this relationship, we counted the number of exons within a 1 Mb window, as it reflects both the number of genes and the intron count. The correlation and partial correlations between GC%, RR, and the number of exons (NE) are listed in Table 5. Unlike the previous situation, where we had looked at the three variables RR, DT and GC%, the consideration of NE instead DT is bringing up an observation that allows us to infer a causal relationship: although no significant direct correlation exists between RR-NE, a negative correlation between RR and NE emerges after conditioning on GC%.
This result (Table 5) suggests the causal model in Figure  1(B). In this causal model, RR and NE are two independent causes of GC%. The inference of this causal structure is based on the known fact that conditioning on a common child variable creates a correlation between two previously uncorrelated causes of this child variable [29]. Or spoken more specifically, the relationship between NE and RR can be understood as follows: normally the two variables RR and NE do not contain any information about each other and are therefore uncorrelated. However, given the status of GC-content as third variable, this situation changes and RR and NE are now mutually informative. This mutual informativeness of NE and RR depending on GC% is explained by a model where both RR and NE are independent causes of GC%. When GC% in a region is high and RR is low, NE is more likely to be high. Vice versa, when NE is low, RR is more likely to be high. Thus, given the status of GC%, a previously invisible relationship between RR and NE emerges due to the causal influence of both variables on GC%.
Consistent with our present observation, a negative correlation between gene density and RR had been observed earlier in a multiple regression analysis when looking at 3 Mb windows, despite the fact that the unconditional RR/ gene count correlation was weakly positive [35]. Importantly, window size could be a factor that exerts some Causal graph models or their skeleton for GC-content, recombination-rate, number-of-exons, and distance-to-tel-omere Figure 1 Causal graph models or their skeleton for GC-content, recombination-rate, number-of-exons, and distance-to-telomere. (A) Relationship graph for GC%, RR,log(DT) that is inferred from the correlations in Table 1.

(B)
Causal graph for GC%, RR, NE that is inferred from the correlations in Table 5. (C) Partially directed graph for GC%, RR, -log(DT), and NE that is consistent with the result in Table 6. All edges/arrows are highly significant. (D) A hypothetical model including an extra variable NCO/R: proportion of non-crossing-over events. This model may help to orient the previously undirected edges. Recombinations tend to occur more often in physical proximity to genes, when compared to intergenic regions; but on the other hand, they also tend to occur away from exons on a finer scale [17]. It might be due to this subtle variation of RR at different length scales that the correlation between RR-NE is insignificant at the 1 Mb scale, but was weakly positive on the 3 Mb scale.
Nevertheless, when we repeated the chromosome-specific analysis using the variable NE (instead of DT), this confirmed the overall pattern of correlation between RR and NE. Unconditionally the correlation is not significant and can be both positive and negative. However, the partial correlations between NE and RR conditional on GC% are all negative with most of them being significant (results not shown). In principle, the absence of an unconditional correlation between RR and NE could also result from a phenomenon termed suppression [36][37][38]. Suppression refers to the situation, where different signs are obtained by following two paths with opposite effects from the same starting to the same ending node. However, the observed change of the correlation from insignificant to significant is inconsistent with suppression, because this conditional dependence indicates that both the NE and the RR link with GC% are pointing towards GC%.

Four variables: GC%, recombination rate, distance to telomere, and number of exons
In the final step, we extended our 3-variable analysis to a 4-variable analysis, which includes GC%, RR, -log(DT), and NE. Besides the previously calculated first-order partial correlation (conditional on one variable), we now also calculate the second-order partial correlations (conditional on two other variables). The result are shown in Table 6. When comparing the second-order partial correlations to the first-order partial correlations, we found that conditioning on GC% is mostly responsible for any change of correlation status. Conditioning on DT, RR or NE has only some quantitative effect, instead of introducing any qualitative changes into pairwise and first-order correlations. This implies a central position of GC% among these variables. Figure 1(C) depicts a partially directed graph that is consistent with the results in Table 6. Importantly, the inclusion of DT does not alter causal relationships that were inferred above in the 3-variable analysis of RR, NE and GC%. Also, the above correlations between RR, DT and GC% remain largely unaltered by the inclusion of NE. As mentioned above, telomere distance is inversely correlated with GC% and RR. Additionally, we see in the unconditional pairwise analysis that telomere distance is inversely correlated with NE too, although this correlation is of smaller magnitude. This correlation between DT and NE does not change substantially when conditioning on RR. However, when conditioning on GC%, the correlation between DT and NE changes its direction. Following a similar line of reasoning as above, this suggests a model where DT and NE are two independent causes of GC%.
On the contrary, this cannot be said for the influence of RR and DT on GC%, because the correlation RR and DT does not depend on conditioning on GC%.
To find the missing orientations of the links between RR, DT and GC% in the 4-variable model, we next applied the TETRAD program [39] that implements the PC-algorithm to create a causal model by a systematic search strategy (see Methods for details) [40,41]. The graphical result that we obtained from running TETRAD is essentially the same as the one depicted in Figure 1(C) and confirmed the direction of the two arrows that we had inferred for causative influence of both RR and NE on GC%. However, the additionally proposed orientations of the links RR →log(DT) and NE → -log(DT) are biologically counterintuitive, because telomere distance is unlikely to be an effect of any of the other variables. To explain the difficulty to infer the directions of these causal links between RR, DT and GC%, we hypothesize the causal model in Figure  1(D). This model includes as fifth hidden variable the proportion of recombination events that are resolved exclusively as gene conversion event without any crossingover event (NCO/R), a variable that was recently sug-Chromosome-specific correlations and partial correlations Figure 2 Chromosome-specific correlations and partial correlations. Chromosome-specific correlation and partial correlation for GC%-RR (top), GC%-log(1/DT) (middle), RR-log(1/DT) (bottom) in 3 datasets: 1 Mb, non-coding (black) [22]; 1 Mb, disregard coding/non-coding status (blue); and 50 kb, disregard coding/non-coding status (red). Acrocentric chromosomes are marked by yellow bars.
gested to be important [22]. In this model in Figure 1(D), NCO/R is a cause of GC% that does not fully depend on RR, but is influenced in its magnitude by RR. A similar relationship might connect NCO/R with NE. On the other hand, distance-to-telomere, similar to other variables measuring position or time, might play the role of providing a common environment for several other variables. In other words, one can draw a directed arrow from DT to all other variables under discussion. A similar situation is seen for the linkage disequilibrium between two neighboring genetic markers, where the position can be considered is a "cause" of both markers. However, we could not test the validity of the model in Figure 1(D) because NCO/R data are not available.

Conclusion
We apply partial correlation and graphical probabilistic model inference to several genomic variables that are correlated with GC-content in the human genome. We can show that recombination rate and exon density are two independent causes of GC% as measured on the 1 Mb scale. This observation adds some support to models that complement the influence of recombination rate on GCcontent with a component involving selection. In addition, it appears unlikely that GC% variation is a cause of variation in recombination rate or exon density. We observe some heterogeneity in the human genome, such as differences in the correlation of RR with the distance to telomere between acrocentric and non-acrocentric chro-mosomes. We also see indications of window-size dependent correlation pattern, which may reflect the subtle differences of the distribution of recombination near and within genes.

Terminology in relationship and causal graphs
A graph G = (V, E) contains vertex/node set V and edge/ link If there is an edge between node i and j, either directed or undirected, we say there is a "direct association/relationship" between the two nodes. If there is no edge between node i and node j, the two are still connected through multiple-step edges, as all our nodes are in one single graph; then we say the two nodes are "indirectly associated".
If all edges are directed, the graph is said to be "directed graph" (e.g. Figure 1(B)). If all edges are undirected, the graph is an "undirected graph" (e.g. Figure 1(A)). If some edges are directed and other edges are undirected, the graph is a "partially directed graph" (e.g. Figure 1(C)).

Partial correlations
For many situations, conditional correlation is equivalent to partial correlation [42] which is defined as follows (with one control variable z): where is the Pearson product-moment correlation coefficient. From the linear regression framework, partial correlation is the correlation after the main terms in regression over z are removed:  Partial correlation ρ xy.z is often lower than ρ xy , and a significantly lower partial correlation is an indication that the x -y correlation is indirect.
With more than 3 variables (x, y, z, w), the partial correlation can be defined by conditional on one variable (e.g. z, first order), or two variables (z, w, second order). Both Eq.(1) and Eq.(2) can be extended for calculating secondorder partial correlation: and Higher order partial correlation can be defined in an analog fashion. Figure 3 illustrates an example for inferring relationship and causal graph from data for three variables x, y, z. In Figure 3(A), we assume all pairwise correlations are significant, so all nodes are linked to other nodes. If the correlation between two of the variables is not due to a direct cause-effect relationship, but mediated via a third variable, then the correlation between the two conditional on that third variable will be greatly reduced. Accordingly, we would end up Figure 3(B), if assuming that the partial correlation Cor(x, y|z) becomes insignificant, while the other two partial correlations remain significant. In that case, partial correlation cannot determine the orientation of causal arrows. Except the first causal model on the top of Figure 3(C), all the three other causal models were possible. However, in a special situation, a directed causal model can be inferred uniquely. Suppose Cor(x, z) and Cor(y, z) are both significant, but Cor(x, y) is insignificant,

Establishing undirected, partially directed and directed graphs
Illustration of the procedures in establishing undirected, directed, or partially directed graphs Figure 3 Illustration of the procedures in establishing undirected, directed, or partially directed graphs. (A) If correlation Cor(i, j) is significant, draw an edge between node i and node j (i, j ∈ (x, y, z)). (B) The conditional correlation Cor(x, y|z) is insignificant, remove the edge (x, y). (C) Using other information to select one or few causal models that are consistent with the data.
then we start with the undirected graph in Figure 3(B) from the unconditional analysis. Further suppose Cor(x, y|z), Cor(x, z|y), Cor(y, y|x) are all significant. Then by the rule of d-separation [41] only the top model in Figure  3(C) is consistent with these assumptions.

TETRAD program for inference of causal models from partial correlations
The TETRAD program [39] implements the PC-algorithm to automatically infer causal relationships from partial correlation analysis [40]. This algorithm can be broken into two phases: an adjacency phase and an orientation phase. In the adjacency phase, a complete undirected graph over the variables is constructed and then edges X -Y are removed, if some set S among either the adjacents of X or the adjacents of Y can be found such that I(X, Y|S). Then the orientation phase is begun. The first step examines unshielded triples and considers to orient them as colliders. An unshielded triple is a triple (X, Y, Z) where X is adjacent to Y, Y is adjacent to Z, but X is not adjacent to Z. Since X is not adjacent to Z, the edge X -Z must have been removed during the adjacency search by conditioning on some set S xz ; (X, Y, Z) is oriented as a collider X → Y ← Z just in case Y is not in this S xz . Once all such unshielded triples have been oriented as colliders, a series of rules orients any edge whose orientation is implied by previous orientations.