Partial correlation analysis indicates causal relationships between GC-content, exon density and recombination rate in the human genome
© Freudenberg et al. 2009
Published: 30 January 2009
Skip to main content
© Freudenberg et al. 2009
Published: 30 January 2009
Several features are known to correlate with the GC-content in the human genome, including recombination rate, gene density and distance to telomere. However, by testing for pairwise correlation only, it is impossible to distinguish direct associations from indirect ones and to distinguish between causes and effects.
We use partial correlations to construct partially directed graphs for the following four variables: GC-content, recombination rate, exon density and distance-to-telomere. Recombination rate and exon density are unconditionally uncorrelated, but become inversely correlated by conditioning on GC-content. This pattern indicates a model where recombination rate and exon density are two independent causes of GC-content variation.
Causal inference and graphical models are useful methods to understand genome evolution and the mechanisms of isochore evolution in the human genome.
GC-content (% of guanine(G) or cytosine(C) bases) is known to vary along human chromosomes. To describe large genomic regions of homogeneous GC%, the term "isochore" was coined in 1980s . Since then, the question has been intensively debated, why genomes contain GC-high and GC-low isochore regions. The initially proposed hypotheses was that GC-rich isochore constitute an adaptation to homeothermy in warm-blooded species , as well as favorable bendability and B-Z helix transition that lead to more open chromating and ease transcription . This explanation fits well to the correlation between GC-content and gene density [4, 5]. The second hypotheses to explain variation in GC-content is a mutation bias related to processes like DNA replication and repair [6, 7]. The third explanation arose from the later discovery that local GC-content and recombination rate (number of crossing over events per meiosis per unit sequence length) are strongly correlated . The molecular basis for this explanation is recombination associated biased gene conversion (BGC), which may act to increase GC-content [7, 9–11]. The availability of full genome sequences now allows to draw a more complex picture of GC-content variation than only separating the genome into a set of discrete isochore categories. Early after completion of the first human genome draft sequence, it was observed that seemingly homogeneous region at one length scale may not be homogeneous at shorter length scales and that it is possible to have "domains within a domain" [12, 13]. More recently, a fine-grained picture also arose for variation of recombination rate along human chromosomes [14–17]. This facilitated the study of the relationship between GC-content and recombination rate on a much finer scale, showing that recombination hotspots are associated with local increases in GC-content  but do not significantly influence local substitution rate. In parallel, the BGC-hypothesis has been supported by several additional lines of evidence [19–21]. In a most recent study, recombination rates was found to be the major determinant of limiting-GC-content – the stationary GC-content towards which the human genome is currently evolving , strongly supporting recombination associated BGC as a major determinant of GC-content.
Nevertheless, it is not entirely clear how the two correlations of GC-content with both recombination rate and gene density relate to each other. In the simplest case, a third correlation between gene density and recombination rate would exist. In this case one could test whether increased GC-content in gene dense regions were a consequence of increased recombination. In the absence of a correlation between recombination rate and gene density, their shared relationship with GC-content remains to be explained. In particular, the correlation between GC-content and gene density is less understood. Thus, the true model of the evolution of genome-wide and regional GC-content may have a neutral (non-Darwinian) and additionally a (positive and negative) selection component [23–25] or it may be void of this selection component. Because the correlation with gene density has been a major argument of evolutionary models that explain local GC-content as result of selection, a better understanding of the correlations between these variables is an important task.
To understand the relationship between recombination rate, gene density and GC-content, it is further important to note that even if BGC were the only reason for GC-content variation, this would not necessarily imply a purely neutral model of isochore evolution, because local recombination rate may itself evolve under the influence of natural selection. For instance, it has been observed that recombination is increased at human central nervous system genes and immune-system genes [26, 27]. These gene categories had been observed before to be subject to accelerated or faster sequence evolution, respectively . Because more recombination at a genetic locus may increase the effective strength of selection, this led to the suggestion that gene selection intensity might be one determinant of local recombination rate variation [26, 27].
In the present study, we aim at the assignment of "direct" and "indirect" labels, as well as "cause" and "effect", whenever possible, to variables that are informative about local GC-content. We notice that many previous analyses are based on statistical correlation, whereas the causal relationship between them remains undecided. For instance, researchers who are interested in understanding the causes of recombination rate variation or gene sequence evolution, GC-content itself or hidden variables associated with GC-content may be seen as possibly confounding factors. On the other hand, for people who are interested in in GC-content variation, recombination and the associated gene conversion, and possibly mutation events, are a priori treated as causal variables.
When dealing with several correlated variables, a widely used statistical method is multiple regression. However, multiple regression is not always a good method to test for causal relationships, because the equality sign in a regression analysis does not have a direction. Thus, one can move an independent variable from the right-hand side of the equation to the left-hand side to be a dependent variable . Moreover, two unconditionally independent variables can be correlated conditional on a common causal child, which is exactly what is carried out in a multiple regression . Therefore, we propose to use techniques for inferring causal relationship by conditional correlation analysis to understand the relationship between GC-content, recombination rate, and gene density in the human genome.
To this end, we start representing a group of pairwise correlated variables by an undirected graph structure: nodes/vertices represent variables and links/edges represent observed statistical correlations. In the next step, we remove all links that are inferred to be indirect associations, based on the absence of conditional correlation. Finally, we apply causal inference rules to assign causal arrows, if possible. In cases where the complete causal model cannot be inferred from the data, the result is a partially directed graph that optimally characterizes the relationship among the tested variables. Similar inference techniques have been previously applied to other genomics problems  and for studying relationships between human-disease related intermediate-phenotypes .
Correlation and and partial correlation at 1 Mb windows. Correlation and and partial correlation between GC%, recombination rate (RR), and distance to telomere (DT) (negative log-transformed) for 1 Mb windows. Conditioning is performed on the respective third variable. (A) regardless of coding status; (B) non-coding only.
Quantile values of recombination rates. Quantile values for RR for the three datasets: 1 Mb non-coding, 1 Mb and 50 kb (in cM/Mb).
1 Mb, nc
Correlation and partial correlation at 50 kb windows. Correlation and partial correlation between GC%, RR, and DT (negative log transformed) for 50 kb windows.
Chromosome-specific correlation and partial correlation. Chromosome-specific correlation and partial correlation between GC%, RR, and DT (negative log-transformed) using the 1 Mb window. A p-value for testing zero-correlation is included only when the correlation is not significant. n is the number of windows per chromosome (i.e., sample size). Acrocentric chromosomes are marked by *.
ρ (p-value)/partial-ρ (p-value)
An example of chromosome specific patterns of recombination rate was recently discussed in the context of a putative gene that controls overall recombination rate . This study illustrates the effect of a SNP on increasing the female recombination rates by almost the same amount on all chromosomes with the exception of chromosome 21. Another SNP reduces the male recombination rates by variable degrees for different chromosomes .
adding number-of-exon variable. Correlation and partial correlation between GC%, RR, and number of exons (NE) in 1 Mb windows.
This result (Table 5) suggests the causal model in Figure 1(B). In this causal model, RR and NE are two independent causes of GC%. The inference of this causal structure is based on the known fact that conditioning on a common child variable creates a correlation between two previously uncorrelated causes of this child variable . Or spoken more specifically, the relationship between NE and RR can be understood as follows: normally the two variables RR and NE do not contain any information about each other and are therefore uncorrelated. However, given the status of GC-content as third variable, this situation changes and RR and NE are now mutually informative. This mutual informativeness of NE and RR depending on GC% is explained by a model where both RR and NE are independent causes of GC%. When GC% in a region is high and RR is low, NE is more likely to be high. Vice versa, when NE is low, RR is more likely to be high. Thus, given the status of GC%, a previously invisible relationship between RR and NE emerges due to the causal influence of both variables on GC%.
Consistent with our present observation, a negative correlation between gene density and RR had been observed earlier in a multiple regression analysis when looking at 3 Mb windows, despite the fact that the unconditional RR/gene count correlation was weakly positive . Importantly, window size could be a factor that exerts some influence on the magnitude of observed correlations. Recombinations tend to occur more often in physical proximity to genes, when compared to intergenic regions; but on the other hand, they also tend to occur away from exons on a finer scale . It might be due to this subtle variation of RR at different length scales that the correlation between RR-NE is insignificant at the 1 Mb scale, but was weakly positive on the 3 Mb scale.
Nevertheless, when we repeated the chromosome-specific analysis using the variable NE (instead of DT), this confirmed the overall pattern of correlation between RR and NE. Unconditionally the correlation is not significant and can be both positive and negative. However, the partial correlations between NE and RR conditional on GC% are all negative with most of them being significant (results not shown). In principle, the absence of an unconditional correlation between RR and NE could also result from a phenomenon termed suppression [36–38]. Suppression refers to the situation, where different signs are obtained by following two paths with opposite effects from the same starting to the same ending node. However, the observed change of the correlation from insignificant to significant is inconsistent with suppression, because this conditional dependence indicates that both the NE and the RR link with GC% are pointing towards GC%.
Correlation/partial correlation between GC%, RR, DT and NE. In addition, the first-order partial correlations for RR-NE and DT-NE pairs are shown, whereas the first order partial correlations between the other variables had been already shown above.
0.036/-0.28 (cond. on GC% and log(1/DT))
/-0.34 (cond. on GC%)
/-0.056 (cond. on log(1/DT))
0.17/-0.12 (cond. on GC% and RR)
/-0.23 (cond. on GC%)
/0.18 (cond. on RR)
Figure 1(C) depicts a partially directed graph that is consistent with the results in Table 6. Importantly, the inclusion of DT does not alter causal relationships that were inferred above in the 3-variable analysis of RR, NE and GC%. Also, the above correlations between RR, DT and GC% remain largely unaltered by the inclusion of NE. As mentioned above, telomere distance is inversely correlated with GC% and RR. Additionally, we see in the unconditional pairwise analysis that telomere distance is inversely correlated with NE too, although this correlation is of smaller magnitude. This correlation between DT and NE does not change substantially when conditioning on RR. However, when conditioning on GC%, the correlation between DT and NE changes its direction. Following a similar line of reasoning as above, this suggests a model where DT and NE are two independent causes of GC%. On the contrary, this cannot be said for the influence of RR and DT on GC%, because the correlation RR and DT does not depend on conditioning on GC%.
To find the missing orientations of the links between RR, DT and GC% in the 4-variable model, we next applied the TETRAD program  that implements the PC-algorithm to create a causal model by a systematic search strategy (see Methods for details) [40, 41]. The graphical result that we obtained from running TETRAD is essentially the same as the one depicted in Figure 1(C) and confirmed the direction of the two arrows that we had inferred for causative influence of both RR and NE on GC%. However, the additionally proposed orientations of the links RR → -log(DT) and NE → -log(DT) are biologically counterintuitive, because telomere distance is unlikely to be an effect of any of the other variables. To explain the difficulty to infer the directions of these causal links between RR, DT and GC%, we hypothesize the causal model in Figure 1(D). This model includes as fifth hidden variable the proportion of recombination events that are resolved exclusively as gene conversion event without any crossing-over event (NCO/R), a variable that was recently suggested to be important . In this model in Figure 1(D), NCO/R is a cause of GC% that does not fully depend on RR, but is influenced in its magnitude by RR. A similar relationship might connect NCO/R with NE. On the other hand, distance-to-telomere, similar to other variables measuring position or time, might play the role of providing a common environment for several other variables. In other words, one can draw a directed arrow from DT to all other variables under discussion. A similar situation is seen for the linkage disequilibrium between two neighboring genetic markers, where the position can be considered is a "cause" of both markers. However, we could not test the validity of the model in Figure 1(D) because NCO/R data are not available.
We apply partial correlation and graphical probabilistic model inference to several genomic variables that are correlated with GC-content in the human genome. We can show that recombination rate and exon density are two independent causes of GC% as measured on the 1 Mb scale. This observation adds some support to models that complement the influence of recombination rate on GC-content with a component involving selection. In addition, it appears unlikely that GC% variation is a cause of variation in recombination rate or exon density. We observe some heterogeneity in the human genome, such as differences in the correlation of RR with the distance to telomere between acrocentric and non-acrocentric chromosomes. We also see indications of window-size dependent correlation pattern, which may reflect the subtle differences of the distribution of recombination near and within genes.
A graph G = (V, E) contains vertex/node set V and edge/link set E ⊆ V × V. An edge (i, j) ∈ E is "directed" if (j, i) ∉ E; and is "undirected" if (j, i) ∈ E. If there is an edge between node i and j, either directed or undirected, we say there is a "direct association/relationship" between the two nodes. If there is no edge between node i and node j, the two are still connected through multiple-step edges, as all our nodes are in one single graph; then we say the two nodes are "indirectly associated".
If all edges are directed, the graph is said to be "directed graph" (e.g. Figure 1(B)). If all edges are undirected, the graph is an "undirected graph" (e.g. Figure 1(A)). If some edges are directed and other edges are undirected, the graph is a "partially directed graph" (e.g. Figure 1(C)).
Partial correlation ρ xy.z is often lower than ρ xy , and a significantly lower partial correlation is an indication that the x - y correlation is indirect.
Higher order partial correlation can be defined in an analog fashion.
The TETRAD program  implements the PC-algorithm to automatically infer causal relationships from partial correlation analysis . This algorithm can be broken into two phases: an adjacency phase and an orientation phase. In the adjacency phase, a complete undirected graph over the variables is constructed and then edges X - Y are removed, if some set S among either the adjacents of X or the adjacents of Y can be found such that I(X, Y|S). Then the orientation phase is begun. The first step examines unshielded triples and considers to orient them as colliders. An unshielded triple is a triple (X, Y, Z) where X is adjacent to Y, Y is adjacent to Z, but X is not adjacent to Z. Since X is not adjacent to Z, the edge X - Z must have been removed during the adjacency search by conditioning on some set S xz ; (X, Y, Z) is oriented as a collider X → Y ← Z just in case Y is not in this S xz . Once all such unshielded triples have been oriented as colliders, a series of rules orients any edge whose orientation is implied by previous orientations.
biased gene conversion
distance to telomere
guanine and cytosine content
non-crossing-over events among recombination events
number of exons
Peter (Spirtes) and Clark (Glymour) algorithm
single nucleotide polymorphism.
We thank Peter Arndt for sharing the data he used in his paper. Yaning Yang's work was supported by the Chinese Natural Science Foundation (No. 10671189).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.