Minimizing recombinations in consensus networks for phylogeographic studies

Background We address the problem of studying recombinational variations in (human) populations. In this paper, our focus is on one computational aspect of the general task: Given two networks G1 and G2, with both mutation and recombination events, defined on overlapping sets of extant units the objective is to compute a consensus network G3 with minimum number of additional recombinations. We describe a polynomial time algorithm with a guarantee that the number of computed new recombination events is within ϵ = sz(G1, G2) (function sz is a well-behaved function of the sizes and topologies of G1 and G2) of the optimal number of recombinations. To date, this is the best known result for a network consensus problem. Results Although the network consensus problem can be applied to a variety of domains, here we focus on structure of human populations. With our preliminary analysis on a segment of the human Chromosome X data we are able to infer ancient recombinations, population-specific recombinations and more, which also support the widely accepted 'Out of Africa' model. These results have been verified independently using traditional manual procedures. To the best of our knowledge, this is the first recombinations-based characterization of human populations. Conclusion We show that our mathematical model identifies recombination spots in the individual haplotypes; the aggregate of these spots over a set of haplotypes defines a recombinational landscape that has enough signal to detect continental as well as population divide based on a short segment of Chromosome X. In particular, we are able to infer ancient recombinations, population-specific recombinations and more, which also support the widely accepted 'Out of Africa' model. The agreement with mutation-based analysis can be viewed as an indirect validation of our results and the model. Since the model in principle gives us more information embedded in the networks, in our future work, we plan to investigate more non-traditional questions via these structures computed by our methodology.


Background
Reconstructing the recombinational history of a DNA fragment has proved to be a difficult problem and can only be achieved at small scales. Nonetheless the reconstruction of the history of long fragments, is of great interest to geneticists. Although the mutational history of adjacent fragments is independent, this is not true for recombinational history: thus merging adjoining networks add a new level of richness in complexity in terms of the suite of recombination events that shape variations within and across populations (both populations substructures as well as possible migratory history).
This paper explores the combinatorics involved in incorporating recombination events into the topology. While it is possible to give loose bounds on the number of recombination events using some convenient and clever variation of the Four Gamete Rule [1], the actual enumeration of the recombinations by a careful exploration of the underlying combinatorics will tighten this bound, as well as give additional information such as participating lineages, time-ordering of the recombination events and so on. However, it is important to note that the corresponding combinatorial optimization problem cannot be solved exactly unless P = NP [2,3]. Nevertheless, there have been various efforts to give a good estimate of a bound on this number (see [4] and citations therein).
In this paper, we address the problem of computing a consensus a pair of phylogenetic networks G 1 and G 2 to give G 3 with a minimum number of new recombination events to jointly explain G 1 and G 2 . Such a network G 3 satisfies certain characteristics due to the very nature of its genesis: this is called a compatible network [5]. In this paper we presented a topology-based methodology to understand genetic variations in human haplotype data: We first cluster (possibly overlapping) haplotypes that display no evidence of recombinations and a representative haplotype of each cluster is extracted for the next phase. Then exploiting the coherence seen in such data, each haplotype is recoded using patterns of SNPs (patterns seen across different haplotypes). Finally, a network is constructed from the recoded representative haplotypes. Using a divide-and-conquer paradigm, the haplotype is segmented to give simple structures and then these individual structures are merged to give a unified topology using a DSR Scheme (see Methods). Clearly, each stage is algorithmically non-trivial, however optimizing the number of recombination events in the merging phase is a critical component. This is our focus in this paper. The interested reader is directed to [5] for other details including the rationale of the model.
In this paper, we analyze the performance of the DSR Scheme in two ways. Firstly, we give a mathematical eval-uation of the algorithm. In other words, how far are we from the optimal number of new recombinations that explain the data? We show that the greedy polynomial time DSR based algorithm guarantees that the number of computed new recombination events is within = sz(G 1 , G 2 ) (see Eqn 2) of the optimal number of recombinations. To date, this is the best known result for a network consensus problem. Note that the computation of consensus trees (or networks) is a very battered problem in literature. Thus, although our model is derived from the special setting discussed above, the problem and its solution is of interest in a general context involving reticulation events. See for example [6,7]. The ideas in pair-wise consensus is easily extendible to k-wise consensus.
Secondly, we examine how well the algorithm performs on real data. We apply the method on 100 Kb segment of high SNP density in the recombining part of the X chromosome. With our preliminary analysis from a phylogeographic viewpoint, we are able to infer ancient recombinations, population-specific recombinations and more (see Experimental Results) which also support the widely accepted 'Out of Africa' model. These results are consistent with established mutation-based methods: thus this can be taken as an indirect validation of our analysis and the methodology.

Methods
Here we discuss the underlying mathematical model. We are given H units or extant individuals, each of which has F features. Each feature is a SNP (Single Nucleotide Polymorphism) and a unit is a haplotype. To keep the paper self-contained in this section we reproduce the notation used in [5]. A network G is a directed acyclic graph (DAG) and is defined as follows: It has three kinds of nodes. A node with no incoming edge is a root node and G may have multiple root nodes. A node with no outgoing edges is a leaf node. Each leaf node is labeled with nonempty sets haplotype labels. Every other node is an internal node. A node has at most two incoming edges. When a node has exactly one incoming edge, it is called a mutation node and the incoming edge is called a mutation edge. When the node has two incoming edges, the node is called a recombination or a hybrid or a reticulation node and the incoming edges are called recombination or reticulation edges. The direction of the edges is always towards the leaf nodes which in the figures in this paper is downwards. To avoid clutter, only the recombination edges display the direction as arrows.
Associated with a network G is a segmentation S. The segmentation S is a partition of the F features into some k(≤ F) (nonoverlapping) subsets. When the features are ordered say as f 1 , f 2 , f 3 ,..., f F , they can be simply written as the closed interval [1, F], and the segmentation is a collection of non-overlapping intervals. For example, if F = 5, a possible segmentation of interval [1,5] is: S = { [1,2], [3,4], [5,5]}. The three individual segments are s 1 = [1,2], s 2 = [3,4] and s 3 = [5,5] with S = {s 1 , s 2 , s 3 }. For convenience, the three segments are denoted simply by the consecutive integer labels 1, 2 and 3 and to keep clarity of exposition, S is written simply as S = {1, 2, 3}. Then each feature f is written as s: f where s is the segment label that the feature f belongs to.
For a segmentation S, the labeling of the edges of G are as follows: (1) Mutation edge: Every mutation edge e incident on a node v, has a non-empty label, lbl. Each member, s: f, of the label is interpreted as feature f in segment s. (Note that f itslf may have the form '2:3' as in Figure 1. Now, if f is associated with segment 9, the label is written as 9:2:3.) Further, each element appears at most once in an edge label in G. (2) Recombination edge: The two recombination edges, e 1 and e 2 , incident on a recombination node v are labeled by sets of segment labels lbl 1 and lbl 2 with lbl 1 ≠ lbl 2 . For example lbl 1 = {1, 3} denoting that e 1 is labeled with the segment labels 1 and 3. G is always associated with a segmentation S of the F features, hence written as (G, S). Note that G cannot be any arbitrary network. It must satisfy the following: for each s ∈ S, Restricted(G, s) is devoid of recombinations. Such a network is termed compatible. The Consensus Compatible Network Problem is defined as follows [5]: Given two compatible networks (G 1 , S 1 ) and (G 2 , S 2 ) with no common features (thus S 1 ∩ S 2 = ∅), the task is to compute a compatible network (G 3 , S 1 ∪ S 2 ) with the minimum number of additional recombination nodes such that ( In the remainder of the paper, we refer to (G, S) simply as G, and segmentation S will be clear from context.

The Dominant Subdominant Recombinant (DSR) framework
The DSR scheme to solve the problem and its proof of correctness was presented in [5]. The method is an iterative, bottom-up working at one level of G 1 and G 2 at a time. The level of a node is defined as the maximum distance to a leafnode.
The same level is also associated with any edge e incident on the node written as level(e, G). A leaf is at level 0. The method gets its name from the need to give one of three possible assignments, Dominant (D) or Subdominant (S) or Recombinant (R), to nodes at each iteration, which is central to this scheme. In

DSR assignment rules
The non-empty entries of X l are given a DSR assignment. Note that at least two conditions are required for a viable compatible network G 3 . (Rule 1): Each row (column) in matrix X l has at most one dominant. If the row (column) has no dominant, then it has at most one subdominant. (Rule 2): A non-recombinant element can have another non-recombinant in its row or its column but not both. As a result of the DSR assignments to the entries on X l , the rows and columns also get implicitly assigned as follows.
A row (column) that has a dominant entry is assigned dominant. A row (column) that is not assigned dominant but has a subdominant in the row (column) gets assigned subdominant. A row (column) that has only recom-binants in the row (column) is assigned recombinant. Note that only dominant rows (columns) contribute to entries in X l' , l' > l. Figures 4 and 5 give two different assignments giving the two different networks in Figure 3

Approximation factor of the greedy DSR scheme
In this section, we compute the approximation factor [8,9] of the greedy version of the DSR Scheme. Let the number of new recombination events produced by the DSR algorithm in G 3 be N DSR . Let the optimal number of new recombinations be N opt . We use the following definition of the true approximation factor: levels (excluding leaf level). Note that if G 1 and G 2 are the same (isomorphic) graphs then Y = Z and N opt = 0.

Theorem 1
Proof: Let N max (N min ) be the maximum (minimum) number of new recombinations produced by the DSR scheme over all possible DSR assignments. Then we first show the following: Clearly N opt ≤ N max holds (else it contradicts the optimality of N opt ). Next we have to show that N min ≤ N opt holds as well. For this we need a few more characterizations of the network.

Recombination node descriptor F 1 |F 2
Let Y be the set all given haplotypes (or taxa). A split or bipartition is written as Z 1 |Z 2 where Z 1 and Z 2 are nonoverlapping subsets of Y with Y = Z 1 ∪ Z 2 . A tripartition Z 1 |Z 2 |Z 3 is defined similarly. In earlier works (see [7,2] and citations therein) a mutation event has been associated with a bipartition of Y and a recombination event with a tripartition. However, the latter requires certain (2)  X-matrices of Network G 2 of Figure 3(b). Also see Figure 4 for a description of the matrices.
Consensus of a tree and a network Figure 6 Consensus of a tree and a network.
Stepwise construction of G 3 of Figure 6(c) as consensus of T 1 and G 2 : (a)-(e) The X matrices and the DSR assignments Figure 7 Stepwise construction of G 3 of Figure 6(c) as consensus of T 1 and G 2 : (a)-(e) The X matrices and the DSR assignments. (f)-(j) The construction of G 3 using the DSR assignments of (a)-(e).
as L v . Two compatible networks G 1 and G 2 on the same segmentation S are isomorphic (or identical), written as G 1 ≡ G 2 , if the following two conditions hold: (1) For each element s: f in G 1 , L s: f (G 1 ) = L s: f (G 2 ) and viceversa, and, (2) For each recombination node v in G 1 with descriptor F 1 |F 2 , there exists a recombination node in G 2 with the same descriptor and viceversa.

Canonical form
It is possible to bubble up or down an element in the mutation edge label to obtain G' such that G' ≡ G. Our convention will be to bubble down the element of the mutation edge label, towards a leafnode. A network G is in the canonical form (1) if no node has only one outgoing edge and (2) if no element of any mutation edge label can be bubbled down to obtain G' with G' ≡ G. For example see Figure 3. Since the levels of nodes in a canonical network are unique, the following can be readily verified (see also concrete examples in Figures 2 and 6). of the heights of G 1 and G 2 . Then there exist some X-matrices, X 1 , X 2 ,..., whose DSR assignments produce G 3 . This is written as G 3 ≅ X 1 , X 2 , ..., .
Back to the proof: We have to show that N min ≤ N opt holds.
Assume the contrary, i.e., N opt <N min . In other words, the optimal number of new recombinations is even lower than the minimum produced by the algorithm over all possible choices. Then consider this network with N opt new recombinations. Then by Lemma 1, there exist a sequence of X-matrices ≅ X 1 , X 2 , ..., with some DSR assignments for each X l . Thus by these choices of the algorithm N min ≤ N opt must hold, again leading to a contradiction.
Hence N opt N min . Here ends the proof of correctness of Eqn 3. Next, we give a few characterizations of the DSR assignment to facilitate the counting of the new recombinations.

Type I & II (new) recombination events
Let v be a recombination node in G 3 with labels lbl 1 and

Chromosome X SNP data
We used a 100 Kb segment of high SNP density in the recombining part of the X chromosome, starting at genomic position 87,348,404 (Build 35). In Hapmap II [11], this segment contains 194 sites, of which only 175 are polymorphic in at least one population. Recombination rate is ≈ 0.7 cM/Mb, slightly below the ≈ 1 cM/Mb genomewide average. We chose this segment for two reasons. (1) It does not contain any genes. Thus variation in this region is less likely to have been shaped by natural selection and is more likely to reflect purely genomic processes.
(2) The segment does not contain copy number variations or segmental duplications. These could induce genotyping errors possibly producing ectopic recombination events, which is not accounted for in the downstream analysis.
Further, we used only the haplotypes in the hemizygous males to avoid any phasing errors. These errors would manifest as phantom recombination events. The popula-tions used were Yorubans from Nigeria (YRI; N = 30), Europeans (CEU, N = 30), and a pooled sample of Chinese and Japanese (ASN; N = 45).

Statistical analysis (using p-value estimations)
As a preprocessing step, exploiting the coherence seen in the data, each haplotype is recoded using blocks of g SNPs. Based on the combinatorial model, a network is constructed from the recoded representative haplotypes.
Recall that first the haplotype is segmented to give simple structures and then these individual structures are merged with a small number of recombinations to give a unified topology. Here we discuss the choice of the value of g in our experiments. Let l be the number of distinct patterns of the g SNPs across the samples. Using this as a proxy for the extent of LD in this block, we estimate the p-value of the number l. Loosely speaking, when these g SNPs are in linkage equilibrium (or independent), l should be much larger than when they are in LD.
Distribution of l for g = 5 Figure 10 Distribution of l for g = 5. Recall that the randV Scheme is independent of the region but the permV Scheme uses the population distribution of the region for a more realistic estimation.
Let the number of samples be H and let the number of SNPs be F. Further, let V be a column vector of size H.
Since the SNPs are assumed to be bi-allelic, V which represents the value of a SNP in the H samples is binary. We use two schemes, based on the mode of definition of the F vectors, to estimates the p-value. The range of values of l seen in our data is 2 ≤ l < 15 and we study the p-value estimates in this range using two schemes.

RandV
In this scheme, V 1 , V 2 , ..., V F are defined randomly. In other words, each entry of each V is picked independently and uniformly from a set of two alleles. We use 10000 replicates and the distribution of the number of g-sized patterns is shown in Fig 10. The p-values estimated based on this scheme is shown in the table below. The p-values are ≈ 0.0 for every value of l.

PermV
While the RandV scheme is not incorrect, we make some domain-dependent modifications to design another scheme. In the PermV scheme we (i) mimic the allele frequencies seen in the input data and (ii) use the population distribution, by ethnicity, of the screened samples in the chromosomal region. The individual V vectors are plucked from the X-Chromosome of the database (but the SNPs span the entire chromosome) and any untyped SNP (i.e., N in the database) in the vector is given a value in agreement with the allele frequency of that column. Further, we use only those V 's that have MAF ≥ 0.1. We again use 10000 replicates and for each replicate, we randomly permute the F vectors. The distribution of the number of g-sized patterns is shown in Fig 10(b).
If for a block, l has an insignificant p-value, then the subsequent analysis risks becoming unreliable. We then reduce the grain size. An alternative is to discard the offending SNPs of the block, thus fragmenting the region.
In our experiments we used a grain size g = 5 and the pvalues obtained for this on all the regions were acceptable. The haplotypes are re-coded as sequence of these SNP patterns for the combinatorial analysis discussed in the Methods section.

Result analysis
We show a sample network of a short segment of the chosen region in Figure 11. Here we summarize our observations from a phylogeographic viewpoint and answer only questions raised traditionally in this area. Table 1 shows the number of detected recombination events and how they are shared across populations. The observations (over the entire 100 Kb segment) are as follows: We discovered a total of 31 recombinations in the data. Seventeen recombinations are population-specific, and can be used to infer the recombinational diversity within a population. Assuming recombination rate is constant across populations, this is a function of the effective population size of each population. Four recombinations are shared among pairs of populations, and can be used as indicators of shared population ancestry. In this particular case, both Europeans and Asians share events with the African population, which is more recombinationally diverse. Ten recombinations are shared among all three populations, and they are presumably ancient events that occurred before the split of the three populations.
The network on segment Chr X: 87390235-87412114 of the three populations Figure 11 The network on segment Chr X: 87390235-87412114 of the three populations. The leafnodes are labeled with (a set of) clusters of the input haplotypes. A label on an internal node is for reference purposes only. An element of the edge label is to be interpreted as segment-id:position-id:pattern-id.
Mutation-based studies of genetic diversity have shown exactly the same pattern: a larger diversity in Africans, and variation outside of Africa that is partially a subset of that in Africa. Our recombination-based results follow the same pattern, and, as the mutation data, fit the the "Out of Africa" model [12] for the origin of anatomically modern humans. Consistency with mutation data can be taken as an indirect validation of our analysis and the methodology. In our future work, we plan to investigate (raise as well as answer) more non-traditional questions.

Conclusion
We have addressed the problem of studying recombinational variations in human populations. One of the contributions of this work is a guaranteed upper bound on the approximation factor (ratio of discovered new recombination events to the true optimal) in a greedy polynomial time algorithm to construct a consensus network. Such an assurance is of significance when dealing with data where there are no other reasonable means of crosschecking results. To date, this bound is the best known result for this problem. We use this scheme to study recombinational imprints in an appropriate segment of X chromosome from three populations. While the upper bound on the approximation is our theoretical contribution, our second contribution is the results on this data: With our preliminary analysis, we are able to infer ancient recombinations, population-specific recombinations and more, which also support the widely accepted 'Out of Africa' model. The agreement with mutation-based analysis can be viewed as an indirect validation of our results and the methodology. In our future work, we plan to investigate more non-traditional questions via the networks computed by our methodology.