In this section, we first describe how to calculate multi-marker r2 statistics, and then present the FastTagger algorithm. The FastTagger algorithm consists of two steps. In the first step, it generates tagging rules, and in the second step, it uses a greedy approach to select tag SNPs using rules generated in the first step.
Multi-marker tagging rules
Most SNPs have only two alleles, so we consider only bi-allelic SNPs. Given a population, the allele with higher frequency in the population is called major allele, and the allele with lower frequency is called minor allele. We use uppercase letters to denote the major alleles of SNPs, and use lowercase letters to denote the minor alleles. SNPs that are far apart from each other usually are not linked. Here we require that the distance between every pair of SNPs in a rule must not exceed a predefined distance threshold max_dist.
Given k SNPs S = {SNP1, SNP2, ⋯, SNP
k
}, there are 2kpossible haplotypes over the k loci. To calculate the r2 statistic of rule S → SNP
x
, we need to divide the 2khaplotypes into two non-empty groups and map the two groups to the two alleles of SNP
x
. MultiTag [19] and MMTagger [21] uses different methods to do the mapping.
The one-vs-the-rest model
MultiTag uses this model. There are totally
- 2 possible ways to group the 2khaplotypes into two non-empty groups. MultiTag considers only 2kways such that one group contains only one haplotype, and the other group contains all the other haplotypes. It calculates the r2 statistics for all the 2kgroupings, and then select the one with the highest r2 statistic.
The co-occurrence model
MMTagger does the mapping based on the co-occurrences of the alleles of the SNPs on the left hand side and the alleles of the SNP on the right hand side. Let H be a haplotype over the SNP set S on the left hand side, A and a be the two alleles of SNP
x
on the right hand side, and f(H) be the frequency of H. We use f(HA) to denote the frequency of H and SNP
x
= A occurring together, and f(Ha) to denote the frequency of H and SNP
x
= a occurring together. If f (HA) > f (Ha), we map haplotype H to allele A of SNP
x
, otherwise we map haplotype H to allele a of SNP
x
. Let H
A
be the set of haplotypes mapped to allele A, and H
a
be the set of haplotypes mapped to allele a. We convert SNP set S to a bi-allelic marker with two "alleles" H
A
and H
a
. Then we can calculate the r2 statistic between S and SNP
x
as follows.
where P(H
A
), P (H
a
), P (A), P (a) and P (H
A
A) are the relative frequencies of H
A
, H
a
, A, a and H
A
A respectively.
We implemented both models in the FastTagger algorithm, and let users choose which model they want to use.
If the r2 statistic between S and SNP
x
is no less than a predefined threshold min_r 2, we say that SNP
x
can be tagged by S, and R : S → SNP
x
is a tagging rule. With the increase of the size of S, the haplotypes of S partition the whole dataset into finer and finer groups. In an extreme case, every haplotype of S occurs at most once. In this case, the association between haplotypes of S and alleles of SNP
x
becomes unreliable. To prevent over-fitting, we put a constraint on the size of S. The size of S should not exceeds a predefined threshold max_size.
The r2 statistics can be calculated from phased haplotype data directly. If the SNP data are in the form of unphased genotype data, we can use existing haplotype inference algorithms such as PHASE [22] to convert genotype data into phased haplotype data. We can also estimate k-marker haplotype frequencies directly from genotype data without phasing using the algorithms described in [23, 24]. The second approach is used in algorithm LD-select [9].
Generating tagging rules
To generate all the tagging rules, we need to enumerate all the SNP sets that satisfy the maximum distance constraint and maximum size constraint, and then calculate the r2 statistics between these SNP sets and their nearby SNPs. The search space can be enormously large when the number of SNPs is large. We use several techniques to reduce the number of rules to be tested.
Merging equivalent SNPs
Given two SNPs SNP
i
and SNP
j
, if r2(SNP
i
, SNP
j
) = 1, which means that SNP
i
and SNP
j
can tag each other perfectly, then we say SNPi and SNPj are equivalent. Two equivalent SNPs always have the same r2 statistics with other SNPs, thus the computation cost of the rules involving them can be shared by merging them together.
For each group of merged equivalent SNPs, a representative SNP is picked to represent this group. FastTagger generates tagging rules between representative SNPs only. The tagging rules generated in this way are called representative tagging rules. One representative tagging rule can actually represent multiple rules. Therefore, by merging equivalent SNPs, we are not only saving computation cost, but also reducing storage overhead.
Note that not every rule represented by a representative tagging rule is valid. Some of them may not satisfy the distance constraint. Equivalent SNPs that are separated by more than max_dist bases cannot appear in the same rule, and merging them together can produce many false rules. To reduce the number of false rules, FastTagger only merges equivalent SNPs that are within a distance of max_dist.
Pruning redundant tagging rules
If a SNP SNP
x
can be tagged by a SNP set S, then any rule S' → SNP
x
such that S' is a proper superset of S is redundant. FastTagger generates only non-redundant tagging rules to reduce running time and memory consumption, and the definition of non-redundant rules is given as follows:
Definition 1 (Non-redundant tagging rule) Given a rule S → SNP
x
such that SNP
x
can be tagged by S, if there does not exist another rule S' → SNP
x
such that S' is a proper subset of S and SNP
x
can be tagged by S', then S → SNP
x
is called a non-redundant tagging rule.
To prune redundant rules, before calculating the r2 statistic between S and SNP
x
, FastTagger checks whether there exists a subset S' of S such that SNP
x
can be tagged by S'. FastTagger uses a depth-first strategy to enumerate SNP sets. This search strategy is adopted from a frequent generator mining algorithm [25], and it ensures that all the tagging rules whose left hand side is a subset of S are generated before S is processed.
There can be many tagging rules generated. To speed-up the check operation, FastTagger divides the generated tagging rules into groups based on their right hand side SNP, that is, rules with the same right hand side SNP are in the same group. FastTagger then uses a hash map to index the rules in the same group, and the hashing key is the left hand side of the rules. To check whether S → SNP
x
is redundant, FastTagger searches the hash map of SNP
x
for the subsets of S. If there is a subset of S in the hash map of SNP
x
, the rule is redundant; otherwise, the r2 statistic of the rule is calculated.
Skipping rules
Even though merging equivalent SNPs and removing redundant tagging rules can reduce the number of tagging rules significantly, it is still possible that a large number of tagging rules are generated in the first step, which incurs high memory consumption in the second step. FastTagger uses heuristics to further reduce the number of tagging rules generated: if a SNP SNP
x
occurs at the right hand side of tagging rules enough number of times, then SNP
x
will not be considered as right hand side candidate in future rule generation. The rationale behind this heuristics is that if a SNP can be tagged by many other SNPs, then during the tag SNP selection process, the SNP has a high probability to be covered by selected tag SNPs.
Selecting tag SNPs using a greedy approach
Finding the smallest set of tag SNPs is computationally expensive. FastTagger uses a greedy approach similar to the one proposed in [9, 19] to find a near optimal set of tag SNPs.
Let C be the set of candidate tag SNPs, T be the set of tag SNPs selected, and V be the set of SNPs not being covered. A SNP is covered if either it is a tag SNP or it can be tagged by some SNP set S such that S ⊆ T. Initially, C and V contain all the SNPs, and T is empty.
FastTagger first identifies those SNPs that do not appear at the right hand side of any tagging rules, and these SNPs must be selected as tag SNPs. FastTagger puts them into T and remove them from C. These SNPs are also removed from V. For the remaining SNPs in V, if they can be tagged by some SNP set S such that S ⊆ T, then they are removed from V too.
Next, for each SNP SNP
i
∈ C, FastTagger finds the set of SNPs in V that are covered by SNP
i
. A SNP SNP
j
in V is covered by SNP
i
if SNP
j
is not tagged by any subsets of T and there exists a subset S of T such that SNP
j
is tagged by S ∪ {SNP
i
}.
FastTagger then picks a SNP from C that covers the largest number of SNPs in V as a tag SNP. This newly picked tag SNP is put into T and removed from C. All the SNPs that are covered by it including itself are removed from V. This process is repeated until V is empty, that is, all the SNPs have been covered. In each iteration, in order to find the set of SNPs covered by every candidate tag SNP in C, FastTagger needs to keep the tagging rules in memory. However, the number of rules generated can be very large. It is possible that the total size of tagging rules is too large to fit into the main memory. To solve this problem, we can break the whole chromosome into several chunks such that the rules over every chunk can fit into the main memory. We then select tag SNPs within each chunk.
When selecting tag SNPs within each chunk, only those tagging rules whose SNPs all fall into this chunk are used. To also utilize the rules across chunks, we allow two adjacent chunks to have certain overlap. The length of the overlap is determined by the max_dist threshold. The SNPs in one chunk that are within max_dist bases away from the first SNP of the next chunk are included in the next chunk since they can tag or be tagged by SNPs in the next chunk. FastTagger finds tag SNPs from each chunk from left to right. The tag SNPs selected in the current chunk that also belong to the next chunk will be passed on to the next chunk as tag SNPs. Note that if the distance between two adjacent SNPs is larger than max_dist, then these two SNPs are used as a breakpoint even if there is enough memory. The reason being that if the distance between two adjacent SNPs is larger than max_dist, then the two SNPs cannot tag each other or each other's neighbors.
Using the above method, FastTagger can work on chromosomes containing more than 100 k SNPs with as less as 50 MB memory, while existing algorithm consumes more than 1 GB memory even on chromosomes containing around 30 k SNPs.