CGTS: a site-clustering graph based tagSNP selection algorithm in genotype data

Background Recent studies have shown genetic variation is the basis of the genome-wide disease association research. However, due to the high cost on genotyping large number of single nucleotide polymorphisms (SNPs), it is essential to choose a small subset of informative SNPs (tagSNPs), which are able to capture most variation in a population, to represent the rest SNPs. Several methods have been proposed to find the minimum set of tagSNPs, but most of them still have some disadvantages such as information loss and block-partition limit. Results This paper proposes a new hybrid method named CGTS which combines the ideas of the clustering and the graph algorithms to select tagSNPs on genotype data. This method aims to maximize the number of the discarding nontagSNPs in the given set. CGTS integrates the information of the LD association and the genotype diversity using the site graphs, discards redundant SNPs using the algorithm based on these graph structures. The clustering algorithm is used to reduce the running time of CGTS. The efficiency of the algorithm and quality of solutions are evaluated on biological data and the comparisons with three popular selecting methods are shown in the paper. Conclusion Our theoretical analysis and experimental results show that our algorithm CGTS is not only more efficient than other methods but also can be get higher accuracy in tagSNP selection.


Background
Recent studies show that the abundance of single nucleotide polymorphisms (SNPs) and haplotypes can provide the most complete information for genome-wide association studies.Through the analysis of SNPs and haplotypes, most of the genetic variations among different people can be discovered.However, due to the excessive SNPs, which are about 10 million in the human genome [1][2][3], it is costly to genotyping and studying all SNPs in a candidate region for a large number of individuals.Thus the SNP selecting strategy is proposed to find only a subset of SNPs, which are called tagSNPs or tagging SNPs, to represent the whole SNP set.These tagSNPs have high linkage disequilibrium (LD) values with the rest SNPs [4], and the genetic variation information they have are enough to support the further study, such like disease association gene identification and population variation finding [5,6].Several computational methods have been proposed to solve the tagSNP selecting problem.One common approach is based on the haplotype blocks partitions.These methods delimit the human genome into a set of discrete blocks, where only a small set of common haplotypes exist.These methods search the minimum subsets of tagSNPs from each block.The selected tagSNPs can distinguish each pair of common haplotypes in the block [7], or at least most of them [8][9][10][11].However, there is no general solution on how the blocks are formed.And the lack of the inter-block association degrades the selection accuracy.The LD based methods use the pairwise associations of SNPs.TagSNPs are selected from the site clusters which consist of SNPs with strong LD association (measured by the pairwise LD value r 2 ) between each other [12][13][14][15].These tagSNPs can represent associated SNPs in long distance without the block restriction, but may lose some important information contained in the rest SNPs and fail to distinguish all haplotypes in a LD cluster.Bafna.et al. [16] proposed a somewhat different approach, which assumes tagSNPs can reconstruct the remaining SNPs of an unknown sample with high accuracy.TagSNPs are selected by the measure called informativeness, which quantifies how well the unselected SNPs are predicted and the complete haplotypes are reconstructed through the selected SNPs [17][18][19].These methods do not need prior block partitioning or limited haplotype diversity, but their performances are limited by the restrictions such as the fixed number of tagSNPs and the definitions of the informativeness.
In this paper, we present a new method based on clustering and graph, which is called CGTS, to select tagSNPs.Unlike the previous methods, our method integrates the information of the LD association and the genotype diversity using the site graphs without the information loss and the limit of block partition.Graph based algorithm uses the genotype information to discard the redundant SNPs, and does not need to define block or fix on the tagSNP number.To avoid high computational complexity of graph algorithm for large data, the clustering algorithm is proposed to process the genotype data.Compared to three existing popular methods, our method can give better performance in the experiments.

Implementation and data
We implemented our clustering and graph based tagSNP selection algorithm (CGTS) in C ++ and run the program on a PC with a 1.4 GHz CPU and 512 MB memory.A genotype phasing algorithm: PHASE [20] is used on the gen-otype data to obtain the haplotypes for other methods such as HapBlock and STAMPA compared in experiments.
Five public biological data were used for evaluation.These data include: the Hapmap data set of human chromosome 21 which has 20163 SNPs for 90 European persons [21]; the IBD 5q31 data set which have genotypes for 103 SNPs [8] from an inflammatory bowel disease study of father-mother-child trios, and we only used the children's data; three encode regions ENm013, ENr112 and ENr113 from HapMap [21], The number of SNPs genotyped in each region is 361, 412 and 515.All missing and ambiguous alleles are deleted from the test data.

Evaluation method
The evaluation method is based on the accuracy of nontagging SNP prediction by the tagSNPs.The prediction accuracy is determined by cross-validation [22].The data set is divided into ten subsets.The algorithm is run on the nine subsets to select a minimum set of tagSNPs.The nine subsets and the tagSNPs in the remaining set are used to predict the non-tagging SNPs in the remaining set.
The prediction of non-tagging SNPs is based on the assumption that given the genotype values of two SNPs, the probabilities of the values at any intermediate SNPs do not change by knowing the values of additional distal ones [18].It means that for each non-tagging SNP p, the value of p in given genotype sequence can be identified by the value of two closest tagSNPs in the same sequence.Formally, this assumption can be stated as:

For an unidentified p:
where a, b are the two closest tagSNPs of p. g i, x is the allele value of SNP x in genotype i. q is the tagSNP which is different from a, b.
The prediction precision of one subset is calculated by following equation: N c is the number of correctly predicted alleles and N a is the number of all predicted alleles.(1) The accuracy of the algorithm can be computed as: The 20163 SNPs are firstly divided into four subsets according to their physical distances.Each subset contains 5000 SNPs (the last has 5163 SNPs).CGTS run on four subsets separately and the final tagSNPs are the union of all results on these subsets.
Figure 1a, 1b plot the selected tagSNP number and the improved ratio of tagSNP number with respect to k.The xaxis stands for k and the y-axis stands for the selected tag-SNP number and the improved ratio of CGTS on chromosome 21 dataset.The change of running time with increasing k is plotted on figure 1c.The x-axis stands for k and the y-axis stands for the total running time of CGTS on chromosome 21 dataset.The accuracy decrease is also plotted in figure 1d.The x-axis stands for k and the y-axis stands for the accuracy of CGTS on chromosome 21 dataset.
It can be observed that the improved ratio grows rapidly following the increase of k until k = 200, after that a more slow improvement is observed.It can be explained by the reduced LD association between SNPs and the stability of information contained by SNP cluster.The elapsed time also increases as k becomes large.It is because the sites in the graph increase and more sites need to be investigated with the selecting algorithm to obtain the tagSNPs.On the other hand, the accuracy of CGTS decreases as k becomes large.One intuitive reason is that the CGTS discards some real tagSNPs in the selecting step.Following the increase of k, the graphs in selecting step become larger and noisy sites also become more in the graph.These noisy sites decrease the LD associations in the graphs and the genotype diversity information they have may interfere with the tagSNP selection.Some real tagSNPs are discarded due to this interference and pseudo-tagSNPs may be put into the result.
As a result, the parameter k of CGTS is best set to around 200 to obtain the best improvement in solutions, which still keeps the running time and accuracy within a reasonable period.

Experimental results on biological data
We compare our algorithm with three recent algorithms for tagSNP selection that are widely used: HapBlock [9], MLR-tagging [23], STAMPA [18] These three softwares represent three common methods for tagSNP selection.
HapBlock bases on the block-partition, uses dynamic programming and EM subroutine to get the tagSNPs for each blocks.MLR-tagging uses a multiple linear regression approach and selects tagSNPs based on LD associations.STAMPA uses the SNP prediction accuracy for other SNPs to select tagSNPs.HapBlock and the training step of STAMPA use the haplotype data as input.MLR-tagging and STAMPA need to fix on the tagSNP number.Therefore, to compare the performance of our method to these methods, the tagSNP number of MLR and STAMPA are same as the tagSNP number obtained by our algorithm.
The comparison results of the three methods on the ENm013, ENr112, ENr113 and 5q31 are shown in Table 1

Conclusion
We investigate a novel hybrid method for SNP-tagging, which is called CGTS.The efficiency and solutions of SNPtagging are improved by the combination of graph algorithm and site clustering in CGTS.As shown in the experimental results, our algorithm is able to get higher

Figure 2
Performances of four selection methods.

(a) (b)
prediction accuracy than other approaches with the same size of tagSNPs, and outperforms these approaches in terms of the tagSNP size.Computational time required by CGTS is quite reasonable and can be tailored to available computing resources as needed.The key component of CGTS is the SNP graph model which integrates the information of the LD association and the genotype diversity.This model makes the SNP-tagging problem transform to a graph pruning problem.Using this model can avoid the information loss of SNPs.Our method does no need to fix the tagSNP number.The tagSNPs are not restricted to any bounded location, and can easily be applied to cover the whole chromosome.In the algorithm, multiple tagSNPs are used to represent each untagged SNPs, which further reduce the number of selected tagSNPs.Another advantage of CGTS is that it is amenable to further computational improvements.For example, parallel programming could be used to search tagSNPs in separate precincts, and further speed up the computation.In addition, CGTS can also be used on multi-allelic genotype data and haplotype data.

SNP graph construction
We are given n genotype sequences, each genotype consist of m SNPs.If we assume there are no missing data in the sequences, the n sequences can form a n*m matrix M where rows are sequences and columns are SNPs, M[i, j] ∈ {0,1,2}, where 0 and 1 are homozygous types which represent the major and minor alleles, and 2 indicates the heterozygous type.A set G of genotype sequences distributes in the genome region of given populations follow a functions P(g i ).P(g i ) is the frequency of g i , where g i is a genotype in G. Like the haplotype, genotype sequences only have a few common patterns in a given genome region and these patterns can be distinguished by the tag-SNPs.According to the observation made by several biological studies [12,14], tagSNPs usually have strong LD associations with other relative SNPs, except some special tagSNPs which have no association with other sites.We can represent each genotype by a vector Ti ∈ {0, 1, 2} t , where t is the size of tagSNP set in the genotype, and T i belongs to i-th genotype.Thus, the tagSNPs at least have two important attributes: (1) T i can identify common patterns of given genotype set, and the frequency of T i : P(T i ) equals to the related P(g i ); (2) Each tagSNP has the strong LD value with its relative site set.Our goal is to find a minimum size set ST of tagSNPs for given genotype set, and these tagSNPs satisfy the two attributes described above.Formally, for a given set G and its matrix M, our objective is to find a set of tagSNPs ST with minimum size, and for each pair of common genotype pattern P i and P j in the G, these is a tagSNP In addition, the average LD value of s i with its relative sites is highest in the related site set.To achieve this goal, the main problem is how to represent the two features of tagSNPs and associate them to find the real tagSNPs.For a given genotype set G, all SNPs are distributed randomly as discrete sites in the space.If these sites are connected according to the genotype diversity information and the length of edge is set based on the LD value of two sites, the SNPs can form a graph.Through the analysis of several biological data, we discovered that the genetic-relative sites are bunched up in the graph and there are usually more sites around the real tagSNPs than the others except some special tagSNPs which have no association with any sites.That means it is feasible to construct a site graph to describe the SNP attributes and use a graph algorithm on it to find tagSNPs.
For a given genotype set G, the SNP set is S (|S| = m), and the LD value matrix is R.Each SNP has some information of the genotype diversity.Therefore, for each site s i , we define a diversity set X i = { } to represent genotype diversity information for SNP site s i .= {x r | x r = r, r ∈ [1, n]}, for j = 0,1,2, n is the size of the input genotype set size, r is the sequence number of a genotype in given set, x r ∈ means the r-th genotype have the value j at ith SNP.If X i has two Φ subsets, which means the SNP has only one value at all genotypes, we assume this site has no diversity information, and delete it before the graph construction.If two diversity sets X a = X b , we assume these two sites have the same diversity information, and these sites can combine to one site in the graph for their similar information.However, it is too complex to combine the whole SNP diversity sets and their LD associations in one graph.Thus, the X i is divided to three subsets , and three graphs are constructed for S. In each graph, only the information of one subset is represented.The selection results for three graphs are combined to get the final tagSNPs.For a given S and the diversity subset for each s i , j ∈ {0, 1, 2}, the site graph G j (V, E) can be constructed as following steps: (1) The site set V: each site v i of V represents a SNP s i of S.
Each v i corresponds to the diversity subset of s i .Each v i has a site weight ws i defined by the number of the sites similar to s i .For two SNPs s a and s b , s a is similar to s b in graph G j when .ws i is computed from two sets: the sites satisfy X a = X b and the sites satisfy X a ≠ X b and .
( (3) The edge weight set W: for edge e between sites a and b, the edge weight of the SNP a and the SNP b.The LD value r 2 is usually chosen to evaluate the correlation between two SNPs [24,25], and can be computed as follow: In these equations, p i denotes the frequency of i-th allele for SNP a, p j denotes the frequency of j-th allele for SNP b. m = n = 1, when the data are bi-allele haplotypes; m = n = 2 for the bi-allele genotype data.x ij is the frequency of observing i-th allele for SNP a and j-th allele for SNP b together in the same genotype (haplotype) sequence.D max is the maximum value of LD between the i-th allele for SNP a and j-th allele for SNP b.A r 2 value of 1 indicates the highest LD while 0 indicates no LD.The site graphs of a simple genotype data are shown in figure 3.

TagSNP selection
In this section we present our selection algorithm.The algorithm is based on the observation that the SNPs with same genotype diversity information and high correlation are congregated into a subgraph through the construction of SNP graphs.The SNPs which have high probability to be tagSNPs usually have a subgraph with higher density.In order to reduce the space and time complexity, we are interested in discarding the redundant sites instead of finding the tagSNPs.Therefore, for each site graph G j (V, E), the site which has the subgraph of the maximum density and the highest correlation is the top-priority site for tagSNP selection algorithm.For a given SNP set S and its site graph G j (V, E) for diversity subset , the algorithm details are described as follow: (1) Subgraph finding: in the graph G j , a set of subgraphs N(v i ) is defined to represent a neighbourhood for each site v i .It is induced by every vertex which has a directed edge from v i to it or has an bi-directed edge between v i and it, including v i itself.In order to find the top-priority site and the relative subgraph, the measures for subgraph density and site correlations information are defined as follow: For each N(v i )(VI, EI), the subgraph density is: where ||EI|| is the edge set size of the N(v i ) and ||VI|| is the vertex set size of the N(v i ).The information entropy of the subgraph is defined to describe the correlations information of each N(v i ) as: where c is the number of sites in N(v i ) which is equal to is the normalization factor, ws i is the weight for site v i in the subgraph, w ij is the weight of edge e ij for the site v i and v j in the subgraph.A high value of EP means high LD relations among the sites in the subgraph.
The top-priority site measure function is T(λ, EP): θ is the normalization factor, and the top-priority site is the v i which has the maximum T value.
(2) TagSNP candidate selection: for each chosen subgraph in the given G j , the tagSNP candidates are selected by evaluating the information overlapping among the sites.A site which is not the candidate is deleted from the graph.The selection algorithm is described in figure 4 and figure 5 (3) Final tagSNP identification: after the tagSNP candidate set TC j for each G j are obtained, the final tagSNPs are identified by a voting mechanism.For a SNP s i , a score function is defined as: The SNPs which have the highest score are decided to be real tagSNPs.The example of tagSNP selection is given in figure 6.The SNP data are the data used in the SNP graph construction section.

SNPs clustering
When the size of site graph is larger, the efficiency of our graph algorithm decreases, due to the time-consuming graph construction and tagSNP searching in low-relative SNPs.It is observed that the correlation between SNPs tends to decay as the physical distance increases [8,11,16,26].That means there are many SNPs have no correlation or less correlation among them and the graph score s ,TC ,TC ,TC f s ,TC f s ,TC f s ,TC f s ,TC An example of SNP graphs Figure 3 An example of SNP graphs.The dashed represents the bi-directed edge in the graph.Graph for X 0 Graph for X 1  Graph for X 2 Function Divide 2. while X 1 is not empty 3. do 4.
Get a X a j from X 1 ;

5.
for each X b j X 2 ; for each X b j X 2 ; 29. End if 30.
End for 31.
End if 33.End while 34.Combine the same elements in X 2 ; In the KNN clustering algorithm, parameter k is an integer less than the number of SNPs in the input SNP set S, the correlation between two SNPs is represented by the LD value r 2 calculated through equation (5).In this algorithm, r 2 is directly related to the distance of two sites.r 2 = 1, the distance of two SNP sites is 0, and the two SNPs are considered identical and only one of them is retained.r 2 = 0, the distance of two sites is ∞.d k i is the LD value of SNP s i and its k-th nearest neighbour.In each iteration, the SNP which has the highest d k i and its k nearest neighbours are deleted from S and selected to construct the SNP graphs for tagSNP selection.The obtained tagSNPs are put into the S and a new iteration is started.The obtained tagSNPs of each subset are combined to get the final set of tagSNPs.
The choice of k in the KNN algorithm controls the maximum information of the SNP set and the maximum size of site graphs.In general, different values of k result in different accuracy of the tagSNP selection.The bigger k, the smaller tagSNPs may be obtained, the more running time and space are cost.In the tagSNP selection, there are two possible ways to select k. (1) Select k so that the LD value between a chose SNP and its k-nearest neighbour is greater some threshold.(2) Select k to achieve desired prediction accuracy via cross-validation.The accuracy evaluation is given in more detail in the evaluation section.The tradeoff test of k is discussed in result section.

Complexity analysis
For a SNP subset which has the site set size n andcontains m genotype sequences, the running time for LD value computing is O(m).Thus, the total running time of the clustering step is O(n 2 m).For each cluster which has (k+1) SNPs, constructing site graphs takes O(k 2 ).Finding subgraph needs O(k).Selecting tagSNP candidates from a subgraph which has t sites (0 <t ≤ k + 1) takes O(t 2 ).Identifying the final tagSNPs needs O(k).The total running time of tagSNP selection is O(k 2 + kt 2 +k).Since t ≤ k+1, the total running time for whole algorithm is O(n 2 m + k 3 n/k) = O(n(nm + k 2 )) with taking into the account of the iteration number.In practice, the site set size n usuallyless than 5000 and k is controlled in interval [50, 500], the running time is less than expectation.Graph for X 0 Graph for X 1  Graph for X 2

Figure 6
An example of tagSNP selection.
Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime."Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours -you keep the copyright Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp BioMedcentral BMC Bioinformatics 2009, 10(Suppl 1):S71 http://www.biomedcentral.com/1471-2105/10/S1/S71

results for trade-off test
and figure2.HapBlock gave no solution for ENr113 due to the memory overload.CGTS can perform better than HapBlock in prediction accuracy, get smaller tagSNP sets and use shorter times.It is because that there is no need to conduct block partition and the LD associations among SNPs are taken into consideration.CGTS is more accurate than MLR-tagging and STAMPA with the same number of selected tagSNPs.
When the SNP size of the input data is increasing, STAMPA is asymptotically faster but CGTS is more accurate with an acceptable running time.It is due to the increasing iteration number of CGTS.Appropriate choosing of k can reduce the running time of CGTS.And we chose the k = 250 because the CGTS can obtain better improvement in solutions but still keep the running time and accuracy within a reasonable period on the given biological data.

Table 1 : The comparison of the four methods
The edge set E: tagSNPs can identify all or at least most of the common genotype patterns.The diversity information is represented by X i .The associated SNPs have their X i