Assignment of chromosomal locations for unassigned SNPs/scaffolds based on pair-wise linkage disequilibrium estimates

Background Recent developments of high-density SNP chips across a number of species require accurate genetic maps. Despite rapid advances in genome sequence assembly and availability of a number of tools for creating genetic maps, the exact genome location for a number of SNPs from these SNP chips still remains unknown. We have developed a locus ordering procedure based on linkage disequilibrium (LODE) which provides estimation of the chromosomal positions of unaligned SNPs and scaffolds. It also provides an alternative means for verification of genetic maps. We exemplified LODE in cattle. Results The utility of the LODE procedure was demonstrated using data from 1,943 bulls genotyped for 73,569 SNPs across three different SNP chips. First, the utility of the procedure was tested by analysing the masked positions of 1,500 randomly-chosen SNPs with known locations (50 from each chromosome), representing three classes of minor allele frequencies (MAF), namely >0.05, 0.01<MAF ≤ 0.05 and 0.001<MAF ≤ 0.01. The efficiency (percentage of masked SNPs that could be assigned a location) was 96.7%, 30.6% and 2.0%; with an accuracy (the percentage of SNPs assigned correctly) of 99.9%, 98.9% and 33.3% in the three classes of MAF, respectively. The average precision for placement of the SNPs was 914, 3,137 and 6,853 kb, respectively. Secondly, 4,688 of 5,314 SNPs unpositioned in the Btau4.0 assembly were positioned using the LODE procedure. Based on these results, the positions of 485 unordered scaffolds were determined. The procedure was also used to validate the genome positions of 53,068 SNPs placed on Btau4.0 bovine assembly, resulting in identification of problem areas in the assembly. Finally, the accuracy of the LODE procedure was independently validated by comparative mapping on the hg18 human assembly. Conclusion The LODE procedure described in this study is an efficient and accurate method for positioning SNPs (MAF>0.05), for validating and checking the quality of a genome assembly, and offers a means for positioning of unordered scaffolds containing SNPs. The LODE procedure will be helpful in refining genome sequence assemblies, especially those being created from next-generation sequencing where high-throughput SNP discovery and genotyping platforms are integrated components of genome analysis.


Background
The last decade has seen a rapid expansion in the number of genomes from a diverse range of species being sequenced [1]. Further developments of high-throughput sequencing platforms are likely to accelerate the sequencing of potentially many more genomes [2].
Furthermore, such data sets may be coupled with highthroughput SNP-analysis platforms to undertake population diversity characterization [3,4]. The relatively short sequence reads from the high-throughput systems pose challenges in the creation and ordering of contigs and scaffolds in the absence of a mature reference genome. Ordering closely linked markers is also a challenge using linkage mapping. Assembly of the bovine genome sequence has recently been reported [5]. In the course of bovine sequencing to date, more than 2 million SNPs have been discovered and more SNPs are being added with additional sequencing efforts using next generation sequencing technologies [6], resulting in several highdensity SNP-genotyping platforms for population-wide screening of genome diversity. Despite several genome builds, there are still a large number of scaffolds and SNPs that are not yet assigned to any chromosomes. For example, there are 11,869 un-ordered scaffolds in Btau4.0, constituting 9.72% (263.4 Mb) of the bovine genome. In order to improve the genome assembly, it would be useful to assign un-ordered scaffolds and SNPs to chromosomes, and to locations within chromosomes [7].
A number of strategies can be adopted to place polymorphic markers on chromosomes via linkage maps [8][9][10], Radiation Hybrid maps [11][12][13], FISH and integrated maps [14]. Linkage studies require genotypic information on specific families, and it is difficult to construct accurate or high-resolution linkage maps for high-density SNP data [10]. Alternatively, physical maps of SNPs, created by screening RH panels, enable highresolution positioning of SNPs but require high-density anchoring of the physical genome to the assembly. However, a SNP can be given a chromosomal position based on linkage disequilibrium (LD) information of the SNP with other SNPs with known position in the genome. LD analysis does not rely on family information and decays rapidly across ( [15], and within populations [16]) and, as such, can provide a means to accurately position SNPs based on LD relationships with other SNPs with known map positions. Miller et al., [17] applied an LDbased approach to map a test set of SNPs with known map positions. However, the utility of this approach for unmapped SNPs, or SNPs with ambiguous positions in the context of high-density SNP data, has not been demonstrated.
Recently we showed [18] that polymorphic markers can be ordered within a chromosome based on pairwise LD only and termed this procedure LODE (Locus Ordering by Dis-Equilibrium). A sorting algorithm (sorting points into neighbourhoods) [19] was applied. The procedure was successful in assigning a small number of unmapped SNPs to unique chromosomal locations but was found to be limited in terms scaling up to large matrices representing dense SNP panels.
Here we modify the initial LODE procedure for assigning SNPs to chromosomes and positioning SNPs within chromosomes. First, the efficiency of using genome-wide LD information is investigated by using mapped SNPs as a test set. Next, the procedure is applied to assign positions for 4,688 out of 5,314 unpositioned SNPs on Btau4.0, which were either unassigned or assigned with ambiguity based on BLAST against Btau4.0, from a high-density SNP panel of 73,568 SNPs. We also suggest the chromosomal locations of un-ordered scaffolds. Finally the LODE procedure was used to confirm the order of mapped SNPs across the genome as a means to check the quality of genome assembly.

Genotypic Data
Data from three SNP genotyping arrays, namely 15 k [20], 25 k (Affymetrix; http://www.affymetrix.com) and 54 k (Illumina; http://www.illumina.com/), used for genotyping 1,536, 441 and 377 Australian Holstein-Friesian (HF) bulls, respectively, were combined into a single dataset for the current analyses. There were duplicate samples and duplicate SNPs within and between datasets. Only unique samples and SNPs with higher call rate (% genotype assignment) were selected to include in the final dataset. Any inconsistent genotype was set to unknown. The final combined dataset represented 73,569 unique SNPs and 1,943 bulls with an average of 628 bulls genotyped per SNP. The mean coefficient of coancestry among these 1,943 bulls is 0.025, with 0.0 and 0.035 for the first and third quartiles, respectively.

Position of SNPs
The location of each of the 73,569 SNPs in the bovine genome was assessed from BLAST alignment of SNP flanking sequences with the Btau4.0 assembly ftp://ftp. hgsc.bcm.tmc.edu/pub/data/Btaurus/fasta/ Btau20070913-freeze/, which includes a considerable quantity of sequence (organised as either a set of scaffolds or as a pseudo-chromosome) that is not assigned to a chromosome (referred to as 'Un'). We used the Batu4.0 assembly to demonstrate utility of the LODE procedure since the assembly contained a number of SNPs not assigned to chromosomes. Comparison of LODE positions were also made against another bovine assembly build UMD3.0 which has recently become available.
SNP positions on Btau4.0 were categorised as follows: i) 'mapped' (single assignment to a chromosome); ii) 'ambiguous' (more than one assignment in the genome); iii) 'Un' (single assignment to 'Un' sequences only); iv) 'unassigned' (no assignments in the genome). Collectively, the last three categories (ambiguous, Un and unassigned) are here called 'unpositioned'.

LODE procedure
The location of each unpositioned SNP was estimated on the basis of its LD (estimated as r 2 ) with mapped SNPs. The r 2 estimates were obtained using GOLD [21]. The genotypes for SNPs on the X-chromosome were considered as homozygous for the purpose of computing LD estimates. Only high quality LD estimates (significant at the 0.01 level, and estimated from a minimum of 100 observations) were used. The actual procedure used in the present study is an extension of the strategy first used by Miller et al. [17] and subsequently adapted to the LODE procedure by Sölkner et al. [18]. In the present study, the LODE procedure consisted of two main steps: A) assigning a SNP to a chromosome; B) estimating the position of the SNP within the assigned chromosome. After trialling many combinations of criteria, the following strategy was used. (The relative accuracy of using different threshold combinations is shown in Additional file 1).

A) Assigning a SNP to a chromosome
For each unpositioned SNP with MAF >0.01: 1. r 2 was estimated with all mapped SNPs. 2. From these estimates of r 2 , two parameters were computed with respect to each chromosome, namely: a. maximum r 2 (r 2 max , as an indicator of the strength of LD) b. number of mapped SNPs with r 2 > 0.1 (n 0.1 , as an indicator of the number of mapped SNPs in LD with the unpositioned SNP) 3. Chromosomes were then ranked according to r 2 max and n 0.1 , in the latter case after excluding chromosomes for which n 0.1 <3. A chromosome with top ranking for both parameters was identified as the candidate chromosome for that unpositioned SNP.
After trialling the above threshold combinations, SNP with MAF ≤ 0.05 required an additional check to improve accuracy of placement. In addition to the above strategy (steps 1-3), the chromosome with next highest r 2 max was identified. If the r 2 max of the second chromosome exceed 2/3 r 2 max of the candidate chromosome, the SNP was not assigned to any chromosome. This improved the accuracy of assignment from 92.1% to 98.9% (Additional file 1). SNPs which didn't meet these criteria were left unpositioned.

B) Estimating position within an assigned chromosome
For each unpositioned SNP that could be assigned to a chromosome, its location on that chromosome was allocated the same position as that of the mapped SNP with which the unpositioned SNP has r 2 max . The above LODE procedure was first tested for its ability to determine the location of SNPs whose location was actually known. Three test sets involved determining the location of a total of 1,500 "masked" SNPs (50 from each of the 29 autosomes and the X chromosome, randomly selected from SNPs with known positions). Each set comprised SNPs with a different MAF class, namely 0.001<MAF ≤ 0.01 (300 SNPs, 10 from each chromosome); 0.01<MAF ≤ 0.05 (300 SNPs, 10 from each chromosome); >0.05 (900 SNPs, 30 from each chromosome). The extent to which the procedure was successful was assessed in terms of "efficiency" (the percentage of "masked" SNPs that were assigned a location), "accuracy" (the percentage of "masked" SNPs that were assigned to the correct chromosome), and "precision" (the difference in physical distance between the known position and the assigned position). After testing the LODE procedure with the above test sets, the same procedure was applied to unpositioned bovine SNPs.

Comparative position on human genome
To provide further evidence of the utility of the LODE procedure, we used a comparative mapping approach to confirm the genome location of unpositioned bovine SNPs against the human genome assembly hg18, since this represents the most complete mammalian genome to date. This approach was considered helpful since the location of the unpositioned SNPs could not be validated on Btau4.0 directly.
The comparative position of bovine SNPs was estimated in the human genome using two approaches. Firstly, BLAST was used to align the flanking sequences of unpositioned SNPs with the hg18 assembly ftp:// hgdownload.cse.ucsc.edu//goldenPath/hg18/. Secondly, the 'LiftOver' tool http://genome.ucsc.edu/cgi-bin/ hgLiftOver was used with default settings to convert LODE positions from the bovine Btau4.0 assembly to the human hg18 assembly.

LODE as a means for checking genome assembly
The LODE procedure was used to recompute the positions of all SNPs mapped to the genome which were genotyped and met minimum criteria for inclusion as detailed above and MAF >0.05. The procedure was performed in batches, where the positions of 10% (every 10th) of SNPs of a chromosome were masked. The positions of the masked SNPs were recomputed based on the LD information of the remaining SNPs in the genome. The chromosomal assignments and positions estimated by LODE were compared with original positions on Btau4.0 and also with UMD3.0.

Validation of LODE procedure by test runs
A total of 870 (96.7%) of the 900 test SNPs with MAF>0.05 were allocated a chromosomal position by LODE. All but one (i.e. 869 = 99.9%) of the positions were the same as the Btau4.0 accepted assembly position. The comparison of estimated and known SNP positions (Additional file 2) shows strong agreement (mean Pearson's correlation = 0.98 across all chromosomes). The mean precision of localisation was 914 ± 130 kb ( Table 1). The results from alternate criteria that were tested during the development of the preferred strategy are shown in Additional file 1. 92 SNPs (30.6%) from the second test set (0.01<MAF ≤ 0.05) were positioned, with only one mis-assignment (1.1%). Comparison of the estimated LODE positions and the known positions showed high agreement (Additional file 3). Thus, the efficiency of positioning SNPs in this MAF range was much lower, but for those SNPs that could be positioned, the accuracy was very high. Rare SNPs (0.001 <MAF ≤ 0.01) could not be positioned (Table  1). Overall, it can be concluded that the LODE procedure can position SNPs with MAF>0.01 with high accuracy.

Application of LODE to unpositioned SNPs
In the Btau4.0 assembly, there are 6,470 'unpositioned' SNPs. Of these, 5,314 SNPs have MAF>0.01, making them suitable for LODE mapping (Additional file 4). Table 2 shows the number of SNPs positioned by LODE. Of the 5,314 'unpositioned' SNPs with MAF >0.01, 2,291 had ambiguous positions, 1770 were aligned to 'Un' sequences, and 1,253 were unaligned. Using the LODE strategy, 4,688 of the 'unpositioned' SNPs were positioned. Of the 626 SNPs which didn't meet the thresholds of the LODE procedure, 231 had ambiguous positions, 271 had 'Un' sequences and 124 were unaligned. As expected from the test-set results, a higher proportion of the SNPs with MAF >0.05 (94.2%) than with 0.01<MAF ≤ 0.05 (27.6%) could be positioned. The proportions of SNPs placed in the two categories are comparable to the proportions observed in the two corresponding test sets.
Of 2,291 SNPs in the ambiguous category, 2,060 were positioned by LODE. The SNPs in this category had multiple hits when flanking SNP sequence was BLASTed against Btau4.0. Although it is possible that some of the sequence alignment positions in this category may be the result of errors in the Btau4.0 assembly, it is more likely that they are genuine genomic positions reflecting structural polymorphisms or segmental duplications. The SNP positions estimated by LODE are approximations and hence for the SNPs in this category it may be preferable to use LODE positions to discriminate between the multiple sequence-alignment results, and use the sequence alignment consistent with LODE for final positioning.
Of 1,770 SNPs belonging to 'Un' sequences, 1,499 were positioned by LODE. These SNPs belong to 494 unique "Un" unordered Btau4.0 scaffolds. Assignment of these SNPs to definite chromosomes suggests the assignment and positions of respective "Un" scaffolds to the same chromosome as well. Table 3 presents the number and length of these "Un" scaffolds assigned to different candidate chromosomes. These assigned scaffolds comprise 87.7 Mb of genome sequence in total. There were multiple SNPs on some of the "Un" scaffolds. Out of these, 210 "Un" scaffolds had two or more SNPs (mean = 5.04) with all the SNPs aligned to one chromosome (Additional file 5). These 210 "Un" scaffolds with multiple SNPs could be assigned and some of them could be oriented on the chromosome, based on the SNP position estimates. This approach may therefore be very useful for improving the bovine assembly, since it provides for a higher resolution assignment of SNPs and the scaffolds. There were 9 scaffolds with multiple SNPs that were given positions on two chromosomes by LODE. This Table 1 Efficiency (proportion of SNPs placed), accuracy (proportion of SNPs placed correctly) and precision (kb location from draft assembly location) of the LODE procedure for placing SNPs with known location in three test runs with varying thresholds of MAF of SNPs to be placed.  may indicate problems in the assembly of these scaffolds themselves and may require the segments with separate SNPs to be placed separately for improved accuracy of genome assembly. Of 1,253 SNPs in the unaligned category, 1,129 were positioned by LODE. These sequences are missing from the Btau4.0 assembly, possibly because of the nature of whole-genome shotgun sequencing, or because they are within polymorphic regions not present in the two individuals which contributed to Btau4.0, but are present within the population with which we have worked.
In summary, the LODE procedure has positioned 4,688 of 5,314 SNPs that are unpositioned in the Btau4.0 assembly.

Validation of LODE positions by comparative mapping
Unique (single location) positions on the human hg18 assembly were obtained from the BLAST and LiftOver procedures for 284 SNPs from the panel of 4,688 SNPs positioned by LODE. The chromosomal assignments for 230 (81%) of these SNPs were identical between BLAST and LiftOver. 54 (19%) of the 284 SNPs had different chromosomal assignments on hg18 by the two above procedures, which may be due to the LODE positions being outside of conserved syntenic blocks between bovine and human chromosomes. Such blocks are normally very small and quite variable in length. Comparison of the chromosomal positions of the 230 SNPs, with same chromosomal assignments, shows very strong Table 3 Number of SNPs and unassigned scaffolds ("Un") assigned to different chromosomes by the LODE procedure.

Chromosome
No. of SNPs assigned No. of "Un" scaffolds assigned Length of "Un" scaffolds in bp agreement (cor = 0.95) (Figure 1) between the positions obtained through BLAST and LiftOver. These results support the accuracy and utility of the LODE procedure for positioning SNPs with MAF>0.01.  Table 4 shows distribution of these 81 SNPs mapped to different chromosomes. Out of these SNPs, 5 blocks can be noted as shown in Table 5. All the SNPs of these blocks were assigned to a different chromosome by LODE. The positions of these SNPs were compared with another recently released assembly of the bovine genome (UMD3.0) which agrees with LODE assignments for the SNPs in the blocks (Table 5). These blocks suggest problem areas within the Btau4.0 assembly. The comparison of the overall agreement between LODE and SNPs positioned on Btau4.0 are shown in Figure 2 by the way of Oxford grid. The detailed alignment of LODE positions and Btau4.0 for each chromosome is shown in Additional file 6. This identifies the chromosomal regions which may suggest potential problem areas in the Btau4.0 assembly. In particular two regions (10-11 Mb and 90-120 Mb) on BTA5 suggest problem areas in the assembly of this chromosome (Figure 3). Similarly X-chromosome shows several regions where a relatively higher number of SNPs show differences in original Btau4.0 positions and LODE positions which may suggest general problem in the assembly of X-chromosome (Additional file 6).

Discussion
In this study we reported and validated a procedure to accurately and efficiently map SNPs based on LD information. The LODE procedure offers particular advantages in the positioning of problem SNPs for which no unambiguous assignment on a draft genome assembly could be made, as well as a means for positioning of unordered scaffolds containing SNPs. Miller et al. [17] used a genetic algorithm based approach and linkage disequilibrium to position a test set of bovine SNPs with known location, and applied a minimum threshold of r 2 >0.4 between SNPs in their method. Application of such a threshold would have resulted in lower efficiency (71% for SNPs in test Run1 (MAF >0.05) and slightly lower accuracy (2 mis-assignments) when compared to the thresholds adopted in our study ( Table 1). The LODE procedure showed greater utility over the methods described by [17] where the authors have not demonstrated the placement of SNPs with MAF<0.05, SNPs with ambiguous assignments or unpositioned SNPs. The original LODE procedure of Solkner et al. [18] was of similar accuracy and efficiency in small test runs, but has severe limitations in terms of computing time (Solkner et al. in preparation) imposed by matrix dimensions of marker density, thus limiting application to full genome analyses. MAF of SNP to be placed has a significant effect on the efficiency of the LODE procedure, as shown in detail in the result section by running the three different test sets of varying MAF (Table 1). However despite the lower efficiency, the accuracy of the LODE procedure for SNPs with a 0.01<MAF ≤ 0.05 was high. Another advantage of using the LODE procedure was that SNPs which showed deviation from HWE could also be mapped. For  example in the test set1, out the 56 SNPs showing HWE deviation (P < 0.0001), 54 could be given assignments and all of these assignments were correct. Linkage studies generally exclude such SNPs from analysis [10]. Finally LODE procedure can be used for checking the integrity of assembly by sampling and reassigning the positions of SNPs as shown in the result section. The LODE procedure described here is complementary to other commonly used methods to assemble maps, including linkage maps and physical maps such as Fluorescence in situ hybridization (FISH) and Radiation Hybrid mapping [13], but offers significant advantages over these methods since they are very laborious, may have limited resolution, and often require highly specialized resources [22][23][24][25]. The comparative advantages and limitations of using LODE mapping are discussed in detail below.
The building of linkage maps for genome assemblies has the advantage that de novo ordering of markers can result in robust framework maps, but such maps required information from often large and specific resource populations. Indeed linkage maps have been assembled for many species including a broad range of markers (cattle [8], pig [26], sheep [27], mouse [28], chicken [29] and human [30]). In the case of mouse [31] was able to place SNP markers at a resolution and accuracy of 0.3 Mb by linkage mapping. Most resource populations do not have sufficient power to treat each marker in a high density map as a framework reference point (anchor marker) as described by Ott et al. [32] in their guidelines for developing linkage maps. Recently Arias et al. [10] reported on the construction of a bovine hybrid linkage-map by combining linkage and physical map (Btau4.0) information. However, of the 9,713 SNPs genotyped, 2,946 (30.3%) could not be assigned to the linkage map for quality control reasons. Furthermore 743 (9.4%) of the 7,822 markers assembled for mapping could not be positioned. In contrast the LODE procedure was able to place 4,688 out of 5,314 SNPs in a data set of 73,569 SNPs which is the largest panel of bovine SNPs which can currently be assembled from commercially available SNP arrays.
Integrated maps and comparative maps are frequently used to build interim maps for the species in the absence of a completed genome assembly [33,14,34]. BLAST procedure is commonly used to align sequence and when combined with LiftOver can make inference about marker position and order. However, this procedure is highly inefficient when compared to direct mapping such as LODE. For example, out of 4,688 SNPs successfully mapped by LODE, only 230 would have been mapped successfully using BLAST and LiftOver from human assembly to bovine assembly. Lewin et al. [7] highlighted the limitations and conundrum of using comparative mapping information for building maps and emphasised the importance of developing independent species specific maps for discovery of conserved chromosome segments and evolutionary breakpoint regions.
Despite the array of tools available for constructing genetic and physical maps, a large number of SNPs and scaffolds remain unpositioned which is likely to be common for most species in which genome assembly is being undertaken (chicken [35], dog [36], cat [37], pig [38] and many other species [1]). As such, the LODE mapping procedure offers a significant additional tool for completing genome maps and assemblies. LODE procedure relies on the linkage disequilibrium information from the unrelated samples from the population and does not require a specific resource population. A reliable estimate of r 2 can be obtained from a minimum sample size of 75 unrelated individuals [16] which can be found in many diversity and association studies.
However, despite the high degree of accuracy of placement, the LODE procedure still only provides an approximation of the exact localisation (precision) of SNPs within a chromosome, since it is dependent on the accuracy of prior genome assembly as a reference framework and the density of known SNPs to allow positioning of unknown SNPs. Hence, the precision of positioning SNPs with the LODE procedure will increase with increasing SNP density and accuracy of the sequence map. However, quality of the assembly can be assessed by using the LODE procedure to confirm the location of SNP markers with assigned positions, and provides for an independent cross check as shown in the result section. The initial density of marker maps, in order for the LODE procedure to be effective, will depend upon the extent of LD in the population which is often population specific. In the case where no reference positions are available as in the case of denovo genome sequencing and mapping, using D' as a measure of LD will be useful for LODE mapping (Solkner et al. in preparation). It is recommended to always test the LODE strategy on a panel of mapped SNPs with known positions, before applying the procedure to unmapped SNPs; and, if necessary, to alter some of the thresholds criteria.
LODE procedure can also be very helpful in refining sequence and genetic maps for species where comparative genome assemblies are used to build a virtual assembly for the species of interest, such as has recently been done for sheep [33]. Population wide (across or within breeds) LD information from high density SNP data (see the ISGC website http://www.sheephapmap.org/) can be used to place and validate SNP locations, and order of unplaced scaffolds where they contain SNPs with appropriate genotype information. The LODE procedure is likely to be of significance in the future as developments in next-generation sequencing technologies are providing deep sequencing coverage at an affordable price [39][40][41]. These platforms generally provide enormous information on new SNPs from short sequence reads [4,6] but these short sequence reads, at present, can only be assembled into short scaffolds. Genotyping SNPs with the advent of ultra-high genotyping platforms [42] will allow for LODE to integrate these short sequence scaffolds into the existing map information.

Conclusion
The LODE procedure described in this study is an efficient and accurate procedure for positioning SNPs, and offers a means for positioning of unordered scaffolds containing SNPs. The LODE procedure will be helpful in refining genome sequence and checking assemblies, especially those being created from next-generation sequencing where high-throughput SNP discovery and genotyping populations are components of genome analysis.