Identification of polymorphic inversions from genotypes

Background Polymorphic inversions are a source of genetic variability with a direct impact on recombination frequencies. Given the difficulty of their experimental study, computational methods have been developed to infer their existence in a large number of individuals using genome-wide data of nucleotide variation. Methods based on haplotype tagging of known inversions attempt to classify individuals as having a normal or inverted allele. Other methods that measure differences between linkage disequilibrium attempt to identify regions with inversions but unable to classify subjects accurately, an essential requirement for association studies. Results We present a novel method to both identify polymorphic inversions from genome-wide genotype data and classify individuals as containing a normal or inverted allele. Our method, a generalization of a published method for haplotype data [1], utilizes linkage between groups of SNPs to partition a set of individuals into normal and inverted subpopulations. We employ a sliding window scan to identify regions likely to have an inversion, and accumulation of evidence from neighboring SNPs is used to accurately determine the inversion status of each subject. Further, our approach detects inversions directly from genotype data, thus increasing its usability to current genome-wide association studies (GWAS). Conclusions We demonstrate the accuracy of our method to detect inversions and classify individuals on principled-simulated genotypes, produced by the evolution of an inversion event within a coalescent model [2]. We applied our method to real genotype data from HapMap Phase III to characterize the inversion status of two known inversions within the regions 17q21 and 8p23 across 1184 individuals. Finally, we scan the full genomes of the European Origin (CEU) and Yoruba (YRI) HapMap samples. We find population-based evidence for 9 out of 15 well-established autosomic inversions, and for 52 regions previously predicted by independent experimental methods in ten (9+1) individuals [3,4]. We provide efficient implementations of both genotype and haplotype methods as a unified R package inveRsion.


Background
Inversions have long been known to play an important role in chromosomal evolution [5]. Indeed, large inversions are thought to contribute to speciation through reproductive isolation caused by reduced recombination between normal and inverted chromosomes [6]. Inversions of a wide range of sizes are abundant in mammalian lineages [7]. Additionally, extensive study of inversions in Drosophila revealed that inversions can leave genetic signatures, such as reduced nucleotide variation, within the inverted region [8][9][10][11]. More recently, many polymorphic inversions have also been found in humans [3,4,12]. A number of these have functional consequences; polymorphic inversions have been associated with genetic disorders [13], complex disorders such as asthma [14] and even positive selection [15].
Recent resequencing efforts of many human genomes continue to reveal the prevalence of structural variation in humans [16]. However, despite decreases in the cost of sequencing, genotyping microarrays remains the most cost-effective technology for analyzing entire genomes on thousands of individuals. Moreover, inversions have been traditionally difficult to study using experimental techniques. The typical presence of large inverted repeats at the breakpoints is a major challenge for their detection even with current next-generation sequencing techniques. While only few inversions (~15) have been experimentally validated [11,13,15,[17][18][19][20][21][22], a few studies have scanned the whole genome providing experimental evidence for a number of candidate regions. For example, Levy and colleagues determined inverted regions by the whole genome assembly of one subject [3], and Kidd el al. used fosmid paired-end mapping in nine individuals [4]. As such, the ability to accurately predict new inversions and infer their status on large number of subjects would provide a valuable tool for clinical and evolutionary studies of human populations.
Previous studies have used haplotype tagging to indirectly infer which chromosomes are most likely to have an inversion, assuming recombination suppression in inverted heterozygous. For instance, Steffanson et al. [15] showed that an inversion within 17q21 in the European population can be tagged with two different haplotype groups (H1/H2), each related to a polymorphic variant of the MAPT gene. Although haplotype tagging is performed on large groups, it is suitable for regions known to have inversions and known to exhibit divergence between the two arrangements. Three studies [13,23,24], for instance, identified two haplotype groups within a region containing the known 17q21 and 8p23 inversions, and then experimentally validated the tagging on selected samples of the HapMap population. Taken together, this small group of subjects can be used to validate newly developed methods that determine the status of inversions in individuals.
Using a different approach, based on differences in linkage between groups of SNPs, two other methods have been developed to discover the presence of inversions across the genome [1,25]. Bansal et al. [25] used differences in linkage disequilibrium (LD) to determine regions likely to be inverted. However, their method requires the human reference to contain the minor allele, and does not predict which chromosomes in the population are most likely to have the inversion; a factor that is essential for association studies. More recently, Sindi and Raphael [1] developed a probabilistic method that models the population as a mixture of normal and inverted haplotypes, and thus had increased power to detect inversions of lower frequency. Although their method accurately predicts inversion frequencies, it did not yield an accurate classification of individuals into normal and inverted subpopulations. The computationally intensive search of both methods have, in particular, failed to identify inversions like the one within 17q21 in the CEU population, for which a clear extended LD has been shown [13]. In addition, both methods were developed for haplotype (phased) data only.
One way to analyze genotype data, using this method, is to phase the entire genome and then apply it to the resulting haplotypes. This procedure is computationally demanding. For instance, it has been reported that compiled software like fastPHASE [26] can take up to 9 h to analyze 60 subjects in a 41,018 SNPs chromosome (3-GHz Xeon processor with 1 GB). Therefore, a method that directly analyzes genotypes, incorporating the limited phasing required by inversion detection, can substantially reduce this computational load and allow the complete implementation of the methodology in a single software tool to be used in standard up-to-date machines.
In this work, we propose a new methodology to (1) efficiently detect inversions across the genome by directly using genotype data and (2) accurately classify individuals in the population according to inversion status. In addition, our generalization of the inversion model for haplotypes [1], as a computational technique, allows us to treat the different problem of phasing haplotype blocks separated at any distance. Our new application of the inversion model, within the analysis of polymorphic inversion from genotypes, increases the applicability of the method to current GWAS, and enhances its usability by a higher computational efficiency. We provide an efficient implementation of the novel analysis of genotypes, and the new classification and search methods in the R package inveRsion, freely available through Bioconductor [27] (http://www.bioconductor.org/packages/devel/bioc/html/inveRsion.html). We also include a computationally improved version of the previous haplotype model, and the use of the Bayesian Information Criterion (BIC) to gather statistical evidence from neighboring regions.
Both prior LD studies [1,25] tested their methods by constructing "artificial" inversions by reversing the order of SNPs in phased haplotypes from HapMap. We provide a more rigorous test of our method by employing a recent software tool, invertFREGENE [2], that utilizes coalescent theory and suppression of recombination between inverted and normal chromosomes to produce artificial haplotype and genotype data. Lastly, we apply our method to Hap-Map Phase III data, where we compare the analysis of genotypes with that for haplotypes, assess the our classification accuracy in two validated inversions and search the whole genome for inversion signals.

The Inversion Model
The inversion model was first proposed by Sindi et al. 2010 [1] to predict inversion polymorphisms from phased data. Here we give a general formulation that allows us, in addition, to solve the distinct problem of phasing genotype data across inversion breakpoints, and thus reduce the computation load of phasing the whole genome.
We are given m independent observations of four random variables B 1 , B 2 , B 3 , and B 4 , ordered according to location in the reference genome. We are interested in assessing the likelihood of an inversion in the population with one breakpoint between markers B 1 and B 2 and the other breakpoint between markers B 3 and B 4 . In the case an inversion exists, we expect to observe more than one subject i for which the formation of the two pairs (B 1 (x i ), B 2 (x i )) and (B 3 (x i ), B 4 (x i )) is less probable than that of (B 1 (x i ), B 3 (x i )) and (B 2 (x i ), B 4 (x i )), within the population sample (i = 1...m). We denote the joint random variables as In a sample where no individual is inverted, we assume that the probability P fwd (x i |n 1,2 , n 3,4 ) of the observed values B 1 , B 2 , B 3 and B 4 at x i can be factored as the product between the probability of the pairs (B 1 ( where n 1,2 and n 3,4 indicate the respective probabilities of observations for B 1,2 and B 3,4 . The likelihood of all the m independent observations is then On the other hand, in a population where all observations are inverted, the probability P inv (x i |r 1,3 , r 2,4 ) of obtaining the specific values of the x i observation is where, as before, r 1,3 and r 3,4 correspond to the probabilities of observations for B 1,3 and B 2,4 .
In a real population sample, where we expect a mixture of forward and inverted individuals with mixture frequency π. In addition, given appropriate probabilities for all four joint random variables, f 1,2 , f 3,4 for the forward subpopulation and r 1,3 , r 2,4 for the inverted subpopulation, and ω = (f 1,2 , f 3,4 , r 1,3 , r 2,4 ), the probability of observing the values of B 1 , B 2 , B 3 and B 4 for an arbitrary x i is with population sample likelihood An important quantity in predicting the inversion status of an individual is the relative probability that the individual belongs to the forward subpopulation: this is also called the responsibility to the forward model. Similarly, the responsibility of an individual to the inversion model, r 1,i = 1r 0,i .
We test the hypothesis that π = 0 (null model) vs. π >0 (alternative model) with a maximum likelihood approach. We maximize the likelihood of the data under each model; that is, for the null model we use the empirical distribution of allele frequencies for the null model and for the alternative model we select the allele frequencies and inversion frequency π using the Expectation-Maximization (EM) algorithm for mixtures of probability distributions. The decision to reject the null model (no inversion) for the alternative model is based on a model selection test. In this work, we utilize the Bayesian Information Criterion (BIC) to both identify candidate inversions and classify individuals carrying the polymorphic inversions.

Inversion model for genotype data
In the analysis of genotype data, we apply the inversion model in two steps. In the first step, we use the inversion model to phase and pair haplotype blocks around potential breakpoints. This approach allows flexibility to phase two haplotype blocks at any distance, without phasing the whole genome or constantly re-phasing regions, when one of the breakpoints is changed. This use of block phasing is a novel application of the inversion model. The second step is the original application of the model [1] to identify which individuals are likely to have a genetic inversion.
In this work, we define a candidate breakpoint as a pair of consecutive SNPs. A segment tested for an inversion is thus defined by two candidate breakpoints (left and right). We flank the left and right candidate breakpoints by two blocks of N SNPs each, and perform local haplotype phasing with haplo.stats [28]. As a result, an individual i has two haplotypes, L 1 (x i ) and L 2 (x i ) of 2N SNPs each, containing the left breakpoint in the middle, and two 2N SNPs haplotypes R 1 (x i ) and R 2 (x i ), containing the right breakpoint in the middle. The first application of the inversion model is to sort out the chromosome pairing of the containing blocks, that is, we set B 1 = L 1 , B 2 = R 1 , B 3 = R 2 , B 4 = L 2 and m = n s (number of subjects) in the inversion model. The responsibilities for the alternative model tell us if the pairing of (L 1 , R 1 ) in one chromosome and (R 2 , L 2 ) [e.g. (L 2 , R 2 ) ordered by handedness] in the other chromosome is more (less) likely than the pairing (L 1 , R 2 ) and (R 1 , L 2 ) [(L 2 , R 1 )], for each subject in the sample. This first process phases the genotype data between any pair of blocks, surrounding candidate breakpoints.
The second application of the inversion model is to determine the presence of genetic inversions on phased data, as originally proposed in Sindi and Raphael 2010 [1]. In total, we have n c = (2 * n s ) haplotype blocks phased by the former pairing. We first aggregate the data into two variables one for each breakpoint, H L (L j (x i )) corresponds to the haplotype of subject i and chromosome j containing the left breakpoint, and H R (R j (x i )) to the haplotype of subject i and chromosome j containing the right breakpoint. We next split the left haplotype An important consideration in our analysis is whether the initial phasing step itself is confounded by the presence of an inversion. Thus, in the first step of our analysis, we perform two other (more complex) local haplotyping strategies. In the first strategy, we locally haplotype each N-SNP flanking block separately with haplo.stats and then use the inversion model to phase the internal flanking blocks, i.e. those assumed to be within the inversion segment. Finally we use the model again to phase the internal haplotype with each external block. A second alternative strategy is to perform the local haplotyping with haplo.stats once on the surrounding blocks following the forward population and twice following the inverted population, leaving two distinct haplotype data for each population. We compare the accuracy of each phasing strategy in the prediction of inversions on simulated data and contrast them with the accuracy of the prediction for already phased data.

Inversion Detection and Subject Classification
In addition to providing a novel method to analyze genotype data for polymorphic inversions directly, our work aims at improving the efficiency of detecting inversions, and the accuracy of classifying individuals with respect to inversion status.
Previous LD methods [1,25] considered a potential breakpoint between every pair of adjacent SNPs in the genome with sufficient diversity and physical distance between breakpoints. Thus, the total number of segments tested for inversions was very large, O(n!) where n corresponds to the number of candidate breakpoints or SNPs. To reduce the computational load, we decrease the number of regions tested by using a sliding window. We tested our approach on simulated inversions from invertFREGENE [2], and found our model still correctly identifies an inversion if the candidate segment tested is contained within the real inversion sequence and "large enough" compared to the real inversion. Consequently, we scan the chromosome with trial segments of fixed length (window size), as probes for detecting inversions with comparable length. We then reconstruct the true inversion by considering all trial segments with sufficiently large Bayesian Information Criterion (BIC). This simplification massively reduces the number of computations to O(n).
In addition, prior methods for haplotype data were unable to successfully identify individuals carrying the inversion, a necessity for genome association studies. We develop an accurate classification method based on the responsibility of each individual to the normal or inverted subpopulation in overlapping windows. For a fixed window w we determine the responsibility of chromosome i, to the forward model r w 0,i . Sindi and Raphael [1] attempted to classify individuals based on their dominant responsibility, normal if r w 0,i ≥ 0.5 and inverted otherwise; however, this did not yield a useful classifier because the classification from any single instance of the mixture model was poor. We find that combining classifications from adjacent, overlapping windows yields a substantial improvement in classification accuracy with the true inversion. More generally, we estimate the responsibility of individual i to the inversion I by a majority vote overall overlapping segments where BIC exceeds a given threshold BIC > t B : where H is the step Heaviside function, H(z) = 1 when z >0 and 0 otherwise. For haplotype data, we classify a chromosome i as inverted when r I 0,k (t B ) < 0.5 for a particular value of t B . For genotype data, we classify each individual as homozygous for the reference orientation, heterozygous for the inversion or homozygous for the inversion, depending on the classification obtained from the responsibilities of both chromosomes. We define the classification accuracy of our method as the number of individuals with correctly identified inversion status (i.e., fraction of true positives and true negatives). In the simulation results below, we compare the classification accuracy of our method to the true known categorization of individuals as function of t B . In practice, our classification accuracy increases with t B .

Datasets
We studied the power of our method, implemented as the R package inveRsion, to detect inversions on both haplotype and genotype data by analyzing previously known and simulated polymorphic inversions.
First, we applied our method to simulated haplotypes and genotypes to asses our performance at both detecting inversions and classifying individuals into normal and inverted subpopulations. Using recently developed software, invertFREGENE [2], we simulated inversions of varying lengths (0.1 -1.0 Mb) and varying frequencies π = 10 -90% in a populations of 1,000 individuals (2,000 chromosomes). The software provided haplotypes, genotypes and inversion status of each chromosome in the population. Parameters used in running the software, such as number of generations to achieve equilibrium, and rates of recombination and mutation, were taken without change from [2]. In particular, the segments simulated by invertFREGNE have a median fraction of heterozygous of 0.34 and overall SNP density of 0.33 Kbase/SNP on a 2 Mb segment.
To check the performance of our method on real data, we used the SNP data from HapMap phase III (http:// www.hapmap.org), totalling 1.5 M SNPs on 1184 individuals from 11 different populations. As a first step, we analyzed both haplotype and CEU and YRI populations from chromosomes 16 and 17, where well characterized polymorphic inversions have been found. Next, we performed a region of interest analysis, where we did a detailed search around two previously identified inversions (17q21 and 8p23), and classified subjects for all 11 HapMap III populations by predicted inversion status. We then validated the inversion genotypes on previously reported individuals. Finally, for the more demanding task of scanning the whole genome, we used the SNP genotype data of the joint CEU+YRI population (201 subjects). Significant regions were compared with a set of previously reported inversions. We used a first set of 15 well characterized and experimentally validated inversions, and then a second set of putative inversions predicted by whole genome assembly comparison [3] or paired-end mapping experiments [4].

Implementation
Some technical considerations were made in the implementation of the previously described models in the inveRsion R package. In analyzing genotypes, we removed SNPs with missing values for more than 10% of the individuals. For the remaining SNPs, missing values were imputed during local phasing with haplo. stats, we selected haplotypes with the highest posterior probability at each point, and did not considered the propagation of this error into the inversion model. To further limit the regions tested, we only considered breakpoints between SNPs where at least one had minor allele frequency ≥ 10%.
For both mixture models, we used the empirical distribution of allele frequencies as the initial condition for the EM algorithm. We conjectured that this choice would be appropriate to locate the global minimum, and verified this conjecture on simulated inversions by running the EM algorithm with randomly chosen initial conditions.

Simulated data Characterization of a single inversion
We first ran our inversion model for the genotype data produced by invertFREGENE [2] on one inversion. Using a fixed window size of 0.4 Mb, and 5 SNPs flanking each potential breakpoint, inveRsion successfully detected the true inversion between 0.75 -1.25 Mb, see Figure 1. As expected, many segments overlapping the true inversion favor the alternative (inversion) model over null model.
Because we know which chromosomes carry the inversion, we directly assess our classification accuracy (i.e., fraction of individuals with correctly identified inversion status). We show the median classification accuracy as function of t B (equation 7) for a run of 30 invertFREGENE simulations. Note that classifications for haplotype data are at the chromosome level and for genotype data are at the subject level. For some cases, using low values of t B our method identified extra or spurious regions of interest; however, in general as t B increased, the classification improved.
In Figure 2, we show the median classification accuracy for windows overlapping the inversion simulated in Figure 1. For each candidate window, we determined the responsibility of each individual, r w 0,k , and computed the classification of individuals for increasing values of the BIC threshold t B . For both haplotype and genotype data, our ability to successfully classify individuals improves with increasing threshold t B . Reduced accuracy for genotype data, compared to haplotype data, is expected because of errors accumulating from local phasing and incorrect pairing of local haplotypes. In this example, for haplotype data, high accuracy is quickly obtained with perfect classification for BIC~500. For genotype data we show the results of three local phasing strategies. We see that phasings based on the forward population and internal flanking blocks have comparable accuracies which are optimized at BIC~1700. The strategy of haplotyping the forward and inverted populations separately can be highly accurate (0.9) but is consistently lower than the other two. This can be due to the increased amount of computation this strategy requires. Overall, we found no accuracy improvement over our initial simplified phasing strategy.
Decreased accuracies for both genotype and haplotype data for the highest t B values may be attributed to fewer observations. Indeed, the decay of accuracy at highest t B depicts the classification difficulties from prior methods [1] that used only the single most likely predicted region to classify individuals. Overall, our weighted method of classification based on BIC values is more robust than selecting a classification based on a single segment with likelihood ratio of lowest p-value.

Classification and detection performance
We first analyzed the ability of our model to determine simulated inversions on haplotype data with five different SNP densities, matching some of those in current Illumina chip technologies. Figure 3 shows the classification accuracy of inversion status for the inversions specified in the previous sections. We can see that our method performs well at all the densities studied, with more accuracy at higher densities. At the level of the 500 k arrays we can still achieve high chromosome classification accuracy.
To assess the performance of our method to detect inversions and classify individuals in genotype data, we simulated samples varying the length of the inversion and its population frequency. We ran invertFREGENE 50 times for each condition, and computed classification accuracy for different t B values (Figure 4), segmental sensitivity of detection (percentage of the true segment identified -Additional file 1, Figure S1) and segmental false discovery rate (percentage of the segment identified that is not part of the true segment -Additional file 1, Figure S2). We used a window size of 60% of the inversion length and flanking blocks of 5 SNPs (N = 5). A total a 50(cases) × 5(lengths) × 9(frequencies) = 2250 simulations were run. Figure 4 shows the average accuracy of each subject's inversion genotype classification for 5 different lengths, and a population frequency of 60%. At this frequency, high accuracies >0.90 are achieved for lengths greater than 0.25 Mb when t B is large enough. In particular, our method is most accurate for larger inversions.
We also studied the classification accuracy at detecting inversions of all frequencies in relation to the age of the inversion, as treated in invertFREGENE. Figure 5 shows the age of the inversion as function of maximum accuracy across t B for each simulated case. We see that our method achieves highest accuracy for larger lengths and at a wide range of frequencies. Our method is less accurate at identifying older inversions with frequencies (>80%). Particularly, we observe that inversions at 90% frequency heavily cluster in a low predictive range, indicating a critical phenomena in the simulations for frequency increments at high values.

Haplotypes
Genotypes: Phasing flanking blocks based on forward population ordering Genotypes: Phasing internal flanking blocks first, then external blocks Genotypes: Phasing with forward and inverted ordering independently Additional file 1, Figures S1 and S2, illustrates the segmental sensitivity and false discovery rate (FDR) for these simulations, showing the ability of the method to accurately identify the region containing the inversion. We see that while we have high sensitivity in a large range of BIC thresholds (>0), the FDR decreases with increasing t B . Therefore, all the inversion lengths studied are detectable, and higher accuracy is achieved for inversions >0.5Mb. To assess false inversion detection, we ran simulations (250 cases) with no inversions and scanned them with three different window sizes (0.2, 0.4 and 0.6 Mb). In each case we identified the top 2% t B quantiles for which at least one region of interest (overlapping windows) was found. Histograms of for all three scans are shown in Additional file 1, Figure S3. In particular, we find that, in samples of 1000 subjects, an FDR of 0.05 is achieved at t B = 286 for scans of 0.6Mb, t B = 718 for 0.4Mb and t b = 2400 for 0.2Mb.

HapMap data Genotype and haplotype (phased) data
We analyzed the CEU and YRI populations from Hap-Map Phase III. We initially performed a scan on chromosome 16 for the CEU, and chromosome 17 for the CEU and YRI populations, to compare the results obtained for genotype and haplotype (phased) data. We   used two types of genotype data; the original genotypes before phasing and reconstructed genotypes from the phased haplotypes, with the encoding of 0, 1 and 2. We examined chromosomes 16 and 17 with two window sizes (0.4 Mb and 0.7 Mb), and used a BIC threshold of t B = 50. We computed the inversion frequency of the estimated inversion from the majority vote for each chromosome in the sample. Table 1 shows the results for the phased haplotype and genotype data for chromosome 16 of the CEU population. For the haplotype case, we find evidence on a experimentally validated inversion~28.2 -28.8 Mb [22], which is lost in the analysis of the genotype data.
However, notice that on each dataset our method predicts an inversion with breakpoints at roughly~34 -35 Mb that closely match an inversion reported by pairedend mapping in the CEU population at 16p11.2 -16p11.1 [4]. We achieved consistent results when analyzing the genotypes that were reconstructed from the reported haplotypes (Additional file 1, Table S1). This result endorses further experimental characterization of the region.
We analyzed chromosome 17 for both the CEU and YRI populations ( Table 2)   similar to 3 of the 4 predictions in the CEU population, but with fewer overlapping sliding windows and different predicted inversion frequencies. Importantly, the remaining inversion predicted in the CEU population corresponds to a known inversion at 17q21, ≈ 40.4 -42.4 Mb with inversion frequency of 20-25% in mixed European populations, and absent from the YRI HapMap sample [15]. To ensure the specificity of this predicted inversion to the CEU population, we reanalyzed the region 40 -43 Mb in greater detail by considering all possible segments with length ≥ 0.7Mb. As expected, we did not find an inversion in the YRI population, but did obtain improved information on the inversion location in the CEU popula-  [1,25].

Subject classification for inversions within 17q21 and 8p23
To accurately classify individuals according to inversion status for two known inversions (17q21 and 8p23) we  Figure S4, shows that windows favoring the inversion model (BIC >0) flank gaps in the SNP distribution. These gaps correspond to the segmental duplications which have been suggested to give rise to the inversion sequence that sits between them. We computed the classification of each chromosome in the sample with a BIC >600 that was the minimum BIC for which the classification was optimal for the subjects with known inversion status [13]. We correctly classified all 24 samples with known inversion status (7-CEU, 8-YRI, 4-CHB, 4-JPT) (Additional file 1, Table S2). That is, CEU subjects NA10847, NA12156, NA12813 were reported as heterozygous for the inversion and all other subjects as non-inverted homozygous. The inversion status of subjects from all 11 populations is shown in Table 3. We see that the chromosomal incidence of the inversion for the CEU in the joint population analysis is 26% consistent with the single population analysis. Finally, the cross tabulation in Table 4 between the inversion and the "rs1800547" SNP genotypes further validates the tagging of our inversion inference with this SNP (accuracy >0.99), as expected from their known relation to the H1/H2 haplotypes [15].
We next studied the more challenging classification of the recurrent inversion at 8p23 [13], for which no significant signal was obtained from the whole genome scan on the CEU+YRI sample. We performed an detailed search between 6-13 Mb of chromosome 8, and detected a signal (Maximum BIC = 345.349) within the inversion sequence. The region identified was between Left Breakpoint=(8.804291 Mb, 8.98084 Mb) and Right Breakpoint=(10.80516 Mb, 10.98203 Mb), see Additional file 1, Figure S5. We compared our classification with the true inversion status reported for 41 of the individuals from 4 of the 11 HapMap populations. (In Table 1 Inverted sequences found in chromosome 16 of the CEU population using haplotype (phased) and genotype data  One of the CEU predictions corresponds to a known inversion at 17q21, and was also found with window size 0.7Mb (see main text). Consistent with experimental evidence this is a CEU specific inversion, no inversions where predicted in the YRI population in this region with any window size.
Additional file 1, Table S3). We found that while the exact match of the subject genotypes is (70%), the subject-wise identification of an inversion presence is 88%. Population-wise, we see that the CEU are the best predicted with 91% accuracy, while the YRI sample has only 36% accuracy. In the experimental studies all 11 YRI subjects have an inversion, 7 of which are inversion heterozygous. None of the 4 JPT subjects in the experimental validation group have an inversion, and inveRsion classify them accordingly. However, we do not correctly identify any of the 6 CHB subjects carrying the inversion.

Genome-wide scan
Lastly, we scanned the whole genome of the combined CEU and YRI populations of HapMap III with three window sizes (1 Mb, 0.7 Mb and 0.4 Mb) ( Figure 6). Across all chromosomes, we found 263 candidate regions for inversions from overlapping windows with t B = 0. In Additional file 1, Table S4, we report the outer limits of the left and right break points of the inversions found. A number of inversions detected with window size of 1 Mb are also detected with lower sizes. Although the median size of inversion reported by Kidd and colleagues 2008 is 0.1 MB, 23% of those segments are at least above 0.4 Mb, and thus are potentially detectable with our method. We find a total of 52 inversions (20% of our reported regions) that overlap either those found by Kidd et al. [4] on nine subjects or by Levy and colleagues [3] on one subject, providing additional evidence of their existence. A higher proportion, 89 inversions (33%) are at least 1 MB distance from a reported inversion. Table 5 lists the 15 autosomal inversions that have been experimentally validated, 8 of which are detected in our genome-wide scan on genotype data. As mentioned above, the inversion on chromosome 16 reported by Martin et al. 2004 [22] was found on the analysis of haplotype data only. We note these inversions were previously identified in specific individuals, and our current studies now provide population-based evidence to these inversions. In particular, we fully recovered the inversions in the 17q21 and 15q24 regions and found candidate regions within 3 Mb distance from those within the 3q29, 5q13.3, 17q21.3 regions. We found independent evidence for each break point of the chromosome 10 inversion, and for the right break point of the inversion in 15q11. As we mentioned in the previous section, the evidence on 8p23 results from an extensive search on the region and not from a genome-wide scan.

Discussion
We proposed and implemented a method for detection of polymorphic inversions and classification of individuals into normal and inverted sub-populations for both phased haplotype and genotype data. We demonstrated the ability of our method to successfully detect inversions and classify individuals on simulations and showed how the performance of the method is impacted by the length and frequency of the inversion, SNP density, and the value of t B , the BIC threshold. Our findings demonstrate that genotype data can be used to detect inversions, although power for detection remains greater for haplotype data. Other approaches can be developed to perform inversion detection simultaneously with local haplotyping. In this work, we opted for a two-step strategy, which allows us to directly compare the impact of local haplotyping to inversion finding. As shown in Figure 2, the applicability of our approach is demonstrated The H1 and H2 haplotypes have been traditionally associated to the inversion status in the 17q21 inversion. From those haplotypes the SNP "rs1800547" from the MAPT gene has been used to tag the inversion in 17q21 [15]. As expected, our classification of individuals from all 11 HapMap populations agrees well with the their status for this SNP. Table 3 Frequency of the inversion in 17q21 across all HapMap III populations   genInv  ASW  CEU  CHB  CHD  GIH  JPT  LWK  MEX  MKK  TSI  YRI   Hom  81  59  100  100  82  99  100  73  89  44  100   invHet  18  41  0  0  17  1  0  23  9  by our performance tests showing, in particular, the small drop in accuracy we pay for treating genotype data, as compared to haplotype data. Our formulation has several advantages compared to previously published methods to detect polymorphic inversions from nucleotide variation data [1,25]. Our work represents the first method capable of analyzing both genotype and haplotype data. Importantly, ours is the first LD based method to accurately classify individuals into normal and inverted subpopulations. By leveraging information from adjacent windows, our method is able to partition individuals into normal and inverted categories with high accuracy. Finally, our implementation is computationally efficient. Our use of a fixed size sliding-window to analyze chromosomes greatly improves upon the search efficiency of previous methods. In addition, a relevant computational improvement is achieved by the local phasing procedure of the genotype model. We tested the local phasing step of our algorithm on 60 CEUs (HapMap phase III) in a section of the chromosome 7 (0, 85 MB) with 42,379 SNPs. Using a 2.33 GHz intel Core 2 Duo, 2 GB RAM the  Table S4.
analysis took only 40 minutes, which is~13 times faster than phasing the whole segment with fastPhase. This shows that the load of a complete phasing of the data can be greatly reduced in the detection of genetic inversions. The overall efficiency of our method is an attractive feature for the analysis of already available data. We illustrated the high accuracy (≥ 0.9) of our method over a wide range of inversion frequencies in principledsimulated data ( Figure 4). We had difficulties successfully classifying individuals from invertFREGENE populations with inversion frequencies >80%. Note that in this scenario, high frequency inversions are affected by the age of the inversion and not by a simple change of reference genome. Within invertFREGENE, high frequency inversions tend to be older inversions, which may then present a higher within-population variability that can impact our predictions. However, the sudden drop of accuracy at high frequencies seems to indicate an additional critical behavior of the simulations at this range. Thus, in relation to the age of the inversion, the parameters used to run invertFREGENE may have not been optimized for this scenario.
In relation to real inversions, we show that the inveRsion package can be used to identify the inversion genotypes of subjects within the region 17q21. The high accuracies achieved for this inversion and the invertFRE-GENE simulations suggest that our method is most powerful when haplotype divergence is present. We also studied the inversion within 8p23. Despite of not detecting an inversion in this region in a genome-wide scan; with an extensive search, we were able to detect a signal (a third of that for the 17q21 region) and classify the subjects of the CEU population to a high degree.
Consistently with Bosh et al. 2009, we saw that the most accurate predictions are made for the CEU population only. The 8p23 region has shown a number of complexities that might affect our inference to all HapMap groups. It has been shown that the region is unstable, showing mosiac rearrangements in its vicinity [24,29]. In addition, there is a lack of long distance LD, expected for two divergent haplotypes associated to each inverted status. Antonacci and colleagues suggested that this can be due to multiple inversion events. In addition, Deng et al. 2008 showed that the CEU population has an extended linkage disequilibrium in the region greater than that for the CBH+JPT and YRI populations. It is possible that our method particularly favors inversions with less variability within populations. Greater variability of populations within the inverted region could reduce our classification power because large groups of identical haoplotypes are harder to form. Further improvements of our method in this direction could improve performance.
An important practical consideration when using inveRsion is the threshold BIC value used to perform inversion genotyping. Larger values of BIC ( Figure 2) do not always mean higher accuracy. As above, we recommend careful benchmarking of threshold BIC values for assessing subject classification. In the case of inferences in large populations and when experimental data is available on few subjects, a suitable threshold can be set for the classification that maximizes the inference on those subjects. We use this approach in our analysis of the 17q21 region, where we conducted a detailed search in a region known to have the inversion. When experimental data is not available, the stability of inversion allele for each subject could be studied for different BIC thresholds, drawing more confidence from those individuals that are consistently classified. On a genome scale, initial scans can be conducted on several window sizes to identify regions of interest with positive BIC values where, as suggested by our simulations, we have high sensitivity. In our whole genome search for the CEU+YRI populations, we found recurrent inversions at different window sizes, which increases the confidence in the findings. This shows that using only three windows sizes, 0.4 -0.7 -1 Mb, can detect inversions from 0.4 Mb to~2 Mb with enrichment for detecting inversions with length 0.7 -1.4 Mb since more than one window have power to detect inversions in this range. The identification of recurrent inversions with different window sizes could also be used as an additional check to reduce false discoveries. In addition, these results suggest that scans with a few window sizes and as large a value of L as computationally feasible should be sufficient for a complete search for inversion signals in a given chromosome. Note that the window size is a probe for the detection of the inversion, thus it should be considered a function of the length of inversions to be detected. The accuracy of our method increases with higher SNP density. At the 500 k array density, our method can still achieve high classification accuracy. However, due to the high variability of the accuracy estimates, we recommend using imputed data in this case. To have a more accurate measure of the subject classification and length of the inversion, an extensive search for all possible lengths around the regions of interest can then be conducted.
Experimental information about the existence of inversions and their status in a sample of subjects is limited, constraining an exhaustive evaluation of the true positive and negative rates of our methodology with validated inversions. Inversions are difficult to characterize even with the newest sequencing data. In the study of structural variants in 1000 genomes, Durbin et al. 2010 [16] did not report inversions. Large inverted repeats at the breakpoints is a current obstacle for inversion detection using next generation sequencing data, in addition to the computational capacity to analyze hundreds of individuals. However, when inversion detection algorithms for sequence data are available, a necessary future evaluation of our methodology should be performed against this new technology.

Conclusions
In this article, we have presented a methodology to detect inversion polymorphisms using nucleotide variation data from SNP micro-arrays. The method, based on LD differences across break points, is able to classify individuals into their inversion genotype status, allowing the gathering of population-based evidence of known inversions. In addition, using a computationally efficient scan across the genome, it provides evidence on new candidate regions that might be present in the population from which genotyped data is collected.