Identification of polymorphic inversions from genotypes
- Alejandro Cáceres†1Email author,
- Suzanne S Sindi†3,
- Benjamin J Raphael4,
- Mario Cáceres5 and
- Juan R González1, 2Email author
© Cáceres et al; licensee BioMed Central Ltd. 2012
Received: 26 September 2011
Accepted: 9 February 2012
Published: 9 February 2012
Polymorphic inversions are a source of genetic variability with a direct impact on recombination frequencies. Given the difficulty of their experimental study, computational methods have been developed to infer their existence in a large number of individuals using genome-wide data of nucleotide variation. Methods based on haplotype tagging of known inversions attempt to classify individuals as having a normal or inverted allele. Other methods that measure differences between linkage disequilibrium attempt to identify regions with inversions but unable to classify subjects accurately, an essential requirement for association studies.
We present a novel method to both identify polymorphic inversions from genome-wide genotype data and classify individuals as containing a normal or inverted allele. Our method, a generalization of a published method for haplotype data , utilizes linkage between groups of SNPs to partition a set of individuals into normal and inverted subpopulations. We employ a sliding window scan to identify regions likely to have an inversion, and accumulation of evidence from neighboring SNPs is used to accurately determine the inversion status of each subject. Further, our approach detects inversions directly from genotype data, thus increasing its usability to current genome-wide association studies (GWAS).
We demonstrate the accuracy of our method to detect inversions and classify individuals on principled-simulated genotypes, produced by the evolution of an inversion event within a coalescent model . We applied our method to real genotype data from HapMap Phase III to characterize the inversion status of two known inversions within the regions 17q21 and 8p23 across 1184 individuals. Finally, we scan the full genomes of the European Origin (CEU) and Yoruba (YRI) HapMap samples. We find population-based evidence for 9 out of 15 well-established autosomic inversions, and for 52 regions previously predicted by independent experimental methods in ten (9+1) individuals [3, 4]. We provide efficient implementations of both genotype and haplotype methods as a unified R package inveRsion.
Inversions have long been known to play an important role in chromosomal evolution . Indeed, large inversions are thought to contribute to speciation through reproductive isolation caused by reduced recombination between normal and inverted chromosomes . Inversions of a wide range of sizes are abundant in mammalian lineages . Additionally, extensive study of inversions in Drosophila revealed that inversions can leave genetic signatures, such as reduced nucleotide variation, within the inverted region [8–11]. More recently, many polymorphic inversions have also been found in humans [3, 4, 12]. A number of these have functional consequences; polymorphic inversions have been associated with genetic disorders , complex disorders such as asthma  and even positive selection .
Recent resequencing efforts of many human genomes continue to reveal the prevalence of structural variation in humans . However, despite decreases in the cost of sequencing, genotyping microarrays remains the most cost-effective technology for analyzing entire genomes on thousands of individuals. Moreover, inversions have been traditionally difficult to study using experimental techniques. The typical presence of large inverted repeats at the breakpoints is a major challenge for their detection even with current next-generation sequencing techniques. While only few inversions (~ 15) have been experimentally validated [11, 13, 15, 17–22], a few studies have scanned the whole genome providing experimental evidence for a number of candidate regions. For example, Levy and colleagues determined inverted regions by the whole genome assembly of one subject , and Kidd el al. used fosmid paired-end mapping in nine individuals . As such, the ability to accurately predict new inversions and infer their status on large number of subjects would provide a valuable tool for clinical and evolutionary studies of human populations.
Previous studies have used haplotype tagging to indirectly infer which chromosomes are most likely to have an inversion, assuming recombination suppression in inverted heterozygous. For instance, Steffanson et al.  showed that an inversion within 17q21 in the European population can be tagged with two different haplotype groups (H1/H2), each related to a polymorphic variant of the MAPT gene. Although haplotype tagging is performed on large groups, it is suitable for regions known to have inversions and known to exhibit divergence between the two arrangements. Three studies [13, 23, 24], for instance, identified two haplotype groups within a region containing the known 17q21 and 8p23 inversions, and then experimentally validated the tagging on selected samples of the HapMap population. Taken together, this small group of subjects can be used to validate newly developed methods that determine the status of inversions in individuals.
Using a different approach, based on differences in linkage between groups of SNPs, two other methods have been developed to discover the presence of inversions across the genome [1, 25]. Bansal et al.  used differences in linkage disequilibrium (LD) to determine regions likely to be inverted. However, their method requires the human reference to contain the minor allele, and does not predict which chromosomes in the population are most likely to have the inversion; a factor that is essential for association studies. More recently, Sindi and Raphael  developed a probabilistic method that models the population as a mixture of normal and inverted haplotypes, and thus had increased power to detect inversions of lower frequency. Although their method accurately predicts inversion frequencies, it did not yield an accurate classification of individuals into normal and inverted subpopulations. The computationally intensive search of both methods have, in particular, failed to identify inversions like the one within 17q21 in the CEU population, for which a clear extended LD has been shown . In addition, both methods were developed for haplotype (phased) data only.
One way to analyze genotype data, using this method, is to phase the entire genome and then apply it to the resulting haplotypes. This procedure is computationally demanding. For instance, it has been reported that compiled software like fastPHASE  can take up to 9 h to analyze 60 subjects in a 41,018 SNPs chromosome (3-GHz Xeon processor with 1 GB). Therefore, a method that directly analyzes genotypes, incorporating the limited phasing required by inversion detection, can substantially reduce this computational load and allow the complete implementation of the methodology in a single software tool to be used in standard up-to-date machines.
In this work, we propose a new methodology to (1) efficiently detect inversions across the genome by directly using genotype data and (2) accurately classify individuals in the population according to inversion status. In addition, our generalization of the inversion model for haplotypes , as a computational technique, allows us to treat the different problem of phasing haplotype blocks separated at any distance. Our new application of the inversion model, within the analysis of polymorphic inversion from genotypes, increases the applicability of the method to current GWAS, and enhances its usability by a higher computational efficiency. We provide an efficient implementation of the novel analysis of genotypes, and the new classification and search methods in the R package inveRsion, freely available through Bioconductor  (http://www.bioconductor.org/packages/devel/bioc/html/inveRsion.html). We also include a computationally improved version of the previous haplotype model, and the use of the Bayesian Information Criterion (BIC) to gather statistical evidence from neighboring regions.
Both prior LD studies [1, 25] tested their methods by constructing "artificial" inversions by reversing the order of SNPs in phased haplotypes from HapMap. We provide a more rigorous test of our method by employing a recent software tool, invertFREGENE , that utilizes coalescent theory and suppression of recombination between inverted and normal chromosomes to produce artificial haplotype and genotype data. Lastly, we apply our method to HapMap Phase III data, where we compare the analysis of genotypes with that for haplotypes, assess the our classification accuracy in two validated inversions and search the whole genome for inversion signals.
The Inversion Model
The inversion model was first proposed by Sindi et al. 2010  to predict inversion polymorphisms from phased data. Here we give a general formulation that allows us, in addition, to solve the distinct problem of phasing genotype data across inversion breakpoints, and thus reduce the computation load of phasing the whole genome.
We are given m independent observations of four random variables B1, B2, B3, and B4, ordered according to location in the reference genome. We are interested in assessing the likelihood of an inversion in the population with one breakpoint between markers B1 and B2 and the other breakpoint between markers B3 and B4. In the case an inversion exists, we expect to observe more than one subject i for which the formation of the two pairs (B1(x i ), B 2 (x i )) and (B 3 (x i ), B4(x i )) is less probable than that of (B1(x i ), B 3 (x i )) and (B 2 (x i ), B4(x i )), within the population sample (i = 1...m). We denote the joint random variables as B1,2(x i ), B3,4(x i ), B1,3(x i ) and B2,4(x i ) respectively.
where, as before, r1,3 and r3,4 correspond to the probabilities of observations for B1,3 and B2,4.
this is also called the responsibility to the forward model. Similarly, the responsibility of an individual to the inversion model, r1,i= 1 - r0,i.
We test the hypothesis that π = 0 (null model) vs. π > 0 (alternative model) with a maximum likelihood approach. We maximize the likelihood of the data under each model; that is, for the null model we use the empirical distribution of allele frequencies for the null model and for the alternative model we select the allele frequencies and inversion frequency π using the Expectation-Maximization (EM) algorithm for mixtures of probability distributions. The decision to reject the null model (no inversion) for the alternative model is based on a model selection test. In this work, we utilize the Bayesian Information Criterion (BIC) to both identify candidate inversions and classify individuals carrying the polymorphic inversions.
Inversion model for genotype data
In the analysis of genotype data, we apply the inversion model in two steps. In the first step, we use the inversion model to phase and pair haplotype blocks around potential breakpoints. This approach allows flexibility to phase two haplotype blocks at any distance, without phasing the whole genome or constantly re-phasing regions, when one of the breakpoints is changed. This use of block phasing is a novel application of the inversion model. The second step is the original application of the model  to identify which individuals are likely to have a genetic inversion.
In this work, we define a candidate breakpoint as a pair of consecutive SNPs. A segment tested for an inversion is thus defined by two candidate breakpoints (left and right). We flank the left and right candidate breakpoints by two blocks of N SNPs each, and perform local haplotype phasing with haplo.stats . As a result, an individual i has two haplotypes, L1(x i ) and L2(x i ) of 2N SNPs each, containing the left breakpoint in the middle, and two 2N SNPs haplotypes R1(x i ) and R2(x i ), containing the right breakpoint in the middle. The first application of the inversion model is to sort out the chromosome pairing of the containing blocks, that is, we set B1 = L1, B2 = R1, B3 = R2, B4 = L2 and m = n s (number of subjects) in the inversion model. The responsibilities for the alternative model tell us if the pairing of (L1, R1) in one chromosome and (R2, L2) [e.g. (L2, R2) ordered by handedness] in the other chromosome is more (less) likely than the pairing (L1, R2) and (R1, L2) [(L2, R1)], for each subject in the sample. This first process phases the genotype data between any pair of blocks, surrounding candidate breakpoints.
The second application of the inversion model is to determine the presence of genetic inversions on phased data, as originally proposed in Sindi and Raphael 2010 . In total, we have n c = (2 * n s ) haplotype blocks phased by the former pairing. We first aggregate the data into two variables one for each breakpoint, H L (L j (x i )) corresponds to the haplotype of subject i and chromosome j containing the left breakpoint, and H R (R j (x i )) to the haplotype of subject i and chromosome j containing the right breakpoint. We next split the left haplotype into two N - SNPs blocks (H LL , H LR ) = H L each of which flank the potential left breakpoint, and (H RL , H RR ) = H R each of which flank the right breakpoint. In the inversion model, we set B1 = H LL , B2 = H LR , B3 = H RL , B4 = H RR and m = n c , as the total number of chromosomes in the sample. The responsibilities of this model now determine for which chromosomes the pairings (H LL , H LR ) and (H RL , H RR ) are more (less) likely than the pairings (H LL , H RL ) and (H LR , H RR ). That is, the model determines which chromosomes are likely to be inverted between the left and right breakpoints considered.
An important consideration in our analysis is whether the initial phasing step itself is confounded by the presence of an inversion. Thus, in the first step of our analysis, we perform two other (more complex) local haplotyping strategies. In the first strategy, we locally haplotype each N-SNP flanking block separately with haplo.stats and then use the inversion model to phase the internal flanking blocks, i.e. those assumed to be within the inversion segment. Finally we use the model again to phase the internal haplotype with each external block. A second alternative strategy is to perform the local haplotyping with haplo.stats once on the surrounding blocks following the forward population and twice following the inverted population, leaving two distinct haplotype data for each population. We compare the accuracy of each phasing strategy in the prediction of inversions on simulated data and contrast them with the accuracy of the prediction for already phased data.
Inversion Detection and Subject Classification
In addition to providing a novel method to analyze genotype data for polymorphic inversions directly, our work aims at improving the efficiency of detecting inversions, and the accuracy of classifying individuals with respect to inversion status.
Previous LD methods [1, 25] considered a potential breakpoint between every pair of adjacent SNPs in the genome with sufficient diversity and physical distance between breakpoints. Thus, the total number of segments tested for inversions was very large, O(n!) where n corresponds to the number of candidate breakpoints or SNPs. To reduce the computational load, we decrease the number of regions tested by using a sliding window. We tested our approach on simulated inversions from invertFREGENE , and found our model still correctly identifies an inversion if the candidate segment tested is contained within the real inversion sequence and "large enough" compared to the real inversion. Consequently, we scan the chromosome with trial segments of fixed length (window size), as probes for detecting inversions with comparable length. We then reconstruct the true inversion by considering all trial segments with sufficiently large Bayesian Information Criterion (BIC). This simplification massively reduces the number of computations to O(n).
where H is the step Heaviside function, H(z) = 1 when z > 0 and 0 otherwise. For haplotype data, we classify a chromosome i as inverted when for a particular value of t B . For genotype data, we classify each individual as homozygous for the reference orientation, heterozygous for the inversion or homozygous for the inversion, depending on the classification obtained from the responsibilities of both chromosomes. We define the classification accuracy of our method as the number of individuals with correctly identified inversion status (i.e., fraction of true positives and true negatives). In the simulation results below, we compare the classification accuracy of our method to the true known categorization of individuals as function of t B . In practice, our classification accuracy increases with t B .
We studied the power of our method, implemented as the R package inveRsion, to detect inversions on both haplotype and genotype data by analyzing previously known and simulated polymorphic inversions.
First, we applied our method to simulated haplotypes and genotypes to asses our performance at both detecting inversions and classifying individuals into normal and inverted subpopulations. Using recently developed software, invertFREGENE , we simulated inversions of varying lengths (0.1 - 1.0 Mb) and varying frequencies π = 10 - 90% in a populations of 1,000 individuals (2,000 chromosomes). The software provided haplotypes, genotypes and inversion status of each chromosome in the population. Parameters used in running the software, such as number of generations to achieve equilibrium, and rates of recombination and mutation, were taken without change from . In particular, the segments simulated by invertFREGNE have a median fraction of heterozygous of 0.34 and overall SNP density of 0.33 Kbase/SNP on a 2 Mb segment.
To check the performance of our method on real data, we used the SNP data from HapMap phase III (http://www.hapmap.org), totalling 1.5 M SNPs on 1184 individuals from 11 different populations. As a first step, we analyzed both haplotype and CEU and YRI populations from chromosomes 16 and 17, where well characterized polymorphic inversions have been found. Next, we performed a region of interest analysis, where we did a detailed search around two previously identified inversions (17q21 and 8p23), and classified subjects for all 11 HapMap III populations by predicted inversion status. We then validated the inversion genotypes on previously reported individuals. Finally, for the more demanding task of scanning the whole genome, we used the SNP genotype data of the joint CEU+YRI population (201 subjects). Significant regions were compared with a set of previously reported inversions. We used a first set of 15 well characterized and experimentally validated inversions, and then a second set of putative inversions predicted by whole genome assembly comparison  or paired-end mapping experiments .
Some technical considerations were made in the implementation of the previously described models in the inveRsion R package. In analyzing genotypes, we removed SNPs with missing values for more than 10% of the individuals. For the remaining SNPs, missing values were imputed during local phasing with haplo.stats, we selected haplotypes with the highest posterior probability at each point, and did not considered the propagation of this error into the inversion model. To further limit the regions tested, we only considered breakpoints between SNPs where at least one had minor allele frequency ≥ 10%.
For both mixture models, we used the empirical distribution of allele frequencies as the initial condition for the EM algorithm. We conjectured that this choice would be appropriate to locate the global minimum, and verified this conjecture on simulated inversions by running the EM algorithm with randomly chosen initial conditions.
Characterization of a single inversion
Because we know which chromosomes carry the inversion, we directly assess our classification accuracy (i.e., fraction of individuals with correctly identified inversion status). We show the median classification accuracy as function of t B (equation 7) for a run of 30 invertFREGENE simulations. Note that classifications for haplotype data are at the chromosome level and for genotype data are at the subject level. For some cases, using low values of t B our method identified extra or spurious regions of interest; however, in general as t B increased, the classification improved.
Decreased accuracies for both genotype and haplotype data for the highest t B values may be attributed to fewer observations. Indeed, the decay of accuracy at highest t B depicts the classification difficulties from prior methods  that used only the single most likely predicted region to classify individuals. Overall, our weighted method of classification based on BIC values is more robust than selecting a classification based on a single segment with likelihood ratio of lowest p-value.
Classification and detection performance
Additional file 1, Figures S1 and S2, illustrates the segmental sensitivity and false discovery rate (FDR) for these simulations, showing the ability of the method to accurately identify the region containing the inversion. We see that while we have high sensitivity in a large range of BIC thresholds (> 0), the FDR decreases with increasing t B . Therefore, all the inversion lengths studied are detectable, and higher accuracy is achieved for inversions > 0.5Mb. To assess false inversion detection, we ran simulations (250 cases) with no inversions and scanned them with three different window sizes (0.2, 0.4 and 0.6 Mb). In each case we identified the top 2% t B quantiles for which at least one region of interest (overlapping windows) was found. Histograms of for all three scans are shown in Additional file 1, Figure S3. In particular, we find that, in samples of 1000 subjects, an FDR of 0.05 is achieved at t B = 286 for scans of 0.6Mb, t B = 718 for 0.4Mb and t b = 2400 for 0.2Mb.
Genotype and haplotype (phased) data
We analyzed the CEU and YRI populations from HapMap Phase III. We initially performed a scan on chromosome 16 for the CEU, and chromosome 17 for the CEU and YRI populations, to compare the results obtained for genotype and haplotype (phased) data. We used two types of genotype data; the original genotypes before phasing and reconstructed genotypes from the phased haplotypes, with the encoding of 0, 1 and 2. We examined chromosomes 16 and 17 with two window sizes (0.4 Mb and 0.7 Mb), and used a BIC threshold of t B = 50. We computed the inversion frequency of the estimated inversion from the majority vote for each chromosome in the sample.
Inverted sequences found in chromosome 16 of the CEU population using haplotype (phased) and genotype data
Predicted inversions on chromosome 17 for the CEU and YRI populations on genotype data (window size 0.4Mb and t B = 50).
Subject classification for inversions within 17q21 and 8p23
To accurately classify individuals according to inversion status for two known inversions (17q21 and 8p23) we preformed extensive searches near reported breakpoints. In both cases, we obtain high accuracy at classifying individuals with known inversion status.
Frequency of the inversion in 17q21 across all HapMap III populations
Tagging of the inversion in 17q21 as detected by inveRsion and SNP "rs1800547"
We next studied the more challenging classification of the recurrent inversion at 8p23 , for which no significant signal was obtained from the whole genome scan on the CEU+YRI sample. We performed an detailed search between 6-13 Mb of chromosome 8, and detected a signal (Maximum BIC = 345.349) within the inversion sequence. The region identified was between Left Breakpoint=(8.804291 Mb, 8.98084 Mb) and Right Breakpoint=(10.80516 Mb, 10.98203 Mb), see Additional file 1, Figure S5. We compared our classification with the true inversion status reported for 41 of the individuals from 4 of the 11 HapMap populations. (In Additional file 1, Table S3). We found that while the exact match of the subject genotypes is (70%), the subject-wise identification of an inversion presence is 88%. Population-wise, we see that the CEU are the best predicted with 91% accuracy, while the YRI sample has only 36% accuracy. In the experimental studies all 11 YRI subjects have an inversion, 7 of which are inversion heterozygous. None of the 4 JPT subjects in the experimental validation group have an inversion, and inveRsion classify them accordingly. However, we do not correctly identify any of the 6 CHB subjects carrying the inversion.
Although the median size of inversion reported by Kidd and colleagues 2008 is 0.1 MB, 23% of those segments are at least above 0.4 Mb, and thus are potentially detectable with our method. We find a total of 52 inversions (20% of our reported regions) that overlap either those found by Kidd et al.  on nine subjects or by Levy and colleagues  on one subject, providing additional evidence of their existence. A higher proportion, 89 inversions (33%) are at least 1 MB distance from a reported inversion.
Experimentally validated inversions
Inv. Size (Mb)
within 3 Mb distance
each BP separately
right BP only
within 1 MB distance
haplotype data only CEU
within 1 Mb distance
We proposed and implemented a method for detection of polymorphic inversions and classification of individuals into normal and inverted sub-populations for both phased haplotype and genotype data. We demonstrated the ability of our method to successfully detect inversions and classify individuals on simulations and showed how the performance of the method is impacted by the length and frequency of the inversion, SNP density, and the value of t B , the BIC threshold. Our findings demonstrate that genotype data can be used to detect inversions, although power for detection remains greater for haplotype data. Other approaches can be developed to perform inversion detection simultaneously with local haplotyping. In this work, we opted for a two-step strategy, which allows us to directly compare the impact of local haplotyping to inversion finding. As shown in Figure 2, the applicability of our approach is demonstrated by our performance tests showing, in particular, the small drop in accuracy we pay for treating genotype data, as compared to haplotype data.
Our formulation has several advantages compared to previously published methods to detect polymorphic inversions from nucleotide variation data [1, 25]. Our work represents the first method capable of analyzing both genotype and haplotype data. Importantly, ours is the first LD based method to accurately classify individuals into normal and inverted subpopulations. By leveraging information from adjacent windows, our method is able to partition individuals into normal and inverted categories with high accuracy. Finally, our implementation is computationally efficient. Our use of a fixed size sliding-window to analyze chromosomes greatly improves upon the search efficiency of previous methods. In addition, a relevant computational improvement is achieved by the local phasing procedure of the genotype model. We tested the local phasing step of our algorithm on 60 CEUs (HapMap phase III) in a section of the chromosome 7 (0, 85 MB) with 42,379 SNPs. Using a 2.33 GHz intel Core 2 Duo, 2 GB RAM the analysis took only 40 minutes, which is ~ 13 times faster than phasing the whole segment with fastPhase. This shows that the load of a complete phasing of the data can be greatly reduced in the detection of genetic inversions. The overall efficiency of our method is an attractive feature for the analysis of already available data.
We illustrated the high accuracy (≥ 0.9) of our method over a wide range of inversion frequencies in principled-simulated data (Figure 4). We had difficulties successfully classifying individuals from invertFREGENE populations with inversion frequencies > 80%. Note that in this scenario, high frequency inversions are affected by the age of the inversion and not by a simple change of reference genome. Within invertFREGENE, high frequency inversions tend to be older inversions, which may then present a higher within-population variability that can impact our predictions. However, the sudden drop of accuracy at high frequencies seems to indicate an additional critical behavior of the simulations at this range. Thus, in relation to the age of the inversion, the parameters used to run invertFREGENE may have not been optimized for this scenario.
In relation to real inversions, we show that the inveRsion package can be used to identify the inversion genotypes of subjects within the region 17q21. The high accuracies achieved for this inversion and the invertFREGENE simulations suggest that our method is most powerful when haplotype divergence is present. We also studied the inversion within 8p23. Despite of not detecting an inversion in this region in a genome-wide scan; with an extensive search, we were able to detect a signal (a third of that for the 17q21 region) and classify the subjects of the CEU population to a high degree. Consistently with Bosh et al. 2009, we saw that the most accurate predictions are made for the CEU population only. The 8p23 region has shown a number of complexities that might affect our inference to all HapMap groups. It has been shown that the region is unstable, showing mosiac rearrangements in its vicinity [24, 29]. In addition, there is a lack of long distance LD, expected for two divergent haplotypes associated to each inverted status. Antonacci and colleagues suggested that this can be due to multiple inversion events. In addition, Deng et al. 2008 showed that the CEU population has an extended linkage disequilibrium in the region greater than that for the CBH+JPT and YRI populations. It is possible that our method particularly favors inversions with less variability within populations. Greater variability of populations within the inverted region could reduce our classification power because large groups of identical haoplotypes are harder to form. Further improvements of our method in this direction could improve performance.
An important practical consideration when using inveRsion is the threshold BIC value used to perform inversion genotyping. Larger values of BIC (Figure 2) do not always mean higher accuracy. As above, we recommend careful benchmarking of threshold BIC values for assessing subject classification. In the case of inferences in large populations and when experimental data is available on few subjects, a suitable threshold can be set for the classification that maximizes the inference on those subjects. We use this approach in our analysis of the 17q21 region, where we conducted a detailed search in a region known to have the inversion. When experimental data is not available, the stability of inversion allele for each subject could be studied for different BIC thresholds, drawing more confidence from those individuals that are consistently classified.
On a genome scale, initial scans can be conducted on several window sizes to identify regions of interest with positive BIC values where, as suggested by our simulations, we have high sensitivity. In our whole genome search for the CEU+YRI populations, we found recurrent inversions at different window sizes, which increases the confidence in the findings. This shows that using only three windows sizes, 0.4 - 0.7 - 1 Mb, can detect inversions from 0.4 Mb to ~ 2 Mb with enrichment for detecting inversions with length 0.7 - 1.4 Mb since more than one window have power to detect inversions in this range. The identification of recurrent inversions with different window sizes could also be used as an additional check to reduce false discoveries. In addition, these results suggest that scans with a few window sizes and as large a value of L as computationally feasible should be sufficient for a complete search for inversion signals in a given chromosome. Note that the window size is a probe for the detection of the inversion, thus it should be considered a function of the length of inversions to be detected. The accuracy of our method increases with higher SNP density. At the 500 k array density, our method can still achieve high classification accuracy. However, due to the high variability of the accuracy estimates, we recommend using imputed data in this case. To have a more accurate measure of the subject classification and length of the inversion, an extensive search for all possible lengths around the regions of interest can then be conducted.
Experimental information about the existence of inversions and their status in a sample of subjects is limited, constraining an exhaustive evaluation of the true positive and negative rates of our methodology with validated inversions. Inversions are difficult to characterize even with the newest sequencing data. In the study of structural variants in 1000 genomes, Durbin et al. 2010  did not report inversions. Large inverted repeats at the breakpoints is a current obstacle for inversion detection using next generation sequencing data, in addition to the computational capacity to analyze hundreds of individuals. However, when inversion detection algorithms for sequence data are available, a necessary future evaluation of our methodology should be performed against this new technology.
In this article, we have presented a methodology to detect inversion polymorphisms using nucleotide variation data from SNP micro-arrays. The method, based on LD differences across break points, is able to classify individuals into their inversion genotype status, allowing the gathering of population-based evidence of known inversions. In addition, using a computationally efficient scan across the genome, it provides evidence on new candidate regions that might be present in the population from which genotyped data is collected.
We would like to thank Clive Hoggart for providing us with a version of invertFREGENE that outputs the classification of each individual chromosome, and the reviewers for their useful comments. This work has been supported by the Spanish Ministry of Science and Innovation (MTM2008-02457) to JRG, the statistical genetics network (MTM2010-09526-E) to JRG and AC. BJR is supported by the National Institutes of Health (R01 HG5690) and a Burroughs Wellcome Career Award at the Scientific Interface. MC is supported by the ERC under the European Union Seventh Research Framework Programme (FP7) with Starting Grant (243212-INVFEST).
- Sindi SS, Raphael BJ: Identification and frequency estimation of inversion polymorphisms from haplotype data. J Comput Biol 2010, 17(3):517–531. 10.1089/cmb.2009.0185View ArticlePubMedGoogle Scholar
- O'Reilly PF, Coin LJM, Hoggart CJ: invertFREGENE: software for simulating inversions in population genetic data. Bioinformatics 2010, 26(6):838–840. 10.1093/bioinformatics/btq029View ArticlePubMedGoogle Scholar
- Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AWC, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC: The diploid genome sequence of an individual human. PLoS Biol 2007, 5(10):e254. 10.1371/journal.pbio.0050254PubMed CentralView ArticlePubMedGoogle Scholar
- Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, Haugen E, Zerr T, Yamada NA, Tsang P, Newman TL, Tüzün E, Cheng Z, Ebling HM, Tusneem N, David R, Gillett W, Phelps KA, Weaver M, Saranga D, Brand A, Tao W, Gustafson E, McKernan K, Chen L, Malig M, Smith JD, Korn JM, McCarroll SA, Altshuler DA, Peiffer DA, Dorschner M, Stamatoyannopoulos J, Schwartz D, Nickerson DA, Mullikin JC, Wilson RK, Bruhn L, Olson MV, Kaul R, Smith DR, Eichler EE: Mapping and sequencing of structural variation from eight human genomes. Nature 2008, 453: 56–64. 10.1038/nature06862PubMed CentralView ArticlePubMedGoogle Scholar
- Hoffmann A, Rieseberg L: Revisiting the Impact of Inversions in Evolution: From Population Genetic Markers to Drivers of Adaptive Shifts and Speciation? Annu Rev Ecol Evol Syst 2008, 39: 21–42. 10.1146/annurev.ecolsys.39.110707.173532PubMed CentralView ArticlePubMedGoogle Scholar
- Kirkpatrick M: How and Why Chromosome Inversions Evolve. PLoS Biol 2010, 8(9):e1000501. 10.1371/journal.pbio.1000501PubMed CentralView ArticlePubMedGoogle Scholar
- Chaisson M, Raphael B, Pevzner P: Microinversions in mammalian evolution. Proc Natl Acad Sci USA 2006, 103: 19824–19829. 10.1073/pnas.0603984103PubMed CentralView ArticlePubMedGoogle Scholar
- Andolfatto P, Depaulis F, Navarro A: Inversion polymorphisms and nucleotide variability in Drosophila. Genet Res 2001, 77: 1–8.View ArticlePubMedGoogle Scholar
- Kennington W, Partridge L, Hoffmann A: Patterns of diversity and linkage disequilibrium within the cosmopolitan inversion In(3R)Payne in Drosophila melanogaster are indicative of coadaptation. Genetics 2006, 172: 1655–1663.PubMed CentralView ArticlePubMedGoogle Scholar
- Navarro A, Barbadilla A, Ruiz A: Effect of inversion polymorphism on the neutral nucleotide variability of linked chromosomal regions in Drosophila. Genetics 2000, 155: 685–698.PubMed CentralPubMedGoogle Scholar
- Osborne L, Li M, Pober B, Chitayat D, Bodurtha J, Mandel A, Costa T, Grebe T, Cox S, Tsui L, et al.: A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome. Nature genetics 2001, 29(3):321–325. 10.1038/ng753PubMed CentralView ArticlePubMedGoogle Scholar
- Korbel J, Urban A, Affourtit J, Godwin B, Grubert F, Simons J, Kim P, Palejev D, Carriero N, Du L, et al.: Paired-end mapping reveals extensive structural variation in the human genome. Science 2007, 318(5849):420. 10.1126/science.1149504PubMed CentralView ArticlePubMedGoogle Scholar
- Antonacci F, Kidd JM, Marques-Bonet T, Ventura M, Siswara P, Jiang Z, Eichler EE: Characterization of six human disease-associated inversion polymorphisms. Hum Mol Genet 2009.Google Scholar
- Tantisira K, Lazarus R, Litonjua A, Klanderman B, Weiss S: Chromosome 17: association of a large inversion polymorphism with corticosteroid response in asthma. Pharmacogenetics and genomics 2008, 18(8):733. 10.1097/FPC.0b013e3282fe6ebfPubMed CentralView ArticlePubMedGoogle Scholar
- Stefansson H, Helgason A, Thorleifsson G, Steinthorsdottir V, Masson G, Barnard J, Baker A, Jonasdottir A, Ingason A, Gudnadottir V, Desnica N, Hicks A, Gylfason A, Gudbjartsson D, Jonsdottir G, Sainz J, Agnarsson K, Birgisdottir B, Ghosh S, Olafsdottir A, Cazier J, Kristjansson K, Frigge M, Thorgeirsson T, Gulcher J, Kong A, Stefansson K: A common inversion under selection in Europeans. Nat Genet 2005, 37: 129–137. 10.1038/ng1508View ArticlePubMedGoogle Scholar
- Durbin R, Altshuler D, Abecasis G, Bentley D, Chakravarti A, Clark A, Collins F, Francisco M, Donnelly P, Egholm M, et al.: A map of human genome variation from population-scale sequencing. Nature 2010, 467(7319):1061–1073. 10.1038/nature09534View ArticlePubMedGoogle Scholar
- Giglio S, Calvari V, Gregato G, Gimelli G, Camanini S, Giorda R, Ragusa A, Guerneri S, Selicorni A, Stumm M, et al.: Heterozygous submicroscopic inversions involving olfactory receptor-gene clusters mediate the recurrent t (4; 8)(p16; p23) translocation. The American Journal of Human Genetics 2002, 71(2):276–285. 10.1086/341610View ArticlePubMedGoogle Scholar
- Feuk L, MacDonald J, Tang T, Carson A, Li M, Rao G, Khaja R, Scherer S: Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies. PLoS genetics 2005, 1(4):e56. 10.1371/journal.pgen.0010056PubMed CentralView ArticlePubMedGoogle Scholar
- Starke H, Seidel J, Henn W, Reichardt S, Volleth M, Stumm M, Behrend C, Sandig K, Kelbova C, Senger G, et al.: Homologous sequences at human chromosome 9 bands p12 and q13–21.1 are involved in different patterns of pericentric rearrangements. European journal of human genetics: EJHG 2002, 10(12):790. 10.1038/sj.ejhg.5200889View ArticlePubMedGoogle Scholar
- Gilling M, Dullinger J, Gesk S, Metzke-Heidemann S, Siebert R, Meyer T, Brondum-Nielsen K, Tommerup N, Ropers H, Tümer Z, et al.: Breakpoint cloning and haplotype analysis indicate a single origin of the common Inv (10)(p11. 2q21. 2) mutation among northern Europeans. The American Journal of Human Genetics 2006, 78(5):878–883. 10.1086/503632View ArticlePubMedGoogle Scholar
- Gimelli G, Pujana M, Patricelli M, Russo S, Giardino D, Larizza L, Cheung J, Armengol L, Schinzel A, Estivill X, et al.: Genomic inversions of human chromosome 15q11-q13 in mothers of Angelman syndrome patients with class II (BP2/3) deletions. Human molecular genetics 2003, 12(8):849. 10.1093/hmg/ddg101View ArticlePubMedGoogle Scholar
- Martin J, Han C, Gordon LA, Terry A, Prabhakar S, She X, Xie G, Hellsten U, Chan YM, Altherr M, Couronne O, Aerts A, Bajorek E, Black S, Blumer H, Branscomb E, Brown NC, Bruno WJ, Buckingham JM, Callen DF, Campbell CS, Campbell ML, Campbell EW, Caoile C, Challacombe JF, Chasteen LA, Chertkov O, Chi HC, Christensen M, Clark LM, Cohn JD, Denys M, Detter JC, Dickson M, Dimitrijevic-Bussod M, Escobar J, Fawcett JJ, Flowers D, Fotopulos D, Glavina T, Gomez M, Gonzales E, Goodstein D, Goodwin LA, Grady DL, Grigoriev I, Groza M, Hammon N, Hawkins T, Haydu L, Hildebrand CE, Huang W, Israni S, Jett J, Jewett PB, Kadner K, Kimball H, Kobayashi A, Krawczyk MC, Leyba T, Longmire JL, Lopez F, Lou Y, Lowry S, Ludeman T, Manohar CF, Mark GA, McMurray KL, Meincke LJ, Morgan J, Moyzis RK, Mundt MO, Munk AC, Nandkeshwar RD, Pitluck S, Pollard M, Predki P, Parson-Quintana B, Ramirez L, Rash S, Retterer J, Ricke DO, Robinson DL, Rodriguez A, Salamov A, Saunders EH, Scott D, Shough T, Stallings RL, Stalvey M, Sutherland RD, Tapia R, Tesmer JG, Thayer N, Thompson LS, Tice H, Torney DC, Tran-Gyamfi M, Tsai M, Ulanovsky LE, Ustaszewska A, Vo N, White PS, Williams AL, Wills PL, Wu JR, Wu K, Yang J, Dejong P, Bruce D, Doggett NA, Deaven L, Schmutz J, Grimwood J, Richardson P, Rokhsar DS, Eichler EE, Gilna P, Lucas SM, Myers RM, Rubin EM, Pennacchio LA: The sequence and analysis of duplication-rich human chromosome 16. Nature 2004, 432(7020):988–994. 10.1038/nature03187View ArticlePubMedGoogle Scholar
- Deng L, Zhang Y, Kang J, Liu T, Zhao H, Gao Y, Li C, Pan H, Tang X, Wang D, et al.: An unusual haplotype structure on human chromosome 8p23 derived from the inversion polymorphism. Human mutation 2008, 29(10):1209–1216. 10.1002/humu.20775View ArticlePubMedGoogle Scholar
- Bosch N, Morell M, Ponsa I, Mercader J, Armengol L, Estivill X: Nucleotide, cytogenetic and expression impact of the human chromosome 8p23. 1 inversion polymorphism. PloS one 2009, 4(12):e8269. 10.1371/journal.pone.0008269PubMed CentralView ArticlePubMedGoogle Scholar
- Bansal V, Bashir A, Bafna V: Evidence for large inversion polymorphisms in the human genome from HapMap data. Genome Res 2007, 17(2):219–230. 10.1101/gr.5774507PubMed CentralView ArticlePubMedGoogle Scholar
- Scheet P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. The American Journal of Human Genetics 2006, 78(4):629–644. 10.1086/502802View ArticlePubMedGoogle Scholar
- Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome biology 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Schaid D, Rowland C, Tines D, Jacobson R, Poland G: Score tests for association between traits and haplotypes when linkage phase is ambiguous. The American Journal of Human Genetics 2002, 70(2):425–434. 10.1086/338688View ArticlePubMedGoogle Scholar
- Gonzalez JR, Rodriguez-Santiago B, Caceres A, Pique-Regi R, Rothman N, Chanock SJ: A fast and accurate method to detect allelic genomic imbalances underlying mosaic rearrangements using SNP array data. 2011, 12: 166.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.