npInv: accurate detection and genotyping of inversions mediated by non-allelic homologous recombination using long read sub-alignment

Detection of genomic inversions remains challenging. Many existing methods primarily target inversions with a non repetitive breakpoint, leaving inverted repeat (IR) mediated non-allelic homologous recombination (NAHR) inversions largely unexplored. We present npInv, a novel tool specifically for detecting and genotyping NAHR inversion using long read sub-alignment of long read sequencing data. We use npInv to generate a whole-genome inversion map for NA12878 consisting of 30 NAHR inversions (of which 15 are novel), including all previously known NAHR mediated inversions in NA12878 with flanking IR less than 7kb. Our genotyping accuracy on this dataset was 94%. We used PCR to confirm presence of two of these novel NAHR inversions. We show that there is a near linear relationship between the length of flanking IR and the size of the NAHR inversion.


23
Detecting and genotyping inversion 24 We present npInv, a novel tool designed specifically for detecting and genotyping NAHR mediated inversions from long read 25 sequencing data. The input to npInv is an alignment file in bam format generated from local aligner such as BWA-MEM 12 . 26 npInv's pipeline and pseudo code are shown in Figure 1 and in supplementary methods, respectively. In brief, npInv scans the 27 alignment file for reads that contain pairs of subread alignments mapping to the same chromosome but with different orientation 28 ( Figure 2). npInv records this subread alignment pair as an inversion signal. If a subread alignment pair overlap in the original 29 read, npInv records this overlapping sequence as an inverted repeat. npInv clusters and filters all the inversion signals in order to 30 detect into inversion event based on position and the number of inversion signals in the cluster. npInv reports both the number 31 of reads which support an inversion, as well as the number of reads supporting the non-inverted allele (reads which span the 32 inversion breakpoints). Finally npInv applies a binomial model 13 to genotype inversion from these read counts (see Methods). 33 npInv reports the position, mechanism and genotype of each inversion.
Once the inversion regions are defined, we estimated the number of inversion reads as well as the number of reads supporting the non-inverted (reference) allele, which are removed in the step (2). Horizontal black dash line indicates the classification of inversion and reference reads. Finally, the software predicts the inversion with position, mechanism and genotype.
Benchmarking the software using simulated data 36 We first benchmarked the software using simulation data. We simulated 61 NAHR, 100 short (<4 kb) and 100 long (>4 kb) 37 NHEJ non-overlapping inversions in reference GRCh37 chromosome 21. NAHR inversions were simulated based on the 38 location of IR of length above 500 bp in the reference chromosome 21 (which limited their number to 61). We randomly set the 39 genotype of inversion to be heterozygous or homozygous. Next, we used readsim 14 to simulate reads with an average read 40 length of 3 kb, 6 kb or 9 kb. Sequence substitution, insertion and deletion rates were set at 5.1%, 4.9% and 7.8%, respectively 41 based on previously described characteristics of nanopore sequence data 15 Figure 2. Illustration of effect of a NAHR mediated inversion on long read sub-alignments. Idealised NAHR inversion and reference are shown in first two panels. Inverted repeats are showed as dark and light blue. Orange, dark purple and blue hashed rectangles indicate unique sequence. The direction of the hashing indicates its orientation. The third panel (red) shows a read supporting the left breakpoint of the inversion. The large arrow indicates the original unmapped read. The smaller arrows indicate two sub-read alignments, with the direction of the arrow indicating the alignment orientation, and the horizontal dashed line indicating aligned and clipped sequence. The dot lines indicate the position of the subread alignment on the original read. The fourth panel (green) is similar to the third panel, except that it illustrates the read supporting the right breakpoint.

45
For simulated NAHR inversions, npInv demonstrated substantially better sensitivity (41% to 210%)than the next best 46 program (Sniffles) over all coverage and read-lengths simulated (Fig, 3a). Lumpy didn't predict any simulated NAHR inversions. 47 Instead, it reported inversion breakpoints as potential structural variation. This was likely because its algorithm didn't expect 48 repetitive sequence around inversions. The PPV of npInv was also highest across most coverage and read-lengths, although 49 Sniffles' PPV, which was slightly (2% to 5%) higher than npInv in low coverage long read datasets (Fig, 3d). npInv's PPV 50 remained high (> 90%) across all datasets, while its sensitivity depended on the depth and read length. The sensitivity was 51 good (> 80%) at 20 fold coverage and it did not improve significantly when the depth increased to 40 fold. The read length 52 didn't play a key role on both PPV and sensitivity, which was likely due to the fact that most of the background IR used to 53 simulate NAHR inversions are of length less than the shortest average simulated read length (of 3 kb).

54
For NHEJ inversions the difference between the algorithms was not as pronounced. For long (> 4kb) NHEJ inversions, 55 PPV for all 3 methods was more than 92%. The sensitivity of the three methods was similar (around 80%) for 20x coverage or 56 higher, but npInv had a higher sensitivity at lower coverage. For short (<4 kb) NHEJ inversions, the PPV for all 3 methods was 57 higher than 94%, but their sensitivity ranged from 26% to 89%. Lumpy's sensitivity was lower than previously reported using 58 simulations of highly accurate short paired-end reads 16 . For all 3 tools, the sensitivity decreased with increasing average read 59 length. This reflects limitations of existing alignment algorithms on long error-prone reads. When the aligners align long read 60 data, they have to decrease the penalty for gap opening and extending in order to adapt the relatively high sequencing error rate 61 in long read sequencing. As a result, aligners preferred to incorrectly align more sequence at the inversion breakpoint. Even 62 worse, when the inversion was short compared to the read length, the aligner might fully align the inversion spanning read to 63 the reference with wrong gap opening and extending at the inversion flipping sequence (Figure 4). In this case, an inversion 64 supporting read would be incorrectly regarded as a reference supporting read. 65 npInv was the only algorithm which reports the genotype for each inversion. To correctly genotype an inversion both the 66 inversion read and reference read should be detected correctly. npInv's genotype consistency was higher than 90% for long 67 NHEJ inversions and NAHR inversions, but was lower for short NHEJ inversions with low coverage and long reads (9kb) (Fig, 68  3g). The genotyping error is mainly caused by the limits of sensitivity in detecting reads supporting the inversion, and as a 69 result counting these reads as reference-supporting, leading to homozygous inversions being annotated as heterozygous. This

4/13
was particularly a problem in conjunction with the issues regarding alignment to short inversions as discussed in the previous 71 paragraph ( Figure 5). . The performance of genotyping inversion from simulated and real data Each dot is an inversion. X and Y axises are the base 10 logarithm of the estimated reference and inversion allele depth, respectively. "Hetero" and "Homo" are for heterozygous and homozygous inversion, respectively. Reference and inversion allele depth were estimated by npInv from (a) simulation data (40 folds and average read length 6 kb NAHR inversion) and (b) real data (NA12878 nanopore data 18 ). The genotype represented in the legend is the true genotype in the simulation (a) or validated database (b) 5 , respectively. The coloured ellipse indicates high confidence genotype prediction by npInv.
Benchmarking the software using real data 73 We aligned Nanopore high coverage human sequencing data on sample NA12878 18 to GRCh37 and identified 41 inversions 74 using npInv. We compared our results to a 'truth dataset' of inversions from InvFest 5 , which is a database of validated inversions 75 using various techniques including fluorescent in situ hybridization (FISH), polymerase chain reaction (PCR) [19][20][21][22][23][24][25][26][27][28] . We also 76 compared this result to short read sequencing result of NA12878 by Delly 10 and long read (Pacbio) assembly based on inversion 77 call set 29 ( Figure 6). npInv detected 18 (15 NAHR, 3 NHEJ) novel inversons and 23 (15 NAHR, 8 NHEJ) inversions overlapping 78 one or more dataset, of which 13 inversions are included in the truth dataset. npInv analysis of nanopore sequence data had the 79 largest overlap with the validated dataset compared to the PacBio assembly (5) and Delly Illumina analysis (8). This is because 80 npInv (mean inversion size 61 kb) can detect both short and long inversions, while assembly (mean 1.8 kb) and Delly (mean 81 2.3 kb) preferentially identify short inversions. Inversions from the InvFEST database which could not be detected by npInv 82 include inversions shorter than 2 kb (3), flanked by IR longer than 7 kb (5) or inversion with a deletion (1). In other words, 83 npInv could detect all nine detectable (IR< 7kb) validated NAHR inversion as well as four out of five validated NHEJ inversion 84 with size> 2kb . 85 We also used a set of validated 36 inversion sites in NA18278 (derived from InvFEST) to validate genotype consistency 86 of npInv. The genotype consistency for homozygous reference, heterozygous inversion and homozygous inversion are 87 100%(23/23), 83%(5/6) and 86%(6/7) in the real data, respectively. Overall it is 94%(34/36). 89 We selected three novel inversions of size > 1kb predicted by npInv which could be validated using a PCR based approach. As 90 this requires a PCR product which spans the inverted repeat, this placed an upper limit on the size of the IR to be less than 2 91 kb. Three of 18 novel inversions passed these criteria predicted from npInv. We checked inversion 4q35.2 (NHEJ), 3q21.3 92 (NAHR) and 10q11.22 (NHEJ) by PCR (Figure 7). Among these 3 inversions, there were 2 predicted heterozygous (4q35.2 93 and 3q21.3) and 1 homozygous inversion (10q11.22). We were able to validate predicted genotypes at both NAHR inversions. 94 However, the 4q35.2 NHEJ inversion, could not be validated by PCR. Visual inspection of aligned nanopore reads revealed 95 a clear structural variation breakpoint (Supplementary Figure 1a) which was also predicted to be an inversion by Sniffles 17 . 96 However, inspection of Pacbio 29 reads revealed almost no clipped reads, indicating an absence of an inversion (Supplementary 97 Figure 1b). We surmise that the inversion observed in the nanopore sequence data may be due to a somatic mutation which 98 5/13 occurred in a precursor cell to those used for Nanopore sequencing, however this is difficult to prove without access to the exact 99 cell-line used in sequencing.  29 , InvFest database of validated inversions 5 , as well as 103 novel inversions discovered by npInv. This resulted in a set of 87 known inversions, which we mapped to a karyogram ( (Figure 104  8). We observed that NAHR inversions (mean size 275kbp) are longer than NHEJ or FoSTeS inversions (3.8 kb) ( Figure 9)) . 105 Short read methods like Delly 10 primarily focus on NHEJ inversion or NAHR inversion for which IR size is shorter than library 106 insert size. Thus, it mainly reports the distribution of NHEJ inversions. On the other hand, the long read splitting method at IR 107 like npInv could extend the range of detection to longer NAHR inversion( Figure 9). 108 We classified inversions according to the size of flanking IR as short (<500bp), median (500-7000bp) and long (>7kb). 109 Short IR inversions can be detected by PEM based methods from short read sequencing data and local assembly 31 , particularly 110 as the local sequence structure is typically not repetitive around short variants. Median IR inversions are efficiently detected 111 using npInv as shown above. 113 We investigated the relationship between IR and NAHR inversion by summarizing all the background IR in the genome as well 114 as predicted and validated NAHR inversions ( Figure 10, see Methods). The background IRs mainly occur with length less 115 than 10 kb and between repeat distance ranging from 10 Mb to 100 Mb. There are two hotspots for IR at around 300bp and 116 6000bp, which is mainly due to the random distribution of short interspersed nuclear elements (SINEs) and long interspersed 117 nuclear elements (LINEs) in the chromosome. If the probability of a NAHR inversion occurring is equal amongst all the IRs, 118 the distribution of NAHR inversion should be the same as the distribution of background IR. However, we found the NAHR 119 inversion distribution is totally different from the background IR distribution. Surprisingly, there is an almost linear relationship 120 between the size of inverted repeat and the inversion, as well as an apparent empirical upper and lower bound on the size 121 of an IR which can mediate an inversion of a certain size. For example, a 1Mb inversion can only be mediated by an IR of 122 length greater than 50kb. This suggests only some IRs have the potential to mediate non-allelic homologous recombination 123 and become a NAHR inversion. Furthermore, for small (size<3 kb) NAHR inversion, IR sequence must almost have a 100% 124 identity. For larger (IR>50 kb) NAHR inversions, the size of first and second inverted repeat are not always the same and the 125 identity could be lower (0.90 to 0.99). As the size of IR increase, the tolerance of recombination also increases. 126 We use this observed relationship to map the potential location of inversion mediated by large IR (>7kb) in the human 127 genome, which are still not well characterized by sequencing based approaches. We filter all IR pairs with length greater than 128 7kb on the basis of the distance between the repeats (see Methods) to identify IR pairs which can mediate inversions ( Figure 129   10). This leaves 140 regions in which large IR inversions could occur (Figure 8 and supplementary table). All of the 5 known 130 NAHR inversions with IR greater than 7kb observed in NA18278 occur within one of these regions.

132
We developed a new tool, npInv, to detect and genotype inversion from long read sequencing data, with particular application 133 to data generated using Oxford Nanopore sequencing technologies devices. The application of npInv shows high accuracy 134 in both simulation and real data. We use npInv to uncover an almost linear relationship between inverted repeat and NAHR 135 inversion and show the potential of providing an individual inversion map. With the possible widespread adoption of long 136 read sequencing data, application of npInv could help extend our understanding of the extent of inversion polymorphism, their 137 evolutionary significance and their clinical impact. 138 We report the most comprehensive whole-genome inversion map to date, consisting of 87 inversions, of which 38 are 139 NAHR mediated inversions and the remained are NHEJ or FOSTES mediated. We exploited knowledge of potential sites of 140 NAHR inversion to identify a further potential 140 inversion loci with IR length greater than 7kb. An increase in the yield of 141 ultra-long (>100kb) sequence data on this sample, coupled with algorithmic improvements in alignment of long reads will help 142 refine the location of inversions flanked by these long IR.

151
For figure 8, we first filtered the IR (inferred from SD database 4 ) less than 7k to identify 2130 IRs. We carried out a linear 152 regression of IR length against the inversion size on all InvFEST and all npInv detected inversions in NA12878 respectively. 153

9/13
The regression parameters obtained are similar. We use the InvFEST regression parameters to build a predictive model of 154 the length of IR against the distance between IR (i.e. the minimum potential inversion size). We removed the IRs outside of 155 90% prediction interval by R to identify 1302 IRs. Then we applied BEDTools 34 to sort and merge these regions into 140 156 non-overlapping genomic regions.

Analysis of long read sub-alignment
158 npInv focuses on reads which have multiple sub-alignments. For each of these sub-read alignments, i, we sort the alignments by 159 its left-most location in the read. Then we record the (start i read , end i read ) co-ordinates of the alignment on the read, the (start i re f , 160 end i re f ) co-ordinates in the reference genome, as well as the reference orientation and chromosome. For a read containing 161 multiple sub-alignments (Figure 2), we perform the following analysis. We first filter alignments with length less than 500 162 bps or for which the alignment interval on the read is totally contained by another alignment interval. Next, for each pair 163 of read-adjacent alignment intervals (which are allowed to overlap), we keep pairs mapping to the same chromosome and 164 in different alignment orientation as potential inversion signals (A1-A2, B1-B2 in Figure 2). If the first sub alignment is in 165 forward strand, we record this signal as an inversion forward supporting signal. Otherwise, we record as reverse signal. If the 166 first alignment's location in reference is bigger than the second alignment, we record this signal as a left breakpoint inversion 167 supporting signal. Otherwise, we record this signal as right. If two alignment intervals are overlapping by more than 500bp (on 168 the read), this inversion signal is considered to be mediated by NAHR. The overlapping sequence could align to the inverted 169 repeat in both orientations (light blue and dark blue in Figure 2). Only one of these pair of read alignments includes sequence 170 from the inversion itself ( Figure 2). All inversion signals are sorted by chromosome and left-most start position on reference. 171

Analysis of inversion signal 172
After scanning the bam file by split read analysis, the software identifies numerous inversion signals. The user has the option 173 of providing a database of known IR pairs in the genome. If this is provided, the software creates a bin for each IR in the 174 database which is used to merge inversion signals. Each bin represents an inversion call. For each candidate inversion, we 175 check whether the left breakpoint and right breakpoint are within X bp (in practice, X=2000) in the IR database's left and right 176 repeat sequence. If true, group them in this IR bin and delete the binned signal. If the user does not provide an IR database, and 177 also for the remaining signals which cannot be clustered around known IR, inversion signals are grouped into the same bin if 178 their reference start and end are both less than X bp (default 2000) from each other. We then investigate whether each merged 179 inversion signal contains supporting reads on both forward and reverse strands, and also at both left and right breakpoints. 180 The output inversion's start and end are the mean value of the left breakpoints and right breakpoints, respectively. The output 181 inversion left and right breakpoint start and end are the minimum and maximum alignment position at left and right breakpoint 182 in the reference, respectively. We calculate the inversion supporting read R inv as the sum of reads supporting the left and right 183 breakpoint signal for genotyping inversion. If a read supports both left and right breakpoint, it will count as one left breakpoint 184 signal and one right breakpoint signal. 185 npInv annotates the inversion longer than L(parameter, default 1M) as long inversion. We consider that long inversion is 186 not reliable for either NHEJ (usually shorter than 1 Mb) or NAHR (likely with inverted repeat which is too long to be fully 187 spanned) inversion. 189 We calculate the average substitution, deletion and insertion rate and their standard deviations from the first min(10000, all) 190 primary alignments. For each alignment overlapping with inversion, we define its inversion region as (max(left breakpoint start, 191 alignment start), min(right breakpoint end, alignment end)). We calculate three error rates (substitution, deletion and insertion 192 rate) in its inversion region from the primary alignments. If the all three error rates are less than its average rate plus its one 193 standard deviation, we kept this alignment as a reference supporting alignment. We calculate the sum of reference supporting 194 read R re f for the next step. If a read spans both left and right breakpoint and passes the criteria, it will count as one reference 195 left breakpoint signal and one reference right breakpoint signal.

196
Inversion genotyping 197 For each binned inversion, we get the number of inversion and reference supporting reads (R inv ,R re f ) from the above step. Applying binomial model 13 on the genotyping inversion, the posterior probability P of genotype G = g re f re f , g re f inv , g invInv conditional on the observed read counts R re f and R inv could be written as below.

10/13
The likelihood P(R re f , R inv |G) could be written as where ε 1 and ε 2 are the error rates of incorrectly assigning an inversion-supporting read to a reference supporting read and vice-versa, respectively(in practice, we use ε 1 = ε 2 = 0.01, however with availability of more data it would be possible to infer specific mis-assignment rates). We assume an uniform prior such that P(G = g inv ) = P(G = g re f ) = 0.5 and then by Hardy-Weinberg equilibrium P(G = g re f re f ) = P(G = g invInv ) = 0.25 and P(G = g re f inv ) = 0.5. Then we choose the maximum posterior probability genotype as the genotype for the individual. The genotype quality Q is calculated as the second maximum posterior probability P 2nd divided by the maximum posterior probability P 1st in Phred quality score as below.
Inversion simulation and benchmarking 198 We chose the whole GRCh37 chromosome 21 as the reference. We grouped the inversions into three types, which were 199 NAHR, short (0-4 kb) NHEJ and long (4 kb to 1 Mb) NHEJ inversions. We simulated 61 NAHR, 100 short and 100 long 200 NHEJ non-overlapping inversions in reference chromosome 21. NAHR inversions were simulated based on the reference IR 201 (>500 bp) from IRF 32 and limited to 61 non-overlapping NAHR inversions on chromosome 21. We randomly set the genotype 202 of inversion as heterozygous or homozygous. Then we simulated a diploid chromosome 21 and flipped over the simulated 203 inversion interval in one or two chromosomes according to its genotype. Next, we used readsim 14 (version 1.6) to simulate reads 204 from this diploid chromosome with an average read length of 3 kb, 6 kb or 9 kb. Sequence substitution, insertion and deletion 205 rates were set at 5.1%, 4.9% and 7.8%, respectively based on previously described characteristics of nanopore sequence data 15 . 206 Sequence depth was set at 5, 10, 20 or 40 folds for different simulations. The readsim parameter is sim fa -rev strd on -tech 207 nanopore -read mu 3000,6000,9000 -read dist exp -cov mu 5,10,20,40 -err sub mu 0.051 -err in mu 0.049 -err del mu 208 0.078.

209
Simulation reads were aligned by BWA-MEM 12 (version 0.7.15-r1142-dirty) to chromosome 21. The BWA-MEM parameter 210 is -t 16 -x ont2d -M, which is suggested by Sniffles' readme. The alignment result was used for npInv (version 1.2), as well 211 as for software Lumpy 16 and Sniffles 17 . We run Lumpy (v0.2.13) from its executable file named lumpy with parameter -mw 212 4 -tt 1e-3 -sr bam file:BAMINPUT,back distance:20,weight:1,id:1,min mapping threshold:1. Sniffles (version 1.0.5) was 213 downloaded from https://github.com/fritzsedlazeck/Sniffles. We applied Sniffles directly to the simulation bam files with the 214 default parameter. Lumpy or Sniffles inversions were called when their vcf ALT fields are equal to <INV>. An inversion was 215 classified as positive predictive inversion when the true simulation inversion interval was 90% overlapping with the predictive 216 inversion interval, and vice-versa. Finally, the positive predictive value (PPV), sensitivity (S) and genotype consistency (GC) 217 were calculated for different datasets.

219
NA12878 raw data 18 (version rel3) was downloaded from https://github.com/nanopore-wgs-consortium/ 220 NA12878. We aligned it to GRCh37 by BWA-MEM 12 (version 0.7.15-r1142-dirty). The key parameter was -k11 -W20 -r10 221 -A1 -B1 -O1 -E1 -L0 -Y. We ran npInv(version 1.2) with default parameter. The predicted inversions are the inversions whose 222 "FILTER" field is equal to "PASS" in vcf 35 file. 224 For inversions from InvFEST, we accepted the mechanism from InvFEST. For the remaining unclassified inversions, we 225 checked whether the start and end were within inverted repeats from the SD database 4 . If an inverted repeat was found, we 226 classified the inversion as NAHR mediated with IR sizes and identity from SD database. Otherwise, we extracted the whole 227 inversion sequence and aligned it to itself by YASS 36 . If the YASS's dotplot showed inverted repeat sequence at both the start 228 and the end, we classified it into NAHR inversion. The IR sizes and identity were determined by the YASS's alignment result. 229 When the inversion was totally reverse complement, we classified it as Palindrome. We classified the remaining inversions into 230 NHEJ/FoSTeS inversion.

232
PCR was used to validate 3 inversions detected from the sequencing data. Two forward primers were designed to overlap 233 the inversion breakpoints, one to amplify the reference copy and a second primer to amplify the inverted copy with a shared 234 11/13 reverse primer. PCR reactions were performed using 1x HotStar Taq DNA Polymerase (Qiagen), 2.5mM MgCl 2 , 200nM of 235 forward primer (either to amplify the reference or the inverted sequence), 200nM reverse primer and 2ng/uL of DNA NA12878. 236 PCR conditions were optimized for each PCR target. The following PCR conditions were used: hot start at 95 o C for 15 237 minutes, 35 cycles of 95 o C for 30s, 60 o C for 30s and 72 o C for 4 minutes with a final extension of 10 minutes at 72 o C. An 238 annealing temperature of 55 o C was used to amplify the inverted sequence of 3q21.3. PCR products were analyzed by horizontal 239 electrophoresis on 1.5% agarose gel.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.
Availability of data and materials npInv(version 1.2) is an open-source java runnable jar file available at https://github.com/haojingshao/npInv with license GPL-3.0. It was run and tested on Mac OS and Linux with java version 1.8.0 66 and 1.8.0 112.

Competing interests
LC received travel and accommodation expenses to speak at an Oxford Nanopore-organised conference.