### CNV detection

#### Discordant paired-end reads

Using next-generation sequencing, short reads are generated in pairs, where a short gap appears between the two reads and the distance of this gap is roughly known. The distance is called *insert length.*

Discordant read pairs occur when the mapping of a paired-end read is not what is expected, given that there were no structural variations in the donor genome. An example to illustrate discordant read pairs is shown in Figure 1. In this figure, the reference genome contains one copy of the highlighted sequence, while the donor genome contains two tandem copies. A read is sampled from the donor such that the forward read is read from the end of the first copy and the reverse read is read from the beginning of the second copy. When mapped to the reference genome, these reads will map in an unexpected orientation. That is, the reverse read will map upstream of the forward read. By finding reads of this type, we can interogate their regions of mapping to find possible CNVs.

#### Clustering discordant pairs

After mapping a complete set of paired-end reads to a reference genome, we will find many discordant pairs. Given the high coverage, we expect that many discordant pairs will come from each CNV. Therefore, to avoid making duplicate CNV predictions and to bolster our confidence in each predicted CNV, we cluster discordant pairs so that each resulting cluster represents a potential CNV.

We define a simple greedy clustering procedure that clusters discordant pairs that explain the same CNV. We start with the complete set of discordant reads. We select one discordant read *x* from the complete set of discordant reads and find the set {*y*} of all other discordant reads such that the difference in forward read mapping positions for *x* and *y* is no greater than *M*_{
I
}, which is the insert length of the paired-end reads. We call the set of reads *x* + {*y*} *explain* a potential CNV and remove them from the set of all discordant reads. Then we continue this processs until there are either no single discordant reads left in the set of all reads or there are no two single reads such that the difference in their forward read mapping positions is less than or equal to *M*_{
I
}. For each potential CNV, we now have a set of discordant reads and we estimate the boundaries by examining the set of mapping positions of these reads. We estimate the leftmost boundary *b* by taking the minimum of the reverse read mapping positions and the rightmost boundary *b* + *l* by taking the maximum of the forward read mapping positions.

#### Estimating copy-counts

Given a predicted CNV region *c* and the set of reads mapped to this region in the reference, we would like to determine how many copies *d* of the reference region are contained in the donor sequence. In order to determine *d*, we first define a function to calculate the likelihood of observing *r* reads within a region of length *l*, given a particular coverage level. For a region of length *l*, the probability of a read mapping to this region is where *G* is the length of the genome. We can calculate the probability of *r* reads mapping to the length *l* region by using the binomial distribution. Therefore, the expected number of reads mapped to a length *l* region is where *N* is the total number of reads generated from the whole genome. When *N* is very large, the binomial distribution can be very well approximated by the Poisson distribution.

Therefore, we can use the density function for the Poisson distribution with in order to calculate the probability of *r* reads mapping to a length *l* region.

We expect that the length *l* region is represented *d* times in the donor sequence. However, we do not know *d.* We have only observed the number of reads mapped to the length *l* region in the reference. Given this we define the following likelihood function.

By finding the *d* that maximizes the likelihood of *r* reads mapping to the reference region *c*, we determine the number of copies of *c* contained in the donor.

Medvedev et al. [24] applied the same likelihood function defined in the Formula 1 and illustrated that it works well on Yoruban HapMap individual NA18507. Next we show this likelihood function may not always work. If we define *length(donor)* as the total length of the donor CNV, we have where *d* is the copy-counts, *l*_{
i
} is the length of the *i*-th copy and we define *l* as the length of the reference CNV. The likelihood function assumes all the copies in the donor sequence are of similar length, namely *I*_{i}*≈ l* for all 1 ≤ *i* ≤ *d.* Then *length(donor) ≈ d x l.* However, when the length of the copies differs from each other too much, namely the prefix and suffix of some *i*-th copies are big enough such that *l*_{
i
} « l, *length(donor)* may decrease significantly from *d x l.* Then we may have *||((d − k) x l) − length(donor)|| < ||(d x l) − length(donor)||* for some *k* ≥ 1, which implies that there is at least one copy of length less than l. Otherwise *(d x l) − length(donor)* should be 0. If the donor CNV is of length much less than *d x l,* the true copy-counts would not maximize the likelihood function defined above and the copy-counts estimated will be inaccurate. We show later in our experiments that when the assumption that all the copies in the donor sequence are of similar length is violated, the likelihood function fails to identify the true copy-counts.

### CNV reconstruction

Once the CNV region is detected and the CNV copy-counts are estimated, we want to reconstruct the exact CNV copies as they appear in the donor sequence. We call the CNV in the reference sequence *reference CNV* and the CNV in the donor sequence *donor CNV.* The donor CNV can be denoted as [*p*_{1}, *s*_{1}][*p*_{2} , *s*_{2}]...[*p*_{
n
}, *s*_{
n
}], where *n* is the copy-counts, [*p*_{
i
}, *s*_{
i
}] indicates the *i*-th copy, *p*_{
i
} indicates the corresponding start position of the *i*-th copy in the reference sequence, *s*_{
i
} indicates the corresponding end position of the *i*-th copy in the reference sequence. We define s_{
i
}|*p*_{
i
}_{+1} as the junction between the *i*-th and *i* + 1-th copy. The CNV Reconstruction problem thus can be stated as following: Given copy-counts *n*, reference CNV region [*b*, *b* + *l*], where *b* is the starting position of the reference CNV and *l* is its length, identify all *p*_{
i
} and *s*_{
i
} for 1 ≤ *i* ≤ *n* and order them such that the number of mismatches is minimized when all the reads are mapped to the reconstructed sequence.

We use unmapped reads to detect the exact junctions of the copies. Given the high coverage ratio of the next generation sequencing data, each junction should be spanned by certain amount of reads. These reads won’t map to the reference sequence and they indicate the start and end positions of the corresponding copies. We can split these unmapped reads and try to map both split parts to the reference sequence. If both parts map perfectly to the reference sequence (for simplicity, we assume all the reads are error-free), the mapped position indicates the corresponding end and start positions of the two adjacent copies. We show a simple example in Figure 2. As we can see, the reference CNV “CTGTCG” is copied three times in the donor sequence. The unmapped read TCGCT doesn’t occur exactly in the reference sequence and it spans the junction between the first and the second CNV copy in the donor sequence. If we split the read into two substrings TCG and CT, both of them map perfectly to the reference CNV. The end mapping position of TCG indicates the first copy ends at position 14 in the reference. The start mapping position of CT indicates the second copy starts at position 9. Next we show the detailed reconstruction process.

#### Unmapped reads Identification

Since we are using unmapped reads to identify the boundaries of CNV copies, we first need to identify unmapped reads sampled from the CNV region. A naive method is to take all the unmapped reads, split them and map them to every identified CNV region. However, the sequencing technique may introduce a large number of unmapped reads due to sequencing errors. It is very inefficient if we take all of these reads and map them to every CNV region. The problem can be alleviated by using mapped reads. Although we don’t know which donor CNV these unmapped reads come from, we can utilize the other parts in their corresponding paired-end reads to validate which donor CNV they are from. We first identify all the reads mapped to the reference CNV region, which can be done efficiently using any reads mapping tool such as MAQ [26]. Next we check the other parts of these reads and select all the unmapped ones, which are highly possible generated from the corresponding donor CNV region. The number of such reads is much smaller than the total number of unmapped reads. We are then able to focus on only the unmapped reads generated from the correct donor CNV region.

#### Junction validation

We split the unmapped reads and map them back to the reference sequence, as shown in Figure 2. A junction is valid only if it is validated by at least one unmapped reads. Therefore we can split an unmapped read at each internal position, which results in two split substrings *r*_{1}, *r*_{2}. *r*_{1} is the suffix of the first copy and *r*_{2}is the prefix of the second copy. We then map the two substrings back to the reference CNV. If both substrings mapped perfectly to the reference CNV (again, assuming the reads are error-free), the mapping positions in the reference CNV of *r*_{1}and *r*_{2} indicate the corresponding end and start positions of the two adjacent copies.

One problem for the validation is if the length of the split substring is short, it’s highly possible that the split substring maps perfectly to the reference sequence by random chance and the substring may map to many positions. For example, in Figure 2, the unmapped read TCGCT is split to TCG and CT. The substring CT maps to both position 2 and 9. To address the problem, we define a *significant length threshold t* such that when we split the read, we only split the read at positions where the resulting two substrings are of length both no less than *t.* Alternatively speaking, we split the read at the positions in the range of [*t, m-t*] in the read, where *m* is the read length. Therefore a perfect map of a split substring to the reference sequence is highly unlikely occurring by chance. We call such a mapping a *significant mapping.* If a splitting results in two substrings which both have significant mappings, we call such a splitting a *significant splitting.* The significant threshold can be determined by the following formula:

where *L* is the length of the reference CNV region, *∈* is a small number, such as 0.05. The formula indicates that we use the minimum *t* such that the expected number of any length *t* substring occurring in the CNV region is no more than *∈*. For example, in Figure 2, *L*=17. If *∈* = 0.05, according to Formula 2, we obtain *t* ≥ 4. Therefore, the length of the substring as what we used in the example (length as 3 and 2 for the two split substrings) is too small to avoid possible mappings by random chance.

#### Donor sequence reconstruction

Given any order of the junctions, we are able to reconstruct a donor CNV. For example, if the order of the junctions is: *s*_{1}|*p*_{2}, *s*_{2}|*p*_{3}, ..., *s*_{n-1}|*p*_{
n
}*,* the reconstructed donor CNV is [*s, s*_{1}], [*p*_{2}, *s*_{2}], ..., [*p*_{n-1}, *S*_{n-1}], [*p*_{
n ,
} *e*], where *s* and *e* are the start and end positions of the CNV, respectively. However, not all the donor CNVs are valid. We need to follow a simple rule: for any two adjacent junctions *s*_{
i
}|*p*_{i+1}and *s*_{i+1}|*p*_{i+2}, we need to have *p*_{i+1}≤ *s*_{i+1}, namely the starting position of a copy should be no greater than the ending position of the copy. For example, if we only have two adjacent junctions 100|200 and 150|50, the order of the two junctions can be only 150|50, 100|200. The corresponding reconstructed donor CNV is thus [*s*, 150][50, 100][200, *e*], where *s* and *e* are the start and end positions of the CNV, respectively. If we have order as 100|200, 150|50, the reconstructed donor CNV is [*s*, 100][200, 150][50, *e*], where the middle copy [200, 150] is invalid.

Notice there can be multiple orders satisfying this rule and therefore the donor CNV may not be unique. For example, given junctions 100|200 and 300|50, *s* and *e* as the start and end positions of the CNV, we can have an order 100|200 and 300|50, where the reconstructed donor CNV is [*s*, 100][200, 300] [50, *e*], or an order 300|50 and 100|200, where the reconstructed donor CNV is [*s*, 300][50, 100][200, *e*]. Both reconstructed donor CNVs are valid. We are not able to find the true donor CNV given the paired-end reads only. Therefore, we simply output all the possible donor CNVs.

When the CNV region contains repeats, there might be false positive junctions predicted using the above method. We simply rank all the predicted junctions using the number of unmapped reads that can validate them. We pick the junctions with the highest ranks. The number of junctions selected depends on the copy counts.