 Research article
 Open access
 Published:
Copy number variation detection using next generation sequencing read counts
BMC Bioinformatics volume 15, Article number: 109 (2014)
Abstract
Background
A copy number variation (CNV) is a difference between genotypes in the number of copies of a genomic region. Next generation sequencing (NGS) technologies provide sensitive and accurate tools for detecting genomic variations that include CNVs. However, statistical approaches for CNV identification using NGS are limited. We propose a new methodology for detecting CNVs using NGS data. This method (henceforth denoted by mHMM) is based on a hidden Markov model with emission probabilities that are governed by mixture distributions. We use the ExpectationMaximization (EM) algorithm to estimate the parameters in the model.
Results
A simulation study demonstrates that our proposed mHMM approach has greater power for detecting copy number gains and losses relative to existing methods. Furthermore, application of our mHMM to DNA sequencing data from the two maize inbred lines B73 and Mo17 to identify CNVs that may play a role in creating phenotypic differences between these inbred lines provides results concordant with previous arraybased efforts to identify CNVs.
Conclusions
The new mHMM method is a powerful and practical approach for identifying CNVs from NGS data.
Background
Introduction
A copy number variation (CNV) is a variation between genomes in the number of copies of a genomic region that is 1,000 DNA bases (1 Kb) or larger [1]. A CNV is a type of structural variation (SV) because a CNV affects a relatively large region in a DNA molecule. Structural genomic duplications or deletions correspond to copy number gains or losses, respectively. CNVs play an important role in human hereditary illnesses [2] and in plant breeding and agricultural improvement [3].
Maize exhibits extensive variation in both genotype and phenotype relative to the variation seen in humans [4]. The genotypic diversity in maize species permits a variety of uses, such as human and animal food and fuel. The maize genotype B73 was sequenced in 2009 [3]. This accomplishment allows a further comparison and understanding in different types of maize. SwansonWagner et al. [5] and Belo et al. [6] compared a variety of maize inbreds with B73 using array comparative genomic hybridization (aCGH) and identified a considerable number of CNVs along the genome. Springer et al. [7] also analyzed the structural variance between the two maize genotypes B73 and Mo17 using aCGH.
Array comparative genomic hybridization, first proposed in 1997 [8], has served as a robust and effective approach for CNV screening [9]. Statistical methods for analyzing aCGH data are readily available and are described in review articles such as Wineinger et al. [10] and Medvedev et al. [11]. However, aCGH is expensive and has limited resolution and accuracy. Nowadays, rapidly developing next generation sequencing (NGS) technologies provide a sensitive and accurate alternative approach for accessing genomic variations. The quality, speed, and affordability give NGS a significant advantage over microarrays [12, 13].
Despite the advantages of NGS over aCGH, the use of NGS for CNV identification has been limited by a lack of available and effective statistical approaches. The welldeveloped aCGH data analysis methods cannot take the full advantage of NGS data, and thus, new statistical analysis methods for NGS data are needed. Most of the existing methods for CNV detection using NGS data can be classified into two categories: sliding window methods and hidden Markov model (HMM) methods. Sliding window methods include Segseq by Chiang et al. [14], Eventwise testing by Yoon et al. [15], rSWseq by Kim et al. [16] and JointSLM by Magi et al. [17], among others. This category of methods must simultaneously deal with a large number of tests of significance, and the results of such methods are highly dependent on the determination of critical values. One example of the methods using HMMs is CNAseg by Ivakhno et al. [18]. This method relies on borrowing information from additional samples of the two genomes in comparison. In most situations, only one sequenced sample from each genotype is available. The method proposed in this paper is able to detect copy number variation change points even with just one sequenced sample of each genotype.
Data collection and terminology
To understand the data and our model for the data, it is necessary to introduce some NGS data collection details and terminology.
First of all, we use reference genome to describe the genome of the genotype that has been fully sequenced using wholegenome sequencing technologies. In contrast, the target genome is the genome of a genotype of interest that has not been fully sequenced. The goal is to use NGS data from the reference and target genotypes to identify regions of copy number variation between the reference and target genomes. We say that a genomic region in the target genotype where the number of copies is amplified relative to the reference genotype has a copy number gain. A target genomic region present, but at a reduced copy number relative to the reference genome, is said to have a copy number loss. A region that is present in the reference genome but absent in the target genome is described as absent. These three states (copy number gain, copy number loss, and absent) represent copy number variations in the target genome relative to the reference. A region with no difference in the number of copies between the target and the reference genotypes is said to be normal in state. A genomic location where there is a change from one copy number state to another is called a copy number change point.
To identify copy number change points and copy number states, a DNA sample from each of the target and the reference genotypes is obtained. The DNA strands in a sample are fragmented into 100 to 1,000 base segments. At one end of every randomly sampled segment, a sequence of 100 to 300 bases is determined and recorded. Such a sequence of bases is called a read. Each of the reads is then aligned to the reference genome to determine its origin in the genome. The location of the first base of the read on the reference genome is recorded as the position of the read. The numbers of reads for the target genome and the reference genome are recorded as the target read counts and the reference read counts, respectively. If a location has a positive target read count or a positive reference read count, it is called a site. Thus, data from NGS technologies are small nonnegative integer counts with associated site positions on the reference genome.
Preliminary data processing
The data to be analyzed can be described as follows. Suppose the observed reference and target genome read counts are denoted by {o}_{i}^{[r]} and {o}_{i}^{[t]}, respectively. The corresponding genomic positions are denoted by h_{ i }, i = 1,…,I. The goal of mHMM is to find the segmentation {h}_{1}\equiv {h}_{{i}_{0}}<{h}_{{i}_{1}}<{h}_{{i}_{2}}<\dots <{h}_{I}\equiv {h}_{{i}_{J}} such that the copy number of the target genome changes between two consecutive segments, and remains the same within a same segment. Because the read counts take small nonnegative integer values, including a large number of zeros, it is difficult to carry out accurate modeling and inference using the original data. Thus, it is more practical to work with sums of counts rather than the original individual counts.
A common way to aggregate the data is to define windows with a specific width and calculate the sum of target and reference read counts within each window, so as to obtain shorter series of larger target and reference read counts. Kim et al. [16] defined windows using a fixed number of read counts in the reference genome. Chiang et al. [14], Xie et al. [19] and Ivakhno et al. [18] defined windows using a fixed genomic distance. These methods have an underlying assumption that the sites within a window share the same copy number state. Such an assumption may be reasonable because a CNV is a somewhat rare type of genomic mutation, and the closer any two sites are located on the genome, the less likely there is a CNV change point between these sites. However there is also a problem in implementing these methods. Sites are randomly located along a genomic sequence, with a high density in some parts of the genome and a low density in other parts of the genome. Rigidly defining windows with a fixed number of read counts has the potential to put sites physically far away from each other into one window, which increases the risk of including copy number change points in a window. Rigidly defining windows with a fixed genomic distance can produce high variation in the number of sites and in read counts across windows. This can lead to decreased accuracy for identifying CNVs.
In this paper, we propose a new grouping method, which takes into consideration both the numbers of sites in windows, and distances among sites within and between windows. We propose to use Kmeans clustering on the physical site positions to define windows. This method is implemented on each chromosome separately. We first divide each chromosome into several big parts, by defining breakpoints that correspond to the largest distances between adjacent sites. In real data set, we suggest to divide each chromosome into 20 parts. Then we perform Kmeans clustering for each of the 20 parts, where K is chosen for any particular part as the number of reference genomic sites in the part divided by a constant number C. Thus C is on average the number of reference genome sites in each window after grouping. In this paper, C = 40 is used in the simulation study and real maize data analysis. Finally, we obtain the windows, where each window is defined by a collection of sites in one cluster. The number of windows W may be different from one chromosome to another. In the maize application we present later in the paper, the number of windows per chromosome ranges from 2707 to 5831. For window w = 1,…,W, let g_{ w } denote the set of indices corresponding to sites in window w. Thus we obtain a new sequence of the target genome read counts
and a new sequence of the reference genome read counts
where the target and reference read counts for window w are the sum of the target read counts and the sum of the reference read counts within that window: {u}_{w}^{[t]}={\sum}_{i\in {g}_{w}}{o}_{i}^{[t]}, {u}_{w}^{[r]}={\sum}_{i\in {g}_{w}}{o}_{i}^{[r]}, w = 1,…,W. We use the median position of sites within a window as the location for that window: ℓ_{ w } = median{h_{ i },i ∈ g_{ w }}, w = 1,…,W and obtain a series of genomic locations
By this method, the sites that are closest together are more likely to be grouped together in a window, which results in a more reasonable grouping than previously used approaches.
Once sites are grouped into windows, we assume that the hidden states remain the same within windows, and our mHMM method is used to obtain a windowbased segmentation. Next, an adjustment procedure is further performed on the windowbased result to obtain more accurate copy number change points. We refer to the final segmentation result as the sitebased segmentation result.
MixtureHidden Markov Model (mHMM) for a window based result
In this section, we describe the first step of the proposed mixturehidden Markov model (mHMM) that we use to estimate the windowbased copy number change points along the genome. This CNVdetecting methodology is carried out separately on each chromosome.
HMM was described by Baum et al. [20–23]. A HMM is constructed by a bivariate random process {S_{ w },U_{ w }}, w = 1,…,W. One component of this random process, {S_{ w }} (w = 1,…,W), is an unobserved Markov chain with finite states. {S_{ w }} (w = 1,…,W) has Markov property, which means that given the "current state" S_{ w }, the "future state" S_{w+1} and the "past state" S_{w1} are independent, i.e., P(S_{w+1}S_{1},…,S_{ w }) = P(S_{w+1}S_{ w }), w = 1,…,W. The probability a_{ kl } = P(S_{w+1}S_{ w }) is called the transition probability, which determines the probability of the state of w + 1 based on the state of w. In mHMM, the unobserved copy number states of the windows along the chromosome are the hidden states. The copy number states take four values, where
1 = gain: copy number gain/amplification in the target relative to the reference,
2 = normal: no difference in copy number between the target and the reference,
3 = loss: region present in the target genome but at a reduced copy number relative to the reference,
4 = absent: region absent in the sample but present in the reference.
Another important component of a HMM is {U_{ w }} (w = 1,…,W), which is a sequence of observations for w = 1,…,W. Each hidden state generates an observation with specific probability, P(U_{ w }S_{ w }), which is called the emission probability. In mHMM, the observations are the target genome read count {u}_{w}^{[t]} and the reference read count {u}_{w}^{[r]} at each window w, i.e., U_{ w } with value {\mathit{u}}_{w}={({u}_{w}^{[t]},{u}_{w}^{[r]})}^{\prime}, w = 1,…,W. Detailed description of the mHMM is presented in the following two subsections.
The transition probabilities
For the copy number states k and l, the transition probability a_{ kl }(w) (k,l = 1,…,4) is defined as the conditional probability of the next window w+1 taking copy number state l, given the copy number state k for the current window w. Motivated by Marioni et al. [24], we define the transition matrix which takes the relative positions of windows into consideration. As the distance between adjacent windows increases, the transition probability from state k to state l ≠ k increases and approaches to a positive constant. As the distance between adjacent windows decreases, the probability of a difference in copy number states between windows diminishes.
For each window w, the transition probability is defined as a function of w given by A_{ w } = [a_{ kl }(w)]_{4×4}, where
for k,l = 1,…,4 and w = 1,…,W  1. Here the parameter p_{ kl } ∈ (0,1) affects the transition probabilities from state k to state l, and has the constraint {\sum}_{l\ne k}{p}_{\mathit{\text{kl}}}<1, k = 1,…,4. Constant d_{ w } denotes the physical distance on the genome between the location of window w and the location of window w + 1 (i.e., ℓ_{w+1}  ℓ_{ w }). The parameter ρ is a positivevalued parameter determining the effect of distance on the transition matrix. The distance effect diminishes as ρ approaches ∞. The parameter vector θ represents all the parameters in the model, including p_{ kl } (k,l = 1,…,4;l ≠ k), ρ and all the parameters in the emission distribution to be described in the following section.
The emission distributions
The emission distributions define emission probabilities, which are the conditional joint probabilities of reference and target read counts, given the copy number state of the window. Each window w has two observations, a reference read count {U}_{w}^{[r]}={u}_{w}^{[r]} and a target read count {U}_{w}^{[t]}={u}_{w}^{[t]}. We model the reference read count as {U}_{w}^{[r]}{\lambda}_{w}^{[r]}\sim \text{Poisson}\left({\lambda}_{w}^{[r]}\right), where {\lambda}_{w}^{[r]} follows a Gamma distribution with parameters α and β.
Conditional on the state of window w, one natural choice for the emission distribution of the target read count for window w is
Here K_{1} = 2, K_{2} = 1, K_{3} = 0.5 and K_{4} = 0 are the CNV effects of the four copy number states "gain", "normal", "loss" and "absent", respectively. Parameter c_{0} is a normalization factor that accounts for any discrepancy between the total number of reference and target reads in normal regions. However in real data sets, extra variation and misalignments along the genomic sequence data are inevitable. For example, in a normal genomic segment with no difference in copy number between the target and the reference, there still exist some windows with normalized target and reference read count ratios significantly higher or lower than 1; within a segment of copy number gain (or loss), we also find windows that have normalized target and reference read count ratios significantly lower than 2 (or higher than 0.5). The original HMM introduced above will not only identify true copy number variation signals, but also the local variations caused by random error. Without additional modification, the method tends to show more state changes than those justified by true CNV signals.
The problem of identifying too many CNV change points is also pointed out by Ivakhno et al. [18]. To address this problem, Ivakhno et al.’s CNAseg employs a merging adjustment procedure on the outcomes of the original HMM segmentations using Pearson’s χ^{2} statistics. However, CNAseg segmentation depends heavily on the determination of the merging threshold. In the method we propose, instead of using (2) to model the target read count distribution, we use a Poisson mixture model for each of the four copy number states k = 1,2,3,4:
where \mathit{Q}={[{q}_{\mathit{\text{kj}}}]}_{4\times 4}=\left(\begin{array}{cccc}{q}_{11}& 1{q}_{11}& 0& 0\\ \frac{1{q}_{22}}{2}& {q}_{22}& \frac{1{q}_{22}}{2}& 0\\ 0& \frac{1{q}_{33}}{2}& {q}_{33}& \frac{1{q}_{33}}{2}\\ 0& 0& 1{q}_{44}& {q}_{44}\end{array}\right) with q_{ kk } ∈ (0.5,1) for k = 1,…,4, and \mathit{V}={[{v}_{\mathit{\text{kj}}}]}_{4\times 4}=\left(\begin{array}{cccc}2& {v}_{12}& 0& 0\\ {v}_{21}& 1& {v}_{23}& 0\\ 0& {v}_{32}& 0.5& {v}_{34}\\ 0& 0& {v}_{43}& 0\end{array}\right) with v_{12},v_{21} ∈ (1,2), v_{23},v_{32} ∈ (0.5,1), v_{34},v_{43} ∈ (0,0.5). The diagonal elements v_{ kk } (k = 1,…,4) in V denote the effects of the four copy number states: gain, normal, loss and absent, and have fixed constant values v_{11} = 2, v_{22} = 1, v_{33} = 0.5 and v_{44} = 0. The offdiagonal elements v_{ kj } (k,j = 1,…,4;k ≠ j) account for the uncertainties due to random errors in the sequencing technology such as misalignments, and their values are estimated from the data set. More specifically, parameter v_{12} accounts for the uncertainty in the copy number gain state that allows observed copy numbers to vary between 1 and 2; parameters v_{21} and v_{23} accounts for uncertainty in the normal state that allows observed copy numbers to vary between 0.5 and 2; parameters v_{32} and v_{34} allow observed copy numbers to vary between 0 and 1 within the copy number loss state; and the parameter v_{43} allows observed copy numbers to vary between 0 and 0.5 when the state is the absent state. All other v_{ kj }’s are restricted to 0. As a result, matrix V is a tridiagonal matrix. The weight matrix Q is also a tridiagonal matrix with q_{ kj } (k,j = 1,…,4) being the weights for the Poisson components in the mixture distribution. Parameters q_{ kk } (k = 1,…,4) denote the Poisson component weights for the four copy number states and are restricted between 0.5 and 1. Accordingly, the offdiagonal q_{ kj }’s are less than 0.5. In this way, the effects of the four copy number states dominate the effects of random uncertainties.
Based on the model as specified so far, the joint distribution for the target and the reference read counts at window w, conditional on the hidden state for window w being k, is
with {u}_{w}^{[r]}\in \mathbb{Z}\phantom{\rule{0.3em}{0ex}}\setminus \phantom{\rule{0.3em}{0ex}}{\mathbb{Z}}^{} and {u}_{w}^{[t]}\in \mathbb{Z}\phantom{\rule{0.3em}{0ex}}\setminus \phantom{\rule{0.3em}{0ex}}{\mathbb{Z}}^{}. A detailed derivation for (4) is provided in Additional file 1: Appendix 1. Note that
which is greater than 0. However, it is not possible for both the target and the reference read counts to be zero in the same interval, i.e., the joint distribution of {u}_{w}^{[t]} and {u}_{w}^{[r]} is truncated in the sense that
Consequently, we multiply the j^{th} component of (4) by a constant \frac{{({v}_{\mathit{\text{kj}}}{c}_{0}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\beta )}^{\alpha}}{{({v}_{\mathit{\text{kj}}}{c}_{0}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}\beta )}^{\alpha}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{\beta}^{\alpha}} (j = 1,2,3,4), and obtain the joint probability that satisfies (5). For the target and reference reads at window w given S_{ w } = k, the joint emission probability is
where {\mathit{u}}_{w}=\left(\begin{array}{c}{u}_{w}^{[t]}\\ {u}_{w}^{[r]}\end{array}\right)\in (\mathbb{Z}\setminus {\mathbb{Z}}^{})\times (\mathbb{Z}\setminus {\mathbb{Z}}^{})\setminus \left\{\left(\genfrac{}{}{0ex}{}{0}{0}\right)\right\} and {\mathit{U}}_{w}=\left(\begin{array}{c}{U}_{w}^{[t]}\\ {U}_{w}^{[r]}\end{array}\right).
Parameter estimation using the EM algorithm
We want to find the MLE for the parameter θ = {p,ρ,α,β,c_{0},q,v}, where p and ρ are transition probability parameters, and α, β, c_{0}, q = (q_{11},q_{22},q_{33},q_{44})^{′} and v = (v_{12},v_{21},v_{23},v_{32},v_{34},v_{43})^{′} are emission probability parameters. It is not easy to directly maximize the likelihood with respect to this 26dimensional parameter, so we use the ExpectationMaximization (EM) algorithm to iteratively maximize the likelihood
In (7), u = (u1′,…,u W′)^{′} is a series of the target and reference read counts for all the windows along the specific chromosome, s = (s_{1},…,s_{ W })^{′} ∈ {1,2,3,4}^{W} is a vector of unobserved states for all the windows along the chromosome, and {\pi}_{{s}_{1}}=P({S}_{1}={s}_{1}) is the probability that the copy number state in the first window is s_{1}.
Characterizing the E and M steps
In the EM algorithm, the likelihood function is also called the observed data likelihood {\mathcal{L}}_{\text{(obs)}}=\mathcal{L}(\mathit{\theta}\mathit{u}) because all the observations u are observed. If the hidden states were known, we have the complete data likelihood function as follows:
In these two likelihoods, {\pi}_{{s}_{1}}{\prod}_{w=1}^{W1}{a}_{{s}_{w}{s}_{w+1}}(w) is the probability of that the hidden states for all the windows on the chromosome are s_{1},…,s_{ W }, respectively, and {\prod}_{w=1}^{W}P({\mathit{u}}_{w}\mathit{s},\alpha ,\beta ,{c}_{0},\mathit{q},\mathit{v}) is the probability of the observed read counts u, given the hidden states s_{1},…,s_{ W }.
Given the parameter estimates θ^{(m)} from the iteration mof the EM algorithm, we use the Estep and the Mstep to update the parameter estimate.

The Estep: Evaluate the expectation of the complete data loglikelihood with respect to the conditional distribution of the hidden states S given the observed data u, with θ = θ^{(m)}, i.e., evaluate {E}_{\mathit{S}\mathit{u},{\mathit{\theta}}^{(m)}}\left(log\mathcal{L}(\mathit{\theta}\mathit{u},\mathit{s})\right).

The Mstep: Find θ^{(m+1)}, the θ value that maximizes {E}_{\mathit{S}\mathit{u},{\mathit{\theta}}^{(m)}}\left(log\mathcal{L}(\mathit{\theta}\mathit{u},\mathit{s})\right).
A detailed look at both the E and M steps is provided in the Additional file 1.
Initialization, convergence, and prediction of hidden states
The initial values for all the parameters are defined as follows. In the transition probabilities, we define the initial values {\mathit{p}}^{(0)}\stackrel{\mathrm{\Delta}}{=}{\left({p}_{12}^{(0)},\dots ,{p}_{43}^{(0)}\right)}^{\prime}=0.1\xb7{\mathbf{1}}_{12\times 1} and ρ^{(0)} = 0.5. For the parameters in the emission probabilities, we define {q}_{11}^{(0)}={q}_{22}^{(0)}={q}_{33}^{(0)}={q}_{44}^{(0)}=0.5, {v}_{12}^{(0)}={v}_{21}^{(0)}=1.5, {v}_{23}^{(0)}={v}_{32}^{(0)}=0.75, {v}_{34}^{(0)}={v}_{43}^{(0)}=0.25, and use the maximum likelihood estimates of α and β with all sites assigned with normal copy number state (state 2) as the initial values α^{(0)} and β^{(0)}.
We use
and
to denote the total target and reference aligned counts for sites in normal copy number regions, and {c}_{0}=\frac{{x}^{[t]}}{{x}^{[r]}} to denote the ratio between them. We obtain the initial estimate for x^{[t]} and x^{[r]} as follows. First we calculate the ratio between the read counts of the target genome and the reference genome for all the windows with positive reference genome counts through the whole genome across all chromosomes. Then we separate the ratios into three groups using Kmeans clustering to get three group means: M_{1}, M_{2} and M_{3}. Suppose M_{1} < M_{2} < M_{3}. Then x^{[t](0)} is the sum of the target read counts for the windows belonging to the cluster with mean M_{2}, and x^{[r](0)} is the corresponding sum of reference read counts for the windows belonging to the cluster with mean M_{2}. After obtaining x^{[t](0)} and x^{[r](0)}, we calculate {c}_{0}^{(0)} using \frac{{x}^{[t](0)}}{{x}^{[r](0)}}.
The probabilities of the four hidden states for the first window {\pi}_{{s}_{1}} (s_{1} = 1,2,3,4 and {\sum}_{{s}_{1}=1}^{4}{\pi}_{{s}_{1}}=1) are determined by the target and the reference read counts in the first window. We calculate
If R_{1} > 1.5, then π_{1} = 1, i.e., we assign copy number gain as the initial copy number state to the first window; if 0.75 < R_{1} ≤ 1.5, then π_{2} = 1, i.e., we assign normal as the initial copy number state to the first window; if 0 < R_{1} ≤ 0.75, then π_{3} = 1, i.e., we assign copy number loss as the initial state to the first window; if R_{1} = 0, then π_{4} = 1, i.e., we assign absent as the initial state to the first window.
For all the other windows S_{ w } (w ≥ 2), we use normal state (state 2) as the initial state. All the parameters are updated using the EM algorithm. The iteration stops when the difference between θ^{(m)} and θ^{(m+1)} is small enough, and we use the value of θ at the final iteration (denoted by \widehat{\mathit{\theta}}) as our estimate of θ.
After convergence, the conditional probability of each of the four states {P}_{k}(w)=P\left({S}_{w}=k\mathit{u},\widehat{\mathit{\theta}}\right) (k = 1,2,3,4) is calculated, and the conditional hidden state prediction for window w is given by the state that has the largest conditional probability, i.e., {\u015c}_{w}=\underset{k}{\text{arg max}}\phantom{\rule{0.3em}{0ex}}{P}_{k}(w) for w = 1,…,W.
Change point adjustment
Sites are grouped into windows initially, because we wish to have larger read counts to better capture the copy number variation signals with the mHMM, and it was assumed in the first step that the copy number state remains the same within a window. In reality, it is possible that the copy number change points occur within windows. Figure 1 demonstrates the relation between the estimated change point before adjustment and the true change point. Suppose window w is the window identified by the algorithm where the copy number state changes from state 1 to state 2 with genomic sites grouped into windows. In this plot, the brown color and the blue color segments represent the two sides of the window based segmentation cut point. And the vertical black bar is the boundary of the two windows (window w1 and window w). Suppose g_{ w } = {i_{ II },i_{ II } + 1,…,i_{ III }  1}, so i_{ II } is the index of the first site in window w, which would be identified as the change point if no adjustment is made. With g_{w1} = {i_{ I },i_{ I } + 1,…,i_{ II }  1} and g_{w+1} = {i_{ III },i_{ III } + 1,…,i_{ IV }1}, we have i_{ I } and i_{ III } being the first sites of window w  1 and window w + 1, respectively. The true change point i_{(true)} may happen between sites i_{ I } and i_{ III }1. In this plot, the true change point, represented by the vertical red bar, is within window w1. In order to obtain a more accurate result, the following algorithm makes the adjustment:

1.
For site i between site i _{ I } and site i _{ III }1, we obtain both the total target and the reference counts from i _{ I } to i  1, denoted as {z}_{i(L)}^{[t]} and {z}_{i(L)}^{[r]}; also obtain both the total target and the reference counts from i to i _{ III }  1, denoted as {z}_{i(R)}^{[t]} and {z}_{i(R)}^{[r]}.

2.
Calculate the Pearson’s χ ^{2} test statistic {\sum}_{\tau}{\sum}_{\gamma}\frac{{\left({z}_{i(\gamma )}^{[\tau ]}{\widehat{z}}_{i(\gamma )}^{[\tau ]}\right)}^{2}}{{\widehat{z}}_{i(\gamma )}^{[\tau ]}} using \{{z}_{i(L)}^{[t]},{z}_{i(L)}^{[r]},{z}_{i(R)}^{[t]},{z}_{i(R)}^{[r]}\}, where τ = t or r, γ = L or R, and {\widehat{z}}_{i(\gamma )}^{[\tau ]}=\frac{\left({\sum}_{\gamma}{z}_{i(\gamma )}^{[\tau ]}\right)\left({\sum}_{\tau}{z}_{i(\gamma )}^{[\tau ]}\right)}{{\sum}_{\gamma}{\sum}_{\tau}{z}_{i(\gamma )}^{[\tau ]}}.

3.
Do step 1 and 2 for every site i between i _{ I } and i _{ III }  1; the adjusted breakpoint i _{(adj)} is the one with the largest Pearson’s χ ^{2} test statistic value.
Results and discussions
Simulation studies
To set up golden standard test data sets with known CNV position to evaluate the performance of our method and peer tools, we conducted three simulation studies. In the first and the second studies, we simulated reads counts on each of the target and reference positions, then we simulated CNV segments on the target genome. The simulated sequences in the first two studies are low coverage with depths of 2X. In the third simulation study, we first randomly simulated copy number variation segments, then we simulated pair end Illumina reads and mapped back to the reference genome. The third simulation captures the sequencing process and with high coverage with depths of 30X.
Simulation study 1
The first simulation study is based on real DNA sequencing data from chromosome 4 of the lung cancer cell line NCIH2347 from Chiang et al. [14]. Simulation based on real data can best maintain the characteristic of the data including variation and errors that can affect actual data analyses. We first randomly simulated the genomic positions of the sites along the target and reference genomes using a uniform distribution. We generated the reference and target genome read counts by shuffling the reference genome read counts in chromosome 4 of NCIH2347. After that, we randomly picked 90 CNV segments on the simulated target genome. We considered three sizes for the CNV segments: 10 kb, 50 kb and 100 kb and generated 10 segments for each CNV type. We doubled the read counts for the segments with copy number gains, halved the read counts for the segments with copy number losses and set read counts to 0 for segments with no copies. This is a low coverage sequencing with depth of 2X. We compare the result of the mHMM with the mixture Poisson emission probability in (6), the result using the original HMM with the Poisson emission probability in (2), and the result using Segseq by Chiang et al. [14].
Table 1 is the comparison between the original HMM and the proposed mHMM, as well as the mHMM without change point adjustment. The four true copy number states are listed in the first column. Sensitivities, specificities, empirical false positive rate (EFPR) and empirical false negative rate (EFNR) are computed using (9) and listed in Table 1. From (9) we see that a more accurate method has greater values in sensitivity and specificity, and lower values in EFPR and EFNR.
From Table 1, both the mHMM with and without change point adjustment have greater sensitivities and specificity in all four copy number states than the original HMM. Also, the two mHMM methods have lower EFPR and EFNR in copy number gain, loss and normal states. The original HMM has a slightly smaller EFNR for the absent state, but the EPNRs in copy number gains and losses are far greater than other methods. This means that, with the mixture Poisson emission distribution, the mHMM is less affected by the errant variations in states and can capture true CNV signals better.
Comparing between the mHMM with and without change point adjustment, the sensitivities and specificity are all increased by using the adjustment. Also, the EFPR and EFNR are decreased in copy number gain, loss and normal states, while increased slightly in the absent state. In general, the change point adjustment procedure does give the mHMM better accuracies.
Table 2 lists a comparison between the mHMM and SegSeq. Segseq classifies the copy number states with three categories: normal, copy number gain, and copy number loss. Thus, Segseq does not distinguish between the copy number loss state and the no copy state. In Table 2, Segseq did not identify any sites with copy number gain state in this simulation study, so both the sensitivity and EFNR are 0 in the copy number gain state. Moreover, SegSeq has a large EFNR and a low sensitivity in the loss or absent state. Although SegSeq was slightly better in identifying normal states, this came with the cost of much poorer identification of gains, losses, and absent sites by SegSeq relative to the mHMM method.
Simulation study 2
In this simulation study, we examined the mHMM for copy number variation segments with different lengths. We simulated 30 copy number variation segments with each of the following lengths: 10, 20, 30, 50 and 100 kb. Within the 30 copy number variation segments of the same length, we have 10 segments with copy number gain, 10 segments with copy number loss and 10 segments absent in the sample but present in the reference. The simulation procedure is the same as Simulation Study 1. An identification is considered successful if there is a nonempty intersection between an identified segment and a simulated copy number variation segment with the correct variation type. The numbers of successful identified segments out of 10 are listed in Table 3. When the CNV segments are 30 kb or longer the mHMM method identifies almost all the copy number variation segments of any variation type.
Simulation study 3
In order to examine mHMM methodology with a simulation data set that captures the sequencing process, we conducted a third simulation study. First, we randomly introduced duplications (from 3 copies to 4 copies) and deletions (either homozygous 0 copy or heterozygous 1 copy) with difference sizes (1 kb, 5 kb, 10 kb) in the maize reference genome version 2 (chromosome 6 from website ftp://ftp.ensemblgenomes.org/pub/plants/release10/fasta/zea_mays/dna/) to produce a new genome with CNVs. Five segments were simulated for each of the 20 different combinations (4 different copies × 3 sizes). Consequently, we simulated 10 segments with copy number gains, and 10 segments with copy number losses or absence with each of the 3 CNV segment sizes. After that, we simulated randomly sampled paired end Illumina reads based on the new genome sequence with both read ends of length 100 bp, clone length of average 500 bp and standard deviation of 50 bp, coverage depth of 30X and sequence error rate of 0.002. Then, the simulated sequence reads are mapped back to reference genome using the same parameter as before to produce our test data set with known CNV copy number and position.
We compared the mHMM with SegSeq and CNVnator by Abyzov et al. [25]. The simulation results are listed in Table 4 and Table 5. Table 4 lists the number of correctly detected segments for copy number gain, loss and absent states in the first two rows, and the number of incorrectly detected segments for the true normal state in the last row. Because neither SegSeq nor CNVnator distinguish between copy number losses and absence, the table summarizes the two deletion states in one row. We have 10 segments of copy number gain or loss/absent for each of the three sizes. When the CNV lengths are at least 5 kb, the mHMM identifies all the CNV segments. CNVnator detected all the segments with losses or absent state, and almost all segments with copy number gains. SegSeq detected all the segments with copy number gains, but only detected about half of the loss/absent segments. When segment length equals 1 kb, the mHMM identifies all but one segment, while the detection accuracies for SegSeq and CNVnator are much decreased. The last row of Table 4 lists the number of true normal segments that are incorrectly detected by the methods. Smaller numbers in this row represent better performance. Our proposed mHMM has 3 incorrectly detected segments, which is fewer relative to the other two methods. CNVnator has a large number of false positive detections when the true copy number state is normal.
Table 5 lists the overlap rates of true CNV segments and detected CNV segments by mHMM with and without change point adjustment. The overlap rate is calculated using the overlap genomic length divided by the true CNV genomic length. We see that among all the corrected detected CNV segments, the overlap rates are quite high (over 70%). Also, the change point adjustment makes the overlap rate even higher (over 85%).
Application
We applied the proposed mHMM to compare the sequence data from two maize genotypes: B73 and Mo17. The previously sequenced B73 genome [3] was used as the reference genome with the Mo17 genome as the target. The original data was provided by Gore et al. [26], and was preprocessed by the lab of Schnable. In the data preparation stage, 44 base sequence reads from each of the B73 and Mo17 genomes were aligned to the reference B73 genome. We obtained 4.3 million aligned reads from B73 and 1.54 million aligned reads from Mo17, and 2.3 million genomic positions on the reference B73 genome had positive read counts. Using the mHMM, we found 1096 segments of 2000 bases or longer that have copy number variations that are at least 2 fold increasing/decreasing, among which, Mo17 has 14 segments with copy number gain state, 835 segments with copy number loss state and 247 segments with absent state, compared to B73.
We present the window based mHMM result for chromosome 1, 3, 6 and 10 in Figures 2, 3, 4 and 5. In each figure, the horizontal axis represents the genomic location on the reference B73 genome (in million bases, or Mb), and the vertical axis represents the log of the normalized count ratio between Mo17 and B73, i.e., log\left(\frac{\text{Mo17}}{{c}_{0}\text{B73}}\right) with the log ratio set to 7 when the Mo17 count is zero. The blue, teal, green and red colors represent copy number gain, normal, copy number loss and absent state, respectively.
There are several large segments with few copy number variations between Mo17 and B73. They are 121.3 Mb ∼ 130 Mb on chromosome 1 (Figure 2), 69.2 Mb ∼ 82.0 Mb and 84.8 Mb ∼ 95.9 Mb on chromosome 3 (Figure 3), and 49.5 Mb ∼ 61.5 Mb on chromosome 10 (Figure 5). These results are concordant with previous results using aCGH data [6, 7].
The most extreme copy number loss or absent state detected on Mo17 is in chromosome 6 (Figure 4), located from 42.2 million to 46.2 million bases on the reference genome. Within this segment, the reference genome B73 has 8152 total read counts, and the sample genome Mo17 has 542 total read counts. The copy number ratio is about 0.066, or 0.13 after taking into account the normalization factor c_{0}. Table 6 demonstrates the mHMM segmentation result between 35.3 million and 57.0 million bases. Two other long segments with copy number loss or absent state in chromosome 6 are from 26.5 million to 28.8 million bases, and from 47.8 million to 49.7 million bases. These identifications are also concordant with the result from other studies that compared DNA sequences between B73 and Mo17 using aCGH data [6, 7].
Table 7 lists the detected CNV segments that are longer than 2 Mb and show greater than 2fold normalized changes between B73 and Mo17. Plots for all the maize chromosomes are provided in the Additional file 2: Appendix 2.
Several challenging aspects of this application should be noted when interpreting our results. First, there exists a high level of divergence between the target genome Mo17 and the reference genome B73. Because of the high divergence, it is difficult to map the target reads to the reference genome in some regions. This might partially explain the reason that many of the identified CNVs are classified as copy number loss or absent in Mo17 relative to B73. Moreover, the short read length of 44 bp in this data set makes it even more difficult to correctly map the target reads to the reference genome. Secondly, the average coverage depth is also very low (less than 1X). This makes it difficult to identify small CNV segments. Although higher sequencing depth provides better CNV detection accuracy, Sims et al. [27] pointed out that reasonable detections can still be obtained for large CNVs when the coverage is at least 0.1X.
With fast developing sequencing technologies, higher coverage depths and longer reads are available in may of today’s sequencing data sets. Reads with 100 bp are standard, and it is common to see 150 to 300 bp in Illumina reads. Also, coverage depths of 30X and higher are becoming mainstream. With longer reads, more precise mapping and higher coverage depth, the mHMM is expected to provide better CNV detections with higher accuracy.
It is also important to note that DNA sequencing technologies are potentially affected by biases that arise from several biophysical and chemical features [28]. For example, favorable levels of GCcontent could amplify the coverage of the genome so as to affect CNV detections. Many recent methods have addressed GCcontent bias in CNV detections. Risso et al. [29] and Janevski, et al. [28] reviewed GCcontent normalization tools and methods that are currently being used. Our method is based on direct comparison of two genomes within genomic windows. As long as the withinwindow GC content remains similar across genomes, variation in GC content across the genomes should not bias our mHMM results.
Conclusions
In this paper, we proposed a new methodology to detect CNVs in the DNA sequences from two genotypes using next generation sequencing data. We used a hidden Markov model incorporating a mixture emission probability model to identify the copy number variation change points. The simulations study suggests that the mHMM has better sensitivities and specificities in CNV identifications comparing to other methodologies.
The proposed mHMM was applied to compare the two maize inbred lines B73 and Mo17, and identified CNV change points of the target sequence Mo17 relative to the reference sequence B73. The result of the mHMM is concordant with previous genomic studies using aCGH data by Springer et al. [7] and Belo et al. [6].
In addition, the mHMM can be used to compare two genotypes when only one sample of target reads and only one sample of reference reads are available. Many other existing methods require multiple read samples for both target and reference genotypes. Thus, the mHMM approach we have proposed may be especially useful when financial constraints limit data collection.
Availability of the software and supporting data set
The mHMM software and an example data set are available in the webpage https://www.stt.msu.edu/users/hengwang/mHMM.html. The maize sequencing data set supporting the results of this article can be downloaded through the link ftp://ftp.ensemblgenomes.org/pub/plants/release10/fasta/zea_mays/dna/.
References
Banerjee S, Oldridge D, Poptsova M, Hussain WM, Chakravarty D, Demichelis F: A computational framework discovers new copy number variants with functional importance. PLoS ONE. 2011, 6: 3
Gokcumen O, Lee C: Copy number variants (CNVs) in primate species using arraybased comparative genomic hybridization. Methods. 2009, 49: 1825. 10.1016/j.ymeth.2009.06.001.
Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves T, Minx P, Reily A, Courtney L, Kruchowski S, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock S, Belter E, Du F, Kim K, Abbott R, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson S, Gillam B, et al: The B73 maize genome: complexity, diversity, and dynamics. Science. 2009, 326: 11121115. 10.1126/science.1178534.
Buckler ES, Gaut BS, McMullen MD: Molecular and functional diversity of maize. Curr Opin Plant Biol. 2006, 9: 172176. 10.1016/j.pbi.2006.01.013.
SwansonWagner RA, Eichten SR, Kumari S, Tiffin P, Stein JC, Ware D, Nathan M, Springer NM: Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor. Genome Res. 2010, 20: 16891699. 10.1101/gr.109165.110.
Belo A, Beatty MK, Hondred D, Fengler KA, Li B, Rafalski A: Allelic Genome structural variation in maize detected by array comparative genome hybridization. Theor Appl Genet. 2010, 120: 355367. 10.1007/s0012200911289.
Springer NM, Ying K, Fu Y, Ji T, Yeh CT, Jia Y, Wu W, Richmond T, Kitzman J, Rosenbaum H, Iniguez AL, Barbazuk WB, Jeddeloh JA, Nettleton D, Schnable PS: Maize inbreds exhibit high levels of copy number variation (CNV) and presence/absence variation (PAV) in genome content. PLoS Genet. 2009, 5: 11
SolinasToldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Dohner H, Cremer T, Lichter P: Matrixbased comparative genomic hybridization: biochips to screen for genomic imbalances. Gene Chromosome Canc. 1997, 20 (4): 399407. 10.1002/(SICI)10982264(199712)20:4<399::AIDGCC12>3.0.CO;2I.
Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet. 2006, 7: 8597.
Wineinger NE, Kennedy RE, Erickson SW, Wojczynski MK, Bruder CE, Tiwari HK: Statistical issues in the analysis of DNA copy number variations. Int J Comput Biol Drug Des. 2008, 1: 368395. 10.1504/IJCBDD.2008.022208.
Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with nextgeneration sequencing. Nat Methods. 2009, 6: 1320. 10.1038/nmeth.1374.
Hurd PJ, Nelson CJ: Advantages of nextgeneration sequencing versus the microarray in epigenetic research. Brief Funct Genomics. 2009, 8 (3): 174183. 10.1093/bfgp/elp013.
Su Z, Li Z, Chen T, Li QZ, Fang H, Ding D, Ge W, Ning B, Hong H, Perkins RG, Tong W, Shi L: Comparing nextgeneration sequencing and microarray technologies in a toxicological study of the effects of aristolochic acid on rat kidneys. Chem Res Toxicol. 2011, 24 (9): 14861493. 10.1021/tx200103b.
Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: Highresolution mapping of copynumber alterations with massively parallel sequencing. Nat Methods. 2009, 6: 677681. 10.1038/nmeth.1363.
Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2007, 19: 15861592.
Kim TM, Luquette LJ, Xi R, Park PJ: rSWseq: algorithm for detection of copy number alterations in deep sequencing data. BMC Bioinformatics. 2010, 11: 43210.1186/1471210511432.
Magi A, Benelli M, Yoon S, Roviello F, Torricelli F: Detecting common copy number variants in highthroughput sequencing data by using JointSLM algorithm. Nucleic Acids Res. 2011, 39: e6510.1093/nar/gkr068.
Ivakhno S, Royce T, Cox AJ, Evers DJ, Cheetham RK, Tavare S: CNAseg  a novel framework for identification of copy number changes in cancer from secondgeneration sequencing data. Bioinformatics. 2010, 26: 30513058. 10.1093/bioinformatics/btq587.
Xie C, Tammi MT: CNVseq, a new method to detect copy number variation using highthroughtput sequencing. BMC Bioinformatics. 2009, 10: 8010.1186/147121051080.
Baum LE, Petrie T: Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat. 1966, 37 (6): 15541563. 10.1214/aoms/1177699147.
Baum LE, Eagon JA: An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology. Bull Am Math Soc. 1967, 73 (3): 360363. 10.1090/S000299041967117518.
Baum LE, Petrie T, Soules G, Weiss N: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat. 1970, 41 (1): 164171. 10.1214/aoms/1177697196.
Baum LE: An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities III: Proceedings of the Third Symposium on Inequalities. Edited by: Shisha O. 1972, New York: Academic Press, 18.
Marioni JC, Thorne NP, Tavare S: BioHMM: A heterogeneous Hidden Markov model for segmenting array CGH data. Bioinformatics. 2006, 22: 11441146. 10.1093/bioinformatics/btl089.
Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011, 21 (6): 974984. 10.1101/gr.114876.110.
Gore MA, Chia JM, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, Peiffer JA, McMullen MD, Grills GS, RossIbarra J, Ware DH, Buckler ES: A firstgeneration haplotype map of maize. Science. 2009, 326 (5956): 11151117. 10.1126/science.1177837.
Sims D, Sudbery I, Ilott N, Heger A, Ponting C: Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. 2014, 15: 121132. 10.1038/nrg3642.
Janevski A, Varadan V, Kamalakaran S, Banerjee N, Dimitrova N: Effective normalization for copy number variation detection from whole genome sequencing. BMC Genomics. 2012, 13 (Suppl 6): S1610.1186/1471216413S6S16.
Risso D, Schwartz K, Sherlock G, Dudoit S: GCcontent normalization for RNAseq data. BMC Bioinformatics. 2011, 12: 48010.1186/1471210512480.
Rabiner LR, Juang BH: An introduction to Hidden Markov Models. IEEE ASSP Mag. 1986, 3 (1): 416.
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant No. 1027527.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HW designed the model and the algorithm, carried out the simulation studies and data application, and drafted the manuscript. DN supervised the study, participated in the model design, and helped draft the manuscript. KY provided the data preprocessing and the model testing. KY also simulated the data set for the third simulation study. All authors read and approved the final manuscript.
Electronic supplementary material
12859_2013_6359_MOESM1_ESM.pdf
Additional file 1: Appendix 1 – More Details in the EM Algorithm [30]. (PDF 172 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
About this article
Cite this article
Wang, H., Nettleton, D. & Ying, K. Copy number variation detection using next generation sequencing read counts. BMC Bioinformatics 15, 109 (2014). https://doi.org/10.1186/1471210515109
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471210515109