Copy number variation detection using next generation sequencing read counts
 Heng Wang^{1, 2, 3}Email author,
 Dan Nettleton^{4} and
 Kai Ying^{5}
https://doi.org/10.1186/1471210515109
© Wang et al.; licensee BioMed Central Ltd. 2014
Received: 15 August 2013
Accepted: 2 April 2014
Published: 14 April 2014
Abstract
Background
A copy number variation (CNV) is a difference between genotypes in the number of copies of a genomic region. Next generation sequencing (NGS) technologies provide sensitive and accurate tools for detecting genomic variations that include CNVs. However, statistical approaches for CNV identification using NGS are limited. We propose a new methodology for detecting CNVs using NGS data. This method (henceforth denoted by mHMM) is based on a hidden Markov model with emission probabilities that are governed by mixture distributions. We use the ExpectationMaximization (EM) algorithm to estimate the parameters in the model.
Results
A simulation study demonstrates that our proposed mHMM approach has greater power for detecting copy number gains and losses relative to existing methods. Furthermore, application of our mHMM to DNA sequencing data from the two maize inbred lines B73 and Mo17 to identify CNVs that may play a role in creating phenotypic differences between these inbred lines provides results concordant with previous arraybased efforts to identify CNVs.
Conclusions
The new mHMM method is a powerful and practical approach for identifying CNVs from NGS data.
Keywords
Background
Introduction
A copy number variation (CNV) is a variation between genomes in the number of copies of a genomic region that is 1,000 DNA bases (1 Kb) or larger [1]. A CNV is a type of structural variation (SV) because a CNV affects a relatively large region in a DNA molecule. Structural genomic duplications or deletions correspond to copy number gains or losses, respectively. CNVs play an important role in human hereditary illnesses [2] and in plant breeding and agricultural improvement [3].
Maize exhibits extensive variation in both genotype and phenotype relative to the variation seen in humans [4]. The genotypic diversity in maize species permits a variety of uses, such as human and animal food and fuel. The maize genotype B73 was sequenced in 2009 [3]. This accomplishment allows a further comparison and understanding in different types of maize. SwansonWagner et al. [5] and Belo et al. [6] compared a variety of maize inbreds with B73 using array comparative genomic hybridization (aCGH) and identified a considerable number of CNVs along the genome. Springer et al. [7] also analyzed the structural variance between the two maize genotypes B73 and Mo17 using aCGH.
Array comparative genomic hybridization, first proposed in 1997 [8], has served as a robust and effective approach for CNV screening [9]. Statistical methods for analyzing aCGH data are readily available and are described in review articles such as Wineinger et al. [10] and Medvedev et al. [11]. However, aCGH is expensive and has limited resolution and accuracy. Nowadays, rapidly developing next generation sequencing (NGS) technologies provide a sensitive and accurate alternative approach for accessing genomic variations. The quality, speed, and affordability give NGS a significant advantage over microarrays [12, 13].
Despite the advantages of NGS over aCGH, the use of NGS for CNV identification has been limited by a lack of available and effective statistical approaches. The welldeveloped aCGH data analysis methods cannot take the full advantage of NGS data, and thus, new statistical analysis methods for NGS data are needed. Most of the existing methods for CNV detection using NGS data can be classified into two categories: sliding window methods and hidden Markov model (HMM) methods. Sliding window methods include Segseq by Chiang et al. [14], Eventwise testing by Yoon et al. [15], rSWseq by Kim et al. [16] and JointSLM by Magi et al. [17], among others. This category of methods must simultaneously deal with a large number of tests of significance, and the results of such methods are highly dependent on the determination of critical values. One example of the methods using HMMs is CNAseg by Ivakhno et al. [18]. This method relies on borrowing information from additional samples of the two genomes in comparison. In most situations, only one sequenced sample from each genotype is available. The method proposed in this paper is able to detect copy number variation change points even with just one sequenced sample of each genotype.
Data collection and terminology
To understand the data and our model for the data, it is necessary to introduce some NGS data collection details and terminology.
First of all, we use reference genome to describe the genome of the genotype that has been fully sequenced using wholegenome sequencing technologies. In contrast, the target genome is the genome of a genotype of interest that has not been fully sequenced. The goal is to use NGS data from the reference and target genotypes to identify regions of copy number variation between the reference and target genomes. We say that a genomic region in the target genotype where the number of copies is amplified relative to the reference genotype has a copy number gain. A target genomic region present, but at a reduced copy number relative to the reference genome, is said to have a copy number loss. A region that is present in the reference genome but absent in the target genome is described as absent. These three states (copy number gain, copy number loss, and absent) represent copy number variations in the target genome relative to the reference. A region with no difference in the number of copies between the target and the reference genotypes is said to be normal in state. A genomic location where there is a change from one copy number state to another is called a copy number change point.
To identify copy number change points and copy number states, a DNA sample from each of the target and the reference genotypes is obtained. The DNA strands in a sample are fragmented into 100 to 1,000 base segments. At one end of every randomly sampled segment, a sequence of 100 to 300 bases is determined and recorded. Such a sequence of bases is called a read. Each of the reads is then aligned to the reference genome to determine its origin in the genome. The location of the first base of the read on the reference genome is recorded as the position of the read. The numbers of reads for the target genome and the reference genome are recorded as the target read counts and the reference read counts, respectively. If a location has a positive target read count or a positive reference read count, it is called a site. Thus, data from NGS technologies are small nonnegative integer counts with associated site positions on the reference genome.
Preliminary data processing
The data to be analyzed can be described as follows. Suppose the observed reference and target genome read counts are denoted by ${o}_{i}^{[r]}$ and ${o}_{i}^{[t]}$, respectively. The corresponding genomic positions are denoted by h_{ i }, i = 1,…,I. The goal of mHMM is to find the segmentation ${h}_{1}\equiv {h}_{{i}_{0}}<{h}_{{i}_{1}}<{h}_{{i}_{2}}<\dots <{h}_{I}\equiv {h}_{{i}_{J}}$ such that the copy number of the target genome changes between two consecutive segments, and remains the same within a same segment. Because the read counts take small nonnegative integer values, including a large number of zeros, it is difficult to carry out accurate modeling and inference using the original data. Thus, it is more practical to work with sums of counts rather than the original individual counts.
A common way to aggregate the data is to define windows with a specific width and calculate the sum of target and reference read counts within each window, so as to obtain shorter series of larger target and reference read counts. Kim et al. [16] defined windows using a fixed number of read counts in the reference genome. Chiang et al. [14], Xie et al. [19] and Ivakhno et al. [18] defined windows using a fixed genomic distance. These methods have an underlying assumption that the sites within a window share the same copy number state. Such an assumption may be reasonable because a CNV is a somewhat rare type of genomic mutation, and the closer any two sites are located on the genome, the less likely there is a CNV change point between these sites. However there is also a problem in implementing these methods. Sites are randomly located along a genomic sequence, with a high density in some parts of the genome and a low density in other parts of the genome. Rigidly defining windows with a fixed number of read counts has the potential to put sites physically far away from each other into one window, which increases the risk of including copy number change points in a window. Rigidly defining windows with a fixed genomic distance can produce high variation in the number of sites and in read counts across windows. This can lead to decreased accuracy for identifying CNVs.
By this method, the sites that are closest together are more likely to be grouped together in a window, which results in a more reasonable grouping than previously used approaches.
Once sites are grouped into windows, we assume that the hidden states remain the same within windows, and our mHMM method is used to obtain a windowbased segmentation. Next, an adjustment procedure is further performed on the windowbased result to obtain more accurate copy number change points. We refer to the final segmentation result as the sitebased segmentation result.
MixtureHidden Markov Model (mHMM) for a window based result
In this section, we describe the first step of the proposed mixturehidden Markov model (mHMM) that we use to estimate the windowbased copy number change points along the genome. This CNVdetecting methodology is carried out separately on each chromosome.
HMM was described by Baum et al. [20–23]. A HMM is constructed by a bivariate random process {S_{ w },U_{ w }}, w = 1,…,W. One component of this random process, {S_{ w }} (w = 1,…,W), is an unobserved Markov chain with finite states. {S_{ w }} (w = 1,…,W) has Markov property, which means that given the "current state" S_{ w }, the "future state" S_{w+1} and the "past state" S_{w1} are independent, i.e., P(S_{w+1}S_{1},…,S_{ w }) = P(S_{w+1}S_{ w }), w = 1,…,W. The probability a_{ kl } = P(S_{w+1}S_{ w }) is called the transition probability, which determines the probability of the state of w + 1 based on the state of w. In mHMM, the unobserved copy number states of the windows along the chromosome are the hidden states. The copy number states take four values, where
1 = gain: copy number gain/amplification in the target relative to the reference,
2 = normal: no difference in copy number between the target and the reference,
3 = loss: region present in the target genome but at a reduced copy number relative to the reference,
4 = absent: region absent in the sample but present in the reference.
Another important component of a HMM is {U_{ w }} (w = 1,…,W), which is a sequence of observations for w = 1,…,W. Each hidden state generates an observation with specific probability, P(U_{ w }S_{ w }), which is called the emission probability. In mHMM, the observations are the target genome read count ${u}_{w}^{[t]}$ and the reference read count ${u}_{w}^{[r]}$ at each window w, i.e., U_{ w } with value ${\mathit{u}}_{w}={({u}_{w}^{[t]},{u}_{w}^{[r]})}^{\prime}$, w = 1,…,W. Detailed description of the mHMM is presented in the following two subsections.
The transition probabilities
For the copy number states k and l, the transition probability a_{ kl }(w) (k,l = 1,…,4) is defined as the conditional probability of the next window w+1 taking copy number state l, given the copy number state k for the current window w. Motivated by Marioni et al. [24], we define the transition matrix which takes the relative positions of windows into consideration. As the distance between adjacent windows increases, the transition probability from state k to state l ≠ k increases and approaches to a positive constant. As the distance between adjacent windows decreases, the probability of a difference in copy number states between windows diminishes.
for k,l = 1,…,4 and w = 1,…,W  1. Here the parameter p_{ kl } ∈ (0,1) affects the transition probabilities from state k to state l, and has the constraint ${\sum}_{l\ne k}{p}_{\mathit{\text{kl}}}<1$, k = 1,…,4. Constant d_{ w } denotes the physical distance on the genome between the location of window w and the location of window w + 1 (i.e., ℓ_{w+1}  ℓ_{ w }). The parameter ρ is a positivevalued parameter determining the effect of distance on the transition matrix. The distance effect diminishes as ρ approaches ∞. The parameter vector θ represents all the parameters in the model, including p_{ kl } (k,l = 1,…,4;l ≠ k), ρ and all the parameters in the emission distribution to be described in the following section.
The emission distributions
The emission distributions define emission probabilities, which are the conditional joint probabilities of reference and target read counts, given the copy number state of the window. Each window w has two observations, a reference read count ${U}_{w}^{[r]}={u}_{w}^{[r]}$ and a target read count ${U}_{w}^{[t]}={u}_{w}^{[t]}$. We model the reference read count as ${U}_{w}^{[r]}{\lambda}_{w}^{[r]}\sim \text{Poisson}\left({\lambda}_{w}^{[r]}\right)$, where ${\lambda}_{w}^{[r]}$ follows a Gamma distribution with parameters α and β.
Here K_{1} = 2, K_{2} = 1, K_{3} = 0.5 and K_{4} = 0 are the CNV effects of the four copy number states "gain", "normal", "loss" and "absent", respectively. Parameter c_{0} is a normalization factor that accounts for any discrepancy between the total number of reference and target reads in normal regions. However in real data sets, extra variation and misalignments along the genomic sequence data are inevitable. For example, in a normal genomic segment with no difference in copy number between the target and the reference, there still exist some windows with normalized target and reference read count ratios significantly higher or lower than 1; within a segment of copy number gain (or loss), we also find windows that have normalized target and reference read count ratios significantly lower than 2 (or higher than 0.5). The original HMM introduced above will not only identify true copy number variation signals, but also the local variations caused by random error. Without additional modification, the method tends to show more state changes than those justified by true CNV signals.
where $\mathit{Q}={[{q}_{\mathit{\text{kj}}}]}_{4\times 4}=\left(\begin{array}{cccc}{q}_{11}& 1{q}_{11}& 0& 0\\ \frac{1{q}_{22}}{2}& {q}_{22}& \frac{1{q}_{22}}{2}& 0\\ 0& \frac{1{q}_{33}}{2}& {q}_{33}& \frac{1{q}_{33}}{2}\\ 0& 0& 1{q}_{44}& {q}_{44}\end{array}\right)$ with q_{ kk } ∈ (0.5,1) for k = 1,…,4, and $\mathit{V}={[{v}_{\mathit{\text{kj}}}]}_{4\times 4}=\left(\begin{array}{cccc}2& {v}_{12}& 0& 0\\ {v}_{21}& 1& {v}_{23}& 0\\ 0& {v}_{32}& 0.5& {v}_{34}\\ 0& 0& {v}_{43}& 0\end{array}\right)$ with v_{12},v_{21} ∈ (1,2), v_{23},v_{32} ∈ (0.5,1), v_{34},v_{43} ∈ (0,0.5). The diagonal elements v_{ kk } (k = 1,…,4) in V denote the effects of the four copy number states: gain, normal, loss and absent, and have fixed constant values v_{11} = 2, v_{22} = 1, v_{33} = 0.5 and v_{44} = 0. The offdiagonal elements v_{ kj } (k,j = 1,…,4;k ≠ j) account for the uncertainties due to random errors in the sequencing technology such as misalignments, and their values are estimated from the data set. More specifically, parameter v_{12} accounts for the uncertainty in the copy number gain state that allows observed copy numbers to vary between 1 and 2; parameters v_{21} and v_{23} accounts for uncertainty in the normal state that allows observed copy numbers to vary between 0.5 and 2; parameters v_{32} and v_{34} allow observed copy numbers to vary between 0 and 1 within the copy number loss state; and the parameter v_{43} allows observed copy numbers to vary between 0 and 0.5 when the state is the absent state. All other v_{ kj }’s are restricted to 0. As a result, matrix V is a tridiagonal matrix. The weight matrix Q is also a tridiagonal matrix with q_{ kj } (k,j = 1,…,4) being the weights for the Poisson components in the mixture distribution. Parameters q_{ kk } (k = 1,…,4) denote the Poisson component weights for the four copy number states and are restricted between 0.5 and 1. Accordingly, the offdiagonal q_{ kj }’s are less than 0.5. In this way, the effects of the four copy number states dominate the effects of random uncertainties.
where ${\mathit{u}}_{w}=\left(\begin{array}{c}{u}_{w}^{[t]}\\ {u}_{w}^{[r]}\end{array}\right)\in (\mathbb{Z}\setminus {\mathbb{Z}}^{})\times (\mathbb{Z}\setminus {\mathbb{Z}}^{})\setminus \left\{\left(\genfrac{}{}{0ex}{}{0}{0}\right)\right\}$ and ${\mathit{U}}_{w}=\left(\begin{array}{c}{U}_{w}^{[t]}\\ {U}_{w}^{[r]}\end{array}\right)$.
Parameter estimation using the EM algorithm
In (7), u = (u1′,…,u W′)^{′} is a series of the target and reference read counts for all the windows along the specific chromosome, s = (s_{1},…,s_{ W })^{′} ∈ {1,2,3,4}^{ W } is a vector of unobserved states for all the windows along the chromosome, and ${\pi}_{{s}_{1}}=P({S}_{1}={s}_{1})$ is the probability that the copy number state in the first window is s_{1}.
Characterizing the E and M steps
In these two likelihoods, ${\pi}_{{s}_{1}}{\prod}_{w=1}^{W1}{a}_{{s}_{w}{s}_{w+1}}(w)$ is the probability of that the hidden states for all the windows on the chromosome are s_{1},…,s_{ W }, respectively, and ${\prod}_{w=1}^{W}P({\mathit{u}}_{w}\mathit{s},\alpha ,\beta ,{c}_{0},\mathit{q},\mathit{v})$ is the probability of the observed read counts u, given the hidden states s_{1},…,s_{ W }.
Given the parameter estimates θ^{(m)} from the iteration mof the EM algorithm, we use the Estep and the Mstep to update the parameter estimate.

The Estep: Evaluate the expectation of the complete data loglikelihood with respect to the conditional distribution of the hidden states S given the observed data u, with θ = θ^{(m)}, i.e., evaluate ${E}_{\mathit{S}\mathit{u},{\mathit{\theta}}^{(m)}}\left(log\mathcal{L}(\mathit{\theta}\mathit{u},\mathit{s})\right)$.

The Mstep: Find θ^{(m+1)}, the θ value that maximizes ${E}_{\mathit{S}\mathit{u},{\mathit{\theta}}^{(m)}}\left(log\mathcal{L}(\mathit{\theta}\mathit{u},\mathit{s})\right)$.
A detailed look at both the E and M steps is provided in the Additional file 1.
Initialization, convergence, and prediction of hidden states
The initial values for all the parameters are defined as follows. In the transition probabilities, we define the initial values ${\mathit{p}}^{(0)}\stackrel{\mathrm{\Delta}}{=}{\left({p}_{12}^{(0)},\dots ,{p}_{43}^{(0)}\right)}^{\prime}=0.1\xb7{\mathbf{1}}_{12\times 1}$ and ρ^{(0)} = 0.5. For the parameters in the emission probabilities, we define ${q}_{11}^{(0)}={q}_{22}^{(0)}={q}_{33}^{(0)}={q}_{44}^{(0)}=0.5$, ${v}_{12}^{(0)}={v}_{21}^{(0)}=1.5$, ${v}_{23}^{(0)}={v}_{32}^{(0)}=0.75$, ${v}_{34}^{(0)}={v}_{43}^{(0)}=0.25$, and use the maximum likelihood estimates of α and β with all sites assigned with normal copy number state (state 2) as the initial values α^{(0)} and β^{(0)}.
to denote the total target and reference aligned counts for sites in normal copy number regions, and ${c}_{0}=\frac{{x}^{[t]}}{{x}^{[r]}}$ to denote the ratio between them. We obtain the initial estimate for x^{[t]} and x^{[r]} as follows. First we calculate the ratio between the read counts of the target genome and the reference genome for all the windows with positive reference genome counts through the whole genome across all chromosomes. Then we separate the ratios into three groups using Kmeans clustering to get three group means: M_{1}, M_{2} and M_{3}. Suppose M_{1} < M_{2} < M_{3}. Then x^{[t](0)} is the sum of the target read counts for the windows belonging to the cluster with mean M_{2}, and x^{[r](0)} is the corresponding sum of reference read counts for the windows belonging to the cluster with mean M_{2}. After obtaining x^{[t](0)} and x^{[r](0)}, we calculate ${c}_{0}^{(0)}$ using $\frac{{x}^{[t](0)}}{{x}^{[r](0)}}$.
If R_{1} > 1.5, then π_{1} = 1, i.e., we assign copy number gain as the initial copy number state to the first window; if 0.75 < R_{1} ≤ 1.5, then π_{2} = 1, i.e., we assign normal as the initial copy number state to the first window; if 0 < R_{1} ≤ 0.75, then π_{3} = 1, i.e., we assign copy number loss as the initial state to the first window; if R_{1} = 0, then π_{4} = 1, i.e., we assign absent as the initial state to the first window.
For all the other windows S_{ w } (w ≥ 2), we use normal state (state 2) as the initial state. All the parameters are updated using the EM algorithm. The iteration stops when the difference between θ^{(m)} and θ^{(m+1)} is small enough, and we use the value of θ at the final iteration (denoted by $\widehat{\mathit{\theta}}$) as our estimate of θ.
After convergence, the conditional probability of each of the four states ${P}_{k}(w)=P\left({S}_{w}=k\mathit{u},\widehat{\mathit{\theta}}\right)$ (k = 1,2,3,4) is calculated, and the conditional hidden state prediction for window w is given by the state that has the largest conditional probability, i.e., ${\u015c}_{w}=\underset{k}{\text{arg max}}\phantom{\rule{0.3em}{0ex}}{P}_{k}(w)$ for w = 1,…,W.
Change point adjustment
 1.
For site i between site i _{ I } and site i _{ III }1, we obtain both the total target and the reference counts from i _{ I } to i  1, denoted as ${z}_{i(L)}^{[t]}$ and ${z}_{i(L)}^{[r]}$; also obtain both the total target and the reference counts from i to i _{ III }  1, denoted as ${z}_{i(R)}^{[t]}$ and ${z}_{i(R)}^{[r]}$.
 2.
Calculate the Pearson’s χ ^{2} test statistic ${\sum}_{\tau}{\sum}_{\gamma}\frac{{\left({z}_{i(\gamma )}^{[\tau ]}{\widehat{z}}_{i(\gamma )}^{[\tau ]}\right)}^{2}}{{\widehat{z}}_{i(\gamma )}^{[\tau ]}}$ using $\{{z}_{i(L)}^{[t]},{z}_{i(L)}^{[r]},{z}_{i(R)}^{[t]},{z}_{i(R)}^{[r]}\}$, where τ = t or r, γ = L or R, and ${\widehat{z}}_{i(\gamma )}^{[\tau ]}=\frac{\left({\sum}_{\gamma}{z}_{i(\gamma )}^{[\tau ]}\right)\left({\sum}_{\tau}{z}_{i(\gamma )}^{[\tau ]}\right)}{{\sum}_{\gamma}{\sum}_{\tau}{z}_{i(\gamma )}^{[\tau ]}}$.
 3.
Do step 1 and 2 for every site i between i _{ I } and i _{ III }  1; the adjusted breakpoint i _{(adj)} is the one with the largest Pearson’s χ ^{2} test statistic value.
Results and discussions
Simulation studies
To set up golden standard test data sets with known CNV position to evaluate the performance of our method and peer tools, we conducted three simulation studies. In the first and the second studies, we simulated reads counts on each of the target and reference positions, then we simulated CNV segments on the target genome. The simulated sequences in the first two studies are low coverage with depths of 2X. In the third simulation study, we first randomly simulated copy number variation segments, then we simulated pair end Illumina reads and mapped back to the reference genome. The third simulation captures the sequencing process and with high coverage with depths of 30X.
Simulation study 1
The first simulation study is based on real DNA sequencing data from chromosome 4 of the lung cancer cell line NCIH2347 from Chiang et al. [14]. Simulation based on real data can best maintain the characteristic of the data including variation and errors that can affect actual data analyses. We first randomly simulated the genomic positions of the sites along the target and reference genomes using a uniform distribution. We generated the reference and target genome read counts by shuffling the reference genome read counts in chromosome 4 of NCIH2347. After that, we randomly picked 90 CNV segments on the simulated target genome. We considered three sizes for the CNV segments: 10 kb, 50 kb and 100 kb and generated 10 segments for each CNV type. We doubled the read counts for the segments with copy number gains, halved the read counts for the segments with copy number losses and set read counts to 0 for segments with no copies. This is a low coverage sequencing with depth of 2X. We compare the result of the mHMM with the mixture Poisson emission probability in (6), the result using the original HMM with the Poisson emission probability in (2), and the result using Segseq by Chiang et al. [14].
Comparison among mHMM with and without change point adjustment, and original HMM in simulation study 1
mHMM  mHMM no adj.  original HMM  

Gain  Sensitivity  0.917  0.915  0.863 
EFNR  0.097  0.174  0.751  
Normal  Specificity  0.996  0.994  0.945 
EFPR  0.002  0.002  0.003  
Loss  Sensitivity  0.858  0.831  0.826 
EFNR  0.309  0.434  0.846  
Absent  Sensitivity  0.960  0.857  0.810 
EFNR  0.011  0.005  0.000 
From Table 1, both the mHMM with and without change point adjustment have greater sensitivities and specificity in all four copy number states than the original HMM. Also, the two mHMM methods have lower EFPR and EFNR in copy number gain, loss and normal states. The original HMM has a slightly smaller EFNR for the absent state, but the EPNRs in copy number gains and losses are far greater than other methods. This means that, with the mixture Poisson emission distribution, the mHMM is less affected by the errant variations in states and can capture true CNV signals better.
Comparing between the mHMM with and without change point adjustment, the sensitivities and specificity are all increased by using the adjustment. Also, the EFPR and EFNR are decreased in copy number gain, loss and normal states, while increased slightly in the absent state. In general, the change point adjustment procedure does give the mHMM better accuracies.
Comparison between mHMM and SegSeq in simulation study 1
mHMM  SegSeq  

Gain  Sensitivity  0.917  0.000 
EFNR  0.097  0.000  
Normal  Specificity  0.996  0.998 
EFPR  0.002  0.019  
Loss or absent  Sensitivity  0.898  0.076 
EFNR  0.223  0.678 
Simulation study 2
mHMM segmentation with different CNV lengths in simulation study 2
Lengths  10 kb  20 kb  30 kb  50 kb  100 kb 

Loss  0  4  9  10  10 
Gain  3  5  10  10  10 
Absent  2  9  10  10  10 
Simulation study 3
In order to examine mHMM methodology with a simulation data set that captures the sequencing process, we conducted a third simulation study. First, we randomly introduced duplications (from 3 copies to 4 copies) and deletions (either homozygous 0 copy or heterozygous 1 copy) with difference sizes (1 kb, 5 kb, 10 kb) in the maize reference genome version 2 (chromosome 6 from website ftp://ftp.ensemblgenomes.org/pub/plants/release10/fasta/zea_mays/dna/) to produce a new genome with CNVs. Five segments were simulated for each of the 20 different combinations (4 different copies × 3 sizes). Consequently, we simulated 10 segments with copy number gains, and 10 segments with copy number losses or absence with each of the 3 CNV segment sizes. After that, we simulated randomly sampled paired end Illumina reads based on the new genome sequence with both read ends of length 100 bp, clone length of average 500 bp and standard deviation of 50 bp, coverage depth of 30X and sequence error rate of 0.002. Then, the simulated sequence reads are mapped back to reference genome using the same parameter as before to produce our test data set with known CNV copy number and position.
Counts of correctly detected CNV segments and false positively detected normal segments, comparing between mHMM, SegSeq and CNVnator in simulation study 3
Detected  mHMM  SegSeq  CNVnator  

CNV segment counts  1 kb  5 kb  10kb  1 kb  5 kb  10 kb  1 kb  5 kb  10 kb  
Gain  Correct detection  10  10  10  7  10  10  6  9  10 
Loss or absent  Correct detection  9  10  10  7  4  6  8  10  10 
Normal  False positives  0  0  3  1  7  6  436  466  399 
Overlap rate of true CNV segments and detected CNV segments in simulation study 3, comparing between mHMM with and without change point adjustment
Overlap rate  CNV segment length  

1 kb  5 kb  10 kb  
mHMM without adj.  0.724  0.912  0.902 
mHMM  0.859  0.971  0.927 
Table 5 lists the overlap rates of true CNV segments and detected CNV segments by mHMM with and without change point adjustment. The overlap rate is calculated using the overlap genomic length divided by the true CNV genomic length. We see that among all the corrected detected CNV segments, the overlap rates are quite high (over 70%). Also, the change point adjustment makes the overlap rate even higher (over 85%).
Application
We applied the proposed mHMM to compare the sequence data from two maize genotypes: B73 and Mo17. The previously sequenced B73 genome [3] was used as the reference genome with the Mo17 genome as the target. The original data was provided by Gore et al. [26], and was preprocessed by the lab of Schnable. In the data preparation stage, 44 base sequence reads from each of the B73 and Mo17 genomes were aligned to the reference B73 genome. We obtained 4.3 million aligned reads from B73 and 1.54 million aligned reads from Mo17, and 2.3 million genomic positions on the reference B73 genome had positive read counts. Using the mHMM, we found 1096 segments of 2000 bases or longer that have copy number variations that are at least 2 fold increasing/decreasing, among which, Mo17 has 14 segments with copy number gain state, 835 segments with copy number loss state and 247 segments with absent state, compared to B73.
There are several large segments with few copy number variations between Mo17 and B73. They are 121.3 Mb ∼ 130 Mb on chromosome 1 (Figure 2), 69.2 Mb ∼ 82.0 Mb and 84.8 Mb ∼ 95.9 Mb on chromosome 3 (Figure 3), and 49.5 Mb ∼ 61.5 Mb on chromosome 10 (Figure 5). These results are concordant with previous results using aCGH data [6, 7].
The mHMM segmentation result for chromosome 6 from 35.3 Mb to 57.0 Mb
Start Position  End Position  Mo17 Read counts  B73 Read counts  Mo17/B73^{∗} 

35,274,511  36,847,009  1,511  2291  1.33 
36,847,028  36,986,070  81  242  0.67 
36,988,155  39,227,681  2,375  3850  1.21 
39,227,711  39,392,612  53  294  0.36 
39,392,816  39,998,180  605  810  1.51 
39,998,210  40,070,876  114  95  2.43 
40,071,001  42,219,831  1,978  2751  1.45 
42,221,884  42,272,394  3  305  0.01 
42,273,061  42,279,935  0  37  0.00 
42,281,369  42,467,432  6  484  0.02 
42,467,722  42,510,527  0  199  0.00 
42,520,362  42,588,228  10  134  0.15 
42,589,445  42,592,668  0  16  0.00 
42,600,850  42,949,605  41  734  0.11 
42,957,788  43,020,977  0  213  0.00 
43,021,808  43,021,808  1  0  50.00 
43,021,875  43,034,674  0  20  0.00 
43,050,012  43,156,464  7  214  0.06 
43,159,602  43,239,996  0  110  0.00 
43,240,069  43,240,069  4  1  8.11 
43,240,496  43,253,531  0  25  0.00 
43,270,468  43,367,135  27  396  0.13 
43,382,243  43,436,720  0  211  0.00 
43,437,634  43,437,634  1  0  50.00 
43,437,664  43,438,342  0  19  0.00 
43,440,089  43,466,314  8  146  0.11 
43,466,499  43,502,281  0  183  0.00 
43,502,441  43,502,441  1  0  50.00 
43,504,419  43,516,326  0  3  0.00 
43,532,757  43,802,344  29  815  0.077 
43,802,387  43,809,486  0  214  0.00 
43,809,952  43,809,952  4  0  50.00 
43,809,982  43,811,684  0  47  0.00 
43,812,791  43,828,685  1  128  0.01 
43,833,620  43,862,804  0  121  0.00 
43,863,025  43,863,025  1  0  50.00 
43,866,525  43,959,900  0  164  0.00 
43,959,976  43,960,274  2  10  0.40 
43,960,449  43,984,663  0  95  0.00 
43,991,670  44,242,537  24  498  0.09 
44,247,904  44,297,744  0  50  0.00 
44,297,961  44,297,961  1  0  50.00 
44,299,769  44,341,066  0  252  0.00 
44,341,857  46,215,928  371  2308  0.32 
46,215,958  46,834,832  572  895  1.29 
46,835,472  47,130,192  74  545  0.27 
47,130,202  47,668,306  642  833  1.56 
47,681,491  47,739,876  18  72  0.50 
47,740,273  47,740,273  0  2  0.00 
47,746,723  48,880,371  283  1356  0.42 
48,880,568  48,923,056  0  51  0.00 
48,923,160  49,722,664  213  1128  0.38 
49,722,844  50,132,724  231  298  1.57 
50,132,929  50,239,093  28  119  0.47 
50,260,756  52,852,085  2,487  3567  1.41 
52,852,089  52,958,397  155  130  2.41 
52,958,403  53,950,690  918  1453  1.28 
53,951,899  56,137,418  895  3196  0.56 
56,142,451  56,472,218  296  570  1.05 
56,472,614  56,863,324  102  570  0.36 
56,866,117  57,022,366  123  212  1.17 
This table lists the detected CNV segments that are longer than 2 Mb and presents greater than 2fold normalized read count differences between B73 and Mo17
Start Position  End Position  Mo17/B73^{∗}  

chr 1  7,674,194  10,130,535  0.47 
73,051,891  76,150,311  0.49  
78,765,566  80,973,937  0.40  
106,020,503  108,085,310  0.46  
183,346,370  185,514,363  0.34  
185,561,332  187,744,738  0.34  
206,016,332  208,795,988  0.49  
chr 2  14,327,370  17,154,616  0.49 
43,421,667  45,665,032  0.47  
53,360,846  57,062,635  0.48  
101,464,135  103,592,856  0.37  
173,361,552  175,485,424  0.46  
183,250,632  186,406,359  0.46  
209,163,585  211,227,635  0.48  
211,597,012  214,704,617  0.48  
chr 3  3,653,572  5,790,383  0.44 
45,465,224  48,749,064  0.40  
82,020,005  84,053,318  0.48  
176,248,175  179,361,747  0.48  
chr 4  645  2,454,064  0.45 
23,025,833  25,854,581  0.48  
136,792,400  140,387,061  0.36  
140,502,832  143,245,529  0.49  
146,672,844  148,733,733  0.44  
239,597,061  242,255,833  0.48  
243,186,165  245,433,834  0.49  
chr 5  753,941  4,311,600  0.48 
5,201,605  7,723,585  0.48  
11,328,383  14,574,771  0.49  
16,611,265  19,032,650  0.44  
68,583,278  71,949,086  0.40  
156,325,001  158,697,338  0.47  
168,069,569  170,816,075  0.39  
176,121,726  179,598,151  0.43  
191,459,521  195,013,031  0.48  
chr 6  25,898,308  28,476,332  0.33 
70,440,134  72,858,938  0.46  
76,981,074  80,570,101  0.46  
85,620,416  87,659,132  0.48  
102,350,617  104,422,628  0.49  
154,917,250  159,407,299  0.47  
162,488,577  165,722,159  0.48  
chr 7  18,304,500  22,673,932  0.49 
25,498,467  30,547,169  0.47  
78,344,496  81,198,663  0.40  
110,502,115  113,262,773  0.49  
117,913,056  119,946,052  0.49  
120,044,241  122,233,553  0.49  
chr 8  20,108,771  23,447,168  0.43 
160,011,805  166,514,647  0.45  
167,430,533  169,880,960  0.45  
chr 9  38,523,842  40,732,642  0.41 
46,734,889  49,382,812  0.38  
61,499,201  64,577,206  0.39  
83,491,462  88,345,837  0.46  
118,555,633  121,557,270  0.49  
143,196,239  145,630,853  0.45  
145,813,631  150,079,829  0.46  
chr 10  20,262,176  24,734,243  0.48 
39,046,734  41,852,436  0.46  
72,370,857  74,553,412  0.42  
111,247,775  116,238,001  0.46  
117,013,634  120,679,932  0.49  
121,011,209  124,203,304  0.49  
130,590,451  139,841,271  0.46 
Several challenging aspects of this application should be noted when interpreting our results. First, there exists a high level of divergence between the target genome Mo17 and the reference genome B73. Because of the high divergence, it is difficult to map the target reads to the reference genome in some regions. This might partially explain the reason that many of the identified CNVs are classified as copy number loss or absent in Mo17 relative to B73. Moreover, the short read length of 44 bp in this data set makes it even more difficult to correctly map the target reads to the reference genome. Secondly, the average coverage depth is also very low (less than 1X). This makes it difficult to identify small CNV segments. Although higher sequencing depth provides better CNV detection accuracy, Sims et al. [27] pointed out that reasonable detections can still be obtained for large CNVs when the coverage is at least 0.1X.
With fast developing sequencing technologies, higher coverage depths and longer reads are available in may of today’s sequencing data sets. Reads with 100 bp are standard, and it is common to see 150 to 300 bp in Illumina reads. Also, coverage depths of 30X and higher are becoming mainstream. With longer reads, more precise mapping and higher coverage depth, the mHMM is expected to provide better CNV detections with higher accuracy.
It is also important to note that DNA sequencing technologies are potentially affected by biases that arise from several biophysical and chemical features [28]. For example, favorable levels of GCcontent could amplify the coverage of the genome so as to affect CNV detections. Many recent methods have addressed GCcontent bias in CNV detections. Risso et al. [29] and Janevski, et al. [28] reviewed GCcontent normalization tools and methods that are currently being used. Our method is based on direct comparison of two genomes within genomic windows. As long as the withinwindow GC content remains similar across genomes, variation in GC content across the genomes should not bias our mHMM results.
Conclusions
In this paper, we proposed a new methodology to detect CNVs in the DNA sequences from two genotypes using next generation sequencing data. We used a hidden Markov model incorporating a mixture emission probability model to identify the copy number variation change points. The simulations study suggests that the mHMM has better sensitivities and specificities in CNV identifications comparing to other methodologies.
The proposed mHMM was applied to compare the two maize inbred lines B73 and Mo17, and identified CNV change points of the target sequence Mo17 relative to the reference sequence B73. The result of the mHMM is concordant with previous genomic studies using aCGH data by Springer et al. [7] and Belo et al. [6].
In addition, the mHMM can be used to compare two genotypes when only one sample of target reads and only one sample of reference reads are available. Many other existing methods require multiple read samples for both target and reference genotypes. Thus, the mHMM approach we have proposed may be especially useful when financial constraints limit data collection.
Availability of the software and supporting data set
The mHMM software and an example data set are available in the webpage https://www.stt.msu.edu/users/hengwang/mHMM.html. The maize sequencing data set supporting the results of this article can be downloaded through the link ftp://ftp.ensemblgenomes.org/pub/plants/release10/fasta/zea_mays/dna/.
Declarations
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant No. 1027527.
Authors’ Affiliations
References
 Banerjee S, Oldridge D, Poptsova M, Hussain WM, Chakravarty D, Demichelis F: A computational framework discovers new copy number variants with functional importance. PLoS ONE. 2011, 6: 3Google Scholar
 Gokcumen O, Lee C: Copy number variants (CNVs) in primate species using arraybased comparative genomic hybridization. Methods. 2009, 49: 1825. 10.1016/j.ymeth.2009.06.001.View ArticlePubMed CentralPubMedGoogle Scholar
 Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves T, Minx P, Reily A, Courtney L, Kruchowski S, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock S, Belter E, Du F, Kim K, Abbott R, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson S, Gillam B, et al: The B73 maize genome: complexity, diversity, and dynamics. Science. 2009, 326: 11121115. 10.1126/science.1178534.View ArticlePubMedGoogle Scholar
 Buckler ES, Gaut BS, McMullen MD: Molecular and functional diversity of maize. Curr Opin Plant Biol. 2006, 9: 172176. 10.1016/j.pbi.2006.01.013.View ArticlePubMedGoogle Scholar
 SwansonWagner RA, Eichten SR, Kumari S, Tiffin P, Stein JC, Ware D, Nathan M, Springer NM: Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor. Genome Res. 2010, 20: 16891699. 10.1101/gr.109165.110.View ArticlePubMed CentralPubMedGoogle Scholar
 Belo A, Beatty MK, Hondred D, Fengler KA, Li B, Rafalski A: Allelic Genome structural variation in maize detected by array comparative genome hybridization. Theor Appl Genet. 2010, 120: 355367. 10.1007/s0012200911289.View ArticlePubMedGoogle Scholar
 Springer NM, Ying K, Fu Y, Ji T, Yeh CT, Jia Y, Wu W, Richmond T, Kitzman J, Rosenbaum H, Iniguez AL, Barbazuk WB, Jeddeloh JA, Nettleton D, Schnable PS: Maize inbreds exhibit high levels of copy number variation (CNV) and presence/absence variation (PAV) in genome content. PLoS Genet. 2009, 5: 11View ArticleGoogle Scholar
 SolinasToldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Dohner H, Cremer T, Lichter P: Matrixbased comparative genomic hybridization: biochips to screen for genomic imbalances. Gene Chromosome Canc. 1997, 20 (4): 399407. 10.1002/(SICI)10982264(199712)20:4<399::AIDGCC12>3.0.CO;2I.View ArticleGoogle Scholar
 Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet. 2006, 7: 8597.View ArticlePubMedGoogle Scholar
 Wineinger NE, Kennedy RE, Erickson SW, Wojczynski MK, Bruder CE, Tiwari HK: Statistical issues in the analysis of DNA copy number variations. Int J Comput Biol Drug Des. 2008, 1: 368395. 10.1504/IJCBDD.2008.022208.View ArticlePubMed CentralPubMedGoogle Scholar
 Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with nextgeneration sequencing. Nat Methods. 2009, 6: 1320. 10.1038/nmeth.1374.View ArticleGoogle Scholar
 Hurd PJ, Nelson CJ: Advantages of nextgeneration sequencing versus the microarray in epigenetic research. Brief Funct Genomics. 2009, 8 (3): 174183. 10.1093/bfgp/elp013.View ArticleGoogle Scholar
 Su Z, Li Z, Chen T, Li QZ, Fang H, Ding D, Ge W, Ning B, Hong H, Perkins RG, Tong W, Shi L: Comparing nextgeneration sequencing and microarray technologies in a toxicological study of the effects of aristolochic acid on rat kidneys. Chem Res Toxicol. 2011, 24 (9): 14861493. 10.1021/tx200103b.View ArticlePubMedGoogle Scholar
 Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, Russ C, Nusbaum C, Meyerson M, Lander ES: Highresolution mapping of copynumber alterations with massively parallel sequencing. Nat Methods. 2009, 6: 677681. 10.1038/nmeth.1363.View ArticleGoogle Scholar
 Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2007, 19: 15861592.View ArticleGoogle Scholar
 Kim TM, Luquette LJ, Xi R, Park PJ: rSWseq: algorithm for detection of copy number alterations in deep sequencing data. BMC Bioinformatics. 2010, 11: 43210.1186/1471210511432.View ArticlePubMed CentralPubMedGoogle Scholar
 Magi A, Benelli M, Yoon S, Roviello F, Torricelli F: Detecting common copy number variants in highthroughput sequencing data by using JointSLM algorithm. Nucleic Acids Res. 2011, 39: e6510.1093/nar/gkr068.View ArticlePubMed CentralPubMedGoogle Scholar
 Ivakhno S, Royce T, Cox AJ, Evers DJ, Cheetham RK, Tavare S: CNAseg  a novel framework for identification of copy number changes in cancer from secondgeneration sequencing data. Bioinformatics. 2010, 26: 30513058. 10.1093/bioinformatics/btq587.View ArticlePubMedGoogle Scholar
 Xie C, Tammi MT: CNVseq, a new method to detect copy number variation using highthroughtput sequencing. BMC Bioinformatics. 2009, 10: 8010.1186/147121051080.View ArticlePubMed CentralPubMedGoogle Scholar
 Baum LE, Petrie T: Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat. 1966, 37 (6): 15541563. 10.1214/aoms/1177699147.View ArticleGoogle Scholar
 Baum LE, Eagon JA: An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology. Bull Am Math Soc. 1967, 73 (3): 360363. 10.1090/S000299041967117518.View ArticleGoogle Scholar
 Baum LE, Petrie T, Soules G, Weiss N: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat. 1970, 41 (1): 164171. 10.1214/aoms/1177697196.View ArticleGoogle Scholar
 Baum LE: An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities III: Proceedings of the Third Symposium on Inequalities. Edited by: Shisha O. 1972, New York: Academic Press, 18.Google Scholar
 Marioni JC, Thorne NP, Tavare S: BioHMM: A heterogeneous Hidden Markov model for segmenting array CGH data. Bioinformatics. 2006, 22: 11441146. 10.1093/bioinformatics/btl089.View ArticlePubMedGoogle Scholar
 Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011, 21 (6): 974984. 10.1101/gr.114876.110.View ArticlePubMed CentralPubMedGoogle Scholar
 Gore MA, Chia JM, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, Peiffer JA, McMullen MD, Grills GS, RossIbarra J, Ware DH, Buckler ES: A firstgeneration haplotype map of maize. Science. 2009, 326 (5956): 11151117. 10.1126/science.1177837.View ArticlePubMedGoogle Scholar
 Sims D, Sudbery I, Ilott N, Heger A, Ponting C: Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. 2014, 15: 121132. 10.1038/nrg3642.View ArticlePubMedGoogle Scholar
 Janevski A, Varadan V, Kamalakaran S, Banerjee N, Dimitrova N: Effective normalization for copy number variation detection from whole genome sequencing. BMC Genomics. 2012, 13 (Suppl 6): S1610.1186/1471216413S6S16.View ArticlePubMed CentralPubMedGoogle Scholar
 Risso D, Schwartz K, Sherlock G, Dudoit S: GCcontent normalization for RNAseq data. BMC Bioinformatics. 2011, 12: 48010.1186/1471210512480.View ArticlePubMed CentralPubMedGoogle Scholar
 Rabiner LR, Juang BH: An introduction to Hidden Markov Models. IEEE ASSP Mag. 1986, 3 (1): 416.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.