- Methodology article
- Open Access

# hsegHMM: hidden Markov model-based allele-specific copy number alteration analysis accounting for hypersegmentation

- Hyoyoung Choo-Wosoba
^{1}, - Paul S. Albert
^{1}and - Bin Zhu
^{1}Email author

**Received:**1 September 2017**Accepted:**9 October 2018**Published:**14 November 2018

## Abstract

### Background

Somatic copy number alternation (SCNA) is a common feature of the cancer genome and is associated with cancer etiology and prognosis. The allele-specific SCNA analysis of a tumor sample aims to identify the allele-specific copy numbers of both alleles, adjusting for the ploidy and the tumor purity. Next generation sequencing platforms produce abundant read counts at the base-pair resolution across the exome or whole genome which is susceptible to hypersegmentation, a phenomenon where numerous regions with very short length are falsely identified as SCNA.

### Results

We propose hsegHMM, a hidden Markov model approach that accounts for hypersegmentation for allele-specific SCNA analysis. hsegHMM provides statistical inference of copy number profiles by using an efficient E-M algorithm procedure. Through simulation and application studies, we found that hsegHMM handles hypersegmentation effectively with a t-distribution as a part of the emission probability distribution structure and a carefully defined state space. We also compared hsegHMM with FACETS which is a current method for allele-specific SCNA analysis. For the application, we use a renal cell carcinoma sample from The Cancer Genome Atlas (TCGA) study.

### Conclusions

We demonstrate the robustness of hsegHMM to hypersegmentation. Furthermore, hsegHMM provides the quantification of uncertainty in identifying allele-specific SCNAs over the entire chromosomes. hsegHMM performs better than FACETS when read depth (coverage) is uneven across the genome.

## Keywords

- Allele-specific somatic copy number alteration
- Hidden Markov model
- Hypersegmentation
- Next-generation sequencing
- The cancer genome Atlas study

## Background

Characterizing somatic copy number alterations (SCNAs) is important for understanding tumorgenesis [1], cancer etiology and prognosis [2]. In normal cells, two copies of chromosome are inherited from both parents. In contrast, tumor cells frequently contain alterations in copy numbers across the chromosomes, such as deletions, insertions, or amplifications among others [3, 4]. In addition, tumor tissues always contain normal cells (reduced tumor purity) and frequently show an abnormal number of chromosomes (aneuploidy). These characteristics of the cancer genome and tissue heterogeneity complicate the estimation of SCNAs, in contrast to germline copy number variations (CNVs) analysis where neither should be considered [5, 6]. Allele-specific SCNA analysis estimates the integer copy number for each allele instead of the total copy number, and is essential to identify the copy-neutral loss of heterozygosity (NLOH) [7, 8]. Based on the total copy numbers only, NLOH will be misidentified as normal regions with the copy number two, when one chromosome is duplicated but the corresponding homologous region is deleted [9].

In this paper, we consider next-generation sequencing (NGS) platform-based whole exome sequencing (WES) data for studying SCNAs. The NGS technology provides high resolution at the single base-pair, which comes with mapping bias and the tendency for hypersegmentation. Mapping bias occurs from higher mapping rates for the reference allele than those for the variant allele at heterozygous loci [10]. This bias leads to incorrect interpretations of allele-specific SCNAs. Hypersegmentation is also a major challenge in NGS-based allele-specific SCNA. The quality of such data depends on the sample preparation, the library preparation, and polymerase chain reaction (PCR) techniques from applying NGS technology, and the exome enrichment platforms from WES. It has been reported that capture efficiency could vary across the percentage of guanine or cytosine contained in DNA [11]. These technical procedures have a limit of accurate quantification of sequences, which potentially increase measurement errors that in turn result in excessive segmentations.

A number of papers have been proposed to address these challenges and complexities. While PennCNV [12] and QuantiSNP [13] are based on the assumption of 100% tumor purity, ASCAT [14], GPHMM [5], and MixHMM [15] account for both the tumor purity and the ploidy. However, these methods do not explicitly characterize the genotype at each allelic location. Furthermore, these methods use a B-allele frequency (BAF), which is sensitive to mapping bias. Shen and Seshan [16] developed FACETS that uses log Odds Ratio (logOR) instead of BAF, since the logOR of tumor versus normal cells provides unbiased allelic information. FACETS uses a genotype mixture model, providing an allele-specific tumor copy number profile adjusted for the tumor purity and ploidy. However, since the segmentation and genotype mixture modeling are conducted by separate algorithms, it is not possible to assess uncertainty in the estimation of allele-specific SCNA.

In this paper, we propose a novel hidden Markov modeling approach (hsegHMM) for allele-specific SCNA analysis accounting for the hypersegmentation. hsegHMM embeds logOR and logR (log R ratio) into a hidden Markov model (HMM) framework, and simultaneously conducts the segmentation and genotype mixture modeling required to identify SCNAs across chromosomes. Similar to FACETS, the logOR is applied instead of the BAF to adjust for mapping bias. Hypersegmentation, which may result from logR outliers, is accounted for by assuming a t-distribution for the distribution of logR. hsegHMM makes inference about allele-specific SCNAs using the E-M algorithm where we iterate between the E- and the M- step until convergence. The E-step is made tractable by using a recursive forward-backward algorithm that evaluates functions of the hidden locus-specific genotype states given the observed logR and logOR. Given a genotype state, the tumor purity and the ploidy are obtained in the M-step by maximizing the expectation of the conditional log-likelihood function.

We apply hsegHMM to a renal carcinoma cell dataset (TCGA-KL-8331) from TCGA (the Cancer Genome Atlas) project (https://cancergenome.nih.gov) to show the effectiveness of the proposed HMM framework. We also provide various simulation studies to show that hsegHMM is able to accurately detect genotype status across chromosomes that exhibit hypersegmentation. Further, through analysis and simulations, we compare hsegHMM with FACETS.

## Methods

### Hidden Markov model:hsegHMM

#### Modeling with logRatio and logOdds ratio as outcomes

*Z*

_{k}be the unknown genotype as the hidden state of the

*k*

^{th}genomic location and

*W*

_{k}and

*X*

_{k}be the corresponding logR and logOR, respectively. We specify 12 different states of

*Z*

_{k}, given in Table 1. The index for hidden states,

*j*is from 1 to

*J*(in this paper,

*J*=12) and the index for genomic locations,

*k*is from 1 to

*N*. The expectations of logR and logOR are given in Van Loo et al. [14] and Shen and Seshan [16], respectively. Specifically, we can write

Description of tumor genotype states and corresponding genotype of total copy number: homozygous deletion (HOMD), hemizygous deletion LOH (DLOH), copy neutral LOH (NLOH), diploid heterozygous (HET), gain of 1 allele (GAIN), amplified LOH (ALOH), allele-specific copy number amplification (ASCNA), balanced copy number amplification (BCNA), and unbalanced copy number amplification (UBCNA)

State ( | Genotype | Copy number ( | Allelic information |
---|---|---|---|

1 | 0 | 0 | HOMD |

2 | A | 1 | DLOH |

3 | AA | 2 | NLOH |

4 | AB | 2 | HET |

5 | AAB | 3 | GAIN |

6 | AAA | 3 | ALOH |

7 | AAAB | 4 | ASCNA |

8 | AABB | 4 | BCNA |

9 | AAAA | 4 | ALOH |

10 | AAAAB | 5 | ASCNA |

11 | AAABB | 5 | UBCNA |

12 | AAAAA | 5 | ALOH |

where *C*_{N} is the copy number of normal cells prespecified as *C*_{N}=2; *C*_{T,j} is the copy number of tumor cells at the *j*^{th} state; *α* is the tumor purity proportion of the tumor tissue over a range of 0 to 1; *ψ* is the ploidy. For example, if a tumor sample contains 100% tumor cells and is diploidy, then *μ*_{3}, the expectation of logR at the state *j*=3 is 0 along with *C*_{T,3}=2, *ψ*=2, and *α*=1.

where *m*_{j} and \(p_{_{j}}\) are the maternal and paternal copy numbers of the tumor at the *k*^{th} genomic location, respectively. We assume that Z={*Z*_{1}, *Z*_{2}, *Z*_{3}, ⋯, *Z*_{N}}, the genotype sequence across chromosomes follows a Markov chain with a transition probability, *P*_{ij}=*P*(*Z*_{k}=*j*|*Z*_{k−1}=*i*) and an initial probability, *r*_{0j}=*P*(*Z*_{1}=*j*). *P*_{ij} indicates the probability that the *j*^{th} genotype state occurs conditionally on the *i*^{th} genotype state at the previous location.

#### Joint emission probability with conditional distributions of logR and logOR

*W*

_{k}with a degree of freedom

*v*by a mixture of a normal distribution

*N*(

*W*

_{k}|

*μ*

_{j},

*κ*

^{2}/

*u*

_{k}) with a gamma distribution \(G\left (u_{k},\frac {v}{2},\frac {v}{2}\right)\),

We also specify a normal distribution for logR, *W*_{k}|*Z*_{k}=*j* ∼ *N*(*μ*_{j}, *σ*^{2}) to examine how differently these two distributions of logR behave in terms of handling hypersegmentation.

*ζ*

_{j}and

*τ*

^{2}, respectively. Finally, our joint emission probability with a t-distribution of logR at the

*k*

^{th}location given the state

*j*is

*N*(

*W*

_{k}|

*Z*

_{k}=

*j*,

*μ*

_{j},

*σ*

^{2}). Since logOR cannot be obtained in homozygous loci, the emission probability from Eq. 4 is re-formulated, depending on the existence of logOR at the

*k*

^{th}location (Eq. 5).

Here, *X*_{k}=*N**A* indicates that the *k*^{th} location is homozygous, and heterozygous otherwise.

#### Estimation with E-M algorithm

*P*(

*Z*

_{k}=

*j*|W, X

^{2}) and the joint probability,

*P*(

*Z*

_{k}=

*j*,

*Z*

_{k−1}=

*i*| W, X

^{2}), where W and X

^{2}indicate the sets of the logR and squared logOR values over chromosomes, respectively. These probabilities are evaluated by applying the forward-backward algorithm [19] resulting in the following conditional probabilities,

_{l:m}and \(\boldsymbol {X}_{\boldsymbol {l:m}}^{\boldsymbol {2}}\) indicate the observed values of logR and squared logOR from the

*l*

^{th}locus to

*m*

^{th}locus.

*a*

_{j}(

*k*) is evaluated by a forward recursion up to the

*k*

^{th}observation and the

*b*

_{j}(

*k*) by a backward recursion from the last to the (

*k*+1)

^{th}observation. The recursion equations are

*a*

_{j}(

*k*) and

*b*

_{j}(

*k*) may result in extremely small values that cause an underflow issue, particularly with large

*N*as in our application. Therefore, the scaled HMM [20] is implemented for all the analyses in this paper.

*θ*is a set of global parameters:

*θ*={

*α*,

*ψ*,

*κ*

^{2},

*τ*

^{2},

*v*}. Given

*γ*

_{k}and \(\eta ^{i,j}_{k}\) in Eq. 8, we can estimate all the parameters by maximizing

*Q*in the M-step. For estimating the initial probability

*r*

_{0j}, we maximize

*Q*

_{0}under the constraint that \(\sum _{j=1}^{J}P(Z_{1}=j)=1\), which is \(\widehat {r}_{0j}=\widehat {P}(Z_{1}=j)=\frac {\gamma _{1j}}{\sum _{l=1}^{J} \gamma _{1l}}\). Similarily, for estimating

*P*

_{ij}under the constraint \(\sum _{j=1}^{J}P(Z_{k}=j|Z_{k-1}=i)=1\), we obtain the closed form as

*Q*

_{θ}in terms of them by using the L-BFGS-B optimization algorithm in the R function optim.

## Simulations

Initially, we perform two simulation studies for assessing how accurately our proposed model identifies true genotype states when hypersegmentation occurs.

For these two studies, we generate 500 datasets with true genotype sequences of size *N*=10,000 based on a Markov chain with a given transition probability, using the R- package markovchain. We consider four genotypes as true states: A, AA, AAAB, and AAAAB. The first two genotypes are chosen for hemizygous deletion and neutral LOH (loss of heterozygosity), while the last two are chosen for the amplification events. Moreover, AAAB (*j*=7) and AAAAB (*j*=10) give similar expectations for both logR and logOR, making it more challenging to distinguish between these two genotypes. In this study where *ψ*=1.6 and *α*=0.9, *μ*_{7} and *μ*_{10} are 1.25 and 1.55 for logR; *ζ*_{7} and *ζ*_{10} are 1.03 and 1.31 for positive values of logOR. For each simulation study, the hsegHMM-N and the hsegHMM-T models are applied; The observed logOR _{j} is generated by *X*_{k}=*ζ*_{j}+*ε*_{k} where *ε*_{k} is normally distributed in these two simulation studies.

### t-distribution-based logR

In this simulation, we simulate hypersegmentation using a t-distribution for logR and examine both hsegHMM-N and hsegHMM-T models. We start with generating logR values from the t-distributions with *v*(=4) degrees of freedom. The squared values of logOR are generated from the chi-square distribution with one degree of freedom. Similar to the TCGA-KL-8331 dataset used in the TCGA study section, we assign 90% of loci to be homozygous.

^{th}locus is truly assigned to genotype AAAB, and the probability of identification for AAAB at the location is 0.488 with the hsegHMM-N model but 0.940 with the hsegHMM-T model. This indicates that AAAB is harder to correctly identify when we use hsegHMM-N model rather than the correct hsegHMM-T model.

_{H}are computed by averaging 500 asymptotic standard errors based on the Hessian matrices. The Hessian matrix for each dataset is numerically obtained by hessian in R package numDeriv. The estimates are unbiased even under the misspecified normal model. However, the asymptotic standard error estimates are sensitive to misspecification of the logR distribution (SE

_{H}is different from SEs under the misspecified hsegHMM-N model).

Summary of simulation studies with hsegHMM-N and hsegHMM-T models based on 500 simulated datasets

hsegHMM-N | hsegHMM-T | ||||||
---|---|---|---|---|---|---|---|

Simulation 1 | |||||||

True | Est | SEs | SE\(^{\ast }_{\mathrm {H}}\) | Est | SEs | SE | |

| 1.6 | 1.61 | 0.018 | 0.009 | 1.60 | 0.014 | 0.007 |

| 0.9 | 0.90 | 0.004 | 0.003 | 0.90 | 0.003 | 0.003 |

| 0.3 | N/A | 0.30 | 0.007 | 0.007 | ||

| 0.6 | 0.55 | 0.039 | 0.008 | 0.61 | 0.017 | 0.020 |

| 0.5 | 0.50 | 0.033 | 0.025 | 0.50 | 0.028 | 0.024 |

| 4 | N/A | 3.91 | 0.150 | 0.159 | ||

Simulation 2 | |||||||

True | Est | SEs | SE | Est | SEs | SE | |

| 1.6 | 1.62 | 0.028 | 0.015 | 1.60 | 0.013 | 0.011 |

| 0.9 | 0.90 | 0.003 | 0.003 | 0.90 | 0.003 | 0.003 |

| N/A | N/A | 0.64 | 0.017 | 0.018 | ||

| 0.65 | 1.46 | 0.055 | 0.023 | 2.28 | 0.133 | 0.159 |

| 0.5 | 0.48 | 0.026 | 0.024 | 0.49 | 0.025 | 0.024 |

| N/A | N/A | 2.79 | 0.076 | 0.093 |

### A mixture of normals-base logR

We examine the robustness of the t-distribution to alternative distributions that exhibit long tails. Specifically, we simulate under a mixture of normal distributions and examine the robustness of the hsegHMM-T model. Errors are generated by \(\epsilon _{k}\sim \pi _{1}\times N\left (\mu _{\epsilon 1},\,\sigma _{\epsilon 1}^{2}\right)+\pi _{2}\times N\left (\mu _{\epsilon 2},\,\sigma _{\epsilon 2}^{2}\right)\), where *π*_{1} and *π*_{2} indicate the mixture proportions of the first and the second distributions, respectively. The means and variances for those two normal distributions are chosen under the condition of *E*(*ε*_{k})=0 and \(V(\epsilon _{k})=\sigma _{\epsilon }^{2}\). In this simulation, *π*_{1} and *π*_{2} are set as 0.7 and 0.3 with *μ*_{ε1}=*μ*_{ε2}=0, \(\sigma _{\epsilon 1}^{2}=0.5\), and \(\sigma _{\epsilon 2}^{2}=0.5\times 3^{2}\). Then, the total error variance \(\sigma _{\epsilon }^{2}\) is 2.25.

_{H}(Simulation 2 in Table 2) are shown in the same way as described in the previous “t-distribution-based logR” section. Figure 2 shows that all the red lines based on the hsegHMM-T model appear noticeably closer to the black lines than the blue lines based on the hsegHMM-N model for all the genotype states. In particular, detecting both AAAB and AAAAB with the hsegHMM-T model performs much better than the hsegHMM-N model.

The results indicate that under mild amounts of misspecification (t-distribution rather than a mixture of normals) of the log R distribution, estimates of global parameters and their standard errors are both unbiased, and the accuracy of classifications are very good. Consequently, hsegHMM-T provides more accurate estimates of genotype status by managing hypersegmentation more effectively than hsegHMM-N.

A single dataset-based result is provided to investigate more closely how much more robust the hsegHMM-T model is than the hsegHMM-N model to cope with hypersegmentation (Additional files 1–4). The first two figures (Additional files 1 and 2) and the last two figures (Additional files 3 and 4) show the copy number profiles of a particular dataset from the first and second simulation scenarios, respectively. Similar to the analysis results, hsegHMM-T appears to handle hypersegmentation much better than hsegHMM-N. This same pattern was observed for all simulated datasets (data not shown).

*α*=0.3,0.5,0.7) and different numbers of reads (half and double related to the original from Fig. 3). Based on 500 simulated datasets, our proposed model performs better as the purity increases in terms of a higher probabilities of correct genotype identification (Additional file 5).

### A read counts-based simulation for the comparison of hsegHMM and FACETS

In this simulation, we compare our method, hsegHMM with FACETS which also constructs its model based on logR and logOR but with a segmentaion-based approach for allele-specific SCNA analysis. In order to make a fair comparison between these two methods, we generate datasets from read counts and read depth (coverage) in the beginning without relying on model assumptions of hsegHMM or FACETS. In this study, we choose the true profile of genotypes as A, AB, and AA orderly with different lengths over the entire chromosomes.

First, we consider the fact that the total coverages for normal cells and tumor cells are different. Thus, we assign different read depths for tumor and normal cells as 160 and 40 in this study. Then, for normal cells, the read depth is 40 and the read count of A allele is either 20 for AB (heterozygosity) or 40 for AA (homozygosity) regardless of the true genotypes for tumor. On the other hand, for tumor cells, both read depth and read counts are assigned depending on different genotypes. For the genotype AB, the read count and depth are set to be 80 and 160, respectively. For the genotype AA, the read count and depth are the same as 160. For the genotype A, the read count and depth are both 80 as the same number with a half of the total coverage since B allele is lost.

In order to generate read counts and depths, we use a uniform distribution with different intervals for normal and tumor samples. The intervals provide the variation occuring from measurement errors. For normal, read counts and depths are generated from a uniform distribution with the range of ±15 intervals for both the genotype AA and AB, and the range of ±20 for the genotype A. For tumor, read counts and depths are generated from either of these two uniform distributions with the range of ±30 and ±15 with probability of 70% and 30%, respectively. This setting provides logR and logOR values with different variances and asymmetric ranges between genotype A and the others, which makes more challenging to analyze. Finally, we round the decimal values of read counts and depths from these continuous uniform distributions to the nearest intergers.

Through the preprocedure of FACETS, we obtain total 4942 values of logR and logOR based on the integer values of read counts and depths, for which both hsegHMM and FACETS are applied. Figure 3 shows how different these two methods behave through the probability of identification plots for all the different genotype states. These plots are based on 500 simulated datasets. hsegHMM-N (red lines) and hsegHMM-T (blue lines) have almost the exact patterns of the identification plots with FACETS (green lines) for all the genotypes. On the other hand, when the read depth distribution was skewed (a rescaled beta distribution with shape parameter values of 1 and 6), FACETS did poorly in region identification as compared to hsegHMM. Specifically, we consider both the cases of short and wide region length. Additional file 6 shows the probabilities of correct identification for a short region based on 500 simulated datasets. hsegHMM-T identified the mutation in approximately 96% of the datasets as compared with 5% using FACETS (the left panel in Additional file 6). For the wider region, FACETS improved relative to hsegHMM, but there was still a marked improvements of our approach (the right panel in Additional file 6): 83% and 99% identification for FACETS and hsegHMM, respectively. These results show the advantage of hsegHMM as compared with FACETS for uneven coverage.

We also examine our method with different numbers of reads for read counts and depths by reducing half size and increasing double size of them (Additional files 7 and 8). For both the half and double read size cases, our model shows similar results to those from the original size of reads (Additional files 7 and 8). FACETS showed similar behavior when the read counts were altered (Additional files 7 and 8).

## Results

### TCGA-KL-8331 renal cell carcinoma dataset

TCGA project (https://cancergenome.nih.gov) is a cancer genomic collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI). This project includes critical genomic information of 33 types of cancers with more than two petabytes of TCGA genomic dataset to contribute cancer etiology, treatment, and diagnosis. In this research, we apply hsegHMM to whole-exome sequencing data from a chromophobe renal cell carcinoma (RCC) sample (TCGA-KL-8331).

TCGA-KL-8331 dataset consists of read counts and total depths for both normal and tumor paired tissues from the same patient over the entire chromosomes. This dataset contains 1,217,407 single nucleotide variants (SNVs). Through FACETS pre-processing step, these 1.2 MB SNVs were reduced to 369,131 SNVs, which are limited to the germline polymorphic sites and filtered by low quality including lower depth coverage positions (see the details in the Data pre-processing section in [16]). Thus, observed logR and logOR are calculated for *N*=369,131 loci. In this RCC sample, we find that approximately 13% (47,660) of loci are heterozygous with the corresponding logOR available. For computational feasibility, we perform a thinning process which keeps every 10^{th} observation. This also reduces auto-correlations between observations, which helps alleviate hypersegmentation. We apply the hsegHMM procedure to the final dataset *N*=36,914.

_{A/AB}) fits data best with the smallest AIC (62923.90) and BIC (62992.03). We also apply the FACETS method to compare the result with our method. The hsegHMM-T with the two major genotype states have almost the same allele-specific copy number profiles with FACETS in Fig. 6.

*α*is estimated to be about 87–88% for all the methods, which indicates a high proportion of the tumor cells in the tumor tissue. The estimated ploidy, \(\widehat {\psi }\) (≈1.6) appears to be different from 2 in all the methods, which provides evidence for aneuploidy in this sequence. Note that the tumor purity, ploidy and variances of logR (

*V*(

*W*)) are similar in all the three hsegHMM models. This suggests that estimation of global parameters are robust to the distribution of logR and to an expanded genotype state space. This is in contrast to the allele specific genotype status that does appear to be sensitive to the distribution of logR and to the specification of an appropriate state space of genotypes, and hence to hypersegmentation.

Summary of hsegHMM-N, hsegHMM-T, and hsegHMM-T _{A/AB} models of a renal cell carcinoma sample: hsegHMM-T _{A/AB} indicates the hsegHMM-T with A and AB state space; Est and logL represent estimated values for parameters and log-likelihood function values given all the estimates, respectively ; *ψ* is the ploidy and *α* is the tumor purity; *κ*^{2} is the variance component of logR in hsegHMM-T; *V*(*W*) and *τ*^{2} are the variance of logR and logOR in both models, respectively; SE _{H} indicates the average asymptotic standard errors of estimates based on the Hessian matrices

hsegHMM-N | hsegHMM-T | hsegHMM-T | ||||
---|---|---|---|---|---|---|

Est | SE | Est | SE | Est | SE | |

| 1.62 | 0.003 | 1.61 | 0.003 | 1.60 | 0.003 |

| 0.87 | 0.002 | 0.88 | 0.002 | 0.88 | 0.002 |

| N/A | 0.16 | 0.002 | 0.17 | 0.003 | |

| 0.25 | 0.002 | 0.26 | 0.003 | 0.27 | 0.003 |

| 0.57 | 0.012 | 0.58 | 0.012 | 0.57 | 0.014 |

| N/A | 5.50 | 0.185 | 5.48 | 0.218 | |

AIC | 64682.17 | 63120.48 | 62923.90 | |||

BIC | 65934.07 | 64380.89 | 62992.03 |

## Discussion

We have shown that the hidden Markov modeling approach provides an effective way to identify allele-specific copy number alternations along the genome. As compared with FACETS, a segmentation-based approach, the hsegHMM provides an assessment of the uncertainty in parameter estimate (i.e. ploidy and purity), using likelihood-based estimate of variances as well as the ability to assess variability in copy number identification by computing posterior estimates of the genotype at each locus. It is also important to mention that hsegHMM is based on the output of WES, which in turn relies on the exome enrichment platforms where capture efficiency may still affect SCNA estimation.

A major focus of the paper was demonstrating that hypersegmentation in allele-specific SCNA data can be substantially reduced by incorporating a long-tailed emission distribution (hsegHMM-T model) into a HMM framework. We also found that hypersegmentation could occur by choosing a state space (possible genotypes) that is more expansive than necessary. Thus, we recommend that the most parsimonious model with a limited number of genotype states be chosen. Of course, the choice of this model should be based on using penalized likelihood methods such as AIC. Last, the hsegHMM assumes that logR and logOR measurements given genotype are independent across the entire chromosomes. This may not be true when loci are very close together, and failure of this assumption may lead to hypersegmentation. We therefore recommend thinning the sequence data (e.g., taking only one out of ten data points) to avoid this problem.

The application of hsegHMM can be extended in three future directions that have important applications in cancer genetics. First, hsegHMM can be applied to a population-based study where many subjects will be analyzed. In this case, we suggest that individual-specific analyses be conducted and the results combined in a final analysis. For example, evidence of a SCNA being related to a particular cancer may be suggested if a sizable proportion of the posterior probabilities of a genotype at a particular chromosome location are greater than a certain threshold (e.g. >80%). Second, the relationship between a genetic factor and a subject-specific covariate may be examined in a second stage regression. For example, by using all the individual ploidy estimates obtained from the population-based study, we can construct a linear regression of the log ploidy estimate, \(log\widehat {\psi }\) with a set of any covariates such as \(log\widehat {\psi }=\beta _{0}+\beta _{1}{age}+\beta _{2}\)BMI. As an illustration for a population-based study, we have analyzed all 316 renal cell carcinoma samples from TCGA with the proposed model based on 5 copy number of state-space. We obtained the distribution of estimated ploidy across all the samples for any major copy number alteration event across the chromosomes. We also estimated the distribution of purity which is a measure of the quality of the tissue samples. Furthermore, we created a cytoband-based stacked histogram of allele-specific SCNA events for integrating allele-specific SCNA profiles from all the 316 samples. Each sample has its own allele-specific SCNA profile with different genotypes and regions. To standardize genetic locations across the samples, we used a cytoband file format which has predefined positions of cytobands across the whole chromosomes. For each cytoband, the corresponding allele-specific SCNA event is assigned within individual sample, and the number of times each event occurs is counted. After counting the frequencies of all the cytobands, we found that the most frequent mutation is a hemizygous deletion (genotype A) that has highest frequency on Chromosomes 3 and 14. In addition, we found a high frequency of a Gain (genotype AAB) in the region between q21.3 and q35.3 on Chromosome 5 (Additional file 9).

Last, our model structure can be extended to infer tumor subclonal populations. In practice, a tumor sample contains a mixture of clones not just one main clone which is assumed in hsegHMM. Such an approach can also be embedded into a hidden Markov modeling framework, and is the subject of future research.

## Conclusions

In this paper, we propose a hidden Markov model framework (hsegHMM) for estimating genotype status as well as copy number at each locus, incorporating the complexities of tumor samples as well as hypersegmentation. Specifically, under certain type of data with more fluctuated or irregular observations, hsegHMM-T model performs better than hsegHMM-N model in terms of such a remarkable reduction of hypersegmentation. As a byproduct of the hsegHMM estimation procedure, we can compute the posterior probabilities of allele-specific genotype status (the Method section) as well as provide a rigorous comparison of different models (e.g. normal versus t-distribution) by using AIC and BIC (the Result section). Hence, hsegHMM provides a rigorous framework for statistical inference and model assessment. hsegHMM can also expand the genotype state space so that it can handle a more flexible range of copy number alterations. Specifically, this flexibility is useful for analyzing data from certain type of cancers with high-level amplification events. Simulation studies showed that hsegHMM-T performed much better than FACETS in situation where the coverage (read depth) is uneven across the genome.

In conclusion, hsegHMM offers an allele-specific SCNA analysis robust to hypersegmentation while accounting for tumor purity and ploidy. Such robustness enhances the accuracy of detecting genotype status at each locus in NGS-based platforms.

## Declarations

### Acknowledgements

We would like to acknowledge our usage of the data from The Cancer Genome Atlas (TCGA) supported by the National Cancer Institute and National Human Genome Research Institute: https://cancergenome.nih.gov. We would like to thank Bill Wheeler (Information Management Services) and Lei Song (Biostatistics Branch) for computational contributions.

### Funding

The work was supported by the Intramural Research Program of US National Institutes of Health, National Cancer Institute. This work utilized the computational resources of the NIH HPC Biowulf cluster. (http://hpc.nih.gov).

### Availability of data and materials

All the source codes of hsegHMM are available at https://dceg.cancer.gov/tools/analysis/hseghmm.

### Authors’ contributions

HCW, PSA and BZ all contributed to the model formulation, analysis, simulations, interpretation, and writing. All authors read and approved the final manuscript.

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Yang L, Luquette LJ, Gehlenborg N, Xi R, Haseley PS, Hsieh C-H, Zhang C, Ren X, Protopopov A, Chin L, et al. Diverse mechanisms of somatic structural variations in human cancer genomes. Cell. 2013; 153(4):919–29.View ArticleGoogle Scholar
- Roy DM, Walsh LA, Desrichard A, Huse JT, Wu W, Gao J, Bose P, Lee W, Chan TA. Integrated genomics for pinpointing survival loci within arm-level somatic copy number alterations. Cancer Cell. 2016; 29(5):737–50.View ArticleGoogle Scholar
- Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, et al. The landscape of somatic copy-number alteration across human cancers. Nature. 2010; 463(7283):899–905.View ArticleGoogle Scholar
- Zack TI, Schumacher SE, Carter SL, Cherniack AD, Saksena G, Tabak B, Lawrence MS, Zhang C-Z, Wala J, Mermel CH, et al. Pan-cancer patterns of somatic copy number alteration. Nat Genet. 2013; 45(10):1134–40.View ArticleGoogle Scholar
- Li W, Lee A, Gregersen PK. Copy-number-variation and copy-number-alteration region detection by cumulative plots. BMC Bioinformatics. 2009; 10(1):67.View ArticleGoogle Scholar
- Shlien A, Malkin D. Copy number variations and cancer. Genome Med. 2009; 1(6):62.View ArticleGoogle Scholar
- Calhoun ES, Hucl T, Gallmeier E, West KM, Arking DE, Maitra A, Iacobuzio-Donahue CA, Chakravarti A, Hruban RH, Kern SE. Identifying allelic loss and homozygous deletions in pancreatic cancer without matched normals using high-density single-nucleotide polymorphism arrays. Cancer Res. 2006; 66(16):7920–8.View ArticleGoogle Scholar
- Kuga D, Mizoguchi M, Guan Y, Hata N, Yoshimoto K, Shono T, Suzuki SO, Kukita Y, Tahira T, Nagata S, et al. Prevalence of copy-number neutral loh in glioblastomas revealed by genomewide analysis of laser-microdissected tissues. Neuro-Oncol. 2008; 10(6):995–1003.View ArticleGoogle Scholar
- Kloth JN, Oosting J, van Wezel T, Szuhai K, Knijnenburg J, Gorter A, Kenter GG, Fleuren GJ, Jordanova ES. Combined array-comparative genomic hybridization and single-nucleotide polymorphism-loss of heterozygosity analysis reveals complex genetic alterations in cervical cancer. BMC Genomics. 2007; 8(1):53.View ArticleGoogle Scholar
- Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK. Effect of read-mapping biases on detecting allele-specific expression from rna-sequencing data. Bioinformatics. 2009; 25(24):3207–12.View ArticleGoogle Scholar
- Meienberg J, Zerjavic K, Keller I, Okoniewski M, Patrignani A, Ludin K, Xu Z, Steinmann B, Carrel T, Röthlisberger B, et al. New insights into the performance of human whole-exome capture platforms. Nucleic Acids Res. 2015; 43(11):76.View ArticleGoogle Scholar
- Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M. Penncnv: an integrated hidden markov model designed for high-resolution copy number variation detection in whole-genome snp genotyping data. Genome Res. 2007; 17(11):1665–74.View ArticleGoogle Scholar
- Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, Bassett AS, Seller A, Holmes CC, Ragoussis J. Quantisnp: an objective bayes hidden-markov model to detect and accurately map copy number variation using snp genotyping data. Nucleic Acids Res. 2007; 35(6):2013–25.View ArticleGoogle Scholar
- Van Loo P, Nordgard SH, Lingjærde OC, Russnes HG, Rye IH, Sun W, Weigman VJ, Marynen P, Zetterberg A, Naume B, et al. Allele-specific copy number analysis of tumors. Proc Natl Acad Sci. 2010; 107(39):16910–5.View ArticleGoogle Scholar
- Liu Z, Li A, Schulz V, Chen M, Tuck D. Mixhmm: inferring copy number variation and allelic imbalance using snp arrays and tumor samples mixed with stromal cells. PLoS ONE. 2010; 5(6):10909.View ArticleGoogle Scholar
- Shen R, Seshan VE. Facets: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput dna sequencing. Nucleic Acids Res. 2016; 44(16):131.View ArticleGoogle Scholar
- Peel D, McLachlan GJ. Robust mixture modelling using the t distribution. Stat Comput. 2000; 10(4):339–48.View ArticleGoogle Scholar
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B Methodol. 1977; 39:1–38.Google Scholar
- Baum LE, Petrie T, Soules G, Weiss N. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann Math Stat. 1970; 41(1):164–71.View ArticleGoogle Scholar
- Stamp M. A revealing introduction to hidden Markov models. 2011. http://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf.