Implementation
Algorithm presented is implemented in CONVector software package that uses Python for user input collection and data preprocessing and Java (≥ 1.7) for data analysis. Source code is licensed under GPLv2 and is deposited on GitHub: https://github.com/parseq/convector.
Model description
Amplificationbased enrichment strategy is widely used for targeted sequencing. Commercially available enrichment systems, such as AmpliSeq (LifeTechnologies), provide the way to selectively amplify genomic regions of interest. Number and location of targets are determined by the analysis aim and varies from hundreds to thousands in one sequencing library thus making analysis highly multiplex. Quantity of PCRproduct from certain target region (N) generally depends on the initial amount of target DNA segment in the sample (N
_{0}), number of PCR cycles (c), amplification efficiency (e), and is typically modelled as
$$ N=N_{0} \times (1 + e)^{c}. $$
(1)
The main principle of coveragebased CNV detection is to estimate N
_{0}s from Ns and to compare estimated N
_{0}s of targets to find the underrepresented (deletions) or overrepresented (duplications) sequences. The task of homozygous deletion detection is trivial since in case of N
_{0} equals to zero the N is also a zero. Such condition can be detected by the lack of target sequence in the sequencing results. Other cases such of heterozygous deletions and homo and heterozygous duplications require deduction of N
_{0}’s from N’s that can be done by counting reads from corresponding amplicons. The number of reads (N), i.e. amplicon coverage, is highly dependent on amplification efficiency that is influenced by amplicon structure, primer characteristics and reaction conditions. In highly concurrent multiplex PCR environment the amplification balance can be shifted by even minor condition changes that are beyond the control of the researcher. Resulted imbalance leads to an inability to directly compare target region absolute or relative coverages between samples to perform CNV detection, except of cases when analyzed genome region is covered by large number of amplicons (thousands of bases). Nevertheless, detection of relatively small portions of the genome, up to a single exon that is covered by one or few amplicons, may have biological or clinical importance.
To get CNV detection with an amplicon scale resolution we imply the idea that in large amplicon population, which is the case of target PCR, it is possible to find amplicons with similar respond to reaction conditions. Therefore to detect changes in the initial amount of targets we can compare coverages of targets which have similar amplification behavior in series of samples catching the anomalies (under or overrepresentation), i.e. CNVs. Since the described approach accounts for any factors affecting amplification efficiency (even hidden and uncontrolled), it is expected to be more efficient than the normalization just on amplicon or primer characteristics.
The analysis consists of two steps: grouping amplicons with highly correlated coverages in the series of samples and CNV detection by multiple coverage comparison between amplicons within a group. From this perspective, CNV detection can be formulated as anomaly detection using methods of robust statistics.
The analysis requires aligned reads from samples sequenced with the same set of primers and under the same reaction conditions. The goal of analysis is to determine zygosity state for each amplicon in each sample of the dataset. The model suggests that if two amplicons have similar amplification efficiencies and the influence of stochastic effects is small enough, the coverages of these amplicons in set of samples are correlated. The analysis of intersection of confidence intervals for λ obtained using BoxCox transformation applied to the pairs of correlated amplicons and model 1 suggests that log transformation of amplicons coverage should be used to stabilize the variance and to make the variables homoscedastic. We propose two algorithms, unsupervised and supervised, that can be used separately or in combination. First (unsupervised) algorithm operates with amplicons while the second one (supervised) operates with CNV sites that may include one or more amplicons. Since CNV breakpoints are usually located in intronic regions, it is useful to describe CNVs sites in terms of affected exons [18]. We assume that the dataset may contain a number of samples carrying CNVs, but each particular target is affected by CNV just in some subset of samples for the first algorithm. The prerequisite for the second algorithm is the presence of at least 20 samples that are free of CNVs in test dataset which can be satisfied with a priori known control dataset or be assumed based on low previously estimated populational frequency of CNVs in the particular genes of interest. The frequency of CNVs’ presence in the dataset and the dataset size influence on the robustness and effectiveness of the statistical model. Presented model is based on the assumption that each particular CNV is present in no more than 20 % of samples in the dataset, which is in consistence with holds for majority of reallife cases. We used the robust linear regression model to estimate the linear relationships between amplicons.
For CNV zygosity determination, we assumed that the amplification starts from the equal numbers of maternal and paternal copies of targets. The result of amplification, i.e. coverage, is a sum of independent random variables so we can build a linear model for any zygosity state, except of homozygous deletion (coverage is zero, the detection is trivial), by multiplication of coverage with corresponding factor: 0.5 for heterozygous deletion, 1.0 for wild type, 1.5 for heterozygous duplication and 2.0 for homozygous duplication and so on. Using the linearity of the sum of independent random variables’ variances, we can conclude that the variance is decreased when multiplier is equal to 0.5 and increased when it is greater than 1.0. This limitation results in the difficulty of duplication detection relatively to a deletion’s detection due the lower power of corresponding statistical tests. However, it is possible to build a statistical model for CNV statuses detection inside the group of correlated amplicons and use the likelihood considerations of certain CNV status for each amplicon in each sample, but the power to detect the exact number of copies is decreasing with the higher ploidy states.
Our algorithms use several robust statistical techniques that require comparatively large test dataset size for accurate CNVs’ detection. Using the real sequencing data (described below) we determined the lower threshold for the number of samples in the test dataset as 25. However, the tool can be used for datasets of smaller size in case the control dataset of sufficient size is available.
CNV detection algorithms
The detection pipeline consists of two stages: unsupervised and supervised (Fig. 1). The unsupervised algorithm takes coverages of amplicons in the samples as an input and does not require a priori assumptions about the potential presence of CNVs except the maximum CNV frequency at any particular site (described above). The next algorithm (supervised) requires a control (free of CNVs) dataset. Control dataset can be made using the results of the unsupervised algorithm or alternative methods.
We use following notations: a
_{
j
} means jth amplicon in the dataset, c
o
v(a
_{
j
}) denotes the a
_{
j
}’s vector of coverages across the samples, a
_{
jk
} denotes amplicon which vector of coverages is correlating, i.e., has correlation is higher than fixed threshold T, with c
o
v(a
_{
j
}) (Fig. 2). T can be choosen as biggest number from 0.0 to 1.0 such as all analysed exons have at least one amplicon that passes quality control procedure (described below). c
o
v(a
_{
j
})[i] denotes coverage of amplicon j in the sample i. In order to exclude cases when both a
_{
j
} and a
_{
jk
} can be affected with CNV we restrict selection by setting minimum physical distance between a
_{
j
} and a
_{
jk
}. In cases when target regions spread across genes it is possible to select a
_{
jk
} from the set of target regions that are located on genes other than a
_{
j
} is located on. Also we denote M
_{
Normal
},M
_{
HetDel
},M
_{
Dup
} as statistical models for number of copies 2, 1 and greater or equal to 3, respectively.
We impy the idea that standard correction on features such as GCcontent and read length removes a large part of variation, but there are many other sources for amplificationbased sequencing (for instance, primers’ properties) and it is not usually possible to infer all of them. Thus we use different type of normalisation: we construct statistical models within clusters of correlated amplicons. They explain large proportion of variation in response variable which makes standard corrections unnecessary. Pearson’s correlation coefficient used as an effective nonrobust estimator of explained variance for the situations where we are sure that there are no outliers. Robust estimators of correlation and regression’s parameters used as less effective robust alternative for the case where the possible presence of outliers is unknown. Finally, normalisation on correlated random variables makes covariance matrix more close to diagonal, consequently, robust to overfitting, which makes usage of regularized classification methods possible.
Unsupervised algorithm
CNV states are determined by building linear models using a c
o
v(a
_{
j
}) as the response and c
o
v(a
_{
jk
}) as predictors (Fig. 2).
If c
o
v(a
_{
j
})[i] can not be recognized as an outlier for linear model M
_{
Normal
}(a
_{
j
}), we will consider that this amplicon is in wild type state for sample i.
If c
o
v(a
_{
j
})[i] is lower than detection threshold (i.e., close to 0), the amplicon is in homozygous deletion state for sample i.
If c
o
v(a
_{
j
})[i] is lower than we can expect for wild type amplicon and it is closer to the linear model for heterozygous deletion M
_{
HetDel
}(a
_{
j
}) (i.e., the corresponding residual is smaller, which means that the simple Bayes factor \(K = \frac {P(cov(a_{j})M_{HetDel}(a_{j})) }{ P(cov(a_{j})M_{Normal}(a_{j})) }\) for this models is greater then one), but were not recognized as the homozygous deletion, we can conclude that it is a heterozygous deletion (for sample i).
If c
o
v(a
_{
j
})[i] is detected as an outlier for sample i, but it is higher that the regression line and it is closer to the linear model for heterozygous duplication, it can be considered that it is a duplication.
Our approach was developed for singlecopy germline CNVs’ detection so this set of models covers vast majority of reallife cases, but number of models can be increased for detection of complex rearrangements involving quantitative change of higher order (4, 5, etc.). Such ploidy states are not considered as a separate case because the prediction intervals of the models for different types of duplications intersects largely. It is possible to determine the exact ploidy of the region only in case the region is large, but we considered the detection of CNVs with the highest possible resolution (up to one amplicon) as more important than detection of regions with higher possible ploidy with necessarily low resolution.
In order to avoid the influence of errors in predictors we determine the CNV state of an amplicon a
_{
j
} by constructing L linear models that explain variance in c
o
v(a
_{
j
}) better than others. To do this, we use L other amplicons a
_{
jk
}, which vector of coverages in series of samples being analyzed shows the highest correlations with c
o
v(a
_{
j
}) in the same series of samples. Then we assume that if the coverage of a
_{
j
} in a particular sample was detected as CNV in N of L such models, then it indicates the presence of CNV. L and N were empirically choosen as equal to 5 and 4 for our experimental data since larger or smaller numbers gave us worse performance on the train dataset.
S
_{
n
}correlation coefficient is used as a measure of similarity [19]:
$$r_{S_{n}} = \frac{{S_{n}^{2}}(u)  {S_{n}^{2}}(v)}{{S_{n}^{2}}(u) + {S_{n}^{2}}(v),} $$
where u and v are the robust principal variables: \( u = \frac {x  med(x)}{\sqrt {2} MAD(x)} + \frac {y  med(y)}{\sqrt {2} MAD(y)}, v = \frac {x  med(x)}{\sqrt {2} MAD(x)}  \frac {y  med(y)}{\sqrt {2} MAD(y)}.\) Here and below S
_{
n
} means Rousseeuw and Croux’s estimator of standard deviation [20].
We used TheilSen estimator [21] as a robust linear model because of its high breakdown point (\(1  \frac {1}{\sqrt 2} \approx 29.3~\%\)), satisfactory effectiveness and simplicity.
We used the standard formula for internally studentized residuals with the standard deviation replaced with its robust analogue. We used empirical level of significance for the Student’s quantile as a threshold for outliers detection (α=0.02 for deletions and α=0.05 for duplications). The formula of calculating ith studentized residual is:
$$\frac {\hat \varepsilon_{i}} {S_{n} \left(1  \left(\frac{1}{n} + \frac{x_{i}  \bar{x}}{\sum_{j=1}^{n} (x_{j}  \bar{x})^{2} }\right)\right)}, $$
where \(\hat \varepsilon _{i}\) means the ith residual, \(\bar {x}\) means mathematical expectation of the random variable. Algorithm’s pseudocode is available in Additional file 1.
Supervised algoritm
Using the output from the unsupervised stage samples can be grouped using the following rule: if at least one of the amplicons is marked as outlier the sample is suspected of carrying CNV and added to test dataset, and if there is no outlier, the sample is added to control dataset. Also it can be performed using control dataset with a priori known absence of CNVs.
In general, we should search CNVs in any possible CNV site, which can be reformulated as a classification problem. We use ideas of linear discriminant analysis, however, the decision making is divided into 3 separate steps in order to make it configurable for the particular purposes.
Given that CNV site can be considered as a multivariate random variable and in practice we may have up to 100 samples per dataset (each sample consists of hundreds of amplicons), the direct usage of robust covariance matrix consisting of thousands of predictors is useless because of overfitting. To handle this we normalize the coverage values and use the block covariance matrix. In the presented work, we used the size of block one times one.
The algorithm contains several steps:

1.
For each amplicon a
_{
j
} we determine cluster C of k other amplicons a
_{
jk
}. Algorithm uses Pearson’s correlation coefficient as an estimator of correlation in case the Control set is big enough (more than 20, according to bias and variance of permutation based simulations) and robust correlation estimator otherwise.

2.
Normalisation of amplicons’ coverages: \(x_{j} = \log (cov(a_{j}))  \Bigl (\frac {\sum _{k \in C} \log cov(a_{jk})} {C} \Bigr).\) In order to reduce variance we use a group of highly correlated amplicons located far enough from each other so they can not be affected by the same CNV (i.e., amplicons from different genes or chromosomes) and then normalise on its average instead of normalisation on total amount of reads. For datasets with large number of genes and possible CNVs more robust normalisation procedure can be applied. For example, instead of raw coverages a sampled distribution of coverages, predicted by linear models on the first step, can be created and considered as normalisation factor (correlated amplicons should be taken uniformly from different genes).

3.
We construct several probabilistic models for each amplicon, respectively: M
_{
Normal
}(a
_{
j
}),M
_{
HetDel
}(a
_{
j
}),M
_{
Dup
}(a
_{
j
}). Assuming that all x
_{
j
} are distributed normally, we calculate the robust mean estimations for models using vectors log(c
o
v(a
_{
j
})), log(c
o
v(0.5a
_{
j
})), log(c
o
v(1.5x
_{
j
})) and robust scale estimation S
_{
n
} for x
_{
j
}. For simplicity, we assume that number of reads produced from a fragment is distributed as P
o
i
s(N), where N was described in 1. Since we applied log transformation that is much stronger than square root transformation (that keeps variance approximately constant), we have to use correction factors for variances. We can estimate the correction factors using simulation procedure as \(\hat \sigma ^{2}(x_{jDel}) = 2 \cdot \hat \sigma ^{2}(\log (x_{j})), \hat \sigma ^{2}(x_{jDup}) = \frac {2}{3} \cdot \hat \sigma ^{2}(\log (x_{j}))\), where x
_{
jDel
} and x
_{
jDup
} denote expected transformed coverages for M
_{
HetDel
}(a
_{
j
}),M
_{
Dup
}(a
_{
j
}), respectively.

4.
For each region we ask three questions consequently: a) can region’s vector of coverages be produced by probabilistic models M
_{
HetDel
} or M
_{
HetDup
}? b) is it highly probable that the region was produced by M
_{
Normal
}? c) is M
_{
HetDel
} or M
_{
HetDup
} the most probable explanation for the observed coverage of the region?
In order to answer the first question a), we test if each amplicon’ coverage can be produced by these models, using χ
^{2} statistics with low level of significance α (0.05 divided by expected number of amplicons that are altered in copy number, can be roughly estimated as “total number of amplicons in Test dataset divided by 2”, since we have an assumption on CNVs’ frequency). If all amplicons located inside CNV site can be produced by one of this models, we then mark this site as “suspicious”.
Next, we are trying to reject the hypothesis that the normalised coverages within each “suspicious” CNV site can be produced by M
_{
Normal
}. We use normalised Euclidean distance as a test statistic (2) It follows χ
^{2}distribution because we can assume that the estimated covariance matrix is approximately diagonal after the normalisation step. False discovery rate is controled using Benjamini–Hochberg procedure. Obtained adjusted pvalues are qualified as an answer for the second question.
Finally, we check if region’s coverage Bayes factor (where H
_{0} is the absence of CNV and H
_{1} is one of the CNVs’ models) is bigger than some constant value. Using prior knowledge of CNVs frequency in our dataset we chose the value of 100 as a threshold and assumed equal prior probabilities of models. The graphical representation is depicted on (Fig. 3).

5.
The larger size of Control set leads to more accurate statistical estimations and increase the quality of output so if it was found that some of samples from Test set are free of CNVs, we include them in the Control set and repeat the procedure from the first step. Otherwise, we provide the final report about CNVs on CNVsite resolution (currently, we use exons as potential CNV sites due to biological reasons).
Normalised Euclidean distance (NED) can be defined as:
$$ d(\vec{x}, \vec{y}) = \sqrt{\sum_{i=1}^{N} \frac{(x_{i}  y_{i})^{2}}{{s_{i}^{2}}}} $$
(2)
where we used S
_{
n
} estimator of standard deviation for x
_{
i
}−y
_{
i
} values as s
_{
i
} and y is a vector of the location estimations. We used the median of Walsh averages due its robustness and effectiveness as the estimation of location.
Input data quality control
Low covered samples may lead to an unreliable results due the high influence of stochastic effects on amplification and sequencing therefore we filtered out samples that contain less than 15000 reads. We use the quality control procedure for amplicons as well: in case if amplicon has average coverage less than 50 or does not have the required number of correlating amplicons for the prediction on the unsupervised algorithm, we mark this amplicon as QC failed and do not consider. In case the whole exon is covered only with amplicons that did not pass QC control, we do not include this exon into analysis in all samples.
After initial filtering the algorithm defines all amplicons which coverage is close to zero as homozygous deletions. We use coverage 10 as a threshold for homozygous deletion since in practical cases the coverages for homozygous deletions do not exceeded this limit. Then it filters out samples that are not similar to other samples in the dataset. We call such samples “irregular” because they have no analogues in amplification behavior expressed in coverage. Statistical estimation on irregular samples and determination of their CNV states are inaccurate. The presence of large CNV leads to abnormal coverage ratios therefore only samples deviated by more than n/2 genes, where n denotes the overall amount of genes, n>2, are considered irregular. The assumption is that the presence of n/2 large scale CNVs in n genes in one sample is rare.
The algorithm can be divided into several steps. After read counting, we perform library size normalisation taking into account that the total number of reads can be influenced by the large CNV and lead to bias in normalised data. To solve this problem, normalisation is performed within the genes: we divide each amplicon’s coverage by the total number of reads aligned to corresponding gene. For simplicity, we assume that logarithms of obtained values are normally distributed.
Next we calculate NED (2) using this normalised logtransformed data for each gene separately. Working with real data, we found the significant proportion of amplicons can have extreme values which leads to unrealistically high NED for the whole gene. Due the fact that the presence of small proportion of outliers does not corrupt the further analysis due the used estimators’ robustness, we use only 80 % of coordinates (rounded up) with smallest values for NED calculation.
We compare obtained NED value with χ
^{2}distribution and degrees of freedom equal to 80 % of initial number of amplicons within the gene. The sample does not pass quality control if test statistics for more than \(\frac {n}{2}\) genes exceed predefined level of significance.