Combining calls from multiple somatic mutation-callers
© Kim et al.; licensee BioMed Central Ltd. 2014
Received: 16 February 2014
Accepted: 12 May 2014
Published: 21 May 2014
Skip to main content
© Kim et al.; licensee BioMed Central Ltd. 2014
Received: 16 February 2014
Accepted: 12 May 2014
Published: 21 May 2014
Accurate somatic mutation-calling is essential for insightful mutation analyses in cancer studies. Several mutation-callers are publicly available and more are likely to appear. Nonetheless, mutation-calling is still challenging and there is unlikely to be one established caller that systematically outperforms all others. Therefore, fully utilizing multiple callers can be a powerful way to construct a list of final calls for one’s research.
Using a set of mutations from multiple callers that are impartially validated, we present a statistical approach for building a combined caller, which can be applied to combine calls in a wider dataset generated using a similar protocol. Using the mutation outputs and the validation data from The Cancer Genome Atlas endometrial study (6,746 sites), we demonstrate how to build a statistical model that predicts the probability of each call being a somatic mutation, based on the detection status of multiple callers and a few associated features.
The approach allows us to build a combined caller across the full range of stringency levels, which outperforms all of the individual callers.
Somatic mutations are genetic changes that occur in somatic cells after conception. Cancer is driven by such somatic alterations, and thus cataloging somatic mutations is essential to understand the genetic bases of cancer development. With the burst of high-throughput sequencing data generated in recent years, extensive efforts have been made towards accurate somatic mutation-calling. Many calling algorithms are now publicly available, including Shimmer , MuTect , Strelka , MutationSeq , JointSNVMix , and SomaticSniper . Additional in-house callers are likely to be under development for on-going studies. Nonetheless, many challenges remain to be addressed, for example, removing artifactual variants from multiple sources, detecting rare variants in highly heterogeneous tumor samples, detecting variants at a shallower sequencing coverage. Every caller will tackle these issues, but different callers are likely to be more successful on some of them and less so on others. As a consequence, finding the single best performing caller is not easy, and perhaps not even feasible.
Having multiple callers on hand, anyone conducting a mutation analysis may want to apply all of the callers to his/her data with the aim of later constructing a list of final calls. In essence, combining calls from multiple callers amounts to developing a strategy to sort the calls to be included as final calls. This can be done effectively if one can systematically assign a confidence measure to be a somatic mutation across the full list. In general, pursuing this goal requires a validation dataset to some extent. For example, the paper by Lower et al.  presented a method to prioritize calls from three methods by assigning false discovery rate confidence values, but it requires the independent sequencing of at least one of the tumor or normal samples.
In our work, we are considering a situation in which mutation-calling is done (by multiple callers) for many tumor-normal sequence pairs across a large genomic regions such as whole genome or exome, but only a limited resource is available for validation. For example, in practice, often only a small fraction of detected mutations can be validated or a small subset of regions in a selected list of samples are re-sequenced for evaluation purposes. We aim to build a combined caller, which is learned based on the relatively small validation dataset but can be applied to a wider dataset generated based on a similar protocol.
A large corpus in the statistical literature is dedicated to combining individual learners, see e.g. Chapter 16 of , however most of them — e.g., boosting, bagging and random forests — are based on building individual learners from descriptors rather than combining outputs of algorithms. Stacking was introduced as a mean of combining such outputs. In this paper, we exploit this well established framework to merge the outputs of different callers.
Specifically, we present a statistical approach for combining calls from multiple somatic mutation-callers, when validation is impartially done for all mutations detected by all callers in a selected set of regions or samples. For 194 tumor-normal exome-seq pairs from The Cancer Genome Atlas (TCGA) endometrial study , single nucleotide variant (SNV) type mutations (i.e., point mutations) were detected by three somatic mutation-callers. Validation through an independent re-sequencing was impartially done for all the mutations detected from 20 selected patients across the whole exome and for those mutations detected within 243 genes of interest across all 194 patients. We used this data to show how our statistical approach improves against individual callers and naive combination based on caller intersection. We also show that this improvement is maintained when the parameters of the model are estimated on a set of samples or regions different from the ones on which the performance is evaluated.
Our aim is to build a combined caller using the mutation outputs generated by multiple callers based on the same paired tumor-normal sequence data (BAMs; ), when the mutation calls are impartially validated. For illustration purposes, we assume K=3 callers (Caller A, B, and C) are used for mutation-calling. The most basic and key information available in each mutation output is the list of positions detected as point mutations. A mutation output may include additional features such as mutation substitution type, mutation quality score, and perhaps details of filters applied to remove artifactual or low-quality variants. When the raw sequence data are available, genomic features can be computed for each mutation site such as sequencing depth and the variant allele fraction (the fraction of reads carrying the variant allele) for each tumor and normal sample. The more information that is available, the more powerful are the callers that can be constructed.
Validation results of the seven disjoint mutation sets shown in Figure 1
Caller A and C only
Caller A and B only
Caller B and C only
Caller C only
Caller A only
Caller B only
If the features c i k are binary, which is the case if the individual callers returned binary decisions rather than continuous scores, the resulting classifier f(c i ) is the sum of weights β k for callers which classified the site i as a somatic mutation. It can only take 2 K -1 distinct values on sites which were called by at least one caller. Each of these values corresponds to a unique combination of calls by the individual methods, which in turn corresponds to one of the disjoint subsets defined by the Venn diagram discussed in Section ‘Cumulatively adding mutation sets based on combination call status’. If the effects of callers are additive, then the ranking of the sites defined by f is expected to essentially behave like the more naive one defined in Section ‘Cumulatively adding mutation sets based on combination call status’.
The estimators defined by (1) and (3) combine the individual callers uniformly for all sites. It is however conceivable that some callers perform better for some types of sites, e.g., those with low coverage, and less well for others. We now assume that some descriptors g i j , 1≤j≤p, of each site i are available besides the detection status of the three callers and the validation status. These descriptors could typically be genomic features.
where the α j k parameters are weights corresponding to the relevance of feature g i j to measure how predictive caller k is for site i. The weights β k are therefore site-specific, accounting for the fact that the relevance β k of a particular caller k may be different for sites with different genomic features.
Plugging weights (4) in the linear function (2) yields a different set of coefficients for each site i : . h is now a linear function of the K×p products of features g i j and calls c i k so FWLS equivalently amounts to:
describing each site by this extended set of features, and
estimating a linear classifier of mutation sites in this space.
where contains all the products of calls and genomic features for site i. The K×p parameters γ l are the weights of the logistic regression. They are strictly equivalent to the α j k parameters of (4), we only use them to emphasize that FWLS can be formulated as a regular logistic regression estimator in an expanded feature space.
In the experiments of this paper, we consider all combinations of call status defined in Section ‘Cumulatively adding mutation sets based on combination call status’, i.e., all products of single calls rather than the single calls. Technically this can still be cast as a FWLS model, by adding all single calls and products of single calls to the set of features g i j . In practice, our implementation relies on (5), i.e., on a logistic regression in an expanded feature space.
Penalizing the ℓ 1 norm of the parameter is known to lead to sparse estimators , and is used to adjust the level of sparsity.
The approach of building a combined caller by taking intersections or unions (Section ‘Taking intersections or unions’) does not require a training set, and evaluation of the caller can be done straightforwardly on a test set. The approach that cumulatively adds disjoint subsets (Section ‘Cumulatively adding mutation sets based on combination call status’) uses a training set to determine the order of subsets (by computing the validation rate of each subset), and evaluates the performance on a test set using the order. For the approach building a caller by fitting a logistic model (Section ‘Fitting logistic models using the call status and genomic features’), a training set is used to estimate the γ l parameters of (6). In order to choose the hyperparameter λ, we perform 10-fold cross validation on the training set for each candidate λ to estimate the error of the associated model. Then the most parsimonious model whose error is no more than one standard error above the error of the best model is chosen. Once λ is selected, we re-estimate γ l using this λ on the whole training set, and evaluate its performance on the test set. Experiments were conducted using the R package glmnet , which implements penalized GLMs, in particular the ℓ 1 penalized logistic regression of which (6) is an instance. The R scripts that contain our detailed implementation are included as Additional file 1.
We have used the mutation datasets generated for the TCGA endometrial study . For 194 tumor-normal Illumina exome-sequence pairs, somatic-mutation calling was done by three centers whose algorithms are referred to here as Caller A, B, and C. In total, 51,648 single nucleotide variant (SNV) type of mutations were detected. A large fraction of the mutations were targeted for custom capture validation. As explained in the Additional file 2: Supplementary Methods, these sites were captured using the Nimblegen technology and then re-sequenced independently using an Illumina HighSeq 2000. In particular, impartial validation (i.e. validating all calls from all callers) was carried out for all mutations in (1) a randomly selected 20 patients and (2) an additional 243 genes of interest from the remaining 174 patients. Validation status was successfully determined for all but a small fraction (less than 5%) of the validated mutations. For more details about the validation and determining the validation status, see Additional file 2: Supplementary Methods. Our final dataset consists of the successfully validated mutations: (1) 4,438 sites in the selected 20 patients and (2) an additional 2,308 sites within the 243 genes of interest. Note that almost all of these sites (> 95%) are included as example datasets in our software package (Additional file 1).
For each point mutation site in our final dataset, we know the validation status (‘somatic’ or ‘non-somatic’), the call status (i.e., whether or not it was detected) by each of the three callers, the mutation substitution type (combination of the reference allele and the variant allele), and the sequencing depth and the variant allele fraction in each tumor and normal sample based on the exome sequence data that was used for mutation-calling. A brief summary of our dataset is included as Table 1, Additional file 2: Table S1 and Figures S1–S4. Caller B provided more information besides the positions of the detected mutations. For a broader set of somatic variants (candidate mutations), it reported the mutation quality score as well as the pass/fail status of individual filters at each site. Although the detailed description of each filter was not available, the filter outcomes were available (Additional file 2: Table S2), which we were able to use for improving Caller B’s performance (Section ‘Improving a single caller’s performance using details of its filters’). In Section ‘Building and evaluating combined callers’, we demonstrate how to build a combined caller using the calling status of the three individual mutation callers and a few genomic features. In Section ‘Improving a single caller’s performance using details of its filters’, we show the potential for improving the performance of an individual caller using more detailed outputs, using Caller B as an instance.
We first used the mutations detected from the 20 selected patients (total: 4,438) to build and evaluate combined callers. Assuming (for illustrative purposes) that the characteristics of our mutations are not affected by sample-specific features, we randomly split the data into 50% training and 50% test sets. Other fractions were explored, but the qualitative conclusions were similar as long as there was enough data to train the model, e.g., more than 20% of the total.
The performances of individual callers and combined callers are summarized in Figure 2. Note that validation was done only for the mutations that were detected by at least one of the three callers, and therefore, the union of all mutations comprises all true positives and all false positives. The results of three individual callers are given at three points with different false positive rates, i.e., different stringency levels. Caller A is the most liberal in the sense that it detected many false positives (FP rate at 68%) but also detected most of the true positives (TP rate at 96%). Caller C has a very small FP rate (4%) but detected only 67% of the true positives. Caller B performs poorer than Caller C, since it detected not only more false positives but also less true positives. The performance of the caller taking unions or intersections of the calls is marked as another set of points, inside of the sub-panel on the lower right part of the main panel. The stringency levels of these callers are not necessarily ordered. For example, the set of mutations called by two or more callers (2orMore) is nested within any intersection of two callers (AB, AC, or BC), but no ordering exists among the latter three intersections. In contrast to this, the performance of the caller adding mutations sets cumulatively is shown as a connected set of blue dots because of the natural ordering determined based on the validation rates. In reality, the ordering may not be the same between the training set and the test set. When the validation rates are very similar among the mutation subsets or the number of mutations in each set is very small, sampling variation could easily result in a different ordering. In the training set, the validation rates of the mutation set called by A and C but not B, and the set called by A and B but not C, are 97.99% and 97.96%, respectively.
Overall, our combined caller obtained by fitting a logistic model outperforms the individual callers and other naive combinations. The ROC curve of this combined caller is above of all the points representing the performance of individual callers, although sometimes only slightly so. Further, the combined caller allows us to assess the performance across the full range of stringency levels.
For Caller B, mutation quality scores as well as the outcomes of individual filters were available for a broader set of somatic variants. (Note that for each caller, the detected mutations are the somatic variants that passed all the filters implemented by that caller.) In Figure 2, the performance of Caller B was shown as a single point. Here, we demonstrate how such extra details besides the call status can be used to improve the performance. Furthermore, to prove the validity of our approach in a wider dataset, we trained and tested on two different mutation datasets that were generated for the TCGA endometrial study using the same mutation calling algorithms, but constructed from different genomic regions as well as different tumor and normal samples. Specifically, we trained a model on the mutations from the 243 genes of interest from 174 patients (our second dataset described at the beginning of Section ‘Results’), then evaluated on the mutations from the whole exomes of the 20 patients (first dataset). A similar analysis was performed with the roles of the two datasets switched (Additional file 2: Figure S5).
By utilizing the outcomes of individual filters, Caller B’s performance has improved substantially (Figure 3). At a false positive rate of 33%, the true positive rate increases from 63% to 78%, detecting 520 more mutations. This highlights the importance of having the full details of all features involved in the final decision on a variant.
Furthermore, if similar details were available for Caller A and C, then we could generalize the logistic model in previous section (Section ‘Building and evaluating combined callers’) including outcomes of individual filters from all callers, which potentially leads to a higher power as well as better insight on the cause of mutation-calling errors.
In this paper, we present an approach for effectively building a combined caller using the outputs from three mutation callers. Our approach is valid with more than three callers or less concordant mutation call outputs, as long as impartial validation data is available for all calls from all mutation callers as a training data, and the relative performance of individual callers is expected to be consistent between the training set and the test set. The combining approach could be even more beneficial if the individual callers agreed less — assuming (i) they all had comparable individual performances and (ii) the set of loci on which each caller is right could be characterized in terms of genomic features. In this case, the FWLS approach could learn the type of locus on which each caller is typically right and output the best answer for each new locus, resulting in a more accurate calling.
We have analyzed mutation sites that were successfully validated based on the criteria described in Additional file 2: Supplementary Methods. Those validation criteria may not be perfect, but we found them reasonable to demonstrate our approach. Changes in validation criteria can result changes in individual callers’ performances and thus the final model estimated. For example, more stringent criteria are likely to treat all very rare mutations as false calls, and thus in our exercise, may reduce the sensitivity of Caller A to a large extent. However, our approach remains to provide a convenient framework to build the best combined model, given any validation status. In practice, determining validation status based on an independent sequencing data can be very challenging, and developing highly accurate validation method itself is another research topic. Working on better validation is out of scope for our paper, but if uncertainty in the validation could be quantified, it could be used in the logistic model fitting to weight more accurate calls.
In practice, an effective validation strategy is essential for building a successful model. In principle, a training dataset is supposed to contain all sites characterizing a wider dataset for which one wishes to apply the estimated model. Therefore, a validation dataset needs to include enough sites to learn the behaviors of the mutation-calling algorithms across a broad spectrum of genomic features. Another important issue is to have impartially validated sites. If validation is done partially, then the composition of a training dataset is biased and thus the estimated parameters and the performance are also biased.
Our approaches provide a unified framework for dealing with multiple somatic-mutation callers. If the callers provide only the list of positions detected as mutations, then it is difficult to compare them, or to investigate the tradeoff between the stringency of the calling-procedure and its power to detect true mutations. Our combined caller can be used to overcome these difficulties. It offers an evaluation of its performance across the full range as an ROC curve, and in addition, allows easy comparison with individual callers.
Furthermore, we have shown that it is feasible to build a combined caller that performs better than all the individual callers, one which could be better (even slightly) than a caller combining calls only based on the detection status. An even more powerful caller can possibly be built when more features associated with calling performance are available, such as individual details of the filters used by each caller or the measure of strand bias.
Finally, we demonstrate the potential for building a combined caller using a small validation dataset (generated for a subset of regions or samples in the original study), which can be applied to a wider dataset to assign a confidence measure that can be used for ranking the mutations from multiple callers. Our two mutation datasets, one from the selected 20 patients and the other from 243 genes of interest across 174 patients share protocols (sample preparation, sequencing technology, alignment methods, and the applied mutation-calling algorithms) but differ for genomic regions and the tumor and normal samples used for calling. The results from training the model using one of the datasets and evaluating on the other suggest that the estimated models based on these validation datasets are generally applicable to the mutations from whole exomes of all 194 endometrial patients.
The cancer genome atlas
Single nucleotide variant
Feature-weighted linear stacking
Receiver operating characteristic.
We thanks to the TCGA mutation calling group. Special thanks to David Haussler, Li Ding, David Wheeler, and Gad Getz for their leadership and to Singer Ma, Cyriac Kandoth, Kyle Chang for generating the mutation-calling outputs, particularly to Cyriac Kandoth for compiling the mutation outputs as well as the validation data. We also would like to thank to Heidi Sofia and Kenna Shaw for coordination and providing valuable feedbacks, to Paul Spellman for sharing computational facilities, to the members of Speed lab for discussion and providing valuable comments.
The results published here are based upon data generated by The Cancer Genome Atlas project established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at http://cancergenome.nih.gov.
We gratefully acknowledge support from NIH grant 5 U24 CA143799-04.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.