Our aim is to build a combined caller using the mutation outputs generated by multiple callers based on the same paired tumornormal sequence data (BAMs; [11]), when the mutation calls are impartially validated. For illustration purposes, we assume K=3 callers (Caller A, B, and C) are used for mutationcalling. The most basic and key information available in each mutation output is the list of positions detected as point mutations. A mutation output may include additional features such as mutation substitution type, mutation quality score, and perhaps details of filters applied to remove artifactual or lowquality variants. When the raw sequence data are available, genomic features can be computed for each mutation site such as sequencing depth and the variant allele fraction (the fraction of reads carrying the variant allele) for each tumor and normal sample. The more information that is available, the more powerful are the callers that can be constructed.
Fitting logistic models using the call status and genomic features
Stacked generalization was first introduced in the neural network community [12] and later adapted to the statistics literature [9], as a systematic way to combine classifiers.
Given a set of calls
c
_{
i
k
}∈{0,1} for site 1≤
i≤
n and caller 1≤
k≤
K, stacking aims at building a linear function of the calls for each site
i which predicts its true status
y
_{
i
} as accurately as possible. In other words, we represent each site by its
K calls from the different callers, and learn a new classifier of mutation sites in this feature space. Formally, given a set of
n sites with known calls
c
_{
i
k
} for all callers and known true status
y
_{
i
}, a linear stacking approach would solve:
i.e., a linear regression in the call space, estimating weights
β
_{
k
} such that a linear combination of the calls based on these weights is close to the true mutation status. The mutation status of a new site
c
_{
i
} defined by its calls from the
K individual callers would then be predicted via
In practice, we use a logistic model rather than a linear one, because it is better suited to binary classification [
8] – we only have binary mutation status {0,1} as opposed to scores or continuous confidence measures. Our estimator therefore becomes:
If the features c
_{
i
k
} are binary, which is the case if the individual callers returned binary decisions rather than continuous scores, the resulting classifier f(c
_{
i
}) is the sum of weights β
_{
k
} for callers which classified the site i as a somatic mutation. It can only take 2^{
K
}1 distinct values on sites which were called by at least one caller. Each of these values corresponds to a unique combination of calls by the individual methods, which in turn corresponds to one of the disjoint subsets defined by the Venn diagram discussed in Section ‘Cumulatively adding mutation sets based on combination call status’. If the effects of callers are additive, then the ranking of the sites defined by f is expected to essentially behave like the more naive one defined in Section ‘Cumulatively adding mutation sets based on combination call status’.
The estimators defined by (1) and (3) combine the individual callers uniformly for all sites. It is however conceivable that some callers perform better for some types of sites, e.g., those with low coverage, and less well for others. We now assume that some descriptors g
_{
i
j
}, 1≤j≤p, of each site i are available besides the detection status of the three callers and the validation status. These descriptors could typically be genomic features.
Featureweighted linear stacking (FWLS, [
13]) replaces each parameter
β
_{
k
} of the stacking regression estimator (3) by a linear combination of the descriptors
g
_{
i
j
}:
where the α
_{
j
k
} parameters are weights corresponding to the relevance of feature g
_{
i
j
} to measure how predictive caller k is for site i. The weights β
_{
k
} are therefore sitespecific, accounting for the fact that the relevance β
_{
k
} of a particular caller k may be different for sites with different genomic features.
Plugging weights (4) in the linear function (2) yields a different set of coefficients for each site i :
. h is now a linear function of the K×p products of features g
_{
i
j
} and calls c
_{
i
k
} so FWLS equivalently amounts to:

describing each site by this extended set of features, and

estimating a linear classifier of mutation sites in this space.
Formally, after plugging (4) in our stacking estimator (3) we see that FWLS solves:
where
contains all the products of calls and genomic features for site i. The K×p parameters γ
_{
l
} are the weights of the logistic regression. They are strictly equivalent to the α
_{
j
k
} parameters of (4), we only use them to emphasize that FWLS can be formulated as a regular logistic regression estimator in an expanded feature space.
In the experiments of this paper, we consider all combinations of call status defined in Section ‘Cumulatively adding mutation sets based on combination call status’, i.e., all products of single calls rather than the single calls. Technically this can still be cast as a FWLS model, by adding all single calls and products of single calls to the set of features g
_{
i
j
}. In practice, our implementation relies on (5), i.e., on a logistic regression in an expanded feature space.
Finally, since the resulting feature space can become large, we choose to use an
ℓ
_{1}penalized version of (5):
Penalizing the ℓ
_{1} norm
of the parameter is known to lead to sparse estimators [14], and
is used to adjust the level of sparsity.