A grand challenge in biomedical research is improving diagnosis and therapy for various diseases by using genetic profiles from the patients. This can enable personalized medicine, where the therapy of a patient is influenced not only by the results of the conventional medical analysis, but also by the analysis resulting from the values of the individual genetic composition.
The mutagenetic trees mixture model introduced by Beerenwinkel et al. [1] provides an interpretable probabilistic framework for modeling multiple paths of ordered accumulation of permanent genetic changes. The model describes several alternative pathways of disease development, each in a different mixture component. Datasets used for estimating the mixture model consist of patterns of genetic events observed for single patients. From the mixture model, a genetic progression score (GPS) can be computed for each pattern [2] that incorporates correlations among genetic events and mean time intervals between their occurrences. The GPS gives an estimate of the progression of the disease process and can be used for specifying therapies or estimating survival times of the patients [2].
Both the mixture model itself and derived features like GPS values have been proven to improve the interpretation of disease progression and estimation of survival times in the context of two different diseases, namely HIV [1] and cancer [2]. However, due to the complexity of the mixture model, it is important to analyze the stability of the estimation process and of features that are calculated from the model.
In this manuscript, we present a simulation study for inspecting the stability of estimated mutagenetic trees mixture models. We compare simulated true mixture models with models fitted to observations drawn from these true models. The stability analysis refers to GPS values as well as to other features of the mixture model like induced probability distributions or tree topologies. A bootstrap method [3] is used for a more detailed analysis of the variability of the GPS values and for deriving GPS confidence intervals.
The EM-like learning algorithm used for estimating mutagenetic trees mixture models finds a starting solution by an initial clustering step. We extend the fitting procedure given in [1] by specifying different initial assignments of the responsibilities, i.e. a different starting solution for mixture model fitting. We introduce the diversity parameter d that describes the diversity of the initial tree components. Based on simulated models we specify an optimal value for d with regard to the reliability of the final estimated models.
HIV drug resistance
The mutagenetic trees mixture model can be used for modeling HIV evolution as a process of accumulation of mutations in the viral genome under drug pressure. The replication rate of the HI-Virus is extremely high. Under therapy, genetic mutations enable the virus to develop drug resistant mutants. The evolution of HIV drug resistance is conveyed by the erroneous reverse transcription during virus replication, the large diversity of the HIV population, and the natural selection of the fittest mutant under drug pressure. This results in a dynamic and highly adaptive virus population.
In this work we focus on HIV-1 drug resistance to the nucleoside reverse transcriptase (RT) inhibitor zidovudine. The most prevalent mutations in the HIV-1 genome that rise under zidovudine are M41L, D67N, K70R, L210W, T215F/Y, and K219E/Q [4, 5]. Typically, K70R and T215F/Y are the first mutations to occur [6]. The set of mutations 215F/Y, 41L, and 210W, also known as 215 – 41 pathway, occurs together. The same holds for 70R and 219E/Q (70 – 219 pathway). In our experiments we use the dataset from the Stanford HIV Drug Resistance Database [7] that comprises genetic measurements of 364 HIV patients treated only with the drug zidovudine. The data contains the six classical major zidovudine resistance mutations mentioned above.
Evolutionary tree models
In this section, we present the mutagenetic trees mixture models as means to model disease evolution. Furthermore, we describe the genetic progression score (GPS) derived from these models used for quantifying the stage of the disease.
Mutagenetic trees mixture models
The mutagenetic trees mixture model is a probabilistic model that can describe multiple pathogenetic routes of ordered genetic mutations in disease progression. A single mutagenetic tree is a weighted directed tree in which the genetic events are represented by nodes and weights on the edges correspond to the conditional probability of the child event happening given that the parent event has occurred. This tree structure provides a probabilistic model that annotates evolutionary paths of disease progression. A disadvantage of this model is that many subsets of genetic events cannot be explained by a single tree, since it can only represent those subsets of events which include events together with all their predecessors in the tree. All other subsets of events have likelihood zero in the probability distribution generated by a single tree model. A single tree also often fails to capture all diverse pathogenetic routes that can occur in disease progression. Formal details of the definition and estimation of single trees are presented in [8].
The mutagenetic trees mixture, proposed by Beerenwinkel et al. in [1], provides a probabilistic model that can capture multiple evolutionary processes conducting the disease evolution, each of them in a separate mixture component. We consider K mutagenetic trees T
k
, k = 1,...,K, on the set of vertices V = {0,...,l - 1} denoting l events with the null event r as root. The root comprises events that has initially occurred in all samples and accounts for the therapy naive patients. Moreover, it represents the starting point of disease progression. The tree T1 has a star-like topology and describes the genetic events as being independent of each other given the initial null event. According to the probability distribution it induces, the star tree assigns positive probability to all possible 2l-1patterns generated by a given set of l genetic events [1].
Formally, a mutagenetic trees mixture model M is a weighted sum of K mutagenetic trees , where α
k
∈ [0, 1] and . Every subset of genetic events, determined by the pattern x, has positive probability of being generated, given by:
Let V be the set of vertices corresponding to the genetic events, and let E = {(u, v) | u, v ∈ V} be a set of edges. For all edges e = (u, v) ∈ E, the edge weights indicate the conditional probability that the event v appears given that the event u has occurred, and are determined with the mapping p: E → [0, 1], such that:p(e) = Pr(X
v
= 1 | X
u
= 1).
The root vertex denoted by r is the node 0 and specifies the initial null event, such that Pr(X
r
= 1) = 1. The connected branching T = (V, E, p) formally captures the notion of a mutagenetic tree.
Let Ω = {0, 1}lbe the set of all possible patterns of genetic events of length l and x be an observation defined by the subset S ⊆ V of events that have occurred. Each observation x is a binary vector with ones indicating occurrence of mutations. For example, if the pattern (0, 0, 1, 0, 1) is associated with a specific patient this means that the genetic mutations 3 and 5 have occurred in the respective sample. The mutagenetic tree T = (V, E, p) induces a probability distribution on the set Ω. Accordingly, if there exists an edge subset E' ⊆ E, such that S is the set of vertices reachable from the root in the induced tree T' = (V', E', p), then the probability that the tree T generates the pattern x is:
Otherwise, the mutagenetic tree T cannot generate the pattern x, i.e. Pr(x | T) = 0.
Given the number of tree components K and a finite sample of N observations D = {x1,...,x
N
} of the binary vector X = (X1,...,X
l
) that indicates occurrence of subsets of genetic events, the mixture model can be estimated as follows. Assuming that for each sample the tree component from which that sample was generated is known, one can easily reconstruct the mixture model by using Edmonds' branching algorithm [9] K times on the respective observation sets. For large number of samples Edmonds' branching algorithm reconstructs the original mutagenetic tree with very high probability [9]. Since one does not know from which tree each sample was generated, one has to estimate it from the data. The goal is to find trees T1,...,T
K
and mixture parameters α1,...,α
k
that maximize the log-likelihood of the data. Having a mixture model, a standard procedure for maximum-likelihood estimation is the EM algorithm [10].
Let Δ1,...,Δ
K
be binary random variables where Pr(Δ
K
= 1) = α
k
. The responsibility of the k-th tree component T
k
for the i-th observation x
i
is the probability that x
i
was generated from T
k
given the mixture model M: γ
ik
= Pr(Δ
k
= 1 | x
i
, M). The EM-like algorithm for fitting mutagenetic trees mixture models is briefly described below. A detailed description is presented in Figure 1. The algorithm is only an "EM-like"-algorithm, since the tree reconstruction step with Edmonds' algorithm does not provide an exact maximum likelihood estimate. In practice, the solution of Edmonds' algorithm is close to the maximum-likelihood solution.
-
1.
Guess initial responsibilities: Run (K-1)-means clustering algorithm on the given dataset and assign samples to the corresponding (K – 1) clusters.
-
2.
Maximization-like step: Estimate the K tree components T1,...,T
K
from all patterns weighted with their responsibilities and compute the mixture parameters α1,...,α
k
.
-
3.
Expectation step: Recompute the responsibilities of the samples by using the tree components estimated in the previous step.
-
4.
Iterate the two previous steps until convergence.
The solution of an EM algorithm depends on the initial values used for the responsibilities. The mixture model tries to capture diverse paths of ordered genetic changes present in the data. Furthermore, only a small fraction of the dataset, that cannot be mapped to the nontrivial components of the mixture, should be mapped to the star component. The rest should be assigned to the nontrivial trees of the mixture model. In [1], the starting solution is determined by running a (K – 1)-means clustering algorithm on the set of observed patterns D by using the squared Euclidian distance as dissimilarity measure [11]. Here we propose initial assignments of the responsibilities depending on a diversity parameter d. This parameter controls the diversity of the initial tree components comprising the mixture model and, as a consequence, also the diversity of the components in the EM solution. The optimal value of d = 3 is chosen as a result of the simulation study presented in the Results section. This value is a good compromise between diversity of nontrivial branchings and quality of the fitted model.
The EM-like algorithm assumes that the number of trees K is known. The algorithm is efficient enough to estimate this model parameter in a cross-validation framework, as proposed in [1]. Yin et al. [12] introduce a modified Bayesian Information Criterion (BIC) for estimating the number of trees. The modified BIC incorporates a similarity measure for estimating the structural redundancy between tree components in the penalization term of the standard BIC. A mutagenetic trees mixture model estimated for HIV patients is illustrated in Figure 2. The model complies with the main experimental results about HIV-1 zidovudine resistance.
Genetic progression score (GPS)
Considering the tree structure of the components of the mixture model, under some additional assumptions waiting times can be mapped on the tree edges. Consequently, a progression score that incorporates correlations among events and time intervals among occurrences of events can be associated to the mixture model as proposed in [2].
The timed mixture model can be obtained by assuming independent Poisson processes for the occurrence of events on the edges of each mixture component and for the sampling time of the disease. The sampling time denotes the time interval between the onset of the disease and its discovery or analysis. Let the waiting time T
i
of the i-th event denote the time difference between the occurrences of its parent event pa(i) and the event itself. Let T
i
and the sampling time T
S
have exponential distributions with parameters λ
i
and λ
S
, respectively. The conditional probability p
i
assigned to the edge (pa(i), i) that enters the i-th event can be calculated as p
i
= λ
i
/(λ
i
+ λ
S
). In this framework the expected waiting time E(i) for an event i is given as follows:
Typically, the onset of the disease is not known for single patients. Thus λ
S
cannot be derived from the data and the waiting times E(T
i
) on the edges cannot be expressed in timescale of the process of genetic progression. Therefore the mean sampling time is normalized to 1, i.e. E(T
S
) = 1, and unitless waiting times are mapped on the edges of the trees of the mixture model.
It is easy to calculate the expected waiting time for a single event i and map it to the corresponding edge (pa(i), i). However, an explicit closed formula for computing the expected waiting times for an arbitrary pattern x
i
of genetic events cannot be derived. Using the timed mutagenetic trees mixture model M the expected waiting times of arbitrary patterns of events can be obtained by simulating the waiting process along the edges of each tree component and using the probability distribution induced by the mixture model.
When simulating the waiting process along a single tree from the mixture with sufficiently large number of simulations (≥ 106), first, waiting times for the events i on the edges (pa(i), i) are drawn from exponential distributions with parameters λ
i
= p
i
/(1 - p
i
). Since the tree structure captures the order in which the genetic abberations occurred, the waiting times of subsequent events along the tree topology are added.
Considering the simulation of the waiting process for the k-th tree component T
k
, the expected waiting time for the subset of events x
i
, denoted with (W (x
i
)), is the average of all waiting times at which the pattern x
i
was observed. Finally, the expected waiting time of the pattern x
i
with respect to the given mixture model, referred to as the GPS of the pattern, is a weighted sum of the expected waiting times of x
i
with respect to each of the K mixture components. The weights are the responsibilities of the respective tree components for the pattern of events x
i
.
Formally, the GPS of the pattern x
i
with respect to the mixture model M is given by
(1)
The computation of the GPS for the pattern x = {0, 70R, 219E/Q, 67N} from the HIV dataset is depicted in Figure 2.