A grand challenge in biomedical research is improving diagnosis and therapy for various diseases by using genetic profiles from the patients. This can enable personalized medicine, where the therapy of a patient is influenced not only by the results of the conventional medical analysis, but also by the analysis resulting from the values of the individual genetic composition.
The mutagenetic trees mixture model introduced by Beerenwinkel et al. [1] provides an interpretable probabilistic framework for modeling multiple paths of ordered accumulation of permanent genetic changes. The model describes several alternative pathways of disease development, each in a different mixture component. Datasets used for estimating the mixture model consist of patterns of genetic events observed for single patients. From the mixture model, a genetic progression score (GPS) can be computed for each pattern [2] that incorporates correlations among genetic events and mean time intervals between their occurrences. The GPS gives an estimate of the progression of the disease process and can be used for specifying therapies or estimating survival times of the patients [2].
Both the mixture model itself and derived features like GPS values have been proven to improve the interpretation of disease progression and estimation of survival times in the context of two different diseases, namely HIV [1] and cancer [2]. However, due to the complexity of the mixture model, it is important to analyze the stability of the estimation process and of features that are calculated from the model.
In this manuscript, we present a simulation study for inspecting the stability of estimated mutagenetic trees mixture models. We compare simulated true mixture models with models fitted to observations drawn from these true models. The stability analysis refers to GPS values as well as to other features of the mixture model like induced probability distributions or tree topologies. A bootstrap method [3] is used for a more detailed analysis of the variability of the GPS values and for deriving GPS confidence intervals.
The EMlike learning algorithm used for estimating mutagenetic trees mixture models finds a starting solution by an initial clustering step. We extend the fitting procedure given in [1] by specifying different initial assignments of the responsibilities, i.e. a different starting solution for mixture model fitting. We introduce the diversity parameter d that describes the diversity of the initial tree components. Based on simulated models we specify an optimal value for d with regard to the reliability of the final estimated models.
HIV drug resistance
The mutagenetic trees mixture model can be used for modeling HIV evolution as a process of accumulation of mutations in the viral genome under drug pressure. The replication rate of the HIVirus is extremely high. Under therapy, genetic mutations enable the virus to develop drug resistant mutants. The evolution of HIV drug resistance is conveyed by the erroneous reverse transcription during virus replication, the large diversity of the HIV population, and the natural selection of the fittest mutant under drug pressure. This results in a dynamic and highly adaptive virus population.
In this work we focus on HIV1 drug resistance to the nucleoside reverse transcriptase (RT) inhibitor zidovudine. The most prevalent mutations in the HIV1 genome that rise under zidovudine are M41L, D67N, K70R, L210W, T215F/Y, and K219E/Q [4, 5]. Typically, K70R and T215F/Y are the first mutations to occur [6]. The set of mutations 215F/Y, 41L, and 210W, also known as 215 – 41 pathway, occurs together. The same holds for 70R and 219E/Q (70 – 219 pathway). In our experiments we use the dataset from the Stanford HIV Drug Resistance Database [7] that comprises genetic measurements of 364 HIV patients treated only with the drug zidovudine. The data contains the six classical major zidovudine resistance mutations mentioned above.
Evolutionary tree models
In this section, we present the mutagenetic trees mixture models as means to model disease evolution. Furthermore, we describe the genetic progression score (GPS) derived from these models used for quantifying the stage of the disease.
Mutagenetic trees mixture models
The mutagenetic trees mixture model is a probabilistic model that can describe multiple pathogenetic routes of ordered genetic mutations in disease progression. A single mutagenetic tree is a weighted directed tree in which the genetic events are represented by nodes and weights on the edges correspond to the conditional probability of the child event happening given that the parent event has occurred. This tree structure provides a probabilistic model that annotates evolutionary paths of disease progression. A disadvantage of this model is that many subsets of genetic events cannot be explained by a single tree, since it can only represent those subsets of events which include events together with all their predecessors in the tree. All other subsets of events have likelihood zero in the probability distribution generated by a single tree model. A single tree also often fails to capture all diverse pathogenetic routes that can occur in disease progression. Formal details of the definition and estimation of single trees are presented in [8].
The mutagenetic trees mixture, proposed by Beerenwinkel et al. in [1], provides a probabilistic model that can capture multiple evolutionary processes conducting the disease evolution, each of them in a separate mixture component. We consider K mutagenetic trees T_{
k
}, k = 1,...,K, on the set of vertices V = {0,...,l  1} denoting l events with the null event r as root. The root comprises events that has initially occurred in all samples and accounts for the therapy naive patients. Moreover, it represents the starting point of disease progression. The tree T_{1} has a starlike topology and describes the genetic events as being independent of each other given the initial null event. According to the probability distribution it induces, the star tree assigns positive probability to all possible 2^{l1}patterns generated by a given set of l genetic events [1].
Formally, a mutagenetic trees mixture model M is a weighted sum of K mutagenetic trees M={\displaystyle {\sum}_{k=1}^{K}{\alpha}_{k}{T}_{k}}, where α_{
k
}∈ [0, 1] and {\displaystyle {\sum}_{k=1}^{K}{\alpha}_{k}=1}. Every subset of genetic events, determined by the pattern x, has positive probability of being generated, given by:
\mathrm{Pr}(xM)={\displaystyle \sum _{k=1}^{K}{\alpha}_{k}\mathrm{Pr}(x{T}_{k})}.
Let V be the set of vertices corresponding to the genetic events, and let E = {(u, v)  u, v ∈ V} be a set of edges. For all edges e = (u, v) ∈ E, the edge weights indicate the conditional probability that the event v appears given that the event u has occurred, and are determined with the mapping p: E → [0, 1], such that:p(e) = Pr(X_{
v
}= 1  X_{
u
}= 1).
The root vertex denoted by r is the node 0 and specifies the initial null event, such that Pr(X_{
r
}= 1) = 1. The connected branching T = (V, E, p) formally captures the notion of a mutagenetic tree.
Let Ω = {0, 1}^{l}be the set of all possible patterns of genetic events of length l and x be an observation defined by the subset S ⊆ V of events that have occurred. Each observation x is a binary vector with ones indicating occurrence of mutations. For example, if the pattern (0, 0, 1, 0, 1) is associated with a specific patient this means that the genetic mutations 3 and 5 have occurred in the respective sample. The mutagenetic tree T = (V, E, p) induces a probability distribution on the set Ω. Accordingly, if there exists an edge subset E' ⊆ E, such that S is the set of vertices reachable from the root in the induced tree T' = (V', E', p), then the probability that the tree T generates the pattern x is:
\mathrm{Pr}(xT)={\displaystyle \prod _{e\in {E}^{\prime}}p(e)}\cdot {\displaystyle \prod _{e\in S\times (V\backslash S)}(1p(e))}.
Otherwise, the mutagenetic tree T cannot generate the pattern x, i.e. Pr(x  T) = 0.
Given the number of tree components K and a finite sample of N observations D = {x_{1},...,x_{
N
}} of the binary vector X = (X_{1},...,X_{
l
}) that indicates occurrence of subsets of genetic events, the mixture model can be estimated as follows. Assuming that for each sample the tree component from which that sample was generated is known, one can easily reconstruct the mixture model by using Edmonds' branching algorithm [9] K times on the respective observation sets. For large number of samples Edmonds' branching algorithm reconstructs the original mutagenetic tree with very high probability [9]. Since one does not know from which tree each sample was generated, one has to estimate it from the data. The goal is to find trees T_{1},...,T_{
K
}and mixture parameters α_{1},...,α_{
k
}that maximize the loglikelihood of the data. Having a mixture model, a standard procedure for maximumlikelihood estimation is the EM algorithm [10].
Let Δ_{1},...,Δ_{
K
}be binary random variables where Pr(Δ_{
K
}= 1) = α_{
k
}. The responsibility of the kth tree component T_{
k
}for the ith observation x_{
i
}is the probability that x_{
i
}was generated from T_{
k
}given the mixture model M: γ_{
ik
}= Pr(Δ_{
k
}= 1  x_{
i
}, M). The EMlike algorithm for fitting mutagenetic trees mixture models is briefly described below. A detailed description is presented in Figure 1. The algorithm is only an "EMlike"algorithm, since the tree reconstruction step with Edmonds' algorithm does not provide an exact maximum likelihood estimate. In practice, the solution of Edmonds' algorithm is close to the maximumlikelihood solution.

1.
Guess initial responsibilities: Run (K1)means clustering algorithm on the given dataset and assign samples to the corresponding (K – 1) clusters.

2.
Maximizationlike step: Estimate the K tree components T_{1},...,T_{
K
}from all patterns weighted with their responsibilities and compute the mixture parameters α_{1},...,α_{
k
}.

3.
Expectation step: Recompute the responsibilities of the samples by using the tree components estimated in the previous step.

4.
Iterate the two previous steps until convergence.
The solution of an EM algorithm depends on the initial values used for the responsibilities. The mixture model tries to capture diverse paths of ordered genetic changes present in the data. Furthermore, only a small fraction of the dataset, that cannot be mapped to the nontrivial components of the mixture, should be mapped to the star component. The rest should be assigned to the nontrivial trees of the mixture model. In [1], the starting solution is determined by running a (K – 1)means clustering algorithm on the set of observed patterns D by using the squared Euclidian distance as dissimilarity measure [11]. Here we propose initial assignments of the responsibilities depending on a diversity parameter d. This parameter controls the diversity of the initial tree components comprising the mixture model and, as a consequence, also the diversity of the components in the EM solution. The optimal value of d = 3 is chosen as a result of the simulation study presented in the Results section. This value is a good compromise between diversity of nontrivial branchings and quality of the fitted model.
The EMlike algorithm assumes that the number of trees K is known. The algorithm is efficient enough to estimate this model parameter in a crossvalidation framework, as proposed in [1]. Yin et al. [12] introduce a modified Bayesian Information Criterion (BIC) for estimating the number of trees. The modified BIC incorporates a similarity measure for estimating the structural redundancy between tree components in the penalization term of the standard BIC. A mutagenetic trees mixture model estimated for HIV patients is illustrated in Figure 2. The model complies with the main experimental results about HIV1 zidovudine resistance.
Genetic progression score (GPS)
Considering the tree structure of the components of the mixture model, under some additional assumptions waiting times can be mapped on the tree edges. Consequently, a progression score that incorporates correlations among events and time intervals among occurrences of events can be associated to the mixture model as proposed in [2].
The timed mixture model can be obtained by assuming independent Poisson processes for the occurrence of events on the edges of each mixture component and for the sampling time of the disease. The sampling time denotes the time interval between the onset of the disease and its discovery or analysis. Let the waiting time T_{
i
}of the ith event denote the time difference between the occurrences of its parent event pa(i) and the event itself. Let T_{
i
}and the sampling time T_{
S
}have exponential distributions with parameters λ_{
i
}and λ_{
S
}, respectively. The conditional probability p_{
i
}assigned to the edge (pa(i), i) that enters the ith event can be calculated as p_{
i
}= λ_{
i
}/(λ_{
i
}+ λ_{
S
}). In this framework the expected waiting time E(i) for an event i is given as follows:
\text{E}({T}_{i})=\frac{1}{{\lambda}_{i}}=\frac{1{p}_{i}}{{p}_{i}}\frac{1}{{\lambda}_{S}}=\frac{1{p}_{i}}{{p}_{i}}\text{E}({T}_{S}).
Typically, the onset of the disease is not known for single patients. Thus λ_{
S
}cannot be derived from the data and the waiting times E(T_{
i
}) on the edges cannot be expressed in timescale of the process of genetic progression. Therefore the mean sampling time is normalized to 1, i.e. E(T_{
S
}) = 1, and unitless waiting times are mapped on the edges of the trees of the mixture model.
It is easy to calculate the expected waiting time for a single event i and map it to the corresponding edge (pa(i), i). However, an explicit closed formula for computing the expected waiting times for an arbitrary pattern x_{
i
}of genetic events cannot be derived. Using the timed mutagenetic trees mixture model M the expected waiting times of arbitrary patterns of events can be obtained by simulating the waiting process along the edges of each tree component and using the probability distribution induced by the mixture model.
When simulating the waiting process along a single tree from the mixture with sufficiently large number of simulations (≥ 10^{6}), first, waiting times for the events i on the edges (pa(i), i) are drawn from exponential distributions with parameters λ_{
i
}= p_{
i
}/(1  p_{
i
}). Since the tree structure captures the order in which the genetic abberations occurred, the waiting times of subsequent events along the tree topology are added.
Considering the simulation of the waiting process for the kth tree component T_{
k
}, the expected waiting time for the subset of events x_{
i
}, denoted with {\text{E}}_{{T}_{k}}(W (x_{
i
})), is the average of all waiting times at which the pattern x_{
i
}was observed. Finally, the expected waiting time of the pattern x_{
i
}with respect to the given mixture model, referred to as the GPS of the pattern, is a weighted sum of the expected waiting times of x_{
i
}with respect to each of the K mixture components. The weights are the responsibilities of the respective tree components for the pattern of events x_{
i
}.
Formally, the GPS of the pattern x_{
i
}with respect to the mixture model M is given by
\begin{array}{lll}\text{GPS}({x}_{i})\hfill & =\hfill & {\text{E}}_{M}(W({x}_{i}))={\displaystyle \sum _{k=1}^{K}\mathrm{Pr}({T}_{k}{x}_{i})\cdot {\text{E}}_{{T}_{k}}(W({x}_{i}))}\hfill \\ =\hfill & {\displaystyle \sum _{k=1}^{K}{\gamma}_{ik}\cdot {\text{E}}_{{T}_{k}}(W({x}_{i}))}.\hfill \end{array}
(1)
The computation of the GPS for the pattern x = {0, 70R, 219E/Q, 67N} from the HIV dataset is depicted in Figure 2.