Learning mutational graphs of individual tumor evolution from multi-sample sequencing data

Phylogenetic techniques quantify intra-tumor heterogeneity by deconvolving either clonal or mutational trees from multi-sample sequencing data of individual tumors. Most of these methods rely on the well-known infinite sites assumption, and are limited to process either multi-region or single-cell sequencing data. Here, we improve over those methods with TRaIT, a unified statistical framework for the inference of the accumula- tion order of multiple types of genomic alterations driving tumor development. TRaIT supports both multi-region and single-cell sequencing data, and output mutational graphs accounting for violations of the infinite sites assumption due to convergent evolution, and other complex phenomena that cannot be detected with phylogenetic tools. Our method displays better accuracy, performance and robustness to noise and small sample size than state-of-the-art phylogenetic methods. We show with single-cell data from breast cancer and multi-region data from colorectal cancer that TRaIT can quantify the extent of intra-tumor heterogeneity and generate new testable experimental hypotheses.


Introduction
Intra-tumor heterogeneity (ITH) is the final product of the complex interplay arising from competition, selection and neutral evolution of cancer cell subpopulations, and currently represents a major hurdle in the development of effective diagnostic and therapeutic strategies for most cancer (sub)types [1][2][3][4][5][6][7]. For this reason, in the last years an impressive number of computational approaches to reconstruct the evolutionary history of tumors have been devised (see [8,9] for recent reviews), taking advantage of the ever increasing resolution and availability of genomic data, as provided especially by multi-region and single-cell sequencing (SCS) experiments.
Such multi-sample datasets allow to overcome some limitations of singlesample bulk sequencing, which returns a noisy mixture of signals from the tumor subpopulations detected in the sequenced biopsy (e.g., the TCGA data [10]). In particular, the analysis of multiple spatially-separated regions of a tumor and its metastases is recently producing a better and clearer picture of ITH in various tumor types [11][12][13][14][15][16]. Accordingly, a number of algorithms to infer tumor phylogenies from allele frequencies, which were originally ideated for singlesample datasets, have been extended to multi-sample data [17][18][19][20][21][22][23][24].
Most of these methods rely on the Infinite Sites Assumption (ISA), according to which each mutation occurs at most once during the evolutionary history of a tumor, and is never lost. Unfortunately, violations of the ISA are more frequent than originally expected, due to chromosomal deletions and loss of heterozygosity, which could lead to back mutations, and to convergent evolution, in which the same mutation is observed in independent clones or lineages (i.e., parallel mutations) [34]. More in general, such methods usually depend on a large number of ad-hoc technical assumptions and parameters, e.g., noise model, search schemes, etc., which need to be opportunely tuned, often requiring computationally demanding automated procedures and/ or prior biological knowledge about the underlying phenomenon. Similarly, in many cases arbitrary heuristics are needed to disambiguate equivalently optimal solutions, e.g., when seeking for a maximum parsimony phylogenetic tree.
Here we introduce TRaIT (Temporal oRder of Individual Tumors), a computational framework to infer the order of accumulation of mutations in single tumors that unifies several distinct approaches.
• TRaIT is the first method that explicitly supports both multi-region and SCS data within a unique statistical framework, with remarkable performances with both data types without the need for a model of noise specific to them; • TRaIT extends mutational trees by accounting for violations of the ISA due to convergent evolution, for certain values of the observed probabilities. In Figure 5-B the analysis of a multi-region colorectal cancer dataset via TRaIT is shown, in which parallel mutations of a driver gene hit independent lineages. As the general validity of the ISA is debated [34], this represents a major advantage of TRaIT with respect to tree-based phylogenetic techniques.
• TRaIT can detect other complex phenomena underlying ITH such as the presence of multiple independent trajectories in the same tumor, or confounding factors annotated in the data. In the former case, these could be due either to multiple cells of origin [42], or to tumour initiation triggered by epigenetic states not annotated in the data (e.g., methylations). The latter relates to the annotation, in the data, of events that are unrelated to the progression.
• TRaIT can process any kind of genomic lesion, e.g., somatic mutations, copy number alterations, fusions, etc., allowing for an unprecedented information integration among data types; • TRaIT outperforms techniques specifically tailored for multi-region or SCS data, in terms of accuracy and robustness to data-specific errors and small sample size.
• TRaIT displays a significant improvement in terms of computational time and scalability, which represents another key advantage in anticipation of the increasing availability of large-scale studies/ datasets on single tumor evolution.
TRaIT's models are assembled by combining simple "building blocks": if a model contains edge x → y, then x + mutants are ancestral to y + ones, and y + mutants are statistically associated to x + . Such conditions describe the underlying clock among x and y, and are estimated via simple inequalities from data; their confidence is assessed via several statistical approaches, such as testing, bootstrap and cross-validation.
The simplicity of the underlying theoretical framework has several computational advantages, which we exploited to implement a suite of efficient algorithms that can model complex temporal structures, up to direct acyclic graphs (DAGs) with disconnected components, hence capturing different key aspects of tumor evolution and allowing for ISA violations (Figure 1-D). On the overall, despite supporting a wider range of data and models compared to standard phylogenies, our methods have state-of-the-art accuracy and more stable performance with small sample size and different data types, as well as lower computational complexity and higher scalability, as extensive tests on synthetic data suggest.
TRaIT can be used to infer the order of accumulating mutations in individual tumors, but not to deconvolve tumor clones' signatures. Thus, we show in this paper how to couple TRaIT to methods for clones detection, so to draw a new and all-encompassing pictures of tumor evolution and ITH, where one can infer which clones were present in the input data (signatures), and how mutations accumulated within each clone.

Materials and methods
TRaIT includes 4 optimal polynomial-time algorithms (Figure 2) that process a binary matrix D with n columns and m rows [37]. D stores n variables (mutations, CNAs, etc.) detected across m samples (single cells or multi-region samples). If an entry in D is 1, then the associated variable is detected in the sample. Missing data in SCS are handled by a standard EM procedure with multiple imputations [43]. A priori estimates of false positives/ negatives rates Multi-region bulk sequencing processes a signal mixed from different tumor subpopulations, with potential contamination of non-tumor cells. Thus, a sample will be likely annotated with lesions from different tumor lineages (green, red), creating spurious correlations in the data. In this case, we expect the rate of false positives and negatives in the calling to be symmetric. C. If we sequence genomes of single cells we can, in principle, have a precise signal from each subpopulation. However, the inference with these data is made harder by high levels of asymmetric noise, and errors in the calling. D. We are interested in studying temporal models of cancer progression in 4 possible scenarios. (i) when all annotated mutations are related to the progression, (ii) when data harbours confounding factors, (iii) when a tumor might have multiple cells of origin and, accordingly, multiple independent progressions and (iv) when independent evolutionary trajectories converge toward a certain mutation (i.e., a confluence). Case (iv) is when the Infinite Sites Assumption (ISA) is violated, as red mutations appear in two distinct evolutionary trajectories. + , − ≥ 0 in D can be provided to each algorithm. All the algorithms are implemented in R, in the TRONCO tool for Translational Oncology [44,45].
At their core, TRaIT's algorithms exploit a bootstrap procedure to compute 2 p-values per edge -one for temporal direction, one for association's strength, according to Suppes' theory of probabilistic causation [47] (see Methods). Two algorithms infer mutational trees (Edm, Gbw) and two DAGs (ChL, PRIM) (see Methods and Supplementary Information) -for this reason our framework supports mutational graphs. All algorithms can return a model with separate components, suggesting that data lacks statistically significant associations, or harbours multiple progressions which can be inputed to tumour triggers not annotated in the data (e.g., epigenetic lesions).
For these reasons, TRaIT can be used ( TRaIT's input data is a binary matrix that stores the presence/absence of a variable in a sample (e.g., a mutation, a CNA, or a persistent epigenomic lesion). Some of these observations harbour false positives and negatives (noise). B. We estimate via bootstrap the prima facie ordering relation Edm and Gbw infer models for (i, . . . , iii), whereas only ChL and PRIM can explicitly account for (iv); the latter case happens when the ISA is violated. One can choose which TRaIT's algorithm to use according to, e.g., research goals or prior knowledge on the evolutionary process. A rule of thumb might be the application of all TRaIT's algorithms, followed by a comparative analysis of their output models -as we show in the case studies. The creation of a consensus model could be also effective to this end. In the Results section we present results from simulations of different experimental conditions and data; these tests allowed to assess the performance of the algorithms in the four scenarios, thus providing indications to an appropriate choice according to the specific case.
The detailed mathematical description of all TRaIT's algorithm is included in the Supplementary Information, whereas in the following some details on the overall theoretical framework are provided.
Suppes' probabilistic causation [47] Let p(·) be multinomial estimates of the probabilities in D. For every pair of variables x and y in D, x is a plausible cause of y if p(x) > p(y) and p(y | x) > p(y | ¬x) . (1) The former condition acts as an Infinite Sites Assumption (ISA), as we are assuming that lesions are persistent; which can be weakened in certain situations (see below). So, we estimate temporal precedence by marginal probabilities. The latter implies condition statistical dependence: p(x, y) = p(x)p(y) [48]. For an edge to be part of our models, both conditions must be satisfied. When this is not the case, an edge is included and a non-significant p-value returned (see below). By iterating this approach, we can create models. This tool is the core ingredient of successful causal approaches for cancer evolutionary inference [46]. It represents a necessary but not sufficient estimator of selective advantage, and combined with a statistical frameworks to disentangle true from spurious associations, can detect selection [49]. With data from a single patient, we limit its power to predict just temporal orderings (see the Supplementary Information).

Working scenarios (Figure 1-D)
Theres is a huge deal of variability in cancer data types, in cancer ITH, as well as in our ability to call mutations etc. Besides, several aspects of cancer evolution are yet undeciphered, so we considered four working scenarios representative of different biologically and technologically-motivated assumptions, and defined corresponding algorithms.
The simplest setting is (i) when all D's variables are involved in the progression. Then, we generalize this to when (ii) some of D's variables are annotated in D, but irrelevant to tumor progression (e.g., calling uncertainty or other confounding factors). Besides, we account for a (iii) a tumor with multiple cells of origin, and we aim at identifying multiple independent models from a unique dataset. The fourth case is that of a tumor that shows (iv) selective pressures that converge towards variable x. It seems reasonable to consider (i, ii) more common than (iii, iv).

Algorithms (Figure 2)
TRaIT's algorithms use a non-parametric bootstrap strategy to assess Suppes' conditions among variables pairs, and include them in a direct graph G. Then, four different strategies can compute, from G, a model. Output models can be interpreted as Suppes-Bayes Causal Networks [50][51][52], an extension of Bayesian Networks, with maximum likelihood estimates of the parameters θ [53] -bearing in mind that usually such models are used in truly causal approaches. The output is the Maximum A Posteriori probabilistic model that best explains D.
The algorithms are inspired by (i − iv), which require us to infer (i) a mutational tree T , (ii) T and some detached nodes for the variables identified as confounding, (iii) a set of trees, {T i }, usually called a forest and (iv) a direct acyclic graph G (since confluent trajectories lead to a node with multiple incoming edges).
TRaIT implements two algorithms to infer trees: Edm (Edmonds) ( Figure  2-C), Gbw (Gabow) (Figure 2-E), based on weighted directed minimum spanning tree reconstruction. These algorithms scan G to identify the T that maximizes the edges' weights, which are computed via information-theoretic measures of the degree of association of variables -e.g., (pointwise) mutual information. Edm and Gbw differ from the way they order strongly connected components [54,55] that appear in G because of finite sample bias. Computationally, Gbw is more expensive and general than Edm.
Two additional algorithms, ChL , are available to infer direct acyclic graphs. ChL is a Bayesian model-selection method to factorize a joint distribution over the input variables [56]. PRIM is the equivalent to Edm for undirected structures, is applied by rendering G undirected, and weighting it with mutual information (which is symmetric). By assigning a posteriori an ordering to the undirected spanning trees that is consistent with the variables' frequency, we can retrieve confluent relations that capture violations of the ISA. In TRaIT these cases can be detected under certain conditions of the parameters: if -in the true progression -x and y converge to z, we will not detect two confluent trajectories if p(x) ≤ p(z) or p(y) ≤ p(z); see the Supplementary Information.
Detection of k independent progressions is a feature available for all algorithms, as it is enforced by the bootstrap when G has k disconnected components. Each group describes evolutions as triggered by multiple initiation cells. Thus, by G's estimation and by picking the proper TRaIT's algorithm, one can easily cover scenarios (i, . . . , iv).

Data types
TRaIT's algorithms work with both SCS and multi-region data. We expect D to contain noisy observations of the unknown true genotypes (Figure 1-A-C). The algorithms can be informed of the usual false positives and negatives rates + ≥ 0 and − ≥ 0, respectively. This adds no overhead to the computation, but prevents to learn noise rates from D, as it is instead possible with techniques as SCITE [37]. Since the algorithms show stable performance for slight variations in the input noise rates, avoidance of complex noise-estimation schemas seems a plus, especially when reasonable estimates of + and − are known a priori. This strategy is also used in OncoNEM [40].
For SCS with missing data we use a standard Expectation Maximization approach to input missing values; we repeat it n times, and then perform inference and select the MAP best model out of the n trials.

Code availability
All the algorithms included in TRaIT are implemented in R, and are available in the TRONCO tool for TRanslational ONCOlogy [44,45]. TRONCO is available under a GPL3 license at its webpage: https://sites.google.com/site/troncopackage or at Bioconductor: https://bioconductor.org.

Data availability
All data used in this paper are available from the supplementary material of [30] and [57]. To allow the reviewers to replicate our case studies we provide the source code as well as the input data at https://goo.gl/Ku13MM.

Additional file 1 -Supplementary Information
The detailed description of TRaIT's algorithmic framework and the results of extensive tests both on simulated data and several real datasets are provided in the Supplementary Information.

Simulations
We assessed the performance of TRaIT's algorithms with simulated single cell and multi-region data.
In particular, we generated multiple batches of independent synthetic datasets from random phylogenies (generative models), with 5 ≤ n ≤ 20 nodes and different levels of topological complexity (Figure 1-D). SCS datasets with 10 ≤ m ≤ 100 cells and multi-region datasets with 5 ≤ m ≤ 50 regions (accounting for sampling bias) were created. To test the robustness against imperfect data, false positives, false negatives (highly asymmetric for SCS) and/or missing data were introduced in the true genotypes, consistently with previous studies [37]. Multiple configurations of parameters were scanned, and we measured the ability to infer true edges (sensitivity), and discriminate false ones (specificity); further details on data generation are available as Supplementary Information. We compared our methods to SCITE, the state-of-the-art for phylogenetic inference of mutational trees from SCS data [37]. In the test, we also included previously developed approaches to causal inference from single-sample data (CAPRESE [48] and CAPRI [46]).
Full results are in Supplementary Figures S3 and S5-S15. Here we show four simulations in Figure 3; these settings are consistent with the results across all tests. Figure 3 displays the results for TRaIT and SCITE 1 in canonical settings of noise and sample size, for case (i) (SCS and multi-region data), for case (ii) (multi-region data), and for case (iii) (SCS data).
All the techniques achieve high sensitivity and specificity scores from SCS generated by phylogenies with drivers only -Edm and Gbw highlighting the best results (medians approx. 0.8 and 1). When we sampled multi-region data from the same topology, performances worsened for all methods likely due to the smaller sample size and the mixed bulk signal. The introduction of confounding factors (2 out of n = 13 variables), does not to impact the performance significantly, and all algorithms mostly discriminate the true generative model. Finally, the inference of tumors with multiple independent progressions proves to be a harder task, as sensitivity decreases and the performance of all methods are similar. Notice that SCITE, in all tests, achieves the lowest specificity; this might point at a mild-tendency to overfitting, probably due to the combination of its search scheme and noise-learning model (see also Conclusion).
General conclusions can be drawn from the whole set of tests that we carried out. As expected, performances improve with lower noise and larger datasets. In particular, with SCS data Gbw, Edm and SCITE seem the best algorithms; they generally achieve very similar sensitivity, even though the latter presents (on average) lower specificity. For SCS data, all the tested algorithms seem very efficient up to 20/30% of missing data, with SCITE showing a slightly greater robustness (Supplementary Figure S11).
Results on multi-region data display similar trends, with Gbw and Edm showing the overall best performance. In this case, however, SCITE is less effective in retrieving both the true and the false relations, especially with small datasets and/or low noise levels.
Interestingly, by systematically analyzing the impact of a variation of the input + and − with respect to the true noise values, we discovered that the performance is rather stable for all algorithms (in Figure 3-D we show Gbw algorithm); this supports our choice -in line with other tools [40]. -of not implementing sophisticate noise-learning strategies in TRaIT.
Finally, a computation time assessment allowed to record a 3× speedup of all the algorithms included in TRaIT with respect to SCITE, on standard CPUs (Supplementary Table 10).

SCS data: Triple-Negative Breast Cancer
We applied TRaIT to a SCS dataset of Triple-Negative Breast Cancer (patient TNBC in [30]). The input data consists of single-nucleus exome sequencing of  We estimate from simulations the rate of detection of true positives (sensitivity) and negatives (specificity), visualized as box-plots from 100 independent points. We compare TRaIT's algorithms and SCITE, the stateof-the-art for mutational trees inference. For each data type, here we show here a mild-noise setting with canonical sample size: in SCS data noise is Extensive results for different models, data type, noise and sample size are in Supplementary  We augment the setting in A-right with 2 random variables (with random marginal probabilty) to model confounding factors, and generated SCS data. C. We generated multi-region data from a tumor with n = 21 mutations, and a random number of 2 or 3 distinct cells of origin to model independent progressions. D. Spectrum of average sensitivity and specificity for Gbw algorithm estimated from 100 independent SCS datasets sampled from the generative model in Supplementary Figure 5 (m = 75, n = 11). The true noise rates are + = 5 × 10 −3 ; − = 5 × 10 −2 ; we scan input + and − in the ranges: + = (3, 4, 5, 6, 7) × 10 −3 and − = (3, 4, 5, 6, 7) × 10 −2 .
In [30], with a bulk sequencing control, mutations detected both in the bulk and in the majority of the cancer cells were annotated as clonal, whereas those undetected in the bulk as subclonal. The authors then manually curated a qualitative phylogenetic tree (Figure 4-B). We here run TRaIT with the mutational profiles of each single cell describing the presence/ absence of nonsynonymous point mutations of the 22 genes selected in [30]. The rate of missing values in this dataset is very low (around 1%); as suggested in [30], we used 9.73 × 10 −2 for allelic dropout rate and 1.24 × 10 −6 for false positive rate.
For these data, all TRaIT's algorithms return trees, suggesting consistency with the ISA (Supplementary Figures S16-S17); here, we discuss Edm's one since that algorithm achieved the best performance in the simulations with drivers and confounding factors (Figure 4-C). To improve the analysis, from the same dataset we also deconvolve the signatures of putative clones with OncoNEM, and compare the predictions (Figure 4-D).
TRaIT allows to characterize the qualitative phylogeny provided in [30] by identifying the gradual accumulation of point mutations, expectedly due to defects in DNA repair or replication machineries, both in the clonal and subclonal histories of the tumor.
On the one hand, Edm's model displays high-confidence branched evolution consistent with subclone A 1 (ppp2r1a, syne2 and aurka mutations), A 2 (ecm2, chrm5 and tgfb2 mutations), and H (nrrk1, aff4, ecm1, cbx4 mutations) [30]. On the other hand, TRaIT provides a notably higher resolution in the description of the mutations annotated as clonal in [30], e.g., pten, tbx3 and notch2, are suggested to trigger tumor initiation. These results are also consistent with the presence of different molecular clocks operating at different stages of tumour growth [30]. TRaIT allows to formulate new hypotheses about undetected subclones, possibly characterized by private mutations in akap9, or in jak1, setbp1 and cdh6, which however would require further experimental confirmations.
Clones' analysis via OncoNEM detects 10 clones, their lineages and temporal relations, thus refining the qualitative analysis of [30]. Remarkably, such results are mostly consistent with ours, as the mutational ordering predicted by OncoNEM (obtained by estimating the assignment of mutations to clones, as suggested in [40]) largely overlaps with that inferred via TRaIT. This is particularly evident for early events, and for most of the late subclonal ones; exception made for subclone H, which is not detected by OncoNEM. As mutations in araf, akap9, notch3 and jak1 have the same marginal probability, their temporal ordering can not univocally determined from these data. TRaIT, in fact, provides a p-value p > 0.05 for the direction of those edges, suggesting that any permutation of their ordering would be possible. For this reason, unless more sequenced cells were available, we can not univocally match the clonal signatures obtained with OncoNEM and the temporal orderings identified by TRaIT for these temporally-intermediate events.
This result proves that the concerted use of techniques for the inference of mutational ordering, together with clonal deconvolution approaches, can provide a picture of tumor evolution and ITH at an unprecedented resolution and accuracy.

Multiple-biopsy data: MSI-high Colorectal Cancer
We applied TRaIT to a moderately-differentiated MSI-High colon cancer characterized by a primary tumor and a right hepatic lobe metastasis, with no prior treatments (patient "P3" in [57]). For this patient, targeted DNA resequencing of three regions of the primary tumor (P3-1, P3-2, and P3-3) and of two metastatic regions (L-1 and L-2), allowed to identify 47 nonsynonymous point mutations and 11 indels [57] ( Figure 5-A).
To process this dataset with TRaIT, we first grouped the mutations with the same signature across the five regions, hence obtaining: (a) a clonal group, including the 34 mutations detected in all the samples, (b) a subclonal group, including the 3 mutations detected only in the L regions, and (c) 8 mutations with different mutational profiles. Our methods will not resolve their ordering since their signals are statistically undistinguishable. However, we will be able to order the groups against the mutations.
With these data both PRIM and ChL predict confluent evolutionary trajectories (Supplementary Figures S19), suggesting a violation of the ISA. In Figure  5-B we show PRIM's direct acyclic graph and Edm's tree. Both models predict branched tumor evolution and high ITH among the subclonal populations, consistently with the phylogenetic analysis carried out in [57].
First, the clonal lesions -the clonal root -trigger the first expansions of this tumor, with mutations in the key colorectal drivers apc, kras, pik3ca and tp53 [49]. These biomarkers are ubiquitous, and could not be used to disentangle the mutational spectrum of the primary tumor from the metastatic lesions, in accordance with [58].
Second, the models identify distinct branches outgoing from the trunk, which discriminate the different subclonal evolutions. In both models one subclonal trajectory is initiated by a stopgain SNV in the DNA damage repair gene atm. Edm, in particular, characterizes region P3-1 by a subsequent accumulation of inhba and cdkn2a nonsynonimous mutations, whereas P3-3 by smad4 (stopgain SNV) and kmt2c (frameshift). Conversely, PRIM infers a more complex model, in which two confluent trajectories anticipate common late mutations in different regions: mutations of either inhba or of tgfbr2 may precede mutations of cdkn2a in region P3-1, whereas in region P3-3 alterations of either inhba or of smad4 might precede alterations of kmt2c. Interestingly, the alterations of cdnk2a might point to a cell cycle arrest hallmark for this tumor. Notice that the model inferred via PRIM exactly fits in scenario (iv), which could not be identified with canonical phylogenetic approaches.
Third, in both models the subclonal metastatic expansion is originated by a stopgain SNV in gnaq, anticipating mutations in smad4, setd2, ar (i.e., the subclonal group) and ppp2r1a. The models suggest canonical convergent evolution (i.e., parallel) towards smad4 (a stopgain SNV in the primary tumor, and a nonsynonimous mutation in the metastasis). The transducer of transforming growth factor-β superfamily signaling smad4 regulates cell proliferation, differentiation and apoptosis [59], and its loss is usually correlated with colorectal metastases [60]. ar is a transcription factor that regulates cell migration and inhibites hepatocellular carcinoma metastases [61]; its splice variants are known to promote metastasis in several tumor types [62]. Similarly, gnaq is supposed be relevant in metastases development in certain tumor types [63]. ppp2r1a is a negative regulator of signal transduction, gene expression and cell cycle [64], and its mutation influences tumor-endothelium interaction in melanoma metastases [65]. Notice that many other genes that are supposed to characterize MSI-high progression are wildtype in this tumor, e.g., fbxw7, braf, arid1a, fam123b, etc., [49], as a further evidence of the high level of ITH even within the same tumor subtype.
We finally compared the ordering estimated by TRaIT to the predictions obtained by SCITE (Supplementary Figure S20). Both approaches predict the same formation of the metastatic lesion, yet some significant differences are present. First, SCITE predicts that the mutation of atm triggers tumor initiation, prior to the mutations included in TRaIT's clonal group, which are ordered in a 34 events-long linear chain. Yet, this specific order has score equivalent to several other models (Supplementary Figure S20) and, thus, might be unreliable.
Besides, in SCITE's model tgfbr2 is associated to region P3-1 (in accordance with PRIM, but not with Edm), gnaq's stopgain is upstream to both P3-3 and L branches, and some relations appear in inverted temporal ordering, e.g., between smad4 and kmt2c. Finally, by construction, SCITE can not infer any confluent evolutionary trajectory, as its statistical model relies on the ISA.

Conclusion
The increasing availability of high-resolution multi-sample sequencing data allows one to study ITH at an unprecedented resolution, to understand origination and development of tumors both at the genotype and phenotype level, and to better stratify and treat cancer patients.
Multi-region and SCS data harbours signals that can be informative of different aspects of tumor evolution. In fact, several techniques have been developed that either deconvolve clonal signatures, determine the ordering of growing clones or accumulating mutations, or estimate clonal fractions and cellular prevalence. The concerted application of these techniques allows to draw complex pictures of cancer evolution. As we show for a triple negative breast cancer, one can use the same dataset to detect both the signatures of the prevalent clones, and to infer the temporal precedence (i.e., ordering) of mutations that generated them. With the right tools, one can hence understand which clones are annotated in the data, and how they were shaped by evolutionary pressures.
The majority of techniques that perform such analyses ground their roots in standard phylogenetic theory, or in some of its cancer-specific derivations. These techniques are very effective, but sometimes they also implement a noteworthy deal of technical assumptions regarding sequence substitution models, alleles fixation, noise or search scheme etc. As a consequence, it could be hard to quantify how much the final predictions are shaped by the model and its assumptions, or actually suggested by the data. For instance, complex noise-learning models to leverage the imbalances of SCS data might resolve the ordering of clonal mutations in arbitrary ways. This manifests as long trunks whose actual order can not be estimated from current data, and the inclusion of subclonal mutations in dubious positions in the trunk (Supplementary Figure S19).
Similarly, when one seeks for a maximum parsimony phylogenetic tree of tumor evolution, several equivalent-scoring solutions could be returned. When that happens, one has to implement disambiguation heuristics to select one output model [11,12]. This could be one of the computed trees, or a new tree that is a combination of those (e.g., a bootstrap consensus [66]). Despite these routines are often adopted, they are somewhat arbitrary and some deal of care should be warned.
On top of these, recent evidences on the violation of the ISA suggest that this assumption might not be always appropriate; the ISA is preponderant in the derivation of most inferential techniques, and future methods should find a way to consistently account for its violations [35].
In this paper, we deviate from phylogenetic methods and present the TRaIT computational framework, whose methods give statistically robust estimates of mutational orderings in a variety of settings. Our models are simple, and can be interpreted straightforwardly: if an edge connects two mutations (i) it statistically resolves their temporal ordering and (ii) the mutations are statistically dependent. Both conditions are estimated from data without using complex inferential models, and assessed via statistical testing, which leads to p-values. Further assessment of models' confidence can be obtained by usual bootstrap or cross-validation approaches [49]. One should be warned that, as mutational trees or other phylogenies, TRaIT's models will display in output all mutations annotated in the input data. So, x and y could be passengers mutations observed by hijacking in this patient; nonetheless, their temporal relation can be disentangled, but for a more thorough carachterization of divers against passenger mutations, one would arguably need more complex tools and data combined with causal approaches [46,48,49].
The simplicity of our framework has major advantages, both from an evolutionary and a computational point of view. First of all, TRaIT's models can account for any variable that can be annotated in a tumor sample. Thus, with TRaIT one can introduce high-level information on pathways, hallmarks, phenotypic-triggering lesions or epigenetic states (e.g., methylations), as long as they are persistent during tumor evolution. Inclusion of these information in traditional sequence-based phylogenetic methods that work with sequences could be harder. Second, TRaIT implements four optimal (i.e., polynomial-time) algorithms that look for different types of signals in the sequencing data and can model more complex topologies than trees, such as direct acyclic graphs with disconnected components. Therefore, TRaIT can be used to investigate whether data suggest the presence of confounding factors, or if the tumor's data harbours several progressions, or if late mutations associate to multiple evolutionary trajectories. This latter case is a first attempt at performing inference when the ISA is violated by convergent evolution, a possibility that is missing in classical phylogenetic methods that are limited to estimating single trees from data 2 . Thus, with TRaIT's algorithms one can test a broad set of hypotheses on tumor evolution as we show in a colorectal cancer case study where we find convergent evolution towards cdkn2a, which might point to a cell cycle arrest hallmark for this tumor type.
The computational burden of our techniques is limited, compared to standard Bayesian approaches (which, however, include an estimation of uncertainty within the model). We do not compute a full posterior over our estimates, but rather a Maximum A Posteriori model constrained by Suppes' conditions. These conditions impose minimum levels of significance to the ordering predicted by our models, and are enforced as empirical Bayes priors. In light of the increasingly available data -especially from SCS -TRaIT's scalability properties represent an important algorithmic advancement over Bayesian computations that might become impractical with larger datasets. Our methods accommodate low-effort parallel implementations [67], which we provide in the TRONCO tool for TRanslational ONCOlogy [44,45].
TRaIT could be improved in several ways. For instance, we could pair bulk sequencing samples to either SCS or multi-region inputs; in fact, the combination of these has been recently shown to improve the estimation of the mutational ordering [38]. Furthermore, we could extend our framework to infer, besides mutational orderings, clonal signatures and architectures, in the attempt of defining a unified framework for cancer evolutionary inference. A general ex-tension to models where the ISA is violated could also be investigated. From a broader perspective, our methods build on our earlier contributions on tumor evolution from single-sample bulk sequencing data [46,48,49] (see the Supplementary Information for further discussion). These models allowed us to define the first automatic pipeline to quantify inter-tumor heterogeneity across multiple patients [49].
To conclude, we advocate the use of our methods as complementary to phylogenetic tools for clone deconvolution, in a joint effort to better quantify the extent of ITH. To this end TRaIT represents an innovative and powerful tool to raise precision and effectiveness of large-scale analyses of single tumor evolution.

Author contribution
DR, AG and GC designed the algorithmic framework. DR, LDS and GC implemented the tool. LDS carried out the simulations on synthetic data. Data gathering was performed by DR, AG, LDS and GC. All the authors analyzed the results and interpreted the models. DR, AG, MA and GC wrote the original draft of the paper, which all authors reviewed and revised in the final form.  Figure 4: A. Input data from single-nucleus sequencing of a triple-negative breast cancer [30] (32 cells). The rate of missing values for this dataset is very low (around 1%), allelic dropout has rate 9.73 × 10 −2 , and false discovery 1.24×10 −6 . B Manually curated phylogenetic tree estimated in [30]. Mutations are annotated to the trunk if they are ubiquitous across cells, and detected also in a bulk control sample [30]. Subclonal mutations are those appearing only in more than one cell. C. Temporal mutational tree obtained with Edm algorithm; p-values are obtained by 3 tests for Suppes' conditions and overlap (hypergeometric test), and edges annotated with a posteriori non-parametric bootstrap scores (100 estimates). For these data, all TRaIT's algorithms return trees (Supplementary Figure 16), consistently with the manually curated phylogeny (A). Most edges are highly confident (p < 0.05), expect for groups of variables with the same frequency which have unknown ordering (red edges). The ordering of mutations in subclones A 1 , A 2 and tumor initiation has high bootstrap estimates (> 75%). D. We perform a concerted analysis to estimate both clones' signatures and their formation, at least for mutations with a clear statistical signal in these data. We do this by computing a clonal tree with OncoNEM, which predicts 10 clones. Mutations are assigned to clones via maximum a posteriori estimates. The mutational ordering of the early clonal expansion of the tumor, which involves mutations in pten, tbx3, notch2, is consistent among both models. The same happens for most of the late subclonal events, e.g., mutations in map2k7, map3k4, ppp2r1a, syne2 and aurka in subclone A1. However, the temporal ordering of intermediate events has weaker support as they have the same marginal probability, and any permutation of their ordering would be equivalent for our method. 16 Figure 5: A. Multi-region sequencing data for a MSI-high colorectal cancer [57], with three regions of the primary cancer: p3-1, p3-2 and p3-3, and two of one metastasis: L-1 and L-2. To use this data with TRaIT we merge mutations that have the same signature across all regions, obtaining a clonal group (light blue) including 34 mutations and sublclonal group (light yellow) including: nonsynonymous SNVs of smad4 and setd2, and non-framshift insertion of AR. B. Models obtained by Edm and PRIM algorithms, with their confidence annotated and the overlap in the predicted ordering obtained by SCITE. PRIM predicts convergent evolution -which violates the ISA -towards a non-synonymous mutation in cdkn2a, which is also predicted by ChL (Supplementary Figure  18). All edges, in all models, are statistically significant for Suppes' conditions (temporal precedence and selection strengths). C. Four of the predicted ordering relations are consistently found across all TRaIT's algorithm, which gives a highconfidence explanation for the formation of the L2 metastasis. This finding is also in agreement with predictions by SCITE (Supplementary Figure 19).
in genomic instability processes defines lung cancer evolution.  (Figure 1), Davis and Navin distinguish three different approaches to infer cancer progression models from data of individual patients. The approaches are here summarized as Figure 1, and generally termed as op , mimicking the " -ordering problem".
Different versions of the op aim at providing different insights on the evolutionary aspects of cancer progression. In particular: • when = p, we solve a classical phylogenetic inference problem and aim at displaying a set of input cells as leaves of a phylogenetic tree; • when = c, we seek to identify a clonal lineage tree that models an ancestry-relation for a set of clonal signatures that we infer from data; • when = m, we seek to find the order of mutations that accumulate during cancer progression.
Hopefully, results from these approaches can be somehow reconciled, as the same data type is used to approach the problem. For instance, we might argue that the order of accumulating mutations should be consistent with the clonal lineage tree, which in turn should be consistent with a phylogenetic tree of the corresponding cells that we sequence. The efforts to solve these problems are ongoing, with different techniques and tools that are becoming popular to solve the cop and the pop versions of the problem, see, e.g., [2][3][4]. Here, consistently with earlier works of us on the inference of cancer progression [5][6][7][8][9][10], we focus on the mutational version of the problem, mop .

A framework based on probabilistic causation
Suppes' framework of probabilistic causation is the core of our approach to cancer progression inference. Here, we use it to derive algorithms that exploit optimal results from minimum spanning tree reconstruction and Bayesian inference.
We recall that Suppes' axioms provide a necessary but not sufficient set of conditions to make causal claims [6][7][8][9][10]. In our earlier works [5][6][7][8][9][10] we considered data from multiple patients, i.e., multiple observations of the tumor progressions, which allowed us to disentangle genuine causal relations from spurious ones. On the contrary here, while we can still quantify the statistical trends -between mutations -with Suppes' conditions, we restrict ourselves to multiple observations from the same patient and so we do not observe tumor progression multiple times. Thus, our claims can not be of causal natures, but even so we can provide an estimation of the temporal progression in the individual tumor. This is also reflected in the spanning tree assumption we make in our algorithms, which implies that one unique predecessor is assigned to every considered mutation. For these reasons, even if we may be depicting in our models spurious causal relations, such relations still represent valid temporal orderings. 3

Preliminaries
Input data. We consider a binary-valued dataset D with n variables and m observations where the columns of D are the n variables X = {x 1 , . . . , x n }, x i ∈ {0, 1} m , and z 1 , . . . , z m are the m samples. Variables refer to genomic events detected by sequencing of cancer genomes (i.e., (epi)genomic alterations of various types such as, e.g., single nucleotide or structural variants, or copy number alterations). Value x i,j = 1 means that event x j is detected in sample z i .
Assumptions. We require each event x i to measure a somatic alteration that is persistent during tumor evolution, e.g., a mutation or a copy number variant. Epigenetic states of expression or methylation could be used only if they fulfil this condition. A further technical assumption, not motivated by the phenomenon of cancer progression, is that no columns of D are either all zeros, or ones, and that no two columns exist that are indistinguishable. If this happens, D should be reshaped appropriately before applying the algorithms that we define.
Output model. From D, we want to estimate a joint distribution p(·) over X so that we can sample genotypes from our output model. In our formulation we use ideas from Bayesian Networks [11], a class of Graphical Models based on directed acyclic graphs augmented with parameters θ. A partially order set (poset) of a graph G over X is defined by a partial ordering : if x i x j than edge x i → x j is in G. For G to be acyclic we also require the transitive closure of to have no path that start and end in the same x i . We turn G into a generator for a conditional probability distribution by defining the parameters wherep is a distribution conditioned to the incoming edges of x j , often called conditional probability table [11]. |θ xj | is exponential in the number of edges incoming to x j , if variables are binary.
When we infer a model of cancer progression within the individual patient, our output model will be mot likely tree (but in some cases it could be a forest, or a general graph). A graph G is a tree if (i) it has one root node x * with no incoming edges, and (ii) all other x j = x * have one incoming edge, i.e., |{x i | x i x j }| = 1. Hence, for a tree, the parameters also simplify to θ xj =p(x j | x i ), i.e., they become linear in size. A graph G is a forest, if it can be partitioned into a set of trees. Trees and forests are acyclic, by definition.

Combining a causal graph with information theory
Causality conditions define a prior graph structure. We frame Suppes' probabilistic causation [12] within cancer progression [5,6], to create a partial ordering causal relation PF over X . We dub it prima facie, and we will use it to create the final output model's structure .
For any pair of variables x i and x j , we define In general, PF induces a cyclic graph. We interpret prima facie as a necessary condition for cancer progression, along the lines of [6]. So, we consider PF to provide us with a superset of the edges that will appear in our output models; derivation of such edges is discussed in the next sections.
To include a pair of variables in PF , we test two inequalities over distributions estimated from D. A statistical model of those marginal and joint/ conditional distributions over X can be created via non-parametric bootstrap [6]. Then, we can carry out a Mann-Whitney U test to compute a p-value for the alternative hypothesis that the distributions have different means: x i PF x j when both inequalities have confidence p < 0.05. This testing/ bootstrap schema can support prior information of noise in the data. If we are informed that D harbours false positives and negatives rates + and − , we can correct the marginal bootstrap estimates as where n i = k x k,i m , and proceed similarly for the joint estimates as follows where n i,j = k x k,i x k,j m . See Section 8 for the complete derivation of the error model.
Information-theoretic measures for associations' detection. The causal ordering PF is a super set of the ordering that we want to return as output; we thus need to subset PF .
To rank and select pairs in PF we can use a score function. If we interpret each x i as a random variable with binary outcomes, we can compute information-theoretic measures for the detection of its degree of association to other variables [13]. For each x i PF x j , we measure the point-wise mutual-information (pmi) that quantifies the discrepancy between x i and x j for their outcomes x and y. Here, to detect the association between alterations that accumulate during progression, we set x = y = 1.
In some cases, we will also use the expected value of pmi over all the possible outcomes of x i and x j , which is the mutual information (mi) These measures are standard [13], and could be used to derive alternative score functions (e.g., conditional pointwise or entropy). In this work, however, we limit our scope to pmi and pmi.

Strategies for structure selection and parameters' learning
The prima facie ordering PF induces a mapping to 2 | PF | potential models. Some of these are not trees or might be bad at capturing the distribution of the data. Thus, the problem of picking a particular to build G is non trivial. By combining PF with a pmi/mi score we have obtained a weighted graph. Thus, we can exploit algorithms to extract trees (or other types of models) with certain properties, from the input graph. We denote with pmi PF and respectively with mi PF the graphs weighted with pmi or mi. We devise two classes of structure selection strategies.
(Minimum) Spanning tree algorithms. These are a class of algorithms that aim at detecting the subset of pairs that (i) minimize the total structure's weight, and that (ii) display as a tree. The total weight is defined as the summation of the weights of the pairs that are selected. This is a well-known problem in graph theory, and we can exploit optimal solutions from the literature. The approaches are different according to the graph that these are given as input.
Edm Edmonds' optimum branching algorithm is a canonical solution to the task of inferring a spanning tree of minimum weight from a weighted directed graph, given a root node in input [14]. We use this algorithm with pmi PF as input, provided that we make it acyclic. Cycles/ loops breaking is a hard computational problem, and we resort on the heuristic that is used by the CAPRI algorithm [6]. This heuristics breaks loops according their confidence, defined as the combination of p-values for Suppes' conditions: edges with small confidence (i.e., high p-values) are deleted first to break loops. When pmi PF is made acyclic, to use Edmonds' algorithm and maximize our pmi scores, we can change their sign. The running time of this algorithm is O(|X | · | pmi PF |), which can be optimized to O(|X | log(| PF |)) for sparse 1 pmi PF [15]. PRIM Prim's algorithm is the equivalent of Edmonds' for undirected tree structures [16]. We can use if we forget the directionality of the edges in our prima facie ordering, i.e., if for every pair x i PF x j we also force the inclusion of x j PF x i no matter what the test statistics for Suppes' conditions. If we want to use this search strategy, however, we will use mi PF as input, as we need to use a measure that is symmetric and defined over all the support of the random variables (i.e., pmi is not accounting for the cases x i = 1 − x j = 1 and viceversa). The complexity of this algorithm, if it is implemented by using a binary heap and an adjacency list for G, is the same as Edmonds'. The final tree returned by this strategy is undirected, and so we orient it according to the marginal frequencies of the events. That is, for every final pair x i PF x j and x j PF x i we select x i → x j if p(x i ) > p(x j ). For this reason, the final model could contain confluent structures such as x i PF x j and x k PF x j -i.e., a model with two edges confluent in x j : x i → x j and x k → x j . We observe that, when this happen, the final model is not a tree as x j has more than one parent. The interpretation in terms of the induced distribution is still that of a Bayesian Network.
Gbw Cycles in pmi PF can be handled in another search schema by exploiting Gabow's algorithm [17], before using Edmond's algorithm to maximize the weight of the final tree. Gabow's algorithm algorithm is optimal to detect the strongly connected components of the directed graph pmi we thus create, for each strongly connected component, all the possible trees associated; then we select the tree at maximum weight for each such set of candidates. We can optimize this procedure by separating acyclic sub-graphs, if any, in the very beginning. This algorithm is optimal but more expensive than Edm, and shall be seen as an alternative way to deal with loops in the prima facie structure.
Bayesian model-selection strategies.
ChL A Chow-Liu tree is an optimal method for constructing a second-order product approximation of the joint probability distribution over X [18]. It is known that such a tree minimizes the Kullback-Leibler distance to the actual joint distribution, and can be interpreted as a Bayesian Network. For constructing the optimal tree, at each iteration the algorithm adds the maximum mi pair to the tree. This algorithm returns an undirected structure when we run it with mi PF as input, that we make directed as in the PRIM search strategy. It should be evident the similarity between the two algorithms, also in terms of complexity.
Learning parameters θ. Given a graph (or, as a special case a tree), we can fit its parameters as standard in the Bayesian Networks approach, by maximum likelihood estimation from D [11].
Compared algorithms. As motivated in the Main Text, SCITE and OncoNEM are at the stateof-the art for two orthogonal problems in single-cell phylogenetic inference (mutational vs clonal ordering). In our simulations we compared against SCITE, since its mutational tree is directly comparable to our models. We also attempted at comparing our methods against OncoNEM; however, the computational burden of OncoNEM did not allow extensive tests, see Table 10.

Synthetic data and performance measures
Comparison between the algorithms is based on large synthetic tests for different combinations of model type, size, number of samples, noise etc. We describe here the details of the approach, and provide the user with its R implementation in the TRONCO tool. TRONCO https : //sites.google.com/site/troncopackage is also available on Bioconductor.
Sampling from single-cell sequencing. Genotypes from single-cell sequencing are sampled by a phylogeny. We describe the simpler case of sampling from a single tree, more general cases are trivial extensions. A cartoon is shown in Figure 2 that shows some possible single-cell genotypes.
The following recursive procedure visits a tree, starting from its root x * (i.e., we set x * = 1 in the genotype), and outputs a sampled genotype according to its structure and parameters.
• if we are visiting a leaf x l (i.e., a node without outgoing edges) with incoming edge x i → x l , then we sample x l = 1 (in the genotype) with probability θ x l = p(x l | x i ), and 0 otherwise.
• if we are visiting a branching node x i (i.e., x i = 1 in the genotype) with children x i → x j and x i → x k we either sample only one of the children and we proceed recursively, or we stop. Notice that we forbid to sample a genotype with both children 2 , i.e., p(x j , x k | x i ) = 0, so To have consistent cell genotypes for the whole model, when we recursively proceed with x j (resp. x k ) we set x k (resp. x j ) and all its descendants equal to 0.
Notice that, by construction, genotypes are consistent with the phylogeny of the generative model, as We observe that: (i) we can easily generalize this procedure to an arbitrary amount of children per branching, and that (ii) this procedure generates only genotypes that correspond to cancer cells (because we start with x * = 1). If required, we can a posteriori add wild-type genotypes to a dataset to account for contamination of normal cells.
Sampling from multi-region sequencing. When we collect and sequence a bulk of tumor cells we get a signal that is a mixture of alterations found in different tumor sub-populations.
In terms of induced distribution and the branching structures described in single-cell sequencing sampling, this means that data will support p(x i = x j = x k ) > 0 as the sequenced samples will contain cells from both populations with signatures x i = x j = 1 − x k and x i = x k = 1 − x j . To create such a signal there are different ways. On one hand, one can change the effect of branchings on the induced distribution to account for p(x i = x j = x k ) > 0, on the other one can emulate a mixed signal by mixing a number of individual signals. The former approach requires more parameters in the generative model to account for the conditional probabilities of both children, given a parent node. We adopt the latter approach and sample c genotypes from a single-cell sequencing experiment: let z 1 , . . ., z c be such samples, we create a sample where each component of z * is 1 if at least one z i is 1. Then, we repeat the procedure to produce as many samples as we need, according to the number of regions that we want to simulate. This approach requires only one more parameter, c. If one interprets the c samples as c cells, one might be tempted to pick a very large c (e.g., c 10 6 ). If one does so, however, all z * will be similar, as for large c the proportions of the sampled genotypes will converge to the true ones. Thus, our dataset would have small variance across samples, biasing the data. Thus, to have more variance, we set c to be small (c < n) and interpret c as the probabilistic number of clones spread across the regions that we sequence.
Adding noise to syynthetic data. When we say that observed data are obtained by adding noise to sampled genotypes, we mean the usual introduction of false positives and negatives with rates + ≥ 0 and − ≥ 0, respectively [2,3,6]. Precisely, when we have sampled a dataset of samples D according either to (8) or (9), we apply to D an independent point-wise process that flips the matrix's entries according to the rates + / − .
To investigate the ideal performance of the algorithms we sometimes use noise-free data, that is + = − = 0. In more realistic setting, we use different models of noise according to the type of simulate sequencing. Sequencing of single cells is characterized by distinct errors, which usually take place in the DNA amplification phase: (i) allelic dropouts and (ii) false alleles, the former occurring at a significantly higher rate, thus leading to higher rates of false negatives. Accordingly, in the generation of noisy single-cell data we expect highly asymmetric noise parameters + − , that we simulate by assuming one-order of magnitude in their difference. Multi-region bulk sequencing data instead harbour more balanced noise effects, in that case we set + = − .
Performance measured. We want to measure the tendency to overfit or underfit of every algorithm, in particular circumstances of sample size, noise etc. Thus, the performances measured in each experiment are: • the rate at which true model edges are inferred sensitivity TP TP+FN ; • the rate at which false model edges are discarded specificity TN TN+FP ; where TP, TN, FP, FN are the number of true (T)/ false (F) positives (P) and negatives (N).
Algorithms' implementation. The official tools with the implementation of each algorithm that we test have been used. CAPRI, CAPRESE and the algorithms that we describe in this paper are available in the TRONCO tool for TRanslational ONCOlogy. SCITE was downloaded from its Github repository and OncoNEM from its Bitbucket one.

Testing a taxonomy of models
We define a "taxonomy" which describes different key phenomena in cancer progression ( Figure 2).
1. Inference with drivers: in which the generative models are phylogenetic trees whose nodes are the genomic alterations that have a functional role in the progression. These are called drivers and provide a selective advantage to the clones that harbours them; 2. Inference with drivers and confounding factors: in which the generative models are phylogenetic trees, as above, but the observed data also include uncorrelated random events. This is a handle to account for possible confounding factors, i.e., genomic alterations that have no functional role in the progression, and that we do not know a priori; 3. Inference with multiple cells of origin: in which the generative topology are multiple independent trees, grouped as a forest. This is a way to model tumors that originate from two or more cells or clones, a phenomenon also known as polyclonal tumor origin [19]. 4. Inference with convergent trajectories: in which the data are generated by sampling from DAG models, thus with confluences accounting for genomic events possibly occurring at a late stage in distinct clonal evolutionary trajectories of the same tumor, i.e., specific events that are independently gained by distinct cancer clones at different times. This scenario implies an explicit violation of the Infinite Sites Assumption (ISA), which underlies most standard phylogenetic methods. In this case, we adopt for most of the tests an interpretation of the confluence equivalent to that of the upstream branching, that is of exclusivity 3 .
According to this taxonomy, we perform a simple non-exhaustive test of the performance of the algorithms. We distinguish the test if data are from single-cell or multi-region sequencing. These tests are rather simple in size and variability of settings, and more detailed tests are carried out in the next sections. However, they still give a general idea of the general performance trend.

Single-cell data
Experiment I We test a set of fixed topologies according to the taxonomy in Figure 2: (i) a tree with n = 11 nodes (Inference with drivers), (ii) which we then augment with the addition of 2 disconnected nodes (Inference with drivers and confounding factors), (iii) a forest with two distinct trees that account for n = 7 nodes (Inference with multiple cells of origin), and (iv) a DAG with n = 8 nodes and 2 confluences (Inference with convergent trajectories). For each single test we generated 100 single-cell datasets with m = 75 samples and a mild noise setting, + = 0.005 − = 0.05.

Inference with drivers
The performances of all the algorithms are consistently similar and very good. Yet, Gbw and Edm with pmi reach highest performance, while SCITE displays a larger dispersion. All algorithm but CAPRI have similar median sensitivity, suggesting a comparable capability of inferring the true relations from data. Specificity scores, instead, suggest that SCITE tends to overfit more than other approaches (i.e., lower true negative rate). CAPRI displays a very high specificity, but that is possibly due its "regularization" approach that tends to return sparse models (thus, trading specificity for sensitivity) 4 .

Inference with drivers and confounding factors
A similar trend is observed also in this scenario, with SCITE displaying a very good sensitivity, but also the lowest specificity among all. We can observe that the best algorithm in this case seems to be PRIM, as it displays the same sensitivity of SCITE, but higher specificity. CAPRI's very high specificity is still due to the regularization terms. We finally show in Table 2 that our approach is capable to model the trend of selecting advantage among events, with the confounding factor consistently presenting a much lower trend compared to all the other events.

Inference with multiple cells of origin
In this case Gbw, Edm and SCITE present an identical performance with median values of both specificity and sensitivity equal to 100%, slightly outperforming the other algorithms. In particular, by looking at Table 3 one can notice that all the techniques are able to retrieve the two distinct roots of the progression, with the exception of CAPRESE and CAPRI in a few cases.

Inference with convergent trajectories
In this case, SCITE displays a better overall performance, especially in terms of sensitivity. Gbw and SCITE have 100% specificity, while Edm 95%. Thus, in this case the major problem seem to identify the true edges, which SCITE does with 65% sensitivity. Furthermore, by looking at Tables 4 and 5 one can see how may times one of the two hard-exclusivity confluences ( Table 4) and soft-exclusivity confluences (Table 5) shown in Fig. 3B are retrieved by the distinct algorithms, over 100 experiments. Furthermore, we provide an example of convergent evolution that is not modelled by our approach, i.e., when Suppes' conditions are violated, in Figure 4. Also in this settings SCITE displays the best performance, followed by Gbw and Edm. We notice that by construction only CHOW-LIU and PRIM algorithms can infere both arcs in the confluence.

Multi-region bulk-sequencing data.
Experiment I-MR This experiment reproduces Experiment I with multi-region data sampled from the same generative topologies in Figure 3. The parameters are identical to Experiment I, with the exception of the number of samples (in this case the number of biopsies, rather than cells), which is here set to m = 20.
Results: (Figure 5, Table 6 and Table 7) 1. Inference with drivers All the algorithms but CAPRI and SCITE display identical median values of sensitivity and specificity, with Gbw slightly outperforming the other techniques. However, all the algorithms struggle in retrieving a large number of true relations, as the median sensitivity ranges around 50% for most techniques. In this case, SCITE is the least accurate algorithm, showing a poor efficacy in retrieving both the true positives and true negatives, whereas CAPRI displays high specificity and low sensitivity.

Inference with drivers and confounding factors
As expected, the overall performance slightly worsens with respect to the previous scenario. In particular, the average sensitivity values are lower because the confounding factor complicate the inference problem by introducing spurious correlations. The general trend is however preserved, with Gbw and SCITE being respectively the most and the least accurate algorithms.

Inference with multiple cells of origin
In this case, CAPRI, PRIM and ChL show a slightly better trade-off between sensitivity and specificity, while SCITE seems less accurate when dealing with this mixed signal from multiple populations. More in detail, most algorithms are not able to infer the two distinct roots of the progression in most cases, with the exception of CAPRESE and CAPRI, which succeed in around half of the cases (Table 6). Edm and Gbw display a slightly better performance than the remaining techniques, yet remarkably worse than the single-cell case.

Inference with convergent trajectories
In this case, the median sensitivity is very similar and lower than 40% for all the techniques, highlighting the difficulty in inferring the true relations with this specific generative topology. All in all, Gbw and CAPRI present a slightly better combination of sensitivity and specificity. In Table 7 one can notice that all the algorithms, but CAPRESE and CAPRI, present a very similar capability in retrieving either one of the two confluences of the generative model, or both of them, yet with a significantly worse performance with respect to the single-cell case.

Detailed tests with Single-cell data
We show here results from more detailed comparative tests of the algorithms for single-cell data. In the following, we will be using synthetic data generated (i) from biologically plausible phylogenetic models, and (ii) from a large number of randomly generated topologies.

Inference with drivers
Experiment II We consider polyclonal tumors originating from a unique cell, in a single-cell sequencing experiment. To generate data we fix the phylogenetic trees in Figure 6 [20]. These trees have variable number of clones (n = 6, 11, 17); we use them to sample different number of sequenced cells (m = 10, 50, 100). Besides one ideal noise-free setting, we perturb data with plausible medium and high asymmetric noise rates ( + / − ), in order to mimic characteristics errors in sampling cells and calling mutations. We compare the performance with over 4000 independent tests. Figure 7): Reasonably, the overall performance of each algorithm is higher with lower levels of noise and larger datasets. In the ideal cases of noise-free data and 100 sampled cells, for instance, all algorithms converge to the true generative model. Noteworthy, in many realistic cases, median sensitivity and specificity measures are above 90%. The overall performance trend depends on model size (n), the smaller models being easier to infer as one might expect. For all algorithms, the ability to detect true relations (sensitivity) clearly drops for pathological settings (e.g., we infer 20% of the true edges for the 17-clones model, when we sequence 10 cells).

Results (
Gbw, Edm and SCITE, display a similar superior ability to infer the true relations (i.e., high sensitivity). However, SCITE seems to overfit (i.e., with a 10% loss of specificity for the 17-clones model, when we sequence 100 cells). This is particularly evident with small datasets and models. It also persists with larger models, most likely because of the larger search space for its MCMC heuristics. CAPRI, as expected, shows very high specificity but low sensitivity, due to BIC regularization. Experiment III We generalize Experiment II to 100 randomly sampled topologies with variable number of nodes (n = 5, 10,20). This shall avoid any bias induced by holding fixed the polyclonal topologies in Figure 6.
Results (Figure 8): In general, the results partially reflect those of Experiment II. All the algorithms display very good performances in most settings, in many cases converging to the generative topologies (especially with small models, i.e., with n = 5, 10 nodes). A very similar and optimal overall performance is that obtained by Gbw, Edm, CAPRESE and SCITE, with minor differences in the different parameter settings, yet with an evident tendency of SCITE in inferring denser models with more false positives (i.e., highlighting a lower specificity). Also in this case CAPRI show a very good specificity because of the regularization, but fails in capturing many true positives.
Experiment IV. In order to assess the robustness of the inference with respect to different rates of false positives and false negatives rates provided as input to the algorithms, we investigated the variation of the performance of two selected algorithms, namely Gbw and SCITE, on a dataset generated from the Medium phylogenetic tree in Figure 6 Results (Table 8 and Table 9): By looking at the performance with respect to the different combinations of + and − provided as input to the algorithms, we unexpectedly do not observe noteworthy variations, for both the algorithms. This result indicates that, if the value of noise provided as input to the algorithms is close to the real value (i.e., within a reasonable range), the inference accuracy is not remarkably perturbed. As a consequence, one might question about the usefulness of the usually computationally expensive techniques used for the inference of noise models, as done, for instance, by SCITE. We leave some further comments on this topic to the main text.

Inference with drivers and confounding factors
Here true variables are mixed to random 0/1 variables, totally unrelated to the progression. This could be a simple model of uncertainty in the calling, where we over-call variants that are not true drivers at a certain error rate. Experiment V We use data from Experiment II; to each dataset we add random binary columns. A column is a repeated sampling of a biased coin, with bias uniformly sampled among the marginals of all events. n × 10% random columns are inserted per dataset, where n is the true model size. Figure 9): Surprisingly, the results of this experiment reflects those of Experiment II, with minor differences in the performance of the various algorithms in the different settings. The overall performance is good, at least with sufficiently large datasets and sufficiently small levels of noise. Also in this case, SCITE tends to overfit, especially with larger models and small datasets.

Inference for tumors with multiple cells of origin
In this case the signal that we detect in the data is composed from different true signals, one per population of cells. So, we need to infer a forest with a number of trees equal to the number of 13 different progressions, which it seems reasonable to assume to be low, e.g., below 5.
Experiment VI We extend the sampling strategy in Experiment III to account for forests with fixed total number of nodes, i.e., n = 20. We perform the same procedure of that experiment.
Results ( Figure 10): All the algorithms display a very low sensitivity with small datasets (with 20% median value with m = 10 samples), remarkably increasing the performance with larger datasets (median values around 75% with m = 100 samples in the noise-free case). Gbw, Edm and CAPRESE show a good tradeoff between sensitivity and specificity, displaying a good and similar performance, whereas SCITE confirms the tendency to overfit for small datasets, yet being the most robust algorithm against noise in the data.

Inference with missing data
In addition to the false positives/negatives introduced in the data via allelic dropouts and false alleles, unobserved or missing data points represent another major problem when dealing with single-cell sequencing data. In the early works, around 60% of the data were missing due to the low quality of the sequencing technique. Even though the technology has remarkably improved during the last few years, leading to more reliable and usable data, we here investigate the influence of missing data on the inference, with respect to the considered algorithms. In particular, we performed simulated experiments with a specific generative topology and with different amounts of missing data, ranging from 10% to 40%, analyzing the variation in the inference accuracy.
Experiment VII In order to evaluate the impact of missing data on the inference accuracy, we chose 20 benchmark single-cell datasets generated from the Medium phylogenetic tree in Figure 6, with n = 11 nodes and m = 75 samples. 10 of these datasets were generated with + = − = 0, and the remaining 10 datasets with (ii) + = 0.005, − = 0.05. For each of the 20 datasets we generated 5 further datasets, with the following ratio of randomly included missing entries: r = (0, 0.1, 0.2, 0.3, 0.4), for a total of 100 distinct datasets. As SCITE naturally deals with datasets with missing data, we performed the inference with no further parameters. Instead, in order to perform the reconstruction with the remaining algorithms, we followed this procedure. For each one of the 80 datasets with missing data (we did not consider the case with r = 100), we filled the missing entries via a classical Expectation Maximization (EM) algorithm, and we repeated this step to create 100 complete datasets (for each incomplete datasets). We then performed the inference with all the algorithms on all the 100 datasets in each case, selecting the model with the best likelihood score, which was then used in the performance assessment.
Results ( Figure 11): As one can see in Figure 11, the performance of all the algorithms is profoundly affected by the presence of missing data, both in the noisy and in the noise-free cases. SCITE displays an overall more robust sensitivity than the other techniques, yet in spite of a worse specificity, which would point at a tendency toward overfitting also in this scenario. As expected, the performance of all the techniques is significantly better in the noise-free case and, in general, is maintained at acceptable levels up to values of missing data around 20%/30% according to the cases.

Detailed tests with Multi-region bulk-sequencing data
Analogously, we here present the results of detailed comparative tests of the algorithms on synthetic multi-region sequencing data.

Inference with drivers
Experiment II-MR Here we reproduce Experiment II in the case of multi-region data, and sample the polyclonal topologies shown in Figure 6, with symmetric noise rates ( − = + ). We compare the boxplot performance from 100 tests.
Results ( Figure 12): The accuracy of most algorithms is good in all the scenarios. They all reach high values of specificity, whereas satisfactory values of sensitivity are observed only with combination of sufficiently large datasets and sufficiently low noise. As expected, the overall performance worsens with larger and more complex generative models.
In this case Gbw and Edm display the best efficiency in retrieving both the true positives and negatives, Edm being slightly better in a certain number of parameter settings. Conversely, SCITE shows the worst performance, especially with small datasets and low levels of noise, yet proving a certain robustness to the increase in the noise level. CAPRI displays very good values of specificity even with these data type, yet most likely due to its regularization.
Experiment III-MR This experiment reproduces Experiment III with multi-region data, hence sampling the datasets from 100 randomly generated topologies with variable number of nodes.
Results ( Figure 13): The results of this experiment resemble those of Experiment III. An overall good performance is observed, yet with low sensitivity with small datasets and noisy data.
Gbw and Edm consistently display optimal and very similar trends, with Edm showing a better performance in a sightly larger number of settings. SCITE confirms to be poorly accurate when dealing with multi-region data, especially with small datasets, even when the level of noise is low.
We remark that we chose not to include an experiment analogous to Experiment IV, on inference robustness in the multi-region case, because with symmetrical noise rates we do not expect significant differences in the inference accuracy.

Inference with drivers and confounding factors
Experiment V-MR We reproduce Experiment V with multi-region data augmented with n × 10% random columns. We use the three topologies shown in Figure 6.
Results ( Figure 14): The results are in accordance with those of the analogous single-cell experiment, with an expected overall decrease in the accuracy due to the introduction of spuriously correlated events in the data. Edm proves to be the most accurate algorithm, slightly improving over Gbw, with SCITE being the least efficient in retrieving both the true and the false relations, especially with small datasets and/or low noise levels.

Inference for tumors with multiple cells of origin
Experiment VI-MR We reproduce Experiment VI with multi-region data, and with datasets sampled from random forests (see Figure 2). Figure 15): In accordance with the other results shown hereby, Gbw, Edm and CAPRESE appear to be the most accurate algorithms in this scenario. The former achieves the best sensitivity and specificity in most settings. PRIM proves to be very efficient in retrieving the true relations in many cases, especially with low levels of noise, yet presenting a certain tendency toward overfitting. Also in this case SCITE is the least accurate algorithm in most settings.

Parameters settings and computation time
Parameter settings. The notations to identify the parameters used in this manuscript and the values that we used are given here. when we sample random models. In random tests, 100 trees are generated for each configuration of n and m (see below), and one dataset per tree is sampled. In some tests, we fix n to 6, 11 and 17 ( Figure 6); we specify in the experiment description if that is the case. This corresponds to pairs of equal value that differ for an order of magnitude, e.g., ( + = 0.015, − = 0.15), consistently with the observation that false negative rates in single-cell data are much higher than false positives ones. Such values correspond to an overall error rate that is 2× + and 2× − . For multi-region sequencing, these errors are symmetric When these are sampled at random we impose the constraint that for any pair of variables X and Y it holds and for any marginal p(X) > 0.001. These values seem reasonable to avoid the introduction of biases in the sampling process. In some cases we assigned fixed values to the CPTs, which we report in the corresponding figures (e.g., in Figure 6). p = 0.05 significance of the Mann-Whitney test (p-value).
Computation time. In order to assess and compare the computation time of the distinct techniques we choose a specific parameter setting: we used the Medium phylogenetic tree in Figure 6 as generative topology, with n = 11 nodes, m = 75 samples, + = 0.005, − = 0.05, and we repeated the inference for 100 distinct experiments, on a single core of a Lenovo Thinkpad t430s with an Intel i7 3520M 4-core 2.90GHz processor and 16Gb Ram. In Table 10 one can notice that CAPRESE is the fastest algorithm, followed by the group of PRIM, Gbw, Edm, CAPRI and ChL, which display an almost identical running time, and around 7X slower than CAPRESE. SCITE is a remarkably slower algorithm (i.e., 25X slower than CAPRESE and more than 3X slower than the group of PRIM), whereas OncoNEM has the worst performance (i.e., 300X slower than CAPRESE, around 40X slower than the group of CAPRI, and 12x slower than SCITE), which also prevented us to perform more extensive experiments on such algorithm.

Case studies with real data
We executed TRaIT on 1 single cell (see [21]) and 1 multiple biopsy datasets (see [22]).

Triple-negative breast cancer single-cell data
We first analyzed the data from [21] of single-nucleus exome sequencing of 16 tumor cells from a triple-negative breast cancer. Only two states are called for each site: the presence or absence of a SNV, so we used this as a direct input to our analysis. In [21] the estimated error rates for this dataset are 9.72% for allelic dropout and 1.24 × 10 −6 for false discovery.
We show in figures 16 and 17 the inference for each algorithm. We refer to the main text for a detailed analysis of this dataset.

Colorectal cancer multi-region data
We considered the dataset of [22] of multi-regional primary colon carcinoma surgically resected between October 2013 and April 2014. In particular, we considered Patient 3 from the study [22] consisting of 3 regions (P3-1, P3-2, and P3-3) of the primary cancer and two metastatic regions.
We show in figure 18 (left) the original dataset of 40 alterations of different types and (right) the final input for TRaIT after a set of clonally indistinguishable events were merged. Furthermore, we show in figure 19 the progression models obtained by the different algorithms of TRaIT and in figure the model by SCITE 20.
We refer to the main text for a detailed analysis of this dataset.

Noise model
Here we show the derivation of the noise model for both marginal and joint probabilities.
Proof. Let us call + the probability of observing 1 when we had 0 (false positive) and − the other way around (false negative). We remark that we assume these probability to be strictly in [0, 0.5) with value 0.5 representing totally random entries. Then, for any event x i , we can write the expectation of the probability of observing it (here with the notation n i ), given its theoretical probability p(x i ) as follow.
and with some rearrangements, Joint probabilities.
Proof. Let us now consider the theoretical joint probability p(x i,j ) (in what follow also called p(x i , x j ) to make the elements of the probability explicit) of any two pair of events and the respective observed marginal and joint probabilitis n i , n j and n i,j . Then, the expectation of n i,j can be written as follow.
and being, with some rearrangements,

Figure 1:
The op problems (image taken from [1]). Davis and Navin define three variants to the problem of inferring cancer progression from single-cell data [1]. The pop (phylogenetic ordering problem), which is a classical phylogenetic inference problem and aims at displaying a set of input cells as leaves of a phylogenetic tree. The cop (clonal ordering problem), which seeks to identify a clonal lineage tree that models an ancestry-relation for a set of clonal signatures that we infer from data. The mop (mutational ordering problem), which seeks to find the order of mutations that accumulate during cancer progression. In this paper, we focus on the mop problem. Figure 2: A. Inference with drivers. This phylogenetic tree has 6 nodes (one for each cancer driver xi), and a unique variant/cell of origin, x1. This is the most common case in which inference is carried out. B. Inference with drivers and confounding factors. A phylogenetic model can be extended with spurious variables zi that confound the inference problem. The inference is hindered because we have to detect the spuriousness of the association between any zi and the true driver variants xi. C. Inference with multiple cells of origin.A forest of phylogenetic models can describe the presence of multiple independent progressions that start from different variants (x1 and x7), as it might happen with tumors that start from different cells of origin. D. Inference with convergent trajectories. A DAG can capture confluent evolution of selective pressures toward some event, here x4 (i.e., a violation of the Infinite Sites Assumption). Such confluences can be hard-exclusive if parents of a common later node are never observed together, while if this latter case may occur, we have a situation of soft-exclusivity. Algorithms that assume the true model to be a tree can not infer such models.
x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 z i x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 1 x 2 x 3 x 4 x 5 x 4 x 1 x 2 x 3 x 4 x 5 x 4 x 6   Figure 6: Polyclonal trees [20]. The polyclonal trees that were used to study the performance of BitPhylogeny in [20]. These have different number n of clones, which we classify as low, medium and high. We use them to generate synthetic data some tests, e.g., in Experiment II. The wildtype is a fake node that accounts for samples with no mutations annotated (in our models, we do not have that node). 1.00  Figure 7: Experiment II. Inference with drivers on fixed polyclonal topologies: singlecell data. All the algorithms are tested on datasets generated from the three phylogenetic trees shown in Figure 6, respectively including n = 6, 11, 17 distinct clones. Three incremental level of unbalanced noise in the data are assumed: the ideal noise-free case, and two cases at higher orders of magnitude ( + = 5 · 10 −3 , − = 5 · 10 −2 and + = 2 · 10 −2 , − = 2 · 10 −1 ). We test distinct sample set sizes (m = 10, 50, 100) and we generate 100 distinct datasets for each case, reporting the distributions of sensitivity and specificity. Figure 8: Experiment III. Inference with drivers on multiple random topologies: single-cell data. The algorithms are tested on single-cell datasets generated from a large number (i.e., 100 for each case) of random tree topologies, with n = 5, 10, 20 distinct clones. As in Experiment II, three levels of noise ( + = − = 0, + = 5 · 10 −3 , − = 5 · 10 −2 , + = 2 · 10 −2 , − = 2 · 10 −1 ), and three sample sizes (m = 10, 50, 100) are tested. Figure 9: Experiment V. Inference with drivers and confounding factors on fixed polyclonal topologies: single-cell data. n×10% random (0/1) columns are added to the single-cell datasets generated from the three phylogenetic trees shown in Figure 6. Also in this case, three levels of noise ( + = − = 0, + = 5 · 10 −3 , − = 5 · 10 −2 , + = 2 · 10 −2 , − = 2 · 10 −1 ), and three sample sizes (m = 10, 50, 100) are tested.     Inference with drivers and confounding factors on fixed polyclonal topologies: multi-region data. n × 10% random (0/1) columns are added to the multi-region datasets generated from the three phylogenetic trees shown in Figure 6. Three levels of noise and sample size are tested.   Figure 18: Colorectal cancer multi-region data -Part 1. Genomic alterations of Patient 3 by [22] consisting of 3 regions (P3-1, P3-2, and P3-3) of the primary cancer and two metastatic regions. We show on the left the original dataset of 40 alterations of different types and on the right the final input for TRaIT after a set of clonally indistinguishable events were merged.  Figure 19: Colorectal cancer multi-region data -Part 2. Analysis by TRaIT of the data from [22] consisting of 3 regions (P3-1, P3-2, and P3-3) of the primary cancer and two metastatic regions.  Figure 20: Colorectal cancer multi-region data -Part 3: Model inferred by SCITE A model inferred via SCITE on the data from [22] consisting of 3 regions (P3-1, P3-2, and P3-3) of the primary cancer and two metastatic regions. The events marked either in yellow or green are identically shared by all the samples or by a subgroup, respectively, and would be considered as indistinguishable in our framework. Due to its complex noise model, instead, SCITE retrieves extremely long linear chain of consecutive events, which are however non unique. SCITE, in fact, returns many equivalent models with the same data and, hence, we here show a heatmap displaying for each different model inferred with this dataset (i.e., column), for each event in the model (i.e., row), its parent event (each event is characterized by a unique color, the root being associated to white color). Considered that the rows are significantly non-uniform across the models for most events, and especially for those belonging to the aforementioned yellow and green groups, this result shows that no unique ordering can be identified by SCITE and that no sensible choice for a specific mutational tree instead of another could be done with such algorithmic framework.  Table 2: Experiment I. Inference with drivers and confounding factors. We here show an assessment of how confounding factors are handled by our framework. Specifically, we show (i) mean p-values of the probability raising condition for the in/out-coming arcs for each node in the causal prior graph structure (see Main Text), (ii) mean p-values of the probability raising condition for the in/out-coming arcs for each node in the topology inferred by Edm and (iii) same of (ii), but here we show the average number of in/out-coming arcs per node. The statistics are average on 100 different single-cell datasets generated from the low polyclonal tree topology with 1 confounding factor and a total of n = 7 nodes (see Figure 6). We used a noise-free configuration and 100 samples. From the experiment, it is clear that the confounding node consistently shows a lower selective advantage trend compared to the others.  Table 3: Experiment I. Inference with multiple cells of origin. Number of inferred models including respectively 1, 2 or 3 distinct roots, on 100 different single-cell datasets generated from the forest topology with two distinct roots, with n = 7 nodes, shown in Figure 3. Table 4: Experiment I. Inference with exclusive convergent trajectories. In this table is shown how may times one of the two hard-exclusivity confluences shown in Fig. 3 are retrieved by the distinct algorithms, over 100 experiments.  Fig. 3 are retrieved by the distinct algorithms, over 100 experiments. One should notive that, by construction, only CAPRI, CHOW-LIU and PRIM algorithms can infer confluences.  Table 8: Experiment IV. Noise-robustness test. Variation of the average sensitivity and specificity (and standard deviation) of Gbw on datasets generated from the Medium phylogenetic tree in   Table 9: Experiment IV. Noise-robustness test. Variation of the average sensitivity and specificity (and standard deviation) of SCITE on datasets generated from the Medium phylogenetic tree in Figure 6, with n = 11 nodes, m = 75 samples, + = 5 × 10 − 3, − = 5 × 10 − 2, and all the possible combinations of input + and − in the following ranges: + = (3, 4, 5, 6, 7) × 10 −3 and − = (3, 4, 5, 6, 7) × 10 −2 .

SCITE model
Average execution time  Table 10: Average execution time and total execution time on 100 distinct inference experiments, for the distinct algorithms and the parameter setting described in the Section 6.