Problem setup
Assume we have n number of training data pairs. Each data pair consists of a set of cell sequences and a true cell lineage tree. One example of such a pair is illustrated in Fig. 1.
For the ith data pair, let \(m_i\) be the number of cell sequences in the ith data pair and let t be the sequence length. Let the sequence information in data pair i be written as \(C^i\), an \(m_i\times t\) matrix. The matrix element \(C^i_{jk}\) describes the jth sequence and the kth letter of the ith training data pair. Furthermore, we represent \(L_i\), the cell lineage tree structure, in the form of a Newick format string.
For example, if Fig. 1 represents ith data pair, (a) shows the sequence information. It has \(m_i=4\) cell sequences, each sequence length has a length (t) of 10, and the first letter of the 3rd sequence is \(C^i_{3,1}=\text {E}\). The cell lineage tree is shown in Fig. 1b. In Newick format, \(L_i = ((1:0.5,2:0.5):2,(3:1.2,4:1):2)\).
We built our model (\(m(C;{\hat{\theta }})\)) with n training pairs. The input matrix, C is an \(m_i\times t\) sequence information matrix. The output is the Newick format string representing the tree structure while \(\theta\) represents the parameter set related to model \(m(C;\theta )\), and \({\hat{\theta }}\) represents the estimated parameter with n training data pairs.
To evaluate the model, we have l number of unused data. The loss is indicated as \(L(m(C^i;{\hat{\theta }}), L_i)\). The loss increases when the predicted tree structure (\(m(C;{\hat{\theta }})\)) differs from the true tree structure (\(L_i\)). We use the averaged loss metric (AL) defined below to evaluate the model \(m(C;{\hat{\theta }})\).
$$\begin{aligned} AL = \frac{\sum _{i=1}^{l} L(m(C^i;{\hat{\theta }}), L_i)}{l}. \end{aligned}$$
(1)
Next, we need to define the quantity \(L(L_1, L_2)\) that represents the dissimilarity between the two lineage trees, \(L_1\) and \(L_2\). The two widely used measures of this dissimilarity are the Robinson–Foulds (RF) distance [7] and the Triplet distance [8].
Figure 2 represents two lineage trees, \(L_1\) and \(L_2\). The possible separations for tree 1 are \(\{1,2\}, \{3,4,5\}\) and \(\{1,2,5\}, \{3,4\}\), and the possible separations for tree 2 are \(\{1,2\}, \{3,4,5\}\) and \(\{1,2,3\}, \{4,5\}\). Among the four possible separations, \(\{1,2\}, \{3,4,5\}\) in tree 1 and tree 2 are concordant separations. The RF distance is defined as the total number of concordant separations divided by the total number of separations. As an example: the RF score for Figure 2 is 2/4 = 0.5.
For triplet distance calculations, we sample three items among all items in the tree. We determine \(_5C_3=10\) possible cases. We check whether the tree structure of the three items in tree 1 and tree 2 are the same. For example, the five cases of \(\{1,2,3\}, \{1,2,4\}, \{1,2,5\}, \{1,4,5\}\) and \(\{2,4,5\}\) have the same tree structure in both tree 1 and tree 2. The triplet score is defined as the number of cases with the same tree structure divided by the number of possible cases. Accordingly, the triplet score for our example is 5/10 = 0.5.
We will use \(AL_{RF}\) to denote the RF distance and \(AL_{TP}\) to denote the triplet distance.
Overview of the modeling architecture
The two main issues that need to be answered in the lineage reconstruction problem are (1) how should the model \(m(C;\theta )\) be built and (2) how should \({\hat{\theta }}\) be estimated? Our modeling architecture for \(m(C;\theta )\) is described in Fig. 3. We divide the model into two parts: (1) estimating the distance between cells and (2) constructing a tree using a distance matrix. Let D(C) be a function for estimating the distance matrix for an \(m \times t\) input sequence matrix, C, and let t(D) be a function for predicting the lineage tree for an \(m \times m\) distance matrix, D. Note that a knowledge of the triangular components in D is sufficient for defining the distance matrix. Subsequently, the challenge becomes how \(D(\cdot )\) and \(t(\cdot )\) should be defined.
Choice of distance
One notion for calculating the distance is to define the distance function for the two sequences. Let \(d(C_{i\cdot }, C_{j\cdot }; \theta )=d_{ij}\) be the calculated distance between the ith cell and the jth cell obtained from the given cell information matrix C. The quantities \(C_{i\cdot }, i = 1,\cdots ,m\) represent the ith cell vector taken from C. The quantity \(d_{ij}\) is the (i, j)th element of the cell distance matrix D. The next challenge becomes how \(d(\cdot , \cdot )\) should be defined.
Hamming distance method
One naive approach to modeling \(d(\cdot , \cdot )\) is to compute the Hamming distance expressed in equation (2):
$$\begin{aligned} d_H(C_{i\cdot }, C_{j\cdot }) = \sum _{l=1}^{t} 1(C_{il}\ne C_{jl}), \end{aligned}$$
(2)
where \(1{(\text {condition})}\) is an indicator function with a value of 1 if the given condition is true and 0 otherwise. Note that the Hamming distance \(d_H(C_{i\cdot }, C_{j\cdot })\) simply counts unit differences between the two sequences \(C_{i\cdot }\) and \(C_{j\cdot }\).
The simple calculation of the Hamming distance does not meet the challenges of the present study. Consider the cell differentiation process illustrated in Fig. 4. Let the 2nd and the 3rd leaf cells (dotted) have \(C_{2\cdot } = \text {0AB-0}\) and \(C_{3\cdot }= \text {00CB0}\). The corresponding Hamming distance is \(d_H(C_{2\cdot }, C_{3\cdot }) = 3\). However, it is not reasonable to assign equal weights to each target position. This deficiency is addressed by the weighted Hamming approach described in the following sub-sub-section.
The code for using the Hamming distance method is available from the phangorn package [9] using the dist.hamming function.
Weighted Hamming distance (WHD) method
An example of the cell diffusion process is illustrated in Fig. 4. Consider the objective of reconstructing the lineage tree based on information about the leaf nodes. What would be an appropriate measure for calculating the distance between 0AB-0 and 00CB0 (see the dotted circles in Fig. 4)? The character difference between ‘A’ and ‘0’ would be less than the character difference between ‘B’ and ‘C’ as the initial state ‘0’ can be differentiated to ‘A’ or to any other outcome states, whereas states ‘B’ and ‘C’ cannot be differentiated to other outcome states. In addition, the missing state ’-’ maybe any other state. Seeking a mathematical formula to accommodate these nuances, we propose the weighted Hamming distance (WHD) method:
$$\begin{aligned} d_{WH1}(C_{i\cdot }, C_{j\cdot }) = \sum _{l=1}^{t} w_{C_{il}}w_{C_{jl}}1(C_{il}\ne C_{jl}), \end{aligned}$$
(3)
where, \(C_{il}\) is the lth character in the ith cell sequence, and \(w_{C_{il}}\) is a weight associated with the character \(C_{il}\). The code for calculating the WHD method is available as the dist_weighted_hamming function in the DCLEAR package. For the estimation of weight parameters, we used Bayesian hyperparameter optimization using the BayesianOptimization function in the rBayesianOptimization package [10].
k-mer replacement distance (KRD) method
The algorithm used to compute the k-mer replacement distance (KRD) method first uses the prominence of mutations in the character arrays to estimate the summary statistics used for the generation of the tree to be reconstructed. These statistics include the mutation rate, the mutation probability for each character in the array, the number of targets, and the number of cells. These estimated parameters were combined with pre-defined parameters, such as the number of cell divisions, to simulate multiple lineage trees starting from the non-mutated root. To generate trees with sizes and mutation distributions similar to the target tree, we generated 1000 lineage trees, each with 16 cell divisions of 216 leaves, a mutation rate of 0.1, arrays of 200 characters, 200 cells, and 30 states (‘A’-‘Z’ to ‘a’-‘c’, with an outcome probability following a Gamma distribution with a shape of 0.1 and a rate of 2). Different possibilities for the k-mer distance method were then estimated from the simulated lineage trees and used to compute the distances between the input sequences in the character arrays of internal nodes and tips. The KRD method is available from the DCLEAR package using the dist_replacement function.
Tree construction
We use existing methods such as Neighbor-Joining (NJ), UPGMA, and FastMe [11,12,13] for tree construction from the estimated distance matrix, D. The NJ method is implemented as the nj function in the Analysis of Phylogenetics and Evolution (ape) package, UPGMA is implemented as the upgma function in the phangorn package, and FastMe is implemented as fastme.bal, and fastme.ols in the ape package.