Selected articles from the 16th International Symposium on Bioinformatics Research and Applications (ISBRA20): bioinformatics
 Research
 Open Access
 Published:
Statistical evidence for the presence of trajectory in singlecell data
BMC Bioinformatics volume 23, Article number: 340 (2022)
Abstract
Background
Cells progressing from an early state to a developed state give rise to lineages in cell differentiation. Knowledge of these lineages is central to developmental biology. Each biological lineage corresponds to a trajectory in a dynamical system. Emerging singlecell technologies such as singlecell RNA sequencing can capture molecular abundance in diverse cell types in a developing tissue. Many computational methods have been developed to infer trajectories from singlecell data. However, to our knowledge, none of the existing methods address the problem of determining the existence of a trajectory in observed data before attempting trajectory inference.
Results
We introduce a method to identify the existence of a trajectory using three graphbased statistics. A permutation test is utilized to calculate the empirical distribution of the test statistic under the null hypothesis that a trajectory does not exist. Finally, a pvalue is calculated to quantify the statistical significance for the presence of trajectory in the data.
Conclusions
Our work contributes new statistics to assess the level of uncertainty in trajectory inference to increase the understanding of biological system dynamics.
Background
Inferring global topological patterns in multivariate data is a central step in understanding dynamical systems such as a living cell. Knowledge of global patterns in the data can be used as a basis for deciding the underlying mechanisms. This is especially relevant to developing biological systems. Dynamical processes such as cell differentiation are revealed in lineages of cells progressing from an early state to a developed state. Singlecell technologies that capture these dynamic processes enable our understanding of global topological patterns that can provide insight into developmental biology. The lineages of cells in a developing tissue are treated as trajectories in dynamical systems where each point represents a cell. Dozens of computational methods have been developed to infer trajectories from singlecell data [1]. Most trajectory inference methods attempt to infer trajectory from a graph representation of the singlecell data. Graphrepresentation learning [2] obtains a low manifold representation of the dynamics in the singlecell data onto which trajectory inference can then be applied. LISA2 employs k nearest neighbors (kNN) and builds a spanning tree with user specified root and leaves [3]. STREAM, a trajectory inference method that is applicable on both transcriptomic and epigenetic data employs principal graphs to trajectories [4]. Some methods incorporate additional information other than singlecell gene expression to infer trajectories. CellPath incorporates RNA velocity information and employs kNN graph to infer trajectories [5].
However, to our knowledge, none of the existing methods address the problem of determining the existence of a trajectory in observed data before attempting trajectory inference. To address this problem, we introduce graphbased statistics which quantify trajectory existence. Minimumspanningtree (MST) based statistics have been successfully used in analyzing global structures from large datasets. Krzewina and Saslaw provide a notable example in astronomy [6], where they introduced MSTbased statistics that captured filamentary structures in galaxy data. Although our approach is also MST based, we design different statistics to emphasize the existence of a global trajectory structure. Graphbased statistics have connections to topological data analysis (TDA) [7], which makes use of topological algebra and are thus deterministic in nature. As a result, most of the methods are effectively applicable only to datasets that have clearly defined structures such as the torus. Although the use of statistics have been recently introduced in TDA, the direction is still vastly underexplored. In complementary, our work could help answer the statistical significance question for TDA.
Results
We present a framework using three graph measures to quantify statistical significance for the presence of trajectory in a given dataset. The framework takes as input a data matrix such as a singlecell RNAseq dataset, in which each data point is a cell, and the genes are features. For each statistic, the framework gives as output a set of pvalues, each corresponding to a given number of clusters k, such that \(a \le k \le b\) where a and b are minimum and maximum number of clusters, respectively. From the set of pvalues, the median pvalue measures the statistical significance for the presence of trajectory in that particular dataset. Figure 1 gives an overview of the framework and illustrates the three test statistics.
Next, we delve into a detailed description of the main steps that the framework undertakes in computing the pvalues.
For an input multivariate data set X, we first perform dimensionality reduction on the data. For each k in a given range, a permutation is performed to calculate the distribution of the test statistic under the null hypothesis that there is no trajectory in the dataset. The next step calculates a pvalue that quantifies the statistical significance for the presence of trajectory in the data.
To capture any global topological structure, the data are first partitioned into k homogeneous regions using kmeans clustering. A weighted undirected graph G is created by the k cluster centers with edges connecting every pair weighted by their Euclidean distance. An MST H is then computed on graph G using Prim’s algorithm. We define an MST H by a set of k nodes \(V = \{v_1,v_2,\ldots ,v_k\}\) and \(k1\) edges \(E=\{(v_i,v_j)\) there is an edge between nodes \(v_i\) and \(v_j\) in H, for \(i,j = 1,2,\ldots ,k\)}. We now introduce three treebased statistics to characterize the presence of trajectory in the data. These are the number of degreeone nodes \(T_1(X) = D_1(H)\), the number of degreetwo nodes \(T_2(X) = D_2(H)\), and the length of a longest path \(T_3(X) = L_{\max }(H)\). Detailed definitions of the statistics and algorithms are given in the “Methods” section.
Evaluation on simulated and real singlecell data
We demonstrate how the three test statistics report existence of trajectory in singlecell RNAseq datasets that are simulated or real observations. For singlecell data, the raw dimensionality is typically in the order of tens of thousands. We visualize the data using 2dimensional principal component analysis (PCA) plots of the datasets. These 2dimensional data are the input to our analysis. We selected two real datasets, one with trajectory and the other one without trajectory [8,9,10]. Two simulated datasets were also similarly selected. For each dataset, a permutation test was performed to calculate the empirical distribution of the test statistic under the null hypothesis that there is no trajectory in the dataset. We also selected an additional four simulated datasets to demonstrate the effectiveness of our method in capturing different types of trajectories. The next step involved calculating a pvalue that quantifies the statistical significance for the presence of trajectory in the data for every value of k, the number of clusters, ranging from 5 to 35 clusters. The choice of the range of clusters is critical. Using fewer number of clusters may result in high pvalues even when there is a clear trajectory in the data. For instance, using three clusters would always result in an MST with the highest degree of linearity. Using too many clusters would also result in high pvalues even when there is a clearly defined trajectory, since the high number of clusters would result in a highly branched MST. Empirical evidence shows that the number of clusters k, such that \(5 \le k \le 35\), is reasonable enough to capture significant topological variations in the data. The optimal number of clusters in terms of separability may be too low to capture the statistical significance of the resulting topological structure. Additionally, since our goal is to capture homogeneous regions instead of clustering in terms of separability, we do not attempt to predetermine the optimal number of clusters to build the MST on. Therefore, the results for each dataset consist of 31 pvalues corresponding to the number of clusters in the range \(5 \le k \le 35\), with each pvalue quantifying the statistical significance for the presence of trajectory at k number of clusters. The median of those pvalues is then used to represent the overall statistical significance for the presence of trajectory.
Figures 2, 3, 4 and 5 show the empirical distributions of the statistics as well as the pvalues on each of the four datasets. Figure 6 demonstrates the performance of the 3 statistics on a more complex real fibroblast singlecell dataset. Figure 7 shows pvalues on additional four simulated datasets with different types of trajectories.
The shapes of the distributions of all three test statistics under the null hypothesis that there is no trajectory in the data are roughly similar, with differences in the range and mean values for each statistic. The distributions shown for each statistic were computed at the number of clusters, k, that corresponds with the median pvalue. The pvalues of the four datasets are consistent with the survey results. Across all three statistics, the median pvalues, as shown in the box plots, tend to be lower when there is a trajectory in the data, and much higher when the trajectory is not clearly defined. The pvalues for the number of degreetwo nodes statistic tend to be slightly lower in datasets that have no trajectory and slightly higher in datasets that have trajectory. These results demonstrate the effectiveness of these MSTbased statistics in reflecting nature and presence of trajectory in datasets.
In Fig. 2, the simulated dataset has no trajectory and the median pvalues across all statistics are relatively high. The median pvalues of each statistic are the output of the test: \(D_1\) has 0.815, \(D_2\) has 0.673 and \(L_{\max }\) has 0.664. The results are consistent with the expected output of the method. Since the data have no trajectory, the MST built on the cluster centers would be highly branched resulting in high pvalues. The dataset provides a good example in which the strength of linearity of the resulting MST is lowest. Since all the statistics can differentiate between extreme cases of linearity, it is no surprise that the pvalues are relatively high across all statistics. The null distributions of the statistics are relatively comparable, with a unimodal shape that roughly resembles binomial distributions. The main difference in the distributions is the range and mean values of the statistics.
In Fig. 3, the simulated dataset has a clearly defined trajectory. The median pvalues of the \(D_1\) and \(D_2\) statistics are 0.013 and 0.008 respectively, which depicts a strong statistical significance for the presence of trajectory. The median pvalue of the longestpath length statistic is comparably higher at 0.029. Even though the pvalue is comparably higher, the statistic fairly captures the trajectory. The dataset provides an example in which the degree of linearity of the resulting MST is relatively high. As expected, \(D_1\) and \(D_2\) statistics best capture the trajectory since they can best detect a high degree of linearity. The shapes of null distributions across all three statistics are also comparable.
In Fig. 4, the dataset is real, singlecell data with a cellcycle trajectory [11] but also strong noise. The trajectory is not observable on the data presented. The strength of linearity of the resulting MST is relatively low. Therefore, \(D_1\) and \(D_2\) statistics do not capture the trajectory well as depicted by the median pvalues. The median pvalues for \(D_1\) and \(D_2\) statistics are 0.555 and 0.337, respectively. On the other hand, the median pvalue of \(L_{\max }\) is comparably lower at 0.160. Therefore, the statistical evidence for this true trajectory is low due to strong noise.
In Fig. 5, the dataset is real, mesoderm singlecell data with a development trajectory [9, 10]. The strength of linearity of the resulting MST is high, with few branches. As expected all the statistics capture the trajectory, with relatively low median pvalues. The median pvalues of \(D_1\) and \(D_2\) statistics are 0.163 and 0.167 respectively. \(L_{\max }\) has the lowest median pvalue at 0.027. All the statistics accurately captured the presence of trajectory in the dataset relatively to the noisy cellcycle data in Fig. 4.
Figure 6 shows the threedimensional scatter plot and the pvalues of the three statistics for the trajectory in singlecell data of fibroblast reprogramming. To obtain the data, singlecell RNA sequencing was performed at multiple time points during reprogramming from mouse embryonic fibroblasts to neuronal cells [12]. The median pvalues of 0.065, 0.05 and 0.069 for \(D_1\), \(D_2\) and \(L_{\max }\), respectively, are relatively significant for all the statistics, demonstrating the effectiveness of the statistics in capturing trajectory presence.
We further demonstrate the effectiveness of the three trajectory presence statistics on four simulated singlecell datasets with three containing trajectories and one containing no trajectory (Fig. 7). Figure 7a shows a multifurcating trajectory. Our method effectively captures this type of trajectory with median pvalues of 0.048, 0.050 and 0.146 for \(D_1\), \(D_2\) and \(L_{\max }\), respectively, as given in the corresponding box plots. Even though the trajectory has some branches, a large enough number of clusters would capture the linearity of the trajectory. Figure 7b shows a convergingdiverging trajectory. The trajectory is captured with median pvalues of 0.029, 0.029 and 0.060 for \(D_1\), \(D_2\) and \(L_{\max }\) statistics, respectively. Figure 7c shows a disconnectedlooping trajectory. The statistical significance for the presence of this trajectory is the most significant, with a median pvalue of 0.002 for all the 3 statistics. The MST in that dataset is highly linear and resembles a path graph for the majority of k, the number of clusters that the MST is captured at. Figure 7d has no trajectory, and as expected, the median pvalues for all the three statistics are relatively high, at 0.898, 0.769 and 0.884 for \(D_1\), \(D_2\) and \(L_{\max }\), respectively.
Discussion
Effectiveness of MST statistics
Our work aims to identify statistics that optimally characterize and distinguish the trajectory structures of data between dynamic patterns and random data. All the three statistics are either maximized or minimized if and only if there is a linear trajectory. The statistics promote trajectory patterns, and nonrandomness is between linear pattern and star trees, when there is maximum branching. Empirical studies revealed that the statistics may miss some nonrandom patterns that are not trajectories such as discrete clusters. There is, therefore, relatively high confidence that most of the nonrandom patterns captured by the statistics are trajectories.
Clustering and median pvalue
We employ clustering to partition the data into homogeneous partitions, which are ideal for capturing trajectorylike structures. Intuitively, different numbers of partitions on the same data may capture distinct types of structures. However, when the trajectory is perfectly linear, different numbers of partitions capture the same underlying trajectory structure. By trying out different numbers of partitions, our empirical studies have demonstrated that the median pvalue is the best representative and more conservative value to quantify presence of trajectory. In the future, we plan to find a theoretical basis for the choice of the median pvalue.
Using kmeans clustering to capture homogeneous regions in the data works but is not ideal. We hypothesize that the accuracy may be improved if the partitioned regions closely resemble each other.
Null distribution
Computing the empirical distributions of the statistics is CPU intensive due to permutation testing. Our future work will focus on optimizing the computation process as well as identifying analytical null distributions of the test statistics to gain both precision and efficiency.
Conclusions
We address the problem of statistical evidence for the presence of trajectory on multivariate data. We employed three statistics based on minimum spanning trees to differentiate between data with and without a trajectory. We applied the framework on both real and simulated singlecell RNAseq datasets and the empirical distributions of the statistics are comparable. The pvalue measures the statistical significance for the presence of trajectory in a dataset. The effectiveness of the statistics in detecting trajectory existence lays the ground work for further development of efficient algorithms for high dimensional data such as those from single cell biology.
Methods
The method to obtain the pvalues for each statistic is summarized in the algorithms below. The main entry point is the TestTrajectoryPresence algorithm, which takes as input multivariate data and returns a set of pvalues. The TestStatistic algorithm calculates the test statistic. The NullDistribution algorithm obtains the null distribution of the test statistic by permutation.
The number of degreeone nodes or leaves
The first statistic is the number of degreeone nodes or leaves in a tree graph, \(D_1(H)\), defined by
The degree of a node in a tree is defined as the number of edges incident on that node. We define a leaf as a node with degree one in a tree. We hypothesize that if there is a trajectory in the data, the MST built on the cluster centers would have fewer branches and thus tends to be more linear. Number of leaves is minimized when the tree is completely linear, when there are no branches in the tree. Such linear trees are also called path graphs. On the other end of the spectrum, the statistic is maximized when the tree is a star tree. In a tree with k nodes, the minimum number of leaves is two and the maximum number of leaves is \(k1\). As a result, the number of leaves statistic is quite effective in differentiating between the extremes of linearity of the tree. However, the statistic is not discriminating for some cases in the middle of the spectrum where there might be a noisy global linear trajectory but the number of leaves is not minimized. As MST H is derived from the input data X, we also call the corresponding test statistic \(T_{1(X)}=D_{1(H)}\).
The number of degreetwo nodes
Another closely related test statistic is the number of degreetwo nodes in a tree, \(D_{2(H)}\), defined by
Similar to the number of leaves, the statistic attempts to capture the linearity of the tree. When a tree with k nodes is linear and has no branches, the number of degreetwo nodes is maximized at \(k2\). On the other hand, the statistic is minimized at 0, when the tree is a star tree with maximum branching. The statistic, however, is not theoretically optimal in characterizing tree linearity since there are varying degrees of linearity for which the statistic is minimized at 0. As a result, the statistic is sensitive only in the extreme cases when a trajectory in the data is highly linear or when there is no trajectory at all in the data. On the MST H, we also call the corresponding test statistic \(T_2(X)=D_2(H)\), as H is derived from X.
The length of a longest path
Another test statistic is the length of a longest path in the tree. We define a longest path as a simple path in the tree with the most number of edges. The length of a longest path is defined by the number of edges in the path:
where \(\pi \subseteq G\) is a path graph and also a subgraph of G. This statistic characterizes the linearity of a tree by capturing tree compactness. A more compact tree tends to have more branches and is representative of data with no trajectory patterns. The statistic is minimized at length two on a star tree; it is maximized at length \(k1\) when the tree is a path graph of k nodes. On input data X which gives MST H, the third test statistic is \(T_3(X)=L_{\max }(H)\).
Permutation test
The input data are permuted to generate a null distribution for each statistic. The pvalues for all the statistics were calculated using onetailed tests. We hypothesized that a smaller number of degreeone nodes in the MST implies the presence of strong dynamical patterns, and we therefore, performed a lower tail test to compute the pvalues for the statistic. Conversely, we performed upper tail tests to compute the pvalues for the number of degreetwo nodes and longestpath length statistics.
Availability of data and materials
Source code and data to reproduce all results are provided at https://www.cs.nmsu.edu/~joemsong/TrajTest/BMC_source_and_data.zip.
Abbreviations
 MST:

Minimum spanning tree
 TDA:

Topological data analysis
 PCA:

Principal component analysis
 RNA:

Ribonucleic acid
References
Saelens W, Cannoodt R, Todorov H, Saeys Y. A comparison of singlecell trajectory inference methods. Nat Biotechnol. 2019;37:547–54. https://doi.org/10.1038/s4158701900719.
Hetzel L, Fischer DS, Günnemann S, Theis FJ. Graph representation learning for singlecell biology. Curr Opin Syst Biol. 2021;28:100347. https://doi.org/10.1016/j.coisb.2021.05.008.
Chen Y, Zhang Y, Li JYH, Ouyang Z. LISA2: learning complex singlecell trajectory and expression trends. Front Genet. 2021. https://doi.org/10.3389/fgene.2021.681206.
Chen H, Albergante L, Hsu JY, Lareau CA, Lo Bosco G, Guan J, Zhou S, Gorban AN, Bauer DE, Aryee MJ, Langenau DM, Zinovyev A, Buenrostro JD, Yuan GC, Pinello L. Singlecell trajectories reconstruction, exploration and mapping of omics data with stream. Nat Commun. 2019;10(1):1903. https://doi.org/10.1038/s41467019096704.
Zhang Z, Zhang X. Inference of highresolution trajectories in singlecell RNAseq data by using RNA velocity. Cell Rep Methods. 2021;1(6):100095. https://doi.org/10.1016/j.crmeth.2021.100095.
Krzewina LG, Saslaw WC. Minimal spanning tree statistics for the analysis of largescale structure. Mon Not R Astron Soc. 1996;278(3):869–76. https://doi.org/10.1093/mnras/278.3.869.
Chazal F, Michel B. An introduction to topological data analysis: fundamental and practical aspects for data scientists. arXiv, 1710.04019 (2017). arXiv:1710.04019
Cannoodt R, Saelens W, Todorov H, Saeys Y. Singlecellomics datasets containing a trajectory. Zenodo. 2018. https://doi.org/10.5281/zenodo.1443566.
Loh KM, Chen A, Koh PW, Deng TZ, Sinha R, Tsai JM, Barkal AA, Shen KY, Jain R, Morganti RM, ShyhChang N, Fernhoff NB, George BM, Wernig G, Salomon REA, Chen Z, Vogel H, Epstein JA, Kundaje A, Talbot WS, Beachy PA, Ang LT, Weissman IL. Mapping the pairwise choices leading from pluripotency to human bone, heart, and other mesoderm cell types. Cell. 2016;166(2):451–67. https://doi.org/10.1016/j.cell.2016.06.011.
Koh PW, Sinha R, Barkal AA, Morganti RM, Chen A, Weissman IL, Ang LT, Kundaje A, Loh KM. An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development. Sci Data. 2016;3:160109. https://doi.org/10.1038/sdata.2016.109.
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of celltocell heterogeneity in singlecell RNAsequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33(2):155–60. https://doi.org/10.1038/nbt.3102.
Treutlein B, Lee QY, Camp JG, Mall M, Koh W, Shariati SAM, Sim S, Neff NF, Skotheim JM, Wernig M, Quake SR. Dissecting direct reprogramming from fibroblast to neuron using singlecell RNAseq. Nature. 2016;534(7607):391–5. https://doi.org/10.1038/nature18323.
Acknowledgements
Not applicable.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 23 Supplement 8, 2022: Selected articles from the 16th International Symposium on Bioinformatics Research and Applications (ISBRA20): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume23supplement8.
Funding
This work was supported by US National Science Foundation Grant 1661331 and in part by USDA National Institute of Food and Agriculture Grant 20165118125408. Publication costs are funded by both grants. The funding agencies had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
LT designed algorithms, implemented software, generated results, and wrote the manuscript. MS conceived the project, interpreted results, and wrote the manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Tenha, L., Song, M. Statistical evidence for the presence of trajectory in singlecell data. BMC Bioinformatics 23 (Suppl 8), 340 (2022). https://doi.org/10.1186/s12859022048759
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022048759
Keywords
 Trajectory inference
 Minimum spanning tree
 Graphbased statistics
 Singlecell sequencing
 Developmental biology