PhyliCS: a Python library to explore scCNA data and quantify spatial tumor heterogeneity

Background Tumors are composed by a number of cancer cell subpopulations (subclones), characterized by a distinguishable set of mutations. This phenomenon, known as intra-tumor heterogeneity (ITH), may be studied using Copy Number Aberrations (CNAs). Nowadays ITH can be assessed at the highest possible resolution using single-cell DNA (scDNA) sequencing technology. Additionally, single-cell CNA (scCNA) profiles from multiple samples of the same tumor can in principle be exploited to study the spatial distribution of subclones within a tumor mass. However, since the technology required to generate large scDNA sequencing datasets is relatively recent, dedicated analytical approaches are still lacking. Results We present PhyliCS, the first tool which exploits scCNA data from multiple samples from the same tumor to estimate whether the different clones of a tumor are well mixed or spatially separated. Starting from the CNA data produced with third party instruments, it computes a score, the Spatial Heterogeneity score, aimed at distinguishing spatially intermixed cell populations from spatially segregated ones. Additionally, it provides functionalities to facilitate scDNA analysis, such as feature selection and dimensionality reduction methods, visualization tools and a flexible clustering module. Conclusions PhyliCS represents a valuable instrument to explore the extent of spatial heterogeneity in multi-regional tumour sampling, exploiting the potential of scCNA data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04277-3.


Supplementary Method 1: sequencing reads processing and scCNA calling
We downloaded, from the NCBI Sequence Read Archive (accesion number PRJNA629885), the sequencing reads of cells from a triple-negative breast cancer (TNBC) cell line (MDA-MB-231) and those resulting from the clonal expansion of 2 single daughter cells (MDA231-EX1 and MDA231-EX2) from the parental cell line for 19 cell doublings (995 and 897 cells, respectively). Concerning the parental dataset, two batches of cells where available: 508 obtained with single-end sequencing and 312 resulting from paired-end sequencing. To avoid a batch effect, we only download the cells sequenced using the single-end technique. We aligned the single cell reads against the GRCh38 reference genome using BWA (v0.7.17) [1]. We filtered out low quality reads (MAPQ < 20), secondary alignments and PCR duplicates using SAMtools (v1.9) [2]. After that, we produced the BED files using BEDTools (v2.27.1) [3]. We, finally, computed the CNA events, for the three datasets, separately, using a standalone version of Ginkgo [4], with variable binning (mean bin size = 500kb) and default options. To generate boundaries for variable-length bins, for the reference genome, we used the method outlined by Garvin et al. [4] and implemented at Ginkgo repository: basically, it consists in sampling 101bp reads from the reference genome and mapping it back to itself (BWA), looking for uniquely mapping reads. After that, for the given bin size, reads are assinged to bins such that each bin has the same number of uniquely mappable reads. Consequently, intervals with higher repeat contents and low mappability will be larger than intervals with highly mappable sequences, although they will both have the same number of uniquely mappable positions.

Data prepocessing
First of all, we performed dimensionality reduction with the UMAP method provided by PhyliCS, to improve scCNA data clustering performance [5]. Then, we chose the optimal number of clusters, using nk_clust() method, which computes clustering quality according to multiple indices (Silhouette coefficient, Calinski and Harabasz score or Davies-Bouldin score), for a given clustering algorithm and a specified range of k's. For the current use-case, the Silhouette Coefficient was computed to evaluate Agglomerative Clustering performance (ward linkage), with min k = 3 and max k = 10. As Supplementary Table 1 shows, the best coefficient was obtained with k=3.

Clustering
As Supplementary Figures 3a and 3b show, the clusters obtained with k = 3 are coherent with the distribution of the data. However, it is clear that cluster 0 may still be divided into subgroups: thus, we clustered data with k = 7, the second optimal choice according to the Silhouette index, and obtained a more refined clustering of the data ( Supplementary Figures 4a and 4b).
It can be noticed, in Figure 4b, that the two prevalent cluster of cells, clusters 3 and 4, are characterized by a CN profile very similar to that of MDA-MB-231-EX1 cells, confirming the results found during the multisample analysis. among each other. Consequently, when comparing two subsamples, with varying cardinalities, from the two cell lines, we expect the SHscore to remain, almost, stable.aa As first step, we computed the SHscore using full datasets. Supplementary Figure 5 shows that our hypothesis of internal genetic homogeneity is confirmed while the SHscore value, 0.832, indicates that the two cell lines are characterized by well distinguishable CN profiles.
After that, we generated 9 subsamples from each cell line by randomly selecting a fraction (10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%) of its cells. We computed the SHscore on all pairs made of one of the two full datasets (e.g MDA-MB-231-EX1) and one of the subsamples from the  Figure 6 shows, the resulting SHscores fluctuated of a small quantity with respect to the initial SHscore. Specifically, they where distributed in the interval [0.815, 0.850], with median = 0.833 and IQR = 0.015. The small fluctuations of the score should not surprise because, albeit being internally homogeneous, the two cell lines still present a subclonal structure [6], so the downsampling operation may have targeted cells belonging to different subclones. Anyhow, the fact that the median result is almost identical to the original SHscore indicates that the proposed method is robust to the cardinality of the datasets and may be used to compare samples of any size; additionally, it demonstrates that the SHscore is able to capture heterogeneity even when the input dataset cardinalities are very unbalanced. Fig. 6. Supplementary Figure 6: Downsampling experiment. In order to test the robustness of the proposed method we downsampled both daughter cell lines, producing 9 subsamples for both of them, each containing a fraction 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% of their cells. We computed the SHscore on pairs made of one of the two full datasets against each of the subsamples of the other dataset and observed their distribution: despite small fluctuations, due to the little amount of heterogeneity existing in the the cell lines, the computed SHscores (min = 0.815, max = 0.850, median = 0.833, IQR = 0.015) were comparable to that computed for the orginal dataset (0.832), confirming the proposed method is robust to the cardinality of the input datasets.