Indices for measuring numerical chromosomal variation
The instability index (I) is a metric that calculates the percentage of cells that contain a chromosomal aberration [7]. This metric does not directly depend on the number of chromosomes; however, measuring more chromosomes may increase the likelihood of detecting at least one chromosome that contains an abnormal number of copies.
The Average Number of Copy Number Alterations (ANCA) score has been applied in the context of colorectal and cervical cancer in an attempt to quantify the relationship between tumor aggressiveness and genomic instability [8, 9]. Previous studies have uncovered that more aggressive tumors have a higher ANCA score. However, one limitation of the ANCA score is that it does not account for the number of chromosomes examined. Within aneuvis, we introduce a derivative of the ANCA score, called the Normalized ANCA score, which accounts for the number of chromosomes measured and enables comparisons of this metric between experiments that utilize different numbers of probes.
The aneuploidy (D) and heterogeneity (H) scores were derived from Bakker et al. and represent a pair of statistics that account for the number of cells and chromosomes tested for [5]. The aneuploidy score increases with an increased chromosome copy number – the only score to take the actual number of chromosomes into account. The heterogeneity score increases with the number of distinct chromosomal states observed, and is maximized when each cell has a distinct state. In contrast to the aneuploidy score, the heterogeneity score does not incorporate the chromosomal copy number. These statistics were derived for summarizing copy number data from whole genome single cell sequencing, though their flexible formulation enables them to be applied to other datasets.
In a cell, there are three possible states that a set of chromosomes can assume. Diploidy refers to the presence of two copies of each autosome in a cell, and is the physiologic state of most non-cancerous human cells. Polyploidy refers to an integer-valued increase in the number of chromosomes, often resulting from whole-genome duplication. Aneuploidy occurs when the copy number of 1 or more chromosomes differs from the others and is a feature of many cancers.
Bivariate percentage heatmap
The bivariate percentage heatmap is used for visualizing the covariation between the counts of two chromosomes in a population of single cells. Each square within the grid represents the percentage of cells observed with a certain number of chromosomes listed on the X and Y axes. This approach is appropriate for FISH data, where the ploidy of cells is inferred from chromosome-specific fluorescent probes. For FISH data that include measurements from > 2 chromosomes, multiple bivariate plots are produced in aneuvis to account for all possible pairwise combinations of chromosomes. For example, a population of cells where 4 chromosomes were measured would generate \( \left(\genfrac{}{}{0pt}{}{4}{2}\right)=6 \) bivariate percentage plots.
Permutation testing
Permutation testing between all pairwise comparisons for a user-selected summary statistic is performed by randomly shuffling the labels associated with each observed cell across all groups. Permutation testing is set to 500 permutations by default but can be adjusted by the user.
Spectral karyotyping (SKY)
Copy number information is extracted from SKY data hosted within Microsoft Excel files in ISCN format using regular expressions.
Single cell whole genome sequencing
Within aneuvis, copy number output in browser extensible data (BED) format is converted to a whole-chromosome summary copy number computed using a weighted average, where the inferred copy number at each bin along a chromosome contributes proportionally to the size of each bin (in base pairs). The weighted average is rounded to the nearest integer to obtain the chromosome copy number.
For the usage scenario, low-coverage single cell whole genome sequencing (sc-WGS) (0.01x) was generated from 27 young and 56 senescent IMR90 cells (for a total of 83 cells) across two sequencing runs. IMR90 cells were obtained from American Type Culture Collection (ATCC) (CCL-186). BAM files generated from the Torrent Suite software were converted to .bed files using the bedtools2 bamToBed function. Bed files were uploaded into Ginkgo’s user interface [16] with variable bin sizes of approximately 2.5 megabases (MB) and based on simulations of 150 bp reads with global segmentation [15]. The copy number matrix output from Ginkgo was used as input into aneuvis. Ginkgo copy number output and bed files are available at a Ginkgo-generated permalink [23].
Experimental cell culture and four-color interphase FISH
Young and senescent IMR90 cells were generated and analyzed by four-color interphase FISH, as described previously [12]. Images representing nuclei were randomly acquired and saved as .tiff composite files for both young (N = 406) and senescent (N = 396) cells. Images were visually inspected and FISH signals manually counted blindly for both chromosomes 9 and 12 within a nucleus, as described previously [12].
Example data
Example data using three treatment groups for each type of experimental input (FISH, SKY, and sc-WGS) are available through the aneuvis web application. Example FISH and SKY datasets represent ploidy counts that were manually generated to show varying degrees of severity across treatments. The example sc-WGS dataset is a breast cancer single cell dataset taken from Ginkgo [15, 24]. Artificial labels (Control, Treatment A, Treatment B) were added to all three example datasets to simulate treatments of varying severity.
Summary of ginkgo output
Bed files from 83 cells were uploaded into ginkgo and processed as described in the “Single cell whole genome sequencing” section above. Screenshots were taken from each of the four sections of the Ginkgo output, described below. First, a “tree-display” within Ginkgo showcases a dendrogram of all cells based on genome-wide copy number status similarity (Fig. 6a, left side). Second, the “processed-data” section (Fig. 6a, right side) contains summarized copy number data in various formats that are available for download. The integer “copy number” state file from this section can be used as input into Aneuvis for further statistical analysis and visualization, particularly if different treatment groups were a part of the experimental design. Third, a series of heatmaps displays the copy number state or the number of reads from each cell at each bin in the genome (Fig. 6b). Fourth, a “summary” section shows a copy number scatterplot for each input .bed or .bam file alongside quality control summaries, such as the number of reads per file (Fig. 6c). Graphical outputs from selected files can also be generated from these copy number or quality control metrics. All visualizations are available in their original format at a Ginkgo-generated permalink [23].