ClusTRace, a bioinformatic pipeline for analyzing clusters in virus phylogenies

Background SARS-CoV-2 is the highly transmissible etiologic agent of coronavirus disease 2019 (COVID-19) and has become a global scientific and public health challenge since December 2019. Several new variants of SARS-CoV-2 have emerged globally raising concern about prevention and treatment of COVID-19. Early detection and in-depth analysis of the emerging variants allowing pre-emptive alert and mitigation efforts are thus of paramount importance. Results Here we present ClusTRace, a novel bioinformatic pipeline for a fast and scalable analysis of sequence clusters or clades in large viral phylogenies. ClusTRace offers several high-level functionalities including lineage assignment, outlier filtering, aligning, phylogenetic tree reconstruction, cluster extraction, variant calling, visualization and reporting. ClusTRace was developed as an aid for COVID-19 transmission chain tracing in Finland with the main emphasis on fast screening of phylogenies for markers of super-spreading events and other features of concern, such as high rates of cluster growth and/or accumulation of novel mutations. Conclusions ClusTRace provides an effective interface that can significantly cut down learning and operating costs related to complex bioinformatic analysis of large viral sequence sets and phylogenies. All code is freely available from https://bitbucket.org/plyusnin/clustrace/ Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04709-8.

pandemic that has caused numerous deaths and human suffering, delivery and workforce shortages, travelling limitations, and many other disturbances to both business and normal life activities [4].
All virus genomes change over time due to mutations introduced in the viral genome, primarily by errors made by viral polymerases during replication [7]. However, most changes have minor effect on the phenotype of viruses. However, some mutations may affect the key pathogenic properties of the virus, such as transmissibility and disease severity, or the performance of vaccines, therapeutic agents or diagnostic tools [7].
The rapid progress in sequencing technologies has provided an opportunity to study viral molecular epidemiology and evolution in nearly real-time [8]. The current COVID-19 is the first pandemic with the pathogen being under surveillance using full genome sequencing on a global scale and over an extensive time period [9]. Surveillance of the pandemic creates demand for fast and scalable sequencing, genome assembling, viral strain assignment, phylogenetic analysis, variant calling and molecular epidemiology to inform contact tracing and non-pharmaceutical interventions. Although bioinformatics offers an abundance of methods and tools for sequence analysis, their employment in virology and epidemiology can be hindered by the developer-user gap between bioinformatics and other fields [10]. This gap can be bridged by pipelines tailored specifically for the analysis of viral sequences and equipped with intuitive interface and output reporting.
SARS-CoV-2 is the causative agent of coronavirus disease 2019 (COVID-19) [11]. The SARS-CoV-2 pandemic has already infected more than 437 million people in 224 countries, causing nearly 6 million deaths globally as of 1st of March 2022 (https:// www. world omete rs. info/ coron avirus/). SARS-CoV-2 is a global challenge, which is further complicated by the continuous emergence of new Variants of Concern (VOCs) or Variants of Interest (VOI). Variants that have carried VOC status include Alpha (B.1.1.7) [12], Beta (B1.351) [13], Gamma (P.1) [14], Delta (B.1.617.2) [15] and, as of writing this, we are experiencing the spread of Omicron variant (B.1.1.529) [16]. These VOCs pose an increased public health risk due to having one or more of the following characteristics: higher transmissibility [17], immune escape properties for antibodies from previous infection [18], lower response towards current vaccines compared to the original wild type strains these vaccines were based on [19]. Detecting and monitoring these novel variants is essential in SARS-CoV-2 surveillance.
A number of bioinformatic software packages are already available to help with detection, tracking and tracing of SARS-CoV-2 variation e.g. Pangolin [20], Nextstrain [21], Nextclade [22], Jovian [23], HaVoC [24] and Lazypipe [25]. Such tools are certainly helping the global effort for COVID-19 surveillance, but they are not void of limitations. Tools like Pangolin and Nextclade are primarily designed for tracking large accumulations of mutation events that are rare and may be preceded by the less visible sub-lineage genetic changes. Nextstrain offers a comprehensive analysis, but is heavily dependent on sequence metadata and dataset pre-filtering. Here we introduce ClusTRace (https:// www2. helsi nki. fi/ en/ proje cts/ clust race), a novel bioinformatic pipeline for Unix/Linux environments that complements the existing toolkits with unsupervised clade or cluster analysis, intuitive visualizations and reporting. ClusTRace can help with surveillance of the current ongoing COVID-19 pandemic and for any upcoming future epidemic or pandemic.

Implementation
ClusTRace is a bioinformatic software package implemented primarily in Perl. Clus-TRace supports several tasks that can be executed one by one or combined into pipelines (Fig. 1).
The analysis starts with consensus genomic sequences output by a given sequencing platform (e.g., Illumina). In the first step, ClusTRace assigns genomic sequences to a dynamic Pango lineage classification with Pangolin [20]. Then, ClusTRace collects sequences assigned to different lineages into separate multi-fasta files, so that each multi-fasta contains all sequences assigned to a given Pango lineage. Although we use Pangolin as the default lineage assigner, classification file can be produced with any method preferred by the user (the pipeline will accept any csv-file that conforms to Pangolin output format). All downstream analyses are performed separately for each lineage represented by a multi-fasta file.
Multi-fasta files are then pruned from outliers with SeqKit [26]. By default, we remove all sequences that deviate more than 10% from the median length of the sequence set or that have more than 10% gaps (these parameters can be modified on the command line with -minlen, -maxlen and -maxgap).
In the next step, filtered sequence sets for each lineage are aligned with MAFFT v7 [27]. Multiple sequence alignments (MSAs) are then trimmed for gaps with trimAl [28]. Trimmed alignments are used to construct phylogenetic trees with IQ-TREE 2 [29]. IQ-TREE 2 supports a wide range of substitution models and will, by default, use Mod-elFinder to determine the best-fitting model [29]. The user can choose to create bootstrapped consensus trees with IQ-TREE 2 Ultra-Fast Bootstrapping (ClusTRace-ufboot option) [30]. For very large sequence sets, the user can choose to run VeryFastTree [31] with GTR model (ClusTRace-tree vftree option). By default, ClusTRace will use COVID-19 reference genome (NCBI acc NC_045512.2) as an outgroup sequence to reroot all output phylogenetic trees. There is also an option to specify a separate outgroup sequence for each run.
In the next step, sequence clusters are extracted with TreeCluster [32]. Clusters are extracted with MaxClade-method at several pairwise distance cut-offs. We use two cut-off thresholds that are scaled to the size of the input reference genome (e.g. SARS-CoV-2) and roughly correspond to twenty and thirty mutations between pairs of sequences. MaxClade-method and cut-off thresholds (0.0007 and 0.001) were selected ad hoc based on our previous work with SARS-CoV-2 phylogenies [33]. These values can be easily modified by the user. Next, ClusTRace creates custom nexus trees in which sequences are assigned labels and colours according to the assigned cluster.
ClusTRace can read date annotations from sequence ids and will accept common date formats (e.g. "|YYYY-MM-DD|"). For date annotated sequences ClusTRace will trace the speed of growth for the extracted clusters. This is done by assigning sequences to time periods (calendar months or weeks) and by tracing the number of sequences that are assigned to each cluster and that are dated up to the given time period. For each lineage ClusTRace will print a separate cluster summary file with detailed information on the extracted clusters. These spreadsheet summaries include clustSeqN, clustSeqId and clustGR data sheets. The first and second data sheets report the number and ids of sequences in each cluster for each time period, while the third reports cluster size, median and maximal growth rates, and support value for the corresponding sub-phylogeny for each cluster. Separate clustGR data sheets are printed for each cluster cut-off threshold (by default twenty and thirty). Median and maximal growth rates are measured based on absolute increment in sequence number assigned to each cluster between consecutive time periods.
In the last step, ClusTRace extracts MSA(s) and runs variant calling for the extracted clusters. Nucleotide mutations are called from these against a reference genome with MsaToVcf [34]. Nucleotide variants are filtered to exclude 100 nucleotides (nt) from the start and the end of the genome (to avoid noise related to sequencing errors commonly seen in terminal regions), as well as any regions that have over 30 nt continuous stretches of below 75% coverage (these are also assumed to represent sequencing errors) using trimAl [28]. We also exclude variants with support below 50%. These filtering options are specified in the pipeline default options and can be modified. Amino acid (aa) variants are called with snpEff [35]. Finally, aa variants in all clusters are parsed and added to the cluster spreadsheet summaries as clustMutations and clustMutationTable data sheets. The clustMutations sheet reports nt and aa mutations for each cluster, reference aa mutations and non-reference aa mutations. Reporting reference and non-reference mutations requires supplying reference mutations in a separate file. For genes of interest non-reference mutations can be reported separately (current version reports mutations for the S-gene). The clustMutationTable sheet reports aa mutations for the fastest growing clusters in a binary matrix. The top row lists aa mutations in genomic order with non-reference mutations highlighted in bold.
ClusTRace also supports extracting nt and aa, reference aa and non-reference aa mutations for lineage MSA(s) or for any other collection of MSA(s). Lineage mutations are reported with spreadsheet summary tables similar to the cluster mutation summaries.
ClusTRace also offers an interface to g3viz R library [36]. Using this interface in R, the user can generate interactive mutation plots for both cluster and lineage vcf-files. These interactive plots can be saved in the form of simple html files to complement spreadsheet reports.

Results
To illustrate the intended use of ClusTRace we analyzed a dataset of SARS-CoV-2 full genome sequences from patient samples collected in Finland from January to June 2021. We started by running ClusTRace Pangolin mapping to obtain 5379 sequences assigned to Alpha and 1051 sequences assigned to Beta variants of concern (VOC) (GISAID accessions are available in Additional file 1: Table S1). We then run ClusTRace multifasta construction, outlier filtering, alignment, phylogeny with ultrafast bootstrapping (-ufboot option), default clustering and variant calling for these two lineages. As our outgroup sequences we used EPI_ISL_601443 for the Alpha variant and EPI_ISL_660190 for the Beta variant. All files output by ClusTRace for this analysis are available in Additional file 2.
To get a quick summary on the lineage mutations, we start with g3viz visualisation (Fig. 2, interactive version available in Additional file 2). For Alpha we see that most high frequency aa mutations follow mutations that have been reported as characteristic for this lineage [37] (Fig. 2A). These include T1001I, A1708D, I2230T, 3675_3677del and P4715L in ORF1ab, 69_70del, N501Y, A570D, D614G, P681H, T716I, S982A and D1118H in S, D3L and S235F in N. For Alpha, there are just five aa variants specific for Finnish data with frequency 10% or higher: K5784R and E6272G in ORF1ab, N119H in ORF3a and G96S and RG203KP in N. Plotting all mutations found in at least ten sequences in Alpha (5379 sequences) and Beta (1051 sequences). Mutations that have been reported as characteristic for a given lineage [37,38] are plotted in purple, all other mutations are plotted in green. Numbers in cirles indicate the number of sequences with the given mutation. Graphics were created with the ClusTRace interface to g3viz [36] For Beta, approximately half of mutations with frequency 10% or higher were not covered by mutations that have been reported as characteristic for this lineage [38] (Fig. 2B). Mutations matching characteristic mutations for Beta were: T265I, K1655N, K3353R and P4715L in ORF1ab, D80A, D614G and A701V in S, Q57H and S171L in ORF3a, P71L in E, T205I in N, while the non-characteristic aa mutations with at least 10% frequency were: T3058I, A3209V, A3235S, D4459A, T5912I and A6976V in ORF1ab, T19I and I896L in S, M24V, I26V and I27V in ORF7b, K44R and I121L in ORF8. Note that Beta has non-characteristic mutations in Spike protein, which may potentially affect their receptor binding: T19I in 789 (75%) and I896L in 175 (16.7%) sequences.
Cluster analysis with TreeCluster [32] yielded 108 clusters for Alpha and nineteen clusters for Beta (Figs. 3 and 4, full consensus trees with clusters highlighted are available in files B.1.1.7.con.tree.mr = 30.nex and B.1.351.con.tree.mr = 30.nex in Additional file 2). We used the MaxClade method with a cut-off set to 0.001. Here we take a closer look at the ten clusters for Alpha and Beta that had the highest per month growth rate peaks over the analysed time period.
We start by discussing Alpha clusters. The ten fastest growing clusters covered 3,146 (58.5%) of all Alpha sequences. Cluster size varied in these ten clusters between 100 (1.9%) and 479 (8.9%) sequences (Fig. 5). Maximal growth rates ranged between 74 and 310 sequences per month and peak growth was during February and March. Number of non-characteristic aa mutations introduced in these clusters ranged from one to six. Solitary non-characteristic mutations in S-gene were found in clusters 56 (D80Y), 38 (D287G) and 22 (A701V) ( Table 1).

Fig. 3
Consensus tree for Finnish Alpha dataset with clusters collapsed. Bar plots on the right indicate the number of sequences in each cluster. For clarity, clusters with less than ten sequences and singletons were removed. Inner nodes with no large cluster descendants are plotted in grey The ten fastest growing clusters covered 979 (94.5%) of Beta sequences. Cluster size was between fourteen (1.3%) and 259 (24.6%) sequences (Fig. 6). Maximal growth rates ranged between 11 and 148 sequences per month and maximal growth was during February (clusters 3 and 8), March (clusters 1, 4, 7, 10, 17, 18 and 19) and April (cluster 9). Number of non-characteristic aa mutations introduced in these clusters ranged from three to eight. Several clusters had non-characteristic mutations in S-gene: L18F (cluster 1), T19I (clusters 8-10, 17 and 19) and I896L (cluster 9) ( Table 2).   23:196 Here, aa mutations with frequencies exceeding 50% are listed in genomic order *The first row depicts mutations characteristic for B.1.1.7 according to the lineage report [37]

Benchmarking time and memory efficiency
We benchmarked ClusTRace performance on two datasets with default settings on a Red Hat Enterprise Linux Server 7.9 on a single node with 32 × 2.1 GHz cores. The first dataset included 6,430 SARS-CoV-2 genomic sequences from patient samples collected in Finland from January to June 2021 (GISAID accession ids are available in Additional file 1: Table S1). This run completed in 48 h and 6 min and consumed 83.26 GB of memory. The second dataset included 3,568 genomic sequences for Delta variant sequenced from Finnish patient samples later the same year (GISAID accession ids are available in Additional file 3: Table S2). This run completed in 14 h and 16 min and consumed 75.44 GB of memory. Most time was spent within IQ-Tree calls. We see that execution time does seem to scale nonlinearly with dataset size but is kept within acceptable limits for moderately large datasets. The required memory usage for these datasets was well below available limits.

Discussion
The years 2020 and 2021 could arguably be referred to as a turning point in the history of global health. The COVID-19 pandemic has demonstrated that emerging pathogens can cause havoc in our globalised world. On the other hand, the pandemic has also accelerated the development of better sequencing technologies, bioinformatic tools, diagnostic tests, vaccines and many other fields. The ongoing pandemic has emphasised the need for fast, scalable and, ideally pipelined, analysis of viral genomic sequences. For health authorities, it is important to be able to streamline processing large amounts of genomic sequence data into various summaries and reports that can help to make rational decisions concerning e.g. restrictions, non-pharmaceutical interventions and border control measures to minimize further spread of SARS-CoV-2. Researchers also struggle with the continuous inflow of SARS-CoV-2  23:196 Annotation as in Table 1 Table 2 (continued) sequences that need to be organized into lineages, alignments and phylogenetic trees in order to make sense of the constantly evolving pandemic.
Here, we have presented ClusTRace, a novel bioinformatic pipeline for fast and scalable analysis of large collections of SARS-CoV-2 sequences. ClusTRace supports many types of relevant analyses. These include assigning sequences to lineages, collecting sequences by lineage, filtering outliers, creating multiple sequence alignments, creating phylogenetic trees, extracting phylogeny-based sequence clusters, estimating cluster growth rates, calling nt and aa variants for both lineages and clusters, as well generating a number of table-based and interactive reports. Although most of these steps can be performed separately with designated bioinformatic tools, pipelining with a high-level interface helps to cut down on the learning and operating costs of complex bioinformatic analysis. Several authors have commented on the developer-user gap between bioinformatics and other fields in biology and biomedical research [10]. In this context, high-level pipelines that are tailored to the need of virus research are an important way to bridge this gap.
Popular pipelines for tracking viral outbreak phylodynamics include Augur, Auspice, Nextstrain, Nextclade and Pangolin [20][21][22]39]. Here, we reflect on key similarities and differences of ClusTRace to these toolkits. Pangolin and Nextclade are primarily concerned with classifying viral genomes into lineages or clades, while ClusTRace is designed to track mutations within lineages. Nextclade also offers mutation calling for large clades, which is similar to ClusTRace mutation calling for lineages. Nextstrain is an integration of several toolkits, including Augur for analysing sequence and phylogeographical data, and Auspice for visualising results. Like ClusTRace, Augur offers functionalities for filtering, aligning, phylogenetic reconstruction, re-rooting and refinement of the obtained phylogenies, and offers functionalities to estimate mutation frequencies.
Unlike ClusTRace, Augur also infers sequences and ancestral traits for the ancestral tree nodes. Auspice is designed to visualise phylogenetic and phylogeographic data output by Augur in an interactive webpage format. In ClusTRace, we provide different visualizations, namely spreadsheet summaries and interactive g3viz plots for high growthrate and/or mutation-rate clades. Unlike Nextstrain/Auspice visualizations, ClusTRace focuses directly on parts of the phylogeny that are picked out by the unsupervised cluster analysis and provides no details on the likely origin of the mutations in the tree. However, this approach has its advantages, such as simplicity and speed; unlike Nextstrain/Augur, ClusTRace has no need for down sampling the sequence sets. ClusTRace analysis is also largely unsupervised, i.e. clades are selected and examined for mutations and growth-peaks automatically, in effect filtering clades with alarming features that can then be checked manually more in detail.
In this work, we illustrated the intended scenario for ClusTRace usage on Finnish Alpha and Beta variants of concern. Presented approach can be described as an unsupervised phylogeny-based cluster analysis and variant calling. ClusTRace uses automated unsupervised clustering coupled with cluster growth rate analysis and variant calling to scan through the phylogeny. Clusters that display elevated growth rates, elevated non-reference mutation content or mutations in genomic regions that are of accentuated concern, such as the S-gene, can then be flagged for downstream analysis. In this paper we focus on describing the method and do not attempt to link identified cluster to epidemiologic seeding events. However, in our other work on monitoring SARS-CoV-2 spread in Finland we have appleid identical clustering with some success. For example, in [33] we monitored clusters for Alpha and Beta lineages and in that work clustering suggested that these lineages have spread to Finland via multiple seeding events. In our analysis of Finnish Omicron sequences we were able to identify a single large cluster that most likely corresponded to a super-spreading event (n = 236, which is 27.1% of all Finnish cases) as well as numerous smaller clusters that indicate multiple seeding points [40].
The current SARS-CoV-2 pandemic might endure to the foreseeable future, and new viral variants will likely continue to emerge. Therefore, the global response must continue to adapt and improve to mitigate the costs of the pandemic. The progress made since the start of the pandemic in early 2020 with the global implementation of full genome sequencing can be consolidated by developing efficient and scalable bioinformatic tools that are specifically tailored for genomic surveillance of viral pathogens. These tools must deliver fast, scalable and, ideally, unsupervised analysis and reporting on the pandemic events of concern. Our pipeline, ClusTRace, adds to the available toolbox the option for fast, scalable and unsupervised screening and reporting of the within or local lineage events of concern, such as elevated growth and mutation rates. Clus-TRace can also be adapted for the surveillance of viral pathogens other than the SARS-CoV-2, which may prove useful in future epidemic emergencies.