Overview of pipeline
The pipeline for taxonomic assignment and ranking (Fig. 2) was implemented in several parts: (1) a multi-task CNN (MT-CNN) model for taxonomic assignment and estimation of the genomic region of 50-mer sequences, (2) a sliding window and ensembling algorithm for handling variable length sequences ≥ 50 bp for the MT-CNN model, and (3) a naïve Bayes network for taxonomic ranking based on number of assigned reads and genomic coverage derived from the MT-CNN model.
Development of the multi-task CNN (MT-CNN) model
A multi-task CNN (MT-CNN) model was developed to perform the taxonomic assignment and the estimation of the genomic region of 50-mer sequences (Fig. 3). For the taxonomic assignment task, the model was based on VGGNet (VGG-11) [11], initially developed for image classification but modified to process input DNA sequences that were one-hot encoded as A, T, C, G. Modifications included varying the number of convolutional layers, the number of neurons in each layer, altering the frame size of convolutional kernels, and converting the convolutional and max-pooling layers from 2 to 1D.
For the second task of estimating the genomic position of a sequence, the model was extended to predict the region of the sequence in a genome (out of 10 equipartitioned regions) that was taxonomically assigned by the first task. To implement this, the feature map from the last dense layer of the first taxonomic assignment task was concatenated with the predicted species labels used as the input for the second task, analogous to transfer learning [12]. For this task, a fully connected model was used in the classification of genomic regions.
In the final model, several batch normalization layers were included for improving the classification accuracy and accelerating the training process [13]. The activation function in each layer was ReLU [14], except for the two dense layers for prediction using Softmax [15].
For training, the cross-entropy loss function for both classification tasks was used and the sum of their losses was defined as the total loss of the model. Training was performed using the Adam optimizer [16] and hyper-parameter tuning was done using the Talos framework (https://github.com/autonomio/talos). The final MT-CNN model was implemented using Keras 2.2.4 (https://keras.io/) in Python 3.7 (https://www.python.org).
Sliding window strategy for predictions of sequences longer that 50 bases
To extend the model’s ability to perform predictions on longer sequences (typically 150–300 for current short-read sequences), a sliding-window strategy was implemented (Fig. 4).
Briefly, input reads with length N (N ≥ 50) were split into (N – 49) overlapping 50-mers with a 1-bp sliding window. Each 50-mer sequence were used for prediction with the MT-CNN model and the taxonomic and genomic region predictions for each window were ensembled to select the most likely overall taxonomic and genomic region label using a hard-voting strategy: predictions of ≥ 0.9 were allocated a vote, and the taxonomic and genomic region label with the most votes were assigned to the overall input read.
Taxonomic ranking using a naive Bayes approach
A naïve Bayes approach was used for taxonomic ranking of candidate viruses from the assigned reads and genomic coverage of reads derived from the MT-CNN model. From the MT-CNN outputs, the number of reads per genomic region (1–10) assigned to the same viral species were tabulated prior to estimation of the genomic coverage for each species. To account for errors in genome region assignments, only regions that met a pre-defined threshold were counted, which was set at 0.1 of the average number of expected hits per genomic region, assuming a uniform distribution:
$$ {\text{Threshold = (total}}\;{\text{number}}\;{\text{of}}\;{\text{ hits}}\;{\text{for}}\;{\text{all}}\;{\text{species)}}\; \times \;{\text{0}}{\text{.1/(total}}\;{\text{ number}}\;{\text{ of}}\;{\text{ covered}}\;{\text{ regions)}} $$
(1)
For the implementation of the naïve Bayes network, the number of assigned reads and the number of genomic regions per species were discretized (Additional file 1: table S1) for inputs as parent nodes. Joint probabilities in conditional probability tables (CPTs) of the naive Bayesian network for calculating the likelihood estimate of a taxonomic species label were generated [17] based on relative weights of influence by the number of assigned reads and coverage of assigned reads, set at 7:3. The network was implemented using a Python library pgmpy 0.1.10 [18] (https://pgmpy.org/).
Generation of training/testing datasets
The datasets for training/testing were derived from the International Committee on Taxonomy of Viruses (ICTV) (https://talk.ictvonline.org/) database that provides high-quality genomes of viruses with the standard taxonomy labels. To obtain only human viruses from ICTV, 434 human viral genomes from 187 species were extracted from ICTV by cross referencing them to ViralZone [19], a database that maintains a list of human viruses and their accession IDs.
For preprocessing of sequences prior to training the MT-CNN model, the viral genomes were split into ~ 10 million unique 50-mer sequences and then augmented using Mason2 (https://github.com/seqan/seqan/tree/master/apps/mason2) to generate additional datasets with different mismatch, insertion, and deletion rates (Additional file 1: table S2) in order to improve the generalizability of predictions. In total, the merged datasets comprised ~ 53 million unique k-mers. These k-mers were then split as training/validation/testing datasets at a ratio of 51:1:1 and the bulk of sequences (~ 51 million unique 50-mers) were used for training the MT-CNN model.
For baseline evaluation of the MT-CNN model, the 50-mer testing dataset of 1,000,000 sequences was used, while a 150-mer dataset of 100,000 sequences was generated from the 434 human viral genomes of ICTV for the evaluation of longer sequences. For all datasets, the species labels and the genomic region labels were used for evaluating the performance of the model on taxonomic assignment and prediction of genomic regions respectively.
Simulated datasets of divergent genomes (HIV-1)
For evaluation of performance on divergent sequences, divergent viral genomes of HIV-1 were generated using SANTA-SIM [20], which simulates viral sequence evolution dynamics under selection and recombination. Using the default settings (https://github.com/santa-dev/santa-sim/blob/master/examples/HIV-1.xml), three different HIV-1 datasets were simulated with different number of generations to reflect divergence as the virus evolves. Each dataset contained 1,000 simulated HIV-1 genomes after 100, 1000 and 10,000 generations, representing ~ 2–3 months, 1–3 years and 10–30 years of infection respectively (assuming a replication cycle of 1–2 days). Each HIV-1 genome dataset was then used to generate 100,000 reads of 150 bp in length using Mason2.
Real-world datasets (SARS-CoV-2)
Four recently published clinical RNAseq datasets of SARS-CoV-2 viruses from the University of Washington were used for evaluation of performance on real-world data [21] on a species that was not present in the training datasets. The RNAseq datasets had a maximum sequence length of 185 bp and they were filtered for reads shorter than 50 bp (0.55%) prior to analysis.
Benchmarking tools and metrics
For testing the performance of our MT-CNN model on the taxonomic assignment task, several state-of-the-art tools were chosen for benchmarking: Kraken 2, Centrifuge and Bowtie 2 [22]. These tools were installed with their default databases (MiniKraken2_v2_8GB database for Kraken 2 https://ccb.jhu.edu/software/kraken2/index.shtml?t=downloads, and compressed human, bacteria, archaea and viruses database for Centrifuge https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed%2Bh%2Bv.tar.gz). For Bowtie 2, human viral genomes from ICTV (n = 434) were used to build the reference database.
To evaluate the performance of taxonomic assignments, the following metrics were adopted for the different types of datasets: (1) for the simulated NGS reads from standard genomes of ICTV, Cohen’s kappa coefficient which measures inter-rater reliability for categorical items, and F1 score which is the harmonic mean of the precision and recall, were used as the metrics for multi-class prediction; (2) for the simulated divergent reads from HIV-1 datasets, prediction accuracy which represents the percentage of correctly identified reads in all reads was used; (3) for the real-world RNA-Seq datasets, the sensitivity of classification of SARS (the closest relation with SARS-CoV-2) was used.
To evaluate the performance of genomic position assignments, Cohen’s kappa coefficient and F1 score were adopted as the metrics for multi-class prediction of the genomic position of taxonomically labelled reads for the MT-CNN model and Bowtie2, as both Kraken and Centrifuge did not provide any positional assignments.
All performance evaluations were implemented and calculated using the scikit-learn library version 0.21.2 (https://scikit-learn.org/stable/).