TAR-VIR: a pipeline for TARgeted VIRal strain reconstruction from metagenomic data

Background Strain-level RNA virus characterization is essential for developing prevention and treatment strategies. Viral metagenomic data, which can contain sequences of both known and novel viruses, provide new opportunities for characterizing RNA viruses. Although there are a number of pipelines for analyzing viruses in metagenomic data, they have different limitations. First, viruses that lack closely related reference genomes cannot be detected with high sensitivity. Second, strain-level analysis is usually missing. Results In this study, we developed a hybrid pipeline named TAR-VIR that reconstructs viral strains without relying on complete or high-quality reference genomes. It is optimized for identifying RNA viruses from metagenomic data by combining an effective read classification method and our in-house strain-level de novo assembly tool. TAR-VIR was tested on both simulated and real viral metagenomic data sets. The results demonstrated that TAR-VIR competes favorably with other tested tools. Conclusion TAR-VIR can be used standalone for viral strain reconstruction from metagenomic data. Or, its read recruiting stage can be used with other de novo assembly tools for superior viral functional and taxonomic analyses. The source code and the documentation of TAR-VIR are available at https://github.com/chjiao/TAR-VIR. Electronic supplementary material The online version of this article (10.1186/s12859-019-2878-2) contains supplementary material, which is available to authorized users.


Sizes of common regions between human viruses and other microbial species
To evaluate the similarity between different microbes, we calculated the sizes of LCSs between human viruses and other microbial genomes. As bacteria infect humans as well as viruses, the sizes of LCSs between human viruses, human vs. non-human viruses, and human viruses vs. bacteria were calculated. The virus reference genomes were downloaded from NCBI Viruses (https://www.ncbi.nlm.nih.gov/genome/viruses/). To date (June 2018), there are in total 7,456 complete viral genomes, of which 481 have human as the natural host (denote as human viruses). The human bacterial reference genomes were downloaded from Human Microbiome Project (HMP) on NCBI. In total, 2,314 bacterial reference genomes were downloaded.
As there is a large number of microbial species available, we conducted LCS search for available microbial genomes by constructing generalized suffix array and the corresponding longest common prefix (LCP) array (Gusfield, 1997). First, we build a generalized SA (Rajasekaran and Nicolae, 2014). Then, the LCP array, which contains LCPs between each two adjacent suffixes, can be calculated in linear time (Kasai et al., 2001;Kärkkäinen and Sanders, 2003). By definition, the LCS for each two sequences is the maximum LCP between all pairs of suffixes from the two sequences. The following lemma is employed in order to avoid checking all the LCP values between two sequences. The results of the LCS histograms are shown in Figure S1(A-C). One may also examine whether the read recruitment process can incur contamination by using simulated or real sequencing data. However, the empirical studies using real data are limited to the viruses in the samples. Meanwhile, producing simulated sequencing data for all microbes is not practical. Using suffix-array based LCS computation allows us to obtain a more comprehensive view of the common regions between different microbes.
We also compared the sizes of the LCSs between different microbes with the ones within a viral population. As the characterized haplotypes for different RNA viruses are very limited, instead of computing the LCS using available data, we estimated the LCSs within a quasispecies using a probability model and dynamic programming (Chen et al., 2018). With the mutation rate of 3e-5 at each base during virus replication, the probability distribution of LCS length between two HIV strains that are n generations apart were calculated and the distribution of LCS probabilities is shown in Figure S1(D). and non-human viruses (B), and between human viruses and bacteria (C). The x-axis is the log10 of the LCS length. The y-axis is the number of pairs within the given range of LCS size. Only LCSs that are longer than 10 bp are presented. (D) Probability distribution for LCSs between two simulated HIV strains that are 50, 100, 200, and 500 generations apart. The x-axis is the length of LCS, with a range from 0 to 10,000 bp. The y-axis is the corresponding probabilities for those LCS sizes.
To gain guidance on appropriate overlap cutoffs for extension, the ROC curve for LCS thresholds is plotted in Figure S2. As there are 142,021,586 pairs of virus-vs-other sequences, near zero FPR (false positive rate) values are generated for many LCS cutoffs. Thus we also presented the actual number of virus-vs-other sequence pairs with LCS above a given cutoff in Figure S3. For each LCS cutoff, the corresponding TPR (true positive rate) and the number of false positive pairs are illustrated using two axes. To compute the TPR and FPR, we define a positive case as two sequences from the same quasispecies. The negative case refers to a pair of genomes from two different species. Thus, given an LCS cutoff, the FPR can be computed as the percentage of negative cases with LCS above the cutoff in all 142,021,586 examined pairs. The TRP measures the percentage of positive cases with LCS above the cutoff, which can be computed by sampling the distribution in Figure S1(D). In Figure S1(D), we have several distributions with different number of generations of replication. We generated 10 6 positive samples using the distribution with 100 generations. If one replication takes about 24 hours, this curve mimics the infection of about 3 months.
The AUC for the ROC curve is 0.999, which means sequences from different viral haplotypes in a quasispecies have different LCSs (larger) from sequences of different species. From Figure S3, when the overlap cutoff is between 100 bp to 500 bp, the FPR is close to 0 and TPR is close to 1. Users can choose overlap cutoff within this range but less than the read size. Taking into account the contamination from chimeric reads and the read size, an overlap threshold between 130 to 249 is advised. The actual cutoff still needs to be tested because different data sets have different coverage. For our experiments, the default cutoff is 150 for all MiSeq reads of 250 bp. 10 -9 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 False positive rate

Pseudocode for iterative search
Algorithm 1 Default mode: create BWT for all reads Input: seed read set R 0 , the input text T, the overlap threshold τ Output: Reads that are sequenced from the targeted viruses 1: output ← R 0 2: R ← R 0 3: Create BWT and RID for T : BW T (T ), RID(T ) 4: while R not empty do 5: Backward search on BW T (T ) to find all reads that overlap with reads in R 6: Save them to set R 7: output ← output ∪R 8: R ← R 9: return output

Read recruitment and assembly results for simulated SARS-Cov data
The assembly results on SARS-Cov aligned and recruited reads are shown in Table S1. bowtie2 -x Bat_coronavirus -f -a --score-min L,0,-0.6t -p 16 -S bat_corona_align.sam" sars_meta_whole. fa

Commands for running tools on SARS simulated data set
These are our recommended parameters for read mapping step.
(2) BWA  Table S1. Assembly results on SARS-CoV aligned and recruited metagenomic data. N50 is defined as the maximal length so that all contigs above this length contain at least 50% of all the contig bases. Genome coverage is the percentage of the five haplotypes' genomes being aligned by at least one contig. Mismatch rate is the percentage of mismatches between the aligned contigs and the references.