Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads

Background The first step in understanding ecological community diversity and dynamics is quantifying community membership. An increasingly common method for doing so is through metagenomics. Because of the rapidly increasing popularity of this approach, a large number of computational tools and pipelines are available for analysing metagenomic data. However, the majority of these tools have been designed and benchmarked using highly accurate short read data (i.e. Illumina), with few studies benchmarking classification accuracy for long error-prone reads (PacBio or Oxford Nanopore). In addition, few tools have been benchmarked for non-microbial communities. Results Here we compare simulated long reads from Oxford Nanopore and Pacific Biosciences (PacBio) with high accuracy Illumina read sets to systematically investigate the effects of sequence length and taxon type on classification accuracy for metagenomic data from both microbial and non-microbial communities. We show that very generally, classification accuracy is far lower for non-microbial communities, even at low taxonomic resolution (e.g. family rather than genus). We then show that for two popular taxonomic classifiers, long reads can significantly increase classification accuracy, and this is most pronounced for non-microbial communities. Conclusions This work provides insight on the expected accuracy for metagenomic analyses for different taxonomic groups, and establishes the point at which read length becomes more important than error rate for assigning the correct taxon.

throughput methods such as Illumina), and then classified using one of several available pipelines (e.g. QIIME, MEGAN, Mothur) [2][3][4]. Many of these pipelines have been designed around the analysis of bacterial datasets.
In contrast to metabarcoding, metagenomic approaches do not rely on the amplification of specific genomic sequences, which can introduce bias. Instead, they aim to quantify community composition based on the recovery and sequencing of all DNA from community samples. Metagenomic methods limit biases that can occur during the amplification steps of metabarcoding, and yield insight into the functional diversity present in ecosystems [5,6].
While metabarcoding approaches have been widely applied to both microbial and eukaryotic taxa, the vast majority of metagenomic studies have focused only on microbial communities. Unsurprisingly, the various advantages and disadvantages of using metagenomic analyses for microbial communities are well-documented [7][8][9]. There are likely several factors driving this microbe-centric application of metagenomics, including (1) the greater level of diversity of microbial taxa; (2) the considerable number of microbial taxa that are "unculturable," making it difficult to collect the requisite amount of DNA for genomic sequencing; (3) the availability of a multitude of nonmolecular methods for quantifying multicellular taxa; and (4) the relative paucity of genomic sequence for multicellular organisms in databases [10] (Supp. Figure S1). This latter factor is perhaps the single largest factor in driving the bias toward microbial metagenomics.
However, the amount and diversity of multicellular genomic sequence data is rapidly increasing. Although multicellular metabarcode databases are currently far more complete relative to genomic databases, this gap is closing quickly. For example, the Earth BioGenome project aims to sequence the genomes of upwards of one million eukaryotic species within the next decade [11]. Regardless of the success of this effort, there are a host of ongoing eukaryotic sequencing projects, including Bat 1 K [12], Bird 10 K (10,000 bird genomes [13]), G10K (10,000 vertebrate genomes [14]), and i5K (5000 arthropod genomes [15]), among others. This suggests that within the next 5 years, most multicellular organisms will have at least one member of their family present in genomic databases, with some groups of multicellular organisms being completely represented at the genus level.
This would increase the utility of metagenomics for assessing membership in plant and animal communities, especially for cases in which organisms are difficult to observe or degraded. This is frequently the case for diet studies [16], many invertebrate communities such as in treeholes [17] or algal holdfasts [18].

Analysis of short-read metagenomic data
Many metagenomic classification analyses rely on first pass classifiers to assign reads to one or more taxa, followed by second pass classifiers that can improve on the initial classification by taking into account the number and relationship of taxa identified in the first pass. This second step often relies on a lowest common ancestor algorithm [3,19,20], or by refining taxonomic representation by examining the results from the first pass classifier [21].
The most widely used first pass classifier is BLAST (basic local alignment search tool), and it is considered the gold standard [22]. However, BLAST is not computationally efficient enough to deal with tens or hundreds of millions of reads. Thus, algorithms for fast metagenomic classification have been the subject of intense research over the last few years, and include k-mer based approaches such as CLARK [23], Kraken and related tools (Kraken, Kraken2, and KrakenUniq) [19], Centrifuge [20], EnSVMB [24], and Kaiju [25], as well as reduced alphabet amino acid based approaches such as DIAMOND [26]. In almost all cases these have been designed and benchmarked using short read data [22].

Analysis of long-read data metagenomic data
The advent of "third generation" single molecule long read technologies (PacBio and Oxford Nanopore) has significant implications for metagenomic analyses, most notably for genome assembly [27,28]. These technologies allow read lengths of 10 kilobase pairs (Kbp) and beyond, in strong contrast with the approximately 300 base pairs (bp) limit of Illumina. However, both PacBio and Nanopore technologies have far higher error rates (88-94% accuracy for Nanopore [29] and 85-87% for PacBio [30]). The lower accuracy of Nanopore and PacBio (non-circular consensus) sequence reads may affect the success of current classification methods, and there are few algorithms designed to exploit long-read data.
As a first approach toward determining the use of long-read technologies for metagenomic applications, we would like to understand the relative advantages and disadvantages of using short accurate reads versus long error-prone reads. Recent work has shown that relatively high genus level classifications of approximately 93% have been achieved using Nanopore-based metagenomic analyses of a mock bacterial community [31]. Here we expand this analysis to allow direct comparison between short and long read approaches. In addition, we compare metagenomic classification success in microbial communities as compared to communities of multicellular organisms. We find that longer reads, despite their higher error rate, can considerably improve classification accuracy compared to shorter reads, and that this is especially true for specific taxa.

Results
We first looked only at short read lengths to quantify the effects of sequencing technology and classifier (BLAST or Kraken2) on recall at the level of genus. For both bacteria and fungi, we found that recall was at or above 99.9% for Illumina reads of any length (100 bp, 150 bp, or 300 bp), for both BLAST and Kraken2 (Fig. 1). Similarly, PacBio had very high recall for these groups at approximately 97% for 100 bp reads, and 99.98% for 300 bp reads. In strong contrast, for Nanopore data the recall was far lower; approximately 25% for 100 bp reads and increasing to 75% at 300 bp. In general, Kraken2 had slightly lower recall than BLAST.
However, for plants and animals, average recall was low regardless of sequencing technology. Average recall for Illumina reads peaked at approximately 55 and 75% for animals and plants, respectively (Fig. 1, black). Nanopore recall rates peaked at just over 20 and 35% for animals and plants, respectively, while for PacBio these values were 33 and 62% respectively. However, this was highly taxon-dependent, with some taxa consistently having recall near 100%, while others remained close to 0% regardless of sequencing technology or read length. Perhaps surprisingly, on average Kraken2 outperformed BLAST for Illumina reads for both plant and animal taxa. This difference may be due to the default BLAST parameters being sub-optimal for short reads.
We next quantified differences in classification success (the proportion of all classified reads that were correctly classified), again considering only short read lengths. For bacteria and fungi, all three sequencing methods exhibited high classification success, with the exception of Kraken2 classification of Nanopore reads (Fig. 2). For each sequencing method and classifier, classification success for plants and animals was low relative to bacteria and fungi. For both Illumina and Nanopore, BLAST resulted in approximately 87 and 97% of reads being correctly classified, for animals and plants respectively. While for PacBio these values were slightly lower at 82 and 95%. Kraken2 success was far lower, especially for Nanopore reads, peaking at 54% in animals (Fig. 2). PacBio reads exhibited slightly lower classification success for Kraken2 at 70 and 85% for animals and plants, Illumina reads were only slightly better for these taxa at 72 and 90%. Over this range of read lengths, we found only a weak relationship between read length and classification success, in contrast to the results for recall.
It is perhaps expected that highly accurate Illumina reads would result in more accurate taxonomic classification, followed by slightly lower accuracy PacBio circular consensus sequence (CCS) reads, and more error-prone Nanopore reads. However, it is possible to obtain PacBio and Nanopore reads far in excess of 300 bp (single Nanopore reads of up to 2 megabase pairs have been sequenced), so we next quantified recall and classification success for reads with lengths up to 4000 bp. Because such read lengths Fig. 1 Recall is consistently higher in bacteria and fungi than plants or animals for both short Illumina and Nanopore reads. Each panel shows recall for the different kingdoms. Median recall at the genus level across each panel is shown by blue (Nanopore) or red (PacBio), and black (Illumina). Dashed lines indicate recall rates for reads classified using Kraken2; solid for reads classified using BLAST. Illumina reads exhibit consistently higher recall, followed by PacBio; bacteria and fungi exhibit higher recall than plants or animals are not currently possible to obtain using Illumina technology, we did not measure recall and classification success for Illumina reads of similar lengths.
We observed a similar relationship between read length and recall for both BLAST and Kraken2. For bacteria and fungi, nanopore recall increased from~20% using 100 bp reads to almost 100% when using 1500 bp reads, while PacBio recall increased from 95% to~100% for these taxa. For animals and plants we observed similar trends, although at no point did recall approach 100%. However, long Nanopore reads surpassed the recall of even the longest Illumina reads (300 bp) classified with Kraken2, with crossover points at approximately 3000 bp for animals and 2500 bp for plants regardless of classifier (Fig. 3, black and blue solid lines). The crossover points between PacBio read lengths with Illumina 300 bp read lengths were approximately 700 bp (BLAST) and 900 bp (Kraken2) for animals, while for plants these values were 600 bp (BLAST) and 800 bp (Kraken2).
We also considered this metric at the level of family. In this case found that for animals, Nanopore reads surpassed Illumina reads only at lengths close to 4000 bp, reaching approximately 70% recall at this point (Supp. Figure S2). However, for plants Nanopore recall surpassed Illumina recall at 2500 bp, with 4000 bp reads yielding a recall of approximately 90%. When using PacBio reads at the family level, recall exceeded that of Illumina in all instances regardless of classifier (crossover points at 1500 bp for Kraken2 with animals, and 2500 bp for BLAST with animals). For plants the crossover points were 600 bp and 800 bp for animals and plants respectively. Classification success for short reads is weakly related to read length and strongly dependent on classification method. Each panel shows recall for the different kingdoms. Median classification success at the genus level across each panel is shown by blue (Nanopore) or red (PacBio), and black (Illumina). Dashed lines indicate recall rates for reads classified using Kraken2; solid for reads classified using BLAST. For bacteria and fungi, median classification rates of of all pairs except for Nanopore-Kraken2 are almost exactly 100% for all read lengths in Bacteria and Fungi, resulting in obscured lines at these higher levels We next examined classification success at longer read lengths. For BLAST we observed no relationship between classification success and read length for both PacBio and Nanopore (Fig. 4.). Bacteria and fungi both had consistently high classification success (median 100%), while animals and plants had lower classification success (median 82 and 96%, respectively). Interestingly, classification success in animals for PacBio reads was lower than that of Nanopore reads, at medians of 77 and 86% respectively. This is likely explained by Nanopore reads having more failed queries overall (lower recall). In all other taxa, classification success was virtually the same for both sequencing technologies.
In contrast to BLAST, for Kraken2 we observed a consistent increase in classification success as read length increased (Fig. 4). This was the case for both PacBio and Nanopore reads. However, for Nanopore reads, Kraken2 classification success never exceeded that which was observed for BLAST. This was not the case for PacBio reads, where Kraken2 classification success exceeded that of BLAST at 2600 bp (animals) and 2800 bp (plants).
Finally, we tested classification success at the level of Family. In this case, we observed that for BLAST, the classification success for plants was approximately 99% over all read lengths regardless of sequencing technology, while for Kraken2 only 4000 bp reads reached this level. For animals, BLAST classification success of nanopore reads was approximately 95% over all read lengths while for PacBio reads this value was~87%. For Kraken2, the highest classification success was 85% for nanopore reads and 91% for PacBio (Supp. Figure S3).

Discussion
Here we have compared the relative accuracy of taxon classification using simulated short accurate reads (Illumina) and long reads (Nanopore and PacBio) with known ground truth. We have used two simple metrics of success: recall (the ratio of correctly classified reads to all reads) and classification success (the ratio of correctly classified reads to all classified reads). We have tested taxon classification using a broad range of taxa, including bacteria, fungi, animals, and plants.
Recall for both BLAST and Kraken2 was improved by the use of long reads, especially in the case of animals and plants, for which recall improved almost three-fold as read length increased from 300 bp to 4000 bp. Generally, both Kraken2 and BLAST achieved similar levels of recall. The exception was for short reads for animals and plants, for which Kraken2 was more accurate than BLAST.
We found no relationship between classification success and read length for BLAST. This implies that when a read is classified as belonging to a taxa, the likelihood it was correctly classified remains relatively constant over different read lengths. However, the number of reads that are classified at all increases with read length (causing an increase in recall). The exception to this lack of relationship between classification success and read length was for Kraken2, for which the proportion of correctly classified reads increases with read length by more than 50% for both plants and animals. While relatively few studies have analysed the role of read length in classification succession, at shorter read lengths (such as those achieved through pyrosequencing) [32] or for metagenomic assembled fragments (those constructed through Illumina sequencing with an assembly step prior to classification) [33], similar results have been achieved. Our results are unique in that we show a similarly important role of read length, but for long reads without additional steps to improve read quality.
Our results also indicate that recall for long PacBio and Nanopore reads was equal to or higher than even the longest Illumina reads (300 bp). This was true regardless of kingdom, or classification method (Fig. 3). One implication of this is that a simple way to improve Nanopore classification accuracy is to impose minimum read lengths. This can be achieved by performing size selection during library preparation or during computational analyses. Additionally, these results indicate that recall for PacBio reads is consistently higher than Nanopore, and also rapidly surpasses that of Illumina. Like Nanopore reads, this was true regardless of kingdom, or classification methods.
At first glance, then, there appears to be a clear trade-off between short read Illumina and long read sequencing for metagenomic analyses. While both PacBio and Nanopore allow higher recall at long read lengths, this advantage is offset by the fact that Illumina generally provides more reads per run. This consideration is most significant for PacBio sequencing, which (while having greater accuracy than Nanopore due to the vast majority of reads being CCS), provides considerably less sequencing throughput (at most, approximately two million short reads short (e.g. < 4 Kbp) per 8 M SMRT cell). This contrasts with Nanopore PromethION throughput, which can yield well in excess of 20 million 4 Kbp reads per run. Even MinION throughput is easily in excess of 2 million reads per run for short 2Kbp reads.
The more critical question is whether the higher recall makes up for this deficit in read number. At most, recall for Nanopore improves recall 50% beyond 300 bp Illumina reads, while classification success is similar (using BLAST). Thus, if the read capacity of Illumina runs is 50% or more than Nanopore (or Pacbio), the number of correctly classified reads will be maximised using Illumina technology -on a per sequencing run basis. However, for many researchers the more relevant metric is cost per read, or in many cases, time, which may change this cost calculation. Given this, we find no clear advantage in using Illumina over Nanopore given the observed classification accuracy for long inaccurate Nanopore reads. However, the low number of PacBio reads, despite their high accuracy, may limit the effectiveness of this platform for metagenomic analyses.

Differences in accuracy between bacteria, fungi, animals, and plants
We find very large differences in classification accuracy (mostly in terms of recall) for bacteria and fungi versus plants and animals. The discrepancy between taxonomic groups likely arises from a variety of factors. Among these are the higher degree of divergence between bacterial species relative to animal and plant species, and the complexity of bacterial genomes compared to eukaryotic genomes. We discuss these factors below.
Bacterial taxa are often considered separate species once they have diverged by 6% ANI (Average Nucleotide Identity) on a genomic level [34,35]. The degree of nucleotide divergence between eukaryotic species is not standardised [36], and species are generally designated as such based on the biological species concept put forward by Mayr [37]. Additionally, divergence levels differ substantially between loci (as for bacteria). However, for some loci general ranges for eukaryotic species have emerged. For example, for mitochondrial COI, between-species divergence is usually greater than 3% [38,39]. These loci are among the fastest diverging loci in plant and animal genomes, and many other loci may differ by far less than 1% between species. Due to this low level of divergence, metagenomic classifiers may frequently classify animal and plant genera with lower accuracy than bacterial genera.
A second explanation for the increased classification success in bacteria and fungi is that these genomes contain fewer repetitive elements than animals or plants [40]. Although such repetitive regions are usually masked from classifiers (including BLAST and Kraken2), this masking may not be complete.
A third reason is that the genomic databases for plants and animals are far less complete than for bacteria and fungi. There is a large difference in the number of genomes and sequences available for different Kingdoms, with bacteria having significantly more species present than the next closest kingdom (See Supp. Figure S1). However, we expect this factor will be mitigated in the future as genomic databases continue to expand and computational search methods continue to improve.

Differences in accuracy between Kraken2 and BLAST
We observed similar levels of recall for BLAST and Kraken2 over most read lengths for Nanopore reads, while for PacBio reads Kraken2 appeared to perform better than BLAST in animals and plants. However, there were strong differences in classification success. For short reads, Kraken2 classification success was far lower than BLAST. As read lengths increased, Kraken2 classification success approached those of BLAST for Nanopore reads. Classification success for PacBio reads appeared to be lower than that of Nanopore reads when using BLAST, while when using Kraken2 the opposite was the case, and PacBio had consistently higher classification success than Nanopore. Part of this is likely due to longer reads allowing multiple k-mer matches, decreasing the probability of a false positive classification. One perhaps underappreciated advantage of Kra-ken2 over BLAST is that Kraken2 has reduced sensitivity to structural variation within reads. As Kraken2 allows multiple k-mers to match within a read, structural changes (e.g. inversions) are less likely to influence the outcome of Kraken2 matching. Such structural changes may influence BLAST due to the matching and extend algorithm. Thus for long reads, classifiers that are insensitive to synteny may be more successful, especially for taxa in which structural rearrangements are common.

Conclusions
Here we have shown both PacBio and Nanopore reads, despite being more error-prone than Illumina reads, are useful for metagenomic classification due to their increased length. For plant and animal communities, the classification accuracy of long Pacbio and Nanopore reads exceeds that of Illumina. We found that classification accuracy is more dependent on the set of taxa being considered than on the metagenomic classifier being used (Kraken2 or BLAST), and that this was true for both short accurate (Illumina and long (Nanopore and PacBio) sequence data). Together these data suggest that one consideration in selecting a metagenomic sequencing method (i.e. long or short read) is the taxonomic group of interest.

Genomic data
For each of four major taxonomic divisions (bacteria, fungi, animals, and plants), we downloaded 20 genomes from GenBank [41]. Within each of these divisions, we included genomes from a total of 22 classes, 46 orders, and 58 families ( Fig. 5; details in Supplementary Table S1). To select these genomes, we first constructed a filtered list of genomes that were represented at the chromosome level or greater in GenBank. Within divisions we then selected genomes randomly from this list (without replacement, such that no species was represented more than once).

Read simulation
We simulated Nanopore reads using NanoSim 2.0.0 [42] with the default error parameters for E. coli R9 1D data. This method uses a mixture model to produce simulated reads with indel and error rates similar to real datasets. The error model is applied equally to all parts of a read, and the read lengths are drawn from a distribution approximating real data. To create simulated read data of specific lengths, we truncated the simulated reads after the relevant number of basepairs using a custom bash script (i.e. to simulate 100 bp Nanopore reads, we truncated all reads in a simulated dataset to 100 bp (see example command below). We simulated PacBio reads using SimLORD. This method assumes that the number of passes along a read is chi-squared distributed and dependent on read length, as would occur in a standard SMRT run. Thus, shorter reads have more passes, with very short reads (e.g. < 1000 bp in length) having well over 10 passes on average; reads 2000 basepairs in length having a median of over four passes; and reads 3000 basepairs in length having a median of more than three passes. Thus, short reads are almost exclusively CCS ("HiFi") reads. For both Nanopore and PacBio, we simulated reads for lengths varying from 100 bp to 4000 bp at 100 bp intervals, simulating 1000 reads per interval for all taxa (a total of 40,000 reads for each taxon per read simulation program, and 3.2 million reads for all taxa and read lengths per read simulation program). (See example command below).
simlord −−read−reference Reference −n 1000 −f1 Length Output We simulated Illumina data using dwgsim 0.1.12 [43] with the following options: This implements errors to mirror those in Illumina data, with constant error rates of 1e-4 and no indels (which are extremely rare in Illumina data). We generated 1000 reads for each genome, at three read lengths: 100 bp, 150 bp, and 300 bp (a total of 240, 000 reads across all taxa and lengths), and used only single end reads for all analyses.

Sequence classification
We used BLAST 2.7.1 [44] and Kraken2 [19] for sequence classification. We created a local custom database consisting of the NCBI nt database (downloaded on Feb 8, 2019) and the genomes of the 80 taxa that we used to test classification success. We used the default alignment parameters for BLAST (word_size = 11, match/mismatches scores = 2,-3, gap costsexistence = 5, extension =2, filter = low complexity regions), except for implementing a maximum e-value of 0.1. We used the match with the highest bit score for all downstream analyses. For Kraken2 analyses we used the default parameters (in which the k-mer length is 35 bp and default minimiser length is 31 bp). For Kraken2 we used the taxon assigned by the lowest common ancestor (LCA) algorithm employed in Kraken2 for downstream analyses.

Accuracy metrics
To assess the effects of read length on classification accuracy we focus our analysis only on how often a read is assigned to the correct taxon. For our simulated reads there are three possible outcomes when querying a database (Table 1). We expect that taxa that are well represented in the database, and which have few closely related taxa, will have high rates of true matches. Taxa with many close relatives in the database will have many false matches. Taxa that are poorly represented in the database will have high rates of failed queries. Both of these latter results are in a class usually referred to as false negatives: we falsely infer taxon A is absent. However, they largely arise from different mechanisms. Importantly, as genomic databases become more complete, we expect the fraction of failed queries will decrease. At the same time we expect that the fraction of false matches may increase, as more and more closely related taxa become present in the database. The exact nature of this tradeoff is not well explored. Novel statistical approaches, such as Bayesian re-estimation of species frequencies, may mitigate the problem [21]; however, improved methods are required to address this problem [45].
There are other aspects of classification success that we do not focus on here. The first of these is the notion of a true negative: a sequence that is known to not arise from any taxa, should not return a match to any taxa. This is not a biologically realistic situation (all sequences arise from a taxon), although this aspect is useful when trying to assess the performance of different classifiers [46] and presenting the full truth table. The second aspect we do not consider here are false positives: if a read query matches taxon A, but does not arise from taxon A. We would thus falsely interpret taxon A as being present in a community. This metric is intrinsic to the composition of the community rather than just each taxon and the database. For example, if taxon A dominates the community, then it cannot have a high fraction of false positives relative to true positives simply because the vast majority of read queries from the community will be from taxon A and thus true positives. Conversely if taxon B is extremely rare, there will be a large number of false positives relative to true positives, as very few read queries will be from taxon B, resulting in a very small fraction of true positives.
Thus, we use a simplified set of metrics (see Table 1) that are not intrinsically related to community composition: true matches, false matches, and failed queries. We used M fail our simulated genomic sequence reads from 80 taxa to quantify these three outcomes at both the genus and family level. To assign genus and family from species, we used the NCBI taxonomy database [47] (which is used by BLAST as the default taxon classifier). We calculate two ratios from the three metrics in Table 1. The first is the fraction of true positives classified correctly (i.e. recall): The second is the ratio of true matches to false matches. This simply excludes failed queries from the equation. We term this second metric classification success.
The critical difference between these metrics is that taxa which are poorly represented in the database may nevertheless have high rates of classification success, although recall will necessarily be low. However, as the fraction of failed queries approaches zero (which we expect as genomic databases grow), these two metrics become equivalent.
Additional file 1: Figure S1. The number of species present in the NCBI RefSeq database has grown roughly exponentially over time. Figure S2. Recall at the family level. Figure S3. Classification success at the family level. Table S1. List of species include in the in silico mock community, with associated Kingdom and NCBI.