To highlight the benefit of applying Tamock for study design and method selection, we showcase the use of benchmark samples created by Tamock (v1.3.0). For genome-centric analysis of metagenomes, assembly and binning methods are frequently applied to extract metagenomic bins, resulting in ever-increasing numbers of novel genomes reconstructed from a metagenomic sample. We use Tamock benchmark data to assess the quality of results from an assembly and binning experiment, analysing contamination and presence of tRNA and rRNA genes as described in the MIMAG standards for high (HQ), medium (MQ) and low quality (LQ) metagenome-assembled genome bins [15], while following the workflow from Pasolli et al. [3].
We created benchmark samples for 18 urban metagenome samples from the MetaSUB Consortium [16,17,18] as well as 8 human microbiome samples from the integrative Human Microbiome Project (iHMP) [12]. All original samples and their corresponding simulated samples by Tamock were assembled using MetaSPAdes v3.13.1 [19] with default parameters. Relative changes in assembly performance from original to their corresponding Tamock benchmark samples (“simulated” samples) are shown in Fig. 2.
Assembled sequences were subsequently binned by MetaBAT v2.15 [20]. Following the MIMAG standards [15], we used CheckM v1.1.2 [21] for completeness and contamination values, barrnap [22] to predict rRNA genes and tRNA-Scan-SE v1.3.1 [23] for tRNA genes (Additional file 1: Methods).
High-quality (HQ) genome bins are required to fulfil the following requirements: > 90% completeness and < 5% contamination as well as the presence of 5S, 16S and 23S rRNA genes together with at least 18 different rRNA genes. Medium-quality bins (MQ) are only required to fulfil ≥ 50% completeness and < 10% contamination whereas all remaining bins are classified as low quality (LQ) following MIMAG standards.
No genome bins with high-quality (HQ) could be assembled and binned from both original as well as corresponding simulated samples. Resulting genome bins did not reach HQ due to the lack of all three rRNA genes present in the genome bins as required by MIMAG standards. One single bin did contain a copy of a 5S, 16S and 23S rRNA gene respectively, however due to insufficient completeness this genome bin only met the requirements for MQ. Genome bins fulfilling all criteria for HQ except the presence of rRNA genes were labelled as MQ* (Fig. 3).
The number of genome bins dropped from a total of 337 to 258 from original to simulated samples.
Due to incomplete reference sequence databases, especially for strains present in a sample but not in reference databases, alterations from an original sample towards its corresponding simulated sample are to be expected and unavoidable. Depending on the composition of a sample, mainly two effects can be observed.
-
(I)
If a sample contains multiple different strains for which only one reference genome is available, classified sequences will be replaced only from a single genome. This is most prominent for samples from MetaSUB Boston. All samples from this group represent urban metagenomes with high fractions of sequences classified to Homo sapiens (27 to 81%, Additional file 2: Table S1). As urban metagenomes harbour human sequences from multiple individuals, classification to and resampling from a single reference genome reduces assembly complexity and leads to a reduced total length and number of contigs due to reduced strain variance as well as a higher maximum length of contigs (Fig. 2).
-
(II)
However, if a strain is present in a sample with no direct, but a closely related reference genome in a database, only a subset of sequences will be classified to the respective reference genome with subsequent sampling, resulting in two strains (original and reference) present in the simulated where only one strain has been present in the original sample. Faecal samples from iHMP as well as MetaSUB samples from New York City (NYC) have consistently high fractions of sequences classified as bacterial (34 to 92%) as well as in average 89% sequences assigned to a reference genome (Additional file 2: Table S1). Multiple strains will not be present in a database but will be classified to closely related reference genomes and therefore increase strain diversity and assembly complexity in the simulated sample, leading to reduced average length of contigs and N50 values as well as an increase in the number of contigs (Fig. 2).
Ultimately, the aim of Tamock is to create benchmark data as similar to the original sample as possible by mirroring sample composition, sequence errors, depth and length while replacing the classified sequence fraction with sampled sequences from reference genomes to provide a ground truth for experiments. Despite the inherent limitations as described above for sampling genomes from known reference genomes for unknown metagenomes, most assembly parameters were stable over various metagenomic habitats, reflected in all variables near the value 1 for the comparison of assembly statistics from original and corresponding simulated samples in Fig. 2.
Results from simulated samples with available ground-truth for all sampled reference genomes can subsequently be utilized to assess methods and workflows of interest for an individual study. The abundance, genome coverage, and source for all resampled genomes in a simulated sample provide the base to assess a study design under realistic conditions based on the actual data of interest.
Applied to the presented show case, we can determine the genome coverage required to extract genome bins without any misassemblies compared to their reference genome. We applied MetaQUAST v5.0.2 [24] to assess any potential misassemblies in extracted genome bins. Tables with all sampled genomes and the corresponding abundance are available in the output of Tamock. We were able to successfully extract genome bins without any misassemblies, such as from the iHMP2 stool sample J00827 for Bifidobacterium adolescentis ATCC.
The extracted genome bin in the simulated and original data of sample J00827 (Additional file 3: Table S2) only failed HQ standards due to the lack of rRNA genes. In the corresponding simulated sample for J00827, sequences from B. adolescentis ATCC were sampled at 9 × coverage for the reference genome and the extracted genome bin covered 89.73% of the reference genome with 1.39% of all reads mapping to the respective genome bin (Additional file 4: Figs. S1–S4).
Comparison to benchmark data without unknown sequence fraction
The creation of benchmark data for metagenomic samples from complex environments can lead to high numbers of reference genomes required. Multiple hundred to thousands of reference genomes quickly lead to extensive computational resources required, either by long runtimes or high memory usage. The most recent developed CAMISIM, which has been used to create benchmark data for the CAMI Challenge [1], requires several hundred GB of RAM for higher numbers of genomes according to the documentation of CAMISIM.
In contrast to CAMISIM, relative abundances in Tamock are drawn directly from the unknown sample by classifying all sequences with Centrifuge. The usage of Centrifuge and subsequent processing also enables Tamock to process multiple thousands of reference genomes on a standard desktop due to the low memory usage of Centrifuge for indexes i.e. of RefSeq, with runtimes of a few hours for a sample with about 20 Mio sequences and about 4.000 reference genomes excluding the download time of reference genomes (only required once).
To enable a comparison with benchmark data created by current tools such as CAMISIM without inclusion of the unknown sequence fraction, we created benchmark data only from the classified sequence fraction with reference genomes available. Three versions of benchmark data were created. One set of benchmark data consisted only of sequences which are simulated by Tamock and used to replace all corresponding sequences in the original sample (“simonly” benchmark, Additional file 2: Table S1). The subset of original sequences which Tamock replaces with “simonly” is used to create the benchmark data set “orig-repl” for direct comparison with “simonly”. For the third set of benchmark data, the number of sequences which are simulated by Tamock are scaled to the sequence depth of the original sample, i.e. for a sample with 1500 reads of which 1000 reads are classified and assigned to a genome while the remaining 500 reads are unclassified, the second data set multiplied all counts by 1.5 × to create a sample with 1500 reads using relative abundances from the taxonomic profile of classified sequences (“simscaled” benchmark, Additional file 2: Table S1). The functionality to scale the number of classified sequences to a set sequence depth while maintaining relative abundances is available in Tamock via the option “-rn-sim”. This alters sequence depth and composition of the simulated sample compared to the original sample but enables users to create benchmark data for further experiments with different characteristics (see Additional file 1: Methods).
CAMISIM, representing the most current benchmark creation tool, allows taxonomic profiles to be used for benchmark data creation and utilizes ART as well to simulate Illumina sequences. A benchmark data set from Tamock consisting only of the classified sequence fraction with known abundances and a reference genome therefore is the equivalent of a benchmark dataset created by CAMISIM or other tools creating benchmark data from a taxonomic profile with set abundances which is provided by Tamock. No tool to our knowledge deduces the abundance profile directly from a metagenomic sample, simulates and replaces the known sequence fraction while maintaining the characteristics of the original sample by learning parameters such as sequence errors, length and depth for simulation from the original sample and keeping the unknown sequence fraction.
We assembled and binned the “orig-repl”, “simonly” and “simscaled” benchmark in the same manner as described above for original and simulated samples by Tamock. Since the unknown sequence fraction is not part of the benchmark, we observe a loss of total assembly length and number of contigs for all samples with substantial fractions of unclassified data as expected. This is most prominent for samples from MetaSUB Sacramento and iHMP Feces with only 35.4 and 43.4% of all sequences classified on average (Additional file 4: Fig. S5) and can be explained with the lower sequence depth using only the known sequence fraction. However, even in “simscaled” samples with the same sequence depth but reduced sample complexity, we observe a drop in total sequence length for MetaSUB Boston and NYC whereas Sacramento shows a strong increase (Additional file 4: Fig. S6). Comparing only the two benchmark data sets “orig-repl” and “simonly”, the sequence fractions which Tamock effectively exchanges to create a simulated Tamock benchmark sample, assembly statistics such as maximum length, total length and number of contigs increased from “orig-repl” to “simonly”. Since classification of sequences is incomplete, the loss of the unknown sequence fraction will lead to the loss of fractions of sequences for assembly, whereas sampling from a reference genomes for the exact same sequence depth as done for all counts in “orig-repl” to create “simonly” will improve assembly performance. As described above, this will lead to a slight reduction in sample complexity, particularly for assembly. This is especially true for samples with large fractions of eukaryotic sequences such as from iHMP Nasal Cavity, where e.g. for sequences classified to human only one reference genome is sampled (Additional file 4: Fig. S7).
Samples with fairly high fractions of classified data such as from MetaSUB NYC with 78.2% of sequences classified in average show a drop in total assembly length for purely simulated benchmark data, whereas the simulated sample by Tamock showed only a slight increase in total length compared to the assembly of original samples (Fig. 2). The loss of total assembly length indicates that substantial parts of the unknown sequence fraction did belong to partially classified sequences, leading to the observed loss of sequence coverage during assembly, particularly for MetaSUB NYC samples.
Since a substantial part of the original sample is missing, there is a considerable loss in the number of resulting genome bins in purely simulated benchmarks. The complete “simonly” dataset is sampled from reference genomes, as such sample diversity is lowered, reducing overall assembly and binning difficulty, shown by the less severe loss of near high-quality genome-drafts (MQ*), however the total number of resulting genome bins dropped strongly from 337 from the original samples to 101 for “simonly” (Fig. 4). This is very close to the number of resulting genome bins from “orig-repl” with 96 bins, supporting the exchangeability of these two sequence fractions as performed by Tamock to create benchmark data. A slight increase of both MQ* and LQ quality genome bins is to be expected due to the both the likely loss of non-classified sequences belonging to a genome with classified sequences while reducing assembly complexity due to the selection of reference genomes for replacement. Assembly complexity can be reduced by e.g. the merging of closely related strains into one reference genome if only one is present in the reference database.
For “simscaled” samples, a slight increase in genome draft bins of MQ* quality can be observed, which is based on the increased coverage by scaling up all counts. Nonetheless, the diversity of the original samples is still lost and even with less complexity, there is a drop in the total number of resulting genome bins from 337 to 211.
Ultimately, benchmark samples created by Tamock showed performance closest to the original samples in both assembly and binning performance, underlying the importance to include unknown sequence fractions for metagenomic data with unclassified data. Considering only the respective sequence fractions Tamock alters in an original sample creating a simulated sample by Tamock, we could show the consistent number of resulting genome bins for only these sequence fractions, supporting the approach of Tamock to replace all classified sequences. Benchmark data based on only the known sequence fraction from a metagenome sample provided results deviating further from corresponding results of original samples as Tamock benchmark data.