Software implementation
The SqueezeMeta to anvi’o interface is implemented in python3, and consists of two scripts. Firstly, the anvi-load-sqm.py script will parse a whole SqueezeMeta project (annotated orfs, contigs and bins) into anvi’o [6], generating databases that can be directly used with the anvi-interactive or anvi-refine tools from the anvi’o suite. Secondly, the anvi-filter-sqm.py integrates a search engine for prefiltering the data before launching the anvi’o interactive session. The SQM to R interface is implemented in the SQMtools R package, which contains several utility functions. The main components of the package are described in the next four sections. All analyses were performed using SqueezeMeta v1.1.1 and SQMtools v0.4.5.
The SQM object
The SQM object is a custom R object (implemented as a S3 class) which contain all the relevant information from a metagenomic experiment (orf, contig and bin annotations, aggregated taxonomic and functional profiles, etc). The formal structure of the SQM object is defined in Supplementary Table S2. The loadSQM function from the SQMtools package will create a SQM folder by reading and parsing the results directory generated by the SqueezeMeta pipeline [5]. The SQM object can then be explored by a set of utility functions included in the SQMtools package (Fig. 1). Alternatively, expert users can directly access its content in order to perform custom analyses.
Normalization of sequencing data
For individual orfs, contigs and bins, we provide the following abundance metrics: mapped read count, mapped base counts, and TPM. The TPM (transcripts per million) metric was introduced by Wagner et al. [14] and its calculation is performed in two steps:
-
1.
Obtain the RPK (reads per kilobase) of each feature, by dividing the number of mapped reads to that feature in that sample by the feature length in kilobases.
-
2.
Calculate the TPM of each feature by dividing its RPK by the sum of RPKs of all the features in that sample, and multiplying by a million.
For the sake of being consistent with previous works, we maintain the nomenclature “TPM”, even when use it to measure the abundances of features other than transcripts.
For functional categories, we provide the following abundance metrics: mapped read counts, mapped base counts, TPM and copy number. The TPM of a given functional category is calculated as described above. The reads mapping to genes from that category are aggregated and divided by the average length of the genes from that category in the assembly. Copy number is calculated by dividing the aggregated coverage of the genes from that functional category by the coverage of COG0468 to the RecA/RadA recombinase, which is a universal single copy gene. An alternative way of calculating copy numbers using the median coverage of 15 different Universal Single Copy Genes is also available and described in the USiCGs section of the SQMtools manual.
The subset methods
The subset methods can be applied to every SQM object in order to generate smaller SQM objects containing the taxa or functions of interest. The subsetTax and subsetBins methods will select the contigs assigned to the requested taxon and bin, respectively, and the ORFs contained in those contigs. By default, they will rescale TPMs and copy numbers so that they are calculated with respect to the selected taxon/bins, rather than to the whole metagenome (Table 1).
The subsetFun method will directly select all the ORFs containing the requested functions. By default, functions are searched for by string matching against all the functional annotations. By default SqueezeMeta uses KEGG [20], COG [21] and PFAM [22], but also allows users to include additional custom databases. Users can also search within the KEGG and COG functional hierarchies (e.g. “Phenylalanine, Tyrosine and Tryptophan biosynthesis”). Finally, users can search for functions using regular expressions instead of string matching by adding the parameter fixed = F.
Figure generation
Heatmaps and barplots are generated using ggplot2. We also use KronaTools [23] for generating Krona charts, and pathview [24] for generating annotated KEGG pathway maps. Anvi’o plots were generated with anvi’o version 6.1 [6].
Multivariate and differential abundance analyses
Multivariate analysis was performed using the metaMDS and adonis functions from vegan version 2.5.6 [9]. Differential abundance analysis was performed with DESeq2 version 1.24.0 [8] using an adjusted p-value cutoff of 0.05. DESeq2 was chosen in this study as it was shown to have a good precision/recall balance in simulated datasets with 5 and 10, replicates per group [25], which is close to our 8 replicates per cohort.
Computational resources
All tests were ran in a server with a 24-core Intel Xeon E5–2620 v2 CPU at 2.10GHz / core and 256 Gb of RAM, using 12 processors. Under these conditions, SqueezeMeta (including assembly, annotation and binning of the 16 samples) ran in 113 h (c.a. 4 days and a half). Loading results into SQMtools took only a few minutes. Loading results into anvi’o took an additional 130 h (c.a. 5 days and a half).
Alternatively, we were also capable of running SqueezeMeta on the 16 samples in a laptop with an 8-core Intel Core i7 8750H CPU at 2.20 GHz / core, 16 Gb of RAM, and an extra 16 Gb of swap memory in a SSD disk. The process took 80 h. Results were loaded in SQMtools as described above, but we were unable to load them into anvi’o as the system ran out of memory, making the anvi-profile script to eventually stall.