Reverse engineering environmental metatranscriptomes clarifies best practices for eukaryotic assembly

Krinos, Arianna I.; Cohen, Natalie R.; Follows, Michael J.; Alexander, Harriet

doi:10.1186/s12859-022-05121-y

Research
Open access
Published: 03 March 2023

Reverse engineering environmental metatranscriptomes clarifies best practices for eukaryotic assembly

BMC Bioinformatics volume 24, Article number: 74 (2023) Cite this article

2357 Accesses
3 Citations
2 Altmetric
Metrics details

A Correction to this article was published on 05 May 2023

This article has been updated

Abstract

Background

Diverse communities of microbial eukaryotes in the global ocean provide a variety of essential ecosystem services, from primary production and carbon flow through trophic transfer to cooperation via symbioses. Increasingly, these communities are being understood through the lens of omics tools, which enable high-throughput processing of diverse communities. Metatranscriptomics offers an understanding of near real-time gene expression in microbial eukaryotic communities, providing a window into community metabolic activity.

Results

Here we present a workflow for eukaryotic metatranscriptome assembly, and validate the ability of the pipeline to recapitulate real and manufactured eukaryotic community-level expression data. We also include an open-source tool for simulating environmental metatranscriptomes for testing and validation purposes. We reanalyze previously published metatranscriptomic datasets using our metatranscriptome analysis approach.

Conclusion

We determined that a multi-assembler approach improves eukaryotic metatranscriptome assembly based on recapitulated taxonomic and functional annotations from an in-silico mock community. The systematic validation of metatranscriptome assembly and annotation methods provided here is a necessary step to assess the fidelity of our community composition measurements and functional content assignments from eukaryotic metatranscriptomes.

Peer Review reports

Background

Eukaryotic microbes play diverse and important roles in global ecosystems [1], including grazing processes, primary production, and acting as hosts for diverse and essential symbionts [2]. In ocean ecosystems in particular, the literature on the role of eukaryotic microbes in ecosystem processes continues to expand [3]. This literature provides further evidence that eukaryotic microbes are as important as prokaryotic counterparts when it comes to nutrient cycling and their outsized influence on food webs and community ecology [4,5,6,7], and necessitates renewed efforts of understanding the underlying mechanisms.

The ecological relevance of eukaryotic microbes requires careful study of their ecology and distribution, but this can be difficult to execute, both in situ and in the laboratory. The diversity of natural eukaryotic assemblages makes comprehensive surveys difficult to perform in a taxon-specific manner. Taxonomic diversity may be catalogued in the field using 18S rRNA gene amplicons or cell count data, although this neglects functional diversity, which may be leveraged to inform broader understanding of their biogeochemical and ecological roles. Complicating efforts, many eukaryotic microbes are not easily cultivatable in the lab [1, 8], and relying solely on the subset of eukaryotic microbes that we can grow in the laboratory may apply a biased filter to our understanding of these organisms [9]. For these reasons, the use of culture-independent environmental metatranscriptomic and metagenomic sequencing techniques has become a popular and successful method for uncovering new taxonomic and functional diversity in populations of eukaryotic microbes in diverse environments in the field [8, 10, 11].

Metatranscriptomics has become a widespread and promising approach to answer questions about microbial community activity in the environment without prior knowledge or bias [12], and may be used to identify underlying genetic mechanisms driving global phenomena like ocean biogeochemistry [13,14,15]. Metatranscriptomes provide an accessible means to look at the full suite of genes being expressed by a group of organisms, which may be partitioned by size, site, or phylogenetic origin [16]. Metatranscriptomes can be paired with metagenomes to provide community-level insight into gene expression, and may represent a functional complement to the increasing amount of community composition and novel binned genome data that is available for microbial eukaryotes [10, 11, 17]. However, despite the potential of this approach, the field is relatively new and standardized practices are immature. The first environmental transcriptome, targeting bacterioplankton, was sequenced in 2005 [18], and marine metatranscriptomes began to appear in the literature around 2008 [16, 19]. Metatranscriptomes offer a snapshot of the whole community at the time of sequencing, but the relative proportion of transcripts and their detectability may not always provide meaningful insights into true biological processes, in particular when sequencing depth is low or references are missing from the database [12]. For this reason, databases must be compiled and new computational approaches must and continue to be developed to process and interpret metatranscriptomic data. The collation of laboratory transcriptomic data to a single location and format by the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) [3, 20] began as a repository effort and became one of the most important databases enabling the identification of marine microbial eukaryotes from metatranscriptomic sequences (e.g. [21,22,23,24]). Substantial discoveries have been made using sequenced metatranscriptomes, including novel explanations for persistent gaps in ecological understanding, such as coexistence within a seemingly narrow niche [23], discovering new genes or putative organisms from previously unknown sequences [19], developing a molecular understanding of the basis of coral disease [25], and decoding the complexities of deep-sea hydrothermal vent microbial communities [26]. Metatranscriptomic data availability, in particular for eukaryotic phytoplankton, has been transformed by the Tara Oceans Project, which provided an unprecedented amount of sequence information, enabling us to better interpret the ocean genetic landscape on a global scale [27]. Still, metatranscriptome analysis tends to vary substantially between studies, and interpretation can suffer from biases inherent to the technology.

Reliable, reproducible, and broadly available approaches to metatranscriptome analysis have been lacking, particularly in eukaryotic microbial community assessment. Early transcriptome pipelines were designed in the last decade for conventional, well-studied organisms, such as humans and mice, and their microbial communities (e.g. [28]). These pipelines are unlikely to include user-downloadable software, often are focused on annotation, and do not include a mechanism for de novo assembly and processing [28]. A few years later, the first pipeline for uncharacterized microbial communities emerged, but it was presented as a description of the steps needed for metatranscriptome analysis, rather than as software products available to users [29]. The Simple Annotation of Metatranscriptomes by Sequence Analysis (SAMSA) tool, and its second released version, SAMSA2, are among the most recently updated metatranscriptome analysis tools [30]. While this tool is a complete package that can be downloaded and used by scientists, it focuses on rRNA gene removal steps, and does not include assembly steps [30]. In fields such as microbial oceanography, we often require de novo assembly of transcriptome sequences, as the identity of the organisms in environmental samples is not always known, and even for well-understood organisms, comprehensive references may not be available. To date, metatranscriptome pipelines have either lacked accompanying software products or assembly steps necessarily for de novo environmental analysis. As a consequence, the community remains in need of a reliable metatranscriptome analysis tool that is downloadable, reproducible, and includes de novo transcriptome assembly.

The landscape of de novo transcriptome assembly tools is wide, and often there is disagreement about which tool is best to use for a particular application or the average expression level for a sequenced transcript [31]. The Oyster River Protocol (ORP) software was published in 2018 as an answer to this problem, a tool designed for single transcriptomes and meant to combine assembly tools [32]. Using a collection of transcriptome assemblers, the ORP is designed to overcome the challenge of efficiently collating information from multiple assemblers [32]. Further, it uses a collection of k-mer sizes, where a k-mer is a k-sized portion of the transcript used to split the information up into more digestable pieces, to reduce the likelihood that less abundant transcripts would be favored during assembly due to small k-mer size or vice-versa with more abundant transcripts and large k-mer size. The ORP, however, is a standalone approach to transcriptome assembly, and does not allow the user to simultaneously process multiple samples, nor does it accommodate metatranscriptomes. The ORP does not integrate obviously with downstream annotation metrics, and rather is an approach to combining transcriptome assemblies built using varying k-mer lengths.

More recently, it has been demonstrated that de novo co-assembly using multiple transcriptome assemblers improves the quality of single organism-transcriptome assembly [33]. This was shown using a de novo transcriptome assembly pipeline with non-model organism expression data as input to recapitulate the transcriptome of a single species. A BUSCO (Benchmarking Universal Single-Copy Orthologs; [34]) score quality threshold of 50% recovery was used in order to assess the recovery of single organism transcriptomes [33]. BUSCO is a tool used to determine the proportion of lineage-specific single-copy genes found in a genome (or a transcriptome) in addition to initial statistics based only on sequence content rather than evolutionary lineage [34, 35]. When working with single-organism transcriptomes, metrics like BUSCO scores [34] are appropriate for evaluating the completeness of the sequence library of the organism. This differs from the metatranscriptomic context, wherein BUSCOs from potentially multiple organisms are at play for recovery. The authors of the original co-assembly study [33] note that multiple assemblers used at once for a larger co-assembly contribute to higher quality transcriptomic assemblies of RNAseq data, especially when some subset of the highest-performing assemblers is used [33]. These results may help inform multi-organisms metatranscriptomic community data, but they require a transition from consideration of single-organism BUSCO metrics to identification of key features of multiple organisms present in an environmental community. Specifically, rather than validating a single-organism transcriptome with its BUSCO completeness estimates, it is necessary to identify whether the multiple BUSCO-complete single-organism transcriptomes present in a community-wide sample can be accurately recovered. Identifying the most salient and appropriate metrics for the claim that a single organism has been accurately identified and its functions accurately described from a metatranscriptome poses a significant challenge for the field. This is particularly true for environmental community data in which taxonomic boundaries might not be fully resolved in the first place, and culture representatives may not be available. To further complicate matters, even when assembly products can be shown to be “accurate” relative to commonly used metrics such as contig length, percentage of the raw sequencing reads mapping back to the assembly, and the presence of annotated genes with homology to “core” reference genes, they are not guaranteed to offer the best solution to the assembly problem because of the lack of database representatives [36].

The question that remains from single-organism co-assembly studies is why individual transcriptomic assemblers sometimes produce higher-quality or more complete results, and whether redundancy within each transcriptomic assembly skews quality assessment. In order to answer this question, the assembled content shared in the output of multiple assemblers needs to be compared to the new content offered by combining assembly tools. When standardized approaches to assembly and use of appropriate parameters are used and benchmarked, novel insights about the unifying and diversifying aspects of microbial communities can be established [37, 38]. These are typically centered around one or both of the two essential avenues for annotation of sequence material: the taxonomic identity of the sequences and their functional role in the organism. One previous study that focuses mainly on functional role identification via transcript assembly [39] established a comparative workflow, CoMW, for assessing the success of the recovery of database genes from the human gut microbiome, and compared the effectiveness of the assembly-based process of CoMW to assembly-free methods for metatranscriptomes.

Here, we assess the ability of metatranscriptomic assembly methods, and specifically our co-assembler, multiple-sample co-assembly approach to recover all transcripts included from existing single-organism transcriptome assemblies. Rather than testing the recovery of identified database genes, we compare our metatranscriptome assemblies to annotated “designer” metatranscriptomes constructed from diverse transcriptome assemblies from a database created using the MMETSP database [3, 20]. This validation workflow is designed to answer the questions: Do studies that use metatranscriptomics to understand community diversity in eukaryotic microbes found in the environment (a) adequately recapitulate the taxonomic and functional diversity found in those communities? and (b) reproduce consistent sequences which could be reliably recovered with repeated sampling and assembly? Specifically, the workflow is intended to explore whether sequences for which we have yet to assign a functional annotation are also recovered after reverse engineering raw reads from previously assembled contigs, or whether some may be artifacts of assembly. In addition, we evaluate whether if some assembly tools outperform others, they more likely to outperform in terms of genes from the assembly which can be annotated, length distribution of contigs, or mapping efficiency in recapitulating the raw reads. Doing so is intended to address whether it is acceptable to forgo one or more of these in favor of lower computational requirements.

The community is in need of a protocol for validating popular metatranscriptomic assessments, and a set of recommendations for how best to manage the challenge of minimizing computational assembly costs while maximizing ecological insight extracted from these powerful data. To address these challenges, we have developed eukrhythmic, a pipeline which facilitates metatranscriptome assembly with multiple assembly tools and post-processing for environmental sequence analysis in an all-in-one workflow. Here, we describe the eukrhythmic pipeline and validate its performance via the construction of simulated metatranscriptomes using a tool we call jEUKebox, and apply it to the assembly and analysis of published metatranscriptomic datasets and simulated metatranscriptomes. Our benchmarking effort using eukrhythmic addresses whether assembling metatranscriptomes from a mixed environmental community is comparable to isolating and sequencing particular species or strains of eukaryotic marine microbes and sequencing their transcriptomes individually.

Methods

Throughout this paper, we use: “designer metatranscriptomes” to refer to the “gold standard” jEUKebox-simulated metatranscriptomic contigs generated from MMETSP reference transcriptomes with known taxonomic annotations, “simulated raw reads” to refer to simulated raw reads from the gold standard, and “reassembled products” to refer to the combined simulated output of metatranscriptome assembly using the eukrhythmic pipeline.

Eukrhythmic pipeline

Data cleaning and trimming

Trimming is performed using Trimmomatic version 0.39, a flexible tool that is specifically suited to paired-end next-generation sequencing data, with user-specifiable parameters [40], with a minimum read length of 50 basepairs, a sliding window of length 4 and quality score 2, and a standard list of Illumina adapters (ILLUMINACLIP:<adapter-list>:2:30:7 LEADING:2 TRAILING:2 SLIDINGWINDOW:4:2 MINLEN:50). Optionally, the user may also choose to filter out spike-in sequences, if they were added during extraction, with bbmap [41].

Assembly

One major advantage of using the eukrhythmic pipeline is the flexibility to use as many (or as few) transcriptome assemblers as is appropriate for the data (Fig. 1). Many different metatranscriptome assemblers are available to researchers and commonly used, and it can be challenging to select the appropriate assembler, given that each often has its own advantages and disadvantages [42, 43]. In eukrhythmic, the user may select any combination of assemblers [36, 44,45,46,47,48], and the assembly process is conducted in parallel, as resources allow.

Merging and clustering

The consolidation of outputs from the constituent metatranscriptome assemblers is performed in two steps. First, assemblies from the same sample or user-defined “assembly group” (considered a single unit due to some shared characteristic) are concatenated. Inspired by the process adopted by Cerveau et al. (2016) [49], we used the MMSeqs clustering tool [50] to eliminate similar contigs from the combined assembly, first using a sequence similarity threshold of 100% for the shorter sequence in a local alignment to remove identical contigs recovered by multiple assemblers. Next, the pipeline branches into two output types. For the first output type, individual samples/assembly groups (“CAG”, or “clustered by assembly group”), which then undergo a second round of MMSeqs clustering to remove similar contigs at a 98% similarity threshold (defined the same way as above), accounting for potential sequencing errors [49]. Additionally, samples already merged from the assembly process are then merged between samples, such that one combined assembly is produced with all available data, labeled “multiple assembly consolidation” or abbreviated to “MAD” (“multi-assembler deduplicated assemblies”) in the text. We then cluster the combined assembly at the 98% level of similarity using MMSeqs2 as previously described.

Protein translation

To accommodate protein-space downstream analysis, such as protein families database (Pfam) annotation [51], protein translation with TransDecoder [52] is supported as part of eukrhythmic. Both the output individual sample/assembly group files from the two clustering steps and the single combined assembly are translated to protein sequences.

Annotation

While eukrhythmic is primarily designed for assembly, the user may optionally elect to annotate the assembly output as part of the pipeline. Presently, the pipeline provides annotation tools including phylogenetic assessment using EUKulele [21], and basic functional assessment using the companion tool eggNOG-mapper [53]. To characterize KEGG annotations [54], we grouped results by Kegg Orthology ID (KO). When multiple relevant annotations were associated with a single hit, we assigned counts evenly to the assigned annotations.

Design of simulated community schema

Communities

The six simulated communities were designed to have differing complexity and to represent community ecotypes that might be encountered in real-world metatranscriptomic studies. These configurations are summarized visually in Fig. 2 and in terms of their complexity in Table 1 and their taxonomic composition in Table 2. Community 1 was designed to resemble a community dominated by a single organism, thus has the lowest Shannon diversity index and species richness (see calculations in Section “Metrics for assessing community complexity”). Community 2 has a similar species richness value to Community 1 and only marginally higher diversity, since two strains of the same species make up the majority of the sample. Community 3 has the highest number of genes which are not shared between any of the organisms in the sample, but lower diversity than Communities 4 and 6, which have the highest total species diversity. Community 4 has more genes shared between two closely-related groups. Community 5 has the highest total number of reasonably related organisms and shared genes. For MMETSP group B, the list of MMETSP IDs to choose from was selected randomly, and individual community pairings were determined using fastANI similarity (see Section “Simulation of eukaryotic communities using jEUKebox”).

Table 1 Calculated diversity metrics for the six simulated MMETSP-based communities used in the analysis

Full size table

Table 2 MMETSP members of each simulated community with the number of orthologous groups which include each organism and the assessed BUSCO completeness of each transcriptome

Full size table

Metrics for assessing community complexity

The Shannon diversity index of each community was calculated using the following formula [55]:

$$\begin{aligned} \text {Shannon(community)} = \sum _{j=1}^n p_j \ln p_i \end{aligned}$$

where n is the total number of “types” of community members, and p is their proportion in their community. Total species richness was reported as the total number of types present in the community.

We used sourmash to compute the pairwise similarity of each MMETSP transcriptome within each community [56]. We additionally introduce another diversity metric to account for the potential similarity of the transcriptomes beyond their taxonomic annotations:

$$\begin{aligned} \text {sourmash composite score} = \sum _{i=1}^{n}\sum _{j=1}^n (1-\text {sourmash score}) \min (p_{i}, p_{j}) \end{aligned}$$

In other words, for each pair of transcriptomes in the community, we weight the sourmash similarity score of the pair of transcriptomes by the abundance of the less-abundant transcriptome in the pair. We report the sum of these weighted scores for each community in Table 1.

Simulation of eukaryotic communities using jEUKebox

Selection of transcriptomes

For each set of simulated eukaryotic communities, 12 transcriptomes from the MMETSP [3, 20] were used. These are summarized in Table 2 for the two selected communities. For “community A”, the IDs, but not the contigs selected, were included based on their features, including some MMETSP IDs of the same species and some of closely related strains. For “community B”, the MMETSP transcriptomes were selected randomly by jEUKebox, with the only constraint being the inclusion of some closely-related taxa. For the random selection built into the pipeline, the only requirement is that some subset of the organisms that went into the communities had to have a highly similar partner in the same community per computed nucleotide similarity score.

Similarity computation using fastANI

In order to select “closely-related” transcriptomes for the community specifications, we used fastANI [57] to calculate the average nucleotide-based sequence identity between transcriptomes and identify similar transcriptomes on the basis of having $\ge 80$% average nucleotide identity. Hence, for e.g. community 2 (see Fig. 2), two MMETSP transcriptomes with fastANI similarity $\ge 80$% would be selected.

Identifying putative evolutionary relationships with OrthoFinder

In order to test metatranscriptome assembly quality with respect to recovering genes with shared evolutionary origin, but different current annotated taxonomic identity, we used the tool OrthoFinder to identify orthologous groups between the MMETSP transcriptomes, and to include genes from both highly-conserved and relatively rare gene groups in the designer metatranscriptome [58, 59]. As summarized in Fig. 2, the jEUKebox pipeline automates this process by including 10% of all of the identified single-copy orthologs reported by OrthoFinder (orthologous groups with a single gene representative from every transcriptome in the community). Genes are then selected for each one of the organisms in the community according to the following procedure. For genes that do have a “highly related” partner with respect to computed similarity (see Section “Similarity computation using fastANI”; fastANI score $\ge 80$%), 75% of the contigs to be included in the designer transcriptome (as prescribed by the desired ratio of the candidate organism in the final metatranscriptomes) were taken from orthologous groups which included more than just the candidate. The remaining 25% were randomly selected from orthologous groups that only contained the candidate. For genes without a highly-related partner, 75% of genes were taken from exclusive orthologous groups containing only the candidate. The remaining 25% were randomly selected from orthologous groups shared with other MMETSP transcriptomes.

Simulation of raw reads

After creating the designer metatranscriptomes directly from informed random selection of contigs from the MMETSP transcriptomes, raw reads were simulated using the package Rsubread [60]. We chose a read length of 75 base pairs to enable the simReads function to use its inbuilt set of quality scores to randomly determine a sequencing error for the generated raw reads (via the simulate.sequencing.error parameter). We chose a mean fragment length of 180±40 base pairs and generated a 1 million base pair library for the paired sequencing reads which were simulated using the package for each community and trial.

Reassembly with eukrhythmic

The raw reads simulated using the simReads function were provided as input to the eukrhythmic pipeline. The pipeline was run with default settings as described in Section “Eukrhythmic pipeline” and listed in the configuration file upon downloading the pipeline. Four assemblers were used: rnaSPAdes [36], MEGAHIT [47], metaSPAdes [48], and Trinity [61]. We chose these four assemblers because they each are either designed specifically for the community-level eukaryotic transcriptome assembly problem, or they have been reported to perform particularly well on particular metatranscriptome assembly metrics [62]. rnaSPAdes is built for RNA assembly and performs well with respect to percentage mapping [36, 62], Trinity performs well with respect to number and length of contigs generated [61, 62], MEGAHIT is exceptionally fast [47], and metaSPAdes is designed for community-level sequencing data assembly [48].

Assessing reassembly quality

Assembly statistics

We used the Salmon mapping tool to both quantify the abundance of each contig with respect to the raw reads, and to assess what proportion of the raw reads were represented in the assembled contigs.

We report descriptive statistics for the contigs assembled as a proxy for the quality of the assembled sequences. These include minimum, maximum, mean, and standard deviation of contig length, as well as the N50 metric. We use the definition of the N50 metric as the minimum length among the set of contigs which together constitute 50% of the total length of all contigs in the assembly, as reported by QUAST [63].

Clustering reassembled metatranscriptomic proteins with MMETSP-derived designer metatranscriptomic proteins

To determine whether exact sequence matches were shared between the predicted proteins from the metatranscriptome assembly and the proteins from the MMETSP used to create the designer metatranscriptome, we performed mmseqs2 clustering between the two protein sets [50]. We chose the LINCLUST algorithm as implemented in mmseqs2 due to its exceeding low false discovery rate in clustering [64, 65]. In accordance with what was used by authors of mmseqs2, we report these results using minimum coverage of the target sequence (–cov-mode 1) of 90% and a minimum sequence identity of 90%, at which threshold fewer clusters are produced but there is very little chance of a false negative, i.e. two 90%-similar sequences in the dataset which mmseqs2 fails to report.

When evaluating the likelihood of contigs assembled using eukrhythmic to cluster with the designer contigs, we based the comparison on the protein predictions from TransDecoder [52] as clustered through mmseqs2. For each full nucleotide contig, we considered it to have “clustered with the designer metatranscriptome” if at least one ORF from TransDecoder was successfully co-clustered with a protein from the designer assembly, even though transcriptome assembly happens in nucleotide space. This enabled us to also quantify what proportion of the contigs from the eukrhythmic assembly were not assigned an ORF at all by the TransDecoder software.

Assessing metatranscriptomic proteins using BLAST all-by-all comparison

In addition to clustering, we performed an all-by-all BLAST search between the proteins from the original contigs from the MMETSP and the resulting predicted proteins from eukrhythmic. An e-value cutoff of $10^{-2}$ was used to catch the top match on the basis of bitscore, and then hits were classified according to their percentage identity and bitscore value.

Taxonomic annotations

As performed within the eukrhythmic pipeline, we generated taxonomic annotations for both the designer metatranscriptomes and the reassembled products from eukrhythmic with the EUKulele tool (version 2.0.3) using the default reference database of contigs from all MMETSP transcriptomes and the MarRef database [3, 20, 21, 66]. We report differences in the number of annotated species and genera from EUKulele in the reassembled products as compared to the sequences which were prescribed to be included in the designer metatranscriptome using the jEUKebox pipeline. We also compare the EUKulele annotations from the designer metatranscriptomes, including false matches on the basis of poor-quality sequences being present in the database and failing to be annotated to begin with, to the annotations of the reassembled products. We perform standard linear regression on the number of annotations for each species, genus, order, and phylum from the designer metatranscriptomes as compared to the reassembled products. We also categorize taxonomic annotations according to whether they were classified correctly, incorrectly (in conflict with the original annotations), or were not classified. We performed a Welch’s 2-sample T-test for independent samples as implemented in scipy [67] to compare the summed abundances of correctly and incorrectly classified and unclassified sequences.

Functional annotations

All functional annotations were determined using eggNOG-mapper (version 2.1.3) [53]. Similarly to taxonomic annotations, reported annotations of orthology terms (KOs) from the Kyoto Encyclopedia of Genes and Genomes (KEGG) were compared between the designer metatranscriptomes via annotation of the contigs from the MMETSP and the reassembled products which were retrieved as an output of the eukrhythmic pipeline.

Standard linear regression was performed to compare the abundance of KEGG orthology terms in the designer metatranscriptomes as compared to the reassembled products from eukrhythmic. The regression and associated probability value was calculated using the implementation in base R [68].

Assembling and evaluating environmentally relevant metatranscriptomes from the Tara Oceans project

We assembled metatranscriptomes from the Tara Oceans project [69, 70] as an environmental counterpart to the simulated sequence data. Metatranscriptome samples from three distinct ocean basins were assembled from the highly-diverse small-size fraction surface samples from the Tara project: the North Atlantic, Southern Ocean, and Mediterranean Sea; accession numbers are collated in Table 6. We assembled these metatranscriptomes using default parameters to the eukrhythmic pipeline and used MEGAHIT and rnaSPAdes, which proved to be the fastest and most accurate assemblers, respectively, in both the present work and other investigations [36, 47]. Three assemblers were selected so as to compare the mutual findings of the three assemblers to the unique sequence content identified by each. We assess the results of the metatranscriptome assembly via percentage mapping via Salmon using the default k-mer length of 31, automatic library type selection, and the –validateMappings flag [71], and taxonomic and functional annotations as provided by EUKulele (version 2.0.3) and eggNOG-mapper (version 2.1.3), respectively [21, 53].

These metatranscriptomes were previously analyzed by [22] with transcribed sequences of length $\ge$150 bases assembled using velvet [72] included as part of the “MATOU” database [22]. In order to compare the contigs generated and retained from our multi-assembler approach, we conducted a blastn [73] search with e-value cutoff of 1e-10 to find the top-scoring match of “MATOU” transcribed sequences against our sequences, and compared the contigs that were successfully matched to the database using this method to those that otherwise could be functionally and/or taxonomically annotated. Identified coding sequences of length $>150$ bases were retained for further analysis following [22].

Reassembling and evaluating previously explored metatranscriptomes from the Narragansett Bay time series

We assembled ten samples from a 2015 metatranscriptomic study from the Narragansett Bay time series [23]. These samples are stored under National Center for Biotechnology Information (NCBI) project accession number SRP055134 and samples were assigned individual accession numbers collated in Table 5. We assembled these metatranscriptomes using default parameters to the eukrhythmic pipeline and used MEGAHIT, rnaSPAdes, metaSPAdes, and Trinity [36, 47, 48, 61]. We compared the taxonomic and functional annotations between assemblers to the composition of major taxonomic groups reported by the 2015 study, which used raw read mapping to reference transcriptome assemblies rather than assembling the metatranscriptome itself [23]. We also compare the insights drawn from the simulated metatranscriptomes via jEUKebox to the patterns that emerge from using multiple assemblers on a previously analyzed environmental dataset.

Data processing and visualization

Output data from the described tools were processed using Python version 3.8.3 [74] and R version 4.1.0 [68]. Figures were generated using plotnine in Python [75] or ggplot2 [76] in R with organization into panels using patchwork 1.1.2 [77]. Statistical analysis on the data was performed with SciPy [67] or with R version 4.1.0 [68].

Results

Simulated raw reads were created using the jEUKebox pipeline described in the methods (Section “Simulation of eukaryotic communities using jEUKebox”) were processed with eukrhythmic. Briefly, reads were trimmed, underwent quality estimation, and were assembled using multiple software tools which were identified or shown in previous studies to perform well with transcribed mRNA sequences, metagenomic data, or both [36, 47, 61], were clustered, and then were functionally and taxonomically annotated with EUKulele (version 2.0.3) and eggNOG-mapper (version 2.1.3) [21, 53]. The full details of the jEUKebox and eukrhythmic pipelines are expanded upon in the Materials and Methods (Section “Eukrhythmic pipeline”).

jEUKebox pipeline generates simulated eukaryotic metatranscriptomes with varying sequence diversity

We developed the jEUKebox pipeline to facilitate the rapid creation of comprehensive mock metatranscriptomic datasets that may be be used to benchmark pipelines and software. Here, we construct marine eukaryotic metatranscriptomes with differing sequence diversity and community complexity leveraging reference data from the MMETSP [3, 20]. We treat the jEUKebox-simulated datasets as a gold standard to assess the performance of the eukrhythmic pipelines and the assemblers that it uses. More details about how the pipeline simulates raw reads that resemble the data type generated by marine metatranscriptomic surveys can be found in the Materials and Methods (Section “Simulation of eukaryotic communities using jEUKebox”). We chose two distinct groups of laboratory transcriptomes from the MMETSP [20] for the simulations to ensure that the results were not a product of the specific organisms we selected. For the random selection built into the pipeline, the only requirement is that some subset of the organisms that went into the communities had to have a highly similar partner in the same community per computed nucleotide similarity score (Section “Simulation of eukaryotic communities using jEUKebox”). We also designed the jEUKebox pipeline to include a balanced fraction of common transcripts that had an ortholog expressed by multiple organisms, and we implemented six distinct community configurations so as to simulate a range of species richness and evenness (Fig. 2).

Eukrhythmic products accurately represent the raw reads

The eukrhythmic pipeline produced reassembled products had similar raw read percentage mapping scores to the designer assemblies. The mapping of simulated raw reads against the eukrhythmic reassembled products was lower than against the designer metatranscriptomes against which they were simulated, with $87.5\pm 2.0$% of simulated raw reads mapping against the eukrhythmic reassembled products and $96.0\pm 0.2$% against the designer assembly (Fig. 3A-C; Table 3). This discrepancy is likely due to the error introduction step in the raw reads or conflicts between different raw read placements in candidates for reassembled products which could not be resolved by the assembler. These patterns were reproduced in the environmental dataset we tested [23]: both the MAD ($82.1\pm 3.8\%$) assembly and the multi-assembler clustered assembly (“CAG”; $77.6\pm 4.5\%$ mapped) outperformed any individual assembler with respect to percentage mapping (Fig. 3D-E). In our simulated data, rnaSPAdes had the highest average percentage mapping of any assembler, and MEGAHIT had the lowest (Fig. 3D), but patterns were slightly different in the environmental dataset [23]. While MEGAHIT still underperformed relative to the other assemblers with respect to percentage mapping (Fig. 3D, E), comparisons between the remaining assemblers were less straightforward. rnaSPAdes showed the highest individual performance ($75.0\pm 4.7\%$ mapped), followed by SPAdes ($70.5\pm 5.4\%$ percent mapped). However, Trinity performed better in some samples than in others, hence showed higher dispersion in percentage mapping values ($67.8\pm 8.6\%$).

Table 3 Resulting assembly size and taxonomic, functional, and core content recovery of jEUKebox outputs after raw read simulation and re-assembly with eukrhythmic

Full size table

Table 4 Comparison of the average length of sequences in the designer metatranscriptomes as compared to the eukrhythmic reassemblies

Full size table

Average contig length tended to be significantly shorter in the eukrhythmic reassemblies as compared to the designer metatranscriptomes, although there was considerable variability (Fig. 3). The average length of open reading frames (ORFs) predicted by the TransDecoder tool was also smaller in the eukrhythmic reassemblies as compared to the original sequences retrieved from the MMETSP transcriptomes (Table 4; Fig. 3). Although still substantially shorter than the designer metatranscriptomes, sequences in the eukrhythmic products that were recovered by more than one assembler according to mmseqs2 clustering had progressively longer length (median length 334 base pairs for clusters represented by a single assembler, median length 960 base pairs for clusters represented by all four assemblers; between-distribution t-test $p<0.001$; Fig. 3). These longer contigs had high fidelity with the raw reads, as evidenced by the agreement of multiple assembly approaches, hence were likely to be longer sequences interrupted by fewer conflict instances.

Less stringent clustering slightly decreases identified annotations

eukrhythmic reduces the redundancy of the identified contigs for the merged assembly via clustering, hence reducing computational complexity of downstream operations on the smaller multi-assembler, multi-sample assembly file. Applying clustering directly to the designer metatranscriptomes revealed that substantial protein-space clustering only slightly decreases the unique annotations extracted from the dataset. For example, mmseqs2 clustering with a sequence identity threshold of 0.6 and coverage threshold of 0.6 in coverage mode 1 reduced the number of contigs in the assembly by an average of 23.7% and reduced the assembly file size by an average of 21.7%, but only reduced average identified KEGG database functional annotations by 1.4% and did not result in the loss of any species from the dataset via clustering (Fig. 4). By default, eukrhythmic uses a conservative approach of 100% sequence identity and 98% coverage for the most lenient clustering step, but we found in this test that values of 80% for both coverage and sequence identity could considerably reduce total file size without considerably changing unique annotations (Additional file 1: Fig. S1). Given this substantial reduction in file size without loss of the majority of annotations, more stringent clustering thresholds may be warranted, especially in datasets with many samples or high sequencing depth.

Eukaryotic metatranscriptome assembly accurately recapitulates simulated taxonomic diversity

On the whole, all assemblers performed well with respect to recovery of the major taxonomic annotations from the simulated metatranscriptomes. 94.8±2.2% of all recovered contigs were assigned genus-level annotations by the EUKulele tool that matched genera found in the selected MMETSP transcriptomes used to simulate the metatranscriptomes (97.7±2.2% of annotated contigs). In general, the number of annotations in conflict with the genus-level annotation assigned based on the MMETSP was similar in the designer metatranscriptomes as compared to the reassemblies generated by eukrhythmic. The computed linear regression between the genus-level annotations from the designer assemblies and the eukrhythmic re-assemblies was nearly one-to-one: $\text {Reassembly} = -1353+1.02(\text {Designer})$; $R=0.95$; $p=<8.2e-184$; note that the intercept is relative to total abundances on the order of $10^{5}$. This indicated that the total abundance of each genus-level annotation as assessed by Salmon quantification matched well between the designer metatranscriptomes and the reassembled products from eukrhythmic.

Table 5 Sample IDs and accession numbers for Narragansett Bay samples. Descriptive information about the sample conditions are reproduced from [23]

Full size table

Table 6 Sample ID and accession numbers for the 15 Tara Oceans metatranscriptomes assembled as part of this project, including the ocean basin they were sampled from

Full size table

Despite this performance, some genus-level annotations were missing based on the contigs provided from the MMETSP. Between all trials, an average of 1.3 ± 1.9 genera out of an average of a total 6.1 ± 2.7 genera were not recovered by the eukrhythmic re-assembly, despite being present in the MMETSP transcriptomes which were used to create each community (see Table 2). As many of these annotations were also missing from the EUKulele annotations on the contigs from the MMETSP themselves (1.9±1.9 genera), these contigs may simply have not been sufficiently distinct from the transcriptomes of other organisms in the database to be annotated, potentially due to sequence length or specificity. Using the EUKulele annotations instead of the taxonomic annotation of the transcriptome from which the original contigs were taken, $2.8\pm 1.7$ genera were not found in the eukrhythmic reassembled results as compared to the original EUKulele annotations of the designer assemblies. An average of $39.3\pm 12.9$ distinct genus-level annotations were assigned in the eukrhythmic output as compared to $6.1\pm 2.7$ distinct MMETSP genera being used to generate the samples due to being successfully annotated as similar genera present in the MMETSP. These spuriously taxonomically annotated contigs constituted both a minority of the total assembled contigs as well as estimated abundance from the simulated raw reads (Additional file 1: Fig. S6), and the occurrence of these spurious annotations could be reduced with more stringent EUKulele parameters, though at the expense of some correct annotations.

The sequence annotations were classified according to whether they did or did not align with genus-level annotations from the MMETSP (Fig. 5). There was no statistically significant difference between the per-sample summed abundance of incorrectly-annotated contigs between the designer and the eukrhythmic reassembled products (T=−0.084; p=0.93), however correctly-annotated contigs were significantly more abundant in the designer assemblies (T=−5.28; p=8.3e−7) and unannotated contigs were significantly more abundant in the eukrhythmic reassemblies (T=5.43; p=4.5e−7).

Functional annotations from metatranscriptome assembly match abundance and diversity of functions in designer transcriptomes

Functional annotations were recovered with similar frequency and relative abundance in the eukrhythmic reassembled products as compared to the designer assemblies (Fig. 6; Additional file 1: Fig. S7), and also between assemblers (Additional file 1: Fig. S13). As an overall average across MMETSP groups and samples, 5820.6±349.6 KEGG orthology terms (KOs) were correctly recovered from the designer assemblies, 820.3±163.7 were “false positives” that were recovered in the eukrhythmic reassemblies but not in the original designer assemblies, and 473.8±107.6 were identified in the designer assembly but not recovered by eukrhythmic. However, the false positive and unrecovered KOs tended to have low abundance compared to those which were correctly identified: on average, there were 1566.5±321.3 total occurrences of annotations of false positive KOs per sample in the eukrhythmic reassemblies and 107.6±204.7 total occurrences of annotations of KOs that were not found in the eukrhythmic reassemblies in the designer assemblies, as compared to and average of 132751.9±10176.5 occurrences in the designer assembly and 116489.5±9961.0 occurrences in the eukrhythmic reassemblies of KOs that were mutually recovered before and after the reassembly process. A linear regression with an imposed y-intercept of zero as calculated in R [68] revealed a relationship of $\text {Reassembled KO abundance} = \text {Designer KO abundance} \cdot 0.96$ with an adjusted $R^2$ of 0.85 ($p=2.2e^{-16}$), indicating a nearly one-to-one relationship between the abundances of each KO in the designer assembly and in the reassembled products (including false positives and KOs missing from the eukrhythmic reassemblies; Fig. 6).

The vast majority of KOs also recovered in the designer assemblies were identified by all four assembly tools (5326.6±247.9 KOs across all samples). rnaSPAdes individually recovered the highest number of unique KOs that were also found in the designer assembly of any assembler (96.0±20.4), but rnaSPAdes also generated the highest number of KOs that were not found in the designer assemblies (176.4±26.5), nearly double the number that it uniquely recovered (Fig. 6). rnaSPAdes also had both the highest number of proteins that did and did not have a successful BLAST ([73, 78, 79], Fig. 7).

Applying the eukrhythmic pipeline to environmental metatranscriptomic datasets

To benchmark the eukrhythmic pipeline and to provide examples of the potential biological insights that can be extracted from the assembly approach, we assembled and annotated samples from two metatranscriptomic datasets. First, we chose two sets of samples from the Tara Oceans project as a representative general oceanographic dataset: one set from the Southern Ocean and one from the Mediterranean Sea, two ocean basins with contrasting levels of diversity (Fig. 8A–C). We found that eukrhythmic expands the total amount of protistan coding sequence data that can be recovered from anywhere in the global ocean. We also assembled a metatranscriptomic dataset from a previously published study in Narragansett Bay as a coastal example with a dominant taxonomic group (i.e. a canonical “bloom” scenario). We observe that while eukrhythmic recapitulates many of the general patterns from a direct read mapping-based study, the assembly approach outperforms direct read mapping with respect to the number of distinct diatom representatives recovered, the dominant taxonomic group (Bacillariophyta) in the samples.

Tara Oceans eukrhythmic assemblies contain coding sequences that lack representation in the “MATOU” gene atlas

More than one in six (17.5%) of the coding sequences we recovered from our multi-assembler eukrhythmic assemblies from the Tara Oceans metatranscriptomes had no significant hits to the “MATOU” composite gene atlas across all Tara Oceans metatranscriptomes curated by [22] (Fig. 8D-F; percentage mapping of these coding sequences to the raw reads as compared to all contigs shown in Additional file 1: Fig. S20). An average of 16.1% of all Mediterranean Sea coding sequences and 18.8% of all Southern Ocean sequences did not have any match to previously recovered coding sequence content in the MATOU database, which includes coding sequences from all major global ocean basins. These results indicate the expansion of coding sequences achieved by using eukrhythmic, but also that total number of coding sequences is not evenly expanded across samples (Fig. 8E); while in some samples >75% of coding sequences did not have a match to the MATOU database, in others it was <10%.

As much as 41.3% of the coding sequence products of the assembly for each sample could not be assigned a taxonomic annotation via EUKulele, but more so in samples from the Mediterranean Sea (Additional file 1: Fig. S21); mean Mediterranean Sea: $34.8\pm 5.2$%, mean Southern Ocean $28.0\pm 4.4$%). Among the fraction of sequences that had EUKulele taxonomic annotations and were not found in the MATOU database, dinoflagellates dominated the recorded number of coding sequences recovered in both basins (mean Mediterranean Sea: $12.0\pm 7.7$%, mean Southern Ocean: $34.0\pm 10.1$%; Fig. 8F). Dinoflagellates also dominated in terms of average proportion of total TPM (SO: 30.3±11.1%; MS: 8.6±6.6%), but not in terms of mean raw TPM assigned in the Southern Ocean (SO: Ochrophytes had the highest assigned TPM at 32853.1±66098.9, while dinoflagellates had 31853.4±82808.6 assigned TPM), because some samples dominated by dinoflagellates also had a relatively low number of reads assigned to coding sequences not found in the MATOU database (Fig. 8E, F). All of the taxonomic annotation information for the fraction of the eukrhythmic sequences that had a EUKulele annotation is summarized in Additional file 1: Fig. S22.

Our efforts expand the total coding sequence content available from global ocean metatranscriptomes, but also highlight the ongoing need for intercomparison of approaches. The average length of coding sequences that did not have a match to the MATOU database was 466.9±243.1 bases, while the average length of the coding sequences that did have a match was 613.5±438.0 bases (Additional file 1: Figs. S23, S24). The average length of the coding sequences with a match (i.e., recovered by both assembly efforts) was significantly longer (t=720.86; $p<$1e−16). The use of a k-mer size of 63 with the velvet assembler by [22] may also have contributed to this result: the rnaSPAdes assembler, for example, takes into account the varying coverage level of expression data by using a k-mer size that varies dynamically with read length [36]. This approach occasionally increases misassembly rate, but also protects rarely-expressed genes from being missed [36]. Because eukaryotic communities in the microbial ocean may be sparse and contain rare taxa, we argue that a more exhaustive assembly approach is warranted, even if the average length of assembled sequences is reduced.

Tara Oceans assembly raw read fidelity and non-coding sequences

Tara Oceans assemblies from the Mediterranean Sea and Southern Ocean varied in their overall composition as well as the accuracy of their recovery in the assembly process. While we focused the balance of our analysis on predicted coding sequences to compare to the analysis by [22], we note that via Salmon mapping, an average of 30.1±10.7% of the raw reads for the Mediterranean Sea samples mapped back to the coding sequences, as compared to 51.5±13.3% for the full assembly, while in the Southern Ocean samples, 51.5±11.6% of the raw reads mapped back to coding sequences as compared to 76.4±10.3% for the full assembly (Additional file 1: Fig. S22). This indicates that in both cases, a substantial fraction (>20%) of the original raw reads can be assembled into contigs, but appear to be non-coding. These non-coding sequences may be involved in important regulatory processes [80, 81], such as nutrient stress in diatoms [82], hence should not be excluded from consideration.

Multiple assemblers improve metatranscriptome assembly of Narragansett Bay phytoplankton

We benchmarked the eukrhythmic pipeline using a previously analyzed marine metatranscriptomic dataset [23] (Fig. 9). In particular, we were able to recapitulate the taxonomic composition of the diatom-dominated community described in [23]. Across all assemblers, representative from the phylum Ochrophyta were suggested to be dominant members of the community (Fig. 9A, D; Additional file 1: Fig. S19), and moreover the genera Skeletonema and Thalassiosira were recovered in expected proportions, with Skeletonema producing a numerically-dominant bloom determined via cell counts obtained from microscopy in sample S2 (Additional file 1: Fig. S15). Notably, our assembly recovered a greater diversity of diatom species than the raw read mapping method used previously (Fig. 9; [23]), including the recovery of multiple species of Skeletonema known to be present in this ecosystem ( [83]; Fig. 9D).

While broad patterns in taxonomic annotations were indistinguishable between the different assemblers and the majority of KEGG Orthology (KO) IDs were recovered by all four assemblers (Fig. 9B, C), the assemblers showed some differences with respect to the abundance of each functional annotation. In particular, MEGAHIT reported fewer instances of each functional gene grouping than rnaSPAdes, and fewer than Trinity approximately half of the time (Fig. 9; Additional file 1: Fig. S12). rnaSPAdes appeared to report a lower overall abundance of diatoms when the normalized TPM metric returned by Salmon [71] was used, but this pattern did not hold when non-normalized raw reads were used instead (Fig. 9; Additional file 1: Figs. 14, 15, 16). Contigs generated by the assemblers that were successfully annotated as Skeletonema or some other diatoms appeared to have longer average length than the average among all taxa (mean length of Skeletonema contigs with standard error of the mean: $618.6\pm 0.7$; overall mean: $396.6\pm 0.07$; two sample t-test t = 310.17 $p<2.2e-16$; Additional file 1: Fig. S18). rnaSPAdes produced a disproportionately high number of contigs relative to the other assemblers, many of these contigs belonging to non-diatom taxa (Additional file 1: Fig. S18). These contigs also tended to be shorter in the rnaSPAdes assemblies, both for non-diatom taxa (rnaSPAdes mean ± standard error = $377.7\pm 0.1$; overall mean = $421.1\pm 0.09$; t-test t=−367.17, $p<2.2e-16$) and for unannotated contigs (rnaSPAdes mean ± standard error = $264.7\pm 0.7$; overall mean = $300.8\pm 0.07$; t-test t=−498.5, $p<2.2e-16$). While these differences were universal for annotated non-diatom and unannotated contigs, rnaSPAdes produced shorter Skeletonema contigs than Trinity (t=−101.1; $p<2.2e-16$), but longer Skeletonema contigs than both MEGAHIT (t=41.6; $p<2.2e-16$) and SPAdes (t=64.0; $p<2.2e-16$).

Discussion

Metatranscriptome analysis has become a widespread approach for extracting taxonomic and functional information from protistan communities across a variety of environments ranging from coastal to open ocean marine ecosystems to soil ecosystems [12,13,14, 23, 69, 84]. Here, we designed a multi-assembler pipeline for metatranscriptomic assembly, eukrhythmic, and evaluated its performance on both simulated metatranscriptome data from the MMETSP [20] as well as on previously published metatranscriptome datasets [22, 23]. In doing this, we explored the relative performance of commonly-used assemblers, and determined that a multi-assembler approach improves the outcomes of metatranscriptome assembly with regard to recapitulating proteins and their taxonomic and functional annotations.

Scalable and reproducible pipelines like eukrhythmic enhance intercomparison and advance computational research

Eukrhythmic enables the simultaneous processing of many metatranscriptomes at once, and its modular design allows for reproducibly reprocessing the results of previous analyses as new tools become available. As datasets become larger, research questions now aim to tackle ambitious questions across space and time scales. Increasingly complex datasets necessitate careful workflow management [85, 86]. We have developed a pipeline that manages large metatranscriptomic datasets with the goal of assessing the diversity and function of marine protists, and have demonstrated the usefulness of our modular tool via the new insights the tool draws from previously analyzed, published metatranscriptomes [22, 23]. In particular, the reanalysis of data from [22, 23] highlights the expanded analytical insights that might be derived from a multi-assembler approach such as that provided by eukrhythmic, and these datasets can continue to be reproducibly re-analyzed with eukrhythmic as improved assembly tools become available. Further, despite the roughly sixteen discrete software steps that occur during a run of eukrhythmic, processes are run in parallel and can be deployed to the cluster, meaning that reanalysis that would ordinarily take multiple days per sample could now take the same amount of time for the entire project, resource-dependent.

Do metatranscriptomes capture the diversity of protistan communities?

Environmental metatranscriptomes are a community mosaic of ephemeral RNA-based signals of expression. Metatranscriptomes are increasingly a routine diagnostic tool for making important conclusions about community composition and function within marine systems [12,13,14, 23, 69, 87, 88], and are being applied to establish comparison at global scales [22, 69] and over long time periods [89,90,91]. Despite this, best practices for physical collection, molecular processing, and bioinformatic analyses have yet to be established [86]. Towards the standardization of computational approaches to marine protistan metatranscriptomics [86], we have shown that a multi-tiered eukaryotic metatranscriptome assembly pipeline recapitulates annotated contigs from a mock transcriptome community. Notably, the contigs produced from multiple assemblers tend to be of the highest quality with regard to their similarity to the original contigs from the transcriptome assemblies via clustering, taxonomic, and functional annotations. We find that metatranscriptomic approaches to assess community diversity and function in the environment are indeed adequately and reproducibly recapitulating the taxonomic and functional diversity of the RNA pool and of those environments when they use assembly approaches similar to that employed by eukrhythmic.

Our reassembly of environmental metatranscriptomic datasets further highlights the power of the multi-assembler approach in recovering novel gene content. In the samples from diatom-dominated Narragansett Bay [23], we recovered a greater diversity of diatoms than raw read mapping alone in the original analysis, a level of diversity which aligns with other studies from the region [83]. From the Tara Oceans samples, we found novel protein sequences not recovered and included in a comprehensive, global analysis effort using a single assembler [22], more than half of which had functional and/or taxonomic annotations. Even when the final coding sequences were clustered and only contigs of sufficient length were retained following [22], all samples contained previously unknown coding sequences, and some samples contained more unknown than known sequences. Though not all of these coding sequences could be annotated, recent progress has been made towards annotating genes of unknown function [92], which can be highly abundant in metatranscriptomic data. These results demonstrate the value in reassembling previously analyzed datasets using multiple tools with different underlying algorithms.

One important note is that even the best resource that we have available for consensus-based taxonomic annotation of de novo mixed community metatranscriptome assemblies constrains our efforts before we begin: laboratory-derived sequenced transcriptomes of single organisms cannot be fully reverse-annotated. In other words, even when we use sequence search tools to recover the taxonomic annotation of a contig present in the database, some of these sequences are too short or share a non-negligible percentage of sequences between organisms and cannot be annotated to a fine level of resolution, even in their unmodified state. In these cases, the fact that we can recover many, but not all, of originally annotated contigs after reverse-engineering the community tells us more about the limits of taxonomic annotation via short-read sequences than it does about the pitfalls of the assembly process. Hence, it is critical that we continue to consider the shortcomings of the annotation process as we analyze and re-analyze metatranscriptomic datasets.

Is there one best assembler for eukaryotic environmental metatranscriptomes?

An additional goal of our analysis was to compare the performance of different assemblers on eukaryotic metatranscriptome data, and to determine whether the use of multiple assemblers is warranted. According to our results, no single assembler we assessed (MEGAHIT [47], rnaSPADes [36], metaSPAdes [48], and Trinity [61]) is universally the best choice. De novo sequence assembly has both technical and practical considerations. Beyond simply balancing run time, memory requirements, and optimal accuracy, the performance of assemblers is difficult to evaluate. In particular in community assembly, low sequencing depth may complicate typical approaches used to reduce the effects of sequencing error. In our study, two assemblers stood out as the bookends of the spectrum of assembly approaches. MEGAHIT produced long contigs, but had the lowest percentage mapping of raw reads to the assembly, while rnaSPAdes routinely had the highest raw read mapping percentage and number of functional annotations (Fig. 9; Additional file 1: Fig. S12), but had shorter contigs on average, and a high incidence of transcripts that did not appear to be coding. These patterns held in both simulated and environmental datasets (Figs. 3 and 6; Additional file 1: Figs. S2–S5).

The spectrum of approaches adopted by assemblers also had a significant impact on the interpretation of assembly products. This effect may be clearest when considering how the average length of contigs recovered by an assembler (e.g. MEGAHIT and rnaSPAdes as in the example above) can directly skew interpretation of community composition. Shorter transcripts will recruit fewer reads, but will appear more abundant when a normalization that takes sequence length into account is used [93]. Because assemblers that work like rnaSPAdes produce a greater number of shorter contigs that may not be annotated, organisms or individual contigs or predicted genes with longer transcript length appear comparatively less abundant when reads are normalized, but not when non-normalized raw reads alone are used (Additional file 1: Figs. S15, S16, S17). For example, in the Narragansett Bay samples, we observe that diatom Skeletonema appears to have higher mean contig length - at least in the contigs which can be accurately assembled and taxonomically labeled by the EUKulele tool. However, conventional community composition metrics like TPM that normalize to contig length will penalize the recruitment of raw reads to these longer-than-mean contigs. Yet, as has been well described for transcriptomes, using raw reads leaves interpretation vulnerable to biases related to sequencing depth, sequencing approach, and transcript length, intuitively because longer transcripts are expected to recruit a greater number of raw reads by virtue of their size [93]. In a mixed community sample, and particularly in marine communities in which organisms are extraordinarily diverse, normalizations need to take the heterogeneity of the community into account.

Taken together, these results support the potential utility of merging the subtly different approaches taken by different assembly tools, in order to maximize gene recovery whilst also retaining the distinct signatures that make community composition interpretable. rnaSPAdes-like approaches improve functional recovery, while MEGAHIT-like assemblers produce longer sequences, which possibly have higher fidelity to the observed community. This observation further raises the question of how we can or should be extracting community composition insights from metatranscriptomes, especially when samples cannot or have not been normalized to housekeeping or spike-in sequences.

Should we be downsizing metatranscriptomic assemblies?

Computational constraints continue to limit the scale of metatranscriptomic analyses, as downstream tools for e.g. abundance quantification and functional annotation may have sizable memory requirements for excessively large assembly files [94]. Here, we advocate for a multi-assembler approach to metatranscriptome assembly. As we have discussed, the multi-assembler approach generates a greater number of total predicted coding sequences, and many of the additional coding sequences assembled from our simulated dataset are similar taxonomically, functionally, and via sequence identity to coding sequences from the designer assembly (Figs. 5, 6, 7; Additional file 1: Fig. S8). However, using a multi-assembler approach will create larger assemblies, and users need to be cognizant of the complexity of their dataset and memory usage requirements downstream. Researchers may address excessive assembly size by (1) intentionally limiting the assembly to downsized, high-quality content, or (2) by more strictly clustering assembly products, the choice between which comes down to the research question.

The contigs statistically most likely to contain detectable open reading frames and to closely resemble “true” sequence content in a real-world sample via blast sequence search and mmseqs2 clustering are those that multiple assemblers can agree upon (Fig. 7; Additional file 1: Figs. S8–S11). Researchers may choose to maximize confidence in assembly products by using only the contigs discovered by more than one assembler, or may expand the total number of recovered genes by leveraging multiple algorithms. While the intention of eukrhythmic is to combine the outputs of multiple independently-contributing tools, in an analysis in which the goal is to extract only the products which can be assumed to be of the highest quality, the smaller intersection between assembly tools may be retained. This would also substantially reduce the number of sequences, improving the computational feasibility of downstream analyses. For example, if a researcher was interested specifically in generating a core set of high-confidence genes for a site and then mapping raw reads back to a combined assembly to detect changes in expression over time and space, multiple assembly may furnish a set of transcripts more likely to both be accurately retrieved from the original samples as well as ecologically relevant for mapping. However, it is important to note two pitfalls of this approach. First, this substantially decreases the proportion of the raw reads that are represented in the final contigs after assembly (Fig. 7). Secondly, while some assemblers produce a higher number of sequence products that do not have detectable similarity to the “true” contigs from which raw reads were simulated, they also produce a number of unique sequences that are detectable in the original assembly and, crucially, not identified by any of the other assembly approaches (e.g., rnaspades).

Researchers may also opt to cluster the resulting assembly in accordance with their desired final file size or level of sequence redundancy. Using mmseqs2 clustering [50], we found that for our combined assemblies, the choice of clustering parameters is an important one, with potentially significant reductions in file size being possible without appreciable impact on the functional and taxonomic profile of the assembled metatranscriptome (Fig. 4; Additional file 1: Fig. S1). Approaches are now being developed to more reliably and efficiently cluster predicted genes of both known and unknown function, for example using tools like mmseqs2 coupled to functional domain information and probabilistic modeling [92]. Such approaches are particularly useful for fitting coding sequences from a single assembly into the context of expansive datasets in space and/or time, from which many millions of total coding sequences can be extracted, and computational processing becomes exceptionally limiting [92].

eukrhythmic: an approach for optimized multi-assembler metatranscriptome assembly

Metatranscriptome quality cannot be assessed using either genomic or single-organism metrics. Instead, the assembly products need to be considered as potentially novel genetic content when assessing assembly success. Here, we present eukrhythmic, a workflow for assembling environmental metatranscriptomes of eukaryotic communities by leveraging multiple assemblers. We evaluate our pipeline both using existing environmental metatranscriptomes and simulated community data that we generated using a second pipeline, jEUKebox. The flexible jEUKebox pipeline can be reused as additional reference sequences become available in order to test community ecology hypotheses for cultured and uncultured organisms. Simulating communities and testing their ability to be recovered is an essential step in ensuring the fidelity of metatranscriptome studies as the volume of taxonomic and functional data available to make predictions grows. In particular, we envision the construction of metacommunity data using uncultured organisms inferred from metagenomic sequences (metagenome-assembled genomes (MAGs)) [10]. Our inability to annotate some unmodified contigs from the original simulated community highlights crucial questions about the limits of annotation. Are some genes destined to remain difficult to annotate (taxonomically and functionally), either because they vary too dramatically between organisms, hence an organism-specific and highly complete genome is needed to accurately identify them, or because they are part of an indistinguishable cluster of highly similar genes? Can we be sure that these are true genes, or could they be artifacts of the assembler originally used to generate the reference assemblies? Rigorous simulations of communities may help to identify these difficult-to-annotate genes and to set thresholds that prevent erroneous annotations, in conjunction with new approaches for annotating unknown genes [95]. Computational simulations should be paired with laboratory curation of cultivated communities and accompanying metatranscriptomic sequencing which can be compared with count data. Promising plans for executing these steps are already in place [96].

Critical assessment of the accuracy and quality of metatranscriptome assembly and quantification of technical impacts, such as clustering similarity or the algorithms used to construct contigs, provide confidence for ecological interpretations. The eukrhythmic pipeline represents a reproducible roadmap towards assembling new eukaryotic environmental metatranscriptomes, and reassembling the growing repository of existing eukaryotic environmental metatranscriptomes with multiple assemblers. This flexible tool that researchers can use to standardize the crucial steps of metatranscriptome analysis is a step towards the standardization and validation eukaryotic metatranscriptome assembly. With the consistent use of software tools and pre- and post-processing steps that eukrhythmic enables, metatranscriptome assembly has the potential to unlock the functional roles of largely uncharacterized eukaryotic microbes that drive biogeochemistry across diverse natural ecosystems. Standardized workflows for eukaryotic metatranscriptome assembly like eukrhythmic and community simulations as powered by tools like jEUKebox are a vital means to validate these discoveries.

Availability of data and materials

All code is available in public GitHub repositories https://github.com/alexanderlabwhoi/eukrhythmic and https://github.com/alexanderlabwhoi/jeukebox. Sample output from eukrhythmic for the simulated trials is available at https://osf.io/te7sp/. Code used to generate figures is availability in a public GitHub repository at https://github.com/akrinos/2022-Krinos-eukrhythmic.

Change history

05 May 2023
A Correction to this paper has been published: https://doi.org/10.1186/s12859-023-05313-0

References

Massana R, Pedrós-Alió C. Unveiling new microbial eukaryotes in the surface ocean. Curr Opin Microbiol. 2008;11(3):213–8.
Article PubMed Google Scholar
Worden AZ, Follows MJ, Giovannoni SJ, Wilken S, Zimmerman AE, Keeling PJ. Rethinking the marine carbon cycle: factoring in the multifarious lifestyles of microbes. Science. 2015;347(6223):1257594.
Article PubMed Google Scholar
Caron DA, Alexander H, Allen AE, Archibald JM, Armbrust EV, Bachy C, Bell CJ, Bharti A, Dyhrman ST, Guida SM, et al. Probing the evolution, ecology and physiology of marine protists using transcriptomics. Nat Rev Microbiol. 2017;15(1):6–20.
Article CAS PubMed Google Scholar
Caron DA, Worden AZ, Countway PD, Demir E, Heidelberg KB. Protists are microbes too: a perspective. ISME J. 2009;3(1):4–12.
Article CAS PubMed Google Scholar
Lawler SP, Morin PJ. Food web architecture and population dynamics in laboratory microcosms of protists. Am Nat. 1993;141(5):675–86.
Article CAS PubMed Google Scholar
Stoecker DK. Conceptual models of mixotrophy in planktonic protists and some ecological and evolutionary implications. Eur J Protistol. 1998;34(3):281–90.
Article Google Scholar
Sherr EB, Sherr BF. Significance of predation by protists in aquatic microbial food webs. Antonie Van Leeuwenhoek. 2002;81(1):293–308.
Article CAS PubMed Google Scholar
Del Campo J, Guillou L, Hehenberger E, Logares R, López-García P, Massana R. Ecological and evolutionary significance of novel protist lineages. Eur J Protistol. 2016;55:4–11.
Article PubMed PubMed Central Google Scholar
Del Campo J, Balagué V, Forn I, Lekunberri I, Massana R. Culturing bias in marine heterotrophic flagellates analyzed through seawater enrichment incubations. Microb Ecol. 2013;66(3):489–99.
Article CAS PubMed Google Scholar
Alexander H, Hu SK, Krinos AI, Pachiadaki M, Tully BJ, Neely CJ, Reiter T. Eukaryotic genomes from a global metagenomic dataset illuminate trophic modes and biogeography of ocean plankton. bioRxiv. 2021.
Delmont TO, Gaia M, Hinsinger DD, Frémont P, Vanni C, Fernandez-Guerra A, Eren AM, Kourlaiev A, d’Agata L, Clayssen Q, et al. Functional repertoire convergence of distantly related eukaryotic plankton lineages abundant in the sunlit ocean. Cell Genom. 2022;2(5): 100123.
Article CAS PubMed PubMed Central Google Scholar
Gifford SM, Sharma S, Rinta-Kanto JM, Moran MA. Quantitative analysis of a deeply sequenced marine microbial metatranscriptome. ISME J. 2011;5(3):461–72.
Article PubMed Google Scholar
Becker KW, Harke MJ, Mende DR, Muratore D, Weitz JS, DeLong EF, Dyhrman ST, Van Mooy BA. Combined pigment and metatranscriptomic analysis reveals highly synchronized diel patterns of phenotypic light response across domains in the open oligotrophic ocean. ISME J. 2021;15(2):520–33.
Article CAS PubMed Google Scholar
Salazar G, Paoli L, Alberti A, Huerta-Cepas J, Ruscheweyh H-J, Cuenca M, Field CM, Coelho LP, Cruaud C, Engelen S, et al. Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell. 2019;179(5):1068–83.
Article CAS PubMed PubMed Central Google Scholar
Stewart FJ, Ulloa O, DeLong EF. Microbial metatranscriptomics in a permanent marine oxygen minimum zone. Environ Microbiol. 2012;14(1):23–40.
Article CAS PubMed Google Scholar
John DE, Zielinski BL, Paul JH. Creation of a pilot metatranscriptome library from eukaryotic plankton of a eutrophic bay (Tampa Bay, Florida). Limnol Oceanogr Methods. 2009;7(3):249–59.
Article CAS Google Scholar
Sunagawa S, Acinas SG, Bork P, Bowler C, Eveillard D, Gorsky G, Guidi L, Iudicone D, Karsenti E, Lombard F, et al. Tara Oceans: towards global ocean ecosystems biology. Nat Rev Microbiol. 2020;18(8):428–45.
Article CAS PubMed Google Scholar
Poretsky RS, Bano N, Buchan A, LeCleir G, Kleikemper J, Pickering M, Pate WM, Moran MA, Hollibaugh JT. Analysis of microbial gene transcripts in environmental samples. Appl Environ Microbiol. 2005;71(7):4121–6.
Article CAS PubMed PubMed Central Google Scholar
Gilbert JA, Field D, Huang Y, Edwards R, Li W, Gilna P, Joint I. Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PloS One. 2008;3(8):e3042.
Article PubMed PubMed Central Google Scholar
Keeling PJ, Burki F, Wilcox HM, Allam B, Allen EE, Amaral-Zettler LA, Armbrust EV, Archibald JM, Bharti AK, Bell CJ, et al. The marine microbial eukaryote transcriptome sequencing project (MMETSP): illuminating the functional diversity of eukaryotic life in the oceans through transcriptome sequencing. PLoS Biol. 2014;12(6):1001889.
Article Google Scholar
Krinos AI, Hu SK, Cohen NR, Alexander H. EUKulele: taxonomic annotation of the unsung eukaryotic microbes. J Open Source Softw. 2021;6(57):2817. https://doi.org/10.21105/joss.02817.
Article Google Scholar
Carradec Q, Pelletier E, Da Silva C, Alberti A, Seeleuthner Y, Blanc-Mathieu R, Lima-Mendez G, Rocha F, Tirichine L, Labadie K, et al. A global ocean atlas of eukaryotic genes. Nat Commun. 2018;9(1):1–13.
Article CAS Google Scholar
Alexander H, Jenkins BD, Rynearson TA, Dyhrman ST. Metatranscriptome analyses indicate resource partitioning between diatoms in the field. Proc Natl Acad Sci. 2015;112(17):2182–90.
Article Google Scholar
Johnson LK, Alexander H, Brown CT. Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Gigascience. 2019;8(4):158.
Article Google Scholar
Daniels C, Baumgarten S, Yum LK, Michell CT, Bayer T, Arif C, Roder C, Weil E, Voolstra CR. Metatranscriptome analysis of the reef-building coral Orbicella faveolata indicates holobiont response to coral disease. Front Mar Sci. 2015;2:62.
Article Google Scholar
Lesniewski RA, Jain S, Anantharaman K, Schloss PD, Dick GJ. The metatranscriptome of a deep-sea hydrothermal plume is dominated by water column methanotrophs and lithotrophs. ISME J. 2012;6(12):2257–68.
Article CAS PubMed PubMed Central Google Scholar
Richter D. Metagenomics and metatranscriptomes of oceanic communities. Phycologia. 2017;56(4):158.
Google Scholar
Leimena MM, Ramiro-Garcia J, Davids M, van den Bogert B, Smidt H, Smid EJ, Boekhorst J, Zoetendal EG, Schaap PJ, Kleerebezem M. A comprehensive metatranscriptome analysis pipeline and its validation using human small intestine microbiota datasets. BMC Genomics. 2013;14(1):530.
Article CAS PubMed PubMed Central Google Scholar
Davids M, Hugenholtz F, dos Santos VM, Smidt H, Kleerebezem M, Schaap PJ. Functional profiling of unfamiliar microbial communities using a validated de novo assembly metatranscriptome pipeline. PloS One. 2016;11(1):e0146423.
Article PubMed PubMed Central Google Scholar
Westreich ST, Treiber ML, Mills DA, Korf I, Lemay DG. SAMSA2: a standalone metatranscriptome analysis pipeline. BMC Bioinform. 2018;19(1):175.
Article Google Scholar
Vijay N, Poelstra JW, Künstner A, Wolf JB. Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments. Mol Ecol. 2013;22(3):620–34.
Article CAS PubMed Google Scholar
MacManes MD. The Oyster River Protocol: a multi-assembler and kmer approach for de novo transcriptome assembly. PeerJ. 2018;6:5428.
Article Google Scholar
Ortiz R, Gera P, Rivera C, Santos JC. Pincho: a modular approach to high quality de novo transcriptomics. Genes. 2021;12(7):953.
Article CAS PubMed PubMed Central Google Scholar
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–2.
Article PubMed Google Scholar
Jauhal AA, Newcomb RD. Assessing genome assembly quality prior to downstream analysis: N50 versus BUSCO. Molecular Ecology Resources. 2021.
Bushmanova E, Antipov D, Lapidus A, Prjibelski AD. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience. 2019;8(9):100.
Article Google Scholar
Jiang Y, Xiong X, Danska J, Parkinson J. Metatranscriptomic analysis of diverse microbial communities reveals core metabolic pathways and microbiome-specific functionality. Microbiome. 2016;4(1):1–18.
Article Google Scholar
Almeida A, Mitchell AL, Tarkowska A, Finn RD. Benchmarking taxonomic assignments based on 16S rRNA gene profiling of the microbiota from commonly sampled environments. GigaScience. 2018;7(5):054.
Article Google Scholar
Anwar MZ, Lanzen A, Bang-Andreasen T, Jacobsen CS. To assemble or not to resemble-a validated comparative metatranscriptomics workflow (CoMW). GigaScience. 2019;8(8):096.
Article Google Scholar
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20.
Article CAS PubMed PubMed Central Google Scholar
Bushnell B. BBMap: a fast, accurate, splice-aware aligner. Technical report, Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). 2014.
Honaas LA, Wafula EK, Wickett NJ, Der JP, Zhang Y, Edger PP, Altman NS, Pires JC, Leebens-Mack JH, DePamphilis CW. Selecting superior de novo transcriptome assemblies: lessons learned by leveraging the best plant genome. PLoS One. 2016;11(1):0146062.
Article Google Scholar
Clarke K, Yang Y, Marsh R, Xie L, et al. Comparative analysis of de novo transcriptome assembly. Sci China Life Sci. 2013;56(2):156–62.
Article CAS PubMed PubMed Central Google Scholar
Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40(20):155–155.
Article Google Scholar
Simpson JT, Durbin R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012;22(3):549–56.
Article CAS PubMed PubMed Central Google Scholar
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644–52.
Article CAS PubMed PubMed Central Google Scholar
Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31(10):1674–6.
Article CAS PubMed Google Scholar
Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27(5):824–34.
Article CAS PubMed PubMed Central Google Scholar
Cerveau N, Jackson DJ. Combining independent de novo assemblies optimizes the coding transcriptome for nonconventional model eukaryotic organisms. BMC Bioinform. 2016;17(1):525.
Article Google Scholar
Mirdita M, Steinegger M, Söding J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics. 2019;35(16):2856–8.
Article CAS PubMed PubMed Central Google Scholar
El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, et al. The Pfam protein families database in 2019. Nucleic Acids Res. 2019;47(D1):427–32.
Article Google Scholar
Haas B, Papanicolaou A. TransDecoder identifies candidate coding regions within transcript sequences. 2021.
Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, Von Mering C, Bork P. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Mol Biol Evol. 2017;34(8):2115–22.
Article CAS PubMed PubMed Central Google Scholar
Kanehisa M, et al., The KEGG database. In: Novartis Foundation Symposium, Wiley Online Library; 2002. pp. 91–100.
Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(3):379–423.
Article Google Scholar
Brown CT, Irber L. sourmash: a library for MinHash sketching of DNA. J Open Source Softw. 2016;1(5):27.
Article Google Scholar
Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun. 2018;9(1):1–8.
Article Google Scholar
Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20(1):1–14.
Article Google Scholar
Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16(1):1–14.
Article CAS Google Scholar
Liao Y, Smyth GK, Shi W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 2019;47(8):47–47.
Article Google Scholar
Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8(8):1494–512.
Article CAS PubMed Google Scholar
Hölzer M, Marz M. De novo transcriptome assembly: a comprehensive cross-species comparison of short-read RNA-Seq assemblers. Gigascience. 2019;8(5):039.
Article Google Scholar
Bushmanova E, Antipov D, Lapidus A, Suvorov V, Prjibelski AD. rnaQUAST: a quality assessment tool for de novo transcriptome assemblies. Bioinformatics. 2016;32(14):2210–2.
Article CAS PubMed Google Scholar
Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026–8.
Article CAS PubMed Google Scholar
Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9(1):1–8.
Article CAS Google Scholar
Klemetsen T, Raknes IA, Fu J, Agafonov A, Balasundaram SV, Tartari G, Robertsen E, Willassen NP. The MAR databases: development and implementation of databases specific for marine metagenomics. Nucleic Acids Res. 2018;46(D1):692–9.
Article Google Scholar
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–72.
Article CAS PubMed PubMed Central Google Scholar
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. R Foundation for Statistical Computing. 2021. https://www.R-project.org/
Vorobev A, Dupouy M, Carradec Q, Delmont TO, Annamalé A, Wincker P, Pelletier E. Transcriptome reconstruction and functional analysis of eukaryotic marine plankton communities via high-throughput metagenomics and metatranscriptomics. Genome Res. 2020;30(4):647–59.
Article CAS PubMed PubMed Central Google Scholar
Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, Djahanschiri B, Zeller G, Mende DR, Alberti A, et al. Structure and function of the global ocean microbiome. Science. 2015;348(6237):1261359.
Article PubMed Google Scholar
Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14(4):417–9.
Article CAS PubMed PubMed Central Google Scholar
Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18(5):821–9.
Article CAS PubMed PubMed Central Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.
Article CAS PubMed Google Scholar
Van Rossum G, Drake FL Jr. Python Reference Manual. Centrum voor Wiskunde en Informatica Amsterdam; 1995.
Kibirige H, Lamp G, Katins J, gdowding, austin, matthias-k, Funnell T, Finkernagel F, Arnfred J, Blanchard D, Astanin S, Chiang E, Kishimoto PN, Sheehan E, stonebig, Willers, B, Gibboni R, smutch, Halchenko, Y, Pavel, King, B, RK M, Collins J, zachcp, Anthony, Koopman, B, Grohmann CH, Becker D, Brown D, Saiz D. Has2k1/plotnine: V0.8.0. https://doi.org/10.5281/zenodo.4636791.
Wickham H. Ggplot2: elegant graphics for data analysis. Springer; 2016. (https://ggplot2.tidyverse.org).
Pedersen TL. patchwork: the composer of plots. R package version 1.1.1. 2020. https://CRAN.R-project.org/package=patchwork
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
Article CAS PubMed PubMed Central Google Scholar
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinform. 2009;10(1):1–9.
Article Google Scholar
Budak H, Kaya SB, Cagirici HB. Long non-coding RNA in plants in the era of reference sequences. Front Plant Sci. 2020;11:276.
Article PubMed PubMed Central Google Scholar
Rogato A, Richard H, Sarazin A, Voss B, Navarro SC, Champeimont R, Navarro L, Carbone A, Hess WR, Falciatore A. The diversity of small non-coding RNAs in the diatom Phaeodactylum tricornutum. BMC Genom. 2014;15(1):1–20.
Article CAS Google Scholar
Lopez-Gomollon S, Beckers M, Rathjen T, Moxon S, Maumus F, Mohorianu I, Moulton V, Dalmay T, Mock T. Global discovery and characterization of small non-coding RNAs in marine microalgae. BMC Genom. 2014;15(1):1–13.
Article Google Scholar
Canesi KL, Rynearson TA. Temporal variation of Skeletonema community composition from a long-term time series in Narragansett Bay identified using high-throughput DNA sequencing. Mar Ecol Prog Ser. 2016;556:1–16.
Article Google Scholar
Damon C, Lehembre F, Oger-Desfeux C, Luis P, Ranger J, Fraissinet-Tachet L, Marmeisse R. Metatranscriptomics reveals the diversity of genes expressed by eukaryotes in forest soils. PLoS One. 2012;7(1):28967.
Article Google Scholar
Reiter T, Brooks PT, Irber L, Joslin SE, Reid CM, Scott C, Brown CT, Pierce-Ward NT. Streamlining data-intensive biology with workflow systems. GigaScience. 2021;10(1):140.
Article Google Scholar
Cohen N, Alexander H, Krinos A, Hu SK, Lampe RH. Marine microeukaryote metatranscriptomics: sample processing and bioinformatic workflow recommendations for ecological applications. Front Marine Sci. 2022;858.
Gilbert JA, Meyer F, Schriml L, Joint IR, Mühling M, Field D. Metagenomes and metatranscriptomes from the L4 long-term coastal monitoring station in the Western English Channel. Stand Genom Sci. 2010;3(2):183–93.
Article Google Scholar
Nowinski B, Smith CB, Thomas CM, Esson K, Marin R, Preston CM, Birch JM, Scholin CA, Huntemann M, Clum A, et al. Microbial metagenomes and metatranscriptomes during a coastal phytoplankton bloom. Scientific Data. 2019;6(1):1–7.
Article CAS Google Scholar
Vislova A, Aylward F, Sosa O, DeLong E. Metatranscriptome sequence analysis reveals diel periodicity of microbial community gene expression in the ocean’s interior. Am Geophys Union. 2016;2016:44–0482.
Google Scholar
Ollison GA, Hu SK, Mesrop LY, DeLong EF, Caron DA. Come rain or shine: depth not season shapes the active protistan community at station ALOHA in the North Pacific Subtropical Gyre. Deep Sea Res Part I. 2021;170: 103494.
Article Google Scholar
Hu SK, Liu Z, Alexander H, Campbell V, Connell PE, Dyhrman ST, Heidelberg KB, Caron DA. Shifting metabolic priorities among key protistan taxa within and below the euphotic zone. Environ Microbiol. 2018;20(8):2865–79.
Article CAS PubMed Google Scholar
Vanni C, Schechter MS, Delmont TO, Eren AM, Steinegger M, Glöckner FO, Fernandez-Guerra A. AGNOSTOS-DB: a resource to unlock the uncharted regions of the coding sequence space. bioRxiv. 2021.
Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012;131(4):281–5.
Article CAS PubMed Google Scholar
Shakya M, Lo C-C, Chain PS. Advances and challenges in metatranscriptomic analysis. Front Genet. 2019;904.
Vanni C, Schechter MS, Acinas SG, Barberán A, Buttigieg PL, Casamayor EO, Delmont TO, Duarte CM, Eren AM, Finn RD, et al. Unifying the known and unknown microbial coding sequence space. Elife. 2022;11:67667.
Article Google Scholar
Berube P, Gifford S, Hurwitz B, Jenkins B, Marchetti A, Santoro A. Roadmap towards community-wide intercalibration and standardization of ocean nucleic acids ’omics measurements. https://doi.org/10.1575/1912/28054. https://hdl.handle.net/1912/28054

Download references

Acknowledgements

We gratefully acknowledge Margaret Mars Brisbin and Sarah Hu for their invaluable feedback on the pipeline and contribution to testing. We also thank Celeste Nobrega for her work on this project as a Guest Student at Woods Hole Oceanographic Institution. The Poseidon High Performance Computing cluster at Woods Hole Oceanographic Institution was used to run all analyses.

Funding

We gratefully acknowledge funding that supported this work from the Simons Collaboration on Computational Biogeochemical Modeling of Marine Ecosystems (CBIOMES) (grant #549931), a National Science Foundation grant (OCE-1948025 to HA), the Simons Early Career Investigator in Marine Microbial Ecology and Evolution Award (grant #931886 to HA), and the Department of Energy Computational Science Graduate Fellowship (DE-SC0020347 to AIK).

Author information

Authors and Affiliations

MIT-WHOI Joint Program in Oceanography and Applied Ocean Science and Engineering, Cambridge and Woods Hole, MA, USA
Arianna I. Krinos
Department of Biology, Woods Hole Oceanographic Institution, Woods Hole, MA, USA
Arianna I. Krinos & Harriet Alexander
Department of Earth, Atmospheric, and Planetary Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Arianna I. Krinos & Michael J. Follows
Skidaway Institute of Oceanography, University of Georgia, Savannah, GA, USA
Natalie R. Cohen

Authors

Arianna I. Krinos
View author publications
You can also search for this author in PubMed Google Scholar
Natalie R. Cohen
View author publications
You can also search for this author in PubMed Google Scholar
Michael J. Follows
View author publications
You can also search for this author in PubMed Google Scholar
Harriet Alexander
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

HA and AIK conceived the idea for the initial workflow, with input from NRC and MJF. AIK developed the code for eukrhythmic and jEUKebox with support from NRC and HA. AIK and HA wrote the manuscript with contributions from NRC; all authors reviewed and edited the manuscript.

Corresponding authors

Correspondence to Arianna I. Krinos or Harriet Alexander.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version of this article was revised: figure legends have been updated.

Supplementary Information

Additional file 1: Fig. S1.

Visual summary of the effect of clustering with mmseqs2 on recovered enera, KOs (functional annotations) and file size in bytes. While reducing the equence identity threshold for clustering results in up to a 30% average reduction in file size, he number of recovered genera and functional annotations are only modestly impacted, specially at high coverage. An intermediate sequence identity of 0.8 and coverage of .8 would result in a 15-25% average reduction in file size, but leave distinct functional and taxonomic annotations unchanged. Fig. S2. Main text figure facetted by simulated METSP assembly group (two different sets of organisms). Of note is that the eukrhythmic reassembly accurately recapitulates the bimodal distribution in GC-content observed in MMETSP group A’s designer metatranscriptomic sequences. Fig. S3. Protein sequence lengths in the reassemblies as compared to the designer. The 1-to-1 line shows where sequences would fall if the average length of recovered protein sequences via TransDecoder were identical between the designer assemblies and the eukrhythmic-derived reassembled products; the fact that all samples fall in the lower right half of the plot indicates that protein sequences were consistently larger in the designer assemblies as compared to the eukrhythmic reassembled products. Fig. S4. Mean contig length as a function of the number of assemblers that found a sequence that matched the given description. For the 4-assembler cluster, this means that all four assemblers tested identified a sequence that matched the sequence included in the distribution when clustered within eukrhythmic. Panel A corresponds to MMETSP Group A while Panel B corresponds to MMETSP Group B. Welch’s independent T-tests and Kolmogorov-Smirnoff tests for between-distribution goodness of fit computed on these length distributions reveals that the overall distribution of lengths for 1 vs. 2 vs. 3 vs. 4-assembler distributions are all statistically significantly different (p < 1e−6), with greater assemblers within a cluster leading to higheraverage length. Fig. S5. Salmon mapping percentages against the simulated raw reads when computed individually against each of the four assembly tools used by eukrhythmic. rnaSPAdes consistently outperformed the other assemblers with regard to percentage mapping, average length, and number of annotations. Fig. S6. Main text figure, but with A: TPM comparison for the designer assemblies and the reassembled products from euk rhythmic labelled by distinct groupings of simulations relative to their subset of MMETSP organisms; each point is filled according to its “MMETSP group”. Fig. S7. Main text figure with the results split by “MMETSP group” to demonstrate that different taxonomic groupings of organisms included in the simulation does not impact the general trends observed in the results. All three panels are divided by samples using the two “MMETSP groups” of individual organism transcriptomes. Fig. S8. Salmon percentage mapping by MMETSP group when contig products from all assemblers as in eukrhythmic are used (bottom distribution with one or more assembler) as compared to when only contigs agreed upon by multiple assemblers are used. Average percentage mapping decreases progressively with fewer contigs being included as the criteria for inclusion are made stricter. Fig. S9. Proportion of contigs from each clustering subset that had ORFs extracted from the sequence. The vast majority of contigs had a single predicted ORF, but rnaSPAdes alone had the greatest number of nucleotide contigs on which an ORF could not be detected. In practice, these sequences might be assumed to be non-coding. Fig. S10. Proportion of contigs from each clustering subset that had ORFs extracted from the sequence across all contigs identified by the assembler. rnaSPAdes had a higher number of contigs without an identified ORF. Fig. S11. Proportion of contigs from the designer assembly that had ORFs extracted from the sequence. In general, fewer contigs did not have an identified ORF, and a higher number of contigs had multiple predicted ORFs than in the reassembled products from euk rhythmic. Fig. S12. Comparison of the abundance of KO IDs within functional annotations across Narragansett Bay samples and different combinations of metatranscriptome assemblers. A black dotted line indicates a one-to-one relationship, meaning that the abundance of KOs that fall along this line are exactly as abundant using the assembler listed on the x-axis and using the assembler listed on the y-axis. On the top left, Trinity is compared to MEGAHIT, on the top right Trinity is compared to rnaSPAdes, on the bottom left MEGAHIT is compared to rnaSPAdes and on the bottom right MEGAHIT is compared to SPAdes. Each point corresponds to a single KO within a sample. While rnaSPAdes tended to report high abundances of each identified KO relative to the other assemblers, MEGAHIT reported fewer instances of each KO than the other three assemblers in most samples. This may be due to the approaches adopted by the two assemblers. Whereas in a typical assembly, k-mers that appear only once are assumed to be the result of error, these k-mers may represent real and important diversity in a set of low-abundance whole-community sequences [47]. The MEGAHIT assembler is an example of metagenomics-specific software that defines “mercy k-mers” which come into play in between two k-mers within a single read that are sequenced more than once. rnaSPAdes, as an example, does not employ a “mercy” strategy but instead decreases the coverage threshold significantly in comparison to genomic assembly [36]. One is adopting the meta-omic assumption that exceedingly low coverage is plausibly not artificial, while the other is being more generous with respect to coverage and memory use, but intending merely to minimize the influence of sequencing error. The strategy adopted by MEGAHIT may result in incomplete consideration of isoforms, which could have contributed to the relatively low recovery rate of multiple copies of the identified KO groups by the MEGAHIT assemblies of the Narragansett Bay metatranscriptome samples. Fig. S13. Proportion of total normalized TPM value of major taxonomic groups according to the four metatranscriptomic assemblers that were tested. rnaSPAdes had a lower proportion of TPM assigned to Ochrophyta (including diatoms), but further investigation appeared this to be in large part a consequence of the large number of small contigs produced by rnaSPAdes. Fig. S14. Proportion of total normalized TPM as reported by Salmon from the raw reads assigned to each taxonomic category in the assemblies produced by each of the four assemblers (facets). In particular, in sample S2 the taxonomic breakdown of the eukaryotic community differs importantly in the midst of a bloom of the diatom Skeletonema. Fig. S15. Proportion of assigned raw reads as reported by Salmon assigned to each taxonomic category for the four metatranscriptome assemblers, expressed as a proportion of the total. Fig. S16. Proportion of assigned raw reads as reported by Salmon assigned to each taxonomic category for the four metatranscriptome assemblers, expressed as a proportion of the total. Fig. S17. Average contig length by annotation of contigs generated by the four assembly tools across samples. Error bars show the standard error of the mean. Fig. S18. Total number of contigs generated by the four tested assemblers for each of the taxonomic groupings considered. rnaSPAdes tended to produce more contigs than the other four assemblers, but these contigs were often shorter and occasionally led to misleading community composition results. Fig. S19. Analogous stacked abundance plots by taxonomic grouping to the relative abundance plots presented in Alexander et al. [23] for the five in-situ sampling points collected from Narragansett Bay. Fig. S20. Salmon percentage mapping of coding sequences (left) vs. entire contigs (right) for the Tara Oceans samples. Mapping only to coding sequences from the assembly decreased the mean percentage mapped, as reported by Salmon. Fig. S21. Per-sample taxonomic annotations for all euk rhythmic CDSs, including those which were and were not found to have a match to the MATOU database. Fig. S22. Total TPM assigned per sample to sequences that were assigned a EUKulele annotation, but did not have a significant blast match to the MATOU database from Carradec et al. [22]. Fig. S23. Full length distribution of coding sequences recovered by the eukrhythmic assembly and were (left) not found in the MATOU database [22] vs. (right) found in the MATOU database. Fig. S24. Per-sample average length of coding sequences recovered by the eukrhythmic assembly and were (left) not found in the MATOU database [22] vs. (right) found in the MATOU database. Table S1. (Supplement) Effect of clustering designer assembly on assembly size and annotations. Clustering was performed on the original “designer metatranscriptome” set of contigs from the MMETSP references using the mmseqs2 tool (Mirdata et al. 2019). eukrhythmic uses a coverage level of 0.98 and sequence identity of 1 for mmseqs2 clustering. See Supplementary Figure 1 for a graphical summary of the influence of sequence identity and coverage on the size of the recovered assembly and its functional and taxonomic annotations

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Krinos, A.I., Cohen, N.R., Follows, M.J. et al. Reverse engineering environmental metatranscriptomes clarifies best practices for eukaryotic assembly. BMC Bioinformatics 24, 74 (2023). https://doi.org/10.1186/s12859-022-05121-y

Download citation

Received: 08 July 2022
Accepted: 21 December 2022
Published: 03 March 2023
DOI: https://doi.org/10.1186/s12859-022-05121-y

Reverse engineering environmental metatranscriptomes clarifies best practices for eukaryotic assembly

Abstract

Background

Results

Conclusion

Background

Methods

Eukrhythmic pipeline

Data cleaning and trimming

Assembly

Merging and clustering

Protein translation

Annotation

Design of simulated community schema

Communities

Metrics for assessing community complexity

Simulation of eukaryotic communities using jEUKebox

Selection of transcriptomes

Similarity computation using fastANI

Identifying putative evolutionary relationships with OrthoFinder

Simulation of raw reads

Reassembly with eukrhythmic

Assessing reassembly quality

Assembly statistics

Clustering reassembled metatranscriptomic proteins with MMETSP-derived designer metatranscriptomic proteins

Assessing metatranscriptomic proteins using BLAST all-by-all comparison

Taxonomic annotations

Functional annotations

Assembling and evaluating environmentally relevant metatranscriptomes from the Tara Oceans project

Reassembling and evaluating previously explored metatranscriptomes from the Narragansett Bay time series

Data processing and visualization

Results

jEUKebox pipeline generates simulated eukaryotic metatranscriptomes with varying sequence diversity

Eukrhythmic products accurately represent the raw reads

Less stringent clustering slightly decreases identified annotations

Eukaryotic metatranscriptome assembly accurately recapitulates simulated taxonomic diversity

Functional annotations from metatranscriptome assembly match abundance and diversity of functions in designer transcriptomes

Applying the eukrhythmic pipeline to environmental metatranscriptomic datasets

Tara Oceans eukrhythmic assemblies contain coding sequences that lack representation in the “MATOU” gene atlas

Tara Oceans assembly raw read fidelity and non-coding sequences

Multiple assemblers improve metatranscriptome assembly of Narragansett Bay phytoplankton

Discussion

Scalable and reproducible pipelines like eukrhythmic enhance intercomparison and advance computational research

Do metatranscriptomes capture the diversity of protistan communities?

Is there one best assembler for eukaryotic environmental metatranscriptomes?

Should we be downsizing metatranscriptomic assemblies?

eukrhythmic: an approach for optimized multi-assembler metatranscriptome assembly

Availability of data and materials

Change history

05 May 2023

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Additional file 1: Fig. S1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us