In the following sections, we will first provide a brief review of MetaVelvet-SL to establish the framework MetaVelvet-DL is built upon and reformulate the dBG partitioning problem as a three-class classification problem where the classes represent possible ways of partitioning a dBG. The proposed deep learning architecture uses a one-dimensional convolutional neural network (1D CNN) and biLSTM networks to predict class labels of candidate chimeric nodes.
Overview of MetaVelvet-SL
MetaVelvet-SL first constructs a dBG with Velvet, and performs simplification and error removal of tips, bubbles, and erroneous connections using Velvet functions [3]. However, typically, there remain nodes with multiple incoming and outgoing edges that could not be resolved to a single path. These chimeric nodes represent possible repeat regions that occur at multiple locations in a genome and are confirmed based on the actual read coverage and the expected coverage. However, in the assembly of multispecies genomic data, such chimeric nodes do not necessarily represent repeat regions, but may represent a stretch of nucleotide sequence that is evolutionarily conserved in different species.
In MetaVelvet and MetaVelvet-SL, the dBG constructed in Velvet is assumed to be composed of subgraphs that represent the individual species in a metagenomic sample. Therefore, one only needs to partition the multi-species dBG at the correct nodes to construct subgraphs and single-species contig assemblies. In MetaVelvet, partitioning is performed by identifying read coverage peaks, where each peak is assumed to represent one species in the microbial community. Each node is then assigned to a species, and a subgraph is formed by partitioning adjacent nodes having the same species assignment. Repeat nodes are distinguished from chimeric nodes by pair-end read mapping and coverage differences. In MetaVelvet-SL, a candidate chimeric node is defined as a node with two incoming and two outgoing nodes, which are labeled as higher- and lower-coverage incoming and outgoing nodes. There are three classes of possible arrangements at each candidate chimeric node as shown in Fig. 6: class 1, where the candidate node is chimeric, and the higher-coverage incoming and outgoing nodes belong to one species and the lower-coverage incoming and outgoing nodes belong to another species; class 2, where the candidate node is chimeric, and the higher-coverage incoming and lower-coverage outgoing nodes belong to one species and the lower-coverage and higher-coverage outgoings node belong to another; and class 3, where the candidate node is not chimeric, but a repeat node. MetaVelvet-SL uses a three-class SVM to predict the class labels of candidate chimeric nodes; in addition to the pair-end mapping and coverage information, which is also used in MetaVelvet, node sequence information in the form of dinucleotide frequencies is included in the SVM feature vector. MetaVelvet-SL performs metagenome assembly using the following steps:
-
1.
Construct a dBG that consists of multi-species genomes using Velvet functions.
-
2.
Generate a list of candidate chimeric nodes from the dBG constructed in step 1 and obtain the nucleotide sequences as well as pair-end mapping and coverage information for these nodes and their incoming and outgoing nodes.
-
3.
Train a three-class classification SVM model and predict the class labels of candidate chimeric nodes using dinucleotide frequencies and pair-end mapping and coverage information as input features.
-
4.
Partition the multi-species dBG into subgraphs at candidate chimeric nodes that have been classified as classes 1 and 2 in step 2.
The MetaVelvet-DL algorithm developed in this study followed the same steps as MetaVelvet-SL, but replacing the SVM model in step 3 with a deep learning architecture, taking advantage of the latter’s automatic feature extraction ability to consider long-range patterns in same-species sequences with the aim to resolve the graph partition problem. It should be noted that while candidate chimeric nodes with more than two incoming and outgoing nodes do exist, these account for only 1.79% of the candidate chimeric nodes in human gut microbiome [11]. Thus, in this study we focused on the modeling of nodes with two incoming and outgoing nodes. Another potential source of chimeric nodes is horizontal gene transfer between different species. While this phenomenon is not considered in this work, it is a future direction for the extension of this algorithm.
Deep learning classification model
We designed a deep learning architecture, illustrated in Fig. 7, to predict whether a candidate chimeric node is a true chimeric or a repeat node. The architecture follows directly from the structure of the problem itself: there are two incoming and two outgoing nodes, and we need to predict the incoming–outgoing nodes that form a correct pair, or whether none of the nodes do. To this end, we used one 1D convolutional layer and four biLSTM layers on each node sequence, except for the candidate node. The structure then combined the outputs of the biLSTM layers to represent both possible pairings and uses a series of fully connected layers to determine the correct partition method. Hereafter, the higher-coverage incoming node, lower-coverage incoming node, higher-coverage outgoing node, and lower-coverage outgoing node are denoted as HCI, LCI, HCO, and LCO nodes, respectively.
Input
The input to the deep learning architecture includes the sequences for a candidate chimeric node, the two incoming nodes, and the two outgoing nodes, and coverage information that includes:
-
1.
Number of reads connecting the HCI and HCO nodes.
-
2.
Number of reads connecting the HCI and LCO nodes.
-
3.
Number of reads connecting the LCI and HCO nodes.
-
4.
Number of reads connecting the LCI and LCO nodes.
-
5.
Coverage ratios of the incoming nodes to the candidate chimeric node.
-
6.
Coverage ratios of the outgoing nodes to the candidate chimeric node.
-
7.
Coverage of the candidate chimeric node.
-
8.
Length of the candidate chimeric node.
The coverage features are included as additional evidence for graph partitions. A higher number of pair-end reads mapped to a certain pair of incoming–outgoing nodes gives stronger evidence for partitioning. In cases where the incoming or outgoing nodes have highly similar sequences, coverage features can be weighted higher in the decision process. For each of the eight coverage features, a histogram is computed and divided into 10 equally spaced quantiles. The pair-end and coverage feature values of each candidate chimeric node are then one-hot-encoded according to their respective quantile and concatenated to form a binary vector. For sequence data, instead of assigning the four nucleotides ‘A’, ‘G’, ‘C’, and ‘T’ the numbers 1, 2, 3, and 4, or using one-hot encoding, we assigned all possible hexanucleotides a unique integer between 1 and 4096, e.g., ‘AAAAAA’ = 1, ‘AAAAAG’ = 2, and ‘TTTTTT’ = 4096. Accordingly, a 6-bp sliding window is slid from the 5′ to the 3′ end of a sequence at 1-bp step size, at each step assigning the observed hexanucleotide its corresponding integer to obtain a numeric representation of the nucleotide sequence. Using this representation allows us to include short-range patterns in the input data, while leaving sufficient freedom for the deep learning model to find an optimal representation for the classification problem. As input to the deep learning architecture, we extracted 250 bp from the 3′ end of the incoming nodes and 250 bp from the 5′ end of the outgoing nodes and converted the nucleotide sequences to integer sequences as described above. In case the length of a node was less than 250 bp, the integer sequences were padded with zeros so that all sequences have the same length.
Embedding layer
The first layer of the deep learning architecture proposed here is an embedding layer. The embedding layer is applied only to the integer-transformed nucleotide sequences of the candidate chimeric and incoming and outgoing nodes. In natural language processing, word embedding is a dense representation of words in the form of a numerical vector [24] and serves as a method for dimension reduction from the original word space and to encapsulate co-occurrence information [25]. In our architecture, we chose to implement an embedding layer that embeds each integer into a numerical vector of length 64, and the embeddings are learned during the training process.
Convolutional layers
After embedding, the dense representation of each of the incoming and outgoing nodes of a candidate chimeric node is passed through a 1D CNN layer. A nucleotide sequence is 1D data where spatial relationships exist on various scales. The 1D CNN is used to learn features present within the nucleotide sequences, taking advantage of its strength in identifying spatial patterns that exist within input data [26]. The 1D CNN filter is a sliding window that moves across the input data with a step size of 1, convolving the values of the window with the input within the window at each step and producing a 1D feature map as output. In the architecture proposed here, each of the four 1D CNN layers has 32 filters of size 64 × 12, where 64 is the size of the embedding layer output and 12 is the window length.
Batch normalization
Each 1D CNN layer is subjected to batch normalization, which transforms each activation in the output feature map from the 1D CNN layers such that they have zero mean and unit variance. This step can speed up the learning by using large learning rates [27].
Maxpooling
Maxpooling is applied to the batch-normalized feature maps. Maxpooling with 1D input is simply a moving window that takes the maximum value within the window at each step. Such an operation downsizes the input and reduces dimensionality, but still allows the most prominent features in each window to be observed [28]. In the architecture proposed here, we set the maxpooling size to 4.
BiLSTM
For each of the incoming and outgoing nodes, the output from the maxpooling layer is used as input to four consecutive biLSTM layers. As WGS metagenomics data generally cover complete genome sequences and spatial correlations or patterns can occur in concert with upstream or downstream sequences, we utilized biLSTM [21] to process each maxpooling output both forward and backward to incorporate up- and downstream sequence information in the output of a biLSTM unit. We stacked four biLSTM layers, where the input to the first layer is the output of the maxpooling layer, and the input of the subsequent layers is the output of the respective previous layers. Each subsequent layer will learn sequence patterns at a larger scale than the previous one. In the architecture proposed here, we used biLSTM layers of sizes 128, 64, 32, and 16.
Fully connected layers
Fully connected layers where all neurons in one layer are connected to all neurons in the next are used to aggregate all previous layers and to generate a nonlinear combination of the features learned in the previous layers. Let us first denote the network stacks described above from the integer sequence inputs to the embedding layer to the output of the final biLSTM layer for the HCI, LCI, HCO, and LCO nodes as MHCI, MLCI, MHCO, and MLCO, respectively. We then concatenate pairs of network stacks to obtain the following:
-
1.
M1A: the concatenation of MHCI and MHCO
-
2.
M1B: the concatenation of MLCI and MLCO
-
3.
M2A: the concatenation of MHCI and MLCO
-
4.
M2B: the concatenation of MLCI and MHCO
and the output of each concatenation is connected to a separate single layer of 512 neurons. The outputs of the single layers are further concatenated as follows:
-
1.
M1: the concatenation of M1A and M1B
-
2.
M2: the concatenation of M2A and M2B
where M1 and M2 correspond to class 1 and 2 pairings, respectively. The outputs of M1 and M2 are again each connected to a single layer of 512 neurons. The one-hot-encoded features of pair-end and coverage information are next used as inputs to a single layer of 512 neurons. The three layers are then concatenated and used as input to three fully connected layers of sizes 512, 256, and 128. The output of the final fully connected layer is then fed into a softmax layer with three outputs corresponding to the three classes.
The deep learning architecture presented above reflects two design goals. First, in dBGs constructed from metagenomic samples, we observed that node sequence lengths can range from tens to thousands of base pairs. To accurately capture sequence patterns at various resolutions, we included four biLSTM layers, each layer capturing increasingly longer-range patterns. Second, the architecture contains two large subnetworks that are concatenated in the fully connected layers, where each subnetwork represents one possible partition of a chimeric node. The fully connected layers then combine the subnetworks and read coverage information to decide on which is the most probable partition for a candidate chimeric node.
Data
To compare the performance of the MetaVelvet-DL, MetaVelvet-SL, Megahit, and metaSPAdes assemblies, we used low- and medium complexity datasets from the first Critical Assessment of Metagenome Interpretation (CAMI) challenge, which is an effort to provide a standardized benchmark for comparing metagenomic data analysis tools. The low-complexity dataset comprises approximately 50,000,000 pair-end reads with Illumina HighSeq error profile from 40 microbial and viral genomes and 20 circular elements simulating a single sample, for a total of 15 Gb. The medium-complexity dataset covers 132 genomes and 100 circular elements simulating two samples, for a total of 40 Gb. In both datasets, the pair-end reads have read length of 150 bp, a mean insertion length of 270 bp, and a standard deviation of 27 bp.
Training strategy for unknown bacterial species in a metagenomic sample
We generated a training set of candidate chimeric nodes, using gold-standard bacterial species genomes for the CAMI low-complexity dataset. A dBG was constructed at a k-mer size of 31 bp using 150-bp pair-end reads with a 270-bp insertion length and 50× coverage. The processing pipeline we used is as follows:
-
1.
Use the gold standard list of bacterial species for CAMI low-complexity dataset.
-
2.
Generate simulated pair-end reads from the reference genomes of the gold standard species using DWGSIM (https://github.com/nh13/DWGSIM).
-
3.
Construct a dBG from the simulated data and identify the candidate chimeric nodes.
-
4.
BLAST the candidate chimeric, incoming, and outgoing node sequences to the reference genomes to determine the partition class label for each candidate chimeric node.
-
5.
Train the deep learning model with the simulated dataset.
-
6.
Construct a dBG for the unknown metagenomic sample and identify the candidate chimeric nodes.
-
7.
Predict the partition classes of the candidate chimeric nodes in the unknown metagenomic sample and partition the dBG accordingly.
From the above pipeline we created a training set of candidate chimeric nodes from dBGs that were constructed at a k-mer size of 31 bp, using 50 × −coverage, 150-bp pair-end reads with an insertion size of 270 bp simulated from the gold standard species list for CAMI low-complexity dataset. Based on the strategy used in MetaVelvet-SL, we BLASTed the node sequences to the reference genomes of the gold standard bacterial species to obtain the true labels of the candidate chimeric nodes for the training set. In MetaVelvet-SL, we considered the highest-ranking match for each of the incoming/outgoing node sequences. When the HCI and HCO nodes had the same highest-ranking match and the LCI and LCO had the same highest-ranking match different from that of HCI and HCO, we labeled the candidate chimeric node as class 1. When the HCI and LCO had the same highest-ranking match and the LCI and HCO had the same highest-ranking match different from that of HCI and LCO, we labeled the candidate chimeric node as class 2. All other candidate nodes were labeled as class 3. For the training set, 50,000 candidate chimeric nodes from each of classes 1, 2, and 3 were selected to create a training set of 150,000 samples. Another 1000 candidate chimeric nodes from each class were selected to create a validation set of 3000 samples, with a batch size of 100 samples over 256 epochs. The generated datasets are used to train and validate a deep learning model and an SVM model, and their resulting assemblies are denoted as MetaVelvet-DL and MetaVelvet-SL in the discussion. For comparison, we also performed an assembly of CAMI low-complexity dataset where we did not use either the SL or DL model predictions, but the true labels for the candidate chimeric nodes to partition the dBG. We refer to this as the MetaVelvet-gold standard in the discussion.
In real-world applications, a workflow to train a model for chimeric node prediction would first use a phylogenetic analysis software to predict the bacterial species in the unknown sample. Then, a training dataset would be created using simulations with the reference genomes of the predicted species. Here we trained a deep learning chimeric node prediction model using simulated read data generated from the genomes of a set of bacterial species predicted to be present in the CAMI low-complexity dataset by Kraken [28], which we will refer to as MetaVelvet-DL-Kraken. The training set contains 20,000 samples of each class, and the validation set contains 1000 samples of each class. The models were trained with a batch size of 100 samples over 100 epochs.
To demonstrate the robustness of MetaVelvet-DL, we also trained a deep learning and an SVM model using a highly mismatched training dataset generated from an unpublished microbial community found in a marmoset (Callithrix jacchus) rectal sample. The bacterial species families in the CAMI and marmoset datasets are shown in Table 4. It can be seen from the table that the two lists are mismatched, with very few common families in the two datasets. The training set has 20,000 samples of each of classes 1, 2, and 3, and 1000 samples of each class for validation. The models were trained with a batch size of 100 samples over 100 epochs. Hereafter, these models are referred to as MetaVelvet-DL-Marmoset and MetaVelvet-SL-Marmoset.
Finally, to evaluate the performance of MetaVelvet-DL on more complex datasets, we took one of the CAMI medium-complexity samples and trained DL and SL models with the same steps used to generate the MetaVelvet-DL model.
Assessment
To bring focus to the quality of metagenome assemblies with respect to chimeric assembly, we used MetaQUAST [29] to provide a quantitative evaluation of various MetaVelvet-based assemblers with models trained with different training datasets, together with metaSPAdes and Megahit. In addition to providing standard quality statistics such as N50 and mapped genome fraction, MetaQUAST also includes metrics such as number of interspecies translocations and number of misassembled contigs [29], which are important in metagenome assembly.
We also BLASTed the assembled contigs to the gold standard reference genomes. If non-overlapping parts of a contigs are found to have hits in different species, then that contig is considered as a chimeric contig. By identifying the chimeric contigs through BLAST, we compared total chimeric contig lengths and proportions of chimeric contig length of each of the assemblers as a metric for assembly quality.
Another common metric for comparing genome assemblies is the N50 score, which is the length of the shortest contig where if all contigs longer than N50 summed together would account for 50% of the total assembly. However, comparing N50 of different assemblies is biased because of the differing total contig lengths in different assemblies. Therefore, we used the following generalized score, termed the N-len(x) score:
$$ \mathrm{N}\hbox{-} \mathrm{len}(x)=\left|{S}_i\right|\ni {\sum}_{j=1}^i\left|{S}_j\right|\ge x\;\mathrm{and}\;{\sum}_{j=1}^{i-1}\left|{S}_j\right|\le x, $$
(1)
where L is the total length of all contigs, Sj denotes the j-th contig in the total set of contigs sorted by length in a decreasing order, and | Sj | denotes its length. Based on this formulation, the N50 measure is simply a special case of the N-len(x) score where x = L/2 [2]. Using the N-len(x) score, we can compare the length of the shortest contig in the smallest set of contigs whose total length just exceeds the same value among all assemblers.