 Methodology article
 Open access
 Published:
Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields
BMC Bioinformatics volumeÂ 21, ArticleÂ number:Â 402 (2020)
Abstract
Background
De Bruijn graphs are key data structures for the analysis of nextgeneration sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each kmer (resp. k+1mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, underutilising the information present in the sequencing data.
Results
To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner.
Conclusions
We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectationmaximisation parameter estimation. True kmers can be distinguished from erroneous kmers with a higher F_{1} score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.
Background
De Bruijn graphs play a key role in many bioinformatics tools as a data structure to efficiently represent the overlap between sequences. Given a set of sequences S, the de Bruijn graphâ€™s nodes are defined by the kmers (subsequences of length k) present in S. Two nodes u and v are connected by a directed arc when a k+1mer exists in S for which the first k nucleotides coincide with u and the last k nucleotides coincide with v [1]. Often, linear (i.e. nonbranching) chains of nodes are contracted into a single node referred to as a unitig [2].
When S corresponds to a set of reads, one obtains the readbased de Bruijn graph for which we define the coverage of a node (resp. arc) as the number of times its kmer (resp. k+1mer) occurs in the set of reads. Similarly, when S contains a genomic sequence G, one obtains the genome de Bruijn graph for which we define the multiplicity of a node (resp. arc) as the number of times its kmer (resp. k+1mer) occurs in G. A readbased de Bruijn graph is a noisy estimate of the (often unknown) genome de Bruijn graph. In real sequencing experiments, readbased de Bruijn graphs contain many spurious nodes/arcs that emerge because of sequencing errors and may be missing nodes/arcs due to coverage gaps. Our goal is to accurately infer the genome de Bruijn graph from its readbased approximation. To this end, we wish to label each node (resp. arc) in the readbased graph with its corresponding multiplicity: true nodes and arcs should be labelled according to their repeat copy number in G while spurious nodes/arcs should be assigned multiplicity zero. The multiplicities of nodes and arcs are reflected in their coverage: repeated kmers in the genome often have a higher coverage than nonrepeated kmers, while erroneous kmers often have very low coverage.
The accurate inference of the multiplicities of nodes and arcs is nontrivial but nevertheless important: it reveals the sequencing errors (i.e. nodes/arcs that were assigned multiplicity zero) and provides insights into the repeat structure of the genomic sequence. This problem is therefore a common theme in numerous bioinformatics applications, such as de novo genome assembly [3], correction of both second and third generation reads [4â€“7], de Bruijn graphguided hybrid assembly methods [8â€“11] and several variant calling and variantaware assembly methods [12â€“15].
Using a conditional random field (CRF) model, we demonstrate how local coverage evidence observed at each node or arc individually as well as the evidence of the surrounding neighbourhood can be coupled in a highdimensional statistical model. Nonetheless, efficient computational methods exist to infer the individual probability distributions for each random variable present in the model. In the last decade, these types of models have been primarily used in the field of computer vision and image analysis, where they facilitated an evolution from heuristic algorithms to a systematic approach by incorporating contextual constraints [16]. We believe that the use of CRFs on de Bruijn graphs can facilitate a similar evolution in the analysis of sequencing data. Using our model, we consistently obtain more accurate multiplicity assignments on both real and simulated data. Moreover, we observe a more robust convergence when learning the model parameters in an expectationmaximisation (EM) setting. Additionally, because we use a statistical framework, we obtain a measure of certainty about the assignments. We believe this methodology can be of use for many tools that rely on de Bruijn graphs. Note that preliminary results of this work have been presented at the HiTSeq workshop of the 2019 ISMB conference [17].
kmer histogram based approach
Baseline methods for the inference of node and arc multiplicities rely on kmer histograms. A kmer histogram shows, for each kmer coverage, the number of kmers that occur at that coverage in the data. A mixture of distributions is fit to this histogram, where each component corresponds to a distinct multiplicity (see Fig. 1a). Each distribution models, for that multiplicity, the natural variability of coverage that occurs in real sequencing experiments. Using this mixture model, intervals of coverage are selected to point to a certain multiplicity. This idea is adopted by many tools. For example, certain Illumina error correction tools (see [18] for a review) rely on kmer histograms to determine whether or not a kmer is erroneous, i.e. whether or not its coverage falls within the zeromultiplicity interval. Some tools use perbase quality scores to weigh the contribution of each individual kmer occurrence to its coverage [19]. In the context of de Bruijn graphs, the information contained in the kmer histogram is sometimes supplemented with graphtopological features: sequencing errors often emerge in the graph as short dead ends (tips), when the sequencing error occurs towards the end of a read, or as short parallel paths (bubbles), when the error occurs near the centre of a read [20â€“22].
Nevertheless, as exemplified in Fig. 1, inferring multiplicities from a kmer histogram is errorprone. Figure 1a shows a mixture model of negative binomial (overdispersed Poisson) distributions, each corresponding to a specific multiplicity. Intervals of kmer coverage are established and assigned a particular multiplicity such that the probability for that multiplicity value is maximal in the interval. For example, a kmer with coverage [1â€¦3] is assigned multiplicity zero (sequencing error), coverage interval [4â€¦11] corresponds to multiplicity one (nonrepeated sequence), etc. In the example of Fig. 1b, the nodes and arcs of a de Bruijn graph are labelled with a sampled coverage, in order to mimic a sequencing experiment. If we now infer the multiplicities using the intervals of Fig. 1a, we obtain the assignments in Fig. 1c. Not all assignments are correct as some multiplicity estimates are either too high or too low. This is due to the overlap between distributions in the mixture model causing some nodes or arcs with ambiguous coverage (i.e. coverage near the interval boundaries) to be assigned the wrong multiplicity. The selection of hard cutoff intervals will therefore always lead to a certain fraction of incorrect multiplicity assignments, depending on the degree of overlap between distributions, which in turn depends on the sequencing depth and coverage variability of the dataset. To illustrate the practical consequences, a recent survey of Illumina error correction tools [23] demonstrated that many tools suffer from the deletion of lowcoverage true kmers, leading to more fragmented assemblies.
Conservation of flow of multiplicity
Within the context of the de Bruijn graph of Fig. 1c, one can immediately verify the presence of incorrect multiplicity assignments by checking the following criterion at each node: when all nodes and arcs are correctly labelled with their multiplicity, the multiplicity of each node equals both the sum of the multiplicities of its incoming arcs and the sum of the multiplicities of its outgoing arcs. This also holds in the presence of sequencing errors as the corresponding spurious nodes or arcs have been assigned multiplicity zero. This property was first noted by Pevzner and Tang [24], who proposed to use maximumflow algorithms on precorrected de Bruijn graphs to determine the multiplicity of repeats. A similar technique has been used on string graphs [25] and on errorfree weighted bidirected graphs [26].
This important property, which we will call â€˜conservation of flow of multiplicityâ€™, is also an important component of our proposed method. Using the conservation of flow property, one might decide that a node or arc with a relatively poor coverage that falls in the zeromultiplicity interval, does have a multiplicity greater than zero because it provides an essential link in the graph. An example of this is the arc between nodes h and r_{1} in Fig. 1b: r_{1} has strong evidence of being a multiplicity 2 repeated node, both by its own coverage and by the sum of the estimated multiplicities of its outgoing arcs. The multiplicities of its incoming arcs should hence also sum up to 2, which can be achieved by assigning multiplicity 1 to the arc from h to r_{1} despite its low coverage. As a second example, an erroneous kmer with relatively high coverage might still be assigned multiplicity 0 because of flow conservation considerations. In Fig. 1b, once we have established that the arc from node h to r_{1} should have multiplicity 1, we notice that the arc from h to j should be assigned multiplicity 0. This further results into erroneous node j to also be assigned multiplicity 0. By imposing the conservation of flow property across all nodes and arcs, we obtain the correct multiplicity assignment in Fig. 1d.
Methods
Conditional random fields
Conditional random fields (CRFs) are a type of probabilistic graphical models (PGM), i.e. graph models in which the nodes represent random variables and the edges represent direct probabilistic interactions between these variables [27]. Note that this graph is different from the de Bruijn graph, even though both graphs share some resemblance. In case of CRFs, the variables are further divided into a set of observed variables and a set of unobserved variables whose value we want to infer [28]. Here, the observed variables X={X_{1},â€¦,X_{N}} represent the coverage and the unobserved variables Y={Y_{1},â€¦,Y_{N}} represent the multiplicities of nodes and arcs in the de Bruijn graph.
Probabilistic interactions are represented by factors Ï†_{i}(D_{i}), where \({\mathbf {D}_{i} \subseteq \mathbf {X}\cup \mathbf {Y}}, {\mathbf {D}_{i} \nsubseteq \mathbf {X}}\), such that all variables in the scope D_{i} of Ï†_{i} form a clique in the CRF. The joint conditional probability over all variables is then calculated as
with Z(X) the partition function that normalises the product of factors such that P(YX) is a probability distribution [27]. By conditioning on X, the modelling of dependencies between these variables can be avoided and only the conditional dependencies between the Ys are modelled. We use two types of factors in our model. First, the â€˜singleton factorsâ€™ Ï† model the relationship between the coverage observed at a node (resp. arc) and its multiplicity. Given the coverage, the factor represents a categorical (also called: multinoulli) distribution over the different possible multiplicities. Second, the â€˜conservation of flow factorsâ€™ Ï†_{flow} impose the conservation of flow of multiplicity. Such factor has in its scope the multiplicity of a particular node as well as all multiplicities of either incoming or outgoing arcs and assigns a high probability when conservation of flow holds, and a low probability otherwise. They thus model the relationship between the multiplicity of a node and the multiplicities of its adjacent arcs. Collectively, the factors allow evidence (observed coverage) to propagate through the CRF graph. Therefore, the computed multiplicity assignments will be the result of an interplay between locally observed evidence, and evidence observed in the surrounding neighbourhood.
From de Bruijn graph to CRF
Because building a CRF for the entire de Bruijn graph would render computations infeasible for exact inference techniques, CRFs are built for smaller subgraphs: if we want to infer the multiplicity of a particular node n in the de Bruijn graph, a neighbourhood of size s is selected that consists of all nodes that are reachable from node n by a path of length â‰¤s along with all incoming and outgoing arcs of these nodes. For each arc a in the neighbourhood, the CRF contains variable nodes Y_{a} and X_{a}, the multiplicity and observed coverage at that arc, respectively. For each node m in this neighbourhood, the CRF contains a variable node Y_{m}, the multiplicity of that node. The average coverage of nodes that contain <2kkmers is highly correlated with the coverage observed at their adjacent arcs. Therefore, we only add a variable node X_{m} to represent the observed average coverage of node m if this node contains more than 2kkmers. Because the time and memory requirements of the subsequent inference algorithm depend on the number of possible values for the variables, we restrict Val(Y) to [ max(0,m_{opt}âˆ’Î±),m_{opt}+Î±], where Î± is a tuneable parameter (default value: Î±=2) and m_{opt} is the most likely multiplicity for Y based on the locally observed coverage X. Figure 2 shows a de Bruijn graph with a selected neighbourhood of size 1 and its corresponding CRF.
Singleton factors For each X and Y corresponding to an arc or node in the de Bruijn graph, a singleton factor \(\varphi (Y,X): Val(Y,X) \mapsto \mathbb {R}^{+}\) is created that reflects the probability that an arc/node has multiplicity m given the observed coverage C:
To estimate these probabilities, a mixture model of negative binomial distributions is fitted to the kmer histogram for nodes and the k+1mer histogram for arcs such that each component in the mixture corresponds with a multiplicity m. As opposed to Poisson distributions that take only a single parameter Î» and assume the mean and variance are equal, negative binomial distributions are more flexible as they are able to model data with larger variation. The negative binomial distribution is traditionally parameterised using parameters r and p, however, in the context of this work, we use parameters \(\lambda = \frac {pr}{1p}\) (mean) and \(f = \frac {1}{1p}\) (overdispersion factor). The variance is then equal to fÎ». As pâ†’0 (in which case fâ†’1) and râ†’âˆž such that \(\lambda = \frac {pr}{1p}\) remains constant, the negative binomial distribution converges to a Poisson distribution with mean Î».
For m=0 the mean of the negative binomial Î»_{0} is estimated as the mean coverage of the erroneous kmers. We also estimate Î», the mean coverage of kmers with multiplicity 1. For each multiplicity mâ‰¥1 the negative binomial then has mean mÎ». Additionally, overdispersion factors f_{0} for the errordistribution and f for the distributions representing mâ‰¥1 are estimated based on the variance in the data. Finally, each negative binomial is weighted according to the estimated number of kmers with that multiplicity in the dataset, w_{m}.
Conservation of flow factors For each Y_{n} of node n in the de Bruijn graph, two conservation of flow factors are created: one for its incoming arcs and one for its outgoing arcs. These factors have in their scope the corresponding CRFnode Y_{n} and all Y_{a} corresponding to incoming resp. outgoing arcs of node n. E.g. for the incoming arcs:
Factors \(\varphi _{\text {flow}}(Y_{n},\{Y_{a}\}_{a \in \text {in}(n)}): Val(Y_{n},\{Y_{a}\}_{a \in \text {in}(n)}) \mapsto \mathbb {R}^{+}\) assign a value of 1 to cases where the sum of multiplicities of the incoming resp. outgoing arcs is equal to the multiplicity of the node and a low value Îµâ‰ª1 (default: 10^{âˆ’7}) otherwise.
Whereas the singleton factors provide local evidence, the conservation of flow factors direct towards a multiplicity assignment that is consistent with the property of flow of multiplicity for all nodes and arcs in the CRF.
Inference
The product of all factors is the joint probability of the multiplicities over all nodes and arcs in the selected neighbourhood. To determine the probabilities over the possible multiplicities of a specific node n (or arc a) we need to calculate \(P(Y_{n}  \mathbf {X}) = \sum _{\mathbf {Y}\setminus Y_{n}} P(\mathbf {Y} \mathbf {X})\). We use the variable elimination algorithm for PGMs with a â€˜minneighbours criterionâ€™ to select the elimination order [27]. This algorithm performs an exact calculation of the probabilities. This solver was implemented in logspace to avoid numerical underflow. Supplementary Section 1, Additional File 1 contains an example.
Parameter estimation using expectationmaximisation
The parameters of the mixture model need to be estimated from the data. These parameters can be computed using the multiplicities of the nodes/arcs in the de Bruijn graph, which are also unknown. We therefore use the expectationmaximisation (EM) algorithm: during the Estep, the multiplicities of nodes and arcs are inferred using the current model parameter estimates; during the Mstep, the parameter values are updated using the newly inferred multiplicities, weighted by their probability. Convergence is obtained when the estimated multiplicity of a prespecified fraction (default: 0.001) of the nodes and arcs no longer changes between consecutive Esteps.
The parameters of the mixture model are the following: the means of the negative binomial distributions Î» and Î»_{0}, the overdispersion factors f and f_{0} that determine the variance, and the weights w_{m} for all multiplicities. Since nodes correspond to (concatenated) kmers, whereas the arcs correspond to k+1mers, the underlying distributions differ slightly. Therefore, all parameters are estimated separately for the nodes and the arcs of the de Bruijn graph. To avoid complex numerical computations associated with the maximum likelihood estimators, we make use of the method of moment estimators. The exact formulas are provided in Supplementary Section 2, Additional File 1.
The mixture model parameters are estimated using only a small subset of the nodes and arcs in the de Bruijn graph (default: 10.000 nodes and 10.000 arcs). This leads to a significant speedup and still provides accurate estimates for the model parameters.
Using quality scores
Instead of counting the occurrence of each kmer in the dataset, we use weighted counts based on the quality scores of the nucleotides in the kmer. Such weighted counts are called qmers and were first introduced in the read corrector Quake [19]:
Because erroneously called nucleotides often have a low quality score, kmers that overlap with such nucleotides will have a lower qmer count, whereas kmers containing only correct nucleotides will still have a qmer count close to one. As a consequence, the distributions representing the erroneous nodes/arcs and the distributions that represent multiplicity mâ‰¥1 overlap less.
Note that the negative binomial distribution is a discrete probability distribution and is thus only defined for integer counts. However, by replacing the factorials in the discrete probability mass function (pmf) with Gamma distributions we obtain a pmf that is applicable to continuous data and coincides with the original pmf in the discrete points.
The full pipeline
BCALM 2 [29] is used to build a compacted de Bruijn graph from raw sequencing data, i.e. a de Bruijn graph in which all linear paths have been concatenated into single nodes called unitigs. BCALM 2 is capable of constructing de Bruijn graphs even for large genomes with relatively low memory requirements. For all results presented in this paper we use kmer size k=21. Unless mentioned otherwise, we use BCALM 2 with the abundancemin=2 option to remove all kmers that occur only once in the reads. BCALM 2 outputs a de Bruijn graph which serves as input for Detox, the tool in which we implemented the proposed CRF model. Detox uses three stages. In stage 1, the qmer (or optionally: kmer) coverages of all nodes and arcs in the graph are computed. In the stage 2, the mixture model parameters are estimated using the expectationmaximisation algorithm. Finally, in stage 3, the multiplicities of all nodes and arcs in the de Bruijn graph are inferred using CRFs. These multiplicities are written to disk. Detox is implemented in C++11 and uses threads for parallelisation. We refer to Supplementary Section 7, Additional File 1 for a discussion of the choice and default value of the most important input parameters.
Results
Worked example
To gain some intuition, we present a specific case in which the CRF framework outperforms a kmer cutoff methodology. Figure 3 shows a subgraph of the de Bruijn graph, built from the P. aeruginosa dataset (real data, subsampled to 30Ã—). We infer the multiplicity of node n_{1} (with a true multiplicity of 1) using our proposed CRF methodology for different neighbourhood sizes s. At neighbourhood size s=0, the inferred multiplicity is solely based on the (average) kmer coverage observed at node n_{1} and hence, the CRF method degenerates to the kmer cutoff methodology. With larger neighbourhood sizes (sâ‰¥1), the CRF model also incorporates the coverage observed at nodes (and their adjacent arcs) in the neighbourhood of node n_{1}. After applying the variable elimination algorithm, we obtain a categorical (multinoulli) distribution over the different multiplicities for node n_{1} (see inset table of Fig. 3). The estimated multiplicity is then simply the one with the highest probability. These probabilities thus provide a measure of confidence.
The issue in the subgraph of Fig. 3 is that part of the path of the genome through n_{18},n_{9},n_{3},n_{1},n_{2},n_{5} has low coverage. If we only consider individual coverages, node n_{1} incorrectly appears to be a sequencing error whereas its true multiplicity is 1. Similarly, nodes n_{2} and n_{3} appear to have multiplicity 1 whereas their true multiplicity is 2. Because all three nodes suffer from low coverage, a CRF with a neighbourhood of size s=1 does not yet improve the multiplicity estimation. On the contrary, it increases the probability of a zero multiplicity for node n_{1} because the evidence of a multiplicity of 1 for nodes n_{2} and n_{3} is consistent with a zero multiplicity for node n_{1}. However, when using a neighbourhood of size sâ‰¥2, the CRF also incorporates information about parts of the path with higher coverage. Particularly, the coverage observed at arc a_{18â†’9} suggests a multiplicity 1 for node n_{9} (which itself has ambiguous coverage) and hence, in combination with evidence for multiplicity 1 for node n_{7}, a multiplicity 2 for node n_{3}. Similarly, it becomes clear that also node n_{2} has multiplicity 2. Ultimately, a consistent multiplicity assignment can only be obtained when node n_{1} is (correctly) assigned a multiplicity of 1. As we incorporate even more information at neighbourhood size s=3, this belief is further strengthened and the CRF model infers the correct multiplicity with a high degree of confidence.
Multiplicity assignment performance
To evaluate the overall performance of the proposed CRF methodology, we consider five real Illumina datasets (two bacterial and three eukaryotic â€“ see Table 1). To assess the impact of coverage depth, each dataset was downsampled to different coverage depths of 10Ã—,25Ã— and 50Ã—. Using BCALM 2, compacted de Bruijn graphs (k=21) were built for the 15 resulting datasets. BCALM 2 was configured to discard kmers that occur only once in the data. As such, we obtained de Bruijn graphs with a number of nodes ranging between 20 220 (S. enterica, 10Ã—) and 254.7 million (H. sapiens, 50Ã—). For all nodes and arcs, the quality score weighted coverage was computed.
We inferred the node multiplicities using the CRF methodology for different neighbourhood sizes s (0, 1, 3 and 5). For all organisms, except H. sapiens, we inferred the multiplicity of all nodes in the graph. To avoid excessive runtimes, the multiplicity for H. sapiens was inferred for one million randomly sampled nodes. Table 2 shows the node (resp. kmer) accuracy as the percentage of nodes (resp. kmers) that were assigned the correct multiplicity. For all organisms and for all coverage depths, the CRF methodology (neighbourhood size s>0) outperforms the baseline methodology (neighbourhood size s=0) by a significant margin. With increasing neighbourhood size s, the accuracy increases with which CRFs assign multiplicities. Most of this gain is already realised for small values of s (e.g. s=1), indicating that the coverage information of nodes and arcs within close proximity of a node (resp. arc) is most informative to correctly assign the multiplicity of that node (resp. arc).
With higher coverage depth, there is less overlap between the coverage distributions associated with the different multiplicities and hence, node multiplicities can be assigned more easily. The CRF method therefore proves especially useful at low or moderate coverage depths. For example, for C. elegans at 10Ã—, the node accuracy improves from 68.65% (s=0) to 81.29% (s=5), an improvement of 12.64 percentage points. Nevertheless, even at 50Ã— coverage depth, the use of CRFs yield improvements in node accuracy of 1.5 to 4 percentage points.
In all cases, the kmer accuracy is higher than the node accuracy (Table 2). Inferring the multiplicity is easier for nodes that contain more kmers, or in other words, nodes that represent longer unitig sequences. In those cases, the coverage of a node is averaged over a larger number of kmers, and hence, due to the central limit theorem, it has a higher probability of being closer to the mean coverage for the corresponding multiplicity.
Correctly inferring the multiplicity becomes increasingly difficult for higher multiplicity values (Supplementary Tables 1 and 5, Additional File 1). In Supplementary Table 1, Additional File 1, we provide confusion matrices for the different organisms, coverage depths (10Ã— and 50Ã—) and neighbourhood sizes (s=0 and s=5). Each column of a confusion matrix corresponds to all nodes with a specific true multiplicity (0 to 5) and shows how the estimated multiplicities are distributed. For all organisms, the majority of nodes in the de Bruijn graph represents either a sequencing error or a nonrepeated true sequence. Especially for those nodes, estimating the multiplicity appears relatively easy. However, the correct multiplicity inference for nodes that represent repeated sequences is difficult, even at a high coverage depth. For those nodes, the use of CRFs significantly boosts accuracy. For example, for C. elegans (50 Ã—), without the CRF method (s=0) only 54.5% (267 985 out of 429 137) of the nodes with true multiplicity 2 are correctly classified; this figure increases to 83.7% (411 764 out of 429 137) with CRFs (s=5). Hence, even if the coverage of a repeated node is ambiguous (or even misleadingly high or low), the CRF method can often still correctly infer the correct multiplicity from its surrounding context.
We also assessed the proposed CRF model using simulated Illumina data that was generated using the ART tool. We used the same organisms, coverage depths and parameter settings as for the real data. Supplementary Table 4, Additional File 1 shows the node and kmer accuracy. Again, the use of CRFs leads to improved multiplicity estimations in all cases and this effect increases with increasing values for the neighbourhood size s. In general, the accuracies obtained on simulated data are higher than those obtained for real data. There are several reasons for this. Real data may suffer from coverage biases that are not fully captured by read simulation tools. More importantly, establishing the groundtruth node and arc multiplicities is more difficult for real data. Due to the presence of genetic variation (single nucleotide polymorphism, copy number or structural variation), the sequenced individuals may not be a 100% match with their reference genome. Moreover, the samples may include reads that originate from extrachromosomal DNA (e.g. a plasmid in A. thaliana) for which the groundtruth multiplicity labelling cannot accurately be done using the reference genome. For example, from the confusion matrix for S. enterica (Supplementary Table 1, Additional File 1), one observes 2779 nodes with an estimated multiplicity >5 but for which the true multiplicity is 0. Clearly, these nodes represent DNA that is present in the sample, but not in the reference genome. One should take these considerations into account when interpreting the accuracy values in Table 2.
As a final note, the use of the CRF model reduces the number of iterations needed for the EM algorithm to converge (see Supplementary Tables 23, Additional File 1). During the Estep, node and arc multiplicities are inferred using the current estimation of the model parameters. Subsequently, during the Mstep, the model parameters are updated. Clearly, more accurate multiplicity inference during the Mstep leads to a more rapid convergence. In certain cases, we found that the EMalgorithm did not converge when not using the CRF methodology (s=0), e.g., for H. sapiens (10 Ã—). In that case, the distributions that correspond to the error nodes and multiplicity 1 nodes are difficult to distinguish. In contrast, when using CRFs (neighbourhood sizes s>1), the EMalgorithm always converged to the correct model, even for low coverage depths.
Comparison to existing methods
We are unaware of existing tools to determine multiplicities of nodes and arcs in de Bruijn graphs. To compare with existing methods we consider a subproblem of multiplicity determination, i.e. the classification of kmers into trusted (multiplicity mâ‰¥1) and untrusted (multiplicity m=0) kmers. This task is an essential subtask of many error correction tools.
Quake [19] discerns trusted from untrusted kmers by fitting a mixture of distributions to the qmer histogram. Next, a qmer cutoff value is determined below which a kmer has an Î± times higher chance to belong the error distribution than to the true distribution. We ran Quake v.0.3.5 with default parameters to produce a coverage cutoff value and then used Jellyfish v.1.1.12 to partition the kmers into trusted and untrusted subsets. We also compare to BayesHammer [34], a method that uses sequence similarity and Bayesian subclustering techniques on a Hamming graph of kmers. Trusted kmers are determined as the centre of these clusters. We used BayesHammer as a submodule of SPAdes v.3.13.1 with default parameters.
For each method, we compute the number of true positives (TP, erroneous kmers correctly classified), true negatives (TN, true kmers correctly classified), false positives (FP, true kmers incorrectly labelled as untrusted) and false negatives (FN, erroneous kmers incorrectly labelled as trusted). We then compute the sensitivity, specificity and F_{1} score as follows:
Table 3 shows the sensitivity, specificity and F_{1} score obtained by Quake, BayesHammer and our proposed CRF method (Detox). At 10Ã— coverage, BCALM 2 was configured to retain allkmers as we noted that filtering the kmers that occur only once caused a significant number of false positives. Also note that for H. sapiens, only the reads that map to chromosome 21 were used because BayesHammer required more memory than the available 256 GB RAM to handle human genome scale datasets.
For all organisms and coverage depths, Detox attains the highest F_{1} score. Both Quake and BayesHammer appear to be conservative methods: they have a high specificity at the cost of a low sensitivity. For Quake, a higher qmer cutoff value would result in a higher sensitivity, at the cost of a lower specificity. However, we emphasise that the higher F_{1} score for Detox is not solely due to the fact that Quake may select a (too) low qmer cutoff value. This is illustrated in Fig. 4, where we zoom in on the qmer histogram of the P. aeruginosa 25Ã— coverage dataset. The red line shows the qmer cutoff point that yields the highest F_{1} score. Methods such as Quake classify allkmers with a coverage below this cutoff value as untrusted and all other kmers as trusted. In contrast, our CRF model is able to infer that certain kmers with a coverage below the cutoff value have a true multiplicity =1. Conversely, the CRF model can deduce that kmers with a coverage above the cutoff value correspond to sequencing errors. In other words, a node that is assigned multiplicity zero might have a higher coverage than a node that is assigned multiplicity one. This unique property explains the higher classification accuracy obtained by Detox. Indeed, Fig. 4 illustrates that most kmers with ambiguous coverage are correctly classified.
Runtime performance
The runtime of the CRF methodology to infer the multiplicity of a node (or arc) depends on the total number of nodes and arcs in its neighbourhood as well as the degree of connectivity of this subgraph. Linear subgraphs give rise to CRFs that can be solved in a time proportional to the number of nodes and arcs in the subgraph, whereas densely connected subgraphs yield CRFs that may require a solution time that is exponential in the total number of nodes and arcs. We emphasise, however, that for a fixed value of the neighbourhood size s, the total number of nodes and arcs in a subgraph is bounded, and hence, the runtime to infer the multiplicity for a particular node or arc is O(1). Therefore, the runtime to infer the multiplicity of all nodes (resp. arcs) in de Bruijn graph scales linearly with the number of nodes (resp. arcs).
Table 4 lists the average number of nodes and arcs in a subgraph for different genomes (50Ã— coverage depth) and different neighbourhood sizes s as well as the total runtime to infer the multiplicity for 10 000 randomly selected nodes using a single core of a 24core AMD EPYC 7451 CPU with a base clock frequency of 2.3 GHz. With increasing neighbourhood size s, the number of nodes and arcs in the subgraph increases rapidly. Genomes with a complex repeat structure such as H. sapiens give rise to larger and more densely connected subgraphs for which the corresponding CRF solution requires a longer runtime.
Recall that most of the accuracy gains of the CRF methodology are already obtained for s=1 (see Table 2). In that case, the CRF includes coverage information of, depending on the genome, 3â€“5 nodes and 6â€“15 arcs. Using all 24 cores of the AMD EPYC 7451 CPU, the multiplicity of all 254.7 million nodes in the H. sapiens de Bruijn graph (50Ã— coverage depth) can be inferred in roughly 1 h and 20 min, illustrating that the CRF methodology is applicable to even the most challenging genomes. For such largescale de Bruijn graphs, the use of higher values for s should likely be restricted to a subset of nodes, e.g. the nodes for which the multiplicity is ambiguous, to avoid excessive runtimes. Bacterial genomes do not suffer from this restriction. Even for s=5, the multiplicity of all 169 028 nodes in the de Bruijn graph of P. aeruginosa can be inferred in only 27.2 s, using 24 CPU cores.
Discussion
Many genome analysis tools such as de novo genome assemblers, read correction tools and variant callers rely on de Bruijn graphs to represent the underlying genomic sequence. To obtain an informative de Bruijn graph representation, it is important to accurately determine the multiplicities of its nodes and arcs. Nodes/arcs that are assigned multiplicity zero reveal sequencing errors, while higherorder multiplicities provide insight into the repeat structure of the graph.
We used conditional random fields (CRFs) to incorporate contextual information on neighbouring nodes/arcs in a de Bruijn graph and demonstrated that the CRF model significantly improves the accuracy with which multiplicities are assigned for a wide range of organisms and different coverage depths. For the specific subtask of classifying kmers into trusted and untrusted subsets, the CRF model outperformed two existing methods: one based on qmer histograms and one based on clustering of kmers by sequence similarity. Moreover, the use of CRFs in an EM setting provides for a robust estimation of the parameters of the distributions that underlie the kmer or qmer histogram.
Several improvements to the CRF model can be considered as future work. First, the CRF model currently does not take the ploidy of the underlying genome into account. In the case of a diploid organism such as H. sapiens, kmers that represent heterozygous variants should be assigned a multiplicity of one half. In the current implementation, the flow of conservation rule forces the multiplicity of one allele to one and the other allele to zero. Second, the computational performance of the method could likely be improved. We demonstrated that the use of neighbourhood size s=1 is feasible for human genome scale datasets, however, runtime increases rapidly for larger values of s. Several options exist. One may consider a dynamic selection of s, where larger values for s are used only for a subset of nodes/arcs whose accurate multiplicity determination is difficult (e.g. because of ambiguous coverage) or crucial (e.g. during repeat resolution) and resort to smaller values of s (even s=0) for the cases where the multiplicity can be unambiguously derived from the locally observed coverage. Alternatively, the use of approximate inference techniques could be researched. As opposed to the variable elimination algorithm, approximate inference techniques may not yield exact numerical results, but can be more computationally efficient.
Conclusions
In this paper, we provided a method to improve one particular subtask of sequence analysis using de Bruijn graphs. It remains to be demonstrated that more accurate multiplicity determination can also lead to improved practical bioinformatics tools. We plan to develop a de Bruijn graph cleaning tool that makes use of the multiplicities inferred by the CRF model. In combination with an accurate sequencetograph alignment tool, this should yield highly accurate short read or hybrid long read error correction tools. Finally, it would be interesting to investigate two what extent accurate multiplicity estimates could improve the repeat resolution step during de novo genome assembly.
Availability of data and materials
The data that support the findings of this study are publicly available. TableÂ 1 in the manuscript lists the data set identifiers and references for the data that supports the results in the manuscript. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.
Abbreviations
 CRF:

Conditional random field
 EM:

Expectationmaximisation
 FN:

False negative
 FP:

False positive
 PGM:

Probabilistic graphical model
 pmf:

Probability mass function
 TN:

True negative
 TP:

True positive
References
Pevzner PA, Tang HX, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001; 98(17):9748â€“53. https://doi.org/10.1073/pnas.171285098.
Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai ZW, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng XQ, Rubin GM, Adams MD, Venter JC. A wholegenome assembly of Drosophila. Science. 2000; 287(5461):2196â€“204. https://doi.org/10.1126/science.287.5461.2196.
Simpson JT, Pop M. The Theory and Practice of Genome Sequence Assembly In: Chakravarti A, Green E, editors. Annual Review of Genomics and Human Genetics, vol 16: 2015. p. 153â€“72. https://doi.org/10.1146/annurevgenom090314050032.
Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014; 30(24):3506â€“14. https://doi.org/10.1093/bioinformatics/btu538.
Miclotte G, Heydari M, Demeester P, Rombauts S, de Peer Y, Audenaert P, Fostier J. Jabba: hybrid error correction for long sequencing reads. Algoritm Mol Biol. 2016; 11. https://doi.org/10.1186/s1301501600757.
Limasset A, Flot JF, Peterlongo P. Toward perfect reads: selfcorrection of short reads via mapping on de bruijn graphs. Bioinformatics. 2020; 36(5):1374â€“81.
Morisse P, Lecroq T, Lefebvre A. Hybrid correction of highly noisy long reads using a variableorder de Bruijn graph. Bioinformatics. 2018; 34(24):4213â€“22. https://doi.org/10.1093/bioinformatics/bty521.
Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED, Phillippy AM. Hybrid error correction and de novo assembly of singlemolecule sequencing reads. Nat Biotechnol. 2012; 30(7):693. https://doi.org/10.1038/nbt.2280.
Deshpande V, Fung ED, Pham S, Bafna V. Cerulean: A hybrid assembly using high throughput short and long reads. In: International Workshop on Algorithms in Bioinformatics. Berlin: Springer: 2013. p. 349â€“69.
Antipov D, Korobeynikov A, McLean JS, Pevzner PA. HYBRIDSPADES: an algorithm for hybrid assembly of short and long reads. Bioinformatics. 2016; 32(7):1009â€“15. https://doi.org/10.1093/bioinformatics/btv688.
Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017; 13(6):1â€“22. https://doi.org/10.1371/journal.pcbi.1005595.
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012; 44(2):226â€“32. https://doi.org/10.1038/ng.1028.
Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SRF, Wilkie AOM, McVean G, Lunter G, Consortium W. Integrating mapping, assembly and haplotypebased approaches for calling variants in clinical sequencing applications. Nat Genet. 2014; 46(8):912â€“8. https://doi.org/10.1038/ng.3036.
GATK Dev Team. HC step 2: Local reassembly and haplotype determination. 2015. https://software.broadinstitute.org/gatk/documentation/article.php?id=4146. Accessed 02 Jan 2018.
Narzisi G, Corvelo A, Arora K, Bergmann EA, Shah M, Musunuri R, Emde AK, Robine N, Vacic V, Zody MC. Genomewide somatic variant calling using localized colored de Bruijn graphs. Commun Biol. 2018; 1(1):20. https://doi.org/10.1038/s4200301800239.
Li SZ. Markov Random Field Modeling in Image Analysis. Tokyo: Springer; 2001.
Steyaert A, Audenaert P, Fostier J. Determining node/arc multiplicities in de Bruijn graphs using conditional random fields [version 1; not peer reviewed]. F1000Research. 2020; 9(ISCB Comm J). https://doi.org/10.7490/f1000research.1117849.1.
Yang X, Chockalingam SP, Aluru S. A survey of errorcorrection methods for nextgeneration sequencing. Brief Bioinforma. 2013; 14(1):56â€“66. https://doi.org/10.1093/bib/bbs015.
Kelley DR, Schatz MC, Salzberg SL. Quake: qualityaware detection and correction of sequencing errors. Genome Biol. 2010; 11(11). https://doi.org/10.1186/gb20101111r116.
Chaisson MJ, Pevzner PA. Short read fragment assembly of bacterial genomes. Genome Res. 2008; 18(2):324â€“30. https://doi.org/10.1101/gr.7088808.
Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008; 18(5):821â€“9. https://doi.org/10.1101/gr.074492.107.
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: A New Genome Assembly Algorithm and Its Applications to SingleCell Sequencing. J Comput Biol. 2012; 19(5):455â€“77. https://doi.org/10.1089/cmb.2012.0021.
Heydari M, Miclotte G, Demeester P, Van de Peer Y, Fostier J. Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinformatics. 2017; 18. https://doi.org/10.1186/s1285901717848.
Pevzner PA, Tang H. Fragment assembly with doublebarreled data. Bioinformatics. 2001; 17(suppl_1):225â€“33.
Myers EW. The fragment assembly string graph. Bioinformatics. 2005; 21(suppl_2):79â€“85.
Medvedev P, Brudno M. Maximum Likelihood Genome Assembly. J Comput Biol. 2009; 16(8):1101â€“16. https://doi.org/10.1089/cmb.2009.0047.
Koller D, Friedman N. Probabilistic Graphical Models: Principles and Techniques. Cambridge: MIT press; 2009.
Sutton C, McCallum A. An Introduction to Conditional Random Fields. Mach Learn. 2011; 4(4):267â€“373.
Chikhi R, Limasset A, Medvedev P. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics. 2016; 32(12):201â€“8. https://doi.org/10.1093/bioinformatics/btw279.
USA Food and Drug Administration. GenomeTrakr. https://www.fda.gov/food/wholegenomesequencingwgsprogram/genometrakrnetwork. (accessed 18 May 2020).
KÃ¶ser C, Fraser L, Ioannou A, Becq J, Ellington M, Holden M, Reuter S, TÃ¶rÃ¶k M, Bentley S, Parkhill J, Gormley N, Smith G, Peacock S. Rapid singlecolony wholegenome sequencing of bacterial pathogens. J Antimicrob Chemother. 2013; 69. https://doi.org/10.1093/jac/dkt494.
Billmyre KK, Doebley AL, Spichal M, Heestand B, Belicard T, SatoCarlton A, Flibotte S, Simon M, Gnazzo M, Skop A, Moerman D, Carlton PM, Sarkies P, Ahmed S. The meiotic phosphatase gsp2/pp1 promotes germline immortality and small rnamediated genome silencing. PLoS Genet. 2019; 15(3):1â€“26. https://doi.org/10.1371/journal.pgen.1008004.
Eberle M, Fritzilas E, Krusche P, KÃ¤llberg M, Moore B, Bekritsky M, Iqbal Z, Chuang HY, Humphray S, Halpern A, Kruglyak S, Margulies E, McVean G, Bentley D. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a threegeneration 17member pedigree. Genome Res. 2016; 27. https://doi.org/10.1101/gr.210500.116.
Nikolenko SI, Korobeynikov AI, Alekseyev MA. BayesHammer: Bayesian clustering for error correction in singlecell sequencing. BMC Genomics. 2013; 14. https://doi.org/10.1186/1471216414S1S7.
Acknowledgements
Not applicable.
Funding
AS is funded by a Ph.D. grant of the Flanders Research Foundation (File number: 1174621N).
Author information
Authors and Affiliations
Contributions
AS, PA and JF designed the research study and developed the tool and scripts. AS conducted the benchmark experiments. All authors wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
A patent application has been submitted by imec vzw based on this methodology.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Supplementary material: accurate determination of multiplicities of nodes and arcs in a de bruijn graph. This document contains a worked example of the Variable Elimination algorithm in which we also highlight how the CRF improves multiplicity assignments by using contextual information. It also contains our parameter estimation formulas as well as some additional Figures and Tables referred to in the main text. Finally we present an exploration of the influence of two important parameters in our methodology: the â€˜conservation of flow strengthâ€™ and the size of the subset used for EM training in stage 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Steyaert, A., Audenaert, P. & Fostier, J. Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields. BMC Bioinformatics 21, 402 (2020). https://doi.org/10.1186/s1285902003740x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902003740x