Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.

Each factor represents a categorical distribution (also called multinoulli distribution) over the different multiplicities: for each multiplicity (first column) its probability is provided (second column). In this case, all three factors put the highest belief in a multiplicity = 0.
The singleton-factors are connected by flow factors that express the conservation of flow of multiplicity. In this particular example, the flow factors corresponding with node n 1 express that the arc multiplicity should equal the node multiplicity: where ε is a value 1. This way a high belief is expressed in arc and node multiplicity combinations that agree with a conservation of flow of multiplicity and a low belief otherwise. The strength with which we impose conservation of flow of multiplicity is determined by ε. We then multiply all factors and marginalise over (i.e. sum out) the arc variables and normalise to obtain the marginal distribution of Y n1 : where Z is a normalisation constant. If we were to multiply all factors before summing out the relevant variables, the obtained intermediate factor would become very large. The variable elimination algorithm instead eliminates these variables one by one. We first eliminate variable Y a1→2 : To calculate τ 1 we multiply factors ϕ flow and ϕ a1→2 an then marginalise over Y a1→2 as follows: Next, we eliminate variable Y a3→1 : Similarly, we obtain for τ 2 : Because all three singleton factors put a high belief in a multiplicity 0 for n 1 , we observe a strong belief in Y n1 = 0 in the final probability.
Finding the optimal order in which variables should be eliminated is NP-hard. We always eliminate the variable that has the fewest neighbours (and hence results in the smallest possible intermediate factor). Although this ordering is not guaranteed to be optimal, it works well in practice.

Parameter estimation formulas
The singleton factor parameters that we need to determine in the M-step of stage 2 are the following: the means of the negative binomial distributions λ and λ 0 , the overdispersion factors f and f 0 that determine the variance of the negative binomials, and the weights w i for all multiplicities. To calculate the values of these parameters we use the following values, derived from the data: we denote the (quality score weighted) k-mer coverage of a node or (k + 1)-mer coverage of an arc as cov(n) and cov(a) respectively. Note that we are using a de Bruijn graph representation where linear chains of nodes are concatenated to unitigs; cov(n) will thus be the average coverage of all k-mers in the concatenated node. The current multiplicity estimate for nodes and arcs is denoted by mult(n) and mult(a) respectively. Additionally, we will denote the soft multiplicity assignments obtained during the E-step as π m (n) = P (mult(n) = m) for nodes and π m (a) = P (mult(a) = m) for arcs. Since nodes correspond to (concatenated) k-mers, while the arcs correspond to (k + 1)-mers, the underlying distributions will differ slightly. This is why all parameters are estimated separately for the nodes and the arcs of the de Bruijn graph.
To estimate the parameters of the negative binomial error distribution we fit a negative binomial distribution to count data generated by binning the observed q-mer coverages as follows: For coverage c ∈ N: where 1(x) is an indicator function such that 1(x) = 1 if x is true and 1(x) = 0 otherwise and π n = cov(n) − cov(n) , π a = cov(a) − cov(a) . As we are dealing with truncated data due to unobserved zero-counts as well as the removal of low coverage nodes and arcs, we use an EM-procedure as follows: • in the E-step we estimate pseudo-counts for the unobserved coverage values c ∈ {0, 1, . . . , T } (T the preprocessing coverage threshold) as • in the M-step we estimate the weight parameter (w 0 ) and the mean (λ 0 ) and overdispersion factor (f 0 ) of the negative binomial distribution based on the observed and pseudo counts as The E and M step are iterated until the parameter estimates do not change significantly anymore.
For the negative binomial distributions representing the true nodes and arcs we determine the method of moments estimators for mean and variance for each multiplicity (up until a certain multiplicity m (tuneable)) after which we determineλ andf as a weighted average of these estimates. More specifically, the negative binomial parameters are estimated as follows: Note that the weights w m are calculated up until m , while we might need weights for higher order multiplicities during the E-step. We define w m = w m , ∀m > m . True multiplicity Estimated multiplicity Table 1: Confusion matrices for different organisms (real data), coverage depths (10× and 50×) and neighbourhood sizes (s = 0 and s = 5). Each column contains all nodes with a specific true multiplicity and shows the distribution of the estimated multiplicities. For H. sapiens, 10 6 nodes were randomly sampled from the graph.       {0, 1, 3, 5}). The different classes of errors are: a) sequencing error labelled as unique region (0 → 1); b) unique region labelled as sequencing error (1 → 0); c) repeated region labelled as sequencing error (≥ 2 → 0); d) sequencing error labelled as repeated region (0 → ≥ 2); e) other nodes labelled one too low (other, −1); f) other nodes labelled one too high (other, +1); g) other nodes label more than one too low (other, −2); h) other nodes labelled more than one too high (other, +2).

Influence of parameter values
Based on simulated datasets we explore the influence of several parameters that can be user-defined in our pipeline. All experiments were run for two different neighbourhood sizes, 3 and 6, as well as for neighbourhood size 0, which determines multiplicities based on the negative binomial mixture model alone.
The 'flow conservation strength' parameter is defined as f = 1 ε , with ε the value that the flow factors assign to a multiplicity combination for which conservation of flow does not hold. Figure 1 shows the accuracy obtained by using different values for this parameter, i.e. f = 10, 10 3 , 10 5 , 10 7 , 10 9 . To obtain these results, we used the full set of nodes and arcs to train the model in stage 2, such that there was no variance possible due to different subsets for training. As can be seen in Figure 1, multiplicity determination accuracy increases as we impose a higher 'flow conservation strength'. However, from values = 10 5 and higher the accuracy gain is minimal. Based on these results we set the default value for this parameter to 10 7 .
The influence of the size s of the subset used for EM training on variability in negative binomial parameter estimation, and on the accuracy obtained in stage 3 is shown in Figures 2 and 3. We performed EM-training with subsets of 100, 1000, 10 4 and 10 5 nodes and arcs. From these results we conclude that it is not necessary to train EM on the full dataset to get consistent multiplicity assignments in stage 3. We can choose a subset size small enough such that the time-consumption for the whole pipeline is dominated by stage 3, while still obtaining negative binomial parameter estimates that are good representatives for the whole dataset.