Insertion and deletion correcting DNA barcodes based on watermarks
 David Kracht^{1}Email author and
 Steffen Schober^{1}
https://doi.org/10.1186/s1285901504827
© Kracht and Schober; licensee BioMed Central. 2015
Received: 7 May 2014
Accepted: 29 January 2015
Published: 18 February 2015
Abstract
Background
Barcode multiplexing is a key strategy for sharing the rising capacity of nextgeneration sequencing devices: Synthetic DNA tags, called barcodes, are attached to natural DNA fragments within the library preparation procedure. Different libraries, can individually be labeled with barcodes for a joint sequencing procedure. A postprocessing step is needed to sort the sequencing data according to their origin, utilizing these DNA labels. The final separation step is called demultiplexing and is mainly determined by the characteristics of the DNA code words used as labels.
Currently, we are facing two different strategies for barcoding: One is based on the Hamming distance, the other uses the edit metric to measure distances of code words. The theory of channel coding provides wellknown code constructions for Hamming metric. They provide a large number of code words with variable lengths and maximal correction capability regarding substitution errors. However, some sequencing platforms are known to have exceptional high numbers of insertion or deletion errors. Barcodes based on the edit distance can take insertion and deletion errors into account in the decoding process. Unfortunately, there is no explicit codeconstruction known that gives optimal codes for edit metric.
Results
In the present work we focus on an entirely different perspective to obtain DNA barcodes. We consider a concatenated code construction, producing socalled watermark codes, which were first proposed by Davey and Mackay, to communicate via binary channels with synchronization errors. We adapt and extend the concepts of watermark codes to use them for DNA sequencing. Moreover, we provide an exemplary set of barcodes that are experimentally compatible with common nextgeneration sequencing platforms. Finally, a realistic simulation scenario is use to evaluate the proposed codes to show that the watermark concept is suitable for DNA sequencing applications.
Conclusion
Our adaption of watermark codes enables the construction of barcodes that are capable of correcting substitutions, insertion and deletion errors. The presented approach has the advantage of not needing any markers or technical sequences to recover the position of the barcode in the sequencing reads, which poses a significant restriction with other approaches.
Keywords
Background
Due to the steadily increasing throughput on platforms for nextgeneration sequencing and dropping prices of commercially available devices, DNA sequencing becomes broadly accessible for researchers. Since the output of bases in each sequencing run has reached giga to tera orders within the last few years, strategies for efficiently sharing the sequencing capacity has become of particular interest. Multiplexed sequencing is a major key technique, that makes sequencing devices accessible in parallel: DNA samples from different experiments can be pooled into batches and sequenced in parallel in a single sequencing run. Before joining different samples it is mandatory to uniquely label the DNA fragments. DNA barcodes, artificially synthesized sequences of nucleic acids, are used as labels to tag the fragments and to separate the output of the sequencers according to the input samples.
The robustness of multiplexing in general relies on the properties of the used barcodes and how well they are adapted to the underlying sequencing protocol and platform. Essential experimental preprocessing steps, which are needed to prepare the DNA material can cause errors on the target genomic sequence and the barcodes as well. Physical and chemical sequence modifications, e.g. fragmentation, ligation, or copy procedures are known sources of such errors. These errors lead to crosstalk during demultiplexing, i.e., sequences from different batches can not clearly be distinguished, which is of course highly undesirable.
Different constructions for barcodes have been proposed, for example those of Hamady et al. [1] and Bystrykh [2] are based on Hamming codes [3] or the approach of Krishnan et al. [4] based on BCH codes [5], to name just a few. For short lengths it is even feasible to apply bruteforce search techniques, e.g. Frank [6] or Mir et al. [7], where some of the resulting codes of the latter approach even reach fundamental bounds. The constructions mentioned so far are designed to correct substitution errors only. From a conceptional point of view all of them try to provide codes that maximize the socalled Hamming distance between the individual code words. The Hamming distance between a pair of sequences measure the minimal number of symbolwise substitution that are needed to transform them into each other. But, some specific sequencing platforms are known to have exceptional high numbers of insertion and deletion errors as reported for Roche 454 Pyrosequencing [8], PacBio sequencers [9] or Ion Torrent PGM [10]. See [1113] for a comparison of these sequencing techniques. Hence, especially for insertion and deletion prone devices one has to consider barcodes that are capable to correct indels.
Promising attempts to find barcodes that are robust to indels have been considered in [14,15] using the socalled edit or Levenshtein distance (see [16] for an overview). For calculating the distance between code words the Levenshtein distance takes insertion and deletion operations into account and is therefore better suited for applications where decoding based on Hamming distance fails. Unfortunately, there is no codeconstruction known that directly gives optimal codes in edit metric. Some greedy (later evolutionary) algorithms has been proposed in [17,18] to find sets of barcodes of moderate size with high minimal edit distance, additionally fulfilling biological constraints. However, a practical decoding step for the obtained barcodes has not been addressed in the mentioned papers. This was later done by [19], where it is stated that maximizing the edit distance for barcodes (within a sequence context) is a suboptimal or even wrong strategy. The context of a code words, which is simply the sequence that contains the DNA tag plays an important role. Due to indels the exact boundaries of code words can not be correctly recovered. This leads to additional errors, if the sequence context was not included in the code construction. The DNA context at one end of a code word can be taken into account by using an adapted SequenceLevenshtein distance, as proposed in [19].
In this manuscript, we provide an entire different perspective to obtain barcodes. We give codes based on concepts introduced by Davey and Mackay [20]. The original watermark approach is aimed to synchronize and decode a continuous stream of large binary datablocks. In the domain of DNA codes we face additional constraints, for which the original concept is adapted. We finally give an exemplary set of barcodes and provide an in silico application, which shows that demultiplexing based on the watermark concept is applicable in the field of nextgeneration sequencing. Basic concepts of watermark coding has already been considered for data embedding in DNA [21,22], which is closely related to the barcoding approach for DNA sequencing. But, the transmission of biologically compatible sequences through an evolutionary channel (in living cells) is only slightly similar to the approach we consider in the present manuscript.
Note, that search approaches like [16,19] can be used to find better codes in terms of code rate and minimal (sequence) edit distance, but we see two striking advantages of the watermark concept for barcoding. First, the watermark concept contains an implicit synchronization technique, that does not need preambles or markers to find boundaries of code words within an unknown sequence. Embedding of barcodes in an unknown context is not generalized in approaches considering barcodes based on (sequence) edit distance. A twoended embedding of sequences is not reflected in this metric. Furthermore, we are able to give an optimal decoding procedure, adapted to a specific error channel. In short, this enables a maximum degree of freedom for existing as well as future experimental settings. Decoding speed is the second important aspect of multiplexing approaches, as the number of barcodes and the available readlength dramatically increased within the last few years. The decoding of barcodes based on watermarks also provides an efficient method for fast decoding of large scale multiplexing approaches.
Methods
The main concepts presented in this section originates from the ideas first given by Davey and MacKay [20]. The fact, that quaternary watermark codes can be applied for DNA sequencing was preliminary outlined in [23], but sequence constraint on practical oligonucleotides can not be found in this conference paper. Let us denote some basic facts about barcodes first and postpone specific sequence constraints to the section about reasonable encoder settings.

a quaternary sequence. Therefore we will propose generalized nonbinary models and concepts here (the original channel was considers strictly binary).

in general embedded in other sequences. Hence we give an adapted transmission scheme and a novel approach to detect barcode boundaries (in [20] the watermark pattern is repetitive and symbols are decoded as stream).

lengthlimited. That consequently restricts the number of codeconstructions for which an adaption is practical (long LDPC codes are use in the original paper, not plausible for short barcodes).
We will stress additional contributions to the work of Davey and MacKay, where needed. Let us start with a model for the sequencing process and continue with encoding and decoding based on watermarks.
Sequencing model
We will first define a communications theoretic model to formally describe the barcoding application.
Substitution model
A simplistic communication theoretic model for the sequencing of barcodes has been proposed in [7]. Namely, a fixed barcode word \(\mathbf b \in \mathcal A^{N}\), with \(\mathcal {A} = \{A,G,C,T\}\) as set of possible symbols (nucleic acids), is entering a communication channel with output \(\mathbf {r} \in \mathcal A^{N}\), where the channel itself is described by the conditional probabilities to receive a string y if x was sent, for \(\mathbf {x}, \mathbf {y} \in \mathcal A^{N}\).
For our purposes this model has to be extended into two directions: First, a barcode word b is assumed to be embedded into a randomly chosen context to obtain the sequence t (details will be given below). Second, the sequence t is sent over a socalled Sequencing Channel resulting in the received word r. The channel not only substitutes symbols from t, but is also able to delete or insert symbols, hence the length of r and t may differ.
Embedding of barcodes
Barcoding and working hypotheses
In this paragraph we like to focus on two different paradigm for sequence embedding in our barcoding approach: First we like to address the total embedding of barcodes. Well, for most sequencing scenarios, we find the barcodes next to fixed technical sequences, which are more reliable due to sequence specific (biological) reactions, e.g. primer or adapter oligonucleotides. For sequencing experiments, where we can rely on the exact knowledge of the position of a barcode at one end, highly optimized code words has been proposed in [19]. As an extension of our work is possible to include t _{pre} or t _{post} as partially known, nevertheless we want to restrict ourselves to have no prior knowledge about the context. The main gain from this very strict premise is a striking freedom of experimental setups for which we can apply our concepts. For example on platforms with paired end sequencing we are able to decode barcodes independent of the direction of reads not restricting the reads to start with a barcode symbol.
The second aspect is the assumption of inherent barcoded sequences. In this manuscript we focus on the problem of decoding (discrimination of code word sequences), conditioned on the fact that a code word is present in the multiplexed sequences. The problem of detecting a barcode (if it is not guaranteed that every sequence contains one) is a more challenging problem, which we want to avoid in the present assay. On codes based on sequence edit distance this extended problem is addressed for, e.g. the PacBio SMRT platform in [24]. We conjecture that the detection problem of barcodes based on watermarks can be solved in future investigations. Such investigation might also lead beyond the barcoding for sequencing applications, which we will address at the end of this manuscript. Nevertheless, we rely on barcoded samples for the following considerations.
Sequencing channel model
We define a very simplified model for sequence errors and discuss some aspects of oversimplification in the next paragraph. Let us describe the processes involved in sequencing as a memoryless quaternary channel, i.e. each symbol is handled independently of others. This channel model is specified by a set of parameters S, p _{ i } and p _{ d } that are integrated as follows: A transition matrix \(\mathbf S \in \ensuremath {\mathbb R}^{4\times 4}\) which describes the substitution probabilities, and p _{ i } and p _{ d } to specify insertion and deletion probabilities. The channel is modeled as an infinite statemachine on symbol level, in which a symbol t _{ i } is queued to pass the channel and therefore will undergo one of the following three events: With probability p _{ i }, the symbol t _{ i } remains in the queue and the received stream is prolonged with a random inserted symbols where we assume an uniform distribution on \(\mathcal A = \{A,G,C,T\}\). With probability p _{ d }, the actual queued symbol is deleted. With probability p _{ t }=1−p _{ i }−p _{ d } the symbol t _{ i } is passed to a substitution channel which substitutes the symbol t _{ i } according to the transition matrix S, with transition probabilities S(r _{ j },t _{ i })=P r(r _{ j }t _{ i }). In order to downsize the number of parameters in our model, we will consider S as symmetric 4ary substitution matrix later, i.e., we consider substitutions from a base into another with a single error parameter (similar to the model proposed by Jukes and Cantor [25]). Nevertheless, the model considered here would be able to mimic a refined channel, if exact empirical parameters can be considered.
Empirical parameters for a channel model
Empirical data about sequence errors can be seen as the crux of all HMM based approaches in the field of DNA sequencing, as predictions can only perform as good as the underlying assumptions. But, aside from advertising error rate of the big vendors of sequencing devices, real estimates are rather rare to find in literature. Further, there is no real agreement about a common technique to mine such data in a correct way, e.g, using sequence alignment to predict error count is recently in critic of overfitting, due to predefined alignment costs with impact on the estimates. The commutability of substitutions, insertion and deletion events (a substitution is equal to a deletion followed by an insertion) made things even more difficult. Additionally, there are some indications, that the sequencing channel is more complex as the model we utilize in our approach: Sequencing errors seem to be highly depend on the sequence context, with extended implications on the distribution of symbols in sequenced reads. For Illumina this was shown, e.g. in [26]. We know that the DNA polymerase molecule is prone to bursts of insertions and deletions, if for example repetitive symbols (homopolymers) are present in the physical template sequence. Note, that the reliability of the approach presented here is sensitive to empirical channel parameters to obtain best estimates for demultiplexing samples. A stepwise refinement driven by the feedback of experimental studies is mandatory to adapt watermark barcodes for the demands of different sequencing platforms.
Demultiplexing problem
Barcode construction on watermarks
The second block consist of an inner encoder, which works in a complete inverse direction. An inner code \(\mathcal C_{2}\) is used to create barcode words that have a low Hamming distance to a watermark sequence. The similarity of all barcodes to this watermark pattern is utilized to gain synchronization as explained below. The inner code adds redundancy to the code words by mapping each outer symbols \(d_{j}\in \mathbb F_{q_{1}}\) to a sparse sequence representation \( \mathbf s^{(j)} \in \mathbb Z_{4}^{n_{2}}\). The set of sequences s ^{(j)} with low mean Hamming weight (number of nonzero symbols) can be seen as inner code \(\mathcal C_{2}\). The cardinality of this inner code set is q _{1} and the rate of the inner code can be stated as \(R_{2}=\tfrac {\log _{2} (q_{1}) }{ n_{2} \log _{2} (4) }\). By joining n _{1} of these inner code words, a sparse inner block \(\mathbf s = \mathbf s^{(1)}\mathbf s^{(2)} \dotsb \mathbf s^{(n_{1})} \) of length N=n _{1} n _{2} is generated.
A barcode word \(\mathbf b= \mathbf s \oplus \mathbf w \in \mathcal A^{N}\) is obtained via a symbolwise adding of an arbitrary watermark sequence w to the sparse inner block s using a fixed mapping of \(\mathcal A=\{A,G,C,T\}\) onto \(\ensuremath {\mathbb Z}_{4}\), where addition is defined modulo 4 (explicit mappings in Additional file 1 section). The final set of barcodes is denoted as \(\mathcal C = \mathcal C_{2} \circ \mathcal C_{1}\) throughout this manuscript. The code gives a code rate \(R=\tfrac {k_{1}}{n_{2}} \tfrac {\log _{2}(q_{1})}{\log _{2}(4)}\).
Decoding
The decoder consists of two blocks (see Figure 2): An inner decoder \(\mathcal D_{2}\), which utilizes a hidden Markov model (HMM) and channel parameters to provide symbolwise likelihoods. These are fed into the outer decoder \(\mathcal D_{1}\) that performs a maximum likelihood decoding to obtain an estimate \(\hat {\mathbf c}\) of the sent code word. We will first define the HMM and explain how this can be used to find the likeliest transmitted sequence (optimal decoding). Afterwards we give a modified HMM, that enables to estimate the boundaries of embedded barcodes and end this section with a suboptimal symbolwise decoding approach with lowered complexity.
HMM for decoding
The basic idea of the HMM presented in this paragraph refers to considerations in [20]. To explain the function of the HMM for decoding it is helpful to ignore the random context and the embedding of code words initially. Therefore we assume t _{pre} and t _{post} to be absent and the transmitted sequence is exactly one barcode, i.e. t=t _{1} t _{2}··t _{ M=N }=b.
If no insertions or deletions occur in a channel the received sequence r=r _{1} r _{2}⋯r _{ L } is as long as the sent sequence t, i.e. L=M, but some symbols r _{ i } might differ from t _{ i }. Assuming that errors occur independent of the position and equally distributed, we can use fixed substitution probabilities to describe the channel.
For channels with insertion and deletion events the symbolwise fixed association \(t_{i} \leftrightsquigarrow r_{i}\) is usually lost. For example, for a single insertion event at the ith symbol, t _{ i } will be associated to r _{ i+1}. A single deletion event before transmitting the ithe symbol, will shift t _{ i } to the symbol r _{ i−1} in the received sequence. Obviously, such errors accumulate during the transmission.
One of the main problems of decoding is to estimate the number of insertions and deletions given a received sequence r. Therefore we define the drift x _{ i } at the ith transmit symbol as (# insertions)  (# deletions) that occurred in the received sequence before taking t _{ i } into account. The drifts {x _{ i }} can be seen as the hidden states of an HMM.
Optimal decoding
HMM to estimate barcode boundaries
Up to this paragraph we have not discussed the role of the watermark in the transmission. We have just illustrated an HMM able to give the likeliest received sequences r conditioned to a hidden transmit sequence t and parameters . We will now take a closer look at the watermark and how the inner encoding can be understood from a communication theoretic point of view (see Figure 2).
Recall that a barcode is constructed via addition of two quaternary sequences \(\mathbf s, \mathbf w \in \mathbb {Z}_{4}^{N}\) as b=s⊕w. Due to the symmetry of the addition, there are two ways to perceive a data transmission: Apparently there is the transmission of \(\mathbf s \overset {\mathbf w}{\to } \mathbf b\) with w causing some distortion. A further valid perspective of the transmission is \(\mathbf w \overset {\mathbf s}{\to } \mathbf b\), which takes w as source with s causing substitution errors. Here we stick to the last concept treating the symbols s _{ i } as independent and identically (iid) distributed errors on w (which is used to find the boundaries of barcodes).
Finding barcode boundaries
For embedded code words we can understand the symbol b _{ i } as shifted transmit symbol t _{ δ+i } and thus b _{ i } has to be linked to the observable r ^{(δ+i)} in the HMM (see section Embedding of barcodes). But we can easily integrate the sequence offset δ as initial drift x _{0}. For the symbols b _{ i } we therefore redefine the drift states {x _{ i }} for the HMMs according to embedded symbols.
Symbolwise likelihoods
Using this suboptimal inner decoding approach, we are able to decrease the computational costs to q _{1} n _{1} evaluations of the forward metric π (compare section Optimal decoding). As we need an additional soft decoding for the outer code, there are further operations needed: For a maximum likelihood approach we have to consider \({q_{1}^{k}}\) code words and search for the likeliest solution.
Simulations
In order to perform an in silico application of barcodes based on the watermark concept, we first have to define some reasonable setting for encoding, which is already quite challenging.
Reasonable encoder settings
As stated before (see section Barcode construction on watermark), there are different parameters, which influences the concepts and for which we have to find a reasonable setup to run simulations. First there is an outer codes, which should in combination with the inner coding lead to short barcode words, because we do not want to produce exceptional overhead with multiplexing (tagging) target fragments. There is the minimum distance of outer code words and the sparsity of the inner encoder, which can independently be characterized. And finally we have the watermark sequence, which can randomly be chosen. We utilize the degree of freedom with the watermark to incorporate with additional sequence constraint, that barcodes should fulfill to be experimentally valid. Therefore we run a greedy search for the watermark pattern that maximizes the number of barcodes that meet all sequence constraint. But let us consider all particular setting one by one in the following paragraphs.
Suitable outer codes
For the construction of barcodes we focus on a targetlength of N=n _{1} n _{2}∈[12,..,25] symbols with n _{1}∈{3,4,5,6} and n _{2}∈{4,5,6,7,8}. Further we limit the outer code \(\mathcal C_{1}\left [\mathbb F_{q_{1}},n_{1},k_{1},\text {d}_{H}\right ]\) to the best known linear codes listed in [28] for several cardinalities of Galois fields \(\mathbb F_{q_{1}}\), for which we considered q _{1}∈{2,3,4,5,7,8,9}. Long LDPC codes has been used in the original approach of Davey and MacKay (see [20]), but as the construction of short LDPC codes would be somehow confusing for readers involved in channel coding, we decided to use the best known linear codes. But we might note, that the hamming distance and the ability for soft decoding is the only demand on outer codes here and LDPC codes are likely to perform equivalent. The minimal Hamming distance d_{ H } of the best known linear codes we used is either maximal or the highest known regarding given parameters. Additionally we bound the dimension k _{1} to achieve code set sizes \(48 < \mathcal C_{1}=q_{1}^{k_{1}} < 1000\). This guarantees a certain minimal code rate on one side and limit the computational effort of the outer decoding the other side. We end up at 263 possible code configurations, but most of the resulting codes perform disastrous with the watermark concept, because the watermark is heavily corrupted by inner code words.
Sparsity of the inner code
Sequence constraints
We filtered for code words with unbalanced counts of symbols, to respect limitations on the GC content of barcodes. Such filtering can be seen as a de facto standard in the construction of barcodes (see for example [1,7,19]) and is related to technical constraints due to the preparation and the sequencing of genomic material. The relative frequency of a subset of two symbols should not be below 40% and above 60% in each barcode, otherwise we excluded the barcode. We furthermore exclude barcodes with prefect selfcomplementation and more than two sequential repetitions of the same base (homopolymer length), similar to the restrictions stated in [24]. We consider this settings as sufficient and strict enough to avoid experimental problems during the preparation in real sequencing tasks. Discarding such inappropriate code words means an additional loss in code rate.
Increasing the mean edit distance
Estimating the decoding error
To evaluate the refined set of 73 codes, we give the following demultiplexing scenario. For each code we consider an error curve according the estimates of the decoding error probability on different channel settings. The estimation of a single point in the error curve is based on a set of 200,000 barcodes, which we refer to as batch. Each batch is processed with a certain symbol mutation probability p _{ mut }∈[.01,.16]. This value determines a symbolwise probability for an edit operation that can be caused by our channel model. Similar approach has been considered in [19], but we like to be a bit more precise with the description of the modifications of the channel model. Buschmann et al. [19] considered a minimal mutation probability of 10^{−1} for their evaluation, but others claim, that error rates are several orders lower [11]. We will take such indications into account.
Each barcode from a batch is embedded in a random context of variable length (compare section Embedding of barcodes). We use normal distributed random variables (with mean μ _{ l }=50 and variance σ _{ l }=5) to determine the length of t _{pre} as well as t _{post}. We simply take the nearest nonnegative integer to define the length of the random context, which is uniformly distributed on {A,C,G,T}.
We further use a state machine to produce erroneous received sequences based on the following four events: correct transmission C, substitution S, insertion I and deletion D. In slightly different notation to the former model (see. Sequencing Channel) we assign probabilities to the events C and S and do not use conditional probability like a substitution matrix S.
The probability for correct transmission C or substitution S is equal to p _{ t } in the former representation. The probability for C equals 1−p _{ mut } and the probability for any of the error events (S,I or D) is p _{ mut }. Every error event S, I or D is equal likely. To obtain an equal distribution among the mentioned events, it is easier to use the present notation, but equivalent behavior can be approached with both versions of the channel statemachines.
To save decoding time we iteratively pass each transmit sequence through the state machine, until we have at least one error event within the barcode region. The probability of having a defective code word of length N is Pr(def)=1−(1−p _{ mut })^{ N }. We further considered an errorfree barcode to be decoded perfectly, i.e. \(\text {Pr}(\texttt {err}  \overline {\texttt {def}})=0\), with err denoting the event of decoding error. Please note that this oversimplifies the false positive rate introduced by sporadic similarities of the context with dedicated barcode words. The rate is supposed to grow linear with the context length and the size of barcodes set. Nevertheless, the probability for false positive events exponentially tend to zeros with the length N of the code words. For barcodes of length ≥12 embedded in 100 random symbols we consider this error as marginal offset for the estimated decoding errors. Our evaluation finally end up in estimating the conditional error Pr(errdef), which gives an estimate for the unconditioned error probability as P _{ e }=Pr(errdef)Pr(def).
Results and discussion
First we give a rough overview on meaningful properties of watermarkbased barcodes with the considered 73 parameterizations. Furthermore, we provide a refined analysis and evaluation for certain codes in the already defined decoding scenario. To minimize the textual elements in the figures, we use the following notation: q _{1}k _{1}n _{1}n _{2} indicates the concatenated coding using an outer code \(\mathcal C_{1}\left [\mathbb F_{q_{1}},n_{1},k_{1}\right ]\) and an inner code \(\mathcal C_{2}: \mathbb F_{q_{1}} \to \mathbb Z_{4}^{n_{2}}\). The particular codes can be found in the Additional file 1 section of this paper.
Properties of barcodes based on watermarks
There are distinct blocks, where the influence of the inner code can be separately examined. With fixing the outer code, e.g. q _{1}23n _{2} or q _{1}34n _{2} for q _{1}∈{7,8,9} and increasing the length n _{2} of inner code words, one can deduce how code rate is exchanged for lowered mean density and increased mean distance. However, we see configurations, for example 434n _{2} for n _{2}∈{4,5,6}, where we have not been able to increase the mean distance by prolonging the inner code. We have observed several clusters, where similar effects can be found.
For the codes q _{1}k _{1}4n _{2} or q _{1}k _{1}5n _{2} and q _{1}k _{1}6n _{2} the effect of the outer code can be separated. We find clusters of star symbols at different mean distances (see darkened areas in Figure 4). These levels can be explained through the different minimum Hamming distance of the outer codes. We have Hamming distances 2,3 and 4 for the present outer codes at n _{1} equals to 4,5 and 6. For concatenated coding it is known that the minimum distances of inner and outer codes are multiplicative [31]. As the edit metric is upper bounded by the Hamming distance, we anticipate the described levels for edit distances. The mentioned leveling can consistently be found for all outer codes.
According to the very strict filtering (see Sequence constraints), we had to prune out the sets of \(q_{1}^{k_{1}}\) outer code words, what additional lowers the code rate. A detailed description about the excluded barcodes can be found in the Additional file 1.
Evaluation of decoding
A reasonable tradeoff between errorcorrecting capability and cardinality is provided for example by codes with parameters 83n _{1}n _{2} (Figures 4, 5 and 6, in blue). Although we are facing relative low code rates (compared to approaches like [19]) ranging from 0.188 to 0.281, more than 400 sequences available with quite surprising decoding capability.
Decoding complexity
According to [20] we utilized a fast decoding approach with reduced complexity for our simulations. We bounded the maximal number of possible drift states {x _{ i }} in all HMMs to Δ∈{5,..,20} according to the suggestion given by Davey and Mackay. The complexity for decoding a single embedded barcode is \(\mathcal O(MLI + q_{1} N \Delta I + q_{1}^{k_{1}})\), with N denoting the length of the barcode, that is embedded in a received sequence of length M. Decoding is based on the assumption that the channel can introduce maximal I inserted symbols and we consider the maximal drift between received and transmitted sequence to be limited by Δ. An order of MLI calculation are needed to estimate barcode boundaries, n _{2} Δ I operations are needed to provide softvalues for each of the q _{1} n _{1} possible outer code words (result in q _{1} N Δ I) and \(q_{1}^{k_{1}}\) final comparisons has to be spent for soft decoding the outer code in the most expensive case (maximum likelihood).
The prototype decoder with which we ran our simulations is implemented in MATLAB. We further parallelized the decoding procedure and created jobs of 10^{6} received sequences, that were processed by single cores (Opteron, 2.6 GHz). The average length of received sequences was in a range of 112 and 150 symbols, resulting in an average processing time of 6 hours for the tasks with lowest calculation costs (code parameters 7234). The longest time we needed to complete demultiplexing of 10^{6} sequences (code parameters 9355) was strictly below 24 hours (on a single core).
Future directions
Apart from the theoretical considerations we have given in this manuscript, there are lot of future direction starting from this initial point. Some of them are mandatory to enable an application in real biological experiments, others are modifications of the concepts for extended applications.
Let us first address the essential steps needed to establish an HMM based decoding in real experiments: The HMM, as core of the decoding system, is the most sensible part of the concept. It is mandatory to run experiments to gather reliable data about all channels, the concepts should be used for. From our point of view there is a lack of reliable data about insertions, deletions and substitution errors for possible channel models. For the sequencing application we assume that different platforms show a variety of sequencing channels, additionally affected by experimental parameters, e.g. the extend of PCR preprocessing of samples. To obtain an optimal suited decoder, the HMM should be adapted to the considered channels. As most of the channels show a correlation of errors, more complex HMMs should be considered, reflecting a channel model with memory. Finally, it might be possible to establish a selfadaptive algorithm to parametrize the HMMs without any prior knowledge about the ratio of errors in the channel. A suitable statistic and refined calibration steps should be invented. Another important point for estimating the error characteristics is the construction of watermark codes. Exact empirical parameters of the channel could be incorporated in the design of watermark codes to improve decoding steps, suited for special channels.
Further aspects that could be considered with the given concepts are the following: Aside from the synchronization aspect in this manuscript, it seems very promising to use the maximum likelihood decoding method for other sequences than watermark codes. Conditioned on good empirical parameters for an underlying HMM one could consider a reliable detection of barcodes based on the SequenceLevenshtein distance in a probabilistic way rather than based on sequence alignment.
In the presented approach we focused on the discrimination of code words, assuming codes are always present in inspected sequences. The detection of code words within DNA context is another big issue that should be solved for future investigations using an HMM based decoding. Resent research shows that even for sequencing approaches the detection of barcodes is quite challenging. In [24] they focus on a specific problem with certain setups on the PacBio SMRT platform. Caused by technical reasons, sporadically barcodes are not present in the sequence data. Another interesting field of application of an HMM based sequence detection could be clonal studies, where the sequenced genome could or could not contain a predefined sequence, which was introduced in ancestor organisms.
Conclusion
We proposed an adaption of the watermark concept of [20] for DNA barcoding. A generalized channel model for sequencing and suitable modifications of the decoder were defined. Moreover, we investigated in a strategy to choose watermark sequences and inner codes in a reasonable way to enable barcoding in line with common experimental requirements. We provide a code construction, considering the best known linear codes as outer codes and biological sequence constraints to filter for suitable code words, resulting in an exemplary set of 73 different code sets ranging from 12 to 24 nucleotides. The codes are illustrated in a comprehensive scheme, highlighting watermark specific parameters as well as the mean edit distance, to give an impression how watermark based barcodes could be characterized. For a reduced set of codes we finally evaluated the demultiplexing of sequences in a realistic simulation scenario. Within this in silico evaluation we could show that barcodes based on watermarks can theoretically be used for multiplexing. It is remarkable, that even with very short watermark patterns we are able to reliably find the barcodes boundaries in order to discriminate different code words with an HMM based decoder. The probability of decoding errors, which finally leads to the undesired crosstalk phenomenon was found to be very low. Other approaches that investigate barcodes with large (sequence) edit distance [16,19] show significant higher code rates for shorter barcodes, but we have given an entirely different concept that allows for large scale multiplexing approaches, also able to handle insertion and deletion errors.
Moreover, we can provide the markerless synchronization based on watermarks, to recover the barcode boundaries. This synchronization concept provides an ultimate degree of freedom for experimental sequencing setups as well as for future applications, also apart from the sequencing context.
Declarations
Acknowledgements
David Kracht is funded by the Deutsche Forschungsgemeinschaft (DFG) under the grant BO 867/301 in the priority program SPP 1395. Steffen Schober is supported by the DFG grant SCHO 1576/1.
The authors like to thank Mahmoud Almarashli for fruitful discussions and assistance in preliminary considerations and managing preparative simulations.
Authors’ Affiliations
References
 Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R. Errorcorrecting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods. 2008; 5(3):235–7.View ArticlePubMedPubMed CentralGoogle Scholar
 Bystrykh LV. Generalized dna barcode design based on hamming codes. PLoS One. 2012; 7(5):36852.View ArticleGoogle Scholar
 Hamming RW. Error detecting and error correcting codes. Bell Syst Tech J. 1950; 29:147–60.View ArticleGoogle Scholar
 Krishnan AR, Sweeney M, Vasic J, Galbraith DW, Vasic B. Barcodes for dna sequencing with guaranteed error correction capability. Electron Lett. 2011; 47(4):236–7.View ArticleGoogle Scholar
 Lin S, Costello DJ. Error control coding, vol. 123. Englewood Cliffs, New Jersey: Prenticehall; 2004.Google Scholar
 Frank DN. Barcrawl and bartab: software tools for the design and implementation of barcoded primers for highly multiplexed dna sequencing. BMC Bioinformatics. 2009; 10(1):362.View ArticlePubMedPubMed CentralGoogle Scholar
 Mir K, Neuhaus K, Bossert M, Schober S. Short barcodes for next generation sequencing. PLoS One. 2013; 8(12):82933.View ArticleGoogle Scholar
 Gilles A, Meglécz E, Pech N, Ferreira S, Malausa T, Martin JF. Accuracy and quality assessment of 454 gsflx titanium pyrosequencing. Bmc Genomics. 2011; 12(1):245.View ArticlePubMedPubMed CentralGoogle Scholar
 Carneiro MO, Russ C, Ross MG, Gabriel SB, Nusbaum C, DePristo MA. Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics. 2012; 13(1):375.View ArticlePubMedPubMed CentralGoogle Scholar
 Bragg LM, Stone G, Butler MK, Hugenholtz P, Tyson GW. Shining a light on dark sequencing: characterising errors in ion torrent pgm data. PLoS Comput Biol. 2013; 9(4):1003031.View ArticleGoogle Scholar
 Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al. Performance comparison of benchtop highthroughput sequencing platforms. Nat Biotechnol. 2012; 30(5):434–9.View ArticlePubMedGoogle Scholar
 Shendure J, Ji H. Nextgeneration dna sequencing. Nat Biotechnol. 2008; 26(10):1135–45.View ArticlePubMedGoogle Scholar
 Yang X, Chockalingam SP, Aluru S. A survey of errorcorrection methods for nextgeneration sequencing. Brief Bioinform. 2013; 14(1):56–66.View ArticlePubMedGoogle Scholar
 Adey A, Morrison HG, Xun X, Kitzman JO, Turner EH, Stackhouse B, et al. Rapid, lowinput, lowbias construction of shotgun fragment libraries by highdensity in vitro transposition. Genome Biol. 2010; 11(12):119.View ArticleGoogle Scholar
 Qiu F, Guo L, Wen TJ, Liu F, Ashlock DA, Schnable PS. Dna sequencebased “bar codes” for tracking the origins of expressed sequence tags from a maize cdna library constructed using multiple mrna sources. Plant Physiol. 2003; 133(2):475–81.View ArticlePubMedPubMed CentralGoogle Scholar
 Faircloth BC, Glenn TC. Not all sequence tags are created equal: designing and validating sequence identification tags robust to indels. PLoS One. 2012; 7(8):42543.View ArticleGoogle Scholar
 Ashlock D, Guo L, Qiu F. Greedy closure evolutionary algorithms. In: Computational intelligence, proceedings of the world on congress on, vol. 2. Piscataway: IEEE: 2002. p. 1296–301.Google Scholar
 Ashlock D, Houghten SK. A novel variation operator for more rapid evolution of dna error correcting codes. In: Computational intelligence in Bioinformatics and computational biology, 2005. CIBCB’05. Proceedings of the 2005 IEEE symposium on. Piscataway: IEEE: 2005. p. 1–8.Google Scholar
 Buschmann T, Bystrykh LV. Levenshtein errorcorrecting barcodes for multiplexed dna sequencing. BMC Bioinformatics. 2013; 14(1):272–73.View ArticlePubMedPubMed CentralGoogle Scholar
 Davey MC, Mackay DJC. Reliable communication over channels with insertions, deletions, and substitutions. Inf Theory IEEE Trans. 2001; 47(2):687–98.View ArticleGoogle Scholar
 Haughton D, Balado F. Biocode: Two biologically compatible algorithms for embedding data in noncoding and coding regions of dna. BMC Bioinformatics. 2013; 14(1):121.View ArticlePubMedPubMed CentralGoogle Scholar
 Haughton D, Balado F. A modified watermark synchronisation code for robust embedding of data in dna. In: Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. Piscataway: IEEE: 2013. p. 1148–52.Google Scholar
 Kracht D, Schober S. Using the daveymackay code construction for barcodes in dna sequencing. In: Turbo codes and iterative information processing (ISTC), 2014 8th international symposium on. Piscataway: IEEE: 2014. p. 142–6.Google Scholar
 Buschmann T, Zhang R, Brash DE, Bystrykh LV. Enhancing the detection of barcoded reads in high throughput dna sequencing data by controlling the false discovery rate. BMC Bioinformatics. 2014; 15(1):264.View ArticlePubMedPubMed CentralGoogle Scholar
 Jukes TH, Cantor CR. Evolution of protein moleculese. Mamm Protein Metab. 1969; 3:21–132.View ArticleGoogle Scholar
 Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic highthroughput sequencing data generated on illumina hiseq and genome analyzer systems. Genome Biol. 2011; 12(11):112.View ArticleGoogle Scholar
 Rabiner L, Juang BH. An introduction to hidden markov models. ASSP Mag IEEE. 1986; 3(1):4–16.View ArticleGoogle Scholar
 Grassl M. Searching for linear codes with large minimum distance In: Bosma W, Cannon J, editors. Discovering mathematics with magma — reducing the abstract to the concrete. Algorithms and computation in mathematics, vol. 19. Heidelberg: Springer: 2006. p. 287–313.Google Scholar
 Briffa JA, Schaathun HG. Improvement of the daveymackay construction. In: Information theory and its applications, 2008. ISITA 2008. international symposium on. Piscataway: IEEE: 2008. p. 1–4.Google Scholar
 Levenshtein VI. Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady. 1966; 10(8):707–10.Google Scholar
 Forney GD. Concatenated codes, vol. 11. Cambridge: MIT Press; 1966.Google Scholar
 MacWilliams FJ, Sloane NJA. The theory of errorcorrecting codes, vol. 16. Amsterdam, Netherlands: Elsevier; 1977.Google Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.