Due to its very high storage density of theoretically 455 exabytes per gram (using 2 bits per nucleotide) [1] and its extraordinary longevity, deoxyribonucleic acid (DNA) is a promising medium for long-term and high-density storage of important data. However, since DNA as a biological medium is susceptible to a variety of mutations and read/write errors, and the cost for synthesizing and sequencing DNA is still a decisive factor, it is essential to use adequate coding schemes to correct errors and avoid unstable or error-prone DNA sequences when digital data is stored in DNA.
Several coding schemes were proposed to correct read/write errors and avoid error-prone DNA sequences. For example, Church et al. [1] as well as Goldman et al. [2] used different overlapping strategies to map digital data into DNA strands and support error correction. In these coding schemes, the bits 0 and 1 are mapped to two DNA bases each and thus error-prone combinations like long repeats of one base (called homopolymers) are avoided. Heckel et al. [3] proposed an index-based coding scheme, mapping data annotated with the corresponding index, without further modification of these data packets. Furthermore, near-optimal rateless erasure code (NOREC), also called fountain codes, are particularly interesting coding methods for DNA storage. For example, Erlich and Zielinski used the Luby transform (LT) code to achieve high-capacity and low-error DNA storage [4]. In their work, they leveraged the special property of fountain codes to be able to generate theoretically infinitely many different packets for a given input to find packets that satisfy the restrictions defined for their DNA storage approach. Since LT is the most basic NOREC, there is a large untapped potential for improvement in using NORECs for DNA storage.
Ping et al. [5] developed Chamaeleo, a framework that provides multiple DNA storage codes. While Ping et al. focus on a variety of conventional (non-fountain) codes presented in the literature, they also include the LT implementation used by Erlich and Zielinski. In contrast to our work, Chamaeleo does not include means to change, adapt, and integrate modified or new error rules. Furthermore, the missing framework-wide error simulation does not permit an extensive comparison of the usability of the implemented methods for real DNA storage.
In this paper, we present NOREC4DNA, a software framework to use, test, compare, and improve coding schemes for DNA storage. While NOREC4DNA focuses on fountain codes, regular coding schemes can be compared as well. Besides multiple extensible restriction rules, we implemented the three most common fountain codes, namely LT [6], Online [7], and Raptor [8]. The contributions of this paper are as follows:
-
We present a novel framework based on NORECs, called NOREC4DNA, to flexibly generate sequences for DNA storage that satisfy user-defined restrictions.
-
NOREC4DNA allows detailed comparisons of various coding schemes for storing data in DNA.
-
We show that NORECs belong to the most suitable codes for DNA storage; Raptor codes yield the best results across all tested metrics.
The paper is organized as follows. “Near-optimal rateless erasure codes” section gives an overview of the fountain codes and technologies used in NOREC4DNA. “Implementation” section presents the design and implementation of NOREC4DNA. We present experimental results generated using NOREC4DNA in “Results” section. Finally, “Conclusion” section concludes the paper and outlines areas of future work.
Near-optimal rateless erasure codes
NORECs can be used to generate theoretically infinitely many coding symbols (in practice, some limitations apply). Furthermore, only a small number of \((1+\epsilon ) * n\) encoded symbols have to be correctly received to fully reconstruct the original information. Since it does not matter which (and in which order) symbols are received as long as a sufficient number of symbols is received - just like a bucket under a fountain that does not care about which drops of water it collects—these codes are also known as fountain codes.
Typically, fountain codes are applied as follows. First, a distribution function is used to determine a degree for each data packet. These packets are filled with random chunks using XOR according to their degree and then transmitted to the recipient over an erasure channel. After receiving a sufficient subset of the transmitted packets, the receiver computes the original data using the received packets and an indication of the chunks contained in them. To reconstruct the encoded data, the receiver either has to know the list of chunks mixed into a given packet (e.g., as part of the transmission) or has to know the distribution function as well as the seed used to initialize the random number generator used during encoding.
Luby transform codes
The LT code [6] proposed by Luby is considered to be the most fundamental and pioneering fountain code. LT divides the original file into n equally long ‘chunks’ that are then combined into packets using the XOR operator, as shown in Fig. 1.
The encoding process can be summarized as follows:
-
Choose the degree d from the probability distribution function p(.).
-
Select d evenly random distributed different chunks.
-
Generate the resulting packet \(t_i\) by combining all selected chunks using the XOR operator.
-
Add reconstruction information (selected chunk number or seed) to the packet.
Luby presents two distribution functions for his LT coding schemes [6]. The first distribution function, called ‘IdealSoliton’ distribution, was proven by Luby to be mathematically ideal. It operates on integers from 1 to N, with N as the only parameter of the function. It is defined as follows:
$$\begin{aligned} \begin{aligned} \displaystyle \rho (1)&={\frac{1}{N}},\\ \displaystyle \rho (k)&=\frac{1}{k(k-1)}\qquad (k=2,3, \ldots ,N).\, \end{aligned} \end{aligned}$$
(1)
As shown in Fig. 2a, this function has a single mode and then flattens down to the specified value N. Since only the parameter N can be selected, the position of the mode \(\rho (2) = 0.5\) as well as its value cannot be changed.
According to Luby, this function is quite fragile and thus not suitable for practical applications. Therefore, he proposed a robust form of this distribution function, called ‘RobustSoliton’ distribution, which uses a set of elements \(\tau (i)\) to extend the IdealSoliton distribution (see Eq. (2)), adding a spike to the mode at degree 1. In addition to the parameter N already defined in the IdealSoliton distribution, two additional parameters K and \(\delta\) are introduced. While K with \(K < N\) defines the integer position of the additional peak, \(\delta\) describes the expected (real-valued) error probability.
$$\begin{aligned} \begin{aligned} \displaystyle \tau (i)&={\frac{1}{iK}},\qquad \qquad (i=1,2, \ldots ,K-1),\,\\ \displaystyle \tau (i)&={\frac{\ln (R/\delta )}{K}},\qquad (i=K),\,\\ \displaystyle \tau (i)&=0,\qquad \qquad \quad (i=K+1, \ldots ,N).\,\\ \displaystyle \text {with }&R = N/K\,\\ \end{aligned} \end{aligned}$$
(2)
Finally, as shown in Eq. (3), the values for \(\tau (i)\) are added to \(\rho (i)\) and normalized afterwards.
$$\begin{aligned} \begin{aligned} \mu (i) = \dfrac{\rho (i)+\tau (i)}{\sum _{j=1}^{N}(\rho (j)+\tau (j))} \end{aligned} \end{aligned}$$
(3)
Figure 2b shows the influence of the individual parameters on the two distribution functions. While the choice of N determines the maximum degree, the parameter K determines the position of the additional peak. With increasing \(\delta\), the probabilities shift towards degree 2. Therefore, from the second peak and from all other degrees \(i > 2\), small values of probability decrease to the mode at degree 2.
The decoding of the packets generated during the encoding can take place without prior sorting. For decoding, packets are first collected and combined (as far as possible) in such a way that the degree of these packets is reduced in each case. An illustration of decoding a file with three chunks is shown in Fig. 3. As shown in Fig. 4, the decoder can still reconstruct the information if packet 2 gets lost.
Thanks to the special construction of fountain codes based on the XOR operator, it is possible to represent the decoding of all NOREC methods in the form of a linear equation
system \(Ax = b\), as shown in Fig. 5. This ensures that ambiguities during the decoding process, as shown in the previous example, can be avoided and thus the packets can be optimally reduced to the original message.
In A, all 1’s of a line i describe which chunks in the packet i were combined by the XOR operation. An example is shown in Eq. (4).
$$\begin{aligned} \begin{aligned} A[i]&= \begin{pmatrix} 0&1&0&0&1&1 \end{pmatrix}&\Leftrightarrow \\ b[i]&= {{{\veebar }}}_{j = 0}^{|A[i]|} A[i,j] \cdot Chunk[j] = Chunk[1] \veebar Chunk[4] \veebar Chunk[5] \end{aligned} \end{aligned}$$
(4)
Online codes
Online Codes proposed by Maymounkov [7] improve LT codes. Online codes address the main problem with LT codes, which is that LT codes cannot guarantee that after \((1+\epsilon )*n\) generated packets, all n parts of the original message were encoded in a sufficient manner. This problem is a manifestation of the coupon collector’s problem [9]. If a chunk of the original message has been encoded less frequently (e.g., only once or even zero times), these parts are much more vulnerable during decoding, since a loss of the packets containing these chunks cannot be compensated. Since fountain codes pseudo-randomly combine several chunks to packets, it can also happen that the lack of a single packet prevents numerous chunks from being reconstructed. To prevent this problem from occurring, Online codes follow a two-staged approach. First, auxiliary blocks are created in the so-called outer-encoding. Then, together with the message chunks, they are finally encoded into packets during the inner-encoding step. In particular, during the outer-encoding process, \(M = \lceil 0.55 \cdot q \cdot \epsilon \cdot F\rceil\) auxiliary packets are created. Then, each chunk is randomly mixed into q different AUX packets using XOR [7]. Like the chunks of the file, the resulting M AUX packets are considered as regular inputs in the inner-encoding step. In this phase, a packet is generated by randomly selecting a degree n from the distribution function. As a result, selected from the union of the set of chunks and AUX blocks, n unique and equally distributed elements are mixed into this packet. This step is repeated infinitely or until a predefined condition occurs. Such a condition could be, e.g., reaching a certain number of created packets or successfully decoding with a decoder running simultaneously.
The process of decoding encoded content in Online codes is similar to the one described for LT codes. However, the decoding phases must be carried out in the reverse order of the encoding process.
If the mapping of the chunks originally used per AUX packet are already available at the beginning of the decoding (e.g., after the first packet), the decoder can reduce the AUX packets and perform a mapping with the correct chunks during the actual transmission. A reduction of the example shown in Fig. 6 is illustrated in Fig. 7.
Raptor codes
Raptor codes developed by Shokrollahi [8] are the first fountain codes with (theoretically) linear encoding and decoding time. As the name Raptor codes (for rapid tornado) suggests, they are based on tornado codes, which can correct erasures with a fixed error rate. These codes require a constant number of blocks more than the Reed–Solomon codes [10], but are much faster in encoding and decoding. Meanwhile, there are several variants of the original method, e.g., R10 [11] or the RaptorQ code [12]. Depending on the procedure, purpose and area of use, various patents may apply [13, 14].
Similar to the previously mentioned Online codes, the Raptor encoding is based on a two-step process consisting of an outer- and an inner-encoding. While the inner-encoding (just like the inner-decoding) consists of an LT code, the outer-encoding of the Raptor code consists of one or a series of erasure codes with a fixed rate. A possible procedure of this outer-encoding phase is the encoding using a Gray sequence followed by an low density parity check (LDPC) code. Alternatively, a Hamming code or any other erasure code with a fixed rate can be used or combined. This approach combines the advantages of fixed rate coding with codes of the NOREC class.
In contrast to the fountain codes described so far, the Raptor encoding has a (theoretical) fixed upper limit of the number of possible chunks. This upper limit also limits the maximum degree of a packet. For this reason, Raptor codes use a fixed (parameter free) distribution function shown in Eq. (5). This function uses a random number of the given range \([1, 2^{20} = 1048576]\) to determine the degree of each packet.
$$\begin{aligned} \begin{aligned} f(x)= {\left\{ \begin{array}{ll} 1 &{} \text {for } x< 10241\\ 2 &{} \text {for } x< 491582\\ 3 &{} \text {for } x< 712794\\ 4 &{} \text {for } x< 831695\\ 10 &{} \text {for } x< 948446\\ 11 &{} \text {for } x < 1032189\\ 40 &{} \text {else} \end{array}\right. } \end{aligned} \end{aligned}$$
(5)
Data is encoded in two steps. First, the original data is used to create additional information for reconstruction using codes with a fixed rate. Second, the information generated is encoded into many packets using the LT coding technique. Although in practice various erasure codes exist for the first step, the Gray code combined with a subsequent LDPC code is mainly encountered. This can be explained by the simplicity of the Gray code and the properties of LDPC. With this combination (and especially since LDPC codes operate at the Shannon capacity), it is possible to successfully reconstruct a message with a very small overhead of packets (for 1000 chunks, approximately 1-2 additional packets are created using the non systematic approach) with a nearly 100% chance. The number of intermediate blocks to be generated in the first step of the Raptor code is calculated by the formulas shown in Eq. (6) and depends on the number of chunks.
$$\begin{aligned} \begin{aligned} f(k) =&x \in {\mathbb {N}} \text { with } x \cdot (x-1) \ge 2 \cdot k \text { and }\\&\exists ! j< x \text { with } j \cdot (j-1) \ge 2k\\ g(k) =&x \in {\mathbb {N}} \text { with } x \text { is prime and }\\&x \ge \lceil 0.01 \cdot k + f(k) \rceil \text { and }\\&\exists ! j \text { with } j \text { is prime and }\\&j< x \text { and } j \ge \lceil 0.01 \cdot k + f(k) \rceil \\ h(k) =&x \in {\mathbb {N}} \text { with } \left( {\begin{array}{c}x\\ \lceil \frac{x}{2}\rceil \end{array}}\right) \ge f(k) + g(k) \text { and }\\&\exists ! j \text { with } j < x \text { and } \left( {\begin{array}{c}j\\ \lceil \frac{j}{2}\rceil \end{array}}\right) \ge f(k) + g(k)\\ \end{aligned} \end{aligned}$$
(6)
While f(k) is only an auxiliary function for g(k), the function g(k) calculates the number of intermediate blocks to be created using the Gray code. The function h(k) computes the number of intermediate blocks to be generated by LDPC. The sum \(l = k+g(k)+h(k)\) indicates the total quantity of intermediate blocks after the first step. These \(k+g(k)\) blocks are then used as inputs to generate the h(k) LDPC encoded blocks. This ensures that all chunks and previously generated blocks in intermediate steps are captured during LDPC encoding. If the first phase of the implemented Raptor encoding consists of more than two fixed-rate codes, the number of intermediate blocks to be generated must be determined for each of these codes and then, according to the desired order, the chunks together with the preceding intermediate blocks serve as input for the following code.
The \(h(k)+g(k)\) intermediate blocks generated in the first step are then encoded into packets using the LT code. The standard procedure is to select the degree of the packets with the predefined distribution function. In contrast to the fountain codes used so far, the final choice of chunks for a packet to use is not distributed equally, but depends on a particular algorithm. Thus, the function may vary depending on the version of the Raptor code.
Figure 8 shows the result of a run of this rateless encoding. The first packet consists of \(d=4, a=5, b=1\) of the first chunk (\(b = 1\)), the fifth chunk (\(b = 1 + a\% 11 = 5\)), the fourth chunk (\(b = ((1 + a)\% 11 + a\% 11) + a\% 11 = 4\)) and the second chunk (\(b = (((((1 + a)\% 11 + a\% 11) + a\% 11) + a\% 11) + a\% 11) + a\% 11 = 2\)).
This procedure limits the size of allowed input chunks (and thus also the final size of the individual packets) to \(2^{20} - x\), where x is the number of intermediate blocks to be generated in the first step. To avoid this limitation, in RFC 5053 [11] the authors of the Raptor encoding suggest generating ‘sub-blocks’ in addition to chunks (referred to as ‘source blocks’). These are evaluated in the algorithm as a separate run. To enable decoding, only the current chunk number and the number of sub-chunks used must be known. This allows us to encode each chunk separately, offering a selection of almost any number of subdivisions.
The reconstruction of the packets encoded by Raptor codes is quite similar to the Online decoding. The mapping of each intermediate block must be generated exactly in the same way as during encoding. The decoder has to know the number of chunks, the fixed rate codes, and their order. While the number and sequence of most implementation steps is standardized and can therefore be treated as given, a sender must transmit the number of original chunks (or sub-blocks). If the corresponding information is known, the decoding of the Raptor code can also be reduced to a linear equation system and solved using either the Gaussian elimination method [15, 16] or belief propagation [17, 18].
Common errors in DNA storage systems
To leverage the rateless property of the described codes, NOREC4DNA includes a variety of rules defining error probabilities of DNA sequences.
Mutations Depending on the synthesis or storage method, the individual DNA strands are subject to different mutations and mutation probabilities depending on their characteristics. Although all four bases are susceptible to simple mutations, there are differences in the effect, recoverability, and frequency of these mutations. One of the most common mutations is the formation of uracil from cytosine, which, like thymine, would complement adenine and thus would produce a different sequence if sequenced later. A similar defect exists in the oxidation of guanine and the associated formation of 8-oxoguanine. This can bind to both cytosine and adenine and can therefore possibly lead to an error. In addition to a one-by-one mutation, insertions and deletions might lead to further modifications of a DNA sequence.
Homopolymers Polymers that contain longer repetitions of the same base are referred to as homopolymers. These fragments are highly unstable during synthesis, polymerase chain reaction (PCR), and subsequent sequencing. On the one hand, due to how next-generation sequencing systems detect the presence of a nucleotide, homopolymers are difficult to correctly sequence [19]. On the other hand, a so-called ‘slippage’ of the enzyme in the region of the homopolymer might occur during PCR (and thus during synthesis and sequencing). Longer homopolymers greatly increase the probability of such an error and thus should be avoided.
GC Content The GC content of a DNA strand as shown in Eq. (7) indicates the proportion (i.e., frequency) of the bases guanine and cytosine with respect to the length of the full strand:
$$\begin{aligned} \begin{aligned} \text {GC Content} = \dfrac{|G|+|C|}{|G|+|C|+|A|+|T|} \cdot 100\% \text {, }&\text {where |}\star \text {| is the frequency of} \\&\text {base} \star \text {in a given sequence.} \end{aligned} \end{aligned}$$
(7)
Since the base pairs A and T always form two hydrogen bonds, whereas the pairs G and C always form three hydrogen bonds, and GC pairs tend to be thermodynamically more stable due to a stack interaction, statistically more frequent errors such as abortions and mutations occur in sequences with low GC content [20, 21]. This effect can be observed in nature, where thermophilic organisms have a significantly higher GC content than comparable species. In nature, depending on the organism, the GC content varies between approximately 20% and almost 80%. To design a stable, synthesizable and sequenceable DNA storage, a balanced GC content of 40–60% is advantageous. This applies to both the overall sequence and the GC content per window.
Micro-satellites Also known as ‘short tandem repeats’, two to six base pairs long sequences that are frequently repeated in a DNA sequence are called micro-satellites. Although micro-satellites occur approximately one million times in human DNA, they are still considered unstable due to problems during the sequencing and reproduction process. The most common forms of these error sources are di- and trinucleotide repeats.
Belonging to the class of micro-satellites, dinucleotide repeats consist of a long repetition of two base pairs. The reason for a mutation is the high chance of ‘slippage’ during a PCR process. This causes the newly synthesized strand to form a loop, which is not cut out during the repair attempt but leads to an extension of the strand. Since these locations are not recognized by proof-reading (as correction), they often lead to unrecoverable errors. The more frequently a sequence repeats itself, the higher the probability that during PCR a formation of loops occurs, which then changes the DNA.
Similar to the dinucleotide repeats, this type of error is also based on a repetition of nucleotide sequences. However, as the name implies, repetitions of three nucleotides are considered.