- Research
- Open access
- Published:

# Efficient and low-complexity variable-to-variable length coding for DNA storage

*BMC Bioinformatics*
**volume 25**, Article number: 320 (2024)

## Abstract

### Background

Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of *h* consecutive identical bases (homopolymer constraint *h*), and 2) a GC ratio between \( [0.5 - c_{{GC}}, 0.5 + c_{{GC}} ] \) (GC content constraint \(c_{GC}\)). Sequencing or synthesis errors tend to increase when these constraints are violated.

### Results

In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when \(h = 4\) and \(c_{GC} = 0.05\), the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.

### Conclusion

We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.

## Background

In recent years, the rapid growth of data generated by social media, autonomous driving, and other internet-based industries has outpaced the capabilities of traditional storage technologies. DNA-based storage has emerged as a promising next-generation solution due to its high storage capacity, low maintenance costs, and long-term preservation abilities without power consumption [1,2,3,4]. However, two critical constraints must be addressed for DNA storage: the homopolymer constraint, which limits consecutive identical bases not exceeding *h* in the DNA sequence to avoid sequencing errors [5], and the GC content constraint, requiring the G and C base ratio to be within \([0.5 - c_{GC}, 0.5 + c_{GC}]\) to prevent erroneous behavior during the polymerase chain reaction (PCR) process.

Numerous coding algorithms for DNA storage have been proposed to tackle these constraints. Goldman et al. [6] presented a 3-ary Huffman code that satisfies the homopolymer constraint but does not consider the GC content constraint. Song et al. [7] proposed a concatenating method that initially designs shortcodes to meet the homopolymer constraint, then concatenates them to create an extended code fulfilling both constraints. Similarly, Wang et al. [8] devised a short run-length constraint encoding that adheres to the homopolymer and GC content constraint, and then concatenates short sequences to generate the output DNA sequence. However, the use of short code concatenation results in lower coding rate. For instance, with a homopolymer constraint of h = 3 and a GC content constraint of [0.4, 0.6], Song et al.’s approach yields 1.9 bits per nucleotide [7] and Wang et al.’s method delivers 1.917 bits per nucleotide [8]. These rates fall short of the theoretical optimum, which is 1.983 bits per nucleotide [2]. This phenomenon arises due to the nature of short code cascading, which prioritizes local optimization at the expense of overlooking globally applicable code snippets. To adhere to GC content constraints, the short code concatenation method necessitates the utilization of fewer fragments, resulting in a reduction in the binary data length it can effectively correspond to. That is, when compared to the long code approach, employing short code concatenation mandates the use of longer codes to convey identical information. Mishra et al. [9] introduced the minimum variance Huffman tree method for encoding binary data into DNA sequences. Park et al. [10] employed randomization to meet GC content constraints and used a greedy algorithm to build a mapping method addressing homopolymer constraints. However, their approach does not fully utilize DNA base combinations. Notably, JPEG-DNA [11] was proposed to compress images lossily under DNA storage constraints, mainly following the standard JPEG and utilizing DC and AC quantization indicators to match GC content constraints. Despite these efforts, most previous works have high encoding complexity or rely on short code concatenation which leads to sub-optimal rates.

In this paper, we present a novel variable-to-variable length scheme for DNA storage source coding that efficiently maps binary data to DNA sequences while adhering to both homopolymer and GC content constraints. Our algorithm tackles these constraints separately, encoding input sequentially. First, a variable-to-fixed length encoder (homopolymer encoder) is applied to the binary input, generating a DNA sequence satisfying the homopolymer constraint by mapping every two bits to a nucleotide in a Markov manner, thus preventing consecutive \(h + 1\) bases to satisfy homopolymer constraint of *h*. Subsequently, a straightforward fixed-to-variable length encoder (GC content encoder) adjusts the GC ratio by adding dummy bases. This method maintains linear complexity and achieves a high encoding rate, approaching the theoretical limit.

However, the proposed GC content encoder can result in long dummy sequences, potentially causing errors during the DNA sequence reading and writing process. To mitigate lengthy output DNA sequences, we incorporate XOR encoder, applying XOR operations to input binary sequences. Specifically, both the encoder and the decoder share four predefined pseudorandom binary sequences. For a given input sequence \({\textbf {X}}\), the encoder selects the pseudorandom sequence \({\textbf {Z}}\) and encodes \({\textbf {X}}\oplus {\textbf {Z}}\), yielding the shortest output DNA sequence. The encoder also utilizes an additional nucleotide to indicate which pseudorandom sequence is being used.

Finally, we demonstrate the effectiveness of our proposed scheme through a series of experiments using both randomly generated data and existing files. These experiments allow us to rigorously test the performance of our encoder under various conditions and data types. The results highlight that our proposed encoder not only achieves a nearly optimal rate but also generates output DNA sequences with consistent lengths, thanks to the incorporation of the XOR encoder.

It is important to acknowledge that DNA storage is subject to other practical constraints, such as minimum Hamming distance [12, 13], error correcting capabilities [14,15,16], avoiding undesired motifs [17,18,19], and primer constraints [20, 21]. Nonetheless, our algorithm specifically targets the source coding problem under homopolymer and GC content constraints, with the aim of achieving optimal encoding as proposed by Erlich and Zielinski [2].

## Materials and methods

### Problem formulation

We focus on variable-to-variable length encoding, which maps a binary input sequence to a DNA output sequence. Specifically, we assume the input binary sequence \({\textbf {X}}\in \{0, 1\}^* = \cup _{k=1}^\infty \{0, 1\}^k\) can have any finite length and is generated following an i.i.d. Bernoulli(1/2) distribution. We implicitly presume that the binary input is already compressed using a suitable compression algorithm, such as gzip. The corresponding output sequence can also have any finite length of DNA bases, with the alphabet represented by \(\mathcal{Y}= \{\texttt {A},\texttt {C}, {\texttt {G}}, {\texttt {T}}\}\). A variable-to-variable length encoder *f* processes the first \(\ell _X\) bits from \({\textbf {X}}\) and generates a DNA sequence \({\textbf {Y}}\in \mathcal{Y}^*\) with a length of \(\ell _Y\). As both \(\ell _X\) and \(\ell _Y\) are random variables, our objective is to maximize the expected rate \(R = \frac{{\mathbb {E}\left[ \ell _X\right] }}{{\mathbb {E}\left[ \ell _Y\right] }}\) bits per nucleotide (bits/nt).

Due to biological limitations, the encoder’s output \({\textbf {Y}}= f({\textbf {X}})\) must satisfy both homopolymer and GC content constraints. Specifically, the maximum homopolymer run-length of \({\textbf {Y}}\) should not exceed *h*. Furthermore, the GC ratio of \({\textbf {Y}}\) must fall within the range \([0.5-c_{GC}, 0.5+c_{GC}]\). Our goal is to identify an encoder that maximizes the expected rate *R* while satisfying the homopolymer and GC content constraints.

### Algorithm

The proposed encoder *f* is comprised of two encoders: the homopolymer encoder \(f_H\) and the GC content encoder \(f_{GC}\). Initially, the homopolymer encoder \(f_H\) operates as a variable-to-fixed length encoder, mapping the variable-length binary sequence \({\textbf {X}}\) to a fixed-length DNA sequence \({\textbf {U}}\) (i.e., \(f_H({\textbf {X}}) = {\textbf {U}}\)), where the target sequence length of \({\textbf {U}}\) is predetermined by \(\ell _U\). The primary objective of \(f_H\) is to generate a DNA sequence that complies with the homopolymer constraint. Subsequently, the GC content encoder \(f_{GC}\) functions as a fixed-to-variable length encoder, mapping the fixed-length DNA sequence \({\textbf {U}}\) to a DNA sequence \({\textbf {V}}\) with length \(\ell _V\) (i.e., \(f_{GC}({\textbf {U}}) = {\textbf {V}}\)). Finally, XOR encoder is applied to produce the ultimate DNA output sequence \({\textbf {Y}}\) with details provided in Section XOR Encoder. Note that the proposed scheme does not encompass the encoding of DNA sequence indexes.

#### Homopolymer encoder

The core concept of the proposed homopolymer encoder \(f_H\) is to map every 2 bits to a single nucleotide, unless doing so would violate homopolymer constraints. Specifically, the homopolymer encoder \(f_H\) processes 2 bits \(x_1x_2\) from binary data \({\textbf {X}}\) at each step and maps them to the corresponding base \(U = M(x1x2)\), where \(M(00) = {\texttt {A}}\), \(M(01) = {\texttt {C}}\), \(M(10) = {\texttt {G}}\), and \(M(11) = {\texttt {T}}\). However, if the last *h* bases are equal to base *B* (i.e., \(U_{i-h} = \dots = U_{i-1} = B\) at step i), then the next nucleotide generated by \(f_H\) should not be *B* again (i.e., \(U_i \ne B\) is required). In such cases, \(f_H\) takes a single bit \(x_1\) from binary data \({\textbf {X}}\) and maps it to a base other than *B*. If \(B \in \{{\texttt {A}}, {\texttt {T}}\}\), then the next base \(U_i = M_{GC}(x_1)\) should be either \({\texttt {C}}\) or \({\texttt {G}}\), where \(M_{GC}(0) = {\texttt {C}}\) and \(M_{GC}(1) = {\texttt {G}}\). It is important to note that we limit \(U_i\) to two possibilities. For instance, when \(U_{i-h} = \dots = U_{i-1} = {\texttt {T}}\), the next base is either \({\texttt {C}}\) or \({\texttt {G}}\), instead of selecting from \(\{{\texttt {C}}, {\texttt {G}}, {\texttt {A}}\}\). This is because: 1) it offers a low-complexity scheme since a single bit maps to a single base, and 2) it also balances the GC ratio, which we will discuss in the following subsection. Similarly, if \(U_{i-h} = \dots = U_{i-1} = B \in \{{\texttt {G}}, {\texttt {C}}\}\), then \(U_i = M_{AT}(x_1)\) where \(M_{AT}(0) = {\texttt {A}}\) and \(M_{AT}(1) = {\texttt {T}}\). We repeat this process \(\ell _U\) times, and then \(f_H\) returns a DNA sequence \({\textbf {U}}= U_1\dots U_{\ell _U}\) with length \(\ell _U\). This procedure is outlined in Algorithm 1.

#### GC content encoder

The GC content encoder \(f_{GC}\) takes a DNA sequence \({\textbf {U}}\) of length \(\ell _U\) as input and adds dummy bases to meet the GC content constraint \([0.5-c_{GC}, 0.5+c_{GC}]\). Specifically, it adds \({\texttt {G}}\)’s and \({\texttt {C}}\)’s if the GC ratio is too low, or \({\texttt {A}}\)’s and \({\texttt {T}}\)’s if the GC ratio is too high. The objective is to minimize the length of the dummy sequence while still adhering to the homopolymer constraint. Let \(p_{GC}({\textbf {U}})\) represent the GC ratio of \({\textbf {U}}\). If \(p_{GC}({\textbf {U}}) < 0.5 - c_{GC}\), a dummy sequence \({\textbf {D}}\) consisting of \({\texttt {G}}\)’s and \({\texttt {C}}\)’s is generated to compensate for the GC ratio. The length of \({\textbf {D}}\) is the minimum \(\ell _{add}\) that satisfies

The output of \(f_{GC}\) is a concatenation of \({\textbf {U}}\) and \({\textbf {D}}\), i.e., \({\textbf {V}}= \text{ CONCAT }({\textbf {U}}, {\textbf {D}})\). Note that \({\textbf {D}}\) can be either \({\texttt {G}}{\texttt {C}}{\texttt {G}}\dots \) or \({\texttt {C}}{\texttt {G}}{\texttt {C}}\dots \) depending on the last base of \({\textbf {U}}\), ensuring that \({\textbf {V}}\) satisfies the homopolymer constraint. Similarly, if \(p_{GC}({\textbf {U}}) > 0.5 + c_{GC}\), a dummy sequence \({\textbf {D}}\) composed of \({\texttt {A}}\)’s and \({\texttt {T}}\)’s can be generated accordingly. This procedure is outlined in Algorithm 2.

#### XOR encoder

The two-stage encoder with the homopolymer encoder \(f_H\) and GC content encoder \(f_{GC}\) offers a straightforward and effective encoding strategy. However, the nature of variable-to-variable length encoder can result in varying output sequence lengths \({\textbf {V}}\). For instance, if we set \(\ell _U = 150\), most output sequences have lengths between 150 and 155, but some extreme outliers have lengths longer than 165 (see Table 5). This non-uniform length may cause errors during reading and writing. Furthermore, DNA synthesis companies, such as TwistBioscience, often employ a tiered pricing model that is based on the maximum sequence length. Under such circumstances, it becomes crucial to minimize this maximum sequence length. To this end, we introduce a straightforward approach known as XOR encoder, specifically designed to achieve this minimization.

Initially, we pre-generate four sufficiently long random binary sequences \({\textbf {Z}}_A\), \({\textbf {Z}}_C\), \({\textbf {Z}}_G\), \({\textbf {Z}}_T\), according to i.i.d. Bernoulli(1/2), which are provided to both the encoder and decoder beforehand. The idea is to apply \(f_{GC}\circ f_H\) to all possible XOR combinations, \({\textbf {Z}}_A\oplus {\textbf {X}}\), \({\textbf {Z}}_C\oplus {\textbf {X}}\), \({\textbf {Z}}_G\oplus {\textbf {X}}\), or \({\textbf {Z}}_T\oplus {\textbf {X}}\), and select the one that results in the shortest output sequence. More precisely, we find \(b^\star \in \mathcal{Y}= \{{\texttt {A}}, {\texttt {C}}, {\texttt {G}}, {\texttt {T}}\}\) such that

where \(|\cdot |\) denotes the length of the sequence. For \(b\in \mathcal{Y}\), the binary sequence \({\textbf {Z}}_b\) generated based on a random seed \({\textbf {S}}_b\), where the random seed is available to both the XOR encoder and decoder.

To indicate which random sequence is being used, the encoder adds the leading base \(b^\star \) at the beginning of the sequence. For example, if \(f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_A))\) yields the shortest output sequence, the output sequence is a concatenation of \({\texttt {A}}\) and \(f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_A))\), i.e., \({\textbf {Y}}= \text{ CONCAT }({\texttt {A}}, f_{GC}(f_H({\textbf {X}}\oplus {\textbf {Z}}_A)))\). In the following sections, we show that the XOR encoder dramatically reduces the number of long output sequences. This procedure is described in Algorithm 3.

#### Decoder

Upon receiving the output sequence \({\textbf {Y}}\) and the random seed \({\textbf {S}}_b\), the decoder checks the first base \(b^\star =Y_1\), which indicates the pseudorandom sequence in use. Recall that the decoder has the seed \({\textbf {S}}_b\) and can recover the binary sequences \({\textbf {Z}}_b\). Since \(f_{GC}\) only adds the dummy sequence, the decoder recovers \(U = Y_2\dots Y_{\ell _U+1}\) by extracting the first \(\ell _U\) bases. Then, by reversing the \(f_H\) process, the decoder obtains \({\textbf {X}}\oplus {\textbf {Z}}_{b^\star }\). Finally, by applying XOR to \({\textbf {Z}}_{b^\star }\), the decoder retrieves the original binary input \({\textbf {X}}\). In the absence of the XOR encoder, upon receiving the sequence \({\textbf {V}}={\textbf {Y}}\) requiring decoding, the initial step involves removing the redundancy introduced by the GC content encoder \(f_{GC}\). This yields \(U = V_1\dots V_{\ell _U}\) for the decoder. Subsequently, \(f_H\) is reversed to recover the original binary input \({\textbf {X}}\). Throughout this process, no information loss or code bit mismatch occurred. Furthermore, our subsequent experiments did not encounter any decoding failures.

## Analysis

### Homopolymer constraint only

Erlich and Zielinski [2] determined a theoretical rate limit when biological constraints are present. First, when only the homopolymer constraint is present (with a maximum homopolymer run length of *h*), the highest achievable rate \(R_H(h)\) is given by:

For instance, with a homopolymer constraint \(h = 3\), the maximum achievable rate is 1.983 bits/nt.

In this section, we assess the rate of the variable-to-fixed length encoder \(f_H\) and compare it to the theoretical limit (3). For ease of notation, we use \(X^b_a = X_a\dots X_b\) as a subsequence of \(X_1X_2\dots \), omitting the subscript when \(a = 1\). Let \(p(i; \ell , U^h)\) represent the probability of an input being a length-*i* binary sequence \(X^i\), given that the output is a DNA sequence \(Y^\ell \) of length \(\ell \) and the last *h* bases are \(U^h\) (i.e., \(Y_{\ell -h+1}^{\ell }=U^h\)).

We can derive a recursive relationship \(p(i;\ell , U^h)\) from code construction of \(f_H\). First, the initial condition is that \(p(2h; h, U^h) = 1/4^h\) for all \(U^h\) because all length-*h* DNA sequences satisfy the homopolymer constraint.

If \(U_1=\dots =U_{h-1} = b\) and \(\{b, U_h\}\) belong to the same group (\(\{b, U_h\}\subseteq \{{\texttt {A}}, {\texttt {T}}\}\) or \(\{b, U_h\}\subseteq \{{\texttt {G}}, {\texttt {C}}\}\)), then

where \({\tilde{b}}U^{h-1} = \text{ CONCAT }({\tilde{b}}, U^{h-1})\). Conversely, if \(U_1=\dots =U_{h-1} = b\) but \(\{b, U_h\}\) do not belong to the same group,

Otherwise, if \(U_1, \dots , U_{h-1}\) are not identical, then

Using the recursive relationship above, we can calculate the expected length of the input binary sequence \(\ell _X\) when the target sequence length is \(\ell _U\). Since \(\ell _U \le \ell _X \le 2\ell _U\), we have

In Table 1, we compare the rate of \(f_H\) with the theoretical limit. The values \(L_{100}\), \(L_{150}\), and \(L_{200}\) represent the rates when the target sequence lengths \(\ell _U\) are 100, 150, and 200, respectively, using our proposed method. Although the proposed scheme exhibits linear complexity, the results demonstrate that we achieve near-optimal rates.

### Homopolymer and GC-content constaints

In this section, we analyze the rate of the proposed encoder *f*, where the homopolymer encoder \(f_H\) is followed by the GC content encoder \(f_{GC}\). Erlich and Zielinski [2] also provide the theoretical rate limit when both homopolymer constraint *h* and GC content constraint \(c_{GC}\) are present. For output sequences of length \(\ell \), the maximum achievable rate \(R_{H,GC}\) is given by:

where \(\Phi (\cdot )\) is the cumulative function of a standard normal distribution.

We conducted simulations to assess the rate of the proposed two-stage encoder using both \(f_H\) and \(f_{GC}\). We randomly generated binary sequences of length 10,000 according to Bernoulli(1/2) and averaged the rate over 1000 samples. Table 2 displays the average rate of the proposed scheme under practical settings of \(c_{gc}\in \{0.05, 0.1\}, \ell _U\in \{100, 200, 400\}\), and \(h\in \{3, 4, 5\}\). The results indicate that the proposed scheme also achieves a rate close to the fundamental limit.

### XOR encoder

We offer a theoretical analysis to elucidate why the XOR encoder reduces the number of long output sequences. Given any binary sequence \({\textbf {X}}\) and an i.i.d. Bernoulli(1/2) sequence \({\textbf {Z}}\) of length \(\ell \), the output of the XOR encoder \({\textbf {X}}\oplus {\textbf {Z}}\) is also an i.i.d. Bernoulli(1/2) sequence. According to Sanov’s theorem [22], the probability of a GC ratio exceeding (or falling below) a certain threshold is exponentially small. More precisely, let \(r_{GC}({\textbf {X}}\oplus {\textbf {Z}})\) denote the GC ratio of \({\textbf {X}}\oplus {\textbf {Z}}\), then

where \(D(p\Vert q) = p\log \frac{p}{q} + (1-p)\log \frac{1-p}{1-q}\). It is important to note that we did not consider the effect of \(f_H\) since it inserts the opposite base (i.e., \(f_H\) adds \({\texttt {A}}\) or \({\texttt {T}}\) after a \({\texttt {G}}\) homopolymer sequence), which favorably acts to balance the GC ratio.

On the other hand, if we choose the pseudorandom sequence that results in the shortest output sequence, the probability of having a long output sequence becomes considerably smaller:

For instance, if \(n = 150\) and \(c_{GC} = 0.05\), the order of probability (9) is 0.471, while the order of probability with the XOR encoder (10) is 0.049. Although we need to expend one additional bit to indicate which pseudorandom sequence is used for the XOR encoder, the above results suggest that it significantly balances the GC ratio and reduces the number of long output sequences.

## Experiments

### Randomly generated data

In this section, we present additional experimental results on randomly generated data (generated according to i.i.d. Bernoulli(1/2)) using various combinations of the proposed schemes. We compare 1) \(f_H\) only, 2) \(f_H\) and \(f_{GC}\), and 3) \(f_H\) and \(f_{GC}\) with XOR encoder. We examine the cases of homopolymer constraint \(h \in \{3, 5\}\) and GC constraint \(c_{GC} \in \{0.05, 0.1\}\), where the target DNA sequence length is \(\ell _U = 100\).

Table 3 shows the rates (bits/nt) of \(f_H\) and \(f_H + f_{GC}\), which are in line with the theoretical results presented in Table 1 and Table 2, where we demonstrated nearly optimal rates. It should be noted that the scheme with XOR encoder exhibits slightly lower rates for some cases, primarily due to the inclusion of one additional base that indicates which pseudorandom sequence is being used. However, the primary objective of XOR encoder is to reduce the number of tail events (long output sequences), which we will examine in the subsequent sections.

### Encoding image data

We present experimental results on practical data to further validate the effectiveness of the proposed scheme. Note that there is no inherent difference among various types of input data as far as our proposed scheme is concerned. Our scheme can effectively handle any compressed binary data exhibiting a distribution similar to i.i.d. Bernoulli(1/2). The distinction was made in our work to follow convention and to allow for an easier comparison with previous studies, which have applied their schemes to a variety of data sources.

We encode a grayscale image, Jetplane (256\(\times \)256), sourced from the Standard Image Data Base^{Footnote 1}. First, we losslessly compress the image using WebP, then apply the proposed schemes. We calculate the overall rate based on the uncompressed image. For instance, an uncompressed Jetplane image has 8\(\times \)256\(\times \)256 bits, so the rate would be 8\(\times \)256\(\times \)256 divided by the length of the output DNA sequence. We would like to note that this approach to computing rates facilitates a fair comparison as previous works propose joint coding strategies that perform both compression and constraint coding [9, 10].

For comparison, we also encode image inputs using recently proposed schemes [9, 10]. Note that Park et al. [10] set the DNA sequence length to 200. We tested the coding results with a sequence length of \(\ell _U = 150\), homopolymer constraints of \(h = 3\), GC content constraints of \(c_{GC} = 0.05\), and the application of XOR encoder. Table 4 presents the rate of various encoding methods^{Footnote 2}, where the proposed scheme achieves the highest rate secondof 3.32.

### Encoding video data

In this section, we apply the proposed method to the video input. Video^{Footnote 3} is in MP4 format with a file size of 2.10 MB, and we apply the proposed scheme directly without further compression. Figure 1 presents the rates of the proposed scheme. Notably, due to XOR encoder, there is negligible rate loss caused by additional GC content encoder.

To further investigate the effect of XOR encoder, Figure 2 displays the statistics of the number of additional bases from \(f_{GC}\) without the XOR encoder. Among 1,226 output DNA sequences, 995 sequences do not require the addition of bases, and 199 sequences need 9 or fewer additional bases. However, some output sequences have more than 16 additional bases, which can lead to significant issues during sequencing or synthesis. Table 5 shows the number of additional symbols when the XOR encoder is applied. There are 1,223 output sequences and all output sequences need no additional base except for the leading indicator of the pseudorandom sequence. Note that we have fewer output sequences due to XOR encoder. The above experiment implies that our assumption on data distribution is valid, and demonstrates that XOR encoder maintains a consistent output DNA sequence length. Moreover, Table 6 shows the comparison of coding rates with recent works in the field. It is worth noting that the methods of Song et al. [7] and Wang et al. [8] meet the homopolymer constraint of 3 and the GC content constraint of 0.4 to 0.6. Our proposed method achieves the highest coding rate among the approachessecond, with a rate of 1.956 specifically when using XOR coding.

### Single block encoding

To rigorously evaluate our proposed method, we designed experiments to test its performance under an extreme setup: encoding long input sequences as a single block. This approach not only demonstrates the computational efficiency of our method but also highlights its applicability to long-read sequencing technologies. We conducted extensive experimentation with a variety of parameter settings, utilizing both existing and newly generated data. The experiments included practical data types such as images and videos, showcasing the versatility of our approach. For instance, we used a 256\(\times \)256 grayscale image (Jetplane) and a 2.10 MB MP4 video file. Additionally, we generated random binary sequences of length 10,000 using a Bernoulli(1/2) distribution and averaged the rate over 1,000 samples. Unlike previous studies that processed inputs of fixed length, our method encodes the entire input data into a single DNA strand. This innovation is made possible by the linear complexity of our method, ensuring efficient processing even for large input sizes. The outcomes of these tests, detailed in Table 7, reveal that our method consistently achieves performance rates nearing theoretical optimum. This efficiency is maintained across various conditions, with the performance metrics adjusted for the application of different compression algorithms. It is important to note that the table refrains from detailing the specific value of \(\ell _U\) since the encoder treats the entire input as a single block. Instead, we present \(\ell \) to represent the length of generated sequences.

Furthermore, the table illustrates how the integration of the XOR encoder affects efficiency variably, with certain conditions leading to altered rates. This variance is a deliberate aspect of our method’s design. The XOR encoder, in particular, plays a crucial role in managing the length of the output sequences, especially suppressing the elongation of sequences that could otherwise become outliers in terms of length. This suppression mechanism is key to maintaining high efficiency across the board. While the XOR encoder’s process does introduce additional bases-contrary to the base-sparing nature of the GC content encoder, which typically avoids such additions–the overall impact on coding rates is nuanced. For longer sequences, the application of the XOR encoder proves exceptionally beneficial. It effectively controls the expansion of sequence length, allowing our encoding method to efficiently handle long input sequences as a single block, thereby demonstrating its computational efficiency. This careful balance of sequence length and efficiency, particularly the ability to mitigate the extension of long sequences through the XOR encoder, reinforces our method’s value and its suitability for complex encoding tasks, thereby justifying its strategic incorporation into our scheme.

## Conclusions

In this work, we examined a source coding problem in DNA storage, focusing on homopolymer and GC content constraints. We introduced a unique variable-to-variable-length coding approach that sidesteps the need for concatenating predefined short sequences. Our technique employs a sequence of simplistic schemes (homopolymer and GC content encoders) coupled with XOR encoder, all exhibiting linear complexity. The experimental evaluation conducted on both randomly generated data and existing files showed that our proposed strategy achieves rates close to the theoretical optimum.

## Availability of data and materials

The code is released with an open-source license and can be accessed at https://github.com/gyfbianhuanyun/DNA_storage_channel_codec. In addition, all the datasets used in the article can be found in the “/codec/test_file/” folder.

## Notes

The image samples. http://asssy.sakura.ne.jp/idba.html. (Accessed on 16 Mar. 2023).

Note that for these proof-of-concept experiments, we omitted the final few bits that could not yield the designated sequence length of \(\ell _U\). However, considering all the experiments generated more than 1000 output sequences, the impact of this omission is minimal and negligible.

Video data. https://www.pexels.com/video/video-footage-of-flying-seagulls-4713259/. (Accessed on 16 Mar. 2023).

## References

Church GM, Gao Y, Kosuri S. Next-generation digital information storage in DNA. Science. 2012;337(6102):1628–1628.

Erlich Y, Zielinski D. DNA fountain enables a robust and efficient storage architecture. Science. 2017;355(6328):950–4.

Pääbo S, Poinar H, Serre D, Jaenicke-Després V, Hebler J, Rohland N, Kuch M, Krause J, Vigilant L, Hofreiter M. Genetic analyses from ancient DNA. Annu Rev Genet. 2004;38(1):645–79.

Bonnet J, Colotte M, Coudy D, Couallier V, Portier J, Morin B, Tuffet S. Chain and conformation stability of solid-state DNA: implications for room temperature storage. Nucleic Acids Res. 2010;38(5):1531–46.

Yazdi S, Gabrys R, Milenkovic O. Portable and error-free DNA-based data storage. Sci Rep. 2017;7(1):1–6.

Goldman N, Bertone P, Chen S, Dessimoz C, LeProust EM, Sipos B, Birney E. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature. 2013;494(7435):77–80.

Song W, Cai K, Zhang M, Yuen C. Codes with run-length and GC-content constraints for DNA-based data storage. IEEE Commun Lett. 2018;22(10):2004–7.

Wang Y, Noor-A-Rahim M, Gunawan E, Guan YL, Poh CL. Construction of bio-constrained code for DNA data storage. IEEE Commun Lett. 2019;23(6):963–6.

Mishra P, Bhaya C, Pal AK, Singh AK. Compressed DNA coding using minimum variance Huffman tree. IEEE Commun Lett. 2020;24(8):1602–6.

Park S-J, Lee Y, No J-S. Iterative coding scheme satisfying GC balance and run-length constraints for DNA storage with robustness to error propagation. J Commun Netw. 2022;24(3):283–91.

Dimopoulou M, San Antonio EG, Antonini M. A jpeg-based image coding solution for data storage on dna. In: 2021 29th European Signal Processing Conference (EUSIPCO), 2021;786–790. IEEE.

Benerjee KG, Banerjee A. On DNA codes with multiple constraints. IEEE Commun Lett. 2020;25(2):365–8.

Jeong J, Park S-J, Kim J-W, No J-S, Jeon HH, Lee JW, No A, Kim S, Park H. Cooperative sequence clustering and decoding for DNA storage system with fountain codes. Bioinformatics. 2021;37(19):3136–43.

Blawat M, Gaedke K, Huetter I, Chen X-M, Turczyk B, Inverso S, Pruitt BW, Church GM. Forward error correction for DNA data storage. Proced Comput Sci. 2016;80:1011–22.

Ceze L, Nivala J, Strauss K. Molecular digital data storage using DNA. Nat Rev Genet. 2019;20(8):456–66.

Weber JH, De Groot JA, Van Leeuwen CJ. On single-error-detecting codes for DNA-based data storage. IEEE Commun Lett. 2020;25(1):41–4.

Press WH, Hawkins JA, Jones SK Jr, Schaub JM, Finkelstein IJ. Hedges error-correcting code for DNA storage corrects indels and allows sequence constraints. Proc Natl Acad Sci. 2020;117(31):18489–96.

Löchel HF, Welzel M, Hattab G, Hauschild A-C, Heider D. Fractal construction of constrained code words for DNA storage systems. Nucleic Acids Res. 2022;50(5):30–30.

Welzel M, Schwarz PM, Löchel HF, Kabdullayeva T, Clemens S, Becker A, Freisleben B, Heider D. DNA-aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage. Nat Commun. 2023;14(1):628.

Yazdi ST, Kiah HM, Gabrys R, Milenkovic O. Mutually uncorrelated primers for DNA-based data storage. IEEE Trans Inf Theory. 2018;64(9):6283–96.

Wang Y, Noor-A-Rahim M, Zhang J, Gunawan E, Guan YL, Poh CL. Oligo design with single primer binding site for high capacity DNA-based data storage. IEEE/ACM Trans Comput Biol Bioinf. 2019;17(6):2176–82.

Sanov IN. On the Probability of Large Deviations of Random Variables. United States Air Force: Office of Scientific Research, University of Michigan; 1958.

## Acknowledgements

Not applicable.

## Funding

This research was supported by the Pioneer Research Center Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT & Future Planning (NRF-2022M3C1A3081366).

## Author information

### Authors and Affiliations

### Contributions

A. N. and Y. G. conceived the project and wrote the paper. Y. G. wrote the code. A. N. supervised the study. Both authors read and approved the final manuscript.

### Corresponding author

## Ethics declarations

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

## About this article

### Cite this article

Gao, Y., No, A. Efficient and low-complexity variable-to-variable length coding for DNA storage.
*BMC Bioinformatics* **25**, 320 (2024). https://doi.org/10.1186/s12859-024-05943-y

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/s12859-024-05943-y