BioCode: Two biologically compatible Algorithms for embedding data in noncoding and coding regions of DNA
 David Haughton^{1}Email author and
 Félix Balado^{1}
DOI: 10.1186/1471210514121
© Haughton and Balado; licensee BioMed Central Ltd. 2013
Received: 31 July 2012
Accepted: 19 March 2013
Published: 9 April 2013
Abstract
Background
In recent times, the application of deoxyribonucleic acid (DNA) has diversified with the emergence of fields such as DNA computing and DNA data embedding. DNA data embedding, also known as DNA watermarking or DNA steganography, aims to develop robust algorithms for encoding nongenetic information in DNA. Inherently DNA is a digital medium whereby the nucleotide bases act as digital symbols, a fact which underpins all bioinformatics techniques, and which also makes trivial information encoding using DNA straightforward. However, the situation is more complex in methods which aim at embedding information in the genomes of living organisms. DNA is susceptible to mutations, which act as a noisy channel from the point of view of information encoded using DNA. This means that the DNA data embedding field is closely related to digital communications. Moreover it is a particularly unique digital communications area, because important biological constraints must be observed by all methods. Many DNA data embedding algorithms have been presented to date, all of which operate in one of two regions: noncoding DNA (ncDNA) or proteincoding DNA (pcDNA).
Results
This paper proposes two novel DNA data embedding algorithms jointly called BioCode, which operate in ncDNA and pcDNA, respectively, and which comply fully with stricter biological restrictions. Existing methods comply with some elementary biological constraints, such as preserving protein translation in pcDNA. However there exist further biological restrictions which no DNA data embedding methods to date account for. Observing these constraints is key to increasing the biocompatibility and in turn, the robustness of information encoded in DNA.
Conclusion
The algorithms encode information in near optimal ways from a coding point of view, as we demonstrate by means of theoretical and empirical (in silico) analyses. Also, they are shown to encode information in a robust way, such that mutations have isolated effects. Furthermore, the preservation of codon statistics, while achieving a nearoptimum embedding rate, implies that BioCode pcDNA is also a nearoptimum firstorder steganographic method.
Background
The potential of deoxyribonucleic acid (DNA) for use as a storage medium of digital data was realised just over a decade ago [1]. Many promising applications of this emerging field have been proposed, such as long term data storage [2] and genetic tagging [3]. It is likely that, with advancements in DNA sequencing and synthesising technologies, information embedding in the genome of living organisms will be routine in the near future. To date several data embedding algorithms have been proposed [1, 2, 48]. However, as we will see later, none of them fully comply with some recently highlighted biological restrictions. Not adhering to these restrictions can potentially be detrimental to the organism hosting the artificial informationcarrying DNA. Here we propose two novel algorithms jointly called BioCode, which, unlike any previous ones, produce informationencoded DNA more biologically compatible for the host organism, thus improving the robustness of the encoded message. In addition to operating under strict constraints, never dealt with before, they encode information in near optimal ways. This is to the extent that for one such algorithm the embedding rate (in information bits embedded per DNA component) is indistinguishable from the optimal theoretical bound.
Interest in using DNA for information storage (genetic memories) is growing, not surprisingly, as it is a highly compact and potentially durable medium with the ability to make replicas of information costing little energy. Stored information is passed from generation to generation when placed anywhere in the genome of asexual organisms. Data encoded in DNA is subject to errors caused by random mutations in the organism’s DNA, but if encoded correctly it may still be retrievable after millions of generations or more [7]. Encoding information in sexually reproducing organisms is more complicated due to the effects of genetic crossover. However this issue has been tackled by Heider et. al [9], who proposed embedding information in mitochondrial DNA (mtDNA). In most sexually reproducing species mtDNA is inherited from the mother alone, making it an ideal location for data embedding.
Another application of robust DNA data embedding algorithms is the genetic tagging of organisms. This would be of interest to individuals researching and working with artificial or genetically modified organisms, allowing them to embed “ownership watermarks”. This was the case in one recent, high profile experiment performed by the J Craig Venter Institute (JCVI). A watermarked DNA sequence, representing the researchers’ initials, was embedded in a chemically synthesized bacterial genome [10]. A further proposal is the application of DNA data embedding for tagging potentially hazardous viruses [11]. Unique watermarks could identify different laboratories handling viruses, and thus it would be possible to refute claims that some particular institution is the source of a viral outbreak.
Despite the different potential applications of DNA data embedding, all embedding algorithms should be designed based on some common principles. Many of the prior algorithm proposals have been made by researchers concerned primarily with the biological aspects of embedding an artificial DNA sequence, but which paid relatively little attention to the coding aspects of the problem. Instead we have designed the BioCode algorithms keeping in mind not only more stringent biological constraints, but also principles from digital communications. Firstly, the informationcarrying DNA sequence should not hinder the host organisms’ development (that is, it should be as biocompatible as possible). Secondly, the embedded data should be retrievable as close as possible to a theoretical threshold (Shannon’s capacity), determined by the number of generations a message has been transmitted along and the mutation rate between generations. Finally, the algorithms should make economical use of DNA in terms of data storage, that is, maximise the embeddable payload for a given sequence length. We will demonstrate these properties through an in silico empirical analysis, in conjunction with theoretical estimates of achievable embedding rate.
There exist two distinct regions within the genomes of living organisms: proteincoding (pcDNA) regions and nonprotein coding (ncDNA) regions. In the past, ncDNA was thought to have no function, however recent research suggests that up to 80% of ncDNA may be responsible for regulatory functions [12]. In the remaining 20% of ncDNA it is safe to assume that DNA can be freely overwritten. Indeed several authors have performed successful data embedding experiments in vivo in these regions [5, 6]. The ncDNA data embedding algorithm we propose here is also designed to operate in this non functional 20% of ncDNA.
On the other hand pcDNA regions are responsible for the encoding of proteins, which are the basic building blocks of life. It is possible to modify pcDNA regions to encode information; however the constraints which an algorithm must operate under are more restrictive. The goal of each of the two BioCode algorithms presented here is to optimally embed information within each of the two types of DNA regions that we have discussed.
Prior art
The DNA data embedding field was born a little over a decade ago with the seminal paper by Clelland et al. [1], in which the authors proposed and implemented a data embedding scheme. Alphanumeric data was embedded using a trivial assignment of base groupings to characters. The synthesised DNA in this case was embedded in vitro, but not subcloned into an organism’s genome. The work of Clelland et al. was built upon by Wong et al. [2], in which they performed in vivo embedding of data in bacterial ncDNA regions. Similar to Clelland et al’s encoding scheme, a base to alphanumeric translation table was used. Two bacteria were selected for embedding, E. coli and D. radiodurans. The latter has the ability to survive in harsh environments such as those containing high levels of ionizing radiation, implying that the encoded message would also be resilient under such conditions.
The first paper to discuss error correction for information encoded in DNA was by Smith et al[13]. Since any information embedded in DNA is replicated from generation to generation, any difference between encoded information may be resolved by examining copies obtained from different organisms. Also, there exists genetic machinery in the cell which maintains DNA, providing limited error correction. Despite such inherent error correction abilities, the use of error correction methods at the encoding stage is required to reliably retrieve information after many generations of a host organism.
Arita and Ohashi [4] developed an embedding algorithm which operates in pcDNA regions. The algorithm encodes binary data and was successfully tested in vivo. The main pitfall of this method is that it requires that the original DNA sequence be available at the decoder end in order to decode the embedded message.
One paper of significance was written by Heider and Barnekow [5], in which they proposed two versions of a data embedding algorithm, entitled “DNACrypt”. The ncDNA version of the DNACrypt algorithm is a trivial mapping of bits to bases. The authors also proposed a pcDNA version of their algorithm, and went on to test their proposal in vivo[14]. It was suggested that Hamming code be used in conjunction with DNACrypt to increase robustness under mutations, although note that error correction can actually be applied on any DNA data embedding method.
The use of repetition coding as an explicit DNA data embedding method was first proposed by Yachie et al[6]. The premise behind their algorithm is that errors may be corrected by embedding redundant copies of information throughout an organism’s genome. The authors performed in vivo embedding of binary data in multiple ncDNA regions. Also included was an in silico analysis of their method, showing the data recovery rate for a varying mutation rate. This work was expanded upon by Haughton and Balado [7].
The first paper to discuss performance analysis of data embedding algorithms and propose performance bounds was by Balado [15]. The achievable rate for both ncDNA and pcDNA under substitution mutations when codons are uniformly distributed was presented. Further bounds were proposed by Balado and Haughton in [16]. These are upper bounds on the possible embedding rate (bits per DNA component) that an algorithm can attain. Therefore we will compare the performance of the BioCode methods to these bounds.
For more information on DNA watermarking the reader is referred to the recent review by Heider and Barnekow [17].
Notation and framework
In this section we introduce the notation necessary for explaining the BioCode algorithms. We also present the framework used and a summary of basic biological facts that will be needed to explain the algorithms. Sets will be represented by calligraphic letters, for instance . The cardinality of a set, or the number of elements it contains, is denoted as $\left\mathcal{S}\right$. Elements of sets are represented by lower case letters, such as $v\in \mathcal{S}$. Vectors of elements are represented by bold letters, for instance v= [v_{1},v_{2},⋯,v_{ k }].
Inherently, DNA is a linear digital storage medium whose building blocks are four nucleotide bases, denoted in set notation by $\mathcal{X}\triangleq \left\{\text{A,C,T,G}\right\}$. These bases belong to two chemically distinct groups, purines $\mathcal{R}\triangleq \{\mathrm{A},\mathrm{G}\}$ and pyrimidines $\mathcal{Y}\triangleq \{\mathrm{T},\mathrm{C}\}$. We will represent a DNA strand comprising n bases by a vector x= [x_{1},x_{2},⋯,x_{ n }], with ${x}_{i}\in \mathcal{X}$. A dinucleotide DNA sequence is represented by a twoelement vector d= [x_{1},x_{2}]. The DNA molecule actually consists of two antiparallel strands, and either of the two strands completely defines the other by means of the socalled WatsonCrick base pairings A T and G C. This fact is of importance for the BioCode ncDNA method, as we will see later.
For reasons that will become clear next, DNA data embedding algorithms which target proteincoding DNA manipulate codons, as opposed to individual bases. A codon is a group of three consecutive bases, which we will denote as $\widehat{\mathrm{x}}=\phantom{\rule{1em}{0ex}}[{x}_{1},{x}_{2},{x}_{3}]\in {\mathcal{X}}^{3}$, with a vector of codons being for instance $\stackrel{\u0304}{\mathrm{x}}=\phantom{\rule{1em}{0ex}}[{\widehat{\mathrm{x}}}_{1},\cdots \phantom{\rule{0.3em}{0ex}},{\widehat{\mathrm{x}}}_{n}]$. Genes are simply pcDNA regions flanked by certain start and stop markers enclosing consecutive codons^{a} that can be translated into proteins by the genetic machinery. Every codon $\widehat{\mathrm{x}}$ uniquely translates to an amino acid $a=\mathit{\text{aa}}\left(\widehat{\mathrm{x}}\right)$, where the a a(·) function translates a codon (or codon sequence) to an amino acid (or amino acid sequence). Using their standard abbreviations, the set of all possible amino acids is $\mathcal{A}\triangleq \{$Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr, Val, Stp }. Stp is included for notational convenience, although it is not an amino acid but just a “translation stop” command. The sequential concatenation of amino acids in a gene produces a protein. The relationship between codons and amino acids, represented by a a(·), is given by the nearuniversal genetic code. This is a redundant relationship since $\left{\mathcal{X}}^{3}\right=64$ but $\left\mathcal{A}\right=21$. The set of synonymous codons which translate the same amino acid $a\in \mathcal{A}$ is denoted ${\mathcal{S}}_{a}$. The superset of all codons is given by ${\mathcal{S}}_{\mathcal{A}}$, and each subset ${\mathcal{S}}_{a}$ is composed of the codons which translate the same amino acid, $\forall a\in \mathcal{A}{\mathcal{S}}_{a}\subset {\mathcal{S}}_{\mathcal{A}}$. This redundancy is also behind the different codon bias (or codon usage bias) exhibited by different organisms. Codon biases are characteristic frequencies of the appearance of codons associated with each amino acid. As we will see, this builtin redundancy of the genetic code lies at the foundations of all pcDNA algorithms, and therefore both the genetic code and codon bias are fundamental to these techniques.
Finally, note that taking into account the three bases in a codon and the two antiparallel strands in a DNA molecule, there are six different reading frames in which a DNA segment could be translated to proteins. A correct reading frame is determined by the presence of a start codon (the codon mapping to Met, and two codons mapping to Leu in eukaryotic organisms).
Constraints of DNA data embedding
It is essential that any data embedding process does not harm the functionality of the host organism, that is to say, the informationcarrying DNA strand y and the original xshould be biologically equivalent. In order to develop reliable data embedding algorithms the constraints which enable robust encoding must be clear. This section outlines important biological constraints which should be placed upon DNA modifications. The BioCode algorithms described in the following section abide by all of these constraints.

ncDNA constraint: no start codons — A modified ncDNA region (in order to embed information) should not be mistaken as a pcDNA region by the genetic machinery. This implies that start codons^{b} should not appear in the modifications. To the best of our knowledge BioCode ncDNA is the only algorithm strictly observing this constraint, however another method does acknowledge it to some extent. This algorithm was used by the JCVI to encode data in the artificially engineered synthetic bacterium and is disclosed in a patent [18]. This method does not completely guarantee that start codons will not be created; instead, it is designed such that the probability of start codons appearing is low. Moreover, this low likelihood only applies to one of the six possible reading frames of DNA, whereas BioCode ncDNA enforces the constraint in all six frames.

In any case it might still happen that a modified region which originally did not contain start codons may acquire them due to mutations accumulated over a number of generations. This is clearly a potentially unavoidable scenario for any method.

pcDNA constraints: primary structure preservation — The primary structure, i.e. protein translation, of a gene may not be altered, in effect meaning that $\mathit{\text{aa}}\left(\stackrel{\u0304}{\mathrm{y}}\right)=\mathit{\text{aa}}\left(\stackrel{\u0304}{\mathrm{x}}\right)$. Algorithms are restricted to encoding information by replacing codons synonymously (that is to say, with codons which translate the same amino acid). This greatly reduces capacity and increases the complexity of pcDNA algorithms over ncDNA algorithms. codon bias preservation (codon count preservation)— The second constraint which must be considered concerns the distribution of codons in organisms, or codon bias. There is a growing body of research pointing towards the codon bias usage of pcDNA regions dictating the gene expression levels in both eukaryotic and prokaryotic organisms, in particular, the speed at which genes are translated into proteins [19, 20]. Therefore it is desirable that the codon bias in a given pcDNA region be preserved when such a region is modified to embed data. This constraint may be especially important when encoding information extensively throughout an organism’s genome.

The empirical distribution of codons in a pcDNA region is given by its codon bias, which is just a normalised codon count. Hence, in practice preserving a codon bias amounts to preserving a codon count. In other words, the codon bias preservation constraint implies that the histogram of the codons in a pcDNA region must remain unchanged after the embedding process.

It should be noted that if the codon composition for a particular amino acid does not vary, i.e. the same codon translates a single amino acid every time in a pcDNA region, then any algorithm operating under this constraint cannot encode information using those codons. In practice we have not observed this extreme case and while codon compositions do not appear with equal frequency, they are sufficiently distributed to achieve high embedding rates.
The codon bias preservation constraint has been acknowledged, to some extent, in a DNA embedding algorithm created by Liss et al. [8]. This algorithm encodes information by first determining the frequency of each codon to be used for embedding. Codons are assigned to bit values in such a way as to mirror the bit frequencies of the message with the codon usage frequencies. It is a reasonable assumption to expect the binary message to be embedded, m to be approximately uniformly random as any data will appear so when compressed. Under the method by Liss et al., if we assume the binary message is uniformly random, and there is high variation in codon usage frequencies for an amino acid, the codon bias would not be preserved.
An even more stringent constraint for pcDNA embedding is the preservation of codon pairs. A recent study demonstrated that certain codon pairs were preferred in pcDNA regions, while others were avoided [21]. We have investigated this constraint when combined with the two constraints above and, for the genes used in this study, have determined that no information can be encoded when strictly enforced. In these genes there were no two amino acid pairs with differing codon compositions, meaning that no codon pairs could be swapped while maintaining the primary structure preservation constraint. Therefore this constraint will not be considered here. A further issue with this constraint is the preservation of codon pairs in different reading frames. If codon pairs in all reading frames were to be preserved, the DNA sequence could not be modified at all.
Method
As we will see, both algorithms proposed in this paper operate under conditions which vary depending upon the message encoding progress, and which take into account the aforementioned constraints on DNA modification. Both algorithms face the problem of statically or dynamically mapping a given set of available symbols (bases or codons) to message bits, and vice versa. For clarity, this common encoding principle which we call graduated mapping will be introduced next, before the actual BioCode algorithms are presented.
Graduated mapping
Given a set of available symbols , which in general are bases or codons, it is possible to map all of it’s elements to the elements of a second set of binary strings . Obviously both sets must have identical cardinality, denoted by $\mu =\left\mathcal{S}\right=\left\mathcal{M}\right$. Let $l\triangleq \lfloor \underset{2}{log}\mu \rfloor $ denote the minimum length of any binary string in .
First, let us consider the simplest case, that is, when l= log2μ. In this case is composed of μ lengthl binary strings, arranged in ascending order from zero to μ−1. The other case to consider is when l< log2μ. In this instance, to achieve a higher embedding rate, some of the binary strings in must be of length ⌊log2μ⌋+1 bits. The first 2^{ l } values from are assigned llength binary strings, in ascending order from 0 to 2^{ l }−1. The remaining values from the range 2^{ l }+1 to μ are first duplicated with the llength binary strings corresponding to the range 2^{l+1}−μ+1 to 2^{ l }. The strings in the former range are concatenated with a “1”, while the strings in the latter are concatenated with a “0”.
Dynamic graduated mapping
We will see that a special situation is the requirement that each of the elements from be used a specific amount of times due to biological constraints. If an element $s\in \mathcal{S}$ has been used as many times as permitted, then it will be removed from , decreasing μ by one unit. Every such removal prompts a remapping of $\mathcal{S}\iff \mathcal{M}$ in a graduated fashion, whereby is completely recreated using the new value of μ and the mapping method just described in the paragraph above.
As an example of the method, suppose that $\mathcal{S}=\{a,b,c,d,e\}$, then it would have the following mapping $\mathcal{S}\iff \mathcal{M}=\{00,01,10,110,111\}$. Now, if during execution of the algorithm the element d is used as many times as permitted, becomes $\mathcal{S}\setminus d$ and the set is remapped as $\mathcal{M}=\{00,01,10,11\}$.
As we will see in the following section, the two BioCode algorithms exploit the basic concept of graduated mapping in their own unique ways. Notice that the actual permutations used in the mappings may be kept as a secret shared by encoder and decoder, thus implementing the aforementioned secret key that precludes decoding by unauthorised third parties.
BioCode ncDNA
In this section we introduce BioCode ncDNA —a method to optimally embed information within ncDNA while observing the no start codons constraint. Firstly, observe that as $\left\mathcal{X}\right=4$ it is possible to encode information by trivially assigning a two bit sequence to each base. This is the foundation of the ncDNA embedding algorithm DNACrypt by Heider and Barnekow [5], among others. However such a static mapping of bits to DNA symbols does not take into account the no start codons constraint discussed in the previous section. Using such a mapping it is possible that some particular messages will produce start codons in the informationcarrying strand. One might think that simply avoiding messages which translate into start codons would bypass this problem. However, this is far from being a solution because there are three possible reading frames where the genetic machinery might find a start codon, plus three additional reading frames in the antiparallel complementary strand.
These duplets indicate that the next encoded symbol in a DNA sequence is a special case since a start codon may be produced if the wrong symbol is encoded. Such a situation is avoided by constantly examining the trailing dinucleotide sequence, d=[y_{i−2},y_{i−1}], where i represents the position of encoding within the informationcarrying DNA sequence y. If the concatenation of the previous two bases d with the current base y_{ i } has the potential to create a start codon (that is, if $\mathbf{d}\in \mathcal{D}$), then the algorithm restricts the choice of y_{ i } to a subset of bases ${\mathcal{S}}_{\mathbf{d}}$ such that no start codon can be produced. Otherwise y_{ i } can be freely chosen from . In order to reflect these conditions, a graduated mapping from the subset ${\mathcal{S}}_{\mathbf{d}}$ to message bits is used to encode the symbol y_{ i }. Note that the graduated mapping is different for different values of d, but static for any given d.
BioCode ncDNA
d  AT  CT  TT  CA  ${\mathcal{X}}^{\mathbf{2}}\setminus \mathcal{D}$ 

$\left{\mathcal{S}}_{\mathbf{d}}\right$  3  3  3  1  4 
${\mathcal{S}}_{\mathbf{d}}$  A  A  A  C  A 
T  T  T  T  
C  C  C  C  
G  
↓  Decode  
Encode  ↑  
${\mathcal{M}}_{\mathbf{d}}$  0  0  0  00  
10  10  10  01  
11  11  11  10  
11 
BioCode ncDNA guarantees that no start codon can be created in all reading frames in both sense and antisense directions. The algorithm can be easily modified in such a way as to prevent any other codon of choice from appearing. Decoding an embedded message is simply the reverse process of encoding, with one additional improvement. Since it is not possible for start codons to appear intentionally, if they do arise due to mutations it is possible to detect the corresponding message errors —and even in some cases to correct them.
Binary Codon equivalency
Binary to Codon translation table
(a)  

a  Ala  Cys  Asp  Glu  Phe  Gly  His  Ile  Lys  Leu  Met  Asn  Pro  Gln  Arg  Ser  Thr  Val  Trp  Stp  Tyr  
$\left{\mathcal{S}}_{a}\right$  4  2  2  2  2  4  2  3  2  6  1  4  2  2  6  6  4  4  1  3  2  
GCA  TGC  GAC  GAA  TTC  GGA  CAC  ATA  AAA  CTA  ATG  AAC  CCA  CAA  AGA  AGC  ACA  GTA  TGG  TAA  TAC  
GCC  TGT  GAT  GAG  TTT  GGC  CAT  ATC  AAG  CTC  AAT  CCC  CAG  AGG  AGT  ACC  GTC  TAG  TAT  
${\mathcal{S}}_{a}$  GCG  GGG  ATT  CTG  CCG  CGA  TCA  ACG  GTG  TGA  
GCT  GGT  CTT  CCT  CGC  TCC  ACT  GTT  
TTA  CGG  TCG  
TTG  CGT  TCT  
↑ Decode  
Encode ↓  
(b)  
00  0  0  0  0  00  0  0  0  00    00  0  0  00  00  00  00    0  0  
01  1  1  1  1  01  1  10  1  01  01  1  1  01  01  01  01  10  1  
${\mathcal{M}}_{a}$  10  10  11  100  10  100  100  10  10  11  
11  11  101  11  101  101  11  11  
110  110  110  
111  111  111 
BCE executes as follows: it initiates by translating the sequence of codons, $\stackrel{\u0304}{\mathrm{x}}=[{\widehat{\mathrm{x}}}_{1},{\widehat{\mathrm{x}}}_{2},\cdots \phantom{\rule{0.3em}{0ex}},{\widehat{\mathrm{x}}}_{n}]$ into its corresponding amino acid sequence $\mathbf{a}=\mathit{\text{aa}}\left(\stackrel{\u0304}{\mathrm{x}}\right)=[{a}_{1},{a}_{2},\cdots \phantom{\rule{0.3em}{0ex}},{a}_{n}]$ (primary structure). The encoded sequence, $\stackrel{\u0304}{\mathrm{y}}$ is then constructed by traversing a and choosing for each index i a messagedependent codon ${\widehat{\mathrm{y}}}_{i}$ such that $\mathit{\text{aa}}\left({\widehat{\mathrm{y}}}_{i}\right)={a}_{i}$. A lookup of Table 2 is performed to find the bit sequence matching the current message bit(s) m in ${\mathcal{M}}_{{a}_{i}}$. The codon ${\widehat{\mathrm{y}}}_{i}\in {\mathcal{S}}_{{a}_{i}}$ is selected corresponding to the position of that match.
BioCode pcDNA
The BioCode pcDNA algorithm preserves in $\stackrel{\u0304}{\mathbf{y}}$ not only the primary structure of the original host sequence $\stackrel{\u0304}{\mathbf{x}}$ —as BCE does already— but also its codon count. These two objectives are simultaneously achieved by means of a dynamic adaptation of the strategy followed by BCE. We have just seen that in BCE the cardinality of the codon set ${\mathcal{S}}_{{a}_{i}}$ corresponding to each amino acid a_{ i } is constant for all i=1,2,⋯,n, which allows the use of a static lookup table throughout the embedding process. However the additional constraint observed by BioCode pcDNA requires the cardinality of ${\mathcal{S}}_{a}$ to be varied during the embedding process.
The following is a step by step procedure of the algorithms’ operation made with reference to Figure 3.

Amino Acid Translation — As in BCE, the vector of codons, $\stackrel{\u0304}{\mathrm{x}}$ is converted into a vector of amino acids; $\mathbf{a}=\mathit{\text{aa}}\left(\stackrel{\u0304}{\mathrm{x}}\right)$.

Initialize Encoding Tables — Next, for every amino acid, all possible codon types in $\stackrel{\u0304}{\mathrm{x}}$ which translate that amino acid must be found. Given ${\mathcal{S}}_{c}$ is the set of k codons which translate a single amino acid, ${\mathcal{S}}_{c}$ will only contain the codon types which appear in $\stackrel{\u0304}{\mathrm{x}}$. If all k possible codon compositions are found in $\stackrel{\u0304}{\mathrm{x}}$, then ${\mathcal{S}}_{c}$ will contain all k codons. For example, given the amino acid Glycine we have the corresponding set ${\mathcal{S}}_{g}$. Four codons translate this amino acid which would normally yield ${\mathcal{S}}_{g}\triangleq \left\{\text{GGA, GGC, GGG, GGT}\right\}$. However if the codon GGT does not appear in $\stackrel{\u0304}{\mathrm{x}}$ and all other codons do, then the set will consists of ${\mathcal{S}}_{g}\triangleq \left\{\text{GGA, GGC, GGG,}\right\}$. This process of inserting all the codon types into their component amino acid sets continues until all the unique codons in $\stackrel{\u0304}{\mathrm{x}}$ have been classified. For each amino acid set, a set identical in size is created to contain the corresponding bit mappings. Given ${\mathcal{S}}_{c}$, a corresponding set ${\mathcal{M}}_{c}$ is populated using the cardinality $\mu =\left\mathcal{C}\right$ and the graduated method described in the previous section. There is then a mapping of ${\mathcal{S}}_{c}\mapsto {\mathcal{M}}_{c}$. ${\mathcal{S}}_{c}$ is contained within a superset of codon sets, ${\mathcal{S}}_{c}\in {\mathcal{S}}_{\mathcal{A}}$. If the full set of 64 codons are identified in the pcDNA region then the entire amino acid set ${\mathcal{S}}_{\mathcal{A}}$ and corresponding bit mappings ${\mathcal{M}}_{\mathcal{A}}$ would be identical to Tables 2(a) and (b). Once ${\mathcal{S}}_{\mathcal{A}}$ and ${\mathcal{M}}_{\mathcal{A}}$ have been initialized for each amino acid, they may be queried to determine the available codons and possible bit sequences that can be encoded. Continuing the example above for $\mathcal{G}\triangleq \left\{\text{GGA, GGC, GGG}\right\}$, the possible bit mappings for would be ${\mathcal{M}}_{g}\triangleq \{0,10,11\}$.

A codon count vector c is then created, which contains the number of times that each codon appears in a pcDNA region. This, along with ${\mathcal{S}}_{\mathcal{A}}$ and ${\mathcal{M}}_{\mathcal{A}}$ will be modified as the algorithm progresses.

Table Lookup — Construction of $\stackrel{\u0304}{\mathrm{y}}$ begins by examining the first amino acid a_{1} and the first 3 bits in the message sequence, [m_{1},m_{2},m_{3}]. If amino acid a_{1} is represented by the codon set ${\mathcal{S}}_{{a}_{1}}$ (all codons in ${\mathcal{S}}_{{a}_{1}}$ translate a_{1}), then the available bit sequences are given by ${\mathcal{M}}_{{a}_{1}}$. The bit sequence matching the current input is searched for in ${\mathcal{M}}_{{a}_{1}}$, if $\{{m}_{1},{m}_{2},{m}_{3}\}\notin {\mathcal{M}}_{{a}_{1}}$, then {m_{1},m_{2}} is located, if $\{{m}_{1},{m}_{2}\}\notin {\mathcal{M}}_{{a}_{1}}$ then {m_{1}} is located. The position at which the matching bit sequence is located corresponds to the codon to be selected for embedding from ${\mathcal{S}}_{{a}_{1}}$. That is to say, if the kth element in ${\mathcal{M}}_{{a}_{1}}$ is identical to the current input, then the kth codon of the same amino acid from ${\mathcal{S}}_{{a}_{1}}$ is used for encoding.

Decrease Codon Count — Once the codon $\widehat{\mathrm{y}}$ has been used for encoding, the count for that codon in c is decremented by one.

Adjust Tables — If the count for codon $\widehat{\mathrm{y}}$ reaches zero, then codon $\widehat{\mathrm{y}}$ is removed from ${\mathcal{S}}_{\mathcal{A}}$. In other words, if a codon has been used as many times as it appeared in the original sequence then that codon can no longer be used for embedding because the budget for that codon has been exhausted. The removal of $\widehat{\mathrm{y}}$ from ${\mathcal{S}}_{\mathcal{A}}$ also prompts a remapping of ${\mathcal{S}}_{\mathit{\text{aa}}\left(\widehat{\mathrm{y}}\right)}\mapsto {\mathcal{M}}_{\mathit{\text{aa}}\left(\widehat{\mathrm{y}}\right)}$ in a graduated fashion.

End — The algorithm loops back to the Table Lookup step, continuing its iteration through a to produce $\stackrel{\u0304}{\mathrm{y}}$, until the end of m or $\stackrel{\u0304}{\mathrm{x}}$ has been reached.
Decoding is the reverse procedure of embedding. Instead of performing a lookup using the message vector, a lookup is performed using codons to retrieve the message vector. All of the tables created for encoding must also be created at the decoder and are modified during execution in the same way. An example of BioCode pcDNA encoding with step by step procedure is demonstrated in an Additional file 1. This includes codon and amino acid statics for $\stackrel{\u0304}{\mathrm{x}}$ and $\stackrel{\u0304}{\mathrm{y}}$.
Information embedding rate of the BioCode Algorithms
In this section we analyse the information embedding rate of the BioCode algorithms in message bits/base or message bits/codon. In order to do so we will first discuss the embedding rate of the graduated mapping method, which assigns symbols (bases or codons) to bits in both BioCode methods. For simplicity we will assume that the message bits are uniformly distributed at random.
The equation above may be explained as the weighted average of the lower embedding rate, R_{ ↓ }, and the higher embedding rate, R_{ ↑ }, using as weights the probabilities of those rates being implemented by the encoder. The optimal achievable rate, independent of any method, is given by R= log2μ. There exists one method which attains this rate, called arithmetic encoding [3]. However arithmetic encoding presents error propagation issues at the decoder, which make it impossible to implement error correction effectively.
BioCode ncDNA
This embedding rate is not overly lower than the unconstrained rate of embedding of 2 bits/base. However this rate may only be attained when the message is long.
BioCode pcDNA
where we have used expression (2). In order to see that this rate is nearoptimum, observe that the maximum rate —independent of any method— may be calculated using the same formula above by replacing $R\left(\right{\mathcal{S}}_{a}\left\right)$ with $\underset{2}{log}\left{\mathcal{S}}_{a}\right$. This gives a rate of 1.7819 bits/codon, which is only 3% higher than the BCE rate.
Mutation channel model
In the following we will discuss the mutations model used to evaluate the performance of the BioCode methods. It must be emphasised that most previous authors proposing DNA data embedding did not provide decoding performance analyses of their algorithms, either by means of analyses or by means of in silico Monte Carlo simulations. An exception would be the work of Yachie et al. However such analyses are fundamental for understanding the expected performance of DNA data embedding methods when used in in vivo environments.
Performance analyses are important because the information embedded in the genome of an organism may contain errors caused by mutations accumulated after successive generations of the organism. That is, as shown in Figure 1, due to the effect of a “mutations channel” the informationcarrying DNA sequence (y) may be transformed into a “noisy” version of it (z) before reaching the decoding stage. These errors may impair or degrade the decoding of the embedded information, and hence it is fundamental to analyse the algorithms’ performance under mutations.
Following the communications simile, the mutations channel causing the errors can be characterised using a probabilistic model. The model used in our analysis will only consider base substitution mutations, which are the most prevalent mutations in the DNA of bacteria. In particular such mutations are the overwhelming majority in pcDNA regions [23]. These mutations randomly replace one base with an alternate base at different loci of a genome, and therefore can be modelled by means of a 4×4 transition probability matrix $\Pi \triangleq [Pr(z\lefty\right)]$, where $z,y\in \mathcal{X}$. As a simplification we will also consider that base substitution mutations happen independently at different loci. In reality it may happen that dependent mutations occur, for instance affecting a number of consecutive bases. However such dependencies can be easily broken by any information embedding method by means of a pseudorandom interleaver shared by encoder and decoder.
This model can reflect the higher probability of base transitions (mutations among purines or among pyrimidines) over base transversions (mutations between purines and pyrimidines) by setting γ<1. The γ parameter is a function of the ratio of transitions to transversions ε, and it is obtained from it as γ=3/(2(ε+1)). This model becomes the less realistic JukesCantor model when γ=1. For a more indepth explanation the reader is directed to [7].
Since mutation events occur from parent to child it is natural to model the mutation channel for the number of generations p elapsed between y and z. Assuming that Π gives the transition probability matrix for one generation, the model for p generations is easily found as Π^{ p }. We denote this straightforward extension as a “cascaded mutations model”.
At most, a mutation model can have nine parameters if it the property of time reversibility is to hold. The Kimura model is used in place of models with greater numbers of parameters because of the statistical problem of overfitting. If a mutation model has several parameters, some of which cannot be accurately estimated, the results obtained after many generations will be distorted. Reliable estimates of q and γ are available and therefore Π^{ p } can be accurately approximated. The Kimura model has been proven accurate in predicting the capacity of a DNA sequence when compared with a 12 parameter model [25].
Message bitframe resynchronisation
While performance will only be evaluated under the base substitution mutation channel just described, base errors may also occasionally confuse the decoder into inserting or removing message bits. If this happens the message bitframe common to encoder and decoder can become desynchronised, that is, the same index in m and m^{′} may no longer refer to the same message bit. We must stress that this issue not confined to BioCode, but common to all existing pcDNA data embedding algorithms. Therefore, the message bitframe must be resynchronised at the decoder, as otherwise the situation above may occasionally lead to a high message bit error.
We will employ two resynchronisation methods in order to deal with bitframe desynchronisation errors: marker codes and watermark codes. These strategies could actually be applied to resynchronise after insertion and deletion mutations on the level of DNA, which are not considered in this paper. Since they are applied on the bit level, not quaternary, the methods would lack channel information and as such can not decode optimally. Incorporating these methods fully for the DNA case is no trivial task because the embedding rate per base is not constant when operating under the restriction highlighted in this paper.
Marker codes
Marker codes were originally proposed by Sellers [26] in 1962, however they were not referred to as marker codes until much later [27]. These codes place a pilot signal at regular intervals in the binary message. The decoder expects the pilot signal to be located at specific points and if not found corrective action is taken. Suppose the pilot signal “001” is received by the decoder as “010”, it would infer that a deletion has occurred in the block preceding the pilot. The decoder resynchronises the remainder of the message by inserting a bit in the middle of the erroneous block. Marker codes, in the original proposal, are capable of correcting one desynchronisation error per block. They are not, however, designed to correct the block in which the error occurred.
Watermark codes
Watermark codes are a recently proposed resynchronisation method by Davey and MacKay that have been shown to achieve a high encoding rate [27]. Despite their name, they are not related to DNA watermarking, but may be applied here to correct bit insertions and deletions. The application a watermark code is as follows: firstly a “watermark” vector w is created which, for the purpose of our simulations was a uniformly random binary vector agreed upon by the encoder and decoder. The s p a r s e(·) function inserts zeros evenly throughout the input binary vector with the position of insertions known to both encoder and decoder. The message vector, m is sparsified and added modulo 2 to the watermarked vector, $\stackrel{~}{\mathbf{m}}=\mathit{\text{sparse}}\left(\mathbf{m}\right)+\mathbf{w}$, which is then embedded in a DNA sequence.
The next step differs in our implementation over Davey and MacKay’s. Under their method, after being transmitted across a channel, the received vector ${\stackrel{~}{\mathbf{m}}}^{\prime}$ is processed by the Forward Backward algorithm to correct insertions and deletions [27]. However under our method, after the DNA sequence has been decoded, possibly accumulating errors, the watermark decoder processes ${\stackrel{~}{\mathbf{m}}}^{\prime}$ by aligning it with w. This is done in a similar manner to the alignment process of the NeedlemanWunsch algorithm, however here the edit distance is used. One important factor must be incorporated into the alignment scoring; it is impossible for desynchronisation errors to occur in w. Differences between ${\stackrel{~}{\mathbf{m}}}^{\prime}$ and w, other than desynchronisation errors, are a result of encoding or mutations, which the decoder is unable to distinguish between. Therefore, when resolving differences caused by flips between ${\stackrel{~}{\mathbf{m}}}^{\prime}$ and w, the values in ${\stackrel{~}{\mathbf{m}}}^{\prime}$ are stored as the alignment.
The Forward Backward algorithm allows for the computation of probabilities which may then be passed to a substitution error correction decoder. The method we employ here does not incorporate the channel transition probabilities in the realignment process and because of this, is not as accurate as the Davey and MacKay’s algorithm. However, our method is greatly simplified and less computationally complex.
Results and discussion
In this section we describe the performance measures used to evaluate the BioCode algorithms. These evaluations are performed by means of Monte Carlo simulations, which implement the cascaded Kimura model as the mutations channel.
Performance measures
First of all we must establish relevant and objective criterion for evaluation. A very important figure of merit is the “raw” probability of bit error at the decoder (P_{ b }), which is the probability that a bit will be incorrectly decoded after transmission across the mutations channel. By “raw” we mean without error correction coding (ECC): observe that ECC can be applied to any DNA data embedding method in order to enhance performance, but it is the baseline raw probability of bit error that determines the effectiveness of such additional strategies.
If no bitframe resynchronisation is applied, it can happen that ${P}_{b}^{\mathrm{H}}$ is disproportionately high, even though only a few bits might have been inserted or deleted by the mutations channel.
where Pr(·) are empirical estimates of these probabilities computed from the Monte Carlo experiments. We note that I(M;M^{′}) must be scaled from bits/bit to bits/base (for ncDNA methods) or bits/codon (for pcDNA methods).
Monte Carlo simulations
The parameters used in the cascaded Kimura model are q=10^{−8} and γ=0.1, which are values used in prior work [7] and are based on realistic estimates obtained in [28]. The results for BioCode ncDNA were obtained using messages of length 10,000 bits. For BioCode pcDNA the message length varied depending on codon composition and host sequence length. All the graphs compare either the mutual information or probability of bit error (${P}_{b}^{\mathrm{H}}$) against the number of generations an encoded sequence has been transmitted along.
For the empirical analysis of BioCode pcDNA three different pcDNA regions were selected for embedding, two of which were used in prior works. The “ftsZ” region ^{c} in the B. subtilis genome was used for in vivo data embedding with Arita and Ohashi’s algorithm [4]. The “ypt7” region ^{d}, from a species of yeast known as S. cerevisiae, was used for in silico data embedding with the DNACrypt algorithm [5]. The other region used, “pSD1_197” is a plasmid gene of a bacteria belonging to the Shigella genus ^{e}, selected for its differing codon composition and larger sequence length.
Finally, the last set of graphs compare BCE with algorithms proposed by other authors. Notice that the constraints under which the BioCode algorithms operate have never fully been incorporated into any previous embedding method. Therefore direct comparisons with other methods are not appropriate (although comparisons against theoretical bounds are still possible). However BCE, which may be seen as a particular instance of BioCode pcDNA, can actually be compared to other pcDNA data embedding algorithms. Heider and Barnekow’s DNACrypt [5] and Arita and Ohashi’s method [4] are compared to BCE. These methods only maintain the primary structure preservation constraint.
Conclusions
In this paper we have introduced the BioCode algorithms for embedding information in DNA. These novel methods are designed to be more biologically compatible than any previous DNA data embedding algorithms, fully adhering to strict constraints. Furthermore they lay the foundation for information storage in DNA in a way that is both efficient and robust, as we have shown by means of in silico Monte Carlo simulations. The BioCode pcDNA algorithm preserves codon statics making it difficult to infer that information has been embedded. This aspect, in addition to BioCode pcDNA’s nearoptimum embedding rate, implies that BioCode pcDNA is a nearoptimum firstorder steganographic method. While DNA data embedding is currently in its infancy, it is likely that this field will grow considerably as technologies for synthesising and sequencing DNA become cheaper and faster. Therefore efficient data embedding techniques such as the BioCode algorithms can potentially find widespread applicability.
Endnotes
^{a}Possibly interspersed with noncoding regions (introns) in eukaryotic cells.^{b}Codons which mark the start of a gene in pcDNA.^{c}[GenBank:NC_000964.3 (1597832..1598980)]^{d}[GenBank:NC_001145.3 (267174..267800)]^{e}[GenBank:NC_007607.1 (170357..173665)]
Declarations
Acknowledgements
This publication has emanated from research conducted with the financial support of Science Foundation Ireland under grant number: 09/RFP/CMS2212.
Authors’ Affiliations
References
 Clelland CT, Risca V, Bancroft C: Hiding messages in DNA microdots. Nature. 1999, 399 (6736): 533534. 10.1038/21092.View ArticlePubMedGoogle Scholar
 Wong PC, Wong K, Foote H: Organic data memory using the DNA, approach. Comms ACM. 2003, 46: 9598.View ArticleGoogle Scholar
 Shimanovsky B, Feng J, Potkonjak M: Hiding data in DNA. Procs. of the 5th Intl. Workshop in Information Hiding Noordwijkerhout. 2002, The Netherlands, 373386.Google Scholar
 Arita M, Ohashi Y: Secret signatures inside genomic DNA. Biotechnol Prog. 2004, 20 (5): 16051607. 10.1021/bp049917i.View ArticlePubMedGoogle Scholar
 Heider D, Barnekow A: DNAbased Watermarks using the DNACrypt Algorithm. BMC Bioinformatics. 2007, 8 (176):Google Scholar
 Yachie N, Sekiyama K, Sugahara J, Ohashi Y, Tomita M: Alignmentbased approach for durable data storage into living organisms. Biotechnol Prog. 2007, 23 (2): 501505. 10.1021/bp060261y.View ArticlePubMedGoogle Scholar
 Haughton D, Balado F: Repetition coding as an effective error correction code for information encoded in DNA. Bioinformatic Bioeng, IEEE Int Symp. 2011, 0: 253260.Google Scholar
 Liss M, Daubert D, Brunner K, Kliche K, Hammes U: Embedding permanent watermarks in synthetic genes. PLoS ONE. 2012, 7 (8): e4246510.1371/journal.pone.0042465.PubMed CentralView ArticlePubMedGoogle Scholar
 Heider D, Kessler D, Barnekow A: Watermarking sexually reproducing diploid organisms. Bioinformatics. 2008, 24 (17): 19611962. 10.1093/bioinformatics/btn342.View ArticlePubMedGoogle Scholar
 Gibson D, Benders G, AndrewsPfannkoch C, Denisova E, BadenTillson H, Zaveri J, Stockwell T, Brownley A, D W Thomas MA, Merryman C, Young L, Noskov V, Glass J, Venter J, Hutchison C, Smith H: Complete chemical synthesis, assembly, and cloning of a mycoplasma genitalium genome. Science. 2008, 319: 12151219. 10.1126/science.1151721.View ArticlePubMedGoogle Scholar
 Jupiter DC, Ficht TA, Samuel J, Qin QM, de Figueiredo P: DNA, Watermarking of infectious agents: progress and prospects. PLoS Pathog. 2010, 6: e42465View ArticleGoogle Scholar
 The ENCODE Project Consortium: An integrated encyclopedia of DNA, elements in the human genome. Nature. 2012, 489: 5774. 10.1038/nature11247.View ArticleGoogle Scholar
 Smith GC, Fiddes CC, Hawkins J P Cox: Some possible codes for encrypting data in DNA. Biotech Lett. 2003, 25 (14): 11251130. 10.1023/A:1024539608706.View ArticleGoogle Scholar
 Heider D, Barnekow A: DNA watermarks: A proof of concept. BMC Mol Biol. 2008, 9 (40):Google Scholar
 Balado F: On the Shannon capacity of DNA data embedding. IEEE International Conference on Acoustics, Speech and Signal (ICASSP). 2010, Dallas, USA, 17661769.Google Scholar
 Balado F, Haughton D: Gene tagging and the data hiding rate. 23nd IET Irish Signals and Systems Conference. 2012, Ireland: MaynoothGoogle Scholar
 Heider D, Barnekow A: DNA Watermarking: Challenging perspectives for biotechnological applications. Curr Bioinformatics. 2011, 6 (3): 375382. 10.2174/157489311796904646.View ArticleGoogle Scholar
 Clyde A Hutchison: Encoding text into nucleic acid sequences. US Patent 12/916,344. 2010Google Scholar
 Lavner Y, Kotlar D: Codon bias as a factor in regulating expression via translation rate in the human genome. Gene. 2005, 345: 127138. 10.1016/j.gene.2004.11.035.View ArticlePubMedGoogle Scholar
 Das S, Roymondal U, Sahoo S: Analyzing gene expression from relative codon usage bias in Yeast genome: A statistical significance and biological relevance. Gene. 2009, 443 (12): 121131.View ArticlePubMedGoogle Scholar
 Tats A, Tenson T, Remm M: Preferred and avoided codon pairs in three domains of life. BMC Genomics. 2008, 9: 46310.1186/147121649463. [http://www.biomedcentral.com/content/supplementary/1471210514121S1.pdf],PubMed CentralView ArticlePubMedGoogle Scholar
 Haughton D, Balado F: Performance of DNA data embedding algorithms under substitution mutations. Bioinformatics and Biomedicine Workshops (BIBMW) 2010 IEEE International Conference on. 2010, 201206.View ArticleGoogle Scholar
 Chen J, Wu Y, Yang H, Bergelson J, Kreitman M, Tian D: Variation in the ratio of nucleotide substitution and Indel rates across genomes in mammals and Bacteria. J Mol Biol Evol. 2009, 26 (7): 15231531. 10.1093/molbev/msp063.View ArticleGoogle Scholar
 Kimura M: A simple method for estimating evolutionary rate in a finite population due to mutational production of neutral and nearly neutral base substitution through comparative studies of nucleotide sequences. J Mol Biol. 1980, 16: 111120.Google Scholar
 Balado F: Capacity of DNA Data embedding under substitution Mutations. IEEE Trans Inf Theory. 2013, 59 (2): 928941.View ArticleGoogle Scholar
 Sellers JF: Bit loss and gain correction code. Inf Theory, IRE Trans. 1962, 8: 3538.View ArticleGoogle Scholar
 Davey MC, MacKay DJC: Reliable communication over channels with insertions, deletions and substitutions. IEEE Trans Inf Theory. 2001, 47: 687698. 10.1109/18.910582.View ArticleGoogle Scholar
 Purvis A, Bromham L: Estimating the transition/transversion ratio from independent pairwise comparisons with an assumed phylogeny. J Mol Evol. 1997, 44: 112119. 10.1007/PL00006117.View ArticlePubMedGoogle Scholar
 Haughton D, Balado F: A modified watermark synchronisation code for robust embedding of data in DNA. IEEE 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2013Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.