DNA-based watermarks using the DNA-Crypt algorithm

Background The aim of this paper is to demonstrate the application of watermarks based on DNA sequences to identify the unauthorized use of genetically modified organisms (GMOs) protected by patents. Predicted mutations in the genome can be corrected by the DNA-Crypt program leaving the encrypted information intact. Existing DNA cryptographic and steganographic algorithms use synthetic DNA sequences to store binary information however, although these sequences can be used for authentication, they may change the target DNA sequence when introduced into living organisms. Results The DNA-Crypt algorithm and image steganography are based on the same watermark-hiding principle, namely using the least significant base in case of DNA-Crypt and the least significant bit in case of the image steganography. It can be combined with binary encryption algorithms like AES, RSA or Blowfish. DNA-Crypt is able to correct mutations in the target DNA with several mutation correction codes such as the Hamming-code or the WDH-code. Mutations which can occur infrequently may destroy the encrypted information, however an integrated fuzzy controller decides on a set of heuristics based on three input dimensions, and recommends whether or not to use a correction code. These three input dimensions are the length of the sequence, the individual mutation rate and the stability over time, which is represented by the number of generations. In silico experiments using the Ypt7 in Saccharomyces cerevisiae shows that the DNA watermarks produced by DNA-Crypt do not alter the translation of mRNA into protein. Conclusion The program is able to store watermarks in living organisms and can maintain the original information by correcting mutations itself. Pairwise or multiple sequence alignments show that DNA-Crypt produces few mismatches between the sequences similar to all steganographic algorithms.


Background
Sensitive information, especially secret information must be protected against unauthorized access. To achieve this researchers have looked for new cryptographic or steganographic techniques. Existing algorithms encrypt or hide information in binary files, however there are other media, which can be used. There are several algorithms, which encode information into DNA sequences. Examples are the concepts of Clelland [1][2][3][4][5]. These techniques can be used for authentication or to store data for long time.

Clelland et al
Inspired by the micro-dots used during the 2nd world war, Clelland et al. developed an extension of this principle [1]. The scientists produced artificial DNA strands, which contained secret messages. A triplet encodes one character or number ( Table 1). The Clelland algorithm is a simple substitution cipher which encodes characters into DNA sequences using the following encoding function The receiver must know the decoding function and the primer to decode the message. The primers are used for the polymerase chain reaction and in the last step the amplified DNA sequence has to be sequenced and decoded. To improve the security one can use dummy strands, which are not random but correspond to words out of a dictionary.

Gehani et al
The original One-Time pad uses the XOR -exclusive or (⊕). In the case of DNA, the XOR is very impracticable and therefore it is better to use the properties of DNA. Gehani et al. established a DNA One-Time pad by creating word pairs [2]. The first word is the plain text and the second one is the cipher text. After such a block of plain and cipher text, there is a stop codon ( Figure 1). The DNA polymerase completes the plain and cipher text.
To encode a message, the plain text is mixed with the DNA sequences. It binds directly to the corresponding complementary sequence. The DNA polymerase creates the cipher text accordingly and the decoding is functionally analogous. The cipher text binds to its complement and the DNA polymerase creates the plain text. Leier et al. encoded binary information into DNA sequences. A short DNA sequence represents the binary 1 2 , another one represents 0 2 [3]. Further there are another two short DNA sequences, which represent start and end. The fragments have sticky ends and can be ligated ( Figure  2). All resulting sequences are like this s{0 2 |1 2 }e. The start and end marker have primer sequences on one site for the polymerase chain reaction, which can not be ligated.

Leier et al
Although it seems to be more complicated, it is very similar to the algorithm of Clelland et al. The resulting DNA sequence is mixed with dummy strands and can only be detected and isolated knowing the primer sequences. Wong et al. developed a steganographic algorithm based on DNA, which is able to store data in living organisms [4]. The data are translated into a DNA sequence which is inserted into a vector. The insert sequence is flanked by two primer sequences which do not exist in the genome yet. This vector is introduced into a cell of a living organism where it coexists and is replicated with the genomic DNA. To extract the data they used a polymerase chain reaction.

Wong et al
DNA One-Time pad    (Table 2). A value of 0 means to keep the original base at the third position of a codon, while a value of 1 means to change the third base at that position. Arita et al. added a parity bit to each letter, to keep it odd for possible error detection [5]. They encoded 'KEIO' into the ftsZ gene of Bacillus subtilis which is essential for cell division and demonstrated as expected that the changed codon sequences did not affect the cell division, colony morphology, growth rate and sporulation frequency of these bacteria. To extract the encoded message one has to know the original sequence so that one can decide whether the codon is the original or the altered sequence. The DNA-Crypt algorithm is based on small redundant regions comparable to least significant bits in the case of image steganography ( Figure 3). The least significant bits encode a difference in colour of just one on the colour scale, not visible to the human eye, and can be used to hide information in images. DNA binary strands Figure 2 DNA binary strands. Short DNA strands represent the binary 1 2 (light blue), 0 2 (white), start and end marker (dark blue).These sequences can be ligated to long strands by using the sticky ends. Modified from Leier et al. [3].

Comparison to DNA-Crypt
Text or binary information can also be encoded using any DNA based encryption. However unlike image steganography, the DNA steganography does not lead to a loss of information if the focused range is a protein coding region. DNA-Crypt checks for "synonymous codons" in a genome and point mutations are produced by changing the bases [see Additional file 1].
This algorithm offers the possibility to incorporate data into the genome of living organisms, using an alternative method to Wong et al. [4] ( Figure 4). The algorithm is similar to the algorithm of Arita et al., but DNA-Crypt has some important extensions e.g. the use of several encryption and mutation correction codes, which allows encoding of binary information. These extensions are described in the next subsections [5]. A comparative overview of the algorithms and their features is shown in table 3.

Encoding binary information using DNA-Crypt
DNA-Crypt encodes binary information using the following substitution cipher: A standard setting is given in table 4.
Two bits could be encoded by one base, so one byte needs four bases for its encoding.
The DNA-Crypt algorithm Based on this binary encryption, several private and public key cryptographic algorithms are integrated in DNA-Crypt: • One-Time pad [6] • AES [7] • Blowfish [6] • RSA [8,9] To use DNA-Crypt one has to register so that DNA-Crypt can create AES, Blowfish and RSA keys for the user. These keys can be used to encrypt the binary information which then gets integrated into the genome. In addition it is possible to export and to import these keys and to exchange them with other users. Further the user can create new keys in DNA-Crypt or delete old ones. Another possibility is to use a One-Time pad instead of an encryption key.

Mutation correction
Mutations do not occur very often, approximately 10 -10 to 10 -15 per cell division, but they can destroy the encrypted information in DNA sequences. To correct these failures DNA-Crypt uses a correction code based on binary correction. One of them is the 8/4 Hamming-code and another one is the WDH-code [10]. The advantage of the WDHcode is that it can correct more mutations than the 8/4 Hamming-code. The n-times WDH-code repeats the enrypted DNA sequence n times. It can correct failures. All WDH-codes where n is an odd number are perfect.
The 8/4 Hamming-code can only correct ≤ 25% of the mutations. Four bits are used for information (b3,b2,b1,b0) and the other four bits as parity bits. A complete byte is represented by these eight bits b 3 which are called h7, h6, h5, h4, h3, h2, h1, h0. To decode the byte, the following parity sums are build:   C l e l l a n d e t a l .
------2 0 G e h a n i e t a l .
-------L e i e r e t a l . - Wong et al.

DNA-Crypt
organism: the use of this algorithm in living organisms; affect: observation that the algorithm exerts an effect on the organims; error detection/correction: the algorithm shows an error detection/correction function; binary: binary information can be encoded; encryption: the use of binary encryption algorithm like AES or RSA; utilization: storage utilization in a 100 bp DNA sequence; -= negative; + = positive; If p = 0 there is 1 failure in the byte which can be corrected using table 5.
Only one of four bits can be corrected. But not all mutations can be corrected by the 8/4 Hamming-code. Failures which only differ in one bit can be corrected, e.g. 00 ↔ 01 or 11 ↔ 10. Failures like 00 ↔ 11 or 10 ↔ 01 cannot be corrected.
The limiting resource for mutation correction is not the time, but the space. The advantage of the 8/4 Hammingcode is that it is very compact. The space requirements of the 8/4 Hamming-code is f(n) = 2n ∈ Θ(n). In contrast to for the WDHcode.
For example to encode one byte, which means a DNA sequence of four bases, the 8/4 Hamming-code needs eight synonymous codons instead of twenty synonymous codons for the 5-times WDH-code. In contrast to the data published by Arita et al. we can not only exibit error detection but error corrections which enables us to maintain the data. This obviously represents an important advantage.

fuzzy controller
The integrated fuzzy controller decides and recommends whether to use the 8/4 Hamming-code, the WDH-code or no mutation correction for optimal performance [11][12][13][14] [see Additional file 2]. It uses the Singleton-fuzzyfication and has three input dimensions with each separated into three triangular sets. The first dimension is the individual mutation rate (φ) of the DNA sequence containing the secret message ( Figure 5). This is based on a standard mutation rate, by default 1 * 10 -7 for prokaryotes and 1 * 10 -10 for eukaryotes, which is changed by specific mutation rates (α i ) for each base pair. These changes are based on the transversion and transition rate and in addition on the stability (δ) of GC rich regions.
The second input dimension is the length of the DNA sequence containing the secret message ( Figure 6).

T Ti i A C G
The first input dimension Figure 5 The first input dimension. The first input dimension of the fuzzy controller is the mutation rate. The first input dimension is separated intro three triangular sets X i = (a m , a λ , a ρ ). The first called "low" = (0, 0, 6) describes a low mutation rate. The second "middle" = (10, 4, 16) and the third "high" = (20,14,20) describe a middle and a high mutation rate.
The three input dimensions are linked through a set of rules based on heuristics to one output dimension [see Additional file 3]. The maximum of each correction code means a cut on the y axis ( Figure 8). In the next step the fuzzy controller decides, whether to use an 8/4 Hammingcode, a WDH-code or no mutation correction by using the first-maximum method and recommends it to the user.

Results
The program described above was tested by in silico experiments using the DNA sequence encoding the Ypt7 in Saccharomyces cerevisiae.

Ypt7
The small GTPases termed Ypt in yeast and Rab in higher eukaryotes are molecular switches in cellular transport processes [21]. Each Ypt protein is localized to the membrane of specific intracellular compartments and highly specific for a particular transport step [22].
The Ypt7 GTPase from S. cerevisiae is involved in late endosome-to-vacuole transport and vacuole fusion events [23,24]. Ypt7 is one of the 11 members of the S. cerevisiae Ypt family and is homologous to mammalian Rab7.
Analysis of the Ypt7 DNA sequence showed that 32% of the codons allow synonymous substitutions, resulting in 16 bytes, which could be encrypted ( Table 6). The first steganogram contains the message "this is a test" and the second one "yet another test" [see Additional file 4].
The results of the analyses of these steganograms with the fuzzy controller are shown in table 7. Translation with DNA-Crypt and the Expasy Translate Tool shows that the translated amino acid sequences are identical [25].
The output dimension of the fuzzy controller Figure 8 The output dimension of the fuzzy controller. The triangular sets are "none" = (0, 0, 400), "Hamming -code" = (500, 100, 900) and "WDH -code" = (1000, 600, 1000). The maximum of a triangular set, calculated by the set of heuristics of the fuzzy controller, means a cut on the y axis. A cut at 0.28 for none correction code, at 0.67 for Hamming-code and at 0.45 for the WDH-code is shown. The first-of-maximum (fom) represents the recommended correction code, in this case the fuzzy controller recommends the Hammingcode.
The pairwise and the multiple sequence alignments show a few mismatches between the three sequences (Figures 9,  10, 11).
The pairwise sequence alignment was performed with Dotlet and the multiple sequence alignment was performed using ClustalW of the European Bioinformatics Institute with standard settings [26,27].

Discussion
DNA-Crypt produces few sequence mismatches similar to the low noise in image steganography. In case of image steganography one can look at the least significant bits to attack the steganographic algorithms. To attack DNA steganography one can perform pairwise or multiple sequence alignments with the original sequences.

Conclusion
The DNA-Crypt algorithm can encode cryptic messages into DNA sequences, which can be used as watermarks for authentication. DNA-Crypt is a substantial extension to other steganographic algorithms based on DNA, which can be used in combination with a binary encryption algorithm such as AES, RSA or Blowfish and a mutation correction code such as the Hamming-code or the WDHcode. The most appropriate code of these correction codes can be selected by a fuzzy controller, which uses three input dimensions.
Mutations, which cause changes in the reading frame, are problematic and are not appropriate for DNA steganography. Mutations, which change a non-synonymous codon to a synonymous codon or vice versa are more important as these mutations cause errors in the encrypted information. The relevance of these errors depends on the encrypted information. If the encrypted information is an image, e.g. a logo, there would be only a linear colour shift in the image, which is not very relevant and can be corrected very easily. However if the encrypted information must remain correct, e.g. a password, the WDH-code must be used to detect these mutations.
We have not encoutered any problems so far performing our in silico analyses using DNA-Crypt watermarks in DNA coding regions. The use of DNA-Crypt in non-coding sequences like a regulatory RNA sequence or promoter, and enhancer sequences has to be tested in silico and in vivo. Further analyses to clarify, whether alternative splicing events pose a problem for watermarks still have to be carried out. In conclusion DNA-Crypt algorithm represents an interesting tool for hiding authenticating watermarks within coding DNA sequences in silico and most probably in living organisms without affecting the process of protein translation and protein function.
Dotplot of Ypt7 and steganogram 2 Figure 10 Dotplot of Ypt7 and steganogram 2. Pairwise sequence alignment with Dotlet between the original sequence and the steganogram containing "yet another test" [26].  Dotplot of Ypt7 and steganogram 1 Figure 9 Dotplot of Ypt7 and steganogram 1. Pairwise sequence alignment with Dotlet between the original sequence and the steganogram containing "this is a test" [26].
Multiple sequence alignment Figure 11 Multiple sequence alignment. Multiple sequence alignment of the original sequence and two steganograms [27].