ICRPfinder: a fast pattern design algorithm for coding sequences and its application in finding potential restriction enzyme recognition sites
© Li et al; licensee BioMed Central Ltd. 2009
Received: 26 November 2008
Accepted: 11 September 2009
Published: 11 September 2009
Restriction enzymes can produce easily definable segments from DNA sequences by using a variety of cut patterns. There are, however, no software tools that can aid in gene building -- that is, modifying wild-type DNA sequences to express the same wild-type amino acid sequences but with enhanced codons, specific cut sites, unique post-translational modifications, and other engineered-in components for recombinant applications. A fast DNA pattern design algorithm, ICRPfinder, is provided in this paper and applied to find or create potential recognition sites in target coding sequences.
ICRPfinder is applied to find or create restriction enzyme recognition sites by introducing silent mutations. The algorithm is shown capable of mapping existing cut-sites but importantly it also can generate specified new unique cut-sites within a specified region that are guaranteed not to be present elsewhere in the DNA sequence.
Restriction enzymes in genetic engineering
Restriction enzymes and methylase are components of a bacterial mechanism aimed at resisting attack from bacteriophages and removing foreign viral DNA sequences. A restriction enzyme cuts a DNA molecule and forms a sticky or blunt end at each side of the incision site without damaging the nitrogenous bases. A DNA ligase can then splice a cut end to that of another DNA molecule. Each restriction endonuclease is an enzyme that recognizes a specific DNA sequence and cuts the DNA molecule at a particular position in relation to the recognition sequence, producing a blunt or overhanging end, depending upon the enzyme chosen. While restriction enzymes typically recognize specific short DNA sequences, the genetic code is redundant, with most amino acids being represented by more than one codon. Therefore, with creative use of sequence modifications, and use of synonymous codons, one can create restriction sites without changing the precise amino acid sequence coded for.
A DNA sequence can be synthesized by assembling short synthetic oligonucleotides using PCR amplification. Using this approach, several oligonucleotides with overlapping end sequences can be assembled into a whole DNA sequence . DNA synthesis using this PCR-based ligation method, however, has some limitations. For example, it can only produce DNA sequences with length up to approximately 1.5 kbp. In long DNA segments, the primer extension will stop at an unpredictable position in the DNA sequence, and the probability of base pair mismatches will increase. Additionally, the proportion of GC content is limited to 40-60% by this approach. Due to such limitations, a long whole DNA can not be synthesized by PCR technology alone. Using restriction enzyme cutting technology, several DNA segments can be connected to form a longer full-length DNA sequence [2–4]. There is an increasing interest in artificially synthesizing in vitro large DNA constructs and even whole genomes with restriction enzyme cleavage followed by ligation of segments[2, 5–12].
Informatics support for restriction site analysis
There are several informatics tools for finding existing restriction enzyme recognition sites in a DNA sequence. NEBcutter works in browser-based client-server mode . The server maintains a restriction enzyme database, and accepts a DNA sequence and several parameters from a user. After calculating, the server returns to the user the locations of recognition and cleavage sites and displays the sites in the user's browser. The NEBcutter algorithm for locating restriction sites is written using the gcc environment. Filtering and queue management tasks are also implemented in C. Graphical rendering of results is performed by the GD libraries. GD libraries allow functionality such as zoom in, zoom out, and automatic adjustment of the number of tags of recognized sites. The web user interface is dynamically generated by PHP scripts, and the web server is supported by Apache. NEBcutter supports manually input DNA sequences, a sequence loaded from a local file, or a sequence identifier. If a sequence identifier is used the server will retrieve data from NCBI databases. Linear and circular input sequences are both supported. The current version also supports a list of generally used genomes or plasmids as default input sequences.
NEBcutter is a powerful tool to locate restriction enzyme recognition sites. It supports a wide range of restriction enzymes. The illustration of results is flexible and clear. But NEBcutter has several limitations. The first is the potential for network bottlenecks. If the input sequence is large, such as that of a plasmid or a genome, the network response time becomes long and the process may even error out. The second limitation is that NEBcutter only locates recognition sites in the original DNA sequences as input. It does not support recoding to create a new restriction site. Sometimes users may need a specific restriction enzyme site at a certain location. NEBcutter provides no solution for such situations.
WatCut, created at the University of Waterloo, is another online tool for restriction enzyme analysis . It supports functions both to locate restriction enzyme cleavage sites directly in a given DNA sequence and to search for potential restriction sites that can be created in a DNA sequence using silent mutations. Input DNA sequences can be typed in the online form or loaded from the local drive. Then six sequences, three in the forward strands and the other three in the reverse strands, are read with frame shifting. After one of the six sequences is chosen, candidate restriction sites are listed. Results can be displayed graphically or listed textually in a table. Although WatCut introduces support for silent mutations, the functionality is limited. It can only find restriction enzyme recognition sites in a DNA sequence of at most 100 nucleotides. Like NEBCutter, the tool needs a server with PHP support and runs in a client-server mode, with potential performance bottlenecks in server response and network bandwidth.
The increasing interest in artificially synthesizing in vitro large DNA constructs and even whole genomes with the aid of restriction enzymes underlines the importance of creating robust informatics tools that can not only identify existing restriction enzyme recognition sites but also support the design of new restriction sites at a desired location without changing the protein sequence through the use of silent mutations. Based on this rationale, we designed a novel fast pattern finding algorithm, named Inverse Codon Replacement Pattern finder (ICRPfinder). In this paper, we describe several key aspects of the ICRPfinder algorithm and provide sample results using ICRPfinder to find or create restriction enzyme recognition sites in target coding sequences. ICRPfinder is a web-based application and can be accessed using a standard internet browser or run as a stand alone application on a local machine.
Challenges of brute force approach
Possible DNA sequences required to be looked-up in the brute force approach
Possible codons at each Position
Proposed non-brute force approaches
We propose two approaches that avoid the generation of all possible DNA sequences as required by the brute force approach. For illustration, we continue using the example with NotI, with the GCGGCCGC restriction site pattern.
At each position (amino acid) of the target protein all possible codons are compared to the NotI sequence. If no codon matches then we proceed to the next position and start a new match. If one codon is matched, however, then we continue to check the neighboring amino acids for potential complete matches. As numerous locally unmatched DNA sequences are discarded, the computational complexity is decreased to O(n*3n).
An alternative approach would be to "translate" the NotI sequence into three amino acid sequences, one for each reading frame. For example, we shift and "translate" the NotI sequence as:
GCG GCC GC -> "AA" with "GC" right overhang;
G CGG CCG C -> "RP" with "G" left overhang and "C" right overhang;
GC GGC CGC -> "GR" with "GC" left overhang.
There might be "overhangs" of one or two nucleotides at the right, left or both sides. We compare the "translated" NotI sequence to the target protein sequence first. If the "translated" sequence and the target protein sequence are matched, continue to compare the left and right overhangs.
We compare the "AA", "RP" and "GR" one by one to the target "IRGRE", and find "GR" in the target sequence ("IR GR E"). Then we proceed to investigate whether the left overhang "GC" is also matched by the left amino acid "R" ("I R GRE"). There are six codons for the amino acid R, and one of them, CGC, matches the left overhang "GC" (C GC). This means that either the NotI recognition sequence (GCGGCCGC) is found in the original target protein TP53 or that one can create a recognition site in the target protein by using a few silent mutations. That is the solution. In this approach the computational complexity is O(n*3), so execution time falls dramatically. We chose Approach 2, which is more efficient, for implementation. Since we "translate" the object DNA pattern into an amino acid sequence and compare it to the target protein sequence, we call this algorithm Inverse Codon Replacement Pattern finder (ICRPfinder). The object pattern can be a restriction enzyme recognition sequence, or any other nucleic acid sequences of interest, such as promoter sequences.
ICRPfinder includes 2406 common restriction site patterns, and additional sequence patterns can be input by the user. All information regarding the restriction enzymes used in ICRPfinder was retrieved from REBASE  version 906.
Results and Discussion
The results are rendered graphically, and appear in text format if the target coding sequence is short or by zooming in on the results.
Create unique patterns around a specific position
Performance benchmark for ICRPfinder, NEBCutter and WatCut
Details of searching
Response time (sec)
TP53 CDS (1182 bp)
Random CDS (12,000 bp)
Wild type, Silent mutation
Whole coding sequence
Wild type, Silent mutation
Whole coding sequence
Wild type, Silent mutation
246 (generic enzymes)
Whole coding sequence
Typical response times for ICRPfinder were under 1 second, while for NEBCutter the repose times were between 6 and 12 seconds. There could be two reasons why response times of NEBCutter are quite long compared to ICRPfinder. The first reason is the delay due to network transport. Although the server has excellent network connectivity, the transport latency can not be neglected. The second reason could be due to GUI rendering of results. The GUI rendering in NEBCutter is relatively slow as NEBCutter has to choose and display partial results according to the width of the display region. The NEBCutter response time also can vary depending on the number of results to be displayed. It appears that in the zoomed out display, NEBCutter only displays a fraction of the possible restriction enzyme cut sites, while ICRPfinder displays all the restriction sites. If NEBCutter were to display all the restriction sites, we can presume that the response time for NEBCutter would be even slower.
For WatCut, the response times were 2-3 seconds; however, one must note that only a small fraction of the sequence was used, since WatCut limits the input coding sequence to 100 bp. WatCut was run with the default set of enzymes (246), while ICRPfinder was run with a larger default set, of 2406 enzymes.
ICRPfinder is a powerful tool for finding or creating a specific DNA sequence in a given coding sequence. It translates both the target coding sequence and object DNA pattern to amino acid sequences and finds the matches between two amino acid sequences. In addition to finding restriction enzyme recognition sites in the given target coding sequence, ICRPfinder can create recognition sites without changing the translated protein sequence. In addition to restriction enzyme recognition sites, as a general purposed tool, ICRPfinder can also accept users' DNA patterns and then help to analyze any DNA-protein binding events by given DNA patterns. The non-brute force approach with DNA translating can dramatically decrease the computation time for the pattern finding/creating process. Since ICRPfinder is a browser-based application, it can run on any platform, in on-line or off-line mode. This tool should be useful for experimental biologists who manipulate or synthesize large DNA sequences or even whole genomes.
Availability and requirements
Project name: ICRPfinder
Project home page: http://sourceforge.net/projects/icrpfinder/
Operating system(s): Platform independent
License: GNU GPL
Any restrictions to use by non-academics: See GNU GPL license for details
The authors would like to thank Jack Emery for comments that improved this manuscript. This work was supported by start-up fund from ASU to Stephen Albert Johnston.
- Xiong AS, Yao QH, Peng RH, Duan H, Li X, Fan HQ, Cheng ZM, Li Y: PCR-based accurate synthesis of long DNA sequences. Nat Protoc 2006, 1(2):791–797. 10.1038/nprot.2006.103View ArticlePubMedGoogle Scholar
- Kodumal SJ, Patel KG, Reid R, Menzella HG, Welch M, Santi DV: Total synthesis of long DNA sequences: synthesis of a contiguous 32-kb polyketide synthase gene cluster. Proc Natl Acad Sci USA 2004, 101(44):15573–15578. 10.1073/pnas.0406911101PubMed CentralView ArticlePubMedGoogle Scholar
- Reisinger SJ, Patel KG, Santi DV: Total synthesis of multi-kilobase DNA sequences from oligonucleotides. Nat Protoc 2006, 1(6):2596–2603. 10.1038/nprot.2006.426View ArticlePubMedGoogle Scholar
- Richmond KE, Li MH, Rodesch MJ, Patel M, Lowe AM, Kim C, Chu LL, Venkataramaian N, Flickinger SF, Kaysen J, et al.: Amplification and assembly of chip-eluted DNA (AACED): a method for high-throughput gene synthesis. Nucleic Acids Res 2004, 32(17):5011–5018. 10.1093/nar/gkh793PubMed CentralView ArticlePubMedGoogle Scholar
- Itaya M, Fujita K, Kuroki A, Tsuge K: Bottom-up genome assembly using the Bacillus subtilis genome vector. Nat Methods 2008, 5(1):41–43. 10.1038/nmeth1143View ArticlePubMedGoogle Scholar
- Cello J, Paul AV, Wimmer E: Chemical synthesis of poliovirus cDNA: generation of infectious virus in the absence of natural template. Science 2002, 297(5583):1016–1018. 10.1126/science.1072266View ArticlePubMedGoogle Scholar
- Gibson DG, Benders GA, Andrews-Pfannkoch C, Denisova EA, Baden-Tillson H, Zaveri J, Stockwell TB, Brownley A, Thomas DW, Algire MA, et al.: Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science (New York, NY) 2008, 319(5867):1215–1220.View ArticleGoogle Scholar
- Shevchuk NA, Bryksin AV, Nusinovich YA, Cabello FC, Sutherland M, Ladisch S: Construction of long DNA molecules using long PCR-based fusion of several fragments simultaneously. Nucleic Acids Res 2004, 32(2):e19. 10.1093/nar/gnh014PubMed CentralView ArticlePubMedGoogle Scholar
- Smith HO, Hutchison CA 3rd, Pfannkoch C, Venter JC: Generating a synthetic genome by whole genome assembly: phiX174 bacteriophage from synthetic oligonucleotides. Proc Natl Acad Sci USA 2003, 100(26):15440–15445. 10.1073/pnas.2237126100PubMed CentralView ArticlePubMedGoogle Scholar
- Holt RA, Warren R, Flibotte S, Missirlis PI, Smailus DE: Rebuilding microbial genomes. Bioessays 2007, 29(6):580–590. 10.1002/bies.20585View ArticlePubMedGoogle Scholar
- Yount B, Curtis KM, Baric RS: Strategy for systematic assembly of large RNA and DNA genomes: transmissible gastroenteritis virus model. J Virol 2000, 74(22):10600–10611. 10.1128/JVI.74.22.10600-10611.2000PubMed CentralView ArticlePubMedGoogle Scholar
- Forster AC, Church GM: Towards synthesis of a minimal cell. Mol Syst Biol 2006, 2: 45. 10.1038/msb4100090PubMed CentralView ArticlePubMedGoogle Scholar
- Vincze T, Posfai J, Roberts RJ: NEBcutter: A program to cleave DNA with restriction enzymes. Nucleic acids research 2003, 31(13):3688–3691. 10.1093/nar/gkg526PubMed CentralView ArticlePubMedGoogle Scholar
- WatCut: An on-line tool for restriction analysis, silent mutation scanning, and SNP-RFLP analysis[http://watcut.uwaterloo.ca/watcut/watcut/template.php]
- The Restriction Enzyme Database[http://rebase.neb.com/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.