An algorithm of discovering signatures from DNA databases on a computer cluster
© Lee and Sheu; licensee BioMed Central Ltd. 2014
Received: 22 April 2014
Accepted: 29 September 2014
Published: 5 October 2014
Signatures are short sequences that are unique and not similar to any other sequence in a database that can be used as the basis to identify different species. Even though several signature discovery algorithms have been proposed in the past, these algorithms require the entirety of databases to be loaded in the memory, thus restricting the amount of data that they can process. It makes those algorithms unable to process databases with large amounts of data. Also, those algorithms use sequential models and have slower discovery speeds, meaning that the efficiency can be improved.
In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discovery and have proposed a parallel signature discovery algorithm on a computer cluster. The algorithm applies the divide-and-conquer strategy to solve the problem posed to the existing algorithms where they are unable to process large databases and uses a parallel computing mechanism to effectively improve the efficiency of signature discovery. Even when run with just the memory of regular personal computers, the algorithm can still process large databases such as the human whole-genome EST database which were previously unable to be processed by the existing algorithms.
The algorithm proposed in this research is not limited by the amount of usable memory and can rapidly find signatures in large databases, making it useful in applications such as Next Generation Sequencing and other large database analysis and processing. The implementation of the proposed algorithm is available at http://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.
Mutations give diversity to DNA sequences, which led to the evolution of a variety of different species and a multitude of species from the same ancestor. Even though they have similar DNA sequences from a common ancestor, due to evolution, these species also have their own unique DNA sequences which may be understood as the signatures for those particular species and can be used as a way to separate the species [1, 2]. For example, DNA signatures have already been used to identify 14 types of human pathogenic yeast .
Signatures are defined as DNA patterns that are significantly different from other sequences and appear only once in the sequence database. Thus, the purpose of signature discovery is to find all of the signatures in a database . Much research has already been conducted in signature discovery. Amin et al. integrated multiple bioinformatics tools, including CG  and IslandPath , to determine horizontally transferred, pathotype-specific signature genes as targets for specific, high-throughput molecular diagnostic tools and reverse vaccinology screens . PrimerHunter can be used to select highly sensitive and specific primers for virus subtyping identification . To guarantee high sensitivity and specificity, PrimerHunter selects primers such that they efficiently amplify one of the target sequences representing different isolates of the subtype of interest, and none of the non-target sequences representing isolates of closely related virus subtypes. Accurate estimates of the melting temperature of mismatches, based on a nearest-neighbor model and calculated via a fractional programming algorithm, are used in PrimerHunter to ensure the desired amplification properties. TOFI is a tool for identifying oligonucleotide fingerprints for microarray-based pathogen diagnostic assays, which combines genome comparison tools, probe design software, and sequence alignment programs [9, 10]. TOFI is typically used to design fingerprints for a single genome. An enhanced multiple-genome pipeline presented by Satya et al. allows for efficient design of microarray probes common to groups of target genomes . Insignia is web-based tool for identifying genomic signatures that are perfectly conserved by all target genomes and absent from all background genomes based on databases of bacterial and viral genomic sequences, which comprise over 8300 distinct organisms [12, 13]. TOFI designs signatures for microarray-based assays, and Insignia finds unique sequence segments that can be used to design both PCR and microarray signatures. Insignia and TOFI have the ability to identify genomic signatures that are common to multiple target genomes. Insignia and TOFI perform similar computations, but Insignia can be run online and requires less computational resources. TOFI and Insignia both build consensus regions among multiple genomes through pairwise alignments between the target genomes. Insignia reports only the unique segments in the target genomes and provides an option for users to run Primer3 , a PCR signature design software, on these unique segments. To quickly identify signatures in target and background genomes, Insignia has to maintain a specialized database containing pre-computed matches between every pair of genomes. However, the concomitant advantage in speed comes with the limitation that users are restricted to the target and background genomes that are part of the Insignia database, with no option to use other sequences as target or background genomes. TOPSI is a tool that extends the TOFI framework to design signatures for PCR-based pathogen diagnostic assays . Like Insignia, TOPSI identifies unique segments through pairwise alignments between the input genomes. However, TOPSI goes beyond identification of unique segments, and incorporates modules to design PCR signatures from the unique segments and perform extensive specificity analysis on the designed signatures. TOPSI can provide a list of PCR signatures common to all input targets without manual manipulation. CaSSiS is capable of computing comprehensive sets of sequence- and group-specific signatures that guarantee a predefined Hamming distance, the number of mismatches with non-target sequences, from collections of deeply hierarchically clustered sequences . CaSSiS tries to determine perfect group-covering signatures for every target group. For groups lacking a perfect common signature, CaSSiS finds signatures with maximal group coverage within a user-defined specificity. Zheng’s algorithm uses the Hamming distance between sequences as a measuring stick for signature discovery . Suppose l and d are two whole numbers. An l-pattern represents the DNA sequence with a length of l. If two l-patterns are (l,d)-similar, this means that the Hamming distance between the two l-patterns does not exceed d. Moreover, if (l,d)-similar l-patterns could not be found, the pattern is defined as a signature under the discovery condition (l,d) in the database. Zheng’s algorithm can find all of the signatures in a database as defined above. The IMUS algorithm improved upon Zheng’s algorithm to give better discovery efficiency, but requires a larger memory . Based on mathematical analysis, if a discovery condition is set as (l = 24,d = 4), when discovering signatures in a uniformly distributed database with a size of 230, IMUS requires only 7.4% of the string comparisons made by Zheng’s algorithm but creates 256 times more entries in the index. CMD is designed to discover all implicit signatures from DNA databases, where implicit signatures are signatures that satisfy discovery conditions looser than a given discovery condition .
However, none of the above algorithms distribute the computation of the databases onto multiple computers in a cluster. To use the algorithms in such a way, additional scripts must be applied to control the distribution and collection of the databases and results. Unfortunately, many of these approaches do not provide a formal definition for their distribution strategies. Some of the approaches, for example Insignia and CaSSiS, provide strategies to distribute the computation of the databases onto multiple computers in a cluster, but the steps of distribution and collection are not automatic. Manual manipulation is necessary to use these algorithms to distribute the computation of the databases onto multiple computers in a cluster. The match pipeline in Insignia applies strategies to reduce redundancy in sub-datasets, but relies mainly on preprocessing. PTPan , Jellyfish  and DSK  apply different strategies to avoid the necessity of loading the whole database into memory for searches. Each of the three approaches uses secondary storage. For example, Jellyfish and DSK use hash tables to compute the k-mers for a given k. Both algorithms achieve space efficiency by keeping most of the hash tables on disk. When counting k-mers over multiple hash tables, Jellyfish would need to store the intermediate k-mer counts on disk, which requires significantly more space, and the merge phase is not parallelized. This makes the algorithm time intensive for large databases. IMUS and Zheng’s algorithm both have two disadvantages. First, these algorithms require that the entire database to be processed (including all of the data structures that were used during computation) be loaded into memory, meaning that when the amount of data exceeds the memory capacity, these algorithms are unable to complete processing and cannot be used. Second, they are both sequential algorithms, so the time necessary for larger databases is extensive. Due to these two disadvantages, neither IMUS nor Zheng’s algorithm is suitable for applications that require processing large databases. This is a particular problem with the development of Next Generation Sequencing (NGS), as the rate of creation of sequence data is increasing daily, leading also to larger databases. This renders both IMUS and Zheng’s algorithm, which are unable to process large amounts of data and require longer processing times, unsuitable for NGS data analysis.
Divide-and-conquer is a computational strategy for solving both extensive and complicated problems and processing large amounts of data. The basic thought behind this is as follows: suppose the amount of data that needs to be processed for a problem is represented by |D|. If |D| is smaller, it can be easily solved and can be solved directly. Otherwise, the problem may be divided into multiple smaller scale subproblems with close similarities to the original problem. These subproblems may be solved recursively, and the results combined to find a solution to the original problem. Therefore, with the divide-and-conquer strategy, each recursion may include three main steps: (1) solve: if the problem is smaller in scale and easy to solve, it looks for a solution directly; (2) divide-and-recur: divide the original problem into multiple smaller scale subproblems closely similar to the original problem, then recursively try to find the solution to each subproblem; (3) combine: take the solutions from the subproblems and combine to find the solution to the original problem . In addition, as technology has matured, the price of multi-core CPUs has continued to fall, so the possibility to use parallel processing technology on a computer cluster to enhance processing efficiency has greatly improved. In fact, parallel processing technology is already used in many bioinformatics research fields, such as sequence alignment and analysis, protein structure prediction, and motif finding [24–31]. If we can use the divide-and-conquer strategy and parallel processing technology in signature discovery, this will improve the efficiency of discovery in large databases, which will be immensely helpful.
In this research, we propose a signature discovery algorithm called distributed divide-and-conquer-based signature discovery (DDCSD) algorithm. The DDCSD algorithm is designed specifically for discovering signatures on a computer cluster. The DDCSD algorithm automatizes the steps of distributing the database and collecting the unique signatures. The signatures are discovered from the database and provided to users without manual manipulation. The DDCSD algorithm uses the divide-and-conquer strategy to overcome the problem of processing large databases and compares multiple patterns in parallel to accelerate signature discovery. Therefore, the algorithm not only shortens the amount of time needed for discovery, it also is able to process the large databases that could not be processed in the past using IMUS and Zheng’s algorithm. In addition, by setting the threshold value of the direct discovery, DDCSD can limit the memory requirement in discovery to the memory size of the computers in the cluster. More specifically, the DDCSD algorithm can process any amount of data and is not limited by the amount of memory available. The DDCSD algorithm is implemented using a basic divide-and-conquer strategy as the basic structure. First, it decides whether to do direct discovery based on the size of the database. If the database is too large to load in its entirety, it will split the database into two equal parts and recursively processes the parts. As the recursive processing is in progress, the amount of data in a single part will gradually decrease until it can load the single part all into the memory of one computer in the cluster at one time. At the end, it will combine the results that were found separately in the two different parts and find the signatures in the original database. The DDCSD algorithm gives the formal definition in recursion for the dataset distribution strategy, that is not provided by the previous approaches. The DDCSD algorithm includes main and discovery routines. The main routine organizes discovery in a planned way. The discovery routine is used to find the unique patterns from a specified dataset in another dataset. The computation of discovery and collection in DDCSD is distributed onto discovery nodes for parallelization. Based on the experiments made on the human whole-genome EST database that has approximately 2.46G bases, the DDCSD algorithm proposed here can successfully process that database. Whereas previous algorithms could not process databases so large, the DDCSD algorithm took 1.89 hours to find all of the signatures under the discovery condition (30,2) on the cluster of ten discovery computers with 32 GB memory. The main contribution of this research is utilizing the divide-and-conquer strategy in signature discovery to process discovery in large databases, something previous algorithms were unable to do, and providing a parallel signature discovery algorithm on a cluster, that can process databases of any size regardless of the amount of memory available. This algorithm can be applied to NGS data analysis and other analysis of large databases.
Suppose that l and d represent the length and the number of allowed mismatches of signatures, respectively, and Λ is a dataset made up of l-patterns. We define signatures in Λ under a discovery condition (l,d) as patterns that exist in Λ and where there are no other (l,d)-similar patterns inside of Λ. The purpose of this research is to utilize a divide-and-conquer strategy to provide a parallel algorithm that can rapidly discover the signatures in datasets with massive amounts of data on a computer cluster.
For any subset Θ of Λ, if no (l,d)-similar pattern can be found in Θ, this pattern is considered unique in Θ. According to this definition, we can deduce that if one pattern P is a signature in Λ, then P must be unique in Θ. Therefore, if we divide Λ into two partitions of equal size (Λ i and Λ j ), then P will be a signature for either Λ i or Λ j and will be unique to the other partition. Thus, when the signatures of Λ i and Λ j are combined, they will include all of the signatures for Λ, making them valid candidates to discover signatures in Λ. Most importantly, no matter how many levels of recursive processing are applied, this characteristic still stands, meaning that we can use the divide-and-conquer strategy on a computer cluster to deal with the original problem posed to signature discovery algorithms where they could not process large databases. Using the above as the foundation, we designed a distributed divide-and-conquer-based signature discovery (DDCSD) algorithm that can rapidly discover the signatures that satisfy the discovery condition (l,d) in a large dataset on a computer cluster. The DDCSD algorithm includes main and discovery routines. The discovery routine accepts the candidate and source datasets that are made up of l-patterns and will find the patterns that are unique in the source in the candidate. It must be made clear that when the candidate and source are set as the same dataset, the patterns found by the discovery routine are the signatures for the dataset. Each of the computers in the cluster is called a node. The node that handles the main routine is called a main node, and those that handle the discovery routine are called discovery nodes.
The symbols and their definitions in the DDCSD algorithm
The length of signatures
The number of allowed mismatches of signatures
The threshold value of direct discovery
The input dataset made up of l-patterns
A partition of Λ, k = 1,2,…
The set of signatures in Λ
The set of signatures in Λ k
The source in the discovery routine, corresponding to
Λ or Λ k in the main routine
The candidate in the discovery routine, corresponding to
Ω or Ω k in the main routine
An example of DDCSD
Processing time for the discovery processes shown in Table 2
Suppose that P and Q are two l-patterns. If P is divided into equal and non-overlapping ⌈l/γ⌉ number of γ-patterns, these γ-patterns are called γ-segments of P. Pγ,i represents the i-th γ-segment in P. P is called (γ,i,δ)-matched to a γ-pattern Γ if Pγ,i is (γ,δ)-similar to Γ. We arrive at the observation that if P and Q are (l,d)-similar, for a given γ, there will be at least one i such that P is (γ,i,⌊γ d/l⌋)-matched to Qγ,i. Using the observation as the foundation, we designed the discovery routine of DDCSD.
Results and discussion
The time complexity of the discovery routine used in the DDCSD algorithm is analyzed and the results are integrated, yielding the time complexity of the DDCSD algorithm. The memory consumption is also analyzed.
where and η = n η0.
The computational cost for data division and transmission, η|Λ|, is not too large in comparison with the computational cost for discovery, ζ|Λ|2. The time complexity of using DDCSD to discover signatures from Λ is O(|Λ|2).
The discovery node handles the discovery routine in DDCSD. If there are too many patterns in the source and candidate datasets, so that they cannot be loaded into memory all at once, the discovery routine will split them into multiple parts and load and process the parts one at a time. In addition, the threshold value for direct discovery, N, is decided based on the memory space of discovery nodes so that the patterns in the datasets can be loaded into the memory. Thus, the number of patterns in each of the parts is on the order of N. According to the γ-segments included in the l-patterns in the parts, the l-patterns are assigned to (Υ,γ,i,l,d)-groups. In discovery nodes, the memory is mainly used to store the patterns in the (Υ,γ,i,l,d)-groups. Based on the above discussion about the total memory consumption in DDCSD, the memory consumption of each discovery node is τ|N|.
The space complexity of using DDCSD to discover signatures in Λ is O(|Λ|). The space complexity of a discovery node is O(N).
The experimental platform that we used was a cluster of eleven computers, including one main node and ten discovery nodes. The main node was equipped with an Intel Core i7 CPU 870 at 2.93 GHz, 16 GB of memory and 1.5 TB of disk space. Each of the discovery nodes was equipped with an Intel Core i7 CPU 3770 K at 3.50 GHz, 32 GB of memory and 1 TB disk space. The operating system was CentOS release 6.3, and the algorithm tested was coded in JAVA and compiled in JDK 1.6. In this experiment, we used the human whole-genome EST database with 2.46G bases to test the performance of the DDCSD algorithm. In order to avoid impacting the testing, we deleted all remarks and sequences shorter than 36 bases in the database and replaced all universal characters, for example ‘don’t care’, in the sequences with an ‘A’.
When testing the DDCSD algorithm, each recursion only loads the beginning and ending position of the data partition and not the actual data. Only when the discovery needs to happen does it load the data completely into the memory in order to avoid taking up large amounts of memory. In the tests, each l-pattern is divided into 2 segments, with the γ value set to l/2.
The discovery time for the DDCSD algorithm to discover signatures from the human whole-genome EST database under various discovery conditions
l = 24
d = 2
d = 4
The discovery time when various number of discovery nodes were used
In this research, we proposed a distributed divide-and-conquer-based signature discovery (DDCSD) algorithm. The DDCSD algorithm uses a divide-and-conquer strategy to overcome the problem of processing larger databases, thus solving the disadvantage of previous algorithms that could not process large databases. Also, a parallel computation mechanism on a computer cluster was used to accelerate the signature discovery. Therefore, this algorithm is not limited by the amount of memory available, and can rapidly find signatures in large databases, making it applicable to analysis of NGS and other large amounts of data.
The authors would like to thank the Ministry of Science and Technology of the Republic of China, Taiwan, for financially supporting this research under Grants [NSC102-2218-E-040-001, NSC102-2221-E-040-004 and MOST103-2218-E-040-001 to H.P. Lee] and [MOST103-2218-E-126-002 to T.F. Sheu]; and the anonymous reviewers for their constructive suggestions.
- Kaderali L, Schliep A: Selecting signature oligonucleotides to identify organisms using dna arrays. Bioinformatics. 2002, 18 (10): 1340-1349. 10.1093/bioinformatics/18.10.1340.View ArticlePubMedGoogle Scholar
- Francois P, Charbonnier Y, Jacquet J, Utinger D, Bento M, Lew D, Kresbach G. M, Ehrat M, Schlegel W, Schrenzel J: Rapid bacterial identification using evanescent-waveguide oligonucleotide microarray classification. J Microbiol Methods. 2006, 65 (3): 390-403. 10.1016/j.mimet.2005.08.012.View ArticlePubMedGoogle Scholar
- Kiryu BM, Kiryu CP: Rapid identification of candida albicans and other human pathogenic yeasts by using oligonucleotides in a pcr. J Clin Microbiol. 1998, 73: 1634-1641.Google Scholar
- Li F, Stormo GD: Selection of optimal dna oligos for gene expression arrays. Bioinformatics. 2001, 17: 1067-1076. 10.1093/bioinformatics/17.11.1067.View ArticlePubMedGoogle Scholar
- Roten CA, Gamba P, Barblan JL, Karamata D: Comparative genometrics (cg): a database dedicated to biometric comparisons of whole genomes. Nucleic Acids Res. 2002, 30 (1): 142-144. 10.1093/nar/30.1.142.View ArticlePubMed CentralPubMedGoogle Scholar
- Hsiao W, Wan I, Jones SJ, Brinkman FS: Islandpath: aiding detection of genomic islands in prokaryotes. Bioinformatics. 2003, 19 (3): 418-420. 10.1093/bioinformatics/btg004.View ArticlePubMedGoogle Scholar
- Amin HM, Hashem A-GM, Aziz RK: Bioinformatics determination of etec signature genes as potential targets for molecular diagnosis and reverse vaccinology. BMC Bioinformatics. 2009, 10: 7-10.1186/1471-2105-10-7.View ArticleGoogle Scholar
- Duitama J, Kumar DM, Hemphill E, Khan M, Mandoiu II, Nelson CE: Primerhunter: a primer design tool for pcr-based virus subtype identification. Nucleic Acids Res. 2009, 37: 2483-2492. 10.1093/nar/gkp073.View ArticlePubMed CentralPubMedGoogle Scholar
- Vijaya SR, Zavaljevski N, Kumar K, Reifman J: A high-throughput pipeline for designing microarray-based pathogen diagnostic assays. BMC Bioinformatics. 2008, 9: 185-10.1186/1471-2105-9-185.View ArticleGoogle Scholar
- Tembe W, Zavaljevski N, Bode E, Chase C, Geyer J, Wasieloski L, Benson G, Reifman J: Oligonucleotide fingerprint identification for microarray-based pathogen diagnostic assays. Bioinformatics. 2007, 23 (1): 5-13. 10.1093/bioinformatics/btl549.View ArticlePubMedGoogle Scholar
- Satya RV, Zavaljevski N, Kumar K, Bode E, Padilla S, Wasieloski L, Geyer J, Reifman J: In silico microarray probe design for diagnosis of multiple pathogens. BMC Genomics. 2008, 9: 496-10.1186/1471-2164-9-496.View ArticleGoogle Scholar
- Phillippy AM, Mason JA, Ayanbule K, Sommer DD, Taviani E, Huq A, Colwell RR, Knight IT, Salzberg SL: Comprehensive dna signature discovery and validation. PLoS Comput Biol. 2007, 3 (5): e98-10.1371/journal.pcbi.0030098.View ArticlePubMed CentralPubMedGoogle Scholar
- Phillippy AM, Ayanbule K, Edwards NJ, Salzberg SL: Insignia: a dna signature search web server for diagnostic assay development. Nucleic Acids Res. 2009, 37 (2): 229-234.View ArticleGoogle Scholar
- Rozen S, Skaletsky H: Primer3 on the www for general users and for biologist programmers. Methods Mol Biol. 2000, 132: 365-386.PubMedGoogle Scholar
- Satya RV, Kumar K, Zavaljevski N, Reifman J: A high-throughput pipeline for the design of real-time pcr signatures. BMC Bioinformatics. 2010, 11: 340-10.1186/1471-2105-11-340.View ArticleGoogle Scholar
- Bader KC, Grothoff C, Meier H: Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets. Bioinformatics. 2011, 27: 1546-1554. 10.1093/bioinformatics/btr161.View ArticlePubMedGoogle Scholar
- Zheng J, Close TJ, Jiang T, Lonardi S: Efficient selection of unique and popular oligos for large est databases. Bioinformatics. 2004, 20: 2101-2112. 10.1093/bioinformatics/bth210.View ArticlePubMedGoogle Scholar
- Lee HP, Sheu TF, Tsai YT, Shih CH, Tang. C Y: Efficient discovery of unique signatures on whole-genome est databases. Proceeding of the 20th Annual ACM Symposium on Applied Computing (SAC2005). 2005, Santa Fe: Association for Computing Machinery, 100-104.Google Scholar
- Lee HP, Sheu TF, Tang CY: A parallel and incremental algorithm for efficient unique signature discovery on dna databases. BMC Bioinformatics. 2010, 11: 132-10.1186/1471-2105-11-132.View ArticlePubMed CentralPubMedGoogle Scholar
- Eissler T, Hodges C P Meier: Ptpan-overcoming memory limitations in oligonucleotide string matching for primer/probe design. Bioinformatics. 2011, 27: 2797-2805. 10.1093/bioinformatics/btr483.View ArticlePubMedGoogle Scholar
- Marcais G, Kingsford C: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011, 27: 764-770. 10.1093/bioinformatics/btr011.View ArticlePubMed CentralPubMedGoogle Scholar
- Rizk G, Lavenier D, Chikhi R: Dsk: k-mer counting with very low memory usage. Bioinformatics. 2013, 29 (5): 652-653. 10.1093/bioinformatics/btt020.View ArticlePubMedGoogle Scholar
- Cormen TH, Leiserson CE, Rivest RL: Introduction to Algorithms. 2009, Cambridge: MIT PressGoogle Scholar
- Grundy WN, Bailey TL, Elkan CP: Parameme: a parallel implementation and a web interface for a dna and protein motif discovery tool. Bioinformatics. 1999, 12: 303-310.View ArticleGoogle Scholar
- Ho ES, Jakubowski CD, Gunderson SI: itriplet, a rule-based nucleic acid sequence motif finder. Algorithm Mol Biol. 2009, 29: 14-View ArticleGoogle Scholar
- Green JR, Korenberg MJ, Aboul-Magd. M O: Pci-ss: Miso dynamic nonlinear protein secondary structure prediction. BMC Bioinformatics. 2009, 10: 222-10.1186/1471-2105-10-222.View ArticlePubMed CentralPubMedGoogle Scholar
- Venkatesan A, Gopal J, Candavelou M, Gollapalli S, Karthikeyan K: Computational approach for protein structure prediction. Healthcare Inform Res. 2013, 19: 137-147. 10.4258/hir.2013.19.2.137.View ArticleGoogle Scholar
- Chen Y, Wan A, Liu W: A fast parallel algorithm for finding the longest common sequence of multiple biosequences. BMC Bioinformatics. 2006, 7 (4): 4-View ArticleGoogle Scholar
- Rognes T: Paralign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Res. 2001, 29: 1647-1652. 10.1093/nar/29.7.1647.View ArticlePubMed CentralPubMedGoogle Scholar
- Ebedes J, Datta. A: Multiple sequence alignment in parallel on a workstation cluster. Bioinformatics. 2004, 20 (7): 1193-1195. 10.1093/bioinformatics/bth055.View ArticlePubMedGoogle Scholar
- Sun W, Al-Haj S, He J: Parallel computing in protein structure topology determination. Proceedings of 26th Army Science Conference. 2008, Orlando: Assistant Secretary of Army, cp8.Google Scholar
- Kurtz S, Narechania A, Stein JC, Ware D: A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics. 2008, 9: 517-10.1186/1471-2164-9-517.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.