N-gram analysis of 970 microbial organisms reveals presence of biological language models
© Osmanbeyoglu and Ganapathiraju; licensee BioMed Central Ltd. 2011
Received: 18 May 2010
Accepted: 10 January 2011
Published: 10 January 2011
It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree.
We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli.
Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.
Microbes are the most diverse organisms on earth. Genomic and proteomic sequences of most major microbes are either already available or soon to be released; these sequences provide an almost overwhelming amount of information about the microbes and their genetic makeup. The first bacterial genome sequence was reported in 1995  and now more than 1,000 genome and proteome sequences of microbes including plant, animal and human pathogens, are available publicly (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi). With the rapidly increasing availability of whole genome and proteome sequences of microbes, large scale computational recognition and comparison of patterns in biological sequences could be a first step towards discovering and understanding the biology of microbes and their diversity. Understanding their diversity is important to make progress in the field of medicine, public health and agriculture , and possibly in exploring alternate energy sources . Currently, the widely accepted method for studying phylogeny (diversity) of microbes is based on a comparison of genes that encode a small subunit RNA (SSU rRNA) . However, as more gene sequences become available, SSU rRNA based grouping has begun to produce results that conflicts with the results from those derived from alternative gene sets . The use of the whole genome/proteome is considered to provide more robust information for grouping of organisms than the information provided by selected gene sets . However, comparison of whole genomes/proteomes may not be feasible for large sets of organisms using multiple sequence alignment (MSA) based methods as only a small portion of genes is shared across all the organisms that are being compared. Orthologous genes comparison (eg. as shown in ) which requires correct selection of orthologous genes, protein sequence/structure domains comparison (eg. as shown in [8, 9]) which requires the assignment of protein domains at the sequence/structure level, and whole genome/proteome sequences (the pair-wise alignment eg. as shown in  or the alignment free eg. as shown in ) are the main approaches for inferring whole-genome-based phylogeny of microbial organisms.
In their previous work, Ganapathiraju et al. have suggested that genome or proteome sequences show characteristics typical of natural-language texts, and drawing upon this analogy of biology and language  algorithms originally developed for natural language processing may be applied to study biological sequences: topic detection algorithms to secondary or transmembrane structure prediction, statistical n-grams for protein or proteome classification, etc.
N-grams are sequences of 'n' words in a running text. The different n-grams that occur in a document and the frequency of occurrence of each n-gram can be used to characterize the topic of the document or the author-style. N-gram frequencies or more sophisticated statistical models of n-grams are widely used for text processing applications such as information retrieval , language identification , automatic text categorization  and authorship attribution . In a biological context, n-grams can be sequences of amino acids or nucleotides. By employing this analogy between natural language texts and biological sequences, namely by applying 'biological language modeling', whole proteome sequences of microbial organisms have also been shown to contain n-gram genome-signatures .
First, Ganapathiraju, et al.  compared the n-gram frequencies of 44 different organisms using the simple Markovian uni-gram model (context independent amino acid model). For the proteins of Aeropyrum pernix, when the training and the test set were from the same organism, a perplexity of 16.6 was observed, whereas data from other organisms varied from 16.8 to 21.9. This showed that the differences between the 'sublanguages' of the different organisms were automatically detectable with even the simplest language model. They also demonstrated that the modified Zipf-like analysis could reveal specific differences in n-grams (proteome signatures) in different organisms. In other words, specific n-gram sequences were found in abundance in one organism but very rarely in other organisms, thereby serving as the proteome-signature of that organism. Further, it has also been proposed that a statistical model of n-grams (more specifically perplexity) of proteome sequences varied from organism to organism. At the time biological language modeling approach was proposed (2002), proteome sequences of only 44 organisms were available, thereby limiting the generalization of this hypothesis.
N-gram based methods also have been successfully applied to biological domain. Karlin et al. introduced a "genomic signature" based on dinucletiode odds ratio (relative abundance) values which appeared to reflect the species-specific properties of DNA modification, replication and repair mechanism . Campbell et al. compared dinucleotide frequencies (genomic signatures) of prokaryote, plasmid, and mitochondrial DNA . They showed that plasmids and their hosts have substantially compatible nucleotide signatures. Mammalian mitochondrial genomes were very similar, and animal and fungal mitochondria were generally moderately similar, but they diverged significantly from plant and protist mitochondria sets. Passel et al. studied genome-specific relative frequencies of dinucleotides of 334 prokaryotic genome sequences . Intrageneric comparisons showed that in general the genomic dissimilarity scores were higher than in intraspecific comparisons. However, genera such as Bartonella spp., Bordetella spp., Salmonella spp. and Yersinia spp. had low average intrageneric genomic dissimilarity scores and they suggested that members of these genera might be considered the same species. On the other hand, they observed high genomic dissimilarity values for intraspecific analyses for organisms such as Prochlorococcus marinus, Pseudomonas fluorescens, Buchnera aphidicola and Rhodopseudomonas palustris and they suggested that different strains from the same species might actually represent different species. Recently, Pandit et al. identified the distinctive genomic signature associated with the DNA sequence organization in different HIV-1 subtypes .
One of the other earlier applications is protein classification based on n-gram frequencies . Cheng et al. and Daeyaert et al. used n-gram composition of amino acid sequences for protein classification [23, 24]. King et al. presented an n-gram-based Bayesian classifier that predicts the localization of a protein sequence . Recently, Maetschke et al. developed an alignment-free and visual approach to analyze sequence relationship of proteins . They used the number of shared n-grams between sequences as a measure of sequence similarity and rearranging the resulting affinity matrix applying a spectral technique. They made use of heat maps of the affinity matrix to identify and visualize clusters of related sequences or outliers and n-gram-based dot plots and conservation profiles to allow detailed analysis of similarities among selected sequences.
N-gram composition based approaches have also been applied to phylogenetic analysis. Stuart et al. used the singular value decomposition of a sparse 4-gram frequency matrix to represent the proteins of organisms uniquely and precisely as vectors in a high-dimensional space . Then, they used vectors of this kind to calculate pair-wise distance values based on the angle between the vectors, and generated phylogenetic trees of mitochondrial genome based on the resulting distance values. Alternatively, Qi et al. developed a method to reconstruct phylogenetic tree based on n-gram frequencies from which random background is subtracted and neighbor joining method is applied . Tomovic et al. also developed classification and unsupervised hierarchical clustering of genome based on n-gram profile similarity measure .
Diverse n-gram based methods for identification of compositionally different regions have been devised. For example, Mitic et al. reported genomic island determination via binary classification of islands based on n-gram frequency distribution [30, 31]. Rani et al. demonstrated n-gram based promoter prediction where n-grams are used to determine a special bias towards certain combinations of base pairs in the promoter sequences .
In language modelling, the most common metrics for assessing n-gram model composition is perplexity , which can be interpreted as the (geometric) average branching factor of the language according to the model. Perplexity is a function of both the language as well as that of the model. When considered a function of the model, it measures how good the model is (the better the model, the lower the perplexity). The higher the perplexity, the more branches need to be considered statistically. Perplexity has been used to test performance of language models in a wide range of areas. Speech recognition tasks [33, 34], linguistic steganography detection , identification of news coverage  are some of the examples of the perplexity measure usage. In biological sequence modelling, Buehler et al.  used the perplexity metric as a measure of their success in showing that the use of "long distance" features can improve the maximum entropy based model of amino acids sequences.
In this study, we use Zipf-like analysis and the perplexity measure to study the diversity among proteome sequences of microbial organisms as first proposed by Ganapathiraju et al.  to address the question of whether or not the sequences in proteins of different organisms are statistically similar or whether organisms may be viewed to possess different languages. Today, with several ongoing genomics efforts, nearly 1,000 microbial genome sequences, and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree. Here, we extend the previous work  with 970 whole microbial proteome sequences and discuss how n-grams truly reveal proteomic signatures and demonstrate how the n-gram statistical language model could be indicative of evolutionary divergence at the genus level.
N-grams are sequences of n words. In a biological context, n-grams can be sequences of n amino acids or nucleotides. For instance, the sequence "AAANTSDSQKE" has two count of the 2-gram AA, and one count each of the 2-grams AN, NT, TS, SD, DS, SQ, QK and KE. The formal definition of n-grams is given below:
Given a sequence of N words S = s1s2...sN over the vocabulary A, and n a positive integer, an n-gram of the sequence S is any subsequence si...si+n-1 of n consecutive words. There are N-n+1 such n-grams in S. For a vocabulary A with |A| distinct words, there are |A|n possible unique n-grams.
Zipf's law is based on observations made by the linguist George Kingsley Zipf and states that the most frequent word in any kind of text is expected to be twice as frequent as the second most frequent word, etc. In this study, we used a modified Zipf-like analysis as employed by Ganapathiraju et al.  to explore the differences between n-gram usage in different organisms. First, amino acid n-grams of a given length are sorted in descending order by the frequency with which they occur in a reference organism of choice. In all the figures pertaining to this type of analysis, the frequencies of the reference organism are shown in bold line. For comparative analysis, the corresponding frequencies of these n-grams in all other organisms are shown in thin lines. For microbes that are associated with animal hosts, the lines are shown in red and those that are associated with plant hosts are shown in blue.
In text-processing, for a known corpus and its corresponding language model (for instance, a 4-gram model), how well the language model predicts a new text composed of unseen sentences can be estimated by computing its perplexity . The entropy of its words (H) determines the perplexity (2 H ) of a text. We take the n-grams of the new text, and compute what the probability is of generating that n-gram with respect to the n-gram distribution of the reference text. The lower the perplexity, the better the unseen text fits to the known corpus. When applied to amino acid sequences of whole proteome of organisms, it can reveal how similar a new organism's sequence is to known organisms. This analysis can give us inside into evolutionary relatedness of organisms. The formal definition of perplexity and related terms are given below:
Let p(x) be the probability mass function of a random variable X, over a discrete symbol (or alphabet) X: p(X) = P(X = x), x ∈ X
With respect to n-grams, perplexity is given for previous n-1 letters in a sequence denoting how many different letters can occur in the nth position on an average. For example, given any two letters in the sequence AACCTAACCTAACCTAA CCTAACC..., the third letter can be only one out of 4 possibilities. In other words, perplexity is only 1 in guessing the 3rd letter given two previous letters in the sequence (as opposed to being 4 for a random sequence of nucleotides).
In this study, perplexity is defined by frequencies of n-grams and n-1 grams computed as follows:
For each n-gram denoted as n-gramj, its count in both training and test set data are found and denoted as Ctrain-nj and Ctest-nj, respectively.
The counts of the (n-1) gram for n-gramj (i.e the sequence of the first n-1 characters in n-gramj) are also found and denoted as Ctrain-(n-1)j and Ctest-(n-1)j
where j represents the jth n-gram and N is the count of all the n-grams in the sequence.
Perplexity is computed as 2 E .
Multinomial Logistic Regression
As seen from the above equation, one of the categories is used as reference (baseline category). After estimating the coefficients of the model by maximum likelihood model, the probabilities of each one of the categories can be calculated. The final prediction is the category with highest probability .
Suites of tools
Biological Language Modeling Toolkit (BLMT)  and Patternix Revelio (under review) are two suites of tools for proteome and genome sequence processing, developed by Ganapathiraju and others. The suites contain tools for computing n-gram frequencies and perplexity, and are designed to use data preprocessing in suffix arrays for efficient comparisons of large scale sequences. All of the computations presented here have been carried out with these two suites of tools.
Results and Discussion
Unigram signatures of whole proteomes
While there is a striking variation in rank of certain n-grams in different organisms, n-grams in one organism are usually rare in all organisms. This was observed by  and explained by Poddar et al.'s  analysis of unigram distributions of various proteomes that the amino acids which are coded by multiple codons occur more frequently than those coded by fewer codons. In the standard genetic code, even among those amino acids that are coded by only one codon, the occurrence of tryptophan (W) was less frequent than the occurrence of methionine (M). This could be linked to the fact that its codon (TGG), when changed the third position becomes a stop codon (TGA), and this would be detrimental to the protein and therefore is usually not chosen by organisms during evolution. Similarly, among those amino acids that are coded by only two codons, the occurrence of cysteine (C) was fewer. The change in the third position of C also leads to a stop codon. Tryptophan and cysteine are the least frequently occurring amino acids of all the proteomes of micro organisms implies that they are not incorporated in proteins unless they play a specific role. Our findings with a larger dataset further support Poddar et al.'s arguments described above.
Higher order n-gram analysis
Correlation coefficient of 4-gram frequencies across species.
Next, we grouped the microbes by their pathogenecity as animal-infecting or plant-infecting, and compared their n-gram distributions. However, we did not observe significant difference between these two groups. In Figure 3, most of the pathogens infect animal but some species of Burkholderia and Pseudomonas also infect plants. Plant pathogens that belong to these genera are shown in square markers. As seen in this figure, plant and animal pathogens do not show large difference in terms of their unigram distribution in a particular genus. This might be due to the fact that microbes share strategies for invading the host, whether plant or animal . Some examples of these strategies could be: utilizing the type III protein secretion machinery to inject effectors into cells, or having some effectors to target defensive signal transduction pathways in host cells, or having a common targeting domain in their secreted proteins to enter host cells.
The average perplexity of generating a sequence based on the n-gram model of another sequence (cross-perplexity) will tell whether the two are similar to each other in terms of amino acid composition. The average perplexity of a test sequence is larger if the test sequence is dissimilar to the reference sequence. In this study, we investigate whether whole proteome cross-perplexity values are comparable among the same group of microbes. Perplexity models have been computed for many microbial proteomes and tested against all 970 microbial proteomes. Below is one example.
The ability to carry out large scale proteome analysis and cross-comparisons across proteomes leads to useful insights in biology, most prominent of them being evolutionary relations. Our analysis illustrates that unigram distribution of amino acids shows a fine resolution signature at the genus level (genus signature). We also demonstrated that genus level signatures are similar to each other within a given class. Biological language modeling for 970 microbial organisms illustrates significant preferences for particular combinations of amino acids thus strengthening the previous argument that different organisms use different vocabulary. An average cross-perplexity measure is shown to be proportional to evolutionary branch distance within a genus.
Further analysis of microbial genomes in comparison to the biological language models of their host organisms such as human, cow, mouse and plant may reveal further interesting observations.
MKG would like to thank her thesis advisors Dr. Judith Klein-Seetharaman and Dr. Raj Reddy for many discussions during her Ph.D regarding n-gram analysis and Biological Language Modelling. Authors acknowledge the contributions of Thahir Mohamed to the development of perplexity computation tools in Patternix Revelio, and Dr. Roger Day and Dr. George C. Tseng for discussions on statistical analyses. HUO wishes to thank Dr. Gregory Cooper and Dr. Wendy W. Chapman for helpful comments.
- Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al.: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. In Science. Volume 269. New York, NY; 1995:496–512. 10.1126/science.7542800
- Demain AL: Small bugs, big business: the economic power of the microbe. Biotechnology advances 2000, 18(6):499–514. 10.1016/S0734-9750(00)00049-5View ArticlePubMedGoogle Scholar
- Demain AL: Biosolutions to the energy problem. Journal of industrial microbiology & biotechnology 2009, 36(3):319–332.View ArticleGoogle Scholar
- Woese C, Fox G: Phylogenetic structure of the prokaryotic domain: The primary kingdoms. Proceedings of the National Academy of Sciences of the United States of America 1977, 74: 5088–5090. 10.1073/pnas.74.11.5088PubMed CentralView ArticlePubMedGoogle Scholar
- McInerney JO, Cotton JA, Pisani D: The prokaryotic tree of life: past, present... and future? Trends in ecology & evolution (Personal edition) 2008, 23(5):276–281.View ArticleGoogle Scholar
- McFarlane DJ, Elhadad N, Kukafka R: Perplexity analysis of obesity news coverage. AMIA Annual Symposium proceedings/AMIA Symposium 2009, 2009: 426–430.PubMedGoogle Scholar
- Huson DH, Steel M: Phylogenetic trees based on gene content. In Bioinformatics. Volume 20. Oxford, England; 2004:2044–2049. 10.1093/bioinformatics/bth198
- Yang S, Doolittle RF, Bourne PE: Phylogeny determined by protein domain content. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(2):373–378. 10.1073/pnas.0408810102PubMed CentralView ArticlePubMedGoogle Scholar
- Fukami-Kobayashi K, Minezaki Y, Tateno Y, Nishikawa K: A tree of life based on protein domain organizations. Molecular biology and evolution 2007, 24(5):1181–1189. 10.1093/molbev/msm034View ArticlePubMedGoogle Scholar
- Henz SR, Huson DH, Auch AF, Nieselt-Struwe K, Schuster SC: Whole-genome prokaryotic phylogeny. In Bioinformatics. Volume 21. Oxford, England; 2005:2329–2335. 10.1093/bioinformatics/bth324
- Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ: Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome research 2003, 13(2):145–158. 10.1101/gr.335003PubMed CentralView ArticlePubMedGoogle Scholar
- Ganapathiraju M, Balakrishnan N, Reddy R, Klein-Seetharaman J: Computational Biology and Language. Lecture Notes in Artificial Intelligence, LNCS/LNAI 2004, 3345: 25–47.Google Scholar
- Heer TD: Experiments with syntactic traces in information retrieval. Inform Storage Retrieval 10 1974, 133–144. 10.1016/0020-0271(74)90015-1Google Scholar
- Schmitt JC: Trigram-based method of language identification. vol. U.S. Patent 5,062,143 1991.Google Scholar
- Cavnar WB, Trenkle JM: n-Gram-based text categorization. In Proceedings of the 1994 Symposium on Document Analysis and Information Retrieval 1994. University of Nevada, Las Vegas; 1994.Google Scholar
- Kešelj V, Peng F, Cercone N, Thomas C: n-Gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics PACLING'03: 2003. Dalhousie University, Halifax, NS, Canada; 2003.Google Scholar
- Ganapathiraju M, Weisser D, Klein-Seetharaman J, Rosenfeld R, Carbonell J, Reddy R: Comparative n-gram analysis of whole-genome sequences. In HLT'02: Human Language Technologies Conference: 2002. San Diego; 2002.Google Scholar
- Karlin S, Burge C: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 1995, 11(7):283–290. 10.1016/S0168-9525(00)89076-9View ArticlePubMedGoogle Scholar
- Campbell A, Mrazek J, Karlin S: Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proceedings of the National Academy of Sciences of the United States of America 1999, 96(16):9184–9189. 10.1073/pnas.96.16.9184PubMed CentralView ArticlePubMedGoogle Scholar
- van Passel MW, Kuramae EE, Luyf AC, Bart A, Boekhout T: The reach of the genome signature in prokaryotes. BMC evolutionary biology 2006, 6: 84. 10.1186/1471-2148-6-84PubMed CentralView ArticlePubMedGoogle Scholar
- Pandit A, Sinha S: Using genomic signatures for HIV-1 sub-typing. BMC bioinformatics 11(Suppl 1):S26. 10.1186/1471-2105-11-S1-S26
- Solovyev VV, Makarova KS: A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization. Comput Appl Biosci 1993, 9(1):17–24.PubMedGoogle Scholar
- Cheng BY, Carbonell JG, Klein-Seetharaman J: Protein classification based on text document classification techniques. Proteins 2005, 58(4):955–970. 10.1002/prot.20373View ArticlePubMedGoogle Scholar
- Daeyaert F, Moereels H, Lewi PJ: Classification and identification of proteins by means of common and specific amino acid n-tuples in unaligned sequences. Computer methods and programs in biomedicine 1998, 56(3):221–233. 10.1016/S0169-2607(98)00031-5View ArticlePubMedGoogle Scholar
- King BR, Guda C: ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes. Genome biology 2007, 8(5):R68. 10.1186/gb-2007-8-5-r68PubMed CentralView ArticlePubMedGoogle Scholar
- Maetschke SR, Kassahn KS, Dunn JA, Han SP, Curley EZ, Stacey KJ, Ragan MA: A visual framework for sequence analysis using n-grams and spectral rearrangement. In Bioinformatics. Volume 26. Oxford, England; 737–744. 10.1093/bioinformatics/btq042
- Stuart GW, Moffett K, Baker S: Integrated gene and species phylogenies from unaligned whole genome protein sequences. In Bioinformatics. Volume 18. Oxford, England; 2002:100–108. 10.1093/bioinformatics/18.1.100
- Qi J, Wang B, Hao BI: Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. Journal of molecular evolution 2004, 58(1):1–11. 10.1007/s00239-003-2493-7View ArticlePubMedGoogle Scholar
- Tomovic A, Janicic P, Keselj V: n-gram-based classification and unsupervised hierarchical clustering of genome sequences. Computer methods and programs in biomedicine 2006, 81(2):137–153. 10.1016/j.cmpb.2005.11.007View ArticlePubMedGoogle Scholar
- Mitic NS, Pavlovic-Lazetic GM, Beljanski MV: Could n-gram analysis contribute to genomic island determination? Journal of biomedical informatics 2008, 41(6):936–943. 10.1016/j.jbi.2008.03.007View ArticlePubMedGoogle Scholar
- Pavlovic-Lazetic GM, Mitic NS, Beljanski MV: n-Gram characterization of genomic islands in bacterial genomes. Computer methods and programs in biomedicine 2009, 93(3):241–256. 10.1016/j.cmpb.2008.10.014View ArticlePubMedGoogle Scholar
- Rani TS, Bapi RS: Analysis of n-gram based promoter recognition methods and application to whole genome promoter prediction. silico biology 2009, 9(1–2):S1–16.Google Scholar
- Bahl L, Baker J, Jelinek F, Mercer R: Perplexity - a measure of the difficulty of speech recognition tasks. Program of the 94th Meeting of the Acoustical Society of America J Acoust Soc Am: 1997 1997, 62: S63.Google Scholar
- Lee K: On large-vocabulary speaker-independent continuous speech recognition. Speech Communication 1988, 7(4):375–379. 10.1016/0167-6393(88)90053-2View ArticleGoogle Scholar
- Meng P, Huang L, Chen Z, Yang W, Li D: Linguistic steganography detection based on perplexity. International Conference on MultiMedia and Information Technology: 2008 2008.Google Scholar
- Buehler E, Ungar L: Maximum entropy methods for biological sequence modeling. Workshop on Data Mining in Bioinformatics (BIOKDD 2001) 2001, 60–64.Google Scholar
- Tauritz D: Application of n-Grams. In Department of Computer Science. University of Missouri-Rolla; 2002.Google Scholar
- Manning CD, S H: Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: MIT Press; 1999.Google Scholar
- Hosmer DW, Lemeshow S: Applied logistic regression. Wiley-Interscience Publication; 2000.View ArticleGoogle Scholar
- Ganapathiraju M, Manoharan V, Klein-Seetharaman J: BLMT: statistical sequence analysis using N-grams. Applied bioinformatics 2004, 3(2–3):193–200. 10.2165/00822942-200403020-00013View ArticlePubMedGoogle Scholar
- Poddar A, Chandra N, Ganapathiraju M, Sekar K, Klein-Seetharaman J, Reddy R, Balakrishnan N: Evolutionary insights from suffix array-based genome sequence analysis. Journal of biosciences 2007, 32(5):871–881. 10.1007/s12038-007-0087-zView ArticlePubMedGoogle Scholar
- Engel P, Dehio C: Genomics of Host-Restricted Pathogens of the Genus Bartonella. Genome Dyn 2009, 6: 158–169. full_textView ArticlePubMedGoogle Scholar
- Rahme LG, Ausubel FM, Cao H, Drenkard E, Goumnerov BC, Lau GW, Mahajan-Miklos S, Plotnikova J, Tan MW, Tsongalis J, et al.: Plants and animals share functionally common bacterial virulence factors. Proceedings of the National Academy of Sciences of the United States of America 2000, 97(16):8815–8821. 10.1073/pnas.97.16.8815PubMed CentralView ArticlePubMedGoogle Scholar
- Hershberg R, Tang H, Petrov DA: Reduced selection leads to accelerated gene loss in Shigella. Genome biology 2007, 8(8):R164. 10.1186/gb-2007-8-8-r164PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.