- Open Access
A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies
© The Author(s). 2016
- Received: 1 September 2015
- Accepted: 22 June 2016
- Published: 30 June 2016
Comparative analysis of whole genome sequence data from closely related prokaryotic species or strains is becoming an increasingly important and accessible approach for addressing both fundamental and applied biological questions. While there are number of excellent tools developed for performing this task, most scale poorly when faced with hundreds of genome sequences, and many require extensive manual curation.
We have developed a de-novo genome analysis pipeline (DeNoGAP) for the automated, iterative and high-throughput analysis of data from comparative genomics projects involving hundreds of whole genome sequences. The pipeline is designed to perform reference-assisted and de novo gene prediction, homolog protein family assignment, ortholog prediction, functional annotation, and pan-genome analysis using a range of proven tools and databases. While most existing methods scale quadratically with the number of genomes since they rely on pairwise comparisons among predicted protein sequences, DeNoGAP scales linearly since the homology assignment is based on iteratively refined hidden Markov models. This iterative clustering strategy enables DeNoGAP to handle a very large number of genomes using minimal computational resources. Moreover, the modular structure of the pipeline permits easy updates as new analysis programs become available.
DeNoGAP integrates bioinformatics tools and databases for comparative analysis of a large number of genomes. The pipeline offers tools and algorithms for annotation and analysis of completed and draft genome sequences. The pipeline is developed using Perl, BioPerl and SQLite on Ubuntu Linux version 12.04 LTS. Currently, the software package accompanies script for automated installation of necessary external programs on Ubuntu Linux; however, the pipeline should be also compatible with other Linux and Unix systems after necessary external programs are installed. DeNoGAP is freely available at https://sourceforge.net/projects/denogap/.
- Comparative genomics
- Gene prediction
- Gene annotation
- Ortholog identification
- Functional annotation
- Pan genome
- Core genome
- Flexible genome
Advances in next-generation sequencing technology have revolutionized the field of comparative genomics and enabled researchers to gain much greater resolution and insight into questions related to genome plasticity, molecular epidemiology, and evolution and diversity among closely related species and strains [1–5]. A wide range of powerful tools have been developed to help researchers perform whole genome comparisons; however, it is often difficult to automate these analyses [6–8]. The problem is exacerbated when dealing with draft genomes, since predictive and comparative analyses are often not designed to work with fragmented genes that arise due to sequencing or assembly errors . Consequently, it is usually prudent to use multiple methods that employ different underlying algorithms to minimize the occurrence of false positive or negative results due to algorithm bias or sequencing and assembly errors . While using multiple approaches enhances robustness, it also introduces another set of problems related to the integration of tools that more often than not rely on disparate data formats and structures.
Perhaps the biggest challenge faced during comparative genomic analysis is that most analysis approaches do not scale well when faced with hundreds of genomes. There is very high computational complexity associated with the management and analysis of large genomic datasets. The majority of comparative analytical approaches rely on pairwise sequence comparisons, which result in a quadratic relationship between the number of genomes analyzed and the computational time [11–13]. Such computational complexity is often a bottleneck for large-scale genome analysis projects . It is also becoming increasingly impractical to reanalyze an entire genome database every time new strains are added. As these databases expand to include thousands of strains researchers will need the ability to iteratively add new genomes without reanalyzing the entire existing collection.
Given these challenges to large-scale comparative genomic analysis, we reasoned that a new approach might be needed that can reduce the complexity of automated prediction and annotation, streamline the analysis of large numbers of draft whole genome sequences, and permit iterative analysis. To achieve these goals, we developed the de-novo genome analysis pipeline (DeNoGAP), which integrates existing tools for prokaryotic gene prediction, homology prediction, and functional annotation for both intraspecific and interspecific genome comparison. Importantly, it employs an iterative clustering method to identify homologs and novel gene families using hidden Markov models. The iterative clustering process dramatically reduces the computational complexity of large-scale genome comparisons. DeNoGAP also creates SQLite databases to store analyzed genomic information and provides a graphical interface explorer for browsing and comparison of the predicted information between multiple genomes. DeNoGAP provides a modular architecture that will allow researchers to perform large-scale comparative analysis, generate and test the hypothesis, and create a well-annotated genome database for data analysis and exploration.
DeNoGAP is a command line tool built using Perl scripting language for analysis of complete and draft prokaryotic genome sequences. The pipeline performs four primary analysis tasks: gene prediction, functional annotation, ortholog prediction, and pan-genome analysis. DeNoGAP works for both intraspecific (single species) and interspecific (multiple species) genome comparisons, although it was largely envisioned for the former.
List of software and databases incorporated in the DeNoGAP pipeline
Markov chain Clustering (MCL)
UniprotKB / SwissProt
DeNoGAP take four input parameters from the command line: (1) user-defined table of organism metadata (e.g. time and place of isolation, host, etc.); (2) directory path where SQLite database should be created; (3) name of the SQLite database; and (4) configuration file that defines options for processing input genomic data and performing analysis.
DeNoGAP can process genomic data from multiple formats including: GenBank files, fasta formatted genome sequences (chromosome, plasmid or contig), protein sequences, or coding gene sequences. DeNoGAP parses GenBank files and extracts gene coordinates, functional annotations, and sequence information for the genomes. If the input genomic data is in the form of a multi-fasta formatted genome sequence, DeNoGAP predicts gene coordinates and coding and protein sequences using methods described in “Genomic feature prediction” section.
DeNoGAP requires seeding with one or more reference genomes to identify the initial genomic features and sequences that form the basis for later comparative analyses and functional annotations. Although any genome sequence can act as a seed, we recommend using one or more fully closed and well-annotated genome when possible since annotations carry forward through the analysis. Draft genomes can also be used as seeds when necessary. While these will likely have poorer quality gene predictions and annotations, this will not affect homolog clustering in later steps.
DeNoGAP stores the protein and coding sequences and genomic feature information for all genomes into the SQLite database prior to any downstream analysis. Additional genomes can be added to the analysis at any time. DeNoGAP appends new genomic data into the existing SQLite database and performs iterative comparison of new data with the existing information from previously analyzed genomes. The data is accessible via a basic graphical user interface (GUI).
Genomic feature prediction
DeNoGAP predicts coding gene sequences from prokaryotic genome sequences using four gene prediction programs: Glimmer, GeneMark, FragGeneScan, and Prodigal [15–18]. Glimmer, GeneMark, and Prodigal use self-trained data to predict genes while FragGeneScan use sequencing error and codon usage models to predict genes in fragmented genome assemblies. The gene prediction results from all four programs are combined and parsed to identify reliable gene candidates. Predicted open read frames (ORFs) are considered reliable if they are recovered by at least two programs, and the longest ORF is selected when the methods disagree. In some cases, gene prediction algorithms predict ORFs that overlap with one another over a few bases. To avoid predicting a large number of genes with overlapping and repeated sequences, DeNoGAP by default considers ORFs with more than 15 bases overlap as a single ORF. The threshold value for the overlap region can be defined by the user in the configuration file.
Gene sequences predicted by only a single program may be the result of algorithm error or bias, and therefore require further verification before including in the compiled set of reliable gene candidates. ORFs predicted by a single program are verified by BLAST against the UniProtKB/SwissProt database . Singleton ORFs (occurring in only one strain) are also verified by comparing the length of the sequence to the user-defined minimum gene length cut-off. We recommend that singleton ORFs should be only included in the set of reliable gene candidate if they satisfy at least one of the two verification criteria. Nucleotide sequences of the predicted coding regions are translated into amino acid sequences using transeq program from EMBOSS software suite . The results from the gene prediction phase are stored in GenBank file format. All features are named according to genome abbreviation and a feature identification number, which are zero-padded sequential numbers unique for each feature (e.g. strain-code_00001).
Prediction of homolog families and orthologs
Prediction of seed HMM model families
The parsed similarity information is subjected to the MCL algorithm, which clusters significantly similar protein sequences into the protein families. Protein sequences with significant global alignments are grouped together into protein families. Singleton, partial sequences, and chimera-like protein sequences are clustered separately, with each forming a new protein family. We avoid grouping partial and chimera-like sequences with longer similar sequences at this point in the pipeline to prevent errors in construction of the profile-HMM models. These sequences are reconnected later during clustering of profile-HMM models into homolog families.
Selection of diverse representative sequence and constructing HMM models
After clustering of protein sequences into globally similar protein families using MCL, each family is subjected to construction of HMM-profile representing that family. Prior to construction of HMM-profile, each protein family is scanned to select diverse representative sequences. The group of diverse representative sequences from each model family is subjected to multiple sequence alignment using MUSCLE . Any sequences that are 100 % identical over the entire length are merged as one sequence for construction of profile-HMM model. This step minimizes the effect of sampling bias in the construction of the HMM.
The pipeline uses hmmalign when aligning new sequences to an existing HMM model. A profile-HMM model is constructed from the protein alignment of each model family using hmmbuild. All profile-HMM models are added to the profile-HMM database and formatted using hmmpress for sequence-profile comparisons. Singleton groups are also added to the singleton sequence database.
Iterative prediction of HMM model families in new genome
Clustering of HMM model families into homolog families
Because DeNoGAP is designed to construct HMM models from only globally similar protein sequences; truncated or chimeric-like protein sequences form their own unique model families. As a result of this criteria, there is inflation in the number of predicted HMM model families and a potential loss of information about these relationships. Therefore, after completion of iterative prediction of HMM model families, DeNoGAP identify links between model families where member(s) from one family share significant partial similarity with members of another model family. DeNoGAP does this by identifying pairs of related HMM families from the calculated similarity information such that at least one member of the short family shares partial match with a member of the longer family. The HMM families are clustered using a single-linkage clustering approach via a customized R code in the DeNoGAP. The model families linked with each other are clustered into the larger family; thereby, reestablishing homolog relationships between truncated or chimeric sequences and to their potential parent family.
Prediction of ortholog and inparalog pairs
Orthologs are genes that decent from a common ancestor and arise due to speciation or diversification of that ancestor into independent species or strains. In contrast, paralogs are the genes that are related through a duplication event, while inparalogs are paralogous loci which duplicated after a speciation event and are therefore found in the same species . One of the major goals of DeNoGAP is to break down homolog families into ortholog and paralog relationships. While there is no perfect way to accomplish this, we use pairwise smallest reciprocal amino acid distances from one or more outgroup genomes defined a priori by the user to predict orthologous relationship between pairs of protein sequences.
Choosing an appropriate outgroup genome is an important factor for reliable ortholog prediction. The selected outgroup genome(s) should be from a strain or species that is closely enough related to the target strains to have a high likelihood of sharing many homologous sequences, but divergent enough to minimize the likelihood of frequent recombination with these strains. While no rule will work in all cases, selecting distinct species from the same genus is usually a reasonable starting point. It is also possible to use the level of identity at the 16S rRNA locus, as distinct species are typically less than 97 % identical. A more thorough approach would require performing a phylogenetic analysis on a number of loci encoding housekeeping genes, such as is performed in multilocus sequence analysis .
Because orthology is not transitive, DeNoGAP clusters predicted ortholog and inparalog pairs into ortholog families using the MCL algorithm such that each protein sequence in the family shares significant sequence identity with at least one other protein in the family. As shown in Fig. 5, the MCL edge weight for each pair of ortholog and inparalog proteins is calculated by subtracting the pairwise amino acid distance from 1. Although, a more sophisticated weighting scheme can be envisioned, this simple scheme for clustering protein sequences using amino acid distances generates results in good agreement with OrthoMCL (see section on Validation of Ortholog Prediction below).
Identification of core and variable protein families
Studying gene gain and loss by examining the identity and distribution of core (i.e. those genes present in all strains) and variable genes (i.e. those “accessory” or “dispensable” genes that vary in their distribution among strains) can provide insights into strain evolution, plasticity and environmental adaptation [35, 36]. DeNoGAP generates a binary phylogenetic profile of presence and absence for protein families across all compared genomes based on predicted ortholog information. The phylogenetic profile is a binary matrix denoting the presence and absence of each locus across many genomes .
While the core genome is traditionally defined as those genes present in all strains within a defined group, the use of draft genomes can artifactally reduce the size of the core genome if a true core gene is disrupted due to an assembly issue. To compensate for this potential problem DeNoGAP permits the user to define a minimum prevalence threshold (e.g. present in 95 % of strains) for the identification of core genes.
Once a core genome cutoff is defined, the multiple sequence alignment for each core gene is extracted from the alignment stored in the SQLite database. These alignments are then concatenated together to create a core genome alignment, which can be used the construction of a phylogenetic super-tree and downstream comparative analyses [29, 38, 39].
DeNoGAP performs functional annotation of protein families by assigning annotations to each protein sequence using InterProScan. The pipeline scans each protein sequence against ten different databases in the InterProScan standalone suite . The annotation resources in the InterProScan suite include InterPro, Pfam, SMART, TIGRFAM, ProDom, PANTHER, PIR, FingerPrintScan, Gene3D, HAMAP, MetaCyc, and KEGG database [41–52]. It also provides prediction of signal peptides and transmembrane domains for each protein sequence using SignalP, TMHMM, and Phobius respectively [53–55]. InterProScan assigns protein sequences with the Gene Ontology (GO) terms associated with Interpro annotation .
Storing and querying analysis results
DeNoGAP use three relational SQL database for managing and post-processing of the output(s) from different analysis phases. The databases are created using SQLite, which is an in-process library that implements a self-contained, server-less and zero-configuration, transactional SQL database engine. The architecture of three SQLite database created by DeNoGAP for storing results is shown in Additional file 2: Figure S2. The central database stores metadata for genomes, sequences, genomic features, functional annotations and sequence-profile similarities from the iterative addition of new genomes. The second database with prefix “HomologDB” stores mapping information for each protein sequence and its respective hmm-model and homolog family group predicted via the iterative clustering of full-length and partial homolog sequences. The third database with prefix “OrthologDB” stores multiple alignments for homolog families, ortholog and inparalog pairs, sequence similarity between each pair of protein sequences in the homolog family, and phylogenetic profiles of presence and absence for ortholog families across compared genomes. The pipeline uses information stored in the database tables for iterative analysis of new genomes and updates the databases by adding newly analyzed information to the central database and creating a new copy of “HomologDB” and “OrthologDB” database.
DeNoGAP also produces a script to create a searchable graphical user interface (GUI) table for genome information stored in the database. The GUI table allows the user to select groups of species for analyzing the pan-genome of selected species. It allows the user to compare presence and absence of ortholog protein families between selected groups of genomes and identify core, flexible or unique families present in different genomes. It also provides an option to fetch, display and edit annotation for each protein sequence from the database.
We tested DeNoGAP using a dataset consisted of 140 prokaryotic genomes, including 122 bacteria and 20 archaea strains (Additional file 3). This full dataset was used to evaluate the processing time of DeNoGAP verses OrthoMCL. Subsets of the full dataset were used to evaluate and demonstrate various components of DeNoGAP. For example, we selected five fully sequenced and manually annotated Pseudomonas genomes to evaluate the accuracy of the gene predictions module of DeNoGAP. We used 19 well-curated bacterial genomes that are listed as reference proteomes in the Quest for Ortholog database (questfororthologs.org) for benchmarking the ortholog prediction phase of DeNoGAP . Finally, we selected 32 genomes from the genus Pseudomonas, including 22 Pseudomonas syringae, two Pseudomonas aeruginosa, four Pseudomonas putida, three Pseudomonas fluorescens and one Pseudomonas entomophila to illustrate results obtained from the entirety of the analysis pipeline. The 22 P. syringae strains were used as in-group strains; while the other Pseudomonads were used for outgroup comparisons. Pseudomonas syringae pv. tomato strain DC3000 was chosen as a seed reference genome for the all datasets . All archaea strains were used as outgroup genomes for full dataset.
Validation of gene prediction
Summary of gene prediction comparison and statistics
Combined (15 / 50)
Single (15 / 50)
Total (15 / 50)
5659 / 5716
695 / 721
6390 / 6437
5095 / 5135
662 / 673
5757 / 5808
5353 / 5416
634 / 661
5987 / 6077
4810 / 4927
3740 / 3837
8550 / 8764
4869 / 4953
4126 / 4230
8995 / 9183
Validation of ortholog prediction
In order to test the ortholog clustering accuracy of DeNoGAP relative to OrthoMCL, we compared ortholog clusters derived from 195,948 protein sequences from 32 genomes using a granularity parameter (I) of 1.5. DeNoGAP and OrthoMCL clustered protein sequences into 19,914 and 14,377 groups respectively. Of these, 8,703 groups were identical for both methods representing 43.7 % of DeNoGAP groups and 60.5 % of OrthoMCL groups. We also found that 10,204 (70.9 %) of the OrthoMCL groups were a match or subset of DeNoGAP groups, while 18,796 (94.3 %) of the DeNoGAP groups were a match or subset of OrthoMCL groups. We believe that DeNoGAP generates larger numbers of clusters compared to OrthoMCL because it better able to separates highly similar inparalogs into different groups by accounting for gene loss in one or more genomes.
Prediction of fragmented and chimeric protein families
The algorithm implemented in DeNoGAP for calculating similarity between query sequences and HMM models uses a high alignment coverage cut-off (>70 %) for iterative clustering of globally similar protein sequences. Due to this criterion, protein sequences that exhibit partial similarity with HMM models are clustered initially as new protein families. The analysis of 32 Pseudomonas genomes predicted 19,300 protein sequences that had partial similarity with at least one HMM protein family. Approximately, 12,567 (65.1 %) of these sequences displayed significant similarity (query coverage ≥ 70 %) with longer HMM models, suggesting fragmentation of the sequence; whereas, 4,688 (24.2 %) of the sequences showed similarity with HMM models shorter in length. We also found that 1,531 (7.9 %) of protein sequences had significant similarity with both longer and shorter HMM models.
Other than fragmented protein sequences, DeNoGAP also predicts evolutionarily divergent chimera-like protein sequences that are formed through the combination of portions of one or more protein sequences to produce new proteins . The pipeline predicted 514 (2.8 %) of protein sequences had N-terminal or the C-terminal regions with significant similarity to another protein family.
To validate chimera prediction by DeNoGAP, we investigated our results for six known chimeric proteins from P. syringae described in the literature. On searching, it was found that DeNoGAP correctly identify four out of six known chimeric proteins. Two of the identified chimera proteins, HopK1, and HopD1 are type III secreted effector protein present in P. syringae strain PtoDC3000. The pipeline identified partial similarity with the N-terminus of the type III effector HopAQ1 and HopD2 (also known as HopAO1), respectively . The other two predicted chimeric proteins were the type III effector proteins HopBB1 and HopAE1 in the strain PavBPIC631 with N-terminal similarity to HopF2 and HopW1, respectively . These results suggest that DeNoGAP can efficiently be used for predicting novel chimera proteins as well as families of known chimera proteins in new genomes. However, the currently implemented method for chimera prediction also identifies proteins sharing common domains with multi-domain proteins; therefore, the pipeline can over-estimate the number of chimeric proteins in the genome. Consequently, we recommend that chimeric proteins undergo manual verification.
Clustering of HMM families into homolog families
Identification of core and variable protein families
DeNoGAP produces a concatenated MUSCLE-based multiple sequence alignment from all core protein families . The core genome alignment can be used as input to an external tree-building program for creating a core genome super-tree for inferring clonal phylogenetic relationship among strains [37, 38].
We functionally annotated each protein family predicted for 32 phase validation genomes by assigning Interpro annotation to the families using annotation module in DeNoGAP. The analysis identified 11,364 (67 %) ortholog families and 6,423 (25 %) lineage-specific families with one or more Interpro annotation (Fig. 10c). The remaining families had no functional annotation. These results are consistent with supposition that many lineage-specific families are assembly artifacts. The list of highly enriched Interpro annotations and their frequency in predicted ortholog families is given in (Additional file 4).
Exploration and visualization of genomic data
DeNoGAP includes scripts for creating a local web-based database explorer that reads the three SQLite databases and builds a query platform for exploration and visualization of genomic information. The query platform allows users to select a subset of genomes from the database for comparison of core, flexible and unique protein families (Additional file 5: Figure S3) . It provides users with an option to set thresholds for defining core protein families to account for missed genes due to assembly errors. It also permits annotation-specific searches. The program retrieves protein IDs and their associated annotation information based the search query, and outputs the results in an HTML table. The user can further select individual feature IDs to visualize genomic information and annotations for each gene/protein sequence.
DeNoGAP provides a complete package integrating many bioinformatics tools for the analysis of large comparative genomic datasets. The pipeline offers tools and algorithms for the annotation and analysis of both complete and draft genome sequences, and performs analysis tasks including: gene prediction, ortholog prediction, chimera prediction, functional annotation and pan-genome analysis. The modular design of the pipeline makes it relative easy to add new analysis functionalities to the toolkit. One of the major goals while designing DeNoGAP was to provide an integrated and automated workflow for large-scale comparative genomics projects involving hundreds of sequenced genomes; therefore, we have focused on automating the execution of necessary analysis modules, parsing and formatting of output from each analysis phase, and preparing input for the subsequent phase.
While the next-generation sequencing revolution has tremendously increased the number of available genomes for large-scale comparative genomics projects, the computational infrastructure needed for these analyses is often limited. We have designed the DeNoGAP pipeline with the goal of making a sophisticated pipeline that can run on nearly any system with reasonable processing power, memory and disk space, and which easily scales for hundreds of genome. DeNoGAP provides a streamlined workflow to rapidly analyze and annotate newly sequenced and assembled genomes in an iterative manner, and creates a new, or updates an existing, SQLite database. Finally, DeNoGAP provides a database exploration tool that allows researchers to parse and explore the analyzed information for the generation of new hypothesis.
Project name: De-Novo Genome Analysis Pipeline (DeNoGAP)
Project home page: https://sourceforge.net/projects/denogap/
Operating system: Unix, Linux (Ubuntu 12.04 LTS) or higher.
Programming Language: Perl
Other Requirements: Apache 2 or higher.
We acknowledge the value input of the Guttman and Desveaux labs in evaluating this work.
This work was support by a grant from the Natural Sciences & Engineering Research Council of Canada to DSG and a Canada Research Chair to DSG. The funding bodies played no role in the design or execution of the study.
Conceived the pipeline: ST, DSG. Designed and tested the software: ST. Analyzed and interpreted data: ST. Drafted the manuscript: ST, DSG. All authors read and approved the final version of the manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Merhej V, Royer-Carenzi M, Pontarotti P, Raoult D. Massive comparative genomic analysis reveals convergent evolution of specialized bacteria. Biol Direct. 2009;4:13.View ArticlePubMedPubMed CentralGoogle Scholar
- Ilina E, Shitikov E, Ikryannikova L, Alekseev D, Kamashev D, Malakhova M, Parfenova T, Afanas’ev M, Ischenko D, Bazaleev N, Smirnova T, Larionova E, Chernousova L, Beletsky A, Mardanov A, Ravin N, Skryabin K, Govorun V. Comparative genomic analysis of Mycobacterium tuberculosis drug resistant strains from Russia. PLoS One. 2013;8:e56577.View ArticlePubMedPubMed CentralGoogle Scholar
- Read T, Joseph S, Didelot X, Liang B, Patel L, Dean D. Comparative analysis of Chlamydia psittaci genomes reveals the recent emergence of a pathogenic lineage with a broad host range. mBio. 2013;4(2):e00604-12.Google Scholar
- Green S, Studholme DJ, Laue BE, Dorati F, Lovell H, Arnold D, Cottrell JE, Bridgett S, Blaxter M, Huitema E. Comparative genome analysis provides insights into the evolution and adaptation of Pseudomonas syringae pv. aesculi on Aesculus hippocastanum. PLoS One. 2010;5:e10224.View ArticlePubMedPubMed CentralGoogle Scholar
- Tettelin H, Masignani V, Cieslewicz M, Donati C, Medini D, Ward N, Angiuoli S, Crabtree J, Jones A, Durkin A, DeBoy R, Davidsen T, Mora M, Scarselli M, Ros I, Peterson J, Hauser C, Sundaram J, Nelson W, Madupu R, Brinkac L, Dodson R, Rosovitz M, Sullivan S, Daugherty S, Haft D, Selengut J, Gwinn M, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O’Connor K, Smith S, Utterback T, White O, Rubens C, Grandi G, Madoff L, Kasper D, Telford J, Wessels M, Rappuoli R, Fraser C. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial pan-genome. Proc Natl Acad Sci U S A. 2005;102:13950–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Chain P, Kurtz S, Ohlebusch E, Slezak T. An applications-focused review of comparative genomics tools: capabilities, limitations and future challenges. Brief Bioinform. 2003;4:105–23.View ArticlePubMedGoogle Scholar
- Teeling H, Glöckner FO. Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective. Brief Bioinform. 2012;13:728–42.View ArticlePubMedPubMed CentralGoogle Scholar
- Ali A, Soares SC, Barbosa E, Santos AR. Microbial Comparative Genomics: An Overview of Tools and Insights Into The Genus Corynebacterium. J Bacteriol Parasitol. 2013;4:2.View ArticleGoogle Scholar
- Klassen JL, Currie CR. Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genomics. 2012;13:14.View ArticlePubMedPubMed CentralGoogle Scholar
- Kislyuk AO, Katz LS, Agrawal S, Hagen MS, Conley AB, Jayaraman P, Nelakuditi V, Humphrey JC, Sammons SA, Govil D, Mair RD, Tatti KM, Tondella ML, Harcourt BH, Mayer LW, Jordan IK. A computational genomics pipeline for prokaryotic sequencing projects. Bioinformatics. 2010;26:1819–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–89.View ArticlePubMedPubMed CentralGoogle Scholar
- Wall DP, Deluca T. Ortholog detection using the reciprocal smallest distance algorithm. Methods Mol Biol. 2007;396:95–110.View ArticlePubMedGoogle Scholar
- Kuzniar A, Ham R, Pongor S, Leunissen J. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008;24:539–51.View ArticlePubMedGoogle Scholar
- Friedberg I. Automated protein function prediction--the genomic challenge. Brief Bioinform. 2006;7:225–42.View ArticlePubMedGoogle Scholar
- Delcher AL, Harmon D, Kasif S, White O, Salzberg SL. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999;27:4636–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Besemer J, Borodovsky M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 2005;33:W451–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Hyatt D, Chen G-LL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119.View ArticlePubMedPubMed CentralGoogle Scholar
- Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 2010;38:e191.View ArticlePubMedPubMed CentralGoogle Scholar
- Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A. UniProtKB/Swiss-Prot. Methods Mol Biol. 2007;406:89–112.PubMedGoogle Scholar
- Olson SA. Emboss opens up sequence analysis. Brief Bioinform. 2002;3:87–91.View ArticlePubMedGoogle Scholar
- Deng X, Cheng J. Enhancing HMM-based protein profile-profile alignment with structural features and evolutionary coupling information. BMC Bioinformatics. 2014;15:252.View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.View ArticlePubMedPubMed CentralGoogle Scholar
- Sharpton TJ, Jospin G, Wu D, Langille MG, Pollard KS, Eisen JA. Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource. BMC Bioinformatics. 2012;13:264.View ArticlePubMedPubMed CentralGoogle Scholar
- Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2012;9:173–5.View ArticleGoogle Scholar
- Afrasiabi C, Samad B, Dineen D, Meacham C, Sjölander K. The PhyloFacts FAT-CAT web server: ortholog identification and function prediction using fast approximate tree classification. Nucleic Acids Res. 2013;41:W242–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Szklarczyk R, Wanschers BF, Cuypers TD, Esseling JJ, Riemersma M, van den Brand MA, Gloerich J, Lasonder E, van den Heuvel LP, Nijtmans LG, Huynen MA. Iterative orthology prediction uncovers new mitochondrial proteins and identifies C12orf62 as the human ortholog of COX14, a protein involved in the assembly of cytochrome c oxidase. Genome Biol. 2012;13:R12.View ArticlePubMedPubMed CentralGoogle Scholar
- Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7:e1002195.View ArticlePubMedPubMed CentralGoogle Scholar
- Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Edgar R. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Koonin E. Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet. 2005;39:309338.View ArticleGoogle Scholar
- Glaeser SP, Kämpfer P. Multilocus sequence analysis (MLSA) in prokaryotic taxonomy. Syst Appl Microbiol. 2015;38:237–45.View ArticlePubMedGoogle Scholar
- Lassmann T, Frings O, Sonnhammer ELL. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res. 2009;37:858–65.View ArticlePubMedGoogle Scholar
- Felsenstein J. PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 1989;5:164–6.Google Scholar
- Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992;8:275–82.PubMedGoogle Scholar
- Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. The microbial pan-genome. Curr Opin Genet Dev. 2005;15:589–94.View ArticlePubMedGoogle Scholar
- Lapierre P, Gogarten J. Estimating the size of the bacterial pan-genome. Trends Genet. 2009;25:107–10.View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A. 1999;96:4285–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5:e9490.View ArticlePubMedPubMed CentralGoogle Scholar
- Wolf YI, Rogozin IB, Grishin NV, Koonin EV. Genome trees and the tree of life. TRENDS in Genetics. 2002;18:472–9.View ArticlePubMedGoogle Scholar
- Jones P, Binns D, Chang H-YY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong S-YY, Lopez R, Hunter S. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30:1236–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer ELL, Tate J, Punta M. Pfam: the protein families database. Nucleic Acids Res. 2014;42:D222–30.View ArticlePubMedGoogle Scholar
- Lees JG, Lee D, Studer RA, Dawson NL, Sillitoe I, Das S, Yeats C, Dessailly BH, Rentzsch R, Orengo CA. Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis. Nucleic Acids Res. 2014;42:D240–5.View ArticlePubMedGoogle Scholar
- Schultz J, Copley RR, Doerks T, Ponting CP, Bork P. SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 2000;28:231–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Corpet F, Servant F, Gouzy J, Kahn D. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 2000;28:267–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Scordis P, Flower DR, Attwood TK. FingerPRINTScan: intelligent searching of the PRINTS motif database. Bioinformatics. 1999;15:799–806.View ArticlePubMedGoogle Scholar
- Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res. 2005;33:D284–8.View ArticlePubMedGoogle Scholar
- Pedruzzi I, Rivoire C, Auchincloss AH, Coudert E, Keller G, de Castro E, Baratin D, Cuche BA, Bougueleret L, Poux S, Redaschi N, Xenarios I, Bridge A. HAMAP in 2015: updates to the protein family classification and annotation system. Nucleic Acids Res. 2014;43:D1064–70.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu CH, Yeh L, Huang H, Arminski L. The protein information resource. Nucleic Acids Res. 2003;31:345–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Res. 2003;31:371–3.View ArticlePubMedPubMed CentralGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, Copley R, Courcelle E, Durbin R, Falquet L, Fleischmann W, Gouzy J, Griffith-Jones S, Haft D, Hermjakob H, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lopez R, Letunic I, Orchard S, Pagni M, Peyruc D, Ponting CP, Servant F, Sigrist CJ. InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform. 2002;3:225–35.View ArticlePubMedGoogle Scholar
- Caspi R, Altman T, Billington R, Dreher K, Foerster H, Fulcher CA, Holland TA, Keseler IM, Kothari A, Kubo A, Krummenacker M, Latendresse M, Mueller LA, Ong Q, Paley S, Subhraveti P, Weaver DS, Weerasinghe D, Zhang P, Karp PD. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res. 2014;42:D459–71.View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Bendtsen J, Nielsen H, Heijne G, Brunak S. Improved prediction of signal peptides: SignalP 3.0. J Mol Biol. 2004;340:783–95.View ArticlePubMedGoogle Scholar
- Sonnhammer E, Heijne VG, Krogh A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Intl Conf Intell Syst Mol Biol. 1998;6:175–82.Google Scholar
- Käll L, Krogh A, Sonnhammer ELL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004;338:1027–36.View ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Dessimoz C, Gabaldón T, Roos DS, et al. Toward community standards in the quest for orthologs. Bioinformatics. 2012;28(6):900–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Buell CR, Joardar V, Lindeberg M, Selengut J, Paulsen IT, Gwinn ML, Dodson RJ, Deboy RT, Durkin AS, Kolonay JF, Madupu R, Daugherty S, Brinkac L, Beanan MJ, Haft DH, Nelson WC, Davidsen T, Zafar N, Zhou L, Liu J, Yuan Q, Khouri H, Fedorova N, Tran B, Russell D, Berry K, Utterback T, Van Aken SE, Feldblyum TV, D’Ascenzo M, Deng W-LL, Ramos AR, Alfano JR, Cartinhour S, Chatterjee AK, Delaney TP, Lazarowitz SG, Martin GB, Schneider DJ, Tang X, Bender CL, White O, Fraser CM, Collmer A. The complete genome sequence of the Arabidopsis and tomato pathogen Pseudomonas syringae pv. tomato DC3000. Proc Natl Acad Sci U S A. 2003;100:10181–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Feil H, Feil WS, Chain P, Larimer F, DiBartolo G, Copeland A, Lykidis A, Trong S, Nolan M, Goltsman E, Thiel J, Malfatti S, Loper JE, Lapidus A, Detter JC, Land M, Richardson PM, Kyrpides NC, Ivanova N, Lindow SE. Comparison of the complete genome sequences of Pseudomonas syringae pv. syringae B728a and pv. tomato DC3000. Proc Natl Acad Sci U S A. 2005;102:11064–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Joardar V, Lindeberg M, Jackson RW, Selengut J, Dodson R, Brinkac LM, Daugherty SC, Deboy R, Durkin AS, Giglio MG, Madupu R, Nelson WC, Rosovitz MJ, Sullivan S, Crabtree J, Creasy T, Davidsen T, Haft DH, Zafar N, Zhou L, Halpin R, Holley T, Khouri H, Feldblyum T, White O, Fraser CM, Chatterjee AK, Cartinhour S, Schneider DJ, Mansfield J, Collmer A, Buell CR. Whole-genome sequence analysis of Pseudomonas syringae pv. phaseolicola 1448A reveals divergence among pathovars in genes involved in virulence and transposition. J Bacteriol. 2005;187:6488–98.View ArticlePubMedPubMed CentralGoogle Scholar
- Stover CK, Pham XQ, Erwin AL, Mizoguchi SD, Warrener P, Hickey MJ, Brinkman FS, Hufnagle WO, Kowalik DJ, Lagrou M, Garber RL, Goltry L, Tolentino E, Westbrock-Wadman S, Yuan Y, Brody LL, Coulter SN, Folger KR, Kas A, Larbig K, Lim R, Smith K, Spencer D, Wong GK, Wu Z, Paulsen IT, Reizer J, Saier MH, Hancock RE, Lory S, Olson MV. Complete genome sequence of Pseudomonas aeruginosa PAO1, an opportunistic pathogen. Nature. 2000;406:959–64.View ArticlePubMedGoogle Scholar
- Silby MW, Cerdeño-Tárraga AM, Vernikos GS, Giddens SR, Jackson RW, Preston GM, Zhang X-XX, Moon CD, Gehrig SM, Godfrey SA, Knight CG, Malone JG, Robinson Z, Spiers AJ, Harris S, Challis GL, Yaxley AM, Harris D, Seeger K, Murphy L, Rutter S, Squares R, Quail MA, Saunders E, Mavromatis K, Brettin TS, Bentley SD, Hothersall J, Stephens E, Thomas CM, Parkhill J, Levy SB, Rainey PB, Thomson NR. Genomic and genetic analyses of diversity and plant interactions of Pseudomonas fluorescens. Genome Biol. 2009;10:R51.View ArticlePubMedPubMed CentralGoogle Scholar
- Sonnhammer ELL, Gabaldón T, da Silva AW S, Martin M, Robinson-Rechavi M, Boeckmann B, Thomas PD, Dessimoz C. Big data and other challenges in the quest for orthologs. Bioinformatics. 2014;30:2993–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009;19:327–35.View ArticlePubMedPubMed CentralGoogle Scholar
- Stavrinides J, Ma W, Guttman DS. Terminal reassortment drives the quantum evolution of type III effectors in bacterial pathogens. PLoS Pathog. 2006;2:e104.View ArticlePubMedPubMed CentralGoogle Scholar
- O’Brien HE, Thakur S, Gong Y, Fung P, Zhang J, Yuan L, Wang PW, Yong C, Scortichini M, Guttman DS. Extensive remodeling of the Pseudomonas syringae pv. avellanae type III secretome associated with two independent host shifts onto hazelnut. BMC Microbiol. 2012;12:141.View ArticlePubMedPubMed CentralGoogle Scholar
- Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW. Extensive error in the number of genes inferred from draft genome assemblies. PLoS Comput Biol. 2014;10:e1003998.Google Scholar
- Baltrus DA, Nishimura MT, Romanchuk A, Chang JH, Mukhtar MS, Cherkis K, Roach J, Grant SR, Jones CD, Dangl JL. Dynamic evolution of pathogenicity revealed by sequencing and comparative genomics of 19 Pseudomonas syringae isolates. PLoS Pathog. 2011;7:e1002132.View ArticlePubMedPubMed CentralGoogle Scholar