ACID: annotation of cassette and integron data
© Joss et al; licensee BioMed Central Ltd. 2009
Received: 31 December 2008
Accepted: 21 April 2009
Published: 21 April 2009
Although integrons and their associated gene cassettes are present in ~10% of bacteria and can represent up to 3% of the genome in which they are found, very few have been properly identified and annotated in public databases. These genetic elements have been overlooked in comparison to other vectors that facilitate lateral gene transfer between microorganisms.
By automating the identification of integron integrase genes and of the non-coding cassette-associated attC recombination sites, we were able to assemble a database containing all publicly available sequence information regarding these genetic elements. Specialists manually curated the database and this information was used to improve the automated detection and annotation of integrons and their encoded gene cassettes. ACID (annotation of cassette and integron data) can be searched using a range of queries and the data can be downloaded in a number of formats. Users can readily annotate their own data and integrate it into ACID using the tools provided.
ACID is a community resource providing easy access to annotations of integrons and making tools available to detect them in novel sequence data. ACID also hosts a forum to prompt integron-related discussion, which can hopefully lead to a more universal definition of this genetic element.
Initially, integrons were thought of as specialized elements mostly involved in the accumulation of gene cassettes encoding antibiotic resistance determinants in pathogenic bacteria . The advent of genomics and the availability of numerous genome sequences from environmental bacteria made it clear that the integron is a more ancient and widespread gene capture system . Despite the fact that this genetic element is found in about 10% of all sequenced genomes and that cassette arrays can be as large as 150 kb , few integrons have been properly identified and annotated as such. Even when the integron integrase gene is annotated due to its sequence similarity to characterized homologs, the gene cassettes associated with it are labeled as simple open reading frames (ORFs). This is because attC sites, the most distinctive feature of gene cassettes, are non-coding regions and therefore, not recognized by standard automated genome annotation pipelines.
Integrons are a flexible and fast-evolving part of microbial genomes, and their associated cassette arrays represent a unique and segregated gene pool (the genes they carry are rarely found outside these genetic elements). However, because integrons (other than the specialized variants carrying antibiotic resistance genes) rarely have a detectable phenotype under laboratory conditions, they are seldom emphasized in genomic studies. Their proper identification and annotation could help us understand their role in microbial adaptation.
We have created ACID, which stands for Annotation of Cassette and Integron Data, as a resource for the biology community. ACID contains all integrons and gene cassette sequences available in public databases manually curated and accurately annotated. Users can freely access and download these data and/or automatically annotate and submit their own sequences. Tools for the visualization and comparison of cassette arrays are also made available.
Construction and Content
MySQL 5.0 is used as the database engine and a PHP 5.2 front end using the YUI set of AJAX tools is used for display and manipulation of the data [11–13]. The database forum uses phpBB . Design of the database relies on two major tables: 'Sequence' and 'Interest'. 'Sequence' contains data pertaining directly to the acquisition of the original DNA sequence. 'Interest', on the other hand, contains details related to the annotation of that sequence. These tables form a one-to-many relationship.
Definition of integron/gene cassette system components
The mission of ACID is to provide a consistent annotation framework for the description of the integron/gene cassette system by providing researchers with a platform to exchange their ideas as they pertain to this genetic element. For the initial construction of the database; we have developed basic definitions for the different elements of integrons: the integron-encoded integrase (intI), recombination sites attI and attC, gene cassettes, gene cassette open reading frames (CassORFs) and more generally, ORFs. The database however, has been built so it can easily be updated if scientific evidence suggests that modification to the definition of various integron components is required.
The intI is the sequence region that encodes or may have encoded (in the case of degenerate intIs) the integron integrase gene. Integron integrases, part of the tyrosine recombinase family, are different from other members of that protein family insofar as, they encode an additional functional domain required for integron gene cassette site-specific recombination [15–17]. In this initiative, integron integrases are identified by their sequence similarity to the previously characterized homologs in ACID. This comparison is made using BLASTX .
The attI region has only been defined functionally in a limited number of integron classes . It is most commonly located upstream from the intI gene and ends with the 7 bp attI core site (GTTRRRY) at the start of the first inserted cassette (R in Figure 1) [19–21]. Due to the lack of experimental data regarding the precise boundary of the attI site at the end closest to intI (in most cases), we define this feature as the region starting immediately upstream from the start/stop codon of the integrase gene (depending on its orientation with respect to the integrated gene cassettes) and ending with the first detectable 7 bp core site (R).
For the purpose of this database, the attC site comprises inner repeats L' and L", the variable region between these repeats, and the flanking R' and R" repeats (Figure 1). attCs are defined here in accordance with experimental data that has revealed the site of cassette integration . The unit of transfer, the gene cassette, comprises one attC and in most instances a single ORF. Specifically, we define a single gene cassette as the DNA-sequence region from the first 'T' of the R' repeat-region within a given attC, up to and including the first 'G' of the next, downstream R'. Therefore, the attC site that "belongs" to a given cassette is the sequence of the element as it appears in the circular form (Figure 1). We are using the L'/L" and R'/R" nomenclature for naming the attC imperfect repeats corresponding to the IntI-binding domains instead of the 1L/2L and 1R/2R system. This is to be consistent with the terminology used in the recent description of the gene cassette insertion/excision mechanism by the integron integrase [3, 23].
CassORFs are defined as any ORF located in the intervening region between an attI site and an adjacent attC site or, two adjacent attC sites. Some allowance for overlap with these enclosing attI and attC sites is made with at most 15 bp of overlap of the ORF and an attC/attI site being the upper limit.
Integron/gene cassette system annotation procedure
ACID's automated annotation tool first identifies the intI gene and then analyzes the neighboring sequence for the presence of attC like motifs. Two strategies (described in the algorithm section) are used to detect attC sites and the site must match the user-adjustable cut-off restrictions for both methods (described below) to be annotated as such. Given our definition of gene cassettes (above), the latter are automatically annotated based on the identified attCs. By default, the largest ORF located within each cassette, if it encodes a predicted product greater than seventy amino acids, is annotated as a CassORF.
We have also provided the user with the option of searching for ORFs in any given DNA sequence regardless of its nature. Therefore, such a search would retrieve all ORFs associated with a given integron sequence -IntIs, CassORFs and ORFs flanking the integron.
The results obtained at each of these stages can be inspected and edited by the user if desired. This is especially useful if the sequence proves to be more complex than usual (e.g. multiple intI genes are present in the sequence, extremely degenerate attC regions are known to exist or large insertion regions are located within the cassette array).
Data sources for ACID
The database has been compiled from a number of sources. The core of these data is manually annotated sequences identified as containing integrons according to the scientific literature . These sequences were used to compile a list of known intI genes. The amino acid sequences of these IntI proteins were then used to perform a similarity search against the NCBI's non-redundant nucleotide sequence database with TBLASTN . Any records that had a high degree of similarity to the previously identified intI genes were then added and automatically annotated. Unpublished sequence data from ongoing studies (genomic and metagenomic) was manually entered to complement the Genbank records.
Compilation of statistics on ACID content
Various statistics were compiled on the database content. The diversity of gene cassettes found in the database was evaluated using the EstimateS 8.0 statistics software package . For the analyses, we arbitrarily defined cassette-types as any two or more cassettes that have at least a 150 nucleotide overlap with a specified DNA identity. Cassette nucleotide identities of 70, 80, 90, 95, 97 and 99 percent were considered in the analyses.
Gene cassettes were functionally annotated with the Meta Genome Rapid Annotation using Subsystem Technology (MG-RAST) server . This website provides a SEED-based environment that allows users to upload metagenomes for automated analyses . In MG-RAST, novel sequences are screened for potential protein encoding genes via a BLASTX search against the SEED comprehensive non-redundant database (sourced from all major public databases). This similarity search allows the association of a novel sequence with a SEED subsystem, which represents a specific metabolic function.
Algorithms for the identification of attC sites
This section provides an overview of the general methodology used in the two approaches implemented within the database used for the identification of attC sites. These two methods are, String search and Feedback loop.
String search identification of attC sites
This method, however, statistically produces a large number of false positives, a 3 bp motif like AAC appears once every 64 bases (P(N) = 0.25 and P(N)3 = 0.15625 or, 1/64). To filter out false positives from the group of sequences that encoded the inverse core/core sites, we scored them by assessing other features commonly found in attC sites . The total score is composed of several individual categories that reflect separate structural components typical of attC sites. These structural components play a role in creating the single (bottom) strand folded conformation required for integrase-mediated recombination (Figure 2). Specifically, we assign 1 point per match between flanking repeats (core/inverse core sites) of the attC site (R' and R"). The total possible score of this property referred to as R, is 4 (Figure 2B). In addition, we score 1 point per match between the inner repeats of the attC site (L' and L"). We made the decision to weight these base-pairing events to make them more meaningful. Specifically, we considered the potential secondary structure of this simple site as two 3 bp regions. These properties, referred to as LA and LB, when multiplied, produce a maximum score of 9 (Figure 2C). These are multiplied so that pairing is required in each of these two regions to produce a score. For example, LA = 3 and LB = 0 would produce a score of 0 (3 × 0). However, if LA and LB each had 2 base-pairing events, the score would be 4.
In addition, we score 1 point for the presence of the typical extra-helical base, defined here as E and, and 1 pt for each mismatch in the 2 bp adjacent to R' referred to as site S (Figure 2D and 2E, respectively). Final scores are divided by the maximum possible total score and are reported as percentage values. We defined this cut off at 75%, which results in the overall lowest rate of false positives and false negatives. However, the user has the option to define his/her own cut off which may be more appropriate for a specific dataset.
Feedback loop identification of attC sites
The second strategy for attC identification uses a position weight matrix methodology. The idea here is to train the scoring algorithm so that it "learns" which components of the sequence are important for attC identification. All attC sites that had been annotated manually, in addition to a curated collection of attC sites identified using the search string strategy (defined above) were compiled and used to build the position weight matrix. The nucleotide frequencies at the 25 terminal positions at either end of the 2766 positive attC sites were considered since these encode the repeats required to form the IntI binding sites (R"/R' and L"/L' in Figure 1). The probability of a particular nucleotide occurring at each of the 25 terminal positions was calculated from these data. If the nucleotide in question was not important to the function of the attC site (i.e. did not facilitate secondary structure formation) we assumed that the probability of any given nucleotide observed to occur at this position would be close to random (25%). Each nucleotide position was compared to its binding partner within the attC site. If they base paired, a score of 1 was recorded. A score of 0 was recorded for a mismatch. Base pairing 3 positions up and down stream from each location was also considered. From these data, the probability of a particular base to pair with a base at the same position at the opposite end of the attC could be calculated. The percentage weightings were then linearly adjusted to consider the GC content of the sequence. Nucleotides that occurred more often than random frequencies were positively weighted and, conversely, nucleotides that present close to random frequencies were negatively weighted. Percentage weightings were then summed up for each nucleotide position across all known attC sites. The result is an algorithm that has "learned" which nucleotides should be considered important to integron attC sites when searching sequence space.
As new attC sites are identified, they add to the collection of positive examples, thus affecting the weightings for future attC sites. Therefore, as attC functional information grows, so does the accuracy of the feedback loop scoring method. It is important to stress however, that the user has the option to accept or reject the proposed attC sites identified by this algorithm. Rejecting false positives will ultimately lead to a more accurate dataset that is used to train the algorithm. Therefore, we propose that this method be used in combination with meticulous attention to detail on the part of the user. Also, it is possible that this algorithm may become too specific. For example, there is presently only a limited diversity of attCs in the database and therefore, an algorithm trained on this potentially biased dataset may not detect new, diverse attC s. For this reason, the user may choose a minimum score of zero for this method of cassette identification, making it possible to identify novel attC sites by the string search method alone.
Utility and Discussion
This section presents a brief description of the ACID database content and its usage. Detailed tutorials and instructions for the various functions of the database are available on the website.
Entry of new data in ACID
Novel data can be added to the database by manually entering the sequence information and accompanying details into a form. This form can be invoked from the 'add' option of the 'edit' scroll-down menu found at the top of any database entry form. Alternatively, a sequence and all its associated annotations can be imported directly from Genbank (the 'import' option in the 'edit' menu). This option is especially useful if the integron sequence is part of a complete or partial genome sequence.
Sequence annotation procedure
The database allows automatic annotation of any integron found in a proposed sequence. This is achieved in a three steps process. These may be performed individually (so the user can have input on the results, accepting or rejecting them manually) or all at once (using the 'Auto' option from the 'Annotate' drop-down menu). Annotation can only be performed on sequences entered by the user, not ones already present in the database.
The user must choose 'IntI' from the 'Annotate' drop-down menu. The proposed data entry is then used as a query sequence for the BLASTX algorithm  which compares the query to all IntI protein sequences identified previously. The region of the proposed sequence that produces a significant bit score is annotated as IntI.
The user must choose 'attC' from the 'Annotate' drop-down menu. The user has the option of choosing how to annotate the attC sites. Specifically, attC sites are identified using one or both strategies described in the algorithm section above. Cassettes are consequently annotated based on the positions of the attC sites (i.e. they are defined as the nucleotide sequence between attC sites).
The user must choose 'ORF' from the 'Annotate' drop-down menu. ORFs present in the proposed sequence can be annotated based on the following criteria (none or all of which can be selected by the user): present within gene cassettes (CassORFs); present within a specific portion of the sequence (nucleotide coordinates); displaying with standard or alternative start codons; those larger than a specified minimal amino acid length.
The user can either browse the database or perform more sophisticated searches. Using the browse option (from the drop-down 'Browse' menu) provides the user with the option of searching the database according to pre-defined categories (accession number, host species, sequence length, origin, etc). The database also encompasses extensive hierarchical search features. Two search modes (from the scroll-down 'Search' menu) are available to locate integron-related data, 'Sequence' and 'Interest'. The 'Sequence' search allows the user to retrieve complete sequence entries matching the search criteria selected. The 'Interest' search can be used to retrieve any annotated integron component (intI, attC, cassette, etc). The annotated regions/entries returned by either search mode can be downloaded in both amino acid and nucleotide FASTA format for transfer to alternative or complementary programs and further analyses.
A complete forum engine (phpBB) is included in the database to facilitate communication between scientists who are new to studying integrons or who are experienced veterans. The nature of the forum includes both general comments as they pertain to integrons, integron annotation and the gene cassette biology as well as comments specific to a particular annotation in the database. These specific annotation comments are linked back to the database so that relevant comments can be viewed for each sequence as it is being examined. We have included this option because we know that biological variation is often unpredictable and we do not propose to have accounted for all forms of integron-associated diversity. Through this open forum we hope to promote a collaborative environment promoting the sharing of integron-related knowledge.
Overview of database content
Diversity statistics for cassettes annotated with ACID
DNA sequence identity thresholds defining a cassette-type (%)
Number of cassette-types
Most integrons contain few cassettes (0–10), although some contain many (>50). These are almost always found in the marine bacterial genus Vibrio, with a few exceptions (such as the marine heterotroph Saccharophagus degradans) (Figure 3B). Most cassettes range between 200–800 bp in length, but their size can reach > 2000 bp (Figure 3C), probably due to the insertion of tranposon sequences or lysogenic phages. The attC sites associated with cassettes vary between sizes of 55 to 150 bp, with a significant number at 60 bp and between 126–129 bp (Figure 3D). The overrepresentation of these two sizes might be due to a sampling bias rather than functional properties, given that 60 bp is the typical length of an attC type common in class 1 integrons and that most attC sites found in Vibrio integrons are 126–129 bp long.
The majority (86%) of cassette ORFs did not have a homolog with an identified function, leaving a mere 14% for which a putative function could be ascribed. This is a recurring theme in both metagenomic and genomic/fosmid sequencing initiatives targeted at gene cassettes. This trend suggests an integron-specific gene cassette pool, distinct from the larger genomic gene pool. The continued accumulation of cassette sequence data will reveal the frequency at which these currently unique cassette-encoded ORFs will re-occur.
Functional identification of cassette-encoded ORFs in ACID
Gene cassette annotation
Number of cassettes
Phenazine biosynthesis-like protein
Plasmid maintenance system
Glyoxalase family protein
Beta-lactamase (EC 18.104.22.168)
Isochorismatase (EC 22.214.171.124)
Lac operon repressor type protein
Chloramphenicol resistance protein
Transposase OrfAB, subunit B
ATP-dependent Clp protease
Cell wall-associated hydrolases
DNA-damage-inducible protein J
Haemagglutinin associated protein
Microcin immunity protein MccF
Four bacterial phyla harbour integrons associated with gene cassettes (Proteobacteria, Chlorobi, Planctomycetes and Spirochaetes) . However, the vast majority of integron diversity is found in Proteobacteria. Among the 34 different bacterial genera containing integrons and gene cassettes, 17 are from the Proteobacteria phylum. This is likely to represent a sampling bias, given that 548 out of 1047 partial or complete bacterial genome sequences publicly available represent proteobacteria.
Very few integrons have been properly identified and annotated in public databases. These genetic elements have therefore been overlooked in comparison to other vectors that facilitate LGT between microorganisms. By automating the identification of integron intI genes and the non-coding attC sites associated with gene cassettes in DNA sequence data, a database has been assembled containing all publicly available sequence information regarding these genetic elements. The manually curated and publicly available web-based database can perform automated detection and annotation of integrons and gene cassettes enabling future sequence data to be incorporated easily. ACID provides an environment using consistent terms and definitions enabling simple comparison of all known examples of the integron/gene cassette system. The database can be searched using a range of queries and selected data can be downloaded in FASTA format convenient for export and integration with downstream applications. Users can readily annotate and save their own data and have the option of forwarding it to curators for addition to the main database. This database will help create more accurate annotations of bacterial genomes and facilitate a more representative sample of integron frequency and diversity.
Availability and Requirements
ACID can be accessed at http://integron.biochem.dal.ca/ACID/login.php. For instructions and tutorials, click on the help menu on the toolbar at the top of the ACID website.
We are grateful to Ian Paulsen (Macquarie University, Sydney, Australia) for critical reading of the manuscript. This database was created and is maintained using funds from The Canadian Institutes for Health Research (CIHR), the Australian Research Council (ARC) and the National Health and Medical Research Council (NHMRC) of Australia. JEK is also funded by a fellowship from the CIHR. MJJ is supported by a combined scholarship from the ARC and NSW Agriculture. YB is supported by the Merck-MIT Postdoctoral Fellowship Program of the Computational and Systems Biology Initiative at MIT.
- Recchia GD, Hall RM: Gene cassettes: a new class of mobile element. Microbiology 1995, 141(Pt 12):3015–3027.View ArticlePubMedGoogle Scholar
- Stokes HW, Hall RM: A novel family of potentially mobile DNA elements encoding site-specific gene-integration functions: integrons. Molecular Microbiology 1989, 3(12):1669–1683. 10.1111/j.1365-2958.1989.tb00153.xView ArticlePubMedGoogle Scholar
- MacDonald D, Demarre G, Bouvier M, Mazel D, Gopaul DN: Structural basis for broad DNA-specificity in integron recombination. Nature 2006, 440(7088):1157–1162. 10.1038/nature04643View ArticlePubMedGoogle Scholar
- Collis CM, Grammaticopoulos G, Briton J, Stokes HW, Hall RM: Site-specific insertion of gene cassettes into integrons. Mol Microbiol 1993, 9(1):41–52. 10.1111/j.1365-2958.1993.tb01667.xView ArticlePubMedGoogle Scholar
- Rowe-Magnus DA, Guerout AM, Mazel D: Super-integrons. Res Microbiol 1999, 150(9–10):641–651. 10.1016/S0923-2508(99)00127-8View ArticlePubMedGoogle Scholar
- Collis CM, Hall RM: Expression of antibiotic resistance genes in the integrated cassettes of integrons. Antimicrob Agents Chemother 1995, 39(1):155–162.PubMed CentralView ArticlePubMedGoogle Scholar
- Holmes AJ, Gillings MR, Nield BS, Mabbutt BC, Nevalainen KM, Stokes HW: The gene cassette metagenome is a basic resource for bacterial genome evolution. Environ Microbiol 2003, 5(5):383–394. 10.1046/j.1462-2920.2003.00429.xView ArticlePubMedGoogle Scholar
- Hall RM, Collis CM, Kim MJ, Partridge SR, Recchia GD, Stokes HW: Mobile gene cassettes and integrons in evolution. Ann N Y Acad Sci 1999, 870: 68–80. 10.1111/j.1749-6632.1999.tb08866.xView ArticlePubMedGoogle Scholar
- Mazel D: Integrons: agents of bacterial evolution. Nat Rev Microbiol 2006, 4(8):608–620. 10.1038/nrmicro1462View ArticlePubMedGoogle Scholar
- Boucher Y, Labbate M, Koenig JE, Stokes HW: Integrons: mobilizable platforms that promote genetic diversity in bacteria. Trends Microbiol 2007, 15(7):301–309. 10.1016/j.tim.2007.05.004View ArticlePubMedGoogle Scholar
- MySQL 5.0[http://www.mysql.com/]
- PHP 5.2[http://www.php.net/]
- Messier N, Roy PH: Integron integrases possess a unique additional domain necessary for activity. J Bacteriol 2001, 183(22):6699–6706. 10.1128/JB.183.22.6699-6706.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Collis CM, Kim MJ, Stokes HW, Hall RM: Binding of the purified integron DNA integrase Intl1 to integron- and cassette-associated recombination sites. Mol Microbiol 1998, 29(2):477–490. 10.1046/j.1365-2958.1998.00936.xView ArticlePubMedGoogle Scholar
- Gravel A, Fournier B, Roy PH: DNA complexes obtained with the integron integrase IntI1 at the attI1 site. Nucleic Acids Res 1998, 26(19):4347–4355. 10.1093/nar/26.19.4347PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Collis CM, Kim MJ, Partridge SR, Stokes HW, Hall RM: Characterization of the class 3 integron and the site-specific recombination system it determines. J Bacteriol 2002, 184(11):3017–3026. 10.1128/JB.184.11.3017-3026.2002PubMed CentralView ArticlePubMedGoogle Scholar
- Partridge SR, Recchia GD, Scaramuzzi C, Collis CM, Stokes HW, Hall RM: Definition of the attI1 site of class 1 integrons. Microbiology 2000, 146(Pt 11):2855–2864.View ArticlePubMedGoogle Scholar
- Recchia GD, Stokes HW, Hall RM: Characterisation of specific and secondary recombination sites recognised by the integron DNA integrase. Nucleic Acids Res 1994, 22(11):2071–2078. 10.1093/nar/22.11.2071PubMed CentralView ArticlePubMedGoogle Scholar
- Stokes HW, O'Gorman DB, Recchia GD, Parsekhian M, Hall RM: Structure and function of 59-base element recombination sites associated with mobile gene cassettes. Mol Microbiol 1997, 26(4):731–745. 10.1046/j.1365-2958.1997.6091980.xView ArticlePubMedGoogle Scholar
- Bouvier M, Demarre G, Mazel D: Integron cassette insertion: a recombination process involving a folded single strand substrate. Embo J 2005, 24(24):4356–4367. 10.1038/sj.emboj.7600898PubMed CentralView ArticlePubMedGoogle Scholar
- Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R, et al.: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005, 33(17):5691–5702. 10.1093/nar/gki866PubMed CentralView ArticlePubMedGoogle Scholar
- Hall RM, Brookes DE, Stokes HW: Site-specific insertion of genes into integrons: role of the 59-base element and determination of the recombination cross-over point. Mol Microbiol 1991, 5(8):1941–1959. 10.1111/j.1365-2958.1991.tb00817.xView ArticlePubMedGoogle Scholar
- Michael CA, Gillings MR, Holmes AJ, Hughes L, Andrew NR, Holley MP, Stokes HW: Mobile gene cassettes: a fundamental resource for bacterial evolution. Am Nat 2004, 164(1):1–12. 10.1086/421733View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.