PILER-CR: Fast and accurate identification of CRISPR repeats
© Edgar; licensee BioMed Central Ltd. 2007
Received: 10 May 2006
Accepted: 20 January 2007
Published: 20 January 2007
Sequencing of prokaryotic genomes has recently revealed the presence of CRISPR elements: short, highly conserved repeats separated by unique sequences of similar length. The distinctive sequence signature of CRISPR repeats can be found using general-purpose repeat- or pattern-finding software tools. However, the output of such tools is not always ideal for studying these repeats, and significant effort is sometimes needed to build additional tools and perform manual analysis of the output.
We present PILER-CR, a program specifically designed for the identification and analysis of CRISPR repeats. The program executes rapidly, completing a 5 Mb genome in around 5 seconds on a current desktop computer. We validate the algorithm by manual curation and by comparison with published surveys of these repeats, finding that PILER-CR has both high sensitivity and high specificity. We also present a catalogue of putative CRISPR repeats identified in a comprehensive analysis of 346 prokaryotic genomes.
PILER-CR is a useful tool for rapid identification and classification of CRISPR repeats. The software is donated to the public domain. Source code and a Linux binary are freely available at http://www.drive5.com/pilercr.
Major parameters of the PILER-CR algorithm
Minimum length of a repeat. Smaller values might increase sensitivity, though repeats < 16 bases are not currently known.
Maximum length of a repeat. Larger values might increase sensitivity, though repeats > 64 bases are not currently known. Larger values may also tend to give more false positives, such as RNAs.
Minimum spacer length.
Minimum number of repeats in an array. A value < 3 will give large numbers of false positives. A value > 3 might increase specificity at the expense of missing shorter arrays.
Maximum spacer length.
Repeat length ratio (see caption for definition of ratio).
Spacer length ratio (see caption for definition of ratio).
A run of minarray repeats with at least mincons similarity is required for an array to be reported. The value is specified as fractional identity 0.0 .. 1.0, with 1.0 meaning identical sequences. Increasing this value will tend to increase specificity by reducing the number of degraded or mutated repeats reported.
Step 1: Local alignments
The first step is to find local alignments (hits) of the genome to itself; each hit is a candidate for being an alignment between two copies of a CRISPR repeat. As CRISPRs are separated by short distances, it suffices to search for alignments of two regions that are close to each other. Viewed as a self-similarity plot (dot-plot), this corresponds to searching for hits within a band around the main diagonal of the plot. PILER-CR uses a variant of Rasmussen et al.'s filter  to reduce the dynamic programming space. Searching within a band is accomplished by sliding a window along the genome. The filter utilizes a word index giving the sequence coordinates of each unique word. The index (occurrence table and lookup table, in Rasmussen et al.'s terminology) is updated in the obvious way as the window moves: words on the leading edge of the window are added to the index, words on the trailing edge are deleted. Regions of the dynamic programming matrix remaining after application of the filter are then explored for local alignments. Those local alignments that are too short or too long to plausibly align two CRISPRs are discarded. As an optimization, this is done during the dynamic programming step so that time is not wasted constructing long alignments.
Step 2: Pile construction
Step 3: Graph construction
Each hit connects two piles. Using hits as edges and piles as nodes, a graph is constructed and the connected components of the graph are identified. Each connected component is a candidate for containing one or more CRISPR arrays. This is not essential but significantly reduces the search space for the following step.
Step 4: Draft array identification
Step 5: Array refinement
Given output from step 4, several heuristics are applied in an attempt to improve the inferred arrays.
5a. Partial or degraded copies of the repeat at the beginning or end of the array are deleted. These are often found in practice, and deleting them may expose conserved columns in flanking regions that correctly belong in the repeat. A partial or degraded repeat is identified by having three or more gaps when aligned to the current consensus sequence for the array, or by having more than twice as many gaps as the average for the array. This is repeated until there are no such repeats.
5b. A 100% conserved column immediately preceding or following the current boundary is moved into the repeat. This is repeated until there are no such columns.
5c. A column at the repeat boundary that is < 90% conserved is deleted from the repeat, i.e. moved into the flanking region. This is repeated until there are no such columns.
Step 6: Adjacent array merge
Adjacent arrays with similar consensus sequences and similar spacer lengths are merged. Here, "merging" means that the two arrays are reported as a single array. The reason for this step is that it is quite common for there to be one or a small number of missing or degraded repeats embedded in an array. For two arrays to be merged, the following criteria must all be met. The ratios (a) of the repeat length and (b) spacer lengths for the two arrays must be ≥ 0.95. The two consensus sequences must be globally alignable, where global alignability is defined as having an optimal local alignment (Smith-Waterman) that covers ≥ 95% of the bases in the first array. Finally, the distance between the end of the first array and the beginning of the second array must be approximately k(R + S), where k is 1, 2, 3 or 4, R is the average repeat length of the first array, S is the average spacer length, and "approximately" means that the ratio must be ≥ 0.95.
Step 7: Clustering and alignment
It is often found that repeats in different arrays have recognizably similar sequences. The program therefore clusters consensus sequences into similar groups, meaning groups of sequences having estimated mutual identities of ≥ 50%. For speed, the identity of two (unaligned) sequences is estimated by counting the number of identical words of length 4. Multiple alignments are created for each group, again using fast options of MUSCLE.
Step 8: Report generation
Finally, a report is generated; for an example see Additional file 2. A detailed section specifies each repeat in an array, indicating any differences from the consensus sequence. Summary sections group arrays by estimated sequence similarity (see Step 7 above) and by position in the genome.
The time and space complexities of the algorithm are strictly O(L3) where L is the genome length. Steps 1 though 3 are O(L2) under reasonable assumptions; the cubic term arises in Step 4. In practice, time and space requirements scale approximately as O(L).
The most common error made by PILER-CR is to mis-identify the endpoints of a CRISPR repeat by one or more positions (rarely more than one). However, in most such cases the array is essentially correctly identified, and the correct endpoints are immediately evident to the user from the report. When comparing two repeat sequences, we therefore define three categories of match: an exact match, a close match and no match (meaning not exact or close). An exact match requires the two sequences to be identical. A close match is defined by the following procedure. A local alignment of the two sequences is made using the Smith-Waterman algorithm, and the number of matching bases m in this alignment is computed. If the longest of the two full sequences has length L, the identity of the two sequences is defined as s = m/L. (Note the difference between this identity and the conventionally defined identity of the local alignment, which in general will be shorter and thus have higher identity). If s exceeds a pre-defined threshold, which we choose to be 0.9, then we consider the match to be close.
While false positives cannot be reliably determined in all cases, due to the lack of known complete reference data, some can be identified by visual inspection or by searches of databases of other types of repeat. Some RNA genes occur in arrays that are remarkably similar to CRISPR repeats, with the gene sequence and spacer lengths 100% conserved; these are occasionally reported by PILER-CR as CRISPRs.
Comparison of PILER-CR results with the reference sets due to Jensen et al. and Godde and Bickerton
Jensen et al.
Godde and Bickerton
Additional file 1, Table 1 summarizes the agreement between CRISPR sequences reported by PILER-CR and those in the G and J sets (see Estimating Sensitivity in Methods). In some cases, we found no instances of the sequences reported by G or J in the genome, which is presumably explained by differences in the sequences used here versus the other studies. In other cases, all in the J set, PILER-CR found longer sequences that include the reference repeat as a subsequence, suggesting that Jensen et al. may have failed to identify the full repeat sequence. In a few cases, PILER-CR concatenates two arrays with similar repeat sequences, causing the consensus sequence to be truncated. There were no unambiguous false negatives. Thus, the sensitivity of PILER-CR with default parameters may approach 100%.
Again, Additional 1, file Table 1 is used (see Estimating Specificity in Methods). If longer lengths are allowed for CRISPR repeats, this may result in false positives due to RNA genes; the one example found in our experimental data with default algorithm settings is the longest repeat in Chromohalobacter salexigens. In Additional file 1, Table 1 we have marked clear false positives as FP and questionable repeats as Q. We found 8 questionable and 11 false positives in a total of 319 repeats. We assume that repeats classified as close or truncated (see methods), or are novel versus the reference data, are true positives as there is no reason to suppose otherwise. If we further assume that all the questionable repeats are errors, we get an estimated specificity of 94%. This estimate is necessarily very approximate, and may be quite optimistic.
We have presented PILER-CR, a new software program for the identification of CRISPR repeats, and have shown that the algorithm probably has both good sensitivity and good specificity. While other repeat-finding methods can be applied to finding CRISPRs, sometimes requiring post-processing of their output, PILER-CR provides an integrated package generating a comprehensive and readable report. The program executes rapidly, completing the analysis of 346 prokaryotic genomes in about 15 minutes on a 2 GHz desktop computer. The output is designed to facilitate manual analysis of the reported CRISPR arrays. In addition to a multiple alignment of the repeats in each array, flanking regions and spacers are shown, making it visually obvious when PILER-CR mis-identifies the repeat boundary by one or a few positions. Repeats are clustered by similarity, exposing relationships between different arrays and, where applicable, providing confirmatory evidence of marginal arrays (i.e., arrays with only 3 or 4 repeats that may not be perfectly conserved).
Availability and requirements
PILER-CR is donated to the public domain. Source code is written in C++ and is designed to be easily ported across platforms. It has been compiled and tested using Microsoft Visual C++ on Windows and gcc on i86 Linux. Source code and a Linux binary are freely available as Additional File 3, which contains the version used in this study, and at http://www.drive5.com/pilercr, where any updates and bug fixes will be posted.
The author funded this work. The author is grateful to Viktor Kunin for suggesting that PILER could be adapted to find CRISPR repeats and for helpful feedback on early versions of the program.
- Jansen R, van Embden JD, Gaastra W, Schouls LM: Identification of a novel family of sequence repeats among prokaryotes. Omics 2002, 6: 23–33. 10.1089/15362310252780816View ArticlePubMedGoogle Scholar
- Mojica FJ, Diez-Villasenor C, Soria E, Juez G: Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria. Mol Microbiol 2000, 36: 244–246. 10.1046/j.1365-2958.2000.01838.xView ArticlePubMedGoogle Scholar
- Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, Kerlavage AR, Dougherty BA, Tomb JF, Adams MD, Reich CI, Overbeek R, Kirkness EF, Weinstock KG, Merrick JM, Glodek A, Scott JL, Geoghagen NS, Venter JC: Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 1996, 273: 1058–1073. 10.1126/science.273.5278.1058View ArticlePubMedGoogle Scholar
- DeBoy RT, Mongodin EF, Emerson JB, Nelson KE: Chromosome evolution in the Thermotogales: large-scale inversions and strain diversification of CRISPR sequences. J Bacteriol 2006, 188: 2364–2374. 10.1128/JB.188.7.2364-2374.2006PubMed CentralView ArticlePubMedGoogle Scholar
- Haft DH, Selengut J, Mongodin EF, Nelson KE: A Guild of 45 CRISPR-Associated (Cas) Protein Families and Multiple CRISPR/Cas Subtypes Exist in Prokaryotic Genomes. PLoS Comput Biol 2005, 1: e60. 10.1371/journal.pcbi.0010060PubMed CentralView ArticlePubMedGoogle Scholar
- Makarova KS, Grishin NV, Shabalina SA, Wolf YI, Koonin EV: A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol Direct 2006, 1: 7. 10.1186/1745-6150-1-7PubMed CentralView ArticlePubMedGoogle Scholar
- Pourcel C, Salvignol G, Vergnaud G: CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies. Microbiology 2005, 151: 653–663. 10.1099/mic.0.27437-0View ArticlePubMedGoogle Scholar
- Kurtz S, Schleiermacher C: REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 1999, 15: 426–427. 10.1093/bioinformatics/15.5.426View ArticlePubMedGoogle Scholar
- Dsouza M, Larsen N, Overbeek R: Searching for patterns in genomic data. Trends Genet 1997, 13: 497–498. 10.1016/S0168-9525(97)01347-4View ArticlePubMedGoogle Scholar
- Jansen R, Embden JD, Gaastra W, Schouls LM: Identification of genes that are associated with DNA repeats in prokaryotes. Mol Microbiol 2002, 43: 1565–1575. 10.1046/j.1365-2958.2002.02839.xView ArticlePubMedGoogle Scholar
- Edgar RC, Myers EW: PILER: identification and classification of genomic repeats. Bioinformatics 2005, 21 Suppl 1: i152–8. 10.1093/bioinformatics/bti1003View ArticlePubMedGoogle Scholar
- Rasmussen KR, Stoye J, Myers EW: Efficient q-gram filters for finding all epsilon-matches over a given length. J Comput Biol 2006, 13: 296–308. 10.1089/cmb.2006.13.296View ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32: 1792–1797. 10.1093/nar/gkh340PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5: 113. 10.1186/1471-2105-5-113PubMed CentralView ArticlePubMedGoogle Scholar
- Godde JS, Bickerton A: The Repetitive DNA Elements Called CRISPRs and Their Associated Genes: Evidence of Horizontal Transfer Among Prokaryotes. J Mol Evol 2006.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.