Volume 10 Supplement 6
RegExpBlasting (REB), a Regular Expression Blasting algorithm based on multiply aligned sequences
© Rubino and Attimonelli; licensee BioMed Central Ltd. 2009
Published: 16 June 2009
One of the most frequent uses of bioinformatics tools concerns functional characterization of a newly produced nucleotide sequence (a query sequence) by applying Blast or FASTA against a set of sequences (the subject sequences).
However, in some specific contexts, it is useful to compare the query sequence against a cluster such as a MultiAlignment (MA). We present here the RegExpBlasting (REB) algorithm, which compares an unclassified sequence with a dataset of patterns defined by application of Regular Expression rules to a given-as-input MA datasets.
The REB algorithm workflow consists in
i. the definition of a dataset of multialignments
ii. the association of each MA to a pattern, defined by application of regular expression rules;
iii. automatic characterization of a submitted biosequence according to the function of the sequences described by the pattern best matching the query sequence.
An application of this algorithm is used in the "characterize your sequence" tool available in the PPNEMA resource. PPNEMA is a resource of Ribosomal Cistron sequences from various species, grouped according to nematode genera. It allows the retrieval of plant nematode multialigned sequences or the classification of new nematode rDNA sequences by applying REB. The same algorithm also supports automatic updating of the PPNEMA database. The present paper gives examples of the use of REB within PPNEMA.
The use of REB in PPNEMA updating, the PPNEMA "characterize your sequence" option clearly demonstrates the power of the method. Using REB can also rapidly solve any other bioinformatics problem, where the addition of a new sequence to a pre-existing cluster is required. The statistical tests carried out here show the powerful flexibility of the method.
One of the most frequent needs a Bioinformatics end-user has is to characterize a newly produced sequence. In addition, the design and implementation of specialised bio-databases is a very frequent research activity, as demonstrated simply by accessing the NAR Database Issue compilation , but the main problem is keeping bio-databases updated in time. Both these needs – new sequence characterization and bio-database updating, are not always easy to satisfy fully. Indeed, characterization of "newly produced sequences" is usually performed by applying database similarity searching methods, such as BLAST  and FASTA , which are based on heuristic algorithms. The results, besides being approximate, then require further semi-automatic management in order to characterize the new sequences clearly, as regards their taxonomic and functional location. A great advantage in the "new sequence" classification procedure may be achieved by comparing new sequences with a set of already classified and grouped multialigned sequences, the grouping being based on already assigned functions and species. Here we define this approach as the One-Sequence vs a Dataset of more(n) Multi-Alignments (OS/DnMA). The rationale of the OS/DnMA approach has been implemented in other programs such as HMMalign within PFAM  or HoSeqI and MultiHoSeqI , but both these approaches are protein oriented. As far as the nucleotide sequences, the already available programs based on the OS/DnMA approach are the Classifier available through the Ribosomal Database Project Portal  and the ChiSeqI  both designed to characterize 16srRNAs sequences only. This paper describes an algorithm that was designed to respond to requirements for implementing a tool for characterization of newly sequenced plant parasitic nematode rRNA genes. Indeed, after the PPNEMA database had been developed [7, 8], we had not only to demonstrate the advantage of having the PPNEMA database available for phytoparasite nematologists, but also to implement a PPNEMA updating procedure as efficient and automatic as possible. PPNEMA, a Plant-Parasitic Nematode bioinformatics resource, offers the scientific community a pre-processed archive of plant parasitic nematode sequences useful for nematologists: it is a tool for retrieving plant nematode multialigned sequences for phylogenetic studies or identifying a nematode by comparing its rDNA sequence with the PPNEMA available, genus specific MultiAlignments (MA). Classification of a new sequence by applying BLAST to all the sequences available through PPNEMA should have required an ab-initio procedure not considering already available multialigned groups. RegExpBlasting (REB), the algorithm proposed here, which compares a new sequence against a dataset of patterns derived from the application of Regular Expression coding to multialigned sequences, successfully solves this problem. In addition, as regards PPNEMA database updating, starting from the new rDNA cystronic phytoparasite nematode sequences extracted through the Entrez retrieval system , available at NCBI and not yet stored in PPNEMA, we implemented a protocol, based on REB, which allows each new sequence to be assigned to the best matching PPNEMA MA. RegExpBlasting is not implemented only for its specific application within PPNEMA but, as reported below, it can be more generically applied to any set of both nucleotide and amino-acidic multi-alignments with the aim to characterize a new sequence. However we have successfully tested its usage in the application to nucleotide sequences.
Characterization of a new phytoparasite sequence
Use of RegExpBlasting for the PPNEMA database updating
We tested the procedure using window values of 20, 30 and 40, and tested for each window steps 10, 15 and 20.
Quantitative report of the PPNEMA updating procedure
Number of selected GenBank entries
Number of PPNEMA Groups considered
Number of TP in Normal Search
Number of FN in Normal Search
Number of TP after Scan-RegExpBlasting
Number of True Negative (TN)
It has been emphasized that REB is neither an exact matching method nor a heuristic one but, thanks to careful selection of minimal length and cut-off values, as well as scanning parameters (w and s), results mainly fitting true results can be obtained. What are the limitations? One of the most difficult problems that can be avoided concerns the quality of the MA: non-optimised MA can give rise to increased numbers of False results, both positive and negative. In addition, a poor-quality MA may introduce a great number of gaps, implying reduction of minimal length, elimination of the cluster to be analysed from the dataset, and hence reduction of informative data necessary to allow positives to grow. The solution is to lower the minl value but this, as shown in Figure 4 causes an increase in false positives. Thus, in phase I it is worthwhile excluding MA which may reduce the minimal length or, more generally, the growth of false results. Lastly, as regards MA, a great number of gaps inside them may lead to an increase in the CPU time of the run.
The results here presented and the description of the algorithm clearly show that REB is a tool with a wide range of applicability. Besides the example shown here, we hypothesize that it may be used to definite advantage in Barcoding  procedures in order to identify the best matching group of sequences for a new sequence. Moreover we plan to introduce this code in the HmtDB resource [10–12] to contribute in the classification of human mitochondrial (mt) haplogroups when a partial sequence of the human mt genome is produced. Finally it will be tested in the case of sequences where alternative splicing is present, using the Scan option of the REB software. We have not yet verified this hypothesis, but we are confident that it would be successful.
Similarity searching algorithms are usually classified as exact or heuristic methods.
Both methods compare an anonymous sequence against a dataset of sequences available in bio-databases. Exact methods search for the perfect match between an anonymous sequence and dataset sequences. Heuristic methods, i.e., FASTA and Blast, compare sequences by searching for matching substrings.
However, biological sequences, due to their intrinsic features, the presence of SNPs, insertions and deletions, cannot be managed well with exact methods. And although heuristic methods provide many results, many of them may be approximate or false, so that end-users have to check all results, if possible discriminate false from true positives, and then select the result that gives the best match with the queried sequence from among the true positives.
RegExpBlasting is a valid alternative to exact matching methods because, if positive results are obtained, they will be formed by a set of sequences grouped into an MA described by a pattern written through regular expression rules. Thus, RegExpBlasting is based on the widely used Regular Expression syntax, by which better than with the consensus definition, it is possible to describe the degeneration of a biological pattern defining a specific function.
The starting point in the generation of the regular expression within biosequence management is a sequence MA.
The algorithm is made of two phases: one which generates regular expressions starting from the MA, and the other based on the search of the MA whose regular expression pattern best describes the sequence to be characterized. The algorithm is organized in two phases. RegExpBlasting phase I produces the RegExp pattern describing each of the considered multialignments by extracting the regular expression which represents all the variations of the group of sequences available through their MA. RegExpBlasting phase II allows the classification of a new sequence by comparing it with each of the patterns defined in phase I and reports as output the matching MA to which the new sequence can easily be added.
Phase I – Regular expression generation
Regular expressions are tools used to represent every possible character variation in a sequence alignment, and are much more powerful than the classic consensus sequences which leave out a great quantity of information about sites such as possible SNPs, insertions and deletions. RegExpBlasting is an algorithm which can characterize a new sequence simply by comparing it with the regular expressed patterns defining the queried MA. Starting from each of the MA to be browsed, the nucleotide or aminoacid composition of each variant MA site is calculated, inclusive of gaps: this datum allows the site to be translated into regular expression annotation. The rule is: the entire site composition is associated with a character class, within square brackets; if gaps occur in the site, a question mark is added to the right of the character class. If the site is invariant, it is represented by the nucleotide followed by "?" if gaps occur in the site in question. If the MA is made of non-conserved sequences, displaying large gapped regions due to indel events among the MA composing sequences, the regular expression in the indel regions allows any character, i.e., a wild card. For this whole region, RegExpBlasting implements two different options: a) a number of wild cards less than or equal to the length of the indel region or b) a number of wild cards of any length. From a biological viewpoint, this implies that the indel regions, being so variant, have no importance in the function of the multialigned sequences in question.
In this way, each MA is translated into a Regular Expression, representing every sequence composing that MA and thus generating the RegExp datasets against which "blasting" of the query sequence can be applied. In order to use RegExpBlasting to best effect, it is important for end-users to be aware of several aspects which lie at the basis of pattern generation. These are minimal length and the number of gapped sites. Minimal length minl is the shortest sequence, which can produce a match, and is calculated as the difference between alignment length and number of gapped sites. It is closely correlated to the number of false positives, as will be demonstrated in the description of phase II. The number of gapped sites strongly influences the processing time of the algorithm: the higher the number of gapped sites, the longer the processing time.
Phase II – Blasting of a query sequence against the RegExp dataset
In this phase, RegExpBlasting compares the query sequence against the RegExp dataset, and the resulting output is the MA associated with the best matching pattern. When no positive results are obtained, a further analysis can be performed through the Scan-RegExpBlasting procedure. This procedure scans each RegExp dataset-composing pattern, using a window w characters long and a step s characters long, thus generating n sub-patterns which can then be analyzed. Both w and s are parameters which can be modified by end-users. Thus, once cut-off cf (i.e., the minimum number of matching windows related to the total number of scanned windows which must be detected in a pattern in order to define it as a positive result) has been fixed, the algorithm selects the positive matches if the number of matched sub-patterns related to the total number of considered sub-patterns is equal or greater than the fixed cut-off. In the first application, several tests were carried out by changing the minimal length, window and step length; in order to evaluate parameter values that optimize the performance of the algorithm and define general criteria for end-users.
The algorithm is written in python  but a Web interface will soon be implemented allowing end-users to generate regular expressions and RegExp datasets starting from the MA they submit. The Web system will also allow the characterization of new sequences against end-users defined datasets. Obviously, default parameters will be fixed, but end-users will be allowed to modify them. At present the software can be downloaded and locally used through the tool section implemented within the PPNEMA resource .
List of abbreviations used
Regular Expression Blasting
This work was supported by the University of Bari and PRIN 2006 (project # 2006064992_003).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 6, 2009: European Molecular Biology Network (EMBnet) Conference 2008: 20th Anniversary Celebration. Leading applications and technologies in bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S6.
- Galperin MY: The Molecular Biology Database Collection: 2008 Update. Nucleic Acid Res 2008, 36: D2-D4. Published on line, 2008 January. doi: 10.1093/nar/gkm1037. Published on line, 2008 January. doi: 10.1093/nar/gkm1037.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Pearson WR: Rapid and sensitive sequence comparison with FASTP and FASTA. Meth Enzymol 1990, 183: 63–98.View ArticlePubMedGoogle Scholar
- Becker E, Cotillard A, Meyer V, Madaoui H, Guérois R: HMM-Kalign: a tool for generating sub-optimal HMM alignments. Bioinformatics 2007, 23(22):3095–7.View ArticlePubMedGoogle Scholar
- Arigon AM, Perrière G, Gouy M: Automatic identification of large collections of protein-coding or rRNA sequences. Biochimie 2008, 90(4):609–14.View ArticlePubMedGoogle Scholar
- Wang Q, Garrity GM, Tiedje JM, Cole JR: Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 2007, 73(16):5261–7.PubMed CentralView ArticlePubMedGoogle Scholar
- Rubino F, Voukelatou A, De Luca F, De Giorgi C, Attimonelli M: PPNEMA: A Resource of Plant-Parasitic Nematodes Multialigned Ribosomal Cistrons. Int J Plant Genomics 2008, 387812.Google Scholar
- Baxevanis AD: Searching the NCBI databases using Entrez. Curr Protoc Bioinformatics 2006., Chapter 1(Unit 1.3):Google Scholar
- Hebert PD, Cywinska A, Ball SL, deWaard JR: Biological identifications through DNA barcodes. Proc Biol Sci 2003, 270(1512):313–21.PubMed CentralView ArticlePubMedGoogle Scholar
- Attimonelli M, Accetturo M, Santamaria M, Lascaro D, Scioscia G, Pappadà G, Russo L, Zanchetta L, Tommaseo-Ponzetta M: HmtDB, a human mitochondrial genomic resource based on variability studies supporting population genetics and biomedical research. BMC Bioinformatics 2005, 6(Suppl 4):S4.PubMed CentralView ArticlePubMedGoogle Scholar
- Accetturo M, Santamaria M, Lascaro D, Rubino F, Achilli A, Torroni A, Tommaseo-Ponzetta M, Attimonelli M: Human mtDNA site-specific variability values can act as haplogroup markers. Hum Mutat 2006, 27(9):965–74.View ArticlePubMedGoogle Scholar
- PPNEMA tools[http://www.ppnema.uniba.it:9090/tools]
- Python software download[http://www.python.org]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.