Genome-scale NCRNA homology search using a Hamming distance-based filtration strategy
© Sun et al.; licensee BioMed Central Ltd. 2012
Published: 21 March 2012
NCRNAs (noncoding RNAs) play important roles in many biological processes. Existing genome-scale ncRNA search tools identify ncRNAs in local sequence alignments generated by conventional sequence comparison methods. However, some types of ncRNA lack strong sequence conservation and tend to be missed or mis-aligned by conventional sequence comparison.
In this paper, we propose an ncRNA identification framework that is complementary to existing sequence comparison tools. By integrating a filtration step based on Hamming distance and ncRNA alignment programs such as FOLDALIGN or PLAST-ncRNA, the proposed ncRNA search framework can identify ncRNAs that lack strong sequence conservation. In addition, as the ratio of transition and transversion mutation is often used as a discriminative feature for functional ncRNA identification, we incorporate this feature into the filtration step using a coding strategy. We apply Hamming distance seeds to ncRNA search in the intergenic regions of human and mouse genomes and between the Burkholderia cenocepacia J2315 genome and the Ralstonia solanacearum genome. The experimental results demonstrate that a carefully designed Hamming distance seed can achieve better sensitivity in searching for poorly conserved ncRNAs than conventional sequence comparison tools.
Hamming distance seeds provide better sensitivity as a filtration strategy for genome-wide ncRNA homology search than the existing seeding strategies used in BLAST-like tools. By combining Hamming distance seeds matching and ncRNA alignment, we are able to find ncRNAs with sequence similarities below 60%.
Identifying ncRNAs (non-coding RNAs), which function directly as RNAs rather than being translated into proteins, has drawn tremendous attention recently for two main reasons. First, besides well-known functions in protein-synthesis, regulatory roles of small ncRNAs have been revealed in gene regulation  in a wide variety of species. Second, new members of annotated ncRNA families or novel ncRNAs have been identified due to advances of the next-generation sequencing technologies and RNA-seq. Understanding ncRNAs plays a key role in elucidating the complexity of regulatory network of both complicated and simple organisms.
The second problem of using BLAST-like tools for ncRNA identification is that they do not incorporate structural similarity. Deriving secondary structure on pure sequence alignment has limited accuracy. Previous work  has shown that the final alignments generated by BLAST and structural alignment tools such as FOLDALIGN [13, 14] can be quite different.
In order to conduct ncRNA search efficiently and accurately, we propose a new approach that integrates a sensitive filtration step with a local ncRNA alignment step for identifying homologous ncRNAs. The filtration step locates substrings with Hamming distance smaller than a given threshold. By carefully choosing the length and distance threshold for Hamming distance, we can locate all regions within a range of sequence similarity. In the second step, the regions passing the filtration stage will be used as input to ncRNA alignment programs, which are designed to incorporate both the sequence and structural similarities in ncRNAs. There are a number of ncRNA alignment tools available. As output of the filtration stage does not indicate the exact starting or ending positions of putative ncRNAs, local alignment tools are desired. In this work, we used two types of ncRNA alignment programs for the second stage and compared their performance. The two types of programs are based on different methodologies. One folds and aligns sequences simultaneously to maximize both sequence and structural similarity. The other uses posterior probability alignment to boost homology search sensitivity. NcRNAs that may be missed by conventional sequence comparison tools have higher probability to be identified using these alignment programs.
We applied this approach to ncRNA homology search between intergenic regions in human and mouse genomes , and between the Burkholderia cenocepacia J2315 genome and the Ralstonia solanacearum genome . The experimental results demonstrate that our approach is efficient and is more sensitive than conventional sequence alignment tools for finding ncRNAs with sequence identity below 60%.
There are a number of ncRNA alignment tools that incorporate both sequence and structural similarities. However, most of them are based on global alignment, requiring known starting and ending positions of ncRNAs. Identifying ncRNAs in genomes or transcriptome data sets requires local ncRNA alignment. FOLDALIGN is a highly sensitive local structural alignment tool that can identify ncRNAs with very low sequence similarity (< 40%). Using heuristics such as dynamic programming matrix pruning, FOLDALIGN is faster than the accurate implementation of the Sankoff algorithm . However, it is still CPU-intensive on large data sets. When it is applied for ncRNA search between the intergenic regions of the human and mouse genomes, FOLDALIGN took about 5 months on 70 2-GB-RAM nodes in a linux cluster . Thus, it is not practical to directly apply FOLDALIGN to large sequence sets.
Because of the cost of structural alignment, existing genome-scale ncRNA search tools [2–4] still rely on conventional sequence alignment programs such as BLAST. As one of seeded alignment tools, BLAST relies on its seeding heuristics to achieve efficiency of local similarity search between long genomes. Both the theoretical analysis and empirical experiments [9, 18] have shown that choice of the seeding heuristics affects the sensitivity of local alignments. While BLAST requires consecutive matching, PatternHunter  allows spaced seeds, which can incorporate biological features of the underlying alignments. For example, spaced seeds designed for coding regions allow a mismatch following two exact matches, indicating the less strictly specified base in a codon. However, it is much more difficult to design useful spaced seeds for ncRNA search because 1) ncRNAs do not preserve strong sequence characteristics; 2) we lack enough training sequences for seed design. A more advanced seed type than spaced seed distinguishes transition and transversion as many functional genomic features including ncRNAs show a higher frequency of transition than transversion [18–20]. This type of seed is adopted by sequence comparison tool BLASTZ . It uses the optimal spaced seed designed by PatternHunter but allows a transition mutation (A-G, G-A, C-T, or T-C) at any one of the inspected positions in the seed.
Recently, a posterior-probability based ncRNA local alignment tool PLAST-ncRNA has been implemented . However, it is designed to align a relative short query sequence with a long target sequence rather than between two genomes. Thus, it cannot be directly applied to genome-scale ncRNA search without manually dividing a long genome into numerous small segments.
In our work, we design a filtration strategy based on Hamming distance. There are a number of existing implementations that search for substrings satisfying a pre-defined Hamming distance threshold. For example, in the ungapped short read mapping problem, short reads generated from next-generation sequencing platforms are aligned to the reference genome by allowing a couple of mismatches. Techniques such as neighborhood generation and the pigeon hole theory have been applied to transform inexact match to exact match in order to improve the search speed. Although a number of efficient read mapping programs [22, 23] exist, they cannot be used as the filtration step in ncRNA search because read mapping usually only allows a very small number of mismatches. In addition, they are specifically designed to align a set of short reads with a long reference genome.
In the remaining part of this section, we first describe the coding system that can distinguish transition from transversion in Hamming distance seeds. Then we present optimal HD seed generation.
Design a coding system to distinguish transition from transversion
Transition mutations are less likely to result in amino acid changes. Thus, it is expected that transitions are observed at higher frequency than transversions in homologous protein-coding genes. This fact has been adopted by sequence alignment tools such as BLASTZ to improve the performance of homology search. Similar observations have been made in homologous ncRNAs as well. In the score table RIBOSUM designed by Klein and Eddy , transitions in both single stranded regions and between base pairs have higher scores than transversions. Higgs  reported that the substitution rate between a base pair (such as AU) and its double transition base pair (such as GC) is significantly higher than other mutations. Thus, it is desirable to distinguish transition from transversion in our HD seeds. However, the Hamming distance defined on DNA or RNA bases treats each mismatch equally. In order to favor transition over transversion in HD seeds, we formulate the following coding problem.
Converting bases into bits
Hamming distance seed design
To design an HD seed, we need to determine L and T to maximize its matching probability in ncRNA homologs while keeping the matching probability to random sequences as low as possible. Given a pair of true ncRNA homologs, the probability that the input pair contains a match to the given HD seed is proportional to the sensitivity of the seed. Given a pair of random sequences, the probability that the input pair contains a match to the given seed is proportional to the false positive (FP) rate of the seed. Thus, computing the matching probability allows us to compare performance of different seeds. As there are a large number of valid combinations of L and T, an efficient method is needed for the matching probability computation. In this work, we use a simple i.i.d. model to describe distributions of exact matches, transitions, and transversions in a pair of sequences. The theoretical HD seed matching probability can be efficiently computed based on the i.i.d. model.
For an HD seed < L,T>, there are multiple combinations of x1, x2, and x3 satisfying the above equation. The matching probability must sum over all combinations. In the above equations, l is the number of bases in genomic sequences and L is the number of bits after coding.
Based on the two figures, we determine L and T with the best tradeoff between and . The chosen seed is < 200,55>, which is highlighted in Figures 4 and 5. Its matching probability in true ncRNA homologs is 0.906 and its matching probability in random sequences is 1.45E-07. The seed < 200,55> represents a similarity on coded bit strings. According to the coding Table 1, for genomic sequence of length 50 = 200/ 4, the seed < 200,55> allows 26 transition and 1 transversion mutation. This combination gives the lowest DNA-level similarity 46% = (50 - 26 - 1)/ 50. Thus, this chosen seed is able to detect highly structured ncRNAs which have very low sequence conservation.
Softwares for HD seed matching and local structural alignment
There are a number of tools that can implement HD seed matching. We chose a randomized algorithm LSH-ALL-PAIRS , which is based on locality sensitivity hashing. Although it is an approximation algorithm, it has achieved high sensitivity in detecting DNA homologs with similarity as low as 63%. More importantly, it is fast enough to apply to whole genomes even when the allowed substitutions (i.e. T in the HD seeds) increases.
For a pair of substrings that contain a match to the HD seed, we apply two types of local alignment programs. The first is FOLDALIGN, which can conduct local structural alignment. The second is PLAST-ncRNA, which uses posterior probabilities to conduct alignments. Both of these tools can detect homologous ncRNAs with low sequence similarities.
LSH-ALL-PAIRS, FOLDALIGN, and PLAST-ncRNA were downloaded from the authors' websites.
Experiments and results
For ncRNAs with high sequence similarity, BLAST and other seeded alignment tools suffice to identify them between related genomes. The goal of our tool is to provide complementary ncRNA identification method to conventional sequence comparison tools. In this section, we focus on testing ncRNA search performance of HD seeds in data sets with low sequence conservation.
Comparison of Hamming seeds, BLAST, and blastZ
NcRNA search in the Burkholderia cenocepacia J2315 genome
In the second experiment we focus on ncRNA identification in the Burkholderia cenocepacia J2315 genome by comparing it with the Ralstonia solanacearum genome. Burkholderia cenocepacia is clinically important because it can cause lung infections in cystic fibrosis (CF) patients . There are multiple members in Burkholderia cenocepacia. Coenye et al. conducted ncRNA search by applying BLAST and QRNA between B. cenocepacia strain J2315 and related genomes including the Ralstonia solanacearum genome. As BLAST can miss highly structured ncRNAs, we conducted a complementary analysis using HD seeds and ncRNA alignment programs including FOLDALIGN and PLAST-ncRNA. We applied both tools to regions around HD seed hits and compared the outputs of FOLDALIGN and PLAST-ncRNA. We downloaded the three chromosomes (accession IDs: NC_011000, NC_011001, NC 011002) of the Burkholderia cenocepacia J2315 genome from NCBI. Their sizes are 3,870,082 nt, 3,217,062 nt, and 875,977 nt, respectively. Similarly we downloaded the Ralstonia solanacearum GMI1000 genome (NC_003295) from NCBI. The single chromosome has length 3,716,413 nt. Using BLAST and QRNA, Coenye et al.  reported 78, 116, and 19 putative ncRNAs on the three chromosomes of J2315.
Comparison of the HD seed hits with putative ncRNAs reported by Coenye et al.
As the purpose of this experiment is to identify highly structural ncRNAs that might be missed by existing ncRNA homology search tools such as the combination of BLAST and QRNA, we are only interested in seed hits with identity no more than 60%. For each intergenic seed hit with identity no more than 60%, we extended it to left and right for 100 bases in each input. Then local alignment was conducted between extended substrings using FOLDALIGN or PLAST-ncRNA. As chromosome 2 and chromosome 3 are much larger than chromosome 3 and may have more putative ncRNAs, we only present results of search on chromosome 1 and chromosome 2. All programs run on a 128-node cluster, where each node contains 2 dual-core AMD Opterons running at 2.2 GHz with 8 GB of memory. The running time of HD seed matching using LSH-ALL-PAIRS is 8,250 and 6,850 seconds for chromosome 1 and chromosome 2, respectively. The running times of FOLDALIGN on regions around seed matches are 15 hours and 14 hours for chromosome 1 and chromosome 2, respectively. The running times of PLAST-ncRNA on regions around seed matches on chromosome 1 and chromosome 2 are 697 seconds and 501 seconds, respectively. As FOLDALIGN is based on a computationally intensive structural alignment algorithm by Sankoff , it takes a much longer running time than posterior-probability based PLAST-ncRNA. However, FOLDALIGN can output both the alignment and the consensus secondary structure for each input pair while PLAST-ncRNA does not provide secondary structure derivation. Additional ncRNA structure prediction programs are needed to process the output of PLAST-ncRNA when structure information is needed.
For all output alignments by FOLDALIGN and PLAST-ncRNA, we remove an alignment if it satisfies one of the following conditions: 1) the alignment overlaps with adjacent protein-coding genes; 2) the alignment score is smaller than a given cutoff; and 3) the alignment length is smaller than 55. PLAST-ncRNA has a cutoff for average posterior probability, which is the normalized posterior probability over the length of an alignment. The default cutoff for PLAST-ncRNA is 0.1. There is no default score cutoff for FOLDALIGN when we conduct the alignment using "local" mode. The "scan" mode provides p-values, which interpret the significance of alignment scores in a better way than the raw scores. Following the assumption made by FOLDALIGN that the alignment scores follow an extreme-value distribution, we designed a score cutoff corresponding to the p-value of 10 - 8. Specifically, we generated 50,000 random sequences of length 200 and aligned all pairs of them. Then we conducted curve-fitting using the random alignment scores and determined the score cutoff for the chosen p-value. The computed score cutoff for FOLDALIGN is 450.
Note that although the lowest sequence identity allowed by our chosen HD seed < 200,55> is 46%, PLAST-ncRNA is applied to bigger regions around each seed hit. As a local structural alignment, PLAST-ncRNA can report highly structured alignments with very low sequence conservation. This is shown in the identity distribution in Figures 9 and 12. Many of the putative ncRNAs on chromosome 1 are longer than annotated small ncRNAs. This is consistent to previous observation that small ncRNAs tend to have better sequence conservation than long ncRNAs .
Properties of two putative ncRNAs on chromosome 1 of J2315
We applied FOLDALIGN and PLAST-ncRNA as the local alignment tools to regions around HD seed hits. Although both of these tools conduct local alignment, they are based on different rational and have different optimization goals. FOLDALIGN tries to optimize both sequence and structural similarities. PLAST-ncRNA uses posterior probability to conduct sensitive alignment and does not directly incorporate secondary structure information. Yet, we found that the outputs of these two tools share a large overlap. This could indicate that the shared alignments are highly likely to contain true ncRNAs as they achieved high scores using two highly different alignment methodologies. On the other hand, there is a possibility that these two methods tend to have similar false positive hits. Thus, this poses further questions about how to distinguish functional ncRNAs from pseudo-ncRNAs, which can pass the default cutoffs of the alignment tools but lack real functions. Extra evidence beyond high alignment scores is needed. One type of computational evidence is base composition, which can be conveniently incorporated into homology search. Schattner  applied base-composition statistics to ncRNA gene finding in a limited number of experiments. It is worth investigating whether these statistics can be applied to different species. Other useful evidence includes the availability of the transcriptomic data, the translation potential, and the genomic context around the local alignments. Finally, if these local alignments can be found in a third related genome, this also provides strong evidence for functional ncRNA search.
In this work, we optimize the HD seeds using all known ncRNAs from different species as the training data. We are aware that different types of ncRNAs share different sequence similarities. For example, tRNA and SECIS are more structural and often share lower sequence conservation than snoRNA and miRNA. If we divide our training set into different groups by average sequence similarities, we will have different optimal seeds for each group. However, there is one difficulty behind this strategy. The sizes of available training data can be quite different for homologous ncRNAs in different groups. For example, there are a large number of snoRNAs and miRNAs in current Rfam database. As their average sequence similarities are high, we will have more training data in that group than other groups. For ncRNAs lacking enough training data, the HD seed design may be highly biased. With the advances of the next-generation sequencing technologies and ncRNA search techniques, we foresee that more and more ncRNAs will be revealed from different species. Enrichment of training data will enable us to design better seeds for ncRNAs with different ranges of sequence similarities in the future.
Our experimental results show that HD seed matching provides an effective and efficient filtration step for genome-scale ncRNA search. Compared to conventional sequence comparison tools, HD seed matching is more sensitive in identifying ncRNAs with low sequence conservation. By designing a long HD seed, we can control the matching probability to random sequences. Thus, integrating HD seed matching and a sensitive local structural alignment tool provides a complementary ncRNA search method to existing sequence alignment-based implementations. Besides FOLDALIGN and PLAST-ncRNA, other local ncRNA structural alignment tools or classification methods that integrate more features can be applied to examining HD seed hits.
We plan to apply this method to ncRNA identification in available transcriptome datasets. It has been reported that a large portion of transcript reads generated by RNA-seq cannot be mapped to annotated features such as protein-coding genes. It is unknown whether those reads are from functional ncRNAs. Our tool can be used to examine whether the transcribed regions have structural conservation in related genomes when BLAST-like tools fail. We also plan to integrate more biological features to remove hits that are not likely to be ncRNAs.
This work was supported, in part, by the NSF CAREER Grant DBI-0953738.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 3, 2012: ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S3.
- Bompfunewerer AF, Flamm C, Fried C, Fritzsch G, Hofacker IL, Lehmann J, Missal K, Mosig A, Muller B, Prohaska SJ, Stadler BM, Stadler PF, Tanzer A, Washietl S, Witwer C: Evolutionary patterns of non-coding RNAs. Theory Biosci. 2005, 123 (4): 301-369. 10.1016/j.thbio.2005.01.002.View ArticlePubMedGoogle Scholar
- Rivas E, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics. 2001, 2: 8-10.1186/1471-2105-2-8.PubMed CentralView ArticlePubMedGoogle Scholar
- Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci USA. 2005, 102 (7): 2454-2459. 10.1073/pnas.0409169102.PubMed CentralView ArticlePubMedGoogle Scholar
- Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D: Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006, 2 (4): e33-10.1371/journal.pcbi.0020033.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Lu ZJ, Yip KY, Wang G, Shou C, Hillier LW, Khurana E, Agarwal A, Auerbach R, Rozowsky J, Cheng C, Kato M, Miller DM, Slack F, Snyder M, Waterston RH, Reinke V, Gerstein MB: Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data. Genome Res. 2011, 21: 276-285. 10.1101/gr.110189.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Pang KC, Fritha MC, Mattick JS: Rapid evolution of noncoding RNAs: lack of conservation does not mean lack of function. Trends Genet. 2005, 22: 1-5.View ArticlePubMedGoogle Scholar
- Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005, 33 (Database issue): D121-D124.PubMed CentralView ArticlePubMedGoogle Scholar
- Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002, 18 (3): 440-445. 10.1093/bioinformatics/18.3.440.View ArticlePubMedGoogle Scholar
- Buhler J, Keich U, Sun Y: Designing seeds for similarity search in genomic DNA. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology. 2003, ACM Press, 67-75.Google Scholar
- Sun Y, Buhler J: Designing multiple simultaneous seeds for DNA similarity search. Proceedings of the Eighth Annual International Conference on Computational Molecular Biology(RECOMB '04). 2004, ACM Press, 76-84.View ArticleGoogle Scholar
- Gardner P, Giegerich R: A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics. 2004, 5: 140-10.1186/1471-2105-5-140.PubMed CentralView ArticlePubMedGoogle Scholar
- Havgaard JH, Lyngso RB, Stormo GD, Gorodkin J: Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics. 2005, 21 (9): 1815-1824. 10.1093/bioinformatics/bti279.View ArticlePubMedGoogle Scholar
- Havgaard JH, Torarinsson E, Gorodkin J: Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Comput Biol. 2007, 3 (10): 1896-1908.View ArticlePubMedGoogle Scholar
- Torarinsson E, Sawera M, Fredholm M, Gorodkin J: Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res. 2006, 16: 885-889. 10.1101/gr.5226606.PubMed CentralView ArticlePubMedGoogle Scholar
- Coenye T, Drevinek P, Mahenthiralingam E, Shah SA, Gill RT, Vandamme P, Ussery DW: Identification of putative noncoding RNA genes in the Burkholderia cenocepacia J2315 genome. FEMS Microbiol Lett. 2007, 276: 83-92. 10.1111/j.1574-6968.2007.00916.x.View ArticlePubMedGoogle Scholar
- Sankoff D: Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J Appl Math. 1985, 45 (5): 810-825. 10.1137/0145048.View ArticleGoogle Scholar
- Sun Y, Buhler J: Choosing the best heuristic for seeded alignment of DNA sequences. BMC Bioinformatics. 2006, 7: 133-10.1186/1471-2105-7-133.PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res. 2003, 13: 103-107. 10.1101/gr.809403.PubMed CentralView ArticlePubMedGoogle Scholar
- Higgs PG: RNA secondary structure: physical and computational aspects. Q Rev Biophys. 2000, 33 (3): 199-253. 10.1017/S0033583500003620.View ArticlePubMedGoogle Scholar
- Chikkagoudar S, Livesay DR, Roshan U: PLAST-ncRNA: Partition function Local Alignment Search Tool for non-coding RNA sequences. Nucleic Acids Res. 2010, 38 (Suppl 2): W59-W63.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics. 2008, 24 (5): 713-714. 10.1093/bioinformatics/btn025.View ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Klein R, Eddy S: RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003, 4: 44-10.1186/1471-2105-4-44.PubMed CentralView ArticlePubMedGoogle Scholar
- Buhler J: Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics. 2001, 17 (5): 419-428. 10.1093/bioinformatics/17.5.419.View ArticlePubMedGoogle Scholar
- Schattner P: Searching for RNA genes using base-composition statistics. Nucleic Acids Res. 2002, 30 (9): 2076-2082. 10.1093/nar/30.9.2076.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.