- Open Access
CSA: An efficient algorithm to improve circular DNA multiple alignment
© Fernandes et al; licensee BioMed Central Ltd. 2009
- Received: 23 December 2008
- Accepted: 23 July 2009
- Published: 23 July 2009
The comparison of homologous sequences from different species is an essential approach to reconstruct the evolutionary history of species and of the genes they harbour in their genomes. Several complete mitochondrial and nuclear genomes are now available, increasing the importance of using multiple sequence alignment algorithms in comparative genomics. MtDNA has long been used in phylogenetic analysis and errors in the alignments can lead to errors in the interpretation of evolutionary information. Although a large number of multiple sequence alignment algorithms have been proposed to date, they all deal with linear DNA and cannot handle directly circular DNA. Researchers interested in aligning circular DNA sequences must first rotate them to the "right" place using an essentially manual process, before they can use multiple sequence alignment tools.
In this paper we propose an efficient algorithm that identifies the most interesting region to cut circular genomes in order to improve phylogenetic analysis when using standard multiple sequence alignment algorithms. This algorithm identifies the largest chain of non-repeated longest subsequences common to a set of circular mitochondrial DNA sequences. All the sequences are then rotated and made linear for multiple alignment purposes.
To evaluate the effectiveness of this new tool, three different sets of mitochondrial DNA sequences were considered. Other tests considering randomly rotated sequences were also performed. The software package Arlequin was used to evaluate the standard genetic measures of the alignments obtained with and without the use of the CSA algorithm with two well known multiple alignment algorithms, the CLUSTALW and the MAVID tools, and also the visualization tool SinicView.
The results show that a circularization and rotation pre-processing step significantly improves the efficiency of public available multiple sequence alignment algorithms when used in the alignment of circular DNA sequences. The resulting alignments lead to more realistic phylogenetic comparisons between species.
- Suffix Tree
- Circal Package
- Cyclic Sequence
- Longe Common Subsequence
- Common Block
Genomic sequence alignment tools have been playing an important role in comparative genomics and phylogenetic reconstruction. However, traditional sequence alignment algorithms based on dynamic programming are very inefficient when long genome sequences needed to be aligned. To tackle this problem several heuristic based methods have been proposed. The most popular progressive multiple sequence alignment (MSA) method is ClustalW [1, 2], to which access is provided by a number of web portals. Other methods like T-COFFEE , DIALIGN , MUSCLE , MLAGAN , MAVID , and MAUVE  are also widely used. Despite the fact that these tools are heuristic based and sometimes lead to poor biologically plausible alignments, they were also developed to deal only with linear genomic sequences. When applied to circular genomes, the results become extremely sensitive to the exact place where the genomic sequence begins.
This limitation is very important since circular DNA sequence alignments are central to a number of biological problems. Every cell has some kind of genome that is circular. Prokaryotic genomes are circular and many bacteria possess extra circular DNA molecules, the plasmids. Eukaryotic cells also contain organelles which possess circular DNA molecules: the mitochondrial DNA (mtDNA) inside mitochondria in all cells; and chloroplast DNA inside chloroplasts in plant cells.
MtDNA has long been used for phylogenetic analyses. In fact, the absence of recombination in this genome enables an easy and direct inference of the phylogenetic evolution and its fast mutation rate leads to a high discriminative power. Until recently, phylogenetic reconstructions were based on certain regions of the mtDNA molecule, mainly the protein-coding gene cytochrome b when comparing different species  and hypervariable regions on D-loop when comparing human populations (e.g. ). But the high recent throughput of automatic sequencing techniques is offering the possibility to study complete mtDNA genomes in humans (see revision in ) and in other species (ex: Mus musculus, ). By the end of April 2009 there were around 5,650 human mtDNA complete genomes in GenBank  and 1,750 complete mtDNA which should be used as reference sequence for the diverse species in RefSeq . The blind application of standard phylogenetic analyses in these massive datasets without concern to the circularity of these molecules will lead to the overestimation of genetic distances between species.
Sequencing, the technique employed to determine bases constituting the DNA molecule is performed in fragments, generally overlapping in the ends so that an order can be inferred for constructing the map of the molecule. But the place where a circular genome begins is totally irrelevant and arbitrary. For instance, the first team sequencing the human mtDNA  decided to begin numbering more or less in the middle of a region designated control region or D-loop; however, the chimpanzee mtDNA sequence has position number one placed in tRNA phenylalanine, which will be position 577 in human mtDNA. Due to this arbitrarily first position definition, a false high genetic distance would be obtained from the alignment between human and its closest species, by using available sequence alignment tools. A total of 576 gaps would be added to the beginning of the chimpanzee sequence and around 563 gaps would be added to the end of the human sequence.
Algorithms for the problem of cyclic sequence alignment have already been proposed in the computer science research field. However, like optimal MSA methods, most existing optimal methods that handle this kind of sequences are very time consuming and seldom used. A simple extension of a general MSA dynamic programming algorithm can be used to compute the edit distance between two cyclic sequences, but requires a quadratic time computation complexity [16, 17]. Several other algorithms that explore suboptimal solutions have also been proposed [18, 19], reducing the practical execution time. However, these works present experiments that consider only the cyclic use of the Levenshtein metric .
Based on algorithms closely related to the ideas described above, two software packages have been recently proposed to align circular DNA genome sequences: the Circal package  and the Cyclope package . The algorithm implemented in the Circal package uses a complex gap cost function and can only deal with short sequences, less than a thousand characters, due to its time complexity. The Cyclope package includes an implementation of an exact and a heuristic method with time complexities that are prohibitive if it is used to align several sequences with several thousands of base pairs. For this particular package, the authors claim that it should be used only to obtain a rough first solution of the multiple alignment and that the sequences should be realigned with a standard linear alignment package like ClustalW.
In this paper, we present the CSA tool http://kdbio.inesc-id.pt/software/csa/, a very efficient algorithm that finds the best rotation for a set of circular genome sequences that are to be aligned. Firstly, the genomic sequences are circularized. In a second stage the best rotation is calculated based on the largest chain of non-repeated blocks that belongs to all the sequences. These maximum common blocks are obtained with the help of a generalized cyclic suffix tree, which is a new concept introduced in this work. At the end of the process, the users can visualize all the identified common blocks, obtaining a precise idea on how these regions are conserved along the genomic sequences. The new rotated sequences are made available for download and can be submitted to public available MSA tools. At the developed website, several commonly used MSA and visualizations tools are proposed.
The purpose of the proposed algorithm is to find the best rotation among all the possible rotations of each circular genome sequence, in order to improve subsequent multiple sequence genome alignment. Unlike previous algorithms that pursued the same goal, the proposed algorithm is, to the best of our knowledge, the first one that is able to do this task in linear time. This was accomplished by employing the highly efficient suffix tree data structure .
In general terms, a suffix tree for a given string is an advanced data structure shaped like an upside down tree that stores all the suffixes of the string and that can be used to efficiently solve many complex string problems. An in-depth explanation of this data structure is outside the scope of this article, but a detailed overview of suffix trees including construction methods and applications can be found in a number of sources (e.g. ). In this work we follow many of the definitions presented in that reference. In particular, strings will be denoted as sequences, which correspond to DNA sequences.
Cyclic Suffix Trees
To be able to represent all the rotations of a cyclic sequence S of length n, we introduce the concept of cyclic suffix tree. A cyclic suffix tree is a suffix-tree-like structure which represents all the rotations of the sequence (instead of all its suffixes, as is normal for suffix trees). The construction algorithm follows Ukkonen's suffix tree construction method  but with some subtle modifications, namely in the implementation of suffix links and open leaves (see  for a detailed description of these concepts). In our case, suffix links at leaves are treated as connecting successive rotations instead of successive suffixes. The role of the open leaves is also changed so that the resulting path label from the root to the end of the leaf has always the same length, n. In this way, if a new leaf for the character at position i is created at depth d, the right pointer of that leaf will be i + (n - d - 1). The result is that all the leaves are at the same depth in the tree, which corresponds to the size of the original sequence. When accessing characters from the edge labels, if a pointer indicates a position k that reaches beyond the end of the original sequence (i.e., k > n), then we must subtract from that position the size of the sequence (i.e., k = k - n).
The construction algorithm
Based on the previous definitions, the construction algorithm proceeds as originally proposed in , producing all the n rotations of the sequence in linear time with no additional effort. In some cases, as a final step, we still need to perform an additional pass through S. Take for example the sequence AAABA. The construction algorithm, as proposed in , would stop at the internal node with path label A, and would not report the rotation AAAAB. So, we need to match again all the characters from the beginning of the sequence until the (n-1)-th position to create the last missing rotation. This can easily be done by constructing the cyclic suffix tree of the sequence concatenated with itself (i.e., SS). Using this technique, the algorithm runs in time proportional to 2 n, and we obtain a linear time complexity.
Generalized Cyclic Suffix Trees
We have so far presented the algorithm to build a cyclic suffix tree for a single sequence, in linear time. However what we want is to efficiently obtain the best rotation for an entire set of cyclic sequences. For this we build a tree called a generalized cyclic suffix tree. A generalized cyclic suffix tree is a tree that stores all the rotations of a set of sequences. In this representation, a node can belong to several different sequences at the same time. Each node in the tree is marked, using a bit vector, with the identifiers of all the sequences that contain that node. A linear time algorithm for the construction of a generalized suffix tree for a set of sequences can be found in . The implementation used in this work is a generalization to deal with cyclic sequences.
Finding the best rotation
The general idea to obtain the best rotation for each sequence based on all the others is to find the largest chain of longest common subsequences that belongs to all the sequences and then use the position of that highly conserved block chain in each sequence to establish the cutting point. We start by retrieving all the nodes that belong simultaneously to all the sequences. The tree nodes carry a bit vector with this information, so we only need to perform a depth-first search on the tree starting at the root and count the number of sequences in each node. When the sequence count, in a node, falls bellow the total number of sequences, we don't need to search the children of that node because their count will only be lower or equal to the count of the father. Using the suffix links of those nodes, it is now possible to discard all the nodes whose path label corresponds to a suffix of the path label of another node. The result is the set of all maximal blocks common to all the sequences.
To improve the results three other steps were included. One step removes the nodes that appear more than once in at least one of the sequences. This is important because a repeated subsequence that appears in multiple positions inside a sequence could lead to wrong alignments. At this stage, we are left with all the unique maximal common blocks to all the sequences. Next, the second step groups together the blocks that appear consecutively and in the same order in all the sequences. The third and last step consists of taking the largest chain of these blocks and set its start position on each circular sequence as the start position of the new linear sequence.
For the identification of close related regions the minimum block size is not limited. However, since all the suffixes of common subsequences are automatically excluded when the suffix tree is analysed, it is usual that the longest common subsequences have size not inferior to 5 or 6 nucleotides. The only restriction that was included limits the maximum distance between two consecutive blocks in a chain to 10 nucleotides. Since the algorithm purpose is to identify the best rotation for all the sequences based on a similar region, we tried, with this distance restriction, to avoid that other more complex biological events, like gene inversions, deletions or insertions could play a role at this stage. These events will be detected at the final multiple alignment. The length of the chain, or common region, is the sum of the lengths of each one of the intervening blocks.
Consider the example represented in Figure 3 where sequences ACACG, CGTGA and TGAC correspond to a linearization of three DNA cyclic sequences. There are three common blocks to all these sequences: GAC, AC and C. Since AC and C are both suffixes of GAC, they are removed. The remaining block does not occur more than once in the same sequence, so repetitions are not observed. At the end, the single sequence GAC is reported as the largest block chain, leading to the following rotated and linear set of sequences: GACAC, GACGT and GACT.
Multiple Sequence Alignment
After finding the largest chain of unique common blocks belonging to all the sequences that were circularized, they are again made linear by cutting these sequences at the starting position of the first block from that common chain. The multiple alignment itself can then be easily performed by any linear multiple alignment algorithm. At the CSA tool web site several multiple alignment methods and visualization tools are suggested for further sequence analysis.
The CSA tool interfaces
The genomic sequences can be submitted in the Multi-FASTA format by uploading a file or by pasting the sequences in a text window. The size of the chain of blocks displayed in the output can be specified by the user. By default chains with size higher than fifteen bases are selected and displayed. The minimum size allowed for a chain with only one block is eight since finding conserved blocks of this size in genomic sequences with several thousand base pairs is still statistically significant.
The output page is divided in three main areas where data are displayed. At the beginning the user can find tool execution statistics, including the size and description of each input sequence, the algorithm running time, information about the 20 first longest blocks, and eventually any processing errors. After this general information, the alignment map of the blocks in all the sequences after their optimal rotation is presented, together with a table with all the blocks lengths and positions, following the same colour schema as the alignment map. The colour of the blocks in the results page is not related to their length. The blocks are coloured in a rainbow-like fashion by their positions relative to the first (top-most) sequence. This position-based colouring is more interesting than a length based colouring because it allows, in a much more easy way, the detection of block transpositions among sequences. Before clicking on the alignment image, the table on the right shows the blocks sorted by their length. By left-clicking over a specific section of one of the sequences in the alignment map, the positions table automatically sorts its rows to reflect the correct order of the blocks inside that sequence and it automatically scrolls itself to show the information of the blocks from the selected region/sequence. All the previous described data are available for download at the end of the page.
At the CSA tool web page the user can also find a user manual, the algorithm source code and some example sequences.
Species, GenBank accession numbers and mtDNA genome size (bp) used for the tests.
First set (Primates)
Second set (Mammals)
Genome size (bp)
Genome size (bp)
Canis lupus familiaris
Third set (Distantly related sequences)
Genome size (bp)
16 Primates sequences (First set)
In order to get a sense of the statistical significance of the CSA improvement relative to the random situation, tests with 50 sets of control sequences with random cuts (using the third set described) were also conducted. For these sets, the alignment scores and the consensus length obtained when using the ClustalW tool with and without the CSA algorithm were compared.
We also tried to compare the results obtained with the CSA tool followed by a linear DNA multiple sequence aligner, with the results obtained by the algorithms available at the Cyclope package . However, this comparison could not be performed because this software package is not able to deal with sequences with a few thousand bases pairs as the mtDNA sequences. The recommendation obtained by the tool's authors was to perform a manual rotation based on the genes positions and then use a linear DNA multiple sequence alignment algorithm.
The circularization and rotation in CSA
Note, however, that when more distantly related sequences are considered, the common blocks correspond to single blocks of small sizes (more or less 8 bp) that are still statistical significant if mitochondrial DNA is being considered but not true for DNA sequences of bacterial genomes with million base pairs. In this way, additional care must be taken in the analysis of the sizes of the identified common blocks, when using this algorithmic approach with larger circular genomes from distantly related organisms. In these cases we are dealing with the multiple sequence alignment problem of distantly related genomes that is still an open problem.
Comparing alignment results between alignment tools and with or without CSA
Some biological properties can help in the evaluation of the alignment results: (1) unique deletion of multiple bases instead of multiple deletions interspersing individual nucleotides; (2) higher ratio transition/transversion, as substitutions between the same type of bases are commonest than between different types; (2) the triplet-constraint in protein-coding genes making it more probable to have deletion in multiples of three than other.
Genetic diversity standard measures for Primates, Mammals and Primates with more distantly related sequences, aligned with several alignment tools without and after circularization an rotation in CSA.
CSA + ClustalW
CSA + MAVID
First set (Primates)
Mean no. of pairwise differences
4303 +/- 1939
4084 +/- 1840
4391 +/- 1978
4223 +/- 1903
0.239 +/- 0.120
0.234 +/- 0.118
0.239 +/- 0.121
0.237 +/- 0.120
Second set (Mammals)
Mean no. of pairwise differences
5640 +/- 2592
5386 +/- 2475
6200 +/- 2849
6044 +/- 2778
0.293 +/- 0.152
0.289 +/- 0.150
0.285 +/- 0.148
0.290 +/- 0.150
Third set (Primates + Drosophila melanogaster + Gallus gallus + Crocodylus niloticus)
Mean no. of pairwise differences
5892 +/- 2631
5568 +/- 2486
7018 +/- 3133
6729 +/- 3004
0.295 +/- 0.147
0.276 +/- 0.138
0.239 +/- 0.119
0.241 +/- 0.120
When comparing different alignment tools, ClustalW adds considerable lower amounts of indels than MAVID. A probable cause for this is that MAVID was especially designed for the alignment of large genomes, while ClustalW is much more conservative and takes longer to achieve results.
Comparing alignment results in the 50 sets of control sequences with random cuts
We have demonstrated that the essential step of circularizing and rotating the mtDNA molecules prior to its alignment can significantly improve the efficiency of current multiple sequence alignment tools, developed for the alignment of linear DNA molecules. This pre-processing step leads to more accurate phylogenetic comparisons between species.
To the best of our knowledge the CSA tool is the only web based tool that obtains the best rotation of a set of circular DNA sequences in a very efficient way. The new rotated sequences are made available for further processing and a picture of all conserved block for all the sequences can be found at the result page and can be viewed as a first draft of a future multiple sequence alignment.
Future developments of alignment tools should include more real biological mutation constraints, enabling the use of different assumptions in the different parts of the molecules. It is clear that as sequencing strategies advance further, more information will be obtained for complete genomes, which have necessarily a diverse composition. This is clearly the case of the circular molecules of mtDNA and bacterial genomes being rapidly characterized. For instance, non-coding regions could have a less restrictive rate of gap opening; protein-coding genes should incorporate the rule of multiple of three gaps and be less restrictive for substitution at the third-codon position; third-dimension structure can give additional information for the rRNA and tRNA genes alignment.
Project name: CSA: Cyclic DNA Sequence Aligner
Project home page: http://kdbio.inesc-id.pt/software/csa/
Operating system(s): Platform independent
Programming language: C
Other requirements: none
Any restrictions to use by non-academics: Free downloads and usage for academics only
We thank INESC-ID for providing computing resources. This project was supported by the ARN project – Algorithms for the identification of genetic Regulatory Networks (PTDC/EIA/67722/2006) from the Portuguese Science Foundation (FCT). FCT partially supports IPATIMUP through Programa Operacional Ciência, Tecnologia e Inovação (POCTI), Quadro Comunitário de Apoio III.
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 2003, 31: 3497–3500.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302(1):205–17.View ArticlePubMedGoogle Scholar
- Brudno M, Chapman M, Göttgens B, Batzoglou S, Morgenstern B: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 2003, 4: 66.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792–1797.PubMed CentralView ArticlePubMedGoogle Scholar
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 2003, 13(4):721–731.PubMed CentralView ArticlePubMedGoogle Scholar
- Bray N, Pachter L: MAVID: constrained ancestral alignment of multiple sequences. Genome Res 2004, 14: 693–699.PubMed CentralView ArticlePubMedGoogle Scholar
- Darling AC, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 2004, 14: 1394–1403.PubMed CentralView ArticlePubMedGoogle Scholar
- Castresana J: Cytochrome b phylogeny and the taxonomy of great apes and mammals. Mol Biol Evol 2001, 18: 465–471.View ArticlePubMedGoogle Scholar
- Richards M, Macaulay V, Hickey E, Vega E, Sykes B, Guida V, Rengo C, Sellitto D, Cruciani F, Kivisild T, Villems R, Thomas M, Rychkov S, Rychkov O, Rychkov Y, Gölge M, Dimitrov D, Hill E, Bradley D, Romano V, Calì F, Vona G, Demaine A, Papiha S, Triantaphyllidis C, Stefanescu G, Hatina J, Belledi M, Di Rienzo A, Novelletto A, Oppenheim A, Nørby S, Al-Zaheri N, Santachiara-Benerecetti S, Scozari R, Torroni A, Bandelt HJ: Tracing European founder lineages in the Near Eastern mtDNA pool. Am J Hum Genet 2000, 67: 1251–1276.PubMed CentralView ArticlePubMedGoogle Scholar
- Pereira L, Freitas F, Fernandes V, Pereira JB, Costa MD, Costa S, Máximo V, Macaulay V, Rocha R, Samuels DC: The diversity present in 5,140 human mitochondrial genomes. Am J Hum Genet 2009, 84: 628–640.PubMed CentralView ArticlePubMedGoogle Scholar
- Goios A, Pereira L, Bogue M, Macaulay V, Amorim A: mtDNA phylogeny and evolution of laboratory mouse strains. Genome Res 2007, 17: 293–298.PubMed CentralView ArticlePubMedGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res 2008, (36 Database):D25–30.Google Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, (35 Database):D61–65.Google Scholar
- Anderson S, Bankier AT, Barrell BG, de Bruijn MH, Coulson AR, Drouin J, Eperon IC, Nierlich DP, Roe BA, Sanger F, Schreier PH, Smith AJ, Staden R, Young IG: Sequence and organization of the human mitochondrial genome. Nature 1981, 290: 457–465.View ArticlePubMedGoogle Scholar
- Maes M: On a cyclic string-to-string correction problem. Inform Process Lett 1990, 35(2):73–78.View ArticleGoogle Scholar
- Mollineda RA, Vidal E, Casacuberta F: Cyclic sequence alignments: approximate versus optimal techniques. Int J Patt Recogn and Artificial Intelligence 2002, 16(3):291–299.View ArticleGoogle Scholar
- Bunke H, Bühler U: Applications of approximate string matching to 2D shape recognition. Patt Recogn 1993, 26(12):1797–1812.View ArticleGoogle Scholar
- Mollineda RA, Vidal E, Casacuberta F: Efficient techniques for a very accurate measurement of dissimilarities between cyclic patterns. In Advances in Pattern Recognition. Volume 1876. Lecture Notes in Computer Science. Springer; 2000:337–346.View ArticleGoogle Scholar
- Fritzsch G, Schlegel M, Stadler PF: Alignments of Mitochondrial Genome Arrangements: Applications to Metazoan Phylogeny. J Theor Biol 2006, 240: 511–520.View ArticlePubMedGoogle Scholar
- Mosig A, Hofacker IL, Stadler PF: Comparative Analysis of Cyclic Sequences: Viroids and other Small Circular RNAs. Proceedings of the German Conference on Bioinformatics: 20–22 September 2006; Tübingen P-83: 93–102.Google Scholar
- Weiner P: Linear Pattern Matching Algorithm. Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theor: 15–17 October 1973; Iowa 1973, 1–11.View ArticleGoogle Scholar
- Gusfield D: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. New York: Cambridge University Press; 1997.View ArticleGoogle Scholar
- Ukkonen E: On-line Construction of Suffix Trees. Algorithmica 1995, 14(3):249–260.View ArticleGoogle Scholar
- Excoffier L, Laval G, Schneider S: Arlequin ver. 3.0: An integrated software package for population genetics data analysis. Evolutionary Bioinformatics Online 2005, 1: 47–50.PubMed CentralGoogle Scholar
- Shih AC, Lee DT, Lin L, Peng CL, Chen SH, Wu YW, Wong CY, Chou MY, Shiao TC, Hsieh MF: SinicView: a visualization environment for comparisons of multiple nucleotide sequence alignment tools. BMC Bioinformatics 2006, 7: 103.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.