Rapid DNA barcoding analysis of large datasets using the composition vector method
© Chu et al. 2009
Published: 10 November 2009
Skip to main content
© Chu et al. 2009
Published: 10 November 2009
Sequence alignment is the rate-limiting step in constructing profile trees for DNA barcoding purposes. We recently demonstrated the feasibility of using unaligned rRNA sequences as barcodes based on a composition vector (CV) approach without sequence alignment (Bioinformatics 22:1690). Here, we further explored the grouping effectiveness of the CV method in large DNA barcode datasets (COI, 18S and 16S rRNA) from a variety of organisms, including birds, fishes, nematodes and crustaceans.
Our results indicate that the grouping of taxa at the genus/species levels based on the CV/NJ approach is invariably consistent with the trees generated by traditional approaches, although in some cases the clustering among higher groups might differ. Furthermore, the CV method is always much faster than the K2P method routinely used in constructing profile trees for DNA barcoding. For instance, the alignment of 754 COI sequences (average length 649 bp) from fishes took more than ten hours to complete, while the whole tree construction process using the CV/NJ method required no more than five minutes on the same computer.
The CV method performs well in grouping effectiveness of DNA barcode sequences, as compared to K2P analysis of aligned sequences. It was also able to reduce the time required for analysis by over 15-fold, making it a far superior method for analyzing large datasets. We conclude that the CV method is a fast and reliable method for analyzing large datasets for DNA barcoding purposes.
In 2003, Hebert et al  proposed to use a 648-bp region from the 5'-end of the cytochrome c oxidase subunit 1 (COI) gene as a DNA barcode for identifying all metazoan species. The final goal of DNA barcoding is to identify all eukaryotic species [2, 3]. Recently, DNA barcoding has been tested on several groups of organisms including birds , fishes  and amphibians  with promising results. Ratnasingham and Hebert  estimated that the barcode of life data system (BOLD) would eventually generate 100 million records (for COI-barcoding only) twice the current size of the GenBank database, and that enterprise-scale software would be needed to handle such a large dataset. The goal of BOLD is to generate a "taxon ID tree" based on a neighbour-joining (NJ) algorithm for every query sequence for easy identification. As in other traditional methods of tree construction, sequence alignment has to be performed before construction of the K2P/NJ-tree (i.e., a NJ tree based on the Kimura-2-parameter, K2P, distance). In BOLD, alignment is executed based on the hidden Markov models . Due to the high computational burden involved, BOLD has unfortunately been unable to incorporate all data records in constructing the K2P/NJ tree. The short-term solution is to divide the large barcode dataset into several "sub-projects" with a size limit of 5,000 specimens each for analysis . However, as an estimated 200,000 additional barcode records will be entered in the database each year , the limit of 5,000 specimens for each sub-project will be quickly saturated because closely related taxa (sequences) should not be divided into subsets but preferably analyzed together. The long-term solution is therefore to develop more efficient analytical methods as alternatives or supplements for handling such a large dataset.
COI has several claims to be a suitable DNA barcode marker, including ease in amplification across a wide variety of organisms and provision of enough information to enable organisms to be identified to the species level. But it also has its drawbacks, including inherent risks due to maternal inheritance (noticeably failure in detecting hybridization), the presence of pseudogenes (numts), and its inconsistent evolutionary rate among lineages [2, 9, 10]. These disadvantages continue to disappoint biologists hoping to rely on single gene as the sole marker for taxonomic identification [6, 11, 12]. The feasibility of using alternative and additional genes as DNA barcodes has been explored [9, 13, 14]. Ribosomal RNA (rRNA) genes and their internal transcribed spacer (ITS) [6, 15] show promise as DNA barcode markers for distinguishing different organisms and, as COI is highly conserved in plants, barcode markers such as the trnH-psbA spacer and rbcL gene [9, 11, 16] have also been proposed. Yet because of the high frequency of insertions/deletions in the rRNA and the other non-protein-coding markers, sequence alignment is a critical step in the analysis. This, in turn, requires a large amount of additional computational power during the alignment step, further pushing the analytical power of BOLD system toward its limit. The process of sequence alignment is not only very costly in computational power, but has also yet to be standardised. A lack of standard protocols and inconsistencies between the aligned datasets of different laboratories represent a drawback [13, 17–21] to the incorporation of any non-protein-barcoding markers in BOLD. Including markers such as rRNA in BOLD would mean that alignment will be executed automatically and any problems in alignment cannot be adjusted manually. The problems associated with the alignment procedure effectively limit the use of DNA barcoding .
Our previous study  demonstrated that the composition vector (CV) method  was an effective and efficient tree construction method for analyzing rRNA datasets, thereby facilitating the use of these genes as DNA barcodes. We believe that the CV approach can also sidestep the problems associated with sequence alignment in analyzing large datasets of both COI and non-protein-coding barcode markers, the latter of which usually involve sequence length variation even among closely related taxa. In the present study, we analyzed large DNA barcode datasets, each with more than 300 sequences (including non-COI datasets with variable sequence lengths) with the CV method. Sequences from three published DNA barcode datasets available from GenBank, namely birds , fishes  and nematodes , were analyzed. A 16S rRNA dataset of decapod crustaceans containing 466 GenBank sequences and 268 sequences generated in our laboratory was also analyzed. These datasets were chosen because they included the most common genetic markers that have been proposed as DNA barcodes, including COI, 16S rRNA and nuclear SSU rRNA genes, and contained the largest number of DNA sequences, ranging from 349 to 754, so far assembled. The aim of our study was to evaluate how well and how fast the CV method could handle these large barcode datasets without sequence alignment.
Details of datasets and K values used in CV analysis
Avg. sequence length (bp)
No. of classes
No. of orders
No. of families
No. of genera
No. of species
No. of sequences
Grouping and processing time using the K2P (with CLUSTAL W alignment) and CV methods
> 10 h
> 10 h
< 5 min
< 5 min
< 5 min
All the data analyses in this study were preformed on a 1.4 GHz notebook computer. With the K2P/NJ method, the alignment step using CLUSTAL W  alone took more than five hours to complete for the bird dataset, and required more than 10 hours to complete for the other three datasets (Table 2). When the other two faster algorithms MAFFT  and MUSCLE  were used, the processing time was reduced by 3.2 and 7.7 folds respectively, so that between one and two hours were needed for aligning the bird dataset. By contrast, with the CV method it took less than three minutes to analyze this bird dataset, and the analysis of the other three datasets was completed in five minutes (Table 2). Thus, while the CV/NJ method matched the resolving power of the K2P/NJ method in generating a highly similar tree, it took less than 6% of the time needed in sequence alignment on the same computer.
Qi et al  first introduced the CV method to analyze the phylogeny of prokaryotes based on the complete genomes, and this approach has subsequently been applied to analyze the chloroplast genomes [13, 27]. All these studies were based on protein sequences, and a procedure to subtract the random background from the frequencies of oligopeptide strings was used before computation of the CVs in order to diminish the influence of random neutral mutations at the molecular level and to highlight the shaping role of selective evolution. However, in adopting the CV method in analyzing short rDNA sequences for barcoding, Chu et al  have shown that a subtraction procedure for random background does not further enhance the reliability of the groupings. In the present study, we focus on how well the CV method could handle large DNA datasets, by comparing this method with the K2P/NJ method in analyzing four datasets including both rRNA and protein-coding genes. Although in each of the four datasets tested the topologies between the CV/NJ and K2P/NJ trees were not identical, this does not mean that one method obtained better results than the other. The rationale for generating the K2P/NJ trees in DNA barcoding studies or in BOLD is that the K2P/NJ tree construction is relatively simple , so that a query sequence can be rapidly identified to the species level. However, the K2P/NJ tree was not intended to reflect the phylogenetic relationships among the taxa analyzed . Thus any detailed comparison in tree topologies between the CV and the K2P methods is meaningless. In fact, both the CV and K2P methods performed equally well in identifying and discriminating taxa at the genus/species level in the datasets tested (Table 2).
The dataset of nematodes  was used to resolve the phylogenetic relationships among the species, and three different approaches (BI, MP and NJ) were applied. However, the classification of nematodes based on the molecular approach is never an easy task. The phylogenetic position of the family Choanolaimidae, for example, could not be determined based on any of the three approaches. Two species, Trichinella spiralis and Trichuris muris, could only be assigned into clade 2 based on the BI tree but not on the MP and NJ trees . The problem might be due to the inadequacy of the SSU rDNA sequences in resolving the phylogeny of this group. The evolutionary rates of this gene may also differ among different nematode clades . On the other hand, Floyd et al  proposed to use the number of SSU rDNA sequence differences to define "molecular operational taxonomic units" rather than adopting the classical species concepts in nematode biodiversity studies. It is not surprising to find inconsistent tree topologies based on different analytical methods, including the CV method, from this nematode dataset. In fact, the CV method could generate a tree topology similar to those generated by the three phylogenetic reconstruction methods, by clustering all of the taxa into 12 main clades (Table 2).
Besides its impressive resolving power, the main advantage of the CV/NJ method over the K2P/NJ algorithm is its speed. In every dataset analyzed in this study, the CV/NJ method could generate trees in less than five minutes on a 1.4 GHz notebook computer. The time needed is at least 15-fold more using the K2P/NJ method, which requires sequence alignment. Given its high analytical speed, the CV method could profitably serve as a quick barcoding identification tool capable of matching a query sequence against a pre-installed reference dataset from BOLD on a notebook computer for field workers who may not have internet access. The CV method not only saves time by omitting the alignment step, but also avoids the introduction of any human errors during the alignment process. Where no universal alignment parameters can be defined, every gap representing insertion/deletion that is assigned to a DNA sequence should be checked by eye carefully, and even this step might be subjective . Manual checking becomes a very tedious and laborious procedure when dealing with a large dataset generated by multiple alignments, but is a necessary step for verifying the reliability of the dataset and its suitability for further analysis. The CV method does away with this rate-limiting, tedious step in the tree construction procedure for DNA barcoding.
The CV method was first demonstrated to facilitate the use of rDNA datasets for barcoding purposes since no sequence alignment was necessary . In the present study, we further demonstrated the power of the CV method in analyzing large DNA barcode datasets, regardless of the type of gene markers used. In all cases tested, the CV/NJ method achieved tree topologies similar to those based on the traditional methods which involve sequence alignment, with compatible grouping effectiveness. Furthermore, when the CV method was used, the computational time was at least 15-fold shorter than that based on the K2P method. Besides its effective resolving power and very fast speed of analysis, the CV/NJ method can routinely generate reliable and reproducible trees by eliminating human errors in the multiple alignment process. To conclude, we propose that the CV/NJ method can be used as an effective and efficient tree construction algorithm in analyzing DNA barcode datasets.
Sequences from three published datasets, including birds , fishes  and nematodes  were downloaded from GenBank for analysis. We also assembled a 16S rRNA dataset of decapod crustaceans by including 466 sequences from GenBank and 268 sequences generated in our laboratory. This dataset is available from the first author upon request.
The principle and details of the composition vector (CV) method have been fully described previously [13, 22, 28, 29], and the program is publicly available at http://tlife.fudan.edu.cn/cvtree. In short, for a sequence of gene of length L, the frequency of the appearance of oligonucleotide strings of a fixed length K was determined. The K value used for each dataset was calculated by  and it ranged from 9 to 10 among the four datasets (Table 2). The total number of N possible types of the K strings was 4 K . The frequency of each of the N kinds in a given DNA sequence was determined. We then placed the frequencies of all possible K-strings in a fixed order to obtain a CV of dimension 4 K for each sequence. The correlation C(A, B) between two sequences A and B was determined by taking the projection of one vector on another, and the distance between the two was defined as D = (1 - C)/2. In this way, the sequences difference could be quantitatively evaluated. After constructing a distance matrix for all sequences in a dataset, the neighbour-joining (NJ)  analysis implemented in Phylip 3.63  was used to construct a profile tree for each dataset. The CV/NJ trees were then compared with the corresponding K2P/NJ trees constructed as follows. The DNA sequences in each dataset were first aligned using the multiple alignment program CLUSTAL W 1.5c , and the NJ trees were constructed using Mega 3  based on the Kimura-2-parameter distance model . We also attempted to align the sequences using two other algorithms, MAFFT  and MUSCLE . All the computation was performed using a 1.4 GHz notebook computer with 512 MB of RAM.
The work described in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project no. CUHK4419/04M).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 14, 2009: Biodiversity Informatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S14.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.