The reliability and accuracy of using BLAST as a method for identification depends on several factors. Firstly, the completeness of the reference database is of importance. Very few entire genomes of CITES-listed species have been sequenced: so far only 130  out of a total of 30713 species. Our pipeline is therefore not intended to handle Whole Genome Shotgun (WGS) data.
Secondly, for the standard DNA barcoding markers not all CITES-listed taxa have so far been sequenced. Species in diverse groups such as Orchidaceae or Primates are sometimes similar, and differences between their standard barcodes may therefore be small. To prevent both type I and type II errors in the identification of difficult to distinguish species, specialists of various CITES committees decided that for species that cannot be discriminated based on DNA barcodes the entire genus (that can be recognized by DNA barcoding) rather than the individual species (that cannot) should be placed on the CITES Appendices. The CITES organization annually updates the contents of its appendices for this reason.
An example case is Cyclemys spp., a genus of freshwater turtles (Geoemydidae): one widespread species, C. dentata, is heavily exploited for food while other species in the genus are rarely traded. The entire genus was placed on appendix II in 2013. In the criteria for amendment of the appendices  it is explicitly stated that this action was carried out because enforcement officers are unlikely to be able to distinguish traded material of C. dentata from close relatives (look-alike criteria set out in Annex 2b). In response to this, the default settings of the HTS pipeline use a cut-off value of 3% sequence similarity to distinguish species from each other by DNA barcodes obtained. This approach generally works to keep endangered and non-CITES protected close relatives apart from each other. We explicitly state the cases to which this does not apply below. A cut-off value was chosen based on earlier studies that found this divergence to be sufficient to keep the majority of plants and animals apart using the standard matK, rbcL and COI DNA barcoding markers [19, 20].
Thirdly, the quality of identification depends on the length of the DNA barcode sequence used for identification. Smaller fragments have been shown to lack the discriminatory power to distinguish between species in a genus or higher taxon . For this reason, the pipeline discards identifications obtained from matches shorter than 100bp by default. Finally, to minimize the chance that identifications are based on an erroneous entry the user should look, where possible, at multiple BLAST results and verify that they are in agreement with each other. The pipeline by default returns the 10 BLAST hits with the lowest e-value (after BLAST filtering); based on multiple identifications per sequence the end-user should validate whether an identification is reliable. We recommend that users select BLAST hits with the highest sequence similarity and match length wherever possible. If multiple hits are obtained with identical quality results, but different assigned species, the fragment lacks the discriminatory power to describe the hit to species level. In these cases the user should refrain from assigning a single species but stick to the genus instead.
In our experience, virtually no situations have yet occurred in which a non-CITES-protected species could be mixed up with a CITES-protected taxon. The only exceptions concern taxonomic groups that contain domesticated species from Bovidae (wild cattle, goats and sheep) and Canidae (wolves and foxes). The wild species in these taxonomic groups cannot always be distinguished from their domesticated relatives (cows, dogs, domestic goats and sheep) so identification using standard barcoding markers fails. Similar issues arise when trying to determine whether a species is cultivated or not, as standard barcodes do not provide the necessary resolution to distinguish cultivars from samples collected in the wild.
The HTS barcode checker pipeline is the first tool for automated searches for DNA barcodes of CITES-protected taxa in HTS data. On the CITES website, several other online tools are available, such as databases that can be queried for information about trade, management systems, export quota, publications, identification manuals and photographs, but none as yet to search for hits in HTS datasets. The Chinese Academy of Medical Sciences in Beijing produces DNA barcodes from ingredients from Traditional Chinese Medicines and lists these on its website, but here too automatic search tools are not provided.
To compare speed of the pipeline to current practices we presented a spreadsheet file with ten taxonomic names (among which two CITES-listed taxa) obtained from a TCM HTS dataset to ten colleagues and let them search for CITES-listed taxa by scrolling through the CITES Appendices using the 'search and find’ option in Adobe Reader. Processing time ranged between little over one minute to slightly under five minutes among the ten participants and did not result in full recovery of CITES-listed taxa in all cases. The HTS barcode checker pipeline processed the same dataset in less than ten seconds and successfully retrieved all protected species.
Here we report the pipeline results for three sequence sets that were based on material confiscated by Dutch customs officials. For each sample the Internal Transcribed Spacer 1 (nrITS1) region was amplified and sequenced using the IonTorrent PCM platform. The reads were clustered using CD-HIT  at 97% sequence similarity. The clusters were identified with the HTS barcode checker pipeline under default settings (max e-value of 0.05, minimum of 97% sequence similarity and a hit length of at least a 100 bp). The full pipeline results are available in Additional file 1. The clustered FASTA files for all cases are available with the pipeline distribution in the /data folder.
An incense cone was sequenced and clustered of which the manufacturer provided us with all ingredients among which a protected taxon (Aquilaria). Clustering produced a total of 175 non-singleton OCTUs. A total of 99 unique identifications could be obtained by BLASTing using the pipeline. The results, listed in Table 3, indicate that the cone indeed contained species of Aquilaria (Thymeleaeceae), which are all placed on CITES Appendix II. The not protected plant species specified by the manufacturer were identified as well, thus validating the method.
Wood chips from a confiscated agarwood sample were sequenced. Clustering resulted in a total of 51 non-singleton OCTUs. A total of 26 unique identifications could be obtained by BLASTing the OCTUs, including an identification for Aquilaria species which is listed on CITES appendix II. The majority of the other OCTU identifications were from Citrullus and Pseudomonas.
A confiscated Dendrobium stem was sequenced and clustered, this produced a total of 3845 non-singleton OCTUs. A total of 159 unique identifications could be obtained by BLASTing using the pipeline; these included three different Dendrobium species, listed in Table 3. The results indicate that the stem indeed belongs to a member of the Dendrobium genus, though the barcode lacks the discriminatory strength to determine the exact species. Since all Orchidaceae are on CITES appendix II the sample was lawfully confiscated. Other sequence results included various fungal species.
Although the pipeline presented here is ready to use, several enhancements are possible that would increase usability and impact. For example, although incorrect taxonomic identifications of NCBI GenBank records have previously been noted, no community project exists to record and track such errors . The blacklist used by the HTS barcode checker could be used for communal record keeping, especially as our usage of git as a decentralized revision control system provides the ideal infrastructure for this. Conversely, should an alternative community-wide blacklist of NCBI GenBank come into existence, HTS barcode checker could be modified to make use of it. We expect the number of users to grow once the HTS barcode checker project is linked from the CITES Virtual College , which would build a community that could 'crowd source’ such a blacklist.
Though the HTS barcode checker can be setup to run via CGI or platforms such as galaxy, a publicly hosted web service would make the pipeline accessible to non-expert users such as customs officers as it would remove the need for local installations. In addition, this web application could be configured to update the local databases of additional names and the blacklist at frequent intervals, thereby guaranteeing that the user always operates on state-of-the-art knowledge.
Lastly, DNA barcodes of CITES-protected species collected from well-identified specimens should be uploaded in larger numbers to BoLD, where taxonomic names can be updated as needed by third parties. The number of CITES-protected species is currently 820 for mammals, 605 for birds, 722 for reptiles, 81 for amphibians, 20 for sharks, 132 for fishes, 3 for lungfishes, 1 for sea cucumbers, 25 for scorpions and spiders, 69 for insects, 2 for leeches, 37 for clams and mussels, 10 for snails and conches, 1636 for corals and sea anemones, 260 for sea ferns, fire corals and stinging medusae, and 26290 for plants (counts based on [28–37] and the latest proposed changes to the CITES Appendices). From this total of 30713 CITES-protected species, roughly 16830 (55%) are present in NCBI GenBank with DNA barcodes, and 13883 (45%) remain to be sequenced. Multiple initiatives carried out at The Field Museum and Missouri Botanical Garden (USA), Naturalis Biodiversity Center (the Netherlands), Muséum National d’Histoire Naturelle (France), Smithsonian’s National Museum of Natural History (USA), Zoological Institute of the Russian Academy of Sciences (Russia) and University of Johannesburg (South Africa) are currently producing additional barcode sequences of CITES-listed species. We therefore expect that the current number of 45% not yet covered in NCBI GenBank or BoLD will decrease.