Tool development
Keanu is written in Python 3.5 and is available with an example dataset at https://github.com/IGBB/keanu (Fig. 1). It can be run on any system where Python is available, which includes Windows, macOS, and Linux. The program works by treating taxonomy as a tree. A root taxon in the NCBI taxonomy database forms the base of the tree and, as the tree branches out, classifications become more specific, from superkingdom to species. In the description of how Keanu works, “source” is used to refer to the classification immediately above the current classification, and “descendant” is used to refer to the classification immediately below the current classification. For example, species is the descendant of genus, and superkingdom is the source of kingdom.
Two classes of data are used in Keanu’s implementation to store the data pulled from NCBI. A Taxon object, referred to as a “node,” stores all data associated with an individual taxon. This data includes an NCBI-assigned unique identifying number, a name, a rank, a list of descendants, a source, and the number of hits that are assigned to the taxon. A Graph object, referred to as a “tree,” is used to organize how individual nodes are linked and to make traversing the links between nodes easier.
Preparation of data
Keanu generates interactive web pages as output that can be explored to determine how to proceed with further analysis of the samples based on their contents. This visualization shows the abundance of each taxon in descending order from the root node to the species level. Information about these taxa is pulled from NCBI’s taxonomy database. Only classifications in NCBI’s abbreviated lineage are displayed (i.e. superkingdom, kingdom, phylum, class, order, suborder, family, genus, species).
The input files must be generated before Keanu itself is run (see Fig. 2). Raw FASTQ sequence data first needs to be trimmed to remove any adapter sequences or low-quality data. For the purposes of developing Keanu, sequence reads from the soil sample were trimmed using Trimmomatic 0.32 [7]. As an option, a user can reduce the number of reads (and files) used for alignment to known sequences by assembling sequence reads using an assembly program. Here, the trimmed sample data were assembled using ABySS 1.9.0 [8] and k-mers of length 70, 75, 80, and 85. Assemblies using 85-mers were selected because they had the best contiguity among the assemblies. The identity and taxonomy of sequence reads or assemblies were then determined by alignment to the NCBI Nucleotide (NT) [9] database using BLAST 2.2.30+ [10], though any aligner can be used so long as the results contain taxonomic data. These BLAST results can be filtered during the alignment stage using BLAST’s own filtering methods or using a scripting language of the user’s choice afterwards. Users may choose to do additional filtering with BLAST, such as excluding certain types of environmental data, before running Keanu. The resulting alignment file(s) were processed to produce a file containing a single query ID (a sequence read or contig from the assembly) and its associated taxon ID per line. Finally, this file was reformatted to contain all of the taxon data associated with a single query on one line, as well as the number of times that query-taxon pairing was seen in the file. A sample line might look like the example below.
QUERY_NAME taxon_id_1 [count], taxon_id_2 [count], taxon_id_3 [count]
In addition to the raw data from the samples, a taxonomy data file was downloaded from NCBI’s FTP website. The files containing data about the names and relationships of the taxa were combined into one file for easier parsing later, and the files containing information about merged and deleted taxa were combined as well. Combining these files reduces the number of files needed to run Keanu, and these databases can be stored for reproducibility more easily than keeping up with many individual files.
Creating the tree
Keanu follows three steps to create its output. First, it creates the tree; then it populates the tree with the data supplied by the user; and finally, it produces an interactive web page for the user to explore based on publicly available D3 visualizations [11, 12]. To create the tree, Keanu reads the two files created earlier containing information about the taxonomy database downloaded from NCBI. As Keanu parses the file containing the taxonomy data, a node is created for each line in the file. All the information, except the number of times a node is encountered in the alignments, is present in the taxonomy database file. The second file containing the merged and deleted taxa data is then parsed, and a list of mappings from the original taxon identification number and the new merged identifying number is created. Any deleted taxa are deleted from the tree.
Populating the tree
The data about taxon abundance must be added to each node in the graph. Keanu gets these data by reading the file that contains a line for each query and its associated data. If a query has multiple taxa associated with it, a single definitive assignment cannot be made because the alignment of the sequence itself was ambiguous. Instead, the most specific classification shared by all the taxa associated with a query is assigned a hit. For example, a query with hits to a wide variety of birds will have a hit assigned to Aves if Aves is the most specific classification shared by those hits. If a query has multiple taxa that share the same genus, then the hit is assigned to the shared genus. Rather than make ambiguous assignments, Keanu makes lower resolution but accurate assignments. Keanu determines the most specific shared classification by working backwards from the taxa associated with a query to the root of the tree and finding where the first branching into different classifications occurs. Once a hit is assigned, that hit is propagated upwards through the source of each node until the propagation reaches the root.
Traversing the tree
Having populated the tree with numbers of hits to each taxon, Keanu traverses the tree using a recursive depth-first search. When Keanu first encounters a node, it visits the first descendent of that node, always selecting the first descendent until it reaches a node with no descendents. When a node with no descendants is discovered, Keanu backtracks to the previous node and visits the next descendant in the list. While there are easier ways to traverse the tree, this method allows Keanu to more easily create JSON-formatted data about its path through the tree. This format is for creating the visualization later. Nodes that are not in the abbreviated lineage are not included in the JSON output, and nodes without hits are not visited. When Keanu is done traversing the tree, the JSON-formatted data is cleaned up to be valid JSON data. Rather than providing the user with a JSON file for use with a separate HTML file, Keanu produces an HTML file with the JSON data included in the file for easy distribution to users. The analysis takes about 30 min to run, although time will depend on the processor specifications, computer memory, and size of the input data to Keanu.
Soil sampling, DNA extraction and sequencing
The soil sample was collected aseptically using a sterile shovel from ancient soil in an archeological site located in Alaska in October 2015. The site was located on a 50 m high bluff overlooking the Delta River (63° 49′ 00”N and 14° 56′ 46”W). The bluff was composed of windblown silt and sand punctuated by a series of paleosols ranging from approximately 400 to 8000 calBP (calendar years before present). The sample was collected from an area where a krotovina was present. The krotovina itself, dated using radiocarbon methods, was 170 ± 30 calBP. Following collection, the soil sample was immediately placed onto dry ice and stored in the dark at 4 °C or − 80 °C for soil properties and DNA extraction, respectively. Genomic DNA was extracted from the soil using the PowerSoil® DNA isolation Kit (MoBio, San Diego, CA) and quantified using a Qubit 3.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA). DNA was sent for sequencing to Global Biologics (Columbia, MO) and was run in the NexSeq Illumina platform (2 × 150 bp) as well as on the MiSeq Illumina Platform (2 × 250 bp). FASTQ files generated by the sequencers were then used for assembly and subsequent analysis by Keanu.
Installing and running Keanu
Keanu can be downloaded from Github.com. Since Keanu is a Python script with accessory scripts for formatting databases and input, it does not need to be installed in a specific place in a computer.
Before running Keanu, some preparatory steps must be completed. Reads can be assembled using an assembler chosen by the user. Assembling reads reduces sequence duplication, which reduces the amount of data that must be aligned in later steps. Next, the reads or assembled sequences must be aligned to a database that contains taxonomy information from NCBI’s taxonomy database. If the user is using BLAST, then the output should be formatted in output format 6 (a tabular format easily parsed by other programs) and taxon information should be included with the staxids argument. If BLAST has already been run, perhaps for another part of the analysis, the taxonomy information can be extracted from the database and added to the BLAST results with a combination of UNIX commands. More detail is available in the README file on GitHub.
By design, no database is included with Keanu. Instead, Keanu includes a script to create an up-to-date database. The source of the database information is NCBI’s Taxonomy FTP site, where a user can download a compressed archive of the taxonomy information. Keanu’s make_db.py script can take the names file, the nodes file, the deleted nodes file, and the merged nodes file and combine them into a format used by Keanu.
After the database is created, the BLAST results or another aligner’s results can be parsed and reformatted. BLAST results by default have an entry for every match that a query has; Keanu requires that all information about a query be on a single line. The included reformatting script takes a file containing queries and taxonomy information from the alignment and reformats it for Keanu.
Finally, Keanu itself is run, taking as input the reformatted results from the previous step, the two databases created by the database creation step, and the user’s preference of the tree view or the bilevel partition graph view. Keanu produces a single, self-contained HTML file (web page) that can be sent to collaborators and viewed in a web browser.