Skip to main content

PhytoPipe: a phytosanitary pipeline for plant pathogen detection and diagnosis using RNA-seq data

Abstract

Background

Detection of exotic plant pathogens and preventing their entry and establishment are critical for the protection of agricultural systems while securing the global trading of agricultural commodities. High-throughput sequencing (HTS) has been applied successfully for plant pathogen discovery, leading to its current application in routine pathogen detection. However, the analysis of massive amounts of HTS data has become one of the major challenges for the use of HTS more broadly as a rapid diagnostics tool. Several bioinformatics pipelines have been developed to handle HTS data with a focus on plant virus and viroid detection. However, there is a need for an integrative tool that can simultaneously detect a wider range of other plant pathogens in HTS data, such as bacteria (including phytoplasmas), fungi, and oomycetes, and this tool should also be capable of generating a comprehensive report on the phytosanitary status of the diagnosed specimen.

Results

We have developed an open-source bioinformatics pipeline called PhytoPipe (Phytosanitary Pipeline) to provide the plant pathology diagnostician community with a user-friendly tool that integrates analysis and visualization of HTS RNA-seq data. PhytoPipe includes quality control of reads, read classification, assembly-based annotation, and reference-based mapping. The final product of the analysis is a comprehensive report for easy interpretation of not only viruses and viroids but also bacteria (including phytoplasma), fungi, and oomycetes. PhytoPipe is implemented in Snakemake workflow with Python 3 and bash scripts in a Linux environment. The source code for PhytoPipe is freely available and distributed under a BSD-3 license.

Conclusions

PhytoPipe provides an integrative bioinformatics pipeline that can be used for the analysis of HTS RNA-seq data. PhytoPipe is easily installed on a Linux or Mac system and can be conveniently used with a Docker image, which includes all dependent packages and software related to analyses. It is publicly available on GitHub at https://github.com/healthyPlant/PhytoPipe and on Docker Hub at https://hub.docker.com/r/healthyplant/phytopipe.

Background

International trade and consumer demand have increased the worldwide movement of plants and plant parts. At the same time, the global distribution and exchange of plant germplasm that support the improvement and expansion of agricultural and horticultural industries have also grown dramatically in recent years [1,2,3,4]. Imported plant germplasm must be thoroughly tested, and proper phytosanitary measures should be followed to minimize the risk of introduction of new pests and pathogens of quarantine relevance. Therefore, the development of comprehensive diagnostic methods to identify both known and unknown plant pathogens, as well as novel variants, is an important goal for testing plant material distributed at the global and national levels.

Quarantine centers, certification programs, and plant diagnostic clinics have been using traditional virus diagnostic techniques to conduct virus detection, which includes biological indexing (mechanical transmission using herbaceous indicator plants such as Chenopodium quinoa and Nicotiana tabacum, etc.), enzyme-linked immunosorbent assay (ELISA), polymerase chain reaction (PCR), and loop-mediated isothermal amplification (LAMP) [5, 6]. More recently, high-throughput sequencing (HTS), also known as next-generation sequencing (NGS) or deep sequencing has been used by diagnosticians and researchers for detecting and identifying plant pathogens, which has resulted in a steady increase in the identification of plant pathogens affecting various crops [2, 7,8,9,10,11,12,13]. Since it does not require a priori phytosanitary status knowledge of the sample, HTS offers certain advantages when compared to targeted-diagnostic techniques such as ELISA or PCR [10, 11, 14,15,16,17,18]. The use of HTS is now becoming a gold standard across continents after the International Plant Protection Convention (IPPC) recommended it as a diagnostic tool for phytosanitary purposes in 2019 [19]. More recently, the European and Mediterranean Plant Protection Organization (EPPO) released a standard for plant health diagnostics using HTS in 2022 [20].

HTS-based plant pathogen detection involves two major strategies: amplicon sequencing, which uses the power of PCR to amplify specific standardized genetic marker(s), such as 16S rRNA gene for bacteria [21] or unique genomic regions of virus and viroid genomes [22]; or shotgun sequencing, which captures the complete nucleic acids present in a sample [20, 23]. Amplicon sequencing is popularly used for identification and comparison of entire microbial communities while shotgun sequencing has wider applications for uncovering novel and emerging pathogens [23]. Currently, total RNA sequencing (RNA-seq) and small RNA sequencing (sRNA-seq) are the two most widely used HTS shotgun approaches for the detection of plant viruses and viroids [27, 28]. sRNA-seq is designed for virus detection based on the plant viral response mechanism [29] while total RNA-seq, which had traditionally been used for the analysis of the transcriptomic landscape of the host, is now also used for the detection of plant pathogens such as bacteria (including phytoplasma), fungi, viruses, and viroids [8, 30,31,32]. Although RNA-seq is commonly used for virus and viroid detection by plant virologists and in routine diagnostic applications such as post-entry quarantine or certification [11, 33], non-viral pathogens or pests can also be detected from the same RNA-seq dataset used for virus detection [15]. Because of its broad detection spectrum and novel pathogen detection ability, it has recently become an increasingly popular tool for pathogen detection. It has been used in many crops, such as wheat [34], grapevine [35], citrus [36], fruit trees [37], cucurbit [38], sugarcane [10], grasses [7, 38] etc. for viral pathogen detection.

One of the drawbacks of the RNA-seq approach is that it requires higher titer levels of the pathogen expressed in the hosts, which corresponds to a high number of pathogen-derived reads in the data [15]. The RNA-seq approach could miss pathogens with low titer. Another obstacle for using RNA-seq to detect pathogens is its demanding bioinformatic requirements for microbiome analysis. To our knowledge, there are few broadly accepted standard methods for RNA-seq data generation, processing, and analysis [25, 39, 40]. The most used analysis methods or pipelines can fall into three main categories: (i) mapping sequence reads directly to reference genomes from known pathogens such as Pathoscope [41] and CAMAMED [42]; (ii) assembling sequence reads and annotating contigs such as VirFind [43], VSD toolkit [44], VirusDetect [45], and Virtool [37]; and (iii) read-based taxonomic assignments such as Kaiju [46], Kraken2 [47], and Kodoja [48]. However, these methods or pipelines do not offer an integrated sequence read quality control, read assembly, pathogen reference mapping, and read classification to identify known pathogens and discover novel species, which is a common occurrence during plant virus detection. Therefore, an integrative and comprehensive pipeline is needed to detect the presence of a wide range of potential plant pathogens in a specimen.

Here we present Phytosanitary Pipeline (PhytoPipe), an integrative pipeline for plant pathogen identification using RNA-seq data. The pipeline combines current tools for HTS read quality control, the host read filtering, read assembly, contig annotation, reference mapping, and taxonomic classification. PhytoPipe is equally capable of identifying known bacteria (including phytoplasma), fungi, oomycetes, viruses, viroids, and possible novel viruses. Furthermore, the use of the Snakemake workflow management system [49] allows for an efficient and automated deployment on a local multicore computer, computing cluster, or a cloud environment.

Implementation

The PhytoPipe framework uses the Snakemake workflow management system [49] to organize sequence data processing tools. These tools have been organized into four distinct modules: reads preprocessing (Fig. 1A), reads classification (Fig. 1B), assembly-based annotation (Fig. 1C), and reference-based mapping (Fig. 1D). Each module summarizes the results into HTML or Krona reports and tables (Fig. 2). PhytoPipe can be easily set up in a Linux or Mac environment or run using the PhytoPipe docker image [50] (https://hub.docker.com/r/healthyplant/phytopipe) on a Linux, Mac, or Windows system. PhytoPipe requires at least 300 GB of RAM, 1 TB of fast-speed storage and multi-cores (> 32) parallel computing environment. The complete usage and user options are outlined on the GitHub wiki page https://github.com/healthyPlant/PhytoPipe/wiki.

Fig. 1
figure 1

Flowchart describing processes and steps performed by the PhytoPipe workflow. The pipeline integrates reads preprocessing (A), reads classification (B), assembly-based annotation (C), and reference-based mapping (D) into a single workflow. The entire protocol can be run starting from raw sequence data in the form of single- or paired-end FASTQ files. The results are summarized in an HTML report file, tables, and Krona plots for an expert interpretation

Fig. 2
figure 2

Example output from the PhytoPipe workflow. RNA-seq data from two apple samples (NCBI BioProject accession: PRJNA562540 [72]) processed by PhytoPipe show: A an HTML report showing results from different methods; B a MultiQC report for samples read quality; C a read mapping graph for a virus; D a Krona pie graph showing the taxonomic composition

Quality control

PhytoPipe can process sequence reads in the form of single- or paired-end FASTQ files and performs raw sequence reads cleaning in four steps: (1) removing host ribosomal RNAs (rRNA) with bbduk against the SILVA eukaryote ribosomal 18S and 28S RNA database [51], which is summarized by SortMeRNA [52]; (2) removing PCR duplicates with clumpify; (3) removing spike-in or positive controls (the default is pre-determined as PhiX) with BBSplit; and (4) removing and trimming low-quality reads, bases, and adapter sequences with Trimmomatic [53] (Fig. 1A). The tools used in steps 1, 2, and 3 are implemented in the BBTools suite [54]. The raw and clean-read qualities for a single sample are visualized by FastQC [55] and for batch samples by MultiQC [56]. PhytoPipe reports read numbers at each cleaning step, so the user can choose them to evaluate the wet lab work, such as rRNA depletion efficiency.

Read classification

PhytoPipe uses Kraken2 [47] to query reads against the NCBI nt database for the nucleotide-level classification. PhytoPipe also relies on Kaiju [46] to assign reads to taxa using the NCBI taxonomy and a microbial non-redundant database (nr + euk) of bacterial, viral, fungal, and other microbial eukaryotic proteins (Fig. 1B). These k-mer-based approaches classify sequences based on the presence and frequency of specific k-mers in the database [46, 57, 58]. They can discover low titer viruses or phytoplasma, which could be missed by assembly-based methods [29]. Besides the text report generated by the tools, the sequence profile and the metagenomic classification are also interactively visualized by multi-layered Krona pie charts for a given sample [59].

Assembly-based annotation

Prior to read assembly, host reads are usually subtracted for pathogen detection. Instead of mapping reads to the genome to remove host reads, PhytoPipe extracts possible pathogen-derived reads, including classified pathogen-derived (bacteria, fungi, oomycetes, viruses, and viroids) and unclassified reads from the Kraken2 classification using the modified script “extract_kraken_reads.py” in KrakenTools [60] (Fig. 1C). Then these reads are assembled with either SPAdes [61] or Trinity [62] de novo assembler. Trinity is used as default due to its robustness and its ability to better perform when dealing with low titer viruses. Assemblies are then evaluated with QUAST [63] followed by contig (length ≥ 200 nucleotides) annotation at the nucleotide level using blastn [64] against NCBI nt database and at the protein level using Diamond blastx [65] against NCBI nr database. PhytoPipe allows users to obtain the pathogen information in the blast results that are combined with pathogen taxonomy along with HTS read count assigned by Kraken2 and Kaiju. Blast searches against NCBI databases can be time-consuming (several days) depending on the user’s computing environment and the volume of the HTS data. Hence, the user who is just interested in the virus discovery can either use their own database or other alternative virus databases such as NCBI viral reference genomes and Reference Viral Databases (RVDB) protein version [66]. The user’s databases and the analysis parameters can be easily set up in the config file. Finally, the users could identify possible novel viruses based on Diamond blastx results and the ICTV criteria field [67].

Reference-based mapping

To further confirm virus discovery derived from HTS read classification and assembly-based annotation, viral reference genomes are collated before reference-based mapping by PhytoPipe (Fig. 1D). The clean reads are mapped to reference genomes by BWA-MEM [68]. The mapped read number and coverage are calculated by SAMtools [69] and a coverage graph is drawn using matplotlib [70] in Python. A consensus sequence is generated with BCFtools (including mpileup and consensus commands) [71] and is additionally annotated using blastn against the local NCBI nt database to filter non-pathogen sequences.

Results

The PhytoPipe output for each sample includes FastQC/MultiQC reports for HTS read quality assessment, Krona taxonomy pie charts for both Kraken2/Kaiju read classification and blastn/blastx results for contigs, and QUAST report for assembly evaluation. In addition, output also includes blastn/Diamond blastx search result tables, mapping statistics and coverage graphs for viruses and viroids, together with a summary report in HTML format (report.html). The final text report (report.txt) for viruses and viroids includes results from all samples, including the raw/clean-read length and count, read mapping information (reference names and related taxonomy, mapped read count, normalized read count (reads per kilobase of transcript per million mapped reads (RPKM)), percentage of mapped reads, percentage of viral genome covered, and mean coverage), NCBI blast results (blast E-value, blast identity, blast description), and the nucleotide sequence (contig or consensus sequence). A comprehensive sequence quality report includes a read quality table of raw read count, raw bases (Mbases) count, percentage of bases >  = Q30, mean of raw read quality score, percentage of rRNA, read count after removing duplicates, spike-in/control read count, read count after trimming, and the number of possible pathogen-derived reads used for assembly.

To show the PhytoPipe detection of microbes in the real plant RNA-seq data, two apple sample datasets from the study by Wright et al. [72] (host: Malus domestica, NCBI BioProject accession: PRJNA562540) were analyzed. Table 1 shows the microbe detection results. One fungus (Aureobasidium pullulans EXF-150) was found in SRX6762507, which could have been derived from the environment (e.g., water or soil). Fungal viruses (also known as mycoviruses) were also found in this sample (not listed). Three species of bacteria (Actinoplanes friuliensis DSM 7358, Bradyrhizobium sp. 170, and Steroidobacter denitrificans) were found in SRX6762511, which could be soilborne. The validated multiple apple viruses and viroids were also found by PhytoPipe in both samples: apple chlorotic leaf spot virus (ACLSV), apple hammerhead viroid (AHVd), apple mosaic virus (ApMV), apple rubbery wood virus 2 (ARWV2), apple stem grooving virus (ASGV), apple stem pitting virus (ASPV). PhytoPipe also found additional three ones: hop latent virus, hop latent viroid, and hop stunt viroid. Besides this output for all microbes, PhytoPipe has a specific report for viruses and viroids (report.txt), the Table 1 missed validated virus, apple green crinkle associated virus (AGCaV), is in the viral report. Figure 2 shows examples of the PhytoPipe output from this analysis. The HTML report contains summary results from different tools (Fig. 2A), the MultiQC report shows read quality (Fig. 2B), the read mapping graph shows genome coverage for virus and viroid genomes (in this case hop latent viroid: NC_003611) (Fig. 2C), and a Krona pie chart shows the taxonomic composition of the sample (Fig. 2D). The Krona pie chart also offers an interactive view of different pathogens present in the sample. Details of these result files are provided on GitHub (https://github.com/healthyPlant/PhytoPipe/tree/main/doc/test_report.zip).

Table 1 Microbes and pathogens detected by PhytoPipe in samples SRX6762507 and SRX6762511

To compare PhytoPipe with other plant virus detection pipelines, nine datasets corresponding to nine virus detection challenges from the Plant Health Bioinformatics Network (PHBN) VIROMOCK (https://gitlab.com/ilvo/VIROMOCK-challenge) [73] were analyzed. Dataset_2 was excluded since it was designed for mutation detection. Table 2 shows the results of the virus and viroid detection using four different pipelines. PhytoPipe could detect all expected viruses and viroids in the datasets and solved all the pre-determined challenges listed for these datasets. Pipelines Kodoja [29], Pathoscope [41], and Virtool [37] detected most of the known viruses at the species level but missed one to five viruses. Both Kodaja and Pathoscope failed to detect novel viruses. Virtool, on the other hand, could detect novel viruses and solve several challenges. True positive rate (TPR), false negative rate (FNR), and false discovery rate (FDR) are calculated by true positives (TP) (detected expected viruses), false positives (FP) (detected unexpected viruses), and false negatives (FN) (missed expected viruses). TPR = TP/(TP + FN), FNR = FN/(TP + FN) and FDR = FP/(FP + TP). The TPRs for the four pipelines are 100% (PhytoPipe) > 91% (Pathoscope) > 74% (Virtool) > 52% (Kodoja); The FNRs are 0% (PhytoPipe) < 9% (Pathoscope) < 26% (Virtool) < 48% (Kodoja); The FDRs are 39% (PhytoPipe) > 25% (Kodoja) > 22% (Pathoscope) > 19% (Virtool). PhytoPipe has the highest TPR and lowest FNR. It has the highest FDR since some of the identified viruses are not in the expected virus list. For example, citrus blight-associated pararetrovirus, citrus endogenous pararetrovirus, and cherry virus A in the dataset_1 (citrus sample); grapevine fleck virus, grapevine leafroll-associated virus 3, grapevine Kizil Sapak virus, and grapevine leafroll-associated virus 7 in the dataset_3 (grapevine sample); pistacia cryptic virus in the dataset_9 (pistachio sample). To determine whether they are true or false positives, confirmatory experiments such as the use of a PCR-based detection method might be necessary.

Table 2 Comparison of virus and viroid detection with different pipelines

To demonstrate the capabilities of PhytoPipe to detect bacteria, fungi, and oomycetes, twenty-two public RNA-seq sequence datasets (twelve bacteria, eight fungi, and two oomycetes) from the pathogen infected plants in the study by Haegeman et al. in 2023 [15] were analyzed using PhytoPipe. The results in Table 3 show that PhytoPipe detected confirmed pathogens in 18 out of 22 (81.8%) samples. PhytoPipe was not able to detect bacteria from four samples due to the low abundance of the pathogen-derived reads (reads per million reads(rpm) < 10 as reported by Haegeman et al.).

Table 3 PhytoPipe analyses results for RNA-seq samples with confirmed pathogen infection

To further assess the ability of the pipeline to detect plant pathogens, twenty-four simulated RNA-seq datasets (12 with and 12 without pathogens) were generated using ART [74]. Twelve datasets comprising 10 or 20M host reads from 12 crop genomes (apple, cassava, citrus, grapevine, maize, peanut, potato, rice, rose, soybean, sweet potato, wheat) were generated using ART and the subsample function of seqtk (https://github.com/lh3/seqtk). Another twelve spike-in datasets were composed of a host, two to three fungi/bacteria/oomycetes, and six to eight viruses/viroids for each crop with different quantities. The virus/viroid reads in the spike-in samples ranged from 30 to 35,250 with a coverage from 2 to 300X, and fungi/bacteria/oomycetes reads ranged from 877 to 2,185,250 with a coverage from 1 to 10X (Additional file 1: Table S1). The spiked pathogens didn’t show up in the results of 11 host datasets as expected, except citrus endogenous pararetrovirus in the citrus host. In addition, four unexpected viruses were found in the other three host datasets: rice tungro bacilliform virus and citrus exocortis viroid in rice, caulimovirus sp. in sweetpotato, and begomovirus-associated DNA-III in cassava. These results show that the negative control sample has an important role in the analysis. PhytoPipe detected all 79 spiked viruses/viroids that were expected with the high level of correlation between the simulated and observed reads (the majority were mapped reads; classified reads were for viruses without contigs) (R2 equals to 0.86) (Fig. 3A). Viruses missed by the assembly-based method were detected by the Kraken2 classification method. In case of citrus endogenous pararetrovirus, the number of observed reads was double the number of spiked ones. All 13 bacteria/oomycetes pathogens and 13 out of 15 fungi were detected by PhytoPipe at the species level despite the low correlation (R2 equals to 0.35) between the simulated and observed classified reads (Fig. 3B). The spiked reads classified by Kraken2 varied from less than 1% to 95% of the original ones. Eight pathogens (six fungi and two bacteria) had less than 10% reads classified at the species level and one of the spiked fungi Diplocarpon rosae had just three reads classified to this species. These results showed that the coverage and the database are also the keys for RNA-seq to detect pathogens. Although viruses and viroids can be detected with less than 1000 reads due to their smaller genomes, fungi, bacteria, and oomycetes require high titer because of the large genome size. In summary, PhypoPipe can detect spiked microbes in the simulated data with a high level of accuracy. For the virus detection, TPR, FNR and FDR of the pipeline are 100%, 0%, and 2.4%, respectively. Two unexpected viruses, rice tungro bacilliform virus and citrus exocortis viroid were detected in both rice and spike-in rice datasets, are counted as false positives. For the detection of bacteria, fungi, and oomycetes, if only the same species are treated as true positives, TPR, FNR and FDR of the pipeline are 97%, 14%, and 40%, respectively. FDR is high because the simulated genome coverage (1 to 10X) is comparatively low and many reads are classified into the same family or genus, not into the same species.

Fig. 3
figure 3

Comparison between spike-in and observed pathogen reads from simulated RNA-seq data. A 79 viruses/viroids from 12 crops. Observed reads were obtained by either mapping reads to a viral reference if viral contigs are annotated or are from Kraken2 classification. The high correlation between the spike-in and observed viral reads shows high detection ability of PhytoPipe. B 28 bacteria/fungi/oomycetes in 12 crops. Observed reads were obtained from Kraken2 classification

Discussion

To assess whether plant material is pathogen-free is critical for regulatory and biosecurity purposes. In contrast to the use of PCR and ELISA, which target a specific pathogen, RNA-seq can detect all potential pathogens in a plant sample. However, the results could be impacted by several variables, including the type of tissue used, the pathogen titer in the sample, the type of nucleic acid under analysis, the sequencing method employed, the type of analytical tools, and the reference databases used by each analytical tool. Virus detection pipelines can also vary greatly in their ability to detect known and novel viruses. For example, an assembly-based analysis could generate false-negative results when low titer viruses are present in the sample. This potentially high-risk scenario could be due in part to the low incidence of reads that does not allow the generation of sizeable contigs. Contrary to this, read classification-based methods can detect those viruses, but they could be below the threshold. On the other hand, to speed up the detection process, many virus detection pipelines only use viral databases. This could potentially result in some host sequences being annotated as part of a virus genome by blast if the threshold is not strict, or some host reads mapping to viruses for the reference-based mapping method if the tool parameters are less stringent. To address these limitations, we built PhytoPipe, which can integrate read classification, assembly-based annotation, and reference-based mapping methods for the detection of known plant pathogens as well as novel viruses. The possible known viruses and viroids are identified by the overlap of the results from read classification at the nucleotide level by Kraken2 and contig blastn against the viral nucleotide database. The selected candidates are further filtered by the viral reference genome coverage from the reference-based mapping and their consensus sequence annotations from blastn against NCBI nt database. Furthermore, PhytoPipe can identify possible new viruses by the overlap of the results between read classification at the protein level by Kaiju and contig Diamond blastx against the viral protein database. Moreover, PhytoPipe visualizes the sequence profile as Krona pie charts which the users can use to determine the presence of any pathogens (Fig. 2D).

When a method is used for the analysis of HTS data, a threshold is either set by the user or by the developer. There is a trade-off between the true positive rate and the false positive rate for different thresholds. The more stringent the threshold is, the higher could be the number of false negatives, and vice versa. For example, if > 100 reads is used as a read number threshold for the Kraken2 read classification in the sample SRX6762507 (Table 1), three apple viroids (hop stunt viroid, hop latent viroid, and apple hammerhead viroid) could result in false negatives. Furthermore if > 60% viral genome coverage, which is defined as the percentage of a viral genome/segment covered by reads, is used to filter viruses in the PhytoPipe report of SRX6762507, three false positive viruses (apricot latent virus, turnip vein-clearing virus, and ribgrass mosaic virus) can be classified as non-detectable. Therefore, the use of the most appropriate thresholds to reduce false positives and false negatives is critical for an accurate diagnostic. PhytoPipe combines three methods to minimize the detection of false outcomes. If a virus is found by both the classification and assembly-based annotation methods, and its genome coverage (from the read mapping method) is above 15%, PhytoPipe reports this virus as a positive. This viral genome coverage threshold could be low and cause more false positives. Moreover, PhytoPipe has a user-defined pathogen file, named monitorPathogen.txt, which can be modified based on the user's pathogens of interest, such as the nationwide pest priorities. In case these monitored pathogens are missed in the report to cause false negatives, they can be fished out from the Kraken2 result, even with a low read number support (e.g., < 100 read). However, these could have higher probabilities of being false positives. In this circumstance, the user’s knowledge about the pathogen is key to decide whether there is a need for validation or not. Since the pathogen titer is often variable and depends on various factors, such as biotic, abiotic, and methodology challenges to the sample, it is difficult to establish a general threshold that can be applied for all pathogens. The PhytoPipe user can then easily filter the output for all the potential organisms present in the sample using the report files. For example, when looking into the virome of a sample, mapped reads, viral genome coverage, and blast E-value in the report (report.txt) can be used for filtering. For bacteria and fungi, classified read number, contig number, and longest contig size in the summary files (sample.blastnt.summary.txt and sample.blastnr.summary.txt) can be used. Users can determine the present of a positive diagnostic based on their expertise and by the inclusion of negative controls and taxonomy information from PhytoPipe. For further regulatory action, a wet-lab validation should be required for pathogens identified by HTS to minimize the risk of reporting false positives. Moreover, the plant pathogens that are detected despite a low number of reads, plant pathologists may also need to use alternative methods such as PCR, for the confirmatory diagnostics.

A phytosanitary inspection of plant material not only discovers the presence of known pathogens in the sample but also ensures a thorough examination to conclude that the material is free of detectable pathogens. However, it is difficult to determine whether a plant material is clean without a negative control. An HTS report for a plant sample could have many organisms, such as a plant, insects, plant fungi or environmental fungi (e.g., from soil, water, or air), plant bacteria or environmental bacteria, fungal viruses (or mycoviruses), plant viruses, insect viruses, etc. The negative control datasets can efficiently help to filter out unrelated organisms. On the other hand, inaccurate read classifications (with a low read number) or inaccurate contig annotations (with a high blast E-value) are also in the report since similar sequences in the databases are used for classification or annotation, and NCBI nt and nr databases are not curated and frequently updated. With negative control datasets, reasonable thresholds can be set up and wrong annotations can be removed. On the contrary, the absence of negative control datasets could result in a higher number of false positives or even a wrong report for a sample. Therefore, the negative control datasets are needed for reducing errors in the analysis to generate a reliable result.

Ribosomal RNA removal is a key step during the library preparation process when using total RNA as the initial sample source for diagnostics. rRNA could take up to 30–50% of the reads in a sequenced library with inefficient rRNA removal whereas efficient removal of rRNA can lead to samples with less than 5% of the reads. The PhytoPipe ribosomal RNA removal step uses SILVA Eukaryote ribosomal RNA (18S and 28S) database to evaluate the library prep and remove possible host rRNAs. To evaluate whether this step impacts the microbe detection, two analyses of 12 bacteria RNA-seq data from the study by Haegeman et al. [15] were done using Kraken2. One analysis was for raw reads using rRNA database SILVA 138_1 SSU while the other one was for rRNA removed reads using NCBI nt database (PhytoPipe method). Our results showed that the rRNA removal step had no impact on bacterial pathogen detection (Additional file 2: Table S2). Surprisingly, more reads were classified using the PhytoPipe method as compared to the one without rRNA removal.

The available pipelines such as Kodoja [29], VirFind [23], VSD toolkit [24], VirusDetect [25], and Virtool [26] are limited to viral pathogen detection. To the best of our knowledge, PhytoPipe is the first pipeline that has the potential to detect possible microbial pathogens in a plant using RNA-seq data. Besides the wider scope of pathogen detection, PhytoPipe offers many unique features that other pipelines lack (Table 4). First, a k-mer-based read classification method in PhytoPipe can detect the viruses missed by assembly- and mapping-based methods. Second, PhytoPipe can remove and report the percentage of host ribosomal RNA reads in the sample that facilitates the quality assessment of the wet lab work. Third, PhytoPipe subtracts host reads using the k-mer method (based on the Kraken2 result) for all plants instead of host genome mapping for only plants having a complete host genome available. Fourth, PhytoPipe summarizes results from the comprehensive analysis as an HTML report (report.html) that helps users to determine the presence of possible pathogens in the sample. Lastly, the Snakemake workflow management system allows seamless integration and scaling of the pipelines to server, cluster, and cloud environments.

Table 4 Unique features of PhytoPipe in comparison with other pipelines

Despite our best effort to design a comprehensive phytopathogen detection method, PhytoPipe has a few limitations. First, the pipeline is mainly developed for the RNA sequencing data generated using the Illumina sequencing platform. However, it can be extended to other platforms after adjusting for a different platform-related set of tools. Second, the pipeline uses several different databases, which takes time, effort, and significant data storage space for initial construction and maintenance. Third, the pipeline can be used to identify the low titer viruses, which have a low read number supporting the output analysis. Such viruses could be false positives in the report, but this is a limitation found in all the pipelines investigated in this study. Fourth, the pipeline reports possible bacteria and fungi from classification and assembly-based annotation (see Table 1). The end-users of the results need to perform further validation of the results and use their knowledge of plant pathogens to make a regulatory decision. Lastly, only simulated datasets were used to evaluate PhytoPipe detection of all possible microbes in plant samples. This evaluation is still limited without large real datasets.

Conclusions

PhytoPipe is a reliable and robust bioinformatic framework for detecting plant pathogens (bacteria (including phytoplasma), fungi, oomycetes, viroids, and viruses (including novel ones)) using RNA-seq data. Pathogens are identified with HTS read classification and assembly-based annotation methods and further validated with reference-based mapping for viruses. PhytoPipe combines different tools and databases to verify the findings from various angles. Although PhytoPipe is uniquely designed for plant pathogen discovery, it can also be used for the detection of other organisms. A summary HTML file includes metagenomic information from HTS read classification, contig blast annotation, and reference-based mapping for downstream analysis and visualization. An organized running folder keeps detailed information for the user to explore the run information and results. PhytoPipe is implemented using Snakemake, which can take advantage of multicore CPUs in a local, cluster, or cloud environment. The PhytoPipe docker image can be used on a Linux, Mac, or Windows system.

The source code for PhytoPipe is distributed under a BSD-3 license and is freely available at https://github.com/healthyPlant/PhytoPipe. Software documentation available at https://github.com/healthyPlant/PhytoPipe/wiki describes the pipeline's installation, usage, and testing using the published RNA-seq data from NCBI SRA.

Availability of data and materials

The PhytoPipe homepage and code can be found at https://github.com/healthyPlant/PhytoPipe. The documentation and test procedures are both available on GitHub at https://github.com/healthyPlant/PhytoPipe/wiki. PhytoPipe docker image is available on Docker hub at https://hub.docker.com/r/healthyplant/phytopipe. Availability and requirements—Project name: PhytoPipe. Project home page: https://github.com/healthyPlant/PhytoPipe. Operating system(s): Linux, Mac, Windows (Docker image only). Programming language: Python3, Bash and Snakemake. Other requirements: Python 3.6 or greater, Snakemake 5.13 or greater, and NCBI Blast v2.10 or greater. License: BSD-3. Any restrictions to use by non-academics: license needed.

References

  1. MacDiarmid R, Rodoni B, Melcher U, Ochoa-Corona F, Roossinck M. Biosecurity implications of new technology and discovery in plant virus research. PLoS Pathog. 2013;9(8):66.

    Article  Google Scholar 

  2. Martin RR, Constable F, Tzanetakis IE. Quarantine regulations and the impact of modern detection methods. Annu Rev Phytopathol. 2016;54:189–205.

    Article  CAS  PubMed  Google Scholar 

  3. Halewood M, Jamora N, Noriega IL, Anglin NL, Wenzl P, Payne T, Ndjiondjop M-N, Guarino L, Kumar PL, Yazbek M, et al. Germplasm acquisition and distribution by CGIAR genebanks. Plants. 2020;9(10):1296.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Stephen Smith TEN. Mary Challender: Germplasm exchange is critical to conservation of biodiversity and global food security. Agron J. 2021;113(4):11.

    Google Scholar 

  5. Martin RR, James D, Levesque CA. Impacts of molecular diagnostic technologies on plant disease management. Annu Rev Phytopathol. 2000;38:207.

    Article  CAS  PubMed  Google Scholar 

  6. Mumford R, Boonham N, Tomlinson J, Barker I. Advances in molecular phytodiagnostics—new solutions for old problems. Eur J Plant Pathol. 2006;116(1):1–19.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Costa LC, Hu XJ, Malapi-Wight M, O’Connell M, Hendrickson LM, Turner RS, McFarland C, Foster J, Hurtado-Gonzales OP. Genomic characterization of silvergrass cryptic virus 1, a novel partitivirus infecting Miscanthus sinensis. Arch Virol. 2022;167(1):261–5.

    Article  CAS  PubMed  Google Scholar 

  8. Gauthier MEA, Lelwala RV, Elliott CE, Windell C, Fiorito S, Dinsdale A, Whattam M, Pattemore J, Barrero RA. Side-by-side comparison of post-entry quarantine and high throughput sequencing methods for virus and viroid diagnosis. Biology. 2022;11(2):66.

    Article  Google Scholar 

  9. Kumar LM, Foster JA, McFarland C, Malapi-Wight M. First report of Barley virus G in switchgrass (Panicum virgatum). Plant Dis. 2018;102(2):466–466.

    Article  Google Scholar 

  10. Malapi-Wight M, Adhikari B, Zhou J, Hendrickson L, Maroon-Lango CJ, McFarland C, Foster JA, Hurtado-Gonzales OP. HTS-based diagnostics of sugarcane viruses: seasonal variation and its implications for accurate detection. Viruses. 2021;13(8):66.

    Article  Google Scholar 

  11. Maree HJ, Fox A, Al Rwahnih M, Boonham N, Candresse T. Application of HTS for routine plant virus diagnostics: state of the art and challenges. Front Plant Sci. 2018;9:66.

    Article  Google Scholar 

  12. Massart S, Candresse T, Gil J, Lacomme C, Predajna L, Ravnikar M, Reynard JS, Rumbou A, Saldarelli P, Skoric D, et al. A framework for the evaluation of biosecurity, commercial, regulatory, and scientific impacts of plant viruses and viroids identified by NGS technologies. Front Microbiol. 2017;8:66.

    Article  Google Scholar 

  13. Villamor DEV, Ho T, Al Rwahnih M, Martin RR, Tzanetakis IE. High throughput sequencing for plant virus detection and discovery. Phytopathology. 2019;109(5):716–25.

    Article  CAS  PubMed  Google Scholar 

  14. Espindola AS, Cardwell K, Martin FN, Hoyt PR, Marek SM, Schneider W, Garzon CD. A step towards validation of high-throughput sequencing for the identification of plant pathogenic oomycetes. Phytopathology. 2022;112(9):1859–66.

    Article  CAS  PubMed  Google Scholar 

  15. Haegeman A, Foucart Y, De Jonghe K, Goedefroit T, Al Rwahnih M, Boonham N, Candresse T, Gaafar YZA, Hurtado-Gonzales OP, Kogej Zwitter Z, et al. Looking beyond virus detection in RNA sequencing data: lessons learned from a community-based effort to detect cellular plant pathogens and pests. Plants. 2023;12(11):66.

    Article  Google Scholar 

  16. Malapi-Wight M, Salgado-Salazar C, Demers JE, Clement DL, Rane KK, Crouch JA. Sarcococca blight: use of whole-genome sequencing for fungal plant disease diagnosis. Plant Dis. 2016;100(6):1093–100.

    Article  CAS  PubMed  Google Scholar 

  17. Nizamani MM, Zhang Q, Muhae-Ud-Din G, Wang Y. High-throughput sequencing in plant disease management: a comprehensive review of benefits, challenges, and future perspectives. Phytopathol Res. 2023;5(1):44.

    Article  Google Scholar 

  18. Massart S, Olmos A, Jijakli H, Candresse T. Current impact and future directions of high throughput sequencing in plant virus diagnostics. Virus Res. 2014;188:90–6.

    Article  CAS  PubMed  Google Scholar 

  19. FAO: Preparing to use high-throughput sequencing (HTS) technologies as a diagnostic tool for phytosanitary purposes. Commission on Phytosanitary Measures Recommendation; 2019. p. 8.

  20. EPPO. PM 7/151 (1) Considerations for the use of high throughput sequencing in plant health diagnostics. OEPP/EPPO Bull. 2022;52(3):619–42.

  21. Poretsky R, Rodriguez RL, Luo C, Tsementzi D, Konstantinidis KT. Strengths and limitations of 16S rRNA gene amplicon sequencing in revealing temporal microbial community dynamics. PLoS ONE. 2014;9(4): e93827.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Costa LC, Atha B 3rd, Hu X, Lamour K, Yang Y, O’Connell M, McFarland C, Foster JA, Hurtado-Gonzales OP. High-throughput detection of a large set of viruses and viroids of pome and stone fruit trees by multiplex PCR-based amplicon sequencing. Front Plant Sci. 2022;13:1072768.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Morgan XC, Huttenhower C. Chapter 12: human microbiome analysis. PLoS Comput Biol. 2012;8(12):e1002808.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Adams IP, Fox A, Boonham N, Massart S, De Jonghe K. The impact of high throughput sequencing on plant health diagnostics. Eur J Plant Pathol. 2018;152(4):909–19.

    Article  CAS  Google Scholar 

  25. Piombo E, Abdelfattah A, Droby S, Wisniewski M, Spadaro D, Schena L. Metagenomics approaches for the detection and surveillance of emerging and recurrent plant pathogens. Microorganisms. 2021;9(1):66.

    Article  Google Scholar 

  26. Roossinck MJ. Plant virus metagenomics: biodiversity and ecology. Annu Rev Genet. 2012;46:359–69.

    Article  CAS  PubMed  Google Scholar 

  27. Kutnjak D, Tamisier L, Adams I, Boonham N, Candresse T, Chiumenti M, De Jonghe K, Kreuze JF, Lefebvre M, Silva G, et al. A primer on the analysis of high-throughput sequencing data for detection of plant viruses. Microorganisms. 2021;9(4):66.

    Article  Google Scholar 

  28. Massart S, Chiumenti M, Jonghe K, Glover R, Haegeman A, Koloniuk I, Kominek P, Kreuze J, Kutnjak D, Lotos L, et al. Virus detection by high-throughput sequencing of small RNAs: large-scale performance testing of sequence analysis strategies. Phytopathology. 2019;109(3):488–97.

    Article  CAS  PubMed  Google Scholar 

  29. Wu QF, Luo YJ, Lu R, Lau N, Lai EC, Li WX, Ding SW. Virus discovery by deep sequencing and assembly of virus-derived small silencing RNAs. Proc Natl Acad Sci USA. 2010;107(4):1606–11.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Kim NY, Lee HJ, Kim HS, Lee SH, Moon JS, Jeong RD. Identification of plant viruses infecting pear using RNA sequencing. Plant Pathol J. 2021;37(3):258–67.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Kimbrel JA, Di YM, Cumbie JS, Chang JH. RNA-Seq for plant pathogenic bacteria. Genes. 2011;2(4):689–705.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Xu GR, Strong MJ, Lacey MR, Baribault C, Flemington EK, Taylor CM. RNA CoMPASS: a dual approach for pathogen and host transcriptome analysis of RNA-Seq datasets. PLoS ONE. 2014;9(2):66.

    Article  Google Scholar 

  33. Lebas B, Adams I, Al Rwahnih M, Baeyen S, Bilodeau Guillaume J, Blouin AG, Boonham N, Candresse T, Chandelier A, De Jonghe K, et al. Facilitating the adoption of high-throughput sequencing technologies as a plant pest diagnostic test in laboratories: a step-by-step description. EPPO Bull. 2022;52(2):394–418.

    Article  Google Scholar 

  34. Hodge BA, Paul PA, Stewart LR. Occurrence and high-throughput sequencing of viruses in Ohio wheat. Plant Dis. 2020;104(6):1789–800.

    Article  CAS  PubMed  Google Scholar 

  35. Al Rwahnih M, Daubert S, Golino D, Islas C, Rowhani A. Comparison of next-generation sequencing versus biological indexing for the optimal detection of viral pathogens in grapevine. Phytopathology. 2015;105(6):758–63.

    Article  PubMed  Google Scholar 

  36. Bester R, Cook G, Breytenbach JHJ, Steyn C, De Bruyn R, Maree HJ. Towards the validation of high-throughput sequencing (HTS) for routine plant virus diagnostics: measurement of variation linked to HTS detection of citrus viruses and viroids. Virol J. 2021;18(1):61.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Rott M, Xiang Y, Boyes I, Belton M, Saeed H, Kesanakurti P, Hayes S, Lawrence T, Birch C, Bhagwat B, et al. Application of next generation sequencing for diagnostic testing of tree fruit viruses and viroids. Plant Dis. 2017;101(8):1489–99.

    Article  CAS  PubMed  Google Scholar 

  38. Karavina C, Ibaba JD, Gubba A. High-throughput sequencing of virus-infected Cucurbita pepo samples revealed the presence of Zucchini shoestring virus in Zimbabwe. BMC Res Notes. 2020;13(1):53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Bharti R, Grimm DG. Current challenges and best-practice protocols for microbiome analysis. Brief Bioinform. 2021;22(1):178–93.

    Article  CAS  PubMed  Google Scholar 

  40. Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Droge J, Gregor I, Majda S, Fiedler J, Dahms E, et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat Methods. 2017;14(11):1063.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Hong CJ, Manimaran S, Shen Y, Perez-Rogers JF, Byrd AL, Castro-Nallar E, Crandall KA, Johnson WE. PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. Microbiome. 2014;2:66.

    Article  Google Scholar 

  42. Norouzi-Beirami MH, Marashi SA, Banaei-Moghddam AM, Kavousi K. CAMAMED: a pipeline for composition-aware mapping-based analysis of metagenomic data. NAR Genomics Bioinform. 2021;3(1):66.

    Article  Google Scholar 

  43. Ho T, Tzanetakis IE. VirFind: an online bioinformatics tool for plant virus detection and discovery. Phytopathology. 2014;104(11):52–52.

    Google Scholar 

  44. Barrero RA, Napier KR, Cunnington J, Liefting L, Keenan S, Frampton RA, Szabo T, Bulman S, Hunter A, Ward L, et al. An internet-based bioinformatics toolkit for plant biosecurity diagnosis and surveillance of viruses and viroids. BMC Bioinform. 2017;18:66.

    Article  Google Scholar 

  45. Zheng Y, Gao S, Padmanabhan C, Li RG, Galvez M, Gutierrez D, Fuentes S, Lin KS, Kreuze J, Fei ZJ. VirusDetect: an automated pipeline for efficient virus discovery using deep sequencing of small RNAs. Virology. 2017;500:130–8.

    Article  CAS  PubMed  Google Scholar 

  46. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7:66.

    Article  Google Scholar 

  47. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20(1):66.

    Article  Google Scholar 

  48. Baizan-Edge A, Cock P, MacFarlane S, McGavin W, Torrance L, Jones S. Kodoja: a workflow for virus detection in plants using k-mer analysis of RNA-sequencing data. J Gen Virol. 2019;100(3):533–42.

    Article  CAS  PubMed  Google Scholar 

  49. Koster J, Rahmann S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012;28(19):2520–2.

    Article  PubMed  Google Scholar 

  50. Merkel D. Docker: lightweight linux containers for consistent development and deployment. Linux J. 2014;2014(239):2.

    Google Scholar 

  51. Yilmaz P, Parfrey LW, Yarza P, Gerken J, Pruesse E, Quast C, Schweer T, Peplies J, Ludwig W, Glockner FO. The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Res. 2014;42(D1):D643–8.

    Article  CAS  PubMed  Google Scholar 

  52. Kopylova E, Noe L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012;28(24):3211–7.

    Article  CAS  PubMed  Google Scholar 

  53. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. BBTools suite. https://sourceforge.net/projects/bbmap/.

  55. FastQC: FastQC: a quality control tool for high throughput sequence data; 2015. http://www.Rbioinformaticsbabrahamacuk/projects/fastqc/.

  56. Ewels P, Magnusson M, Lundin S, Kaller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Garcia BJ, Simha R, Garvin M, Furches A, Jones P, Gazolla JGFM, Hyatt PD, Schadt CW, Pelletier D, Jacobson D. A k-mer based approach for classifying viruses without taxonomy identifies viral associations in human autism and plant microbiomes. Comput Struct Biotec. 2021;19:5911–9.

    Article  CAS  Google Scholar 

  58. Ren J, Ahlgren NA, Lu YY, Fuhrman JA, Sun FZ. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome. 2017;5:66.

    Article  Google Scholar 

  59. Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinform. 2011;12:66.

    Article  Google Scholar 

  60. KrakenTools: Kraken Tools; 2021. https://githubcom/jenniferlu717/KrakenTools.

  61. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng QD, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644-U130.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST plus: architecture and applications. BMC Bioinform. 2009;10:66.

    Article  Google Scholar 

  65. Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18(4):366.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Bigot T, Temmam S, Perot P, Eliot M. RVDB-prot, a reference viral protein database and its HMM profiles. F1000Research. 2020;8(530):66.

    Google Scholar 

  67. Lefkowitz EJ, Dempsey DM, Hendrickson RC, Orton RJ, Siddell SG, Smith DB. Virus taxonomy: the database of the International Committee on Taxonomy of Viruses (ICTV). Nucleic Acids Res. 2018;46(D1):D708–17.

    Article  CAS  PubMed  Google Scholar 

  68. Li H. Aligning sequence reads, clone sequences and assebly contigs using BWA-MEM. arXiv 2013, 1303(3997v2).

  69. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Proc GPD. The sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.

    Article  PubMed  PubMed Central  Google Scholar 

  70. Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9(3):90–5.

    Article  Google Scholar 

  71. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):66.

    Article  Google Scholar 

  72. Wright AA, Cross AR, Harper SJ. A bushel of viruses: identification of seventeen novel putative viruses by RNA-seq in six apple trees. PLoS ONE. 2020;15(1):66.

    Article  Google Scholar 

  73. Tamisier LH, Annelies A, Foucart Y, Fouillien N, Al Rwahnih M, Buzkan N, Candresse T, Chiumenti M, De Jonghe K, Lefebvre M, Margaria P, Reynard JS, Stevens K, Kutnjak D, Massart S. Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection. Peer Community J. 2021;1:66.

    Article  Google Scholar 

  74. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28(4):593–4.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Katie Ko for helping us to finish Virtool analysis and edit GitHub pages. We also thank Annelies Haegeman for supplying the microbial RNA-seq data. This material was made possible, in part, by a Cooperative Agreement from the United States Department of Agriculture’s Animal and Plant Health Inspection Service (APHIS). It may not necessarily express APHIS’ views.

Funding

This research was supported by USDA-APHIS. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

XH, OH and BA conceived and designed the pipeline. XH implemented the software. XH, OH and BA wrote the manuscript. RF, MM, CM and JF provided guidance and oversight, helped to refine the design and suggested improvements to the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiaojun Hu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1. Table S1

: Overview of simulated RNA-seq data using plant genomes and spiked-in pathogens.

Additional file 2. Table S2

: Comparison of bacterial pathogen detection performed by Kraken2 implemented in PhytoPipe using raw reads against SILVA rRNA database and rRNA removed reads against NCBI nt database.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, X., Hurtado-Gonzales, O.P., Adhikari, B.N. et al. PhytoPipe: a phytosanitary pipeline for plant pathogen detection and diagnosis using RNA-seq data. BMC Bioinformatics 24, 470 (2023). https://doi.org/10.1186/s12859-023-05589-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-023-05589-2

Keywords