Skip to main content

Table 2 Computational frameworks designed to detect microbiota from human sequences by subtractive, filtration, or mixed methods

From: Tissue-associated microbial detection in cancer using human sequencing data

Framework Approach Dependencies Input | output Advantages/disadvantages Cancer validation Refs.
PathSeq Alignment and de novo assembly BLAST
BLASTN
BLASTX
MAQ
MegaBLAST
RepeatMasker
Velvet
Input: RNA-seq or DNA-seq
Output: Pathogen presence/absence
Scalable cloud computing
Feasible for known and novel pathogen identification
Two-pass subtraction with increased filtering costs
Cervical cancer (cell line and simulated data)
TCGA ovarian
[63, 68]
SRSA Alignment and de novo assembly Velvet
MegaBLAST
BLAST
BWA
TopHat
Input: RNA-seq
Output: Species-level taxonomy characterization (prevalence)
Incorporates sample pre-processing, quality filtering, sequence mapping, and assembly
Not freely available
No known updates
Original work validation was limited to cell line
HIV-1 cell line [60]
CaPSID Mix-method, simultaneous alignment, filtration and de novo assembly BioPython
Bowtie2
Trinity
Input: RNA-seq or DNA-seq
Output: Top-hit pathogen genome identification ranked by maximum gene coverage
Web-based, open-source and scalable application;
Modular analyses;
Single pass filtering, which may fail to subtract host reads
Ovarian cancer
TCGA stomach
[67]
SURPI Dual scanning mode; Known pathogens identification or de novo assembly SNAP
RAPSearch
BWA
BLASTN
Bowtie2
DUST in PRINSEQ
Input: Paired-end metagenomic
Output: Species-level taxonomic classification and coverage map
Scalable to cloud or standalone servers
Capacity to incorporate reference database
Dual-mode: quantitative and semi-quantitative pathogen identification
Prostate cancer (cell line, tissue biopsies)
Colorectal cancer
(tissue biopsies)
[71]
PathoScope 2.0 Penalized probabilistic identification; Modular filtration, alignment and assignment SAMtools
BLASTX
Bowtie2
thetaPrior
Input: Metagenomic or genomic (RNA-seq or DNA-seq)
Output: Strain level pathogen relative abundance
Modular detailed result reporting with
Designed for low abundance strain-level identification
MySQL server required; no connection to the population structure of relevant species
TCGA stomach [69, 70]
VirusScan Identification of known viral and integration sites BWA
BLAST
MegaBLAST
Pindel
RepeatMasker
PHYLIP
Input: RNA-seq
Output: Viral read abundance and integration sites
Designed for viral identification;
Abundance and integration sites analyses
TCGA cancer cohorts [72]
MetaShot Two-step similarity filtering and taxonomic assessment Bowtie2
TANGO
STAR
Bash
Input: RNA-Seq or DNA-Seq
Output: Assigned read report and Krona plot with relative abundance
Extracts unassigned reads;
Allow for functional annotations;
Slower than other applications
None [73]
ConStrains Marker-based (SNP patterns)
Strain-level prediction
MetaPhlAn
PhyloPhlAn
Bowtie2
SAMtools
Metropolis-Hasting Monte-Carlo
Input: Metagenomics (RNA-seq)
Output: Strain-level prediction and relative abundance
Single reference strain collection;
Facilitates functional analyses when combined with reference genome-based gene coverage metadata
None [74]
RINS Intersection based identification and removal Bowtie
BLAST
BLAT
Trinity
Input: Mate-paired RNA-seq unmapped reads
Output: Pathogen contigs
Requires prior knowledge of reference;
Detection limited to user-defined parameters
Prostate cancer
(cell line)
[66]
GRAMMy Mix- model Bayesian, Expectation–Maximization and maximum likelihood estimation BLAST
BLAT
MAQ
Bowtie
PerM
BLASY
Input: Metagenomics reads
Output: Genomic relative abundance as numerical vectors
User flexibility
Probabilistic handling of ambiguous hits
Computational efficiency
None [76]
  1. Comparison of computational workflows designed to derive microbial content from human sequences by subtractive and filtering methods, broadly categorized as reference-based, reference-free, and mixed methods approaches. Data requirements to run the pipeline, output information, as well as advantages and disadvantages for each, are summarized. Most have been validated with large cancer datasets, including TCGA sequencing data. ConStrains is based on reference-free, while all other approaches are reference-based or mixed-methods