Skip to main content

Table 2 Computational frameworks designed to detect microbiota from human sequences by subtractive, filtration, or mixed methods

From: Tissue-associated microbial detection in cancer using human sequencing data

Framework

Approach

Dependencies

Input | output

Advantages/disadvantages

Cancer validation

Refs.

PathSeq

Alignment and de novo assembly

BLAST

BLASTN

BLASTX

MAQ

MegaBLAST

RepeatMasker

Velvet

Input: RNA-seq or DNA-seq

Output: Pathogen presence/absence

Scalable cloud computing

Feasible for known and novel pathogen identification

Two-pass subtraction with increased filtering costs

Cervical cancer (cell line and simulated data)

TCGA ovarian

[63, 68]

SRSA

Alignment and de novo assembly

Velvet

MegaBLAST

BLAST

BWA

TopHat

Input: RNA-seq

Output: Species-level taxonomy characterization (prevalence)

Incorporates sample pre-processing, quality filtering, sequence mapping, and assembly

Not freely available

No known updates

Original work validation was limited to cell line

HIV-1 cell line

[60]

CaPSID

Mix-method, simultaneous alignment, filtration and de novo assembly

BioPython

Bowtie2

Trinity

Input: RNA-seq or DNA-seq

Output: Top-hit pathogen genome identification ranked by maximum gene coverage

Web-based, open-source and scalable application;

Modular analyses;

Single pass filtering, which may fail to subtract host reads

Ovarian cancer

TCGA stomach

[67]

SURPI

Dual scanning mode; Known pathogens identification or de novo assembly

SNAP

RAPSearch

BWA

BLASTN

Bowtie2

DUST in PRINSEQ

Input: Paired-end metagenomic

Output: Species-level taxonomic classification and coverage map

Scalable to cloud or standalone servers

Capacity to incorporate reference database

Dual-mode: quantitative and semi-quantitative pathogen identification

Prostate cancer (cell line, tissue biopsies)

Colorectal cancer

(tissue biopsies)

[71]

PathoScope 2.0

Penalized probabilistic identification; Modular filtration, alignment and assignment

SAMtools

BLASTX

Bowtie2

thetaPrior

Input: Metagenomic or genomic (RNA-seq or DNA-seq)

Output: Strain level pathogen relative abundance

Modular detailed result reporting with

Designed for low abundance strain-level identification

MySQL server required; no connection to the population structure of relevant species

TCGA stomach

[69, 70]

VirusScan

Identification of known viral and integration sites

BWA

BLAST

MegaBLAST

Pindel

RepeatMasker

PHYLIP

Input: RNA-seq

Output: Viral read abundance and integration sites

Designed for viral identification;

Abundance and integration sites analyses

TCGA cancer cohorts

[72]

MetaShot

Two-step similarity filtering and taxonomic assessment

Bowtie2

TANGO

STAR

Bash

Input: RNA-Seq or DNA-Seq

Output: Assigned read report and Krona plot with relative abundance

Extracts unassigned reads;

Allow for functional annotations;

Slower than other applications

None

[73]

ConStrains

Marker-based (SNP patterns)

Strain-level prediction

MetaPhlAn

PhyloPhlAn

Bowtie2

SAMtools

Metropolis-Hasting Monte-Carlo

Input: Metagenomics (RNA-seq)

Output: Strain-level prediction and relative abundance

Single reference strain collection;

Facilitates functional analyses when combined with reference genome-based gene coverage metadata

None

[74]

RINS

Intersection based identification and removal

Bowtie

BLAST

BLAT

Trinity

Input: Mate-paired RNA-seq unmapped reads

Output: Pathogen contigs

Requires prior knowledge of reference;

Detection limited to user-defined parameters

Prostate cancer

(cell line)

[66]

GRAMMy

Mix- model Bayesian, Expectation–Maximization and maximum likelihood estimation

BLAST

BLAT

MAQ

Bowtie

PerM

BLASY

Input: Metagenomics reads

Output: Genomic relative abundance as numerical vectors

User flexibility

Probabilistic handling of ambiguous hits

Computational efficiency

None

[76]

  1. Comparison of computational workflows designed to derive microbial content from human sequences by subtractive and filtering methods, broadly categorized as reference-based, reference-free, and mixed methods approaches. Data requirements to run the pipeline, output information, as well as advantages and disadvantages for each, are summarized. Most have been validated with large cancer datasets, including TCGA sequencing data. ConStrains is based on reference-free, while all other approaches are reference-based or mixed-methods