Method outline
The general framework of cDNA-detector consists of three steps: 1. Preparation of a gene model file, for example all exons for a given species and genome assembly, 2. detection of potential cDNA in a BAM-format alignment file, 3. optional clean-up of identified cDNA sequencing reads from the alignment (Fig. 1B; Additional file 1: Figures S1 and S2A).
Generation of gene model file
An annotation file containing genomic coordinates for exon positions for a given species and assembly is required to run cDNA-detector. A processing script to generate such gene models from GTF format annotation files is part of the cDNA-detector distribution. Pre-generated models for the human (hg19 and hg38) and mouse exome (mm10 and mm39) are provided with the tool. Users may consider adding entries for specific transcript variants (such as truncated genes), custom amplicons, or non-coding RNAs that are suspected or known contaminants.
Detection of candidate cDNA in DNA sequencing experiments
cDNA source reads at exon boundaries
Soft-clipped reads as identified by a read’s CIGAR string (S) in the BAM file are considered potential contaminant reads. In addition, cDNA-detector looks for reads with mismatches just past the exon boundary inside the intron (by MD tag). Reads whose overhang into the intron sequence is an exact substring of the consensus sequence of other soft-clipped reads in this location are also counted as possible cDNA reads. Consensus sequences are generated by aligning the intron overhang of all clipped reads, including a base with ≥ 80% frequency in the consensus.
Statistical identification of candidate cDNAs
To evaluate whether the number of soft-clipped boundary-spanning reads for a given exon exceeds the number expected by chance, we build a background expectation model by using all soft-clipped reads overlapping any exon region boundary as defined in the gene model. We then test for significant enrichment of soft-clipped reads, as calculated above, for each single exon boundary genomic position. We define the total number of reads overlapping a specific exon boundary, including genomic reads from the target experiment properly mapped to the adjacent intron, as \({n}_{t}\). The total number of soft-clipped and overhang reads is defined as \({n}_{c}\). The total number of soft-clipped reads in all exon regions is \({N}_{c}\), and the total number of all reads in all exon regions is defined as \({N}_{t}\). We use a beta-binomial model to calculate the probability of observing at least \({n}_{c}\) reads at a given exon boundary by chance (Eq. 1):
$$P\left({X\ge n}_{c}\right)=1-\sum\limits_{x=0}^{{n}_{c}-1}\left(\genfrac{}{}{0pt}{}{{n}_{t}}{x}\right)\frac{B(\alpha +x, {n}_{t} -x + \beta )}{B( \alpha , \beta )}$$
(1)
For \(n_c >0 \) \(X\) in {\({n}_{c}, {n}_{c}+1, \dots , {n}_{t}\)}, where α = \({N}_{c}\), β = \({N}_{t}-\) \({N}_{c}\) + 1 and B(α, β) is the Beta function.
An exon with contaminant cDNA will have soft-clipped reads at both of its boundaries. We therefore combine the P-values for both exon boundaries for all exons using the harmonic mean P-value [25] (to account for dependence between the two P-values). Finally, all combined exon-wise P-values are corrected with the Benjamini–Hochberg procedure [26]. Exons with corrected P-value (Q-value) ≤ 0.05 are considered candidate cDNA-exons. To further increase confidence that a flagged exon originates from cDNA, the overhang consensus sequence of soft-clipped reads at both exon boundaries will be compared to the sequence of the neighboring exon(s). If at least one substring match is found, exons are kept for further analysis. Finally, transcripts with ≥ 30% of exons passing the above criteria will be reported. Because the background model is built on the total number of clipped reads at exon boundaries, it is recommended to remove adapter sequences before alignment to increase sensitivity especially for low-concentration contaminants (Additional file 1: Figure S7).
cDNA source inference
cDNA-detector attempts to identify the source of a candidate cDNA to distinguish endogenous (retrocopy) and exogenous (vector) origins. Source inference is performed by querying the overhang sequence of the 5’ and 3’ ends of a candidate cDNA against the NCBI UniVec database (https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/) and Repbase database of human and mouse known repeat sequences [12]. The highest-scoring match with E-value ≤ 10 is taken as probable origin. Accordingly, a “vector” is reported if the top match stems from UniVec and “retrocopy” is inferred if this hit comes from RepBase. In addition, we require a distance of > 5 bp for a repeat match from the clipped position for a “retrocopy” assignment, as these genes often carry adjacent UTR sequence [27]. In case where no suitable match is found, we use the distance from the exon boundary to the clip as estimate: if the distance is ≤ 5 bp, a cloned vector sequence is more likely (“vector-likely”); if the distance is greater than 10 bp, we infer a possible retrogene (“retrocopy-likely”); otherwise the source is labeled “unknown”. Because source inference strongly depends on the length of the overhang sequence available for BLAST query, it is thus most reliable for experiments with long read lengths.
Removal of contaminant reads from alignment
After the candidate cDNA contamination detection step, a new alignment file can be generated with candidate contaminant reads removed. This step is straightforward for soft-clipped reads at flagged exon boundaries, as these are very likely contaminants. However, exogenous DNA will also be present as fully aligned reads inside exon regions without overlapping boundaries. In some cases, such as a first exon, these reads can overlap with actual ChIP-seq or ATAC-seq signal derived from the proximal promoter. To address this issue, cDNA-detector assigns reads in candidate cDNA exon regions into three classes: (1) clipped reads and their mates clearly obtained from cDNA contamination (\({R}_{c}\)); (2) reads derived from the genome aligned across an exon boundary without soft-clips or mismatches, neighboring exon homology, or a proper mate mapped outside the exon (\({R}_{g}\)); and (3) reads which cannot be unambiguously assigned to either class (\({R}_{a}\)) due to full genomic alignment inside an exon region.
For each candidate contaminant exon, cDNA-detector will first calculate the ratio of \({R}_{c}\) to all unambiguous reads (\({R}_{c}+{R}_{g}\)), and then remove the same fraction of ambiguous reads fully mapped inside the exon regions (\({R}_{a}\)):
$$R_{na} = nint\left( { \frac{{R_{c} }}{{R_{g} + R_{c} }}*R_{a} } \right)$$
where \({R}_{na}\) is the number of ambiguous reads to remove, nint is the nearest integer function. \({R}_{na}\) reads are then randomly selected and removed from this exon. After removal of contaminant reads in all flagged exons, a separate “clean” BAM file is written to a specified output location.
Performance
cDNA simulation experiments for sequencing strategy and read length
To evaluate detection performance, we generated simulated contaminating cDNAs by randomly selecting 100 cDNAs sequences from the CCDS database [28] for each of 10 simulation experiments. Selected cDNAs were “cloned” in silico into the pLX307 vector sequence (http://www.addgene.org/117734/), by replacing the luciferase gene with the corresponding tested cDNA. We then randomly simulated artificial paired-end or single-end reads with read lengths 30 bp, 35 bp, 50 bp, 100 bp, 120 bp, 150 bp, and fragment size 350 bp for paired-end experiments, from the vector with insert, to an average target coverage of 100 × . Simulated reads were aligned to the hg38 human genome assembly with bwa mem [15]. Aligned reads were then added to a T47D ATAC-seq experiment with ERBB2 reads from an intentionally introduced construct removed with cDNA-detector. For experiments with < 100 × coverage, we randomly downsampled to the desired target coverage from this alignment. cDNA-detector was run with default settings to discover simulated cDNA contamination.
Resource consumption
In order to test the time and memory usage and multi-threaded performance of cDNA-detector, we randomly extracted reads to the desired read number from exome SRR1819826 [3]. Time and memory consumption were assessed for cDNA-detector with default settings.
Comparison with related tools
For comparison of cDNA-detector with Vecuum (1.0.1) [4], SeqtrimNext (2.0.68) [13], DeconSeq (0.4.3) [14] and VecScreen (BLAST 2.9.0 +) (https://www.ncbi.nlm.nih.gov/tools/vecscreen/), we used the simulation strategy outlined above. For SeqtrimNext, DeconSeq, VecScreen, original FASTQ or FASTA files were used as input. Alignments (BAM) were input into Vecuum and cDNA-detector. Due to different functionality of these tools, we measured the ability to identify mapped contaminant reads with the following metrics: Recall was defined as (called true simulated mapped reads)/(true simulated mapped reads); precision was defined as (called true simulated mapped reads)/(called mapped reads); F1 was defined as (2 × precision × recall)/(precision + recall). Performance to detect cDNA inserts (only Vecuum and cDNA-detector) was evaluated with recall defined as (called true cDNAs)/(true spiked-in cDNAs), precision as (called true cDNAs)/(called cDNAs) and F1 as (2 × precision × recall)/(precision + recall). Due to possible identical alignments for reads derived from homologous genes, a homolog was also counted as true positive if identified as cDNA candidate.
We further compared cDNA-detector to Vecuum using the exome data used in the respective publication (SRP055482; [3]). FASTQ files were aligned to the hg19 genome assembly with bwa mem with the default settings [15]. cDNA-detector was run with –num_initial_potential_cdna 5000 (allowing a larger number of initial cDNA candidates for evaluation) to detect cDNA contamination. Candidate cDNAs were manually reviewed for presence of soft-clipped reads and other evidence in the Integrative Genome Viewer (IGV; [29]).
We also applied Vecuum to the samples from the TCGA, NCBI SRA (human samples only) and ENCODE in which cDNA-detector identified candidate cDNAs. Vecuum run with default parameters (minimal match length l = 20 and mapping quality threshold Q = 30) detected few cDNAs in the comparison data. Individual hits could be recovered with Vecuum with more relaxed parameters (-l15 -Q1 for TCGA ATAC-seq; -l 10 -Q1 for NCBI SRA and ENCODE data), although at the cost of lost specificity.
Assessment of downstream effect of cDNA contamination and application of cDNA-detector
For assessment of peak calling after cDNA decontamination, we used the simulated cDNA strategy described above. MACS2 (2.2.7.1) [30] was applied to call peaks in the original, contaminated, and decontaminated (by either cDNA-detector or Vecuum) samples. Differential peaks were counted with BEDTools (v2.30.0) [31].
For assessment of variants of cDNA contamination, we generated simulated cDNAs with one random point mutation (limited to chromosome 10) and added them to the human HG002 whole-genome sequencing sample from the Genome In A Bottle project [32]. Then we applied cDNA-detector to identify and remove cDNAs. Bcftools (1.9) [33] was used to call variants in the original HG002, contaminated, and decontaminated samples.
Effect of adapter sequences on cDNA-detector
We aligned reads from the T47D ATAC-seq sample before adapter removal to the hg38 human genome, then simulated cDNAs as described above, and compared performance metrics to the same sample after adapter removal.
Application to public datasets
We applied cDNA-detector snapshot 26 to three large public data sets using the Terra.bio platform and on local compute: 402 ATAC-seq experiments from primary tumors from TCGA [5], 14,485 ChIP-seq, ChIA-PET, ATAC-seq and DNase-seq tracks from ENCODE [34] with read length > 30 bp, 324 exomes from the Cancer Cell Line Encyclopedia [24], and 317 human and 36 mouse studies from the NCBI’s SRA repository (see references in Additional files 5 and 6: Tables S4 and S5). Adapters were removed from TCGA ATAC-seq libraries before running cDNA-detector, and results were inspected manually. Only high-confidence cDNA candidates are reported. SRA studies were selected based on the availability of BAM-format files and restricted to ChIP-seq, ATAC-seq and DNase-seq experiments with sequence read length > 30 bp. We then randomly selected one or two (if the study contained more than one experiment) samples from those projects. SRA files were downloaded, then converted to BAM format with sam-dump (https://ncbi.github.io/sra-tools/sam-dump.html) and samtools [35] with default parameters. Analysis with cDNA-detector was performed as described above. Potential cDNAs were manually reviewed in IGV.