BBCAnalyzer: a visual approach to facilitate variant calling

Background Deriving valid variant calling results from raw next-generation sequencing data is a particularly challenging task, especially with respect to clinical diagnostics and personalized medicine. However, when using classic variant calling software, the user usually obtains nothing more than a list of variants that pass the corresponding caller’s internal filters. Any expected mutations (e.g. hotspot mutations), that have not been called by the software, need to be investigated manually. Results BBCAnalyzer (Bases By CIGAR Analyzer) provides a novel visual approach to facilitate this step of time-consuming, manual inspection of common mutation sites. BBCAnalyzer is able to visualize base counts at predefined positions or regions in any sequence alignment data that are available as BAM files. Thereby, the tool provides a straightforward solution for evaluating any list of expected mutations like hotspot mutations, or even whole regions of interest. In addition to an ordinary textual report, BBCAnalyzer reports highly customizable plots. Information on the counted number of bases, the reference bases, known mutations or polymorphisms, called mutations and base qualities is summarized in a single plot. By uniting this information in a graphical way, the user may easily decide on a variant being present or not – completely independent of any internal filters or frequency thresholds. Conclusions BBCAnalyzer provides a unique, novel approach to facilitate variant calling where classical tools frequently fail to call. The R package is freely available at http://bioconductor.org. The local web application is available at Additional file 2. A documentation of the R package (Additional file 1) as well as the web application (Additional file 2) with detailed descriptions, examples of all input- and output elements, exemplary code as well as exemplary data are included. A video demonstrates the exemplary usage of the local web application (Additional file 3). Additional file 3: Supplement_3. Video demonstrating the exemplary usage of the web application “BBCAnalyzer”. (MP4 11571 kb) Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1549-4) contains supplementary material, which is available to authorized users.


Introduction
The analysis of alignment data is an essential step in the process of analyzing sequencing data. Mutations like single nucleotide base changes, deletions and insertions may be identified in this step.
Usually, the available programs -like GATK [1] or SAMtools [2] -take a bam file, containing the alignment data in a compressed form, as input and return a vcf file containing the called variants and a selection of parameters characterizing them. The overall depth as well as the allele frequencies are returned as two of these parameters.
However, the programs do usually not report the unfiltered number of reads covering a certain mutated base, but they apply a set of partially complex filtration steps. Thresholds -internally or user-defined -are usually used to reject additional variants with a minor ratio. Furthermore, to our knowledge it is not possible to perform a comparable analysis of positions where no variant is called.
The integrative genomics viewer (IGV) [3] provides a possibility for investigating the different number of bases, deletions and insertions at any position in the genome. Yet, it is time consuming to load different samples into the program and to look at the different positions of interest. Moreover, there is no way of automatically summing up and visualizing the base counts in IGV.
With regards to the comparison of different sequencing techniques, it appears useful to have a tool that is able to visualize the background at a selection of locations where e.g. one technique calls a variant but another technique does not. Furthermore, it seems helpful to have a possibility for quickly analyzing positions of expected mutations, which have not been called.
This package provides a possibility for visualizing the number of counted bases, deletions and insertions at any given position in any genome in comparison to the reference bases. Additionally, markers for the relative base frequencies, the mean qualities of the detected bases, known mutations or polymorphisms (e.g. based on dbSNP [4]) and called variants in the data may equally be included into the plots.

Loading the package
The package can be downloaded and installed with > biocLite("BBCAnalyzer") After installation, the package can be loaded into R by typing > library(BBCAnalyzer) into the R console. BBCAnalyzer requires the R-packages VariantAnnotation, Rsamtools and BSgenome.Hsapiens.UCSC.hg19. All of them are loaded automatically when using BBCAnalyzer.

Analyzing the bases and visualizing the results
BBCAnalyzer performs an analysis of the bases, deletions and inserts at defined positions in sequence alignment data and visualizes the results. The process consists of different steps. These steps are highly dependent on each others output, which is why it is necessary to run the function analyzeBases, performing the whole process of analyzing the data and visualizing the results, at first. If any changes concerning the plotting of the results are desired, the function analyzeBases-PlotOnly may be used, taking amongst others the return value of analyzeBases as an obligatory input.

Preparation
For the correct functioning of BBCAnalyzer in the first place, various input files are necessary: The names of the samples to be analyzed have to be provided by a file. There has to be one sample name per line without the ".bam"-suffix.

V1 1 Example_IonTorrent 2 Example_454
The bam-and the corresponding bai files of the samples to be analyzed have to be provided in a folder. The names of the files have to match the names provided by the sample names file.
The target regions have to be provided by a file. The file may either contain regions (chromosome, tab, first base of a region, tab, last base of a region) or positions (chromosome, tab, position). A mixture of both is not supported. Yet, a region may cover only one base, i.e. the first and last base of a region may be identical.

Examplary object targetRegions:
> target_file <-system.file("extdata", "targetRegions.txt", package = "BBCAnalyzer") > targetRegions <-read. Additionally to the necessary input, optional input may be defined: If vcf files shall be considered, the corresponding files of the samples to be analyzed have to be provided in a folder. There has to be one file per sample. The names of the files have to match the names provided by the sample names file.

Analyzing the bases
The analysis of the bases, which actually only serves as a preparation for the final plots, is performed in different steps:

1) Determine target
If the target regions file contains regions to be analyzed, the different positions covered by the regions are determined.
If the file already contains single positions, the program directly proceeds with the next step.
It is not necessary that the regions or positions are ordered.
If a known insert is supposed to be analyzed, the position of the base succeeding the insert has to be given.

2) Analyze reads
The reads at every targeted position get analyzed. By the help of the CIGAR string the bases, deletions and inserts are determined. The output is saved as [Sample].bases.txt.
For every base -also the inserted ones -the base quality is determined ("-1" in case of a deletion). The output is saved as [Sample].quality.txt.
Reads with a mapping quality below a user defined mapping quality threshold get excluded from the analysis. Instead of a base, "MQ" is noted in the base file. Instead of a quality value, "-2" is noted in the quality file.
The function copes with uncovered positions ("NotCovered" in the base-and the quality file) and insertions >1bp (repeated analysis of the position).

3) Analyze frequency
The number of detected bases, deletions and inserts at every position is summed up. Additionally, the mean quality of the detected bases -including the inserts and for the inserted bases only -is calculated.
Bases with a quality below a user defined base quality threshold are excluded and counted separately.
The output is saved as [Sample].frequency.txt.
If the analysis shall consider vcf files as well, the alternate alleles and the genotypes -as far as they are available for the positions analyzed -are written out as well. Furthermore, for every called variant it is noted whether it is an insert.
The function copes with up to two different alternate alleles per position.

4) Report variants
The ratios of the detected bases, deletions and insertions (additionally) at every position are determined. According to the determined ratios, up to six different calls get reported. If the user defines a frequency threshold, minor variants with ratios below this threshold do not get reported. The output is saved as [Sample].calling.txt.
The function copes with insertions >1bp even if the position at which an insert is detected is not covered by all samples being analyzed.
If the analysis shall consider vcf files as well, the call -taking the reference allele, the alternate allele(s) and the genotype into account -is written out as well. For every called variant it is noted whether it is an insert and whether the genotype is heterozygous.
A list of lists is returned containing all the information that is generated in this analysis step.  A  G  T  T  T  A  A  2  T  A  G  C  T  T  A  A  3  T  A  G  C  T  T  A  A  4 T

Visualizing the results
The function -which may be called separately as well using the function analyzeBasesPlotOnly -provides different possibilities for visualizing the results: The absolute or the relative number of detected bases, deletions and inserts may be plotted Apart from these differences, all plots share some general characteristics: The function always produces barplots.
The bars are colored according to the base at the corresponding position (adenine: green; cytosine: blue; guanine: yellow; thymine: red; deletion: black; insertion: lilac edge).
The color may additionally contain information on the mean quality of the base if bounds for the lower-and upper mean quality are set. In this case, the color of the bars is darker if the mean quality of the base is closer compared to the upper quality bound. The color of the bar is lighter if the mean quality of the base is closer compared to the lower quality bound.
The reference bases are plotted on the negative y-axis below each position.
A tabix file containing known variants or mutations may be provided. In this case, more than one reference base is plotted at the corresponding position.
Every targeted position is labelled according to the chromosome and the position.
For each position to be analyzed, lines may be drawn representing relative frequencies (defined by user).
The function copes with different inserted bases at one position (stacked bars) and inserts>1bp, even if these are not covered by all samples.
If the analysis shall consider vcf files as well, the expected number of detected bases, deletions and inserts -according to the vcf file -is added to the plot using dashed lines.

Runtime
The runtime of BBCAnalyzer depends on the number of samples and the number of positions that get analyzed. Furthermore, the number of reads at the positions being analyzed influences the runtime as well.
The following test case analyzes one position for one sample. 344 reads cover the position being analyzed. A vcf file and a file containing known polymorphisms is not considered (vcf file: increases the runtime in this example by 1.625 seconds; file containing known polymorphisms: increases runtime in this example by 2.273 seconds; both: increases runtime in this example by 2.690 seconds).

Example IonTorrent
Test files Example IonTorrent small.bam and Example IonTorrent small.bai.