While many tools are currently available for primary analysis of the sequencing data, there is a shortage of solutions for tertiary analysis, that is the process of extracting insights from the data produced by the upstream analysis steps. Although, one can argue that the wide range of high level analysis does not allow the development of a general purpose tertiary analysis tool, a major tertiary analysis component, that is the identification of common group of variations that affect certain phenotypes in a given population, has yet to be addressed properly.
There are integrated tools designed to provide a cohesive platform for the analysis of next generation sequencing data. These packages include various tools for primary, secondary and tertiary analysis. Here we compare our tool against some of the most widely used tools, that is the Genome Analysis Toolkit (GATK)  and the Genome MuSiC  with the main focus being on their tertiary analysis functions. The most prominent advantage of MuteProc over these tools is its efficient integration of variation and annotation databases that makes the management of multiple large scale projects as convenient and efficient as possible. This is extremely challenging to achieve using the existing tools since they rely on processing large data files. The GATK package consists of various groups of analytical utilities that mostly deals with primary analysis and Quality Control (QC) steps. In particular, we only one found utility within the GATK that processes the cancer specific variations, i.e. SomaticIndelDetector, and yet this utility can only predict somatic indels in one target sample at a time. Other variation analysis utilities, such as VariantAnnotator, Variant Discovery and Evaluation and Manipulation, either provide primary analysis over individual variants or are limited to analysis over a single sample rather than a cohort of samples which is the prominent feature of MuteProc.
The MuSiC package on the other hand enables collective analysis of mutations across a group of samples, so in this sense MuSiC is a more appropriate benchmark to compare against MuteProc. The MuSiC package consists of a collection of downstream analysis tools designed to (1) apply statistical methods to identify significantly mutated genes, (2) highlight significantly altered pathways, (3) investigate the proximity of amino acid mutations in the same gene, (4) search for gene-based or site-based correlations to mutations and relationships between mutations themselves, (5) correlate mutations to clinical features, and (6) cross-reference findings with relevant databases such as Pfam, COSMIC, and OMIM. Aside from the pathway analysis and the clinical correlation utility, which we aim to include in the later versions, the MuteProc provides all the analytical power of MuSiC with three major advantages:
While the input variations to the MuSiC package are validated or predicted somatic mutations, the MuteProc predicts the somatic mutations from raw mutations generated by variant callers. This is by itself a very challenging task as the mutation set detected by the current variant callers has significant amount of noise. MuteProc predicts somatic variations by filtering tumor mutations against the mutations in matched normal samples, other normal samples in the database and the datasets of known polymorphisms such as DBSNP. The remaining mutations following this stringent filtering stage are then validated by high throughput analysis of the mapped reads in tumor and matching normal samples. Additionally, the mutation frequencies in cancer and normal samples are calculated and the mutations are determined to be synonymous, non-synonymous or non-coding.
MuteProc allows mutation analysis over a wide range of annotated genomic regions such as microRNA targets, promoters, enhancers, transcription factor binding sites, regulatory loci and more. In fact any given annotation set can be easily incorporated into the analysis by importing them into the annotation database.
MuteProc provides an efficient QC utility for the identified somatic mutations. The QC is carried out by processing the mapped reads at each somatic variation location in tumor and matched normal BAM files and determines whether the variation is likely to be somatic, germline or the result of an artifact. Note that the germline mutations are not excluded from the analysis, instead they are reported separately as in many studies causative predisposing mutations might be of interest. The results of the QC are generated in HTML files that contain the alignment profile of the variations in tumor and matched normal samples placed side by side for easier comparison. These results are incorporated in the final HTML report with provided hyperlink for easy access.
We believe that our mutation analysis package provides some advantages over the existing tools in managing large scale projects involving thousands of samples across multiple cohorts.