NeoPredPipe: high-throughput neoantigen prediction and recognition potential pipeline

Background Next generation sequencing has yielded an unparalleled means of quickly determining the molecular make-up of patient tumors. In conjunction with emerging, effective immunotherapeutics for a number of cancers, this rapid data generation necessitates a paired high-throughput means of predicting and assessing neoantigens from tumor variants that may stimulate immune response. Results Here we offer NeoPredPipe (Neoantigen Prediction Pipeline) as a contiguous means of predicting putative neoantigens and their corresponding recognition potentials for both single and multi-region tumor samples. NeoPredPipe is able to quickly provide summary information for researchers, and clinicians alike, on predicted neoantigen burdens while providing high-level insights into tumor heterogeneity given somatic mutation calls and, optionally, patient HLA haplotypes. Given an example dataset we show how NeoPredPipe is able to rapidly provide insights into neoantigen heterogeneity, burden, and immune stimulation potential. Conclusions Through the integration of widely adopted tools for neoantigen discovery NeoPredPipe offers a contiguous means of processing single and multi-region sequence data. NeoPredPipe is user-friendly and adaptable for high-throughput performance. NeoPredPipe is freely available at https://github.com/MathOnco/NeoPredPipe.


Background
Cancer cells are fraught with genomic variants in all regions of the genome with high degrees of heterogeneity in a spatially complex tumor. This intra-tumor heterogeneity (ITH) realizes a fitness landscape upon which natural selection can act (reviewed by [1]). Neoantigens, epitopes derived from proteins translated from non-synonymous variants, are able to make their way to the cell surface in the hopes of stimulating an immune response after a number of cellular processing steps have occurred, primarily proteosomal cleavage and binding with major histocompatibility complexes (MHC) I or II.
*Correspondence: ryan.schenck@univ.ox.ac.uk 1 Integrated Mathematical Oncology, Moffitt Cancer Center, 33612 Tampa, FL, USA 2 Wellcome Centre for Human Genetics, University of Oxford, OX3 7BN Oxford, UK Full list of author information is available at the end of the article This binding depends upon the patient specific human leukocyte antigen (HLA) alleles. From here, the bound neoantigen with its MHC-Class I complex makes its way to the cell surface where it may bind with cytotoxic Tcell receptors thereby eliciting infiltration of cytotoxic T-cells capable of detecting and eliminating cells carrying the neoantigen in the absence of immune evading tactics. The immune response is strongly influenced by the total number of neoantigens within a tumor, especially in hyper-mutated cancers ( [2]), as well as the ITH of antigenic mutations ( [3]). Recent advances in sequencing techniques allow for multi-region sequencing approaches whereby adjacent regions of the same tumor or tissue are able to provide greater insights into variant clonality (i.e. truly clonal, subclonal, or shared). There is increasing evidence that the neoantigen landscape of tumours can be highly heterogeneous, containing regions of subclonal immune escape and significantly different neoantigen load that can influence a patient's response to immunotherapy [4][5][6].
A number of tools are available that provide mutated peptide annotation, binding affinity prediction, wild-type and mutant peptide comparison, and neoantigen ranking based on these measures [7][8][9][10]. Their input varies from raw sequencing files (e.g. fastq) [7,8,10] to highly annotated vcf files [9]; some provide HLA-typing as part of their pipeline [7,10], but require further dependencies for HLAtyping software. Most rely on a version of netMHC or netMHCpan for binding prediction, but [9] offers a choice of additional software. For an in-depth comparison of available pipelines for neoantigen calling, we refer the reader to the recent review of Lancaster et al. [11].
Despite the increasing number and diversity of neoantigen-prediction tools, none of them possess the capability of providing predicitions on multi-region sequence data and assessing ITH of the antigenic landscape of tumours. Here, we present NeoPredPipe, a pipeline connecting commonly used bioinformatic software via custom python scripts to allow for the processing of single and multi-region variant call format (VCF) files, variant annotations, neoantigen predictions, cross-referencing with known epitopes, and performing in silico TCR recognition potential predictions in a single, clear, and proficient workflow (Fig. 1).

Implementation
The first stage in neoantigen identification from a VCF file is the proper annotation of variants to identify nonsynonymous variants. To this end, NeoPredPipe employs the widely used and efficient genomics tool, ANNOVAR ( [12]). Specifically, ANNOVAR processes samples in a way that prioritizes exonic variants, this step provides a useful means for quickly partitioning variant calls for downstream applications. The user is able to specify the genome build that they would like to use, provided it is compatible with ANNOVAR. Finally, using the cod-ing_change function of ANNOVAR and custom code, the mutated amino acid sequence is predicted from annotated nonsynonymous variant calls, and the peptide sequence surrounding the newly introduced amino acid is extracted for epitope prediction. From this step, mutations that give rise to a single amino acid change, and mutations that mutate a larger peptide segment (e.g. indels and stoplosses) are handled separately and reported in separate files to help further assessment.
Once the VCF files have been annotated and partitioned with ANNOVAR, the program determines if HLA haplotypes have been provided by the user containing the HLA-A, -B, and -C haplotypes. NeoPredPipe does not include HLA allele identification as this step in the pipeline is highly dependent upon the source of the data (WES, WGS, targeted gene panels, transcriptome data, or conducted via experimental methods), but the pipeline's github page provides detailed advice on haplotyping from WES/WGS data using the popular tool POLYSOLVER [13], and the output of POLYSOLVER is automatically processed in NeoPredPipe. In cases where no HLA haplotype information is available the most common alleles of each haplotype are assessed; while in cases where the HLA haplotypes are homozygous only that HLA haplotype is used for prediction. HLA haplotypes are cross-referenced with available HLA haplotypes prior to executing netMHCpan ( [14]) for the primary neoantigen predictions. As with the primary tool, the user is able to specify the epitope lengths to conduct predictions for (typically epitopes of 8-, 9-, or 10-mers). The output from this process yields a single file containing either filtered or unfiltered (dependent on user options) neoantigen predictions with information on the sample possessing the neoantigen and, in the case of multi-region variant calling, a presence/absence indicator for each of the sequenced regions. These predicted neoantigens are then, optionally, cross-referenced with normal peptides utilizing PeptideMatch ( [15]), whereby the candidate epitopes are assessed for novelty against a reference proteome that can be supplied by the user as a fasta file (e.g. from Ensembl or UniProt). When available, users may also provide expression data as a tsv file specific to each sample (or a single reference file) to quickly assess expression levels of the gene carrying a predicted neoantigen. This information is included in the final output table.
The steps outlined above deliver candidate information for neoantigens from provided variant calls that may be presented to cytotoxic T-cells, however, this does not inform the likelihood of a neoantigen eliciting an immune response (i.e. being recognised by a TCR). In order to predict the recognition potential we employ the algorithms and process utilized by [16]. The recognition potential is defined as the product of A and R, where A is the amplitude of the ratio of the relative probabilities of binding for the wild-type and mutant epitopes to the MHCclass I molecules; and R is a measure of similarity to pathogenic peptides, meant to represent the probability that the neoantigen in question is recognised by a TCR clone already present in the tissue/blood. To define A it is necessary to perform neoantigen predictions for the wildtype and mutant epitope: this is not performed by default by NeoPredPipe, but is supplied as an option to employ as a contiguous pipeline. To define R, NeoPredPipe utilizes the multistate thermodynamic model employed by [16], which requires alignment scores for each epitope to a curated Immune Epitope Database list of known epitopes (can be refined and updated by the user, but is provided).
In order to incorporate the ability to assess ITH in regards to both effective mutations (non-synonymous variants and indels) and neoantigen burdens, NeoPred-Pipe is capable of handling multi-region VCF files; further these files can be multi-region in only a select number of samples and differ in the number of regions. Similarly, NeoPredPipe can process multi-region expression data for samples where information on regions are compiled into separate columns. Thus NeoPredPipe is able to efficiently handle various, potentially multi-region experimental designs for neoantigen prediction and assessments providing a summary table and an optional web-based visualization tool for downstream statistical and in-depth analysis.

Results
The output of the pipeline depends largely on the options set by the user, but at the very least, NeoPredPipe provides two tables of putative neoantigens and their predicted binding affinities, one for single nucleotide/amino acid, and one for indel(-type) variants. With additional options selected it is possible to include, within a single output, whether an epitope matches a reference proteome, its expression on the RNA level and the neoantigen's recognition potential. In additon, for rapid assessment, NeoPredPipe yields summary statistics on the neoantigen burden for each sample, a rapidly executed webbased visualization, as well as information to assess ITH by reporting neoantigen burdens for clonal, subclonal, and shared variants for multi-region samples. A detailed description of NeoPredPipe's output tables and each field in these can be found at https://github.com/MathOnco/ NeoPredPipe.

Use Case
While a small, two sample, multi-region example dataset is provided with the source code for users, we demonstrate the usefulness of NeoPredPipe by applying it to a previously published dataset examining the evolutionary landscape of colorectal tumors [17]. We select two exemplary patient samples (Adenoma 3 and Carcinoma 7 in the original paper) from the dataset, and apply our pipeline using default parameters to evaluate neoantigens in each sample. Figure 2 illustrates the information included in the standard output of NeoPredPipe and potential analysis that can be performed if NeoPredPipe is combined with the output of other standard bioinformatic methods. Figure 2a provides a summary of the complex interactions between different regions of Adenoma 3, and highlights both Region 4, which harbours the highest amount of subclonal (only present in a single region) neoantigens, and the overall clonality of the sample, with 72 neoantigens detected in all regions. For quick analysis, NeoPredPipe directly outputs a summary of the clonality of neoantigens, also divided into categories of strong and weak binders (peptides with a netMHCpan percentile rank ≤ 0.5 and ≤ 2, respectively, as recommended in [14]). Figure 2b visualizes this summary on two bar-charts for Adenoma 3 and Carcinoma 7. We find that whilst the number of shared neoantigens (present in more than one, but not all regions) is highly similar between the two samples, Carcinoma 7 harbours both more clonal (present in all regions) and subclonal neoantigens; and in total 26% of the neoantigens are clonal, compared to 16% of Adenoma 3. Figure 2c shows the recognition potential value for all neoantigens in the two samples. NeoPredPipe identified 10 peptides in Adenoma 3 and 9 in Carcinoma 7 with a recognition potential value above 1. In Fig. 2d, we provide an example of integrating Neo-PredPipe outputs with downstream multi-region variant analysis. By inferring phylogenetic trees of each tumor, constructed using all exonic mutations with a variant allele frequency above 0.05 (see [17] for full methods), we find that neoantigen distributions across regions can reflect the phylogenetic distance of regions and clonal structure of samples. 31% and 23.5% of total exonic mutations are clonal in Carcinoma 7 and Adenoma 3, similarly to the clonality of neoantigens shown in Panel B. This approach also highlights regions with neoantigen loads different from their closest neighbors, such as Region61 and Region62 of Carcinoma 7. Therefore the analysis can inform future experimental and bioinformatic investigations of samples allowing for new evolutionary and mechanistic insights into tumor development, evolution, and progression.

Conclusions
We present NeoPredPipe, an efficient, high-throughput, and user-friendly pipeline for neoantigen prediction and interrogation for single and multi-region tumor VCF files. By tying together commonly utilized bioinformatics toolsets and integrating recent advances in neoantigen assessment, NeoPredPipe yields concise information typically required by researchers and clinicians. Through user options, based on the individuals own computational limitations, the pipeline is scalable for a high performance computing (HPC) cluster environment and customizable for individual research questions. Furthermore, unlike existing methods [7][8][9][10], NeoPredPipe can process a directory containing numerous samples in a single command; therefore provides a user-friendly way for not computer-proficient users to analyse the output of large studies or compare against reference datasets. All source code and an extensive read me for each component of NeoPredPipe with all pipeline options are available at https://github.com/MathOnco/NeoPredPipe.  Number of neoantigens in the two samples that are clonal (present in all regions, shown in blue), shared (present in at least two regions, in yellow) or subclonal (present in a single region, red). Separate counts of weak and strong MHC-binding neoantigens (WB and SB, respectively) are also shown. c Distribution of recognition potential values of neoantigens present in Adenoma 3 (green) and Carcinoma 7 (red). The boxplots represent the median and upper and lower 25 percentile. Only neoantigens with recognition potential higher than zero are shown. d Phylogenetic tree reconstructed from all exonic mutations for Adenoma 3 (left) and Carcinoma 7 (right). Pie-charts and the bar-charts represent the number of weak (orange) and strong (red) binder neoantigens assigned to each branch. The size of each circle is proportional to the percentage of total neoantigens on that branch