Semi-supervised learning for somatic variant calling and peptide identification in personalized cancer immunotherapy

Background Personalized cancer vaccines are emerging as one of the most promising approaches to immunotherapy of advanced cancers. However, only a small proportion of the neoepitopes generated by somatic DNA mutations in cancer cells lead to tumor rejection. Since it is impractical to experimentally assess all candidate neoepitopes prior to vaccination, developing accurate methods for predicting tumor-rejection mediating neoepitopes (TRMNs) is critical for enabling routine clinical use of cancer vaccines. Results In this paper we introduce Positive-unlabeled Learning using AuTOml (PLATO), a general semi-supervised approach to improving accuracy of model-based classifiers. PLATO generates a set of high confidence positive calls by applying a stringent filter to model-based predictions, then rescores remaining candidates by using positive-unlabeled learning. To achieve robust performance on clinical samples with large patient-to-patient variation, PLATO further integrates AutoML hyper-parameter tuning, classification threshold selection based on spies, and support for bootstrapping. Conclusions Experimental results on real datasets demonstrate that PLATO has improved performance compared to model-based approaches for two key steps in TRMN prediction, namely somatic variant calling from exome sequencing data and peptide identification from MS/MS data.

gives a summary of the options used by this command. Note that we used a slightly modified version of msgf2pin, the source code of which can be found here: https://github.com/mrForce/ msgf2pin-PTM-Mass-Delta. We added the "-z" option, so that post-translational modifications would be annotated with mass deltas, rather than UNIMOD accession codes. This is because, at one point, we were working with a version of Percolator that wasn't compatible with UNIMOD accession. Note that, although msgf2pin is part of the Percolator package, we used it as a standalone utility. Include additional features that Percolator will use; see [1] for more information -m 3 Sets the fragmentation method; in this case it's 3, since the data was generated with HCD fragmentation -e 0 Use non-specific cleavage when creating peptides to search -inst 3 Sets the instrument type in this case, it's 3, since the data was generated with a Q-Exactive instrument -mod MOD FILE A file containing the post-translational modifications to include in the search. We used Cysteine Carbamidomethylation as a fixed modification since iodoacetamide was used in the experiments of [2].

Percolator command
Percolator commands were of the form: crux percolator --output-dir OUTPUT_DIR PIN_INPUT We are currently using Percolator version 3.02.0 in Crux version 3.20-d57cff. Displays PTM as mass delta, rather than UNIMOD Accession Figure S1: Interface for the Galaxy MS/MS search tool.

Galaxy Search tool
We created a publicly available Galaxy tool that allows users to run MS-GF+ and Percolator through a web-based user-friendly interface. The tool can be accessed at https://neo.engr.uconn.edu/?tool_ id=msgfplus_runner; the tool version used in this study was 20.06. Figure S1 shows a screenshot of the interface.
The Galaxy search tool supports two search types. The first is called "Unfiltered Search", where the selected MGF file is searched against the selected proteome (concatenated with any user provided FASTA files). The second type is called "Filtered Search". Briefly the base proteome (and any uploaded FASTA files) are broken into peptides with lengths between 8 and 13 amino acids. The user specifies a set of MHC-I/HLA-I alleles, and the peptides are scored using NetMHC. For each allele-length combination, the top k percent scoring peptides are used in the search, where k is a user specified parameter. In this study, we only used the Unfiltered Search.
The user can give the search a name, and select which proteome to search. For this study, we used a Uniprot Human proteome consisting of one protein per gene, which was downloaded from ftp://ftp.uniprot.org/ pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640_ 9606.fasta.gz in April 2019. Currently, this is the only proteome offered for searching, though more pro-teomes will be added in the future.
The tool has four outputs. The first is the PIN file generated by msgf2pin, except with the second row removed. This second row contains information that Percolator needs, but is otherwise not useful to us. The second output are the target and decoy PSMs scored by Percolator. The third is a log file, which is useful for debugging (and also shows exactly how MS-GF+, msgf2pin and Percolator were ran). The fourth is an archive file, which contains, among other things, the MZID output from MS-GF+, the PIN file passed to Percolator, and the output files of the Percolator run. It also contains the FASTA that MS-GF+ searched.
Published Galaxy histories including runs for the 20 MS/MS melanoma datasets analyzed in this paper (grouped by patient) are available at: • https://neo.engr.uconn.edu/u/jordan/h/bassani-mel3-public

FDR Filtering
To fairly assess the number of discoveries each tool makes at a given FDR cutoff, we wrote a Galaxy tool to control FDR at both the PSM and Peptide level. The tool takes as input a tab-seperated value file, and the user specifies which columns contain the peptide, score and label (target or decoy), and the score direction (whether a bigger score is better or worse), as well as an FDR cutoff. The tool will have one output for PSM level FDR filtering, and another for Peptide level FDR filtering. For the Peptide level, it uniquifies the peptides by selecting the best scoring PSM for each peptide, and discards poorer scoring PSMs for that peptide. From then on, the procedure is the same for PSM or Peptide level FDR filtering: 1: groups ← Group PSMs by score 2: sortedGroups ← Sort groups by score, from best to worst 3: numDecoys ← 0 4: numT argets ← 0 5: endIndex ← −1 6: i ← 0 7: for score, group ← groups do 8: if psm is target then numT argets ← numT argets + 1 i ← i + 1 20: end for The target PSMs in the groups up to endIndex are then controlled at FDR-level α. The grouping is necessary because frequently, there will be PSMs with the same score, and they must either be accepted or rejected together as a group. For Percolator, we used the "percolator score" column as the score. For MS-GF+, we used the lnEV alue in the msgf2pin output. This is simply the negative logarithm of a PSM's E-Value. Note that MS-GF+ provides Q-values, which can also be used for FDR control; however, their Q-values are computed based on the Spectral E-Value. The reason for this discrepancy is that we forked Percolator version 3.04 to create the custom version of msgf2pin (see the "msgf2pin settings" subsection above), and that version wasn't able to output both lnEV alue and lnSpecEV alue.
As for MS/MS searches, we created a Galaxy tool that allows users to run the FDR filter through a web-based user-friendly interface. The FDR filter tool (version 20.06) can be accessed at https://neo. engr.uconn.edu/tool_runner?tool_id=FDR_custom_filter; Figure S2 displays a screenshot of its user interface. Figure S2: Interface for the Galaxy FDR filter tool.

PLATO feature descriptions
For SNV calling PLATO used 52 features generated by the CCCP pipeline (described in Table S3) along with the following 58 additional features extracted using SomaticSeq [3] Table S4, which were extracted from the MS-GF+ output. The number of additional neutrons in the peptide compared to the monoisotopic mass MeanErrorTop7 Mean mass error of 7 most intense peaks sqMeanErrorTop7 Square root of MeanErrorTop7 StdevErrorTop7 Standard deviation of mass errors of 7 most intense peaks Charge1, Charge2, Charge3 Spectrum charge DeNovoScore Score of best scoring peptide for the spectrum. This is among all possible peptides, not just those in the database RawScore The PSM score assigned by MS-GF+ Energy Difference between RawScore and DeNovoScore ScoreRatio Ratio of RawScore to maximum possible score (aka DeNovoScore) lnEValue Negative one times the natural logarithm of the database level E-value [1]. See Kim and Pevzner [4] for a detailed description of how E-value is calculated by MS-GF+ lnExplainedIonCurrentRatio Logarithm of the total intensity of identified fragment ions divided by total intensity of all ions lnNTermIonCurrentRatio Logarithm of total intensity of identified N-terminal fragment ions divided by total intensity of all ions lnCTermIonCurrentRatio Logarithm of total intensity of identified C-terminal fragment ions divided by total intensity of all ions lnMS2IonCurrent Logarithm of sum of intensities of all fragment ions PepLen Peptide length P1 and P6 The amino acids before and after the peptide in its protein P2 and P3 The first two amino acids of the peptide P4 and P5 The last two amino acids of the peptide [4] Kim, S., Pevzner, P.: MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications 5, 5277 (2014). doi:10.1038/ncomms6277