TCRs sequence data
CD8+ TCR SC and BS data were collected from: (1) the Selin and Luzuriaga Labs at UMASS; (2) VDJdb [14]; and (3) IEDB [15].
Data acquired from the Selin and Luzuriaga labs contained TCRs isolated from HLA A:02:01-restricted, naïve and peptide-specific CD8+ T cells binding to YVL (EBV-BRLF1109: HLA-A:02:01 restricted, peptide: YVLDHLIVV), GLC (EBV-BRLF1300:HLA-A:02:01 restricted, peptide: GLCTLVAML), and GIL (IAV-M158: HLA-A:02:01 restricted, peptide: GILGFVFTL). SC data were obtained from ex vivo single-cell sequencing of CD8 T cells from peripheral blood mononuclear cells (PBMCs) of four adult donors. Further information on these data can be found in [11]. BS data were obtained from ex vivo bulk sequencing of CD8 T cells from PBMCs of three adult donors. Further information on these data can be found in [16].
Human data from VDJdb was downloaded in January 2018, where paired TCR information is denoted by matching index values and unpaired chains have index values of 0. Complete SC data (confidence value \(\ge \) 1) from the Immune peptide Database (IEDB) was added to our dataset and used as the default for all analysis. To test the sensitivity of the results to data composition, we also built datasets that had IEDB data with confidence value \(\ge 0\), \(\ge 2\), and \(\ge 3\), respectively. In total, our default SC dataset comprised 1447 TCRs, BS \(\alpha \) 21,207 chains, and BS \(\beta \) 25,927 chains (for complete dataset counts see Additional file 1: Table 4). Data is available for download from the Github repository for the project.
CDR information
Our method for predicting the specificity of TCRs requires TCR gene family and complete CDR3 sequence. To obtain this, we retrived all human germline CDR loop information from the International ImMunoGeneTics Information System Gene database (IMGT/GENE-DB) [17]. CDR1 and CDR2 loops can be retrieved directly from the database. However, CDR2.5 needs to be extracted from the IMGT alignment sequence, and is defined by the residues in columns 81-86 of the gapped alignment (F+ORF+in-frame P amino acid sequences with IMGT gaps), as discussed in Dash et al. [2]. After translating the data to protein sequence, we produced non-redundant datasets by removing duplicate TCRs (sequence identity \(< 100\%\)).
Shuffled CDRs
Since we did not have access to true negative data (i.e., data showing which TCRs do not bind to particular epitopes), we randomly shuffled CDR regions within each \(\alpha \) and \(\beta \) chain, to test whether or not a loss of precision would be observed. CDR regions for each TCR were shuffled prior to assigning the receptors to the train and test sets.
Baseline method
We implemented TCRdist [2] as a baseline method for classifying TCRs according to their peptide specificity. TCRdist is based on a nearest-neighbor approach, where the distance between TCRs is obtained from protein sequence alignment scores between TCRs. Using a BLOSUM62 matrix, a protein alignment is performed between any two TCRs using CDR loops 1, 2, 2.5, and 3. Subsequently, CDR 1, 2, and 2.5 are given a weight of 1, where CDR3 is given a weight of 3. Finally, the weighted sum of the CDR loop alignment scores is used as a proximity measure, and TCRs are assigned the peptide specificity of their nearest neighbor [2].
SwarmTCR
The main idea behind SwarmTCR is that the “importance” of the \(\alpha \) and \(\beta \) chains as well as the CDR regions within these chains varies depending upon the peptide that is being recognized, as described in the literature [8]. In order to reflect this, SwarmTCR learns optimal weights for each of the eight CDR loops in a peptide-specific fashion. SwarmTCR explores the eight-dimensional (with SC data) or four-dimensional (with BS data) space of CDR weights with Particle Swarm Optimization (PSO), an established optimization technique inspired by the natural flocking behavior of birds that has been shown to achieve good performance in a wide range of optimization contexts [18].
The weights are used in a nearest-neighbor framework as done in TCRdist. We framed this as an optimization problem, where the objective is to identify a set of weights that maximize classification performance as measured by Average Precision (AP) (eq. 1). AP was selected as the objective function to address the issue of unbalanced datasets (Additional file 1: Table 4), as suggested by [19]. We used Particle Swarm Optimization (PSO) for carrying out the optimization of the weights and maximize AP on a training set.
$$\begin{aligned} {\text {AP}} = \sum _{k=1}^n P(k) \Delta r(k) \end{aligned}$$
(1)
AP is determined by the sum over every position of the precision-recall curve where k is the rank of the retrieved TCRs, n is the number of TCRs, P(k) is the precision at cut-off k, and \(\Delta r(k)\) is the change in recall from \(k - 1\) to k [20].
In PSO particles are initially placed in a multidimensional space at random, with each particle representing a possible solution to the optimization problem. At each iteration, particles move with a velocity vector that is a function of both the local best of the particle and the global best. The velocity (\(\mathbf {v}\)) and position (\(\mathbf {p}\)) of a particle i are updated at each time step t according to Eq. 2 and 3:
$$\begin{aligned} \mathbf {v}_i^{t+1}&= \omega * \mathbf {v}_i^t + c1*r1*(\mathbf {pbest}_i - \mathbf {p}_i^t) + c2*r2*(\mathbf {pbest}_g -\mathbf {p}_i^t) \end{aligned}$$
(2)
$$\begin{aligned} \mathbf {p}_i^{t+1}&= \mathbf {p}_i^t + \mathbf {v}_i^{t+1} \end{aligned}$$
(3)
where \(\omega \) is an inertia factor set to 0.5, c1 and c2 are scaling factors set to 0.5, r1 and r2 are two random numbers between 0 and 1, \(\mathbf {pbest}_i\) is the position of particle i that has resulted in the best value for the objective function so far, while \(\mathbf {pbest}_g\) is the global best (i.e., the position corresponding to the best value so far across all particles).
The optimization is set to terminate if the swarm moves \(\le 10^{-8}\) from its best position or if the change in the swarm’s best objective value is \(\le 10^{-8}\). The swarm size is set to 25, with a maximum number of 20 iterations.
The SwarmTCR model
We define as “training set” the TCRs used to obtain an optimal set of weights maximizing average precision, and test set as the TCRs where the performance of the optimal weights is evaluated. Within both sets, we have a reference subset containing labeled TCRs and a sample subset that the nearest neighbor approach compares against the reference subset to infer peptide labels for the TCRs.
Training and test sets for SC and BS data were constructed differently due to data availability, with BS data being much more abundant than SC data. For BS, the training and test sets were filled using a 50/50 split, and for both training and test sets half of the TCRs specific for a particular peptide were placed into the reference subset and the remainder into the sample subset (Fig. 6A).
For SC, 30% of all TCRs specific for a peptide were placed into the reference subset for both training and testing sets. The same reference subset was used in training and testing due to limited amounts of SC data (see Additional file 1: Table 4). In order to create the sample subsets for training and testing, the remaining 70% (TCR specific) undergoes another 70/30 split (Fig. 6B). We note that the sample reference sets are distinct for training and test. We also ensured that the different proportions of TCR peptide specificities were equally represented in training and test sets.
Once data was randomly allocated into the training and test sets as described above, we performed the PSO procedure on the training set. Each solution (optimal set of weights maximizing average precision) was then applied to the test set. Cross-validation was performed using repeated random sub-sampling for 50 iterations on both SC and BS datasets.
Crystal structure contacts and SwarmTCR output
We searched the Protein Data Bank (PDB) for TCR/pMHC crystal structure complexes containing one of the peptides in our TCR repertoires, to compare α and β chain usage and CDR loop usage with SwarmTCR weights.
We found nine complexes with the GILGFVFTL peptide (PDB IDs: 1OGA, 2VLJ, 2VLK, 2VLR, 5EUO, 5E6I, 5ISZ, 5JHD, 5TEZ), three complexes with the NVLPMVATV peptide (PDB IDs: 3GSN, 5D2L, 5D2N), and one complex with the GLCTLVAML peptide (PDB ID: 3O4L). Using the distance threshold discussed in a previous publication [8], we extracted CDR region residues within 4.5Å to the target (peptide, pMHC).
We then compared contact residue counts from the crystal structures to the SwarmTCR weights for each repertoire and the default TCRdist weight set.