SwarmTCR: a computational approach to predict the specificity of T cell receptors

Background With more T cell receptor sequence data becoming available, the need for bioinformatics approaches to predict T cell receptor specificity is even more pressing. Here we present SwarmTCR, a method that uses labeled sequence data to predict the specificity of T cell receptors using a nearest-neighbor approach. SwarmTCR works by optimizing the weights of the individual CDR regions to maximize classification performance. Results We compared the performance of SwarmTCR against another nearest-neighbor method and showed that SwarmTCR performs well both with bulk sequencing data and with single cell data. In addition, we show that the weights returned by SwarmTCR are biologically interpretable. Conclusions Computationally predicting the specificity of T cell receptors can be a powerful tool to shed light on the immune response against infectious diseases and cancers, autoimmunity, cancer immunotherapy, and immunopathology. SwarmTCR is distributed freely under the terms of the GPL-3 license. The source code and all sequencing data are available at GitHub (https://github.com/thecodingdoc/SwarmTCR). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04335-w.

and is very large, with numbers in human that are thought to exceed 10 20 possible distinct receptors [4].
The collection of TCRs possessed by an individual is known as the T cell repertoire, which is shaped over time by the history of infections in combination with stochastic factors, and is in turn responsible for determining the outcome of an immune response.One of the ultimate goals of T cell repertoire analysis is predicting the specificity of the T cells of an individual using sequence information alone [2].This would entail determining the identity of the peptide(s) that each TCR is capable of recognizing by computationally analyzing the TCR sequences from CD8+ cells circulating in the peripheral blood of an individual.More high-quality TCR sequencing data and peptide binding information specificity will be needed to achieve this objective.On the sequencing front, although the availability of TCR sequencing data is still fairly limited, the field has seen steady progress and technological advancements [5].
Two main technologies are currently available for sequencing TCRs: (1) single cell (SC) sequencing and (2) bulk sequencing (BS).SC TCR sequencing technology allows to reconstruct the complete sequence of a TCR with paired α and β chain sequence information, but its cost is still limiting the amount of available data.In contrast, BS technology is more affordable and has yielded substantially larger amounts of data, but reconstructing the correct α and β chain pairs within a TCR is not possible with this technology.
Being able to map the specificity of human repertoires can equip us with powerful new tools for studying autoimmunity, cancer immunotherapy, and immunopathology [6].However, for these methods to be broadly applicable it is critical to sample T cell repertoires deeply and in multiple individuals, as well as to account for the diversity of binding topologies to pMHCs with computational approaches.Here we introduce SwarmTCR, a computational method to predict the specificity of TCRs for class I MHC/peptide complexes that compares favorably to the nearest-neighbor based approach TCRdist [2] on both SC and BS data.
TCRdist uses a nearest-neighbor approach with a pairwise sequence alignment score between TCRs as a proximity measure.The two chains are weighted equally, and the CDR3 region is weighted three times more than the other CDR regions.While this is a reasonable choice considering the importance of CDR3 for peptide binding, it does not take into consideration the fact that the two chains and the regions within them might have different levels of involvement in binding to the pMHC, depending upon the peptide being presented and the MHC type.In a recent study, we curated a non-redundant set of TCR/pMHC crystal structures and explored binding topologies of TCR/pMHC complexes and the number of contact residues ( ≤ 4.5 [7]) made by α and β chains with pMHC [8].Our results indicated a wide range of TCR binding angles and a variable use of the α (7-25 contacts) and β (6-22 contacts) chains in making contacts with the pMHC.We also computed the number of alpha and beta contacts to the pMHC, determining a ratio of contacts ( α/β ratio) for each structure.In some complexes, the α chain had a much larger number of interactions with the pMHC than the β chain (corr.= 0.77, p < 1.6x10 −14 ), whereas in other complexes the β made more interactions than the α chain (corr.= −0.73,p < 2.2x10 −12 ).In other complexes, we saw an almost equal number of αβ interactions with the pMHC ( ∼ 15 contacts per chain).Taken together, these results suggest a wide range of binding recognition modes, which should be reflected in a computational method to predict TCR binding specificities.
As a first step to leverage these findings, we developed SwarmTCR, a method to predict TCR specificity that automatically learns the optimal set of weights to assign to each CDR region based on classification accuracy in a cross-validation setting (Fig. 1).In addition to CDR1, CDR2, and CDR3, the method also incorporates the CDR2.5 region (a loop between CDR2 and CDR3 that can interact with the pMHC, as discussed in [2]), for a total of four weights per chain.By directly optimizing the weights for CDR regions in an peptide-specific fashion, our method automatically accounts for the diversity in pMHC recognition that is documented in crystal structures (see Methods).
We applied our method to SC and BS data and compared its performance against that of TCRdist.In addition to performing in most cases better than TCRdist, the weights returned by SwarmTCR in SC sequencing data can potentially inform the user about the contribution of the two chains in recognizing the pMHC complex.

Classification performance of SwarmTCR
The rationale for developing SwarmTCR is that receptor α and β chains can be involved in peptide recognition to a variable extent.Figure 2 shows two crystal structures of TCR/ pMHC complexes to visually illustrate the fact that the α and β chains can be involved in pMHC binding to a very different extent, depending on the peptide that is being recognized.In the example shown in Fig. 2, out of the total number of residues making contact with the pMHC, one TCR (PDB ID: 4G8G [9]) has 16 (59%) α chain residues and 11 (41%) β chain residues in contact with the pMHC, whereas the other [10] has 9 (39%) α chain residues and 14 (61%) β chain residues in contact with the pMHC.This is consist- ent with results in the literature [8].
Based on this observation, SwarmTCR optimizes the weights used to compute the CDR alignment scores underpinning the nearest-neighbor classification approach.In contrast, previous attempts at predicting TCR specificity (TCRdist method) used a static weighting scheme with equal α and β chain contributions and fixed CDR loop Fig. 1 Overall approach.Here the model for single-cell data is illustrated.Weights for each CDR loop are determined by the optimization step and subsequently tested on the testing set to assess peptide prediction performance weights [2].The SwarmTCR method makes no assumptions about chain or CDR loop importance, but learns the weights in an peptide-specific fashion.
Mean and standard deviation of the optimized weights for several peptides are shown in Fig. 3 (numerical values in Additional file 1: Table 1), together with Fig. 2 Contact residues in crystal structures.The complexes shown here (PDB ID: 4G8G, peptide: KRWIILGLNK and PDB ID: 2VLR, peptide: GILGFVFTL) express the need for SwarmTCR.The 4G8G complex illustrates an α driven interaction and 2VLR conversely, a β driven interaction.All protein chains (including the CDR loops) are color-coded to supplement the tables beneath each structure.The tables show the number of contact residues in each CDR loop and a target structure Fig. 3 CDR weights and performance.These boxplots summarize the results of SwarmTCR and compare them against TCRdist.A, C SC and BS SwarmTCR results describe weights (y-axis) selected for each CDR loop (x-axis) for each repertoire tested.B, D SC and BS performance comparison of SwarmTCR and TCRdist compare average precision scores (y-axis) for each repertoire tested (x-axis).P-values for performance comparison are defined by two-sample independent t-test classification performance for SwarmTCR and TCRdist [2], separately for SC and BS data.To test the robustness of the results, we repeated the same analysis using TCRs from IEDB at different confidence thresholds (0, 2, 3), obtaining similar results (see Methods and Additional file 1: Table 2).In addition, in the absence of true negative data (i.e., data showing which TCRs do not bind to particular epitopes), we randomly shuffled CDR regions within each chain (alpha and beta), for all TCR sequences with our existing single-cell data.As expected, for nearly all repertoires we observed a notable loss in precision, with IAV-M1 showing a less pronounced loss in precision (see Methods and Additional file 1: Table 2).
Additional area under receiver operating characteristic curve (AUROC) analysis can be found in Additional file 1: Table 3, SC true positive rate (TPR)/false positive rate (FPR) boxplots in Additional file 1: Figs. 1 and 2, and BS TPR/FPR boxplots in Additional file 1: Figs. 3 and 4.

Single cell sequencing
SC data provides paired αβ chain information, i.e., the complete TCR sequence is avail- able.The SwarmTCR optimization procedure for SC data involves the use of eight separate weights, since we have paired α and β chain sequences.The results of our SC analysis show relatively high weight being placed on non-CDR3 loops, although the CDR3 region has high weight for several peptides (Fig. 3A, and Additional file 1: Fig. 5).Interestingly, in the case of the EBV YVL peptide and the Yellow Fever LLW (peptide: LLWNGPMAV) peptide the SwarmTCR optimization procedure assigns more weight to the α chain, sug- gesting that the α and β chains might have a more or less prominent role in TCR peptide recognition depending on the peptide, which is consistent with the example shown in Fig. 2 and the previous literature [8].
By looking at the results in Fig. 3B and in Fig. 4B (Additional file 1: Table 1), we can see that the largest difference between the classification performance of TCRDist [2] and SwarmTCR is for the EBV YVL and GLC peptides.The optimized weights for these peptides differ substantially from the fixed TCRdist weights.Based on the optimized weights, YVL appears to favor the α chain as noted above, with only CDR2β being weighted more than its α counterpart.This is consistent with results in the literature [11].
PR curves for all SC peptides are shown in Additional file 1: Fig. 6, AUROC results in Additional file 1: Table 3, SC TPR/FPR box plots in Additional file 1: Figs. 1 and 2. The distributions of alignment scores between test and reference TCRs for positive (i.e., binding) and negative (i.e., "non binding") TCRs is shown in Additional file 1: Fig. 7.For most epitopes (with the notable exception of NLV, which has poor performance), we can observe a clear separation of scores between positives and negatives.

SwarmTCR weights correlate with structural contacts
We further explored the potential of SwarmTCR to infer TCR chain usage in binding to the pMHC by extracting contact residue counts from TCR/pMHC crystal structure CDR regions (see Methods).Figure 5A shows that the weights generated by SwarmTCR correlate in a statistically significant manner (PCC = 0.812, p < 0.05 ) with actual chain usage for TCR/peptide contacts, compared to TCRdist (PCC = 0.484, p < 0.331).
Though number of contacts increase when MHC contacts are included (Fig. 5B), SwarmTCR weights maintain stronger correlation (PCC = 0.827, p < 0.042 ) compared to TCRdist (PCC = 0.645, p < 0.166 ).We performed the same analysis on CDR regions (Additional file 1: Figs. 8, 9, and 10), obtaining lower correlation values.However, SwarmTCR does appear to capture germline loops with high contact counts.Contact counts for all PDB structures can be seen in Additional file 1: Figs.11 and 12.

Bulk sequencing
As mentioned in the Introduction, in contrast to SC sequencing bulk sequencing yields the sequence of only the α or the β chain of TCRs but not both.The SwarmTCR optimi- zation procedure was carried out in the same manner as for SC, except that the weights to optimize are four instead of eight, since we have unpaired α or β chain sequences, con- taining CDR1, CDR2, CDR2.5, and CDR3 regions for a total of four weights per chain.Compared to SC data, the results on BS data show more weight being placed on CDR3 loops, indicating its importance in predicting the specificity of TCR data when using only one chain.While [2] assigns to the CDR3 loop three times the weight of the CDR1, CDR2, and CDR2.5 regions, SwarmTCR assigns to CDR3 between 4 and 64 times the weight of the other regions (using the average weight as a measure), as it can be seen in Fig. 3C and Additional file 1: Fig. 13.These differences between the SwarmTCR weights and the original weights by Dash et al. [2] have a substantial impact on the classification performance for the GLC and YVL peptides using the β chains (Fig. 3D).
Fig. 4 Precision-recall curves for SwarmTCR and TCRDist.These precision-recall curves show the performance of SwarmTCR and TCRdist on the data used for 50 cross-validation iterations.TCRdist mean curves are in blue and SwarmTCR mean curves are in red, while the shaded regions cover one standard deviation Figure 4C and D shows precision-recall (PR) curves for a representative peptide (EBV YVL), obtained by averaging 50 curves, with the shaded region representing one standard deviation above and below the mean.SwarmTCR outperforms the original weights used in [2] for both chains, with a more substantial improvement for the β chain (AUCPR 0.85 with the optimized weights vs. 0.74 with the origi- nal TCRdist weights).PR curves for all BS peptides are shown in Additional file 1: Fig. 14, AUROC results in Additional file 1: Table 3, SC TPR/FPR box plots in Additional file 1: Figs. 3 and 4.
Fig. 5 Comparison of SwarmTCR weights and contact residues in crystal structures.These plots compare α β chain usage for SwarmTCR and TCRdist, using known crystal structures as a baseline.The x-axis details the normalized weight for each repertoire's chain and the y-axis is ordered in ascending order according to crystal structure average number of contacts.Plot A includes the number of CDR loop contacts to the peptide and plot B includes number of CDR loop contacts to the pMHC.Pearson and Spearman statistics are located to the right of the legend

Discussion
We introduced SwarmTCR, a computational approach for predicting TCR specificity that maximizes classification performance within a nearest neighbor framework by identifying optimal CDR weights.Compared to the results obtained with fixed TCRdist weights, overall SwarmTCR performs better, with some peptides showing more substantial improvement than others (Fig. 4).We note that in a worst case scenario, with enough data SwarmTCR can always fall back on the weights used by TCRdist if those yield maximum performance during the PSO step.
When comparing CDR weights in SC and BS data, we noticed stark differences between the results obtained with the two data types.In particular, we found that SwarmTCR assigns much more weight to the CDR3 region in BS data, whereas the results on SC data show relatively higher weights for the germline CDR loops.Due to the small size of the SC dataset, the diversity of TCR gene families is likely considerably lower than that found in the BS dataset.Therefore, the lower gene family diversity in the SC dataset compared to BS could partly explain the higher predictive power of gene family (germline loops) in SC data.Another reason for this difference in the weights between the two data types is the presence of paired chains information in SC, where combinations of TCR genes for α and β chains would likely be selected for by the optimi- zation approach.More SC data is needed to further elucidate the issue.Consistent with the substantial differences in size between the two datasets, SC results show higher variance in both performance and weight selection than BS results.
The performance, generarizability, and robustness of computational approaches depend on the quality of the data used for training.An important caveat to consider when using publicly available databases like IEDB and VDJdb is that they might contain data obtained in a specific experimental context and not further validated.For example, confounding factors like bystander activation of CD8+ cells (i.e., activation of T cells that is independent of the TCR [12]) can potentially lead to incorrect assignment of TCR specificity.
An important question to consider is whether the optimized weights can also be interpreted to reflect chain and CDR usage.In other words, if a chain or a CDR region receives a high weight during the optimization step, does that mean that it also makes a large number of contacts with the pMHC?Our results suggest that the optimized weights can point to possible TCR chain and CDR loop usage, as shown in Figs.3A, 5, and Additional file 1: Table 1 for the GIL TCR/pMHC crystal structure (Fig. 2, PDB ID: 2VLR) with respect to β chain dominance and CDR2β loop usage.
A recent study [11] corroborates these findings, explaining CDR1β and CDR2β 's role in pMHC recognition, as well as CDR3β 's conserved arginine fitting into a pocket between the peptide and the MHC α2-helix.Additionally, this study explains CDR3β 's sequence conservation and notable variability in CDR3α .This likely explains the weighting results of SwarmTCR (GIL, Fig. 3), despite the number of contacts in the 2VLR TCR/ pMHC structure.We also note that SwarmTCR's weight results for YVL and LLW repertoires align with findings from this study, indicating the importance of the α chain in pMHC recognition [11].
However, one has to exercise caution when interpreting the weights in a structural sense.As discussed above, the weights are the result of an process designed to maximize classification performance, and factors other than structural importance can play a role in determining the optimal weights.If we consider the crystal structures and literature mentioned, this is shown by our BS weighting results and differences present in Additional file 1: Figs. 8, 9, and 10.Nonetheless, given large amounts of TCR sequence data, peptide-specific optimal weights can provide helpful information in elucidating TCR/pMHC interactions.
Sequence-based approaches to infer TCR specificity are appealing due to their computational efficiency and the availability of sequence data [2,6,13].However, structural data continue to provide information that expands and sometimes challenges our current understanding of TCR/pMHC interactions.For example, one study found a strong negative correlation between mean CDR3 α , β charge and peptide charge [2].Another study [3] showed how cross-reactive peptides share similar pMHC features (structural motifs and electrostatic potential) despite having different peptide sequences.These findings point to the importance of factoring in structural information for further improving prediction methods.However, more work needs to be done both at the experimental level (generation of more crystal structures) and the computational level (reliable and scalable modeling of TCRs and pMHC complexes).

Conclusions
Being able to reliably predict TCR specificity will push the boundaries of many disciplines including vaccine design, immunotherapy, cancer research, and disease detection/ prevention in new directions.Here we have introduced SwarmTCR, a nearest-neighbor approach that optimizes CDR weights by maximizing classification performance.SwarmTCR was benchmarked on both SC and BS data, and compared against a stateof-the-art methodology, TCRdist.The results showed that SwarmTCR improves the performance of the nearest-neighbor classification approach and that the CDR weights generated in the training phase tend to correlate with the number of contacts made by the CDR regions in crystal structures.
Human data from VDJdb was downloaded in January 2018, where paired TCR information is denoted by matching index values and unpaired chains have index values of 0. Complete SC data (confidence value ≥ 1) from the Immune peptide Database (IEDB) was added to our dataset and used as the default for all analysis.To test the sensitivity of the results to data composition, we also built datasets that had IEDB data with confidence value ≥ 0 , ≥ 2 , and ≥ 3 , respectively.In total, our default SC dataset comprised 1447 TCRs, BS α 21,207 chains, and BS β 25,927 chains (for complete dataset counts see Additional file 1: Table 4).Data is available for download from the Github repository for the project.

CDR information
Our method for predicting the specificity of TCRs requires TCR gene family and complete CDR3 sequence.To obtain this, we retrived all human germline CDR loop information from the International ImMunoGeneTics Information System Gene database (IMGT/GENE-DB) [17].CDR1 and CDR2 loops can be retrieved directly from the database.However, CDR2.5 needs to be extracted from the IMGT alignment sequence, and is defined by the residues in columns 81-86 of the gapped alignment (F+ORF+in-frame P amino acid sequences with IMGT gaps), as discussed in Dash et al. [2].After translating the data to protein sequence, we produced non-redundant datasets by removing duplicate TCRs (sequence identity < 100%).

Shuffled CDRs
Since we did not have access to true negative data (i.e., data showing which TCRs do not bind to particular epitopes), we randomly shuffled CDR regions within each α and β chain, to test whether or not a loss of precision would be observed.CDR regions for each TCR were shuffled prior to assigning the receptors to the train and test sets.

Baseline method
We implemented TCRdist [2] as a baseline method for classifying TCRs according to their peptide specificity.TCRdist is based on a nearest-neighbor approach, where the distance between TCRs is obtained from protein sequence alignment scores between TCRs.Using a BLOSUM62 matrix, a protein alignment is performed between any two TCRs using CDR loops 1, 2, 2.5, and 3. Subsequently, CDR 1, 2, and 2.5 are given a weight of 1, where CDR3 is given a weight of 3. Finally, the weighted sum of the CDR loop alignment scores is used as a proximity measure, and TCRs are assigned the peptide specificity of their nearest neighbor [2].

SwarmTCR
The main idea behind SwarmTCR is that the "importance" of the α and β chains as well as the CDR regions within these chains varies depending upon the peptide that is being recognized, as described in the literature [8].In order to reflect this, SwarmTCR learns optimal weights for each of the eight CDR loops in a peptide-specific fashion.SwarmTCR explores the eight-dimensional (with SC data) or four-dimensional (with BS data) space of CDR weights with Particle Swarm Optimization (PSO), an established optimization technique inspired by the natural flocking behavior of birds that has been shown to achieve good performance in a wide range of optimization contexts [18].
The weights are used in a nearest-neighbor framework as done in TCRdist.We framed this as an optimization problem, where the objective is to identify a set of weights that maximize classification performance as measured by Average Precision (AP) (eq.1).AP was selected as the objective function to address the issue of unbalanced datasets (Additional file 1: Table 4), as suggested by [19].We used Particle Swarm Optimization (PSO) for carrying out the optimization of the weights and maximize AP on a training set.
AP is determined by the sum over every position of the precision-recall curve where k is the rank of the retrieved TCRs, n is the number of TCRs, P(k) is the precision at cut-off k, and �r(k) is the change in recall from k − 1 to k [20].
In PSO particles are initially placed in a multidimensional space at random, with each particle representing a possible solution to the optimization problem.At each iteration, particles move with a velocity vector that is a function of both the local best of the particle and the global best.The velocity ( v ) and position ( p ) of a particle i are updated at each time step t according to Eq. 2 and 3: where ω is an inertia factor set to 0.5, c1 and c2 are scaling factors set to 0.5, r1 and r2 are two random numbers between 0 and 1, pbest i is the position of particle i that has resulted in the best value for the objective function so far, while pbest g is the global best (i.e., the position corresponding to the best value so far across all particles).
The optimization is set to terminate if the swarm moves ≤ 10 −8 from its best position or if the change in the swarm's best objective value is ≤ 10 −8 .The swarm size is set to 25, with a maximum number of 20 iterations.

The SwarmTCR model
We define as "training set" the TCRs used to obtain an optimal set of weights maximizing average precision, and test set as the TCRs where the performance of the optimal weights is evaluated.Within both sets, we have a reference subset containing labeled TCRs and a sample subset that the nearest neighbor approach compares against the reference subset to infer peptide labels for the TCRs.
Training and test sets for SC and BS data were constructed differently due to data availability, with BS data being much more abundant than SC data.For BS, the training and test sets were filled using a 50/50 split, and for both training and test sets half of the TCRs specific for a particular peptide were placed into the reference subset and the remainder into the sample subset (Fig. 6A).
For SC, 30% of all TCRs specific for a peptide were placed into the reference subset for both training and testing sets.The same reference subset was used in training and testing due to limited amounts of SC data (see Additional file 1: Table 4).In order to create the sample subsets for training and testing, the remaining 70% (TCR specific) undergoes another 70/30 split (Fig. 6B).We note that the sample reference sets are distinct for (