Balancing sensitivity and specificity in distinguishing TCR groups by CDR sequence similarity

Background Repertoire sequencing is enabling deep explorations into the cellular immune response, including the characterization of commonalities and differences among T cell receptor (TCR) repertoires from different individuals, pathologies, and antigen specificities. In seeking to understand the generality of patterns observed in different groups of TCRs, it is necessary to balance how well each pattern represents the diversity among TCRs from one group (sensitivity) vs. how many TCRs from other groups it also represents (specificity). The variable complementarity determining regions (CDRs), particularly the third CDRs (CDR3s) interact with major histocompatibility complex (MHC)-presented epitopes from putative antigens, and thus encode the determinants of recognition. Results We here systematically characterize the predictive power that can be obtained from CDR3 sequences, using representative, readily interpretable methods for evaluating CDR sequence similarity and then clustering and classifying sequences based on similarity. An initial analysis of CDR3s of known structure, clustered by structural similarity, helps calibrate the limits of sequence diversity among CDRs that might have a common mode of interaction with presented epitopes. Subsequent analyses demonstrate that this same range of sequence similarity strikes a favorable specificity/sensitivity balance in distinguishing twins from non-twins based on overall CDR3 repertoires, classifying CDR3 repertoires by antigen specificity, and distinguishing general pathologies. Conclusion We conclude that within a fairly broad range of sequence similarity, matching CDR3 sequences are likely to share specificities. Electronic supplementary material The online version of this article (10.1186/s12859-019-2864-8) contains supplementary material, which is available to authorized users.

In the MHC:peptide: TCR recognition process (Fig. 1) the TCR represents the main source of variability and training in distinguishing of self vs. non-self peptides [28]. MHC is genetically encoded and even the effective diversity across global populations due to allelic variation is limited, since there is degeneracy in the MHC binding groove pockets that hold the peptide side-chains [29,30]. In contrast, TCRs are much more diverse, with hypervariable complementarity determining regions (CDRs), particularly the third CDRs (CDR3s), which are derived from variable-diversity-joining (VDJ) recombination. Overall TCR theoretical diversity is estimated to be perhaps 10 15 [31] and practical diversity in any individual roughly 10 6 [32][33][34]. The CDR3s, which contribute the bulk of the diversity, are thereby able to specifically recognize a wide array of MHC-presented antigen peptides, while the other CDRs, which are largely genetically encoded, are primarily responsible for recognizing the MHC itself (see again Fig. 1) [28,35].
An individual's set of TCRs, or TCR repertoire, is shaped by thymic training against self along with a lifetime of exposure to different antigens. It is presumably much smaller than that of all possible antigens, and there is substantial degeneracy, with different TCRs able to recognize the same antigenic region and the same TCR able to accommodate different antigenic regions [36][37][38][39]. Thus numerous interesting and important questions center on the relationship between TCR repertoire and recognition propensities, including the impacts of genetics vs. training and exposure, the ability of CDRs to accommodate antigenic diversity, and commonalities across pathology-or antigen-specific populations. The advent of large-scale repertoire sequencing [40,41], initially for CDRs alone [32,[42][43][44], and more recently even for paired α/β chains [45,46], provides opportunities to gain insights into patterns of TCR diversity and recognition. An early study tackled the importance of genetics by studying pairs of monozygous twins, and, among other analyses, found that twin pairs had more identical CDR3 sequences than non-twins [47]. More recent publications have shifted the focus from identical sequences to similar sequences. Paired α/β TCRs, resulting from single-cell sequencing and antigen-specific selection, could be classified very well according to their epitope specificities, and the sequences elucidated patterns conferring those specificities [48]. Similarly, epitope-specific repertoires from a variety of viral infection contexts pooled across many subjects revealed distinctive motifs, and furthermore the CDRs from a set of M. tuberculosis subjects clustered into groups with strong MHC associations enabling design of specific MHC-peptide-TCR interactions [49]. An extensive analysis of available TCR sequence data from a wide range of subjects revealed pathogen-specific MHC-TCR associations along with structural insights into MHC and TCR covariation [50].
TCR repertoire sequencing thus provides the opportunity to generalize from a set of samples to patterns than are predictive of relationships among subjects, pathologies, antigens, MHC restrictions, and so forth [48,[50][51][52]. Here, we systematically investigate, over a diverse group of repertoire datasets, the extent to which sequence enables prediction of such relationships. As always in statistical / machine learning approaches, one must thread the needle between under-generalization, missing out on predictions that would be true (i.e., lacking sensitivity), and over-generalization, making predictions that end up not being true (i.e., lacking specificity). Thus we characterize the trade offs between specificity and sensitivity in these various studies. We show that in general it is possible to obtain a good specificity-sensitivity balance and make a large fraction of high-confidence predictions of TCR function from sequence.

Results
We study a diverse group of CDR datasets in order to evaluate in general how predictive TCR sequence is of relationships among groups (subjects, epitopes, pathologies, etc.). Since single cell methods are only now becoming available, many repertoire analysis efforts use standard sequencing approaches to characterize CDR3, which, as discussed above, is the main source of variability and antigen-specific recognition. So as to provide a consistent and interpretable basis for drawing conclusions, as well as to evaluate the information content provided by CDR3 alone, we use only CDR3 for all datasets, and separately analyze CDR3α and CDR3β.
The approach we take is representative of the key principles underlying the many possible sequence-based An MHC (brown) presents a peptide (red spheres) to a TCR (light blue: α chain; light green: β chain) (PDB id 2bnr [67], rendered via PyMOL [68]). The TCR interacts with the MHC and peptide through its CDR loops, with the first two CDRs of each TCR chain generally recognizing mainly the MHC and the third the peptide; here the CDR3α and CDR3β are "grabbing" a methionine and tryptophan in the middle of the peptide (inspired by Chen et al. [67]) prediction methods and parameterizations, with advantages of being readily interpretable and able to function well even with limited data. In short, we evaluate CDR3 sequence similarity (technically dissimilarity, as lower is better), with a score which we term CDRdist, with 0 indicating an exact match and 1 no identifiable similarity. We assess the predictive power of CDR3 sequences by using this score to perform 1-nearest-neighbor classification (Fig. 2), predicting the group (e.g., associated antigen or disease) to which one CDR belongs based on the known group of the most similar CDR. By checking whether the predicted group is correct or not, we can evaluate both sensitivity (fraction of CDRs from a group that are correctly predicted to be in that group) and specificity (fraction of CDRs predicted to be in a group that actually are in that group). In order to evaluate only the impacts of similarity, without being confounded by duplicates which can render classification trivial, we do not consider exact matches (score of 0). In order to gain deeper insights into how the degree of similarity impacts classification performance, we slide a threshold from > 0 to 1, making a prediction only if the nearest neighbor is sufficiently similar, and assessing how relaxing the required score manifests in specificity vs. sensitivity trade-offs. The score also provides the basis for clustering, elucidating sequence similarity-based groupings for CDRs and revealing sequence patterns conferring the observed performance trade-offs.
In the following sections, we apply this general framework to characterize several datasets: CDR groups defined by structure, evaluating sequence variation within and between clusters; repertoires from twins, studying the relative similarity between an individual and their twin vs. others; a number of different human and murine repertoires, assessing CDR distinctiveness as it relates to antigen specificity as well as underlying pathology.

Extent of sequence similarity within/between structural clusters
The Structural T-Cell Receptor Database (STCRDab) [53] structurally clusters CDR3s into canonical classes, separately for α and β, and separately by length(s). A set of CDR3 sequences and associated structural classes were downloaded and investigated for relative intra-vs. inter-class sequence similarity. After removing duplicates, there were 142 unique sequences across four α groups and six β groups (Table 1). While it is common for structural clusters to be characterized by their individual sequence profiles, we also sought to understand the extent to which the sequence pattern from one cluster could be generalized before encroaching on another cluster.
The distance from each unique CDR to the most similar (but distinct) CDR within its structural class tends to be smaller than the distance to the most similar (but distinct) CDR from another class ( Fig. 3 (a, b)). When the distance is less than about 0.2 or 0.3, the closest CDR tends to be within the same structural class (below 0.2: 97% in the same structural class; below 0.3: 96%), while when it is above about 0.6 or 0.7, the CDRs tends to be within different classes (above 0.6: 69% in different Fig. 2 Classifying CDRs by sequence similarity. This illustration plots in a schematic low-dimensional space the locations (dots) of CDRs from three different classes (colors). 1-nearest-neighbor classification predicts the class of one CDR from that of the most similar one; we here refine that to require the nearest neighbor to be close enough, within a specific distance threshold. Contour rings show sequence distances of 0.2, 0.3, and 0.4 from three query CDRs ("A", "B", and "C") from those classes. At a threshold of 0.2, only "B" has a close-enough nearest neighbor, "b1", which is of the same class, so 1-nearest-neighbor classification is correct. At this threshold, "A" and "C" are not predicted. When the threshold is relaxed to 0.3, "A" now has a close-enough nearest neighbor, "a1", of the same class, so it is also correctly predicted. However, "C" has "b2", of a different class, as its close-enough nearest neighbor, so it is incorrectly predicted. In this manner, we study trends trading off correct, incorrect, and unidentified, as the threshold is varied structural classes; above 0.7: 71%). This observation supports the use of nearest-neighbor classification, predicting structural class from sequence matches, which we elaborate to study specificity-sensitivity trade-offs by subjecting it to a distance threshold; i.e., only make a prediction if the nearest neighbor is closer than a given threshold. When the threshold is less than about 0.2, many CDRs are unclassified but those that are tend to be correct; between 0.2 and 0.4, the number of unclassified CDRs drops substantially while still retaining high specificity; and above that range the specificity drops in order to obtain further sensitivity ( Fig. 3 (c, d)).
In order to more directly explore the relationship between sequence similarity and structural similarity, structures were downloaded from STCRDab and the CDR3β loops extracted for those in which electron density was present. When the same sequence was present in multiple structures, a representative was chosen as that with minimal sum of main-chain root mean squared deviation (RMSD) to the others. For each such CDR, the most similar sequence in its STCRDab CDR3β structural class and the most similar sequence from another CDR3β class were compared, in terms of both CDRdist and RMSD. This comparison ( Fig. 4) thus elaborates the implications of Fig. 3, characterizing when a closer sequence implies a closer structure. Some examples are illustrated, limited to cases with good sequence score, such that a classification decision would be made under the thresholding approach above.
This analysis helps calibrate the general level of confidence one can have that two CDRs of a given degree of Fig . 3 Sequence similarity in STCRDab structural clusters. a,b Minimum sequence distances within (blue) vs. between (green) structural classes for (a) CDR3α sequences and (b) CDR3β sequences, plotted as a density estimate. c, d Accuracy of nearest-neighbor classification using sequence similarity to predict structural cluster for (c) CDR3α sequences and (d) CDR3β sequences. As the threshold required to make a classification is increased (x-axis), the number of sequences (y-axis) that are unclassified (green line) decreases, trading off how many are correctly (blue line) vs. incorrectly (magenta) classified local sequence similarity are likely to adopt similar structures, under a readily interpretable classification approach.

CDR similarity across repertoires from twins
An early landmark study in TCR repertoire analysis characterized three pairs of monozygous twins ("A", "B", and "D"), evaluating general characteristics of the repertoires (e.g., diversity) as well as the extent of identity across subjects [47]. The analysis published with that study revealed that the number of identical CDR3 sequences between two individuals was significantly increased if they were twins. We sought to relax the identification of identical sequences across individuals to allow for different degrees of similarity, in particular to test whether twins had more similar sequences than non-twins. As throughout this paper, we explicitly did not consider exact matches, in this case thereby evaluating the "residual" information beyond the previously studied identity. In order to focus the analysis on the strongest signal, we considered only the 1000 most abundant CDR3 sequences from each repertoire, with read counts ranging from over 100,000 down to around 100. Zvyagin et al. [47] observed that shared clonotypes were significantly higher among the most abundant CDR3β sequences for any pair of individuals, and this was even more evident for twins, so we focused on these more significant sequences. For each unique CDR from each subject, the closest non-identical CDR in each other subject was identified. These matches served as the basis for evaluating nearest-neighbor classification, assessing whether the closest CDR was from a twin (correct) or a non-twin (incorrect). Over the range of required distance thresholds, the number of correct classifications outpaces that of incorrect ones ( Fig. 5 and Additional file 1: Table S1), though not as strongly as for the structural clusters. For CDR3α, at a threshold of 0.2 about 46% of each twin pair's CDRs are correctly classified and about 20% incorrectly classified, with the remaining 34% unidentified. Raising the threshold to 0.3 yields roughly 57% correct, but at the cost of roughly 28% incorrect, leaving only 15% unidentified. Further increases in the threshold continue this trend, with 0.4 resulting in 63% correct but 31% incorrect, and 6% unidentified. Results for CDR3β follow the same sensitivity-specificity trend over this range, but at a lower accuracy: at a threshold of 0.2, there are about 26% correct vs. 14% incorrect; at 0.3 the Fig. 4 Sequence similarity differences vs. structural similarity differences. Each point represents a CDR3β, and its coordinate represents the difference in sequence similarity, CDRdist, (x-axis) and structural similarity, RMSD, (y-axis) between the most similar sequence in its STCRDab structural class and the most similar sequence in any other STCRDab CDR3β structural class. A negative number indicates that the CDR is more similar to the neighbor in its structural class than to the other neighbor; i.e., left indicates more similarity in sequence and bottom more similarity in structure to an in-class CDR than to an out-of-class one. Color represents the distance between the CDR and the in-class neighbor. Structural superpositions highlight example relationships between a target CDR (blue), its in-class neighbor (green), and its other neighbor (orange). a Sequence-based classification is correct, but actually does not yield the best structural match. b-d Classification is correct, and yields (b) similar, (c) better, and (d) much better structural matches. e-f Classification according to STCRDab classes is incorrect but is actually consistent with the relative structural similarity: the other structure is (e) more similar or (f) about the same, though in (e) neither structure is particularly similar. g Classification is incorrect, and the closest sequence in the same structural class is more similar to that in the other class trade-off is 45% correct vs. 29% incorrect, and at 0.4, 55% correct vs. 37% incorrect. Overall, since identical sequences were explicitly excluded from the classification, this result demonstrates that twin repertoires share not only more identical sequences, but also more similar sequences.
To quantify the significance of the difference in twin vs. non-twin nearest neighbors, the distribution of the closest twin distance was compared to that of the closest non-twin match with a Mann-Whitney U test. CDR3β matches between twins in pair A are closer to each other than to their closest non-twin matches (p-value 7 × 10 − 4 ) as are those in twin pair D (1 × 10 − 9 ); however, the distinction does not hold for twin pair C (0.24). For CDR3α, however, only twins A are significantly different from others (0.017), with C marginally above a 5% cutoff (0.07) and D insignificant (0.198). We conclude that two of the three twin pairs have more similar CDR3β sequence, but only one pair has more similar CDR3α sequences. In contrast, the CDR3α classification performance is better than that for CDR3β, even though the overall distributions are more similar, suggesting that focusing specifically on close-enough pairs reveals additional valuable information distinctive of twin pairs.
Analyzing patterns of CDR sequence similarity within and between repertoires can yield biological insights into the basis for specificity (or lack thereof ) and suggest directions for further investigation. To explore such patterns for the twin pairs, CDRs in each individual's repertoire were clustered for illustration according to CDRdist at a maximum distance of 0.3, which as shown above yields a good specificity-sensitivity trade-off. Clusters from the different individuals were then compared according to an aggregate sequence score, clusterdist, computed as the average cluster member distances. Figure 6 illustrates some of the patterns of cluster specificities, with motifs revealing the determinants of specificity (or not). In these examples, a C-terminal DSNYQLIW appears to be common to some CDRs in all of the individuals, while AGNNRKLIW is more specific to individual A1, and VV**GREYGNKLVF distinguishes the twin pair A1/A2 from the other individuals.

Classification of epitope specificity by CDR similarity
As discussed in the introduction, a pair of seminal studies published in 2017 studied epitope-specific TCR repertoires across many individuals. This section characterizes sensitivity-specificity trade-offs in predicting within these datasets which epitope each CDR3 recognizes based on the epitopes for similar CDR3s.
Dash et al. [48] used peptide-MHC (pMHC) tetramer selection and single-cell amplification to collect 4635 paired α/β TCR sequences from 10 epitope-specific repertoires. The 1211 unique CDR3α and 1244 unique CDR3β mouse sequences came from 78 mice and were associated with epitopes labeled NP, PA, F2, and PB1 (presented during influenza infection) and M38, m139, The plots track the performance of nearest-neighbor classification, using sequence similarity to predict whether a CDR is from a twin or an unrelated individual. As the threshold required to make a classification is increased (x-axis), the fraction of sequences (y-axis) that are unclassified (green line) decreases, trading off the fraction that are correctly (blue line) vs. incorrectly (magenta) classified and M45 (presented during murine cytomegalovirus infection). The 276 unique CDR3β and 294 unique CDR3α human sequences came from 32 humans and were associated with epitopes labeled M1 from influenza virus, pp65 from human cytomegalovirus, and BMLF1 from Epstein-Barr virus. Among other analyses, Dash et al. performed nearest-neighbor classification based on a custom sequence similarity score using entire paired TCR sequences, and were able to assign 78% (mouse) and 81% (human) of the TCRs to their correct epitope group. We again sought to characterize how specificity and sensitivity vary, and as throughout the paper directly focused on unpaired CDR3 only and the single most similar sequence to a given one. Figure 7 illustrates performance over the range of allowed distance thresholds and Additional file 1: Table  S2 gives details. At 0.2, 53% of the mouse CDR3βs are classified correctly, as are 43% of the human ones, with 16% mouse and 4% human incorrect. Relative to the CDRs that are actually identified (69% mouse and 47% human), these fractions are 77% mouse correct and 91% human correct, comparable to the previous results (78 and 81%, respectively). Thus even using just the CDR3β alone, if the single closest neighbor is close enough then it is highly predictive of the epitope group. Relaxing the threshold to 0.3 yields more correct classifications, 59% mouse and 49% human, at the cost of more incorrect, 29% mouse and 9% human. Consequently the accuracy among those identified is somewhat lower but still comparable, 67% mouse and 87% human. Further relaxing the threshold continues to yield further increases for both correct and incorrect classifications, e.g., at 0.4, 62% of the mouse sequences are correctly classified vs. 36% incorrectly, and 58% correct vs. 16% incorrect for human, which translates to 63% mouse and 79% human correct among those identified. Thus the suitable balance between accuracy and number of predictions again appears to fall in the 0.2 to 0.4 range as we observed first in the structural clusters as evidence of similar conformation. Similar trends hold for analysis of the CDR3α sequences, but CDR3β sequences had greater predictive power, particularly for murine sequences. Other studies, e.g., [47,49], have likewise found CDR3β sequences to be more informative than CDR3α sequences.
In order to gain some insights into the factors conferring epitope specificity, each of the repertoires was clustered, and as with the twins dataset, the clusters were evaluated for their relative specificity. Within the murine and human groups of repertoires, the cluster similarity score clusterdist was computed for each pair of clusters. Each cluster's specificity to its repertoire was characterized by the smallest clusterdist to a cluster from a different repertoire, since a relatively small clusterdist indicates that the sequence pattern defining a cluster common to epitopes in other repertoires while a relatively large score indicates that no cluster in another repertoire has similar epitopes. Figure 8 illustrates some examples of relatively specific and relatively non-specific clusters at a threshold of 0.3, shown above to yield a good specificity-sensitivity balance. For example, TCS*GTGG*NYAEQFF is common to both PB1 and M38 from mice, with a distance of only 0.12 between the two clusters. On the other hand, ASGLVP*G*VYEQYF is distinct to the human M1 repertoire. Its closest motif is ASS**TGTG*YGYTF in p65 at a distance of 0.79, which is relatively large, indicating that ASGLVP*G*VYEQYF is unique to M1. (twin-distinct) clusters from unrelated individuals are closer than that from the twin; (twin-specific) the twin's cluster is closer than those from other individuals; (A1-specific) the cluster is far from all clusters from all other individuals As shown in Fig. 7, the specificity-sensitivity balance shifts as the threshold varies, so in order to understand the sequence patterns driving that shift, we evaluated the evolution of the underlying clusters (illustrative examples in Fig. 9). As the threshold increases, clusters tend to become larger, containing more unique sequences, and are therefore more diverse and less specific. In particular, for the NP cluster, the nearest cluster in a different repertoire at 0.2 is at a clusterdist of 0.43, but that falls to 0.36 for the 0.3 NP cluster; similarly, for PB1-F2 the nearest other-repertoire cluster at 0.2 is 0.49 away, down to 0.38 at 0.4.
Turning to the other recent large TCR repertoire study, Glanville et al. [49] collected 2068 unique sequences using the pMHC tetramers to isolate antigen-specific T cells spanning eight tetramer antigen-MHC (more specifically Fig. 7 Epitope specificity in Dash et al. repertoires. The plots track the performance of nearest-neighbor classification, using sequence similarity to predict a CDR's epitope from a set of seven different murine epitopes or from a set of three different human ones. As the threshold required to make a classification is increased (x-axis), the fraction of sequences (y-axis) that are unclassified (green line) decreases, trading off the fraction that are correctly (blue line) vs. incorrectly (magenta) classified . Their analysis of this data showed that the antigen-specific repertoires tended to share more similar sequences, and revealed some 2-, 3-, and 4-mer motifs enriched in different repertoires. We sought to build on this analysis by once again systematically evaluating the extent of generalization and predictiveness supported by sequence patterns.
Nearest-neighbor classification was performed for each pair of specificities at thresholds of 0.2, 0.3, and 0.4, predicting the epitope+HLA of one CDR based on that of the most similar one (Fig. 10). NP44 and pp65 associated with HLA-B7 generally do very well, and BMLF1 also does relatively well. NP177 and pp65 associated with HLA-A2 are more easily confused with other epitopes. On average, over all the pairwise comparisons, a threshold of 0.2 yields about 19% correctly classified vs. 2% incorrectly classified, with 74% unidentified. Raising the threshold to 0.3 yields 38% correct vs. 7% incorrect with 56% unidentified sequences, and 0.4 continues the trend to 55% vs. 15% with 30% unidentified. The standard deviations on these correct percentages are around 11-13%, as some pairs are clearly quite better than others. Overall, these tests confirmed that the 0.2 to 0.4 range balances accuracy and a sufficient number of predictions for the epitopes in this dataset.

Classification of pathology by CDR similarity
McPAS-TCR catalogues TCR sequences from T cells associated with various pathological conditions in humans and mice [54]. This repository allowed us to move up from epitope specificity to pathology specificity, evaluating how well CDR3β similarity supports classification of the general pathology from which it was derived. The McPAS-TCR human TCR sets with at least 50 unique CDR sequences were downloaded and split into two groups: "small", with fewer than 400, and "large" with more than 400, yielding a relatively equal number of repertoires per group with relatively balanced number of sequences per repertoire ( Table 2). All pairs of pathologies within the same size group were then subjected to nearest-neighbor classification, predicting the pathology of one CDR based on that of the most similar one.
The pairwise classification performance (numbers of correct, incorrect, and unidentified) was calculated for thresholds of 0.2, 0.3, and 0.4, in the common range demonstrating favorable specificity-sensitivity performance across all studies (Fig. 11). On average over all pairwise classifications in the small repertoire group, the percentage of correct classifications increases from 19% at 0.2 to 34% at 0.3 and 49% at 0.4, traded off against 2, 6, and 14% incorrect, respectively, as the fraction of identified sequences goes from 24% up to 40% and finally 63%. Likewise, for the pairwise classifications in the large repertoire group, the fraction of correct classifications averages 24% at 0.2, 48% at 0.3, and 64% at 0.4, with corresponding incorrect classification averages of 4, 12, and 20% and identified averages of 34, 56%, and finally 84%. Over all these comparisons, the standard deviations for correct fractions ranges from 8 to 10%, with some pairs clearly much better than others. In general, larger repertoires are able to classify more sequences than small ones, and do so at a higher accuracy, presumably due to simply having a higher probability of containing a sufficiently close sequence. Some infectious diseases such as influenza, yellow fever, HIV, and hepatitis C all do particularly well in the classification task, even against each other. With these datasets, cancers and autoimmune diseases are confused, and diabetes performs poorly. Further studies are required to ascertain whether differences in classification performance across specific pathologies and general pathology types reflect inherent immunological differences or reveal artifacts in experimental procedures that can perhaps be mitigated with systematic integration methodologies.

Discussion
This study centers on the use of a representative approach to assessing TCR similarity, with local alignments between size-matched CDR3s for each chain separately, and a relaxed substitution scoring matrix combined with a relatively large gap penalty. This set of choices strikes a balance between looking for the local "hot spots" that mediate binding, and accounting for the overall structural context in which those hot spots are situated. While the detailed outcomes would surely be different if the score moved in one direction toward global alignment or in the other direction toward alignment-free motifs, the same general specificity-sensitivity trends would likely hold. In contrast, integrating information across all six CDRs (and even framework regions) [48][49][50], rather than considering only CDR3α or CDR3β independently, would likely yield higher overall performance. However, we felt it worthwhile to explore how much information was encoded just in the CDR3 regions, and found them to be strikingly informative. The 1-nearest-neighbor classifier employed here for specificity-sensitivity assessments is one of the simplest approaches possible, but makes it straightforward to understand and analyze the basis for predictions. A model-based approach, e.g., a linear classifier or even a nonlinear model [51,52,55,56], could give better predictive performance, but would also confound some of the analyses due to the differences in sizes and diversity in different groups. At the same time, a statistical learning approach could provide insights into the importance to different groups of particular CDRs and particular residue positions, directly reveal amino acid motifs conferring specificity, and so forth. In order to focus on the information provided by CDR similarity alone, the analyses presented here did not allow for identity in the 1-nearest-neighbor classification and did not consider abundance. This made the classification task somewhat harder, but also provided a uniform basis across the different studies. In a specific practical application, leveraging identity and abundance would be advantageous; statistical learning approaches could offer a natural means to incorporate this information.
Our analysis of structural clusters provided some intriguing insights into sequence-structure relationships, while the rest of the paper explored a range of sequence-function relationships. The structural analysis was somewhat limited due to limited available structures, but a more complete loop modeling approach [57][58][59] might provide additional power. The functional analysis could benefit from incorporation of MHC restriction information [50], in order to reveal associations among the presented antigens, the presenting MHCs, and the CDRs. And ultimately, combining sequence, structure, and function in an integrated model could provide much deeper insights into the basis for specific recognition.
We individually analyzed repertoires for pairs of twins [47] and for particular antigen specificities [48,49], along with aggregated collections of pathology-related repertoires [54], but we did not seek to combine information across these different studies. An integrative analysis could provide insights into common modalities of recognition, as has been shown for MHC restrictions [50], but which could also span broader functional associations across antigens from different pathogens as well as from different "self"s. An integrative analysis could thus seek to account for private and public aspects of recognition, gain insights into genetics vs. exposure, and support modeling of the development of immunity.

Conclusion
This paper has systematically explored the utility of using CDR3 sequence similarity to predict structural class and functional group (namely antigen specificity or pathology association). Based on a representative measure of similarity and an interpretable classification method, the information content in CDR3 alone was shown to support highly specific predictions at a Fig. 11 Pathology specificity of McPAS-TCR repertoires. For each pair of pathologies in a set of size-matched repertroies (7 small, 6 large), 1-nearestneighbor classification was performed to predict whether a CDR was correlated with one pathology or the other. Each sub-plot summarizes the performance of such as classification test for one such pair of pathologies (row vs. column). The bars indicate the percentage of correct (blue), incorrect (magenta), and unidentified (green) predictions between a pair of pathologies (row and column) at thresholds of 0.2, 0.3, and 0.4 sufficiently stringent similarity threshold, and to maintain good specificity even while increasing sensitivity by substantially relaxing the threshold. Furthermore, patterns supporting predictions within and between groups were shown to provide insights into the structural and functional bases for recognition. We conclude that, if suitably controlled as demonstrated here, predictive frameworks can productively leverage sequence patterns in characterizing and predicting TCR sequence-structure-function relationships.

Data processing
When data was processed for each repertoire, the first "C" from each sequence was removed as it is uninformative for scoring, uniformly inflating the scores. Duplicate sequences were combined within each repertoire.

Sequence similarity score
Given a pair of CDRs to evaluate for similarity, local alignment was performed using the Smith-Waterman (SW) algorithm [60], implemented in the Python package swalign [61]. SW was applied with the BLOSUM45 substitution matrix [62] to allow for biochemical diversity, and a gap penalty of − 10 to focus on matching largely gap-less substrings. Manual inspection of some CDR alignments suggested that these parameters accomplished the intended goals. So as to generate a score that is universally comparable across different CDR sets, the alignment score was normalized by dividing by the self-scores of the two sequences. To provide a dissimilarity measure suitable for clustering, the normalized score was then subtracted from 1. Since SW scores are non-negative and self-scores are maximal, the final score is between 0 (identical) and 1 (no discernable similarity). Formally, for two sequences A and B, the distance is:

Nearest neighbor classification
Given a set of CDRs that are labeled as belonging to one of two or more different groups, a nearest neighbor classifier predicts the label of another CDR based on "nearby" labeled CDRs, in terms of CDRdist. A 1-nearest-neighbor classifier was used for the results here, making the assignment on the single closest (but not identical) CDR rather than taking a vote among several. Furthermore, the allowed distance was thresholded, such that if no neighbor was sufficiently close, then no prediction would be made.

Clustering
CDRs were clustered using hierarchical agglomerative clustering via the linkage function in the scipy.cluster.hierarchy package of scipy [63], with the CDRdist as a comparison function. The resulting dendogram was cut at the specified CDRdist threshold (i.e., 0.2, 0.3, or 0.4 in the example results) in order to define clusters.

Cluster similarity score
In order to characterize how specific or non-specific clusters were to the repertoires they came from, the distance between a pair of clusters was computed as the average pairwise distance between their members: Then the relative specificity of a cluster to its repertoire was characterized in terms of its distance to clusters from other repertoires, with a small score indicating non-specificity (i.e., similarity to a cluster from another repertoire).

Acknowledgements
We thank Jennifer Franks and Yoonjoo Choi for valuable comments on this work.

Funding
Salaries supporting the development and analysis described here were funded in part by NIH grant 2R01GM098977.