Skip to main content
Fig. 2 | BMC Bioinformatics

Fig. 2

From: Clustering based approach for population level identification of condition-associated T-cell receptor β-chain CDR3 sequences

Fig. 2

CDR3 sub-repertoire matching in samples of two unrelated individuals. a hierarchical clustering of CDR3 cluster centroids from samples CD005 (black) and CD006 (green) from our CD PBMC dataset identified 32 sub-repertoires of which 30 (94%) had cluster representatives from both samples. Branch colors indicate sub-repertoires. Only 2 of the 32 (6%) sub-repertoires (shown in black dots) are homogenous, containing cluster centroids from only one sample. b V-, J-, VJ- and VDJ gene usage frequency was compared between clusters coming from the two samples, the percentage of sub-repertoires with significantly different gene usage with p value below 0.05 (using chi-square test of independence) is shown. c Number of different possible 4-mers that start at each position is estimated using Shannon’s entropy for 42nt long CDR3s, highest entropy is observed in positions in which CDR3s have the N1 and N2 region. Similar result was obtained in all samples. 4-mers that are not completely within the N1 or N2 region but either end or start in the regions are counted towards them. d Top 20 4-mers with the highest variance in frequency across the 5000 subsampled CDR3s within a single sample (CD005) is shown. e The frequency of where (in V, N1, D, N2, J) the top 20 most variable 4-mers are found in the CDR3s is shown. f The classification importance of k-mers and genes in distinguishing 4-mer based clusters within a single sample (CD005) is shown. g The frequency of where (in V, N1, D,N2, J) the top 20 most discriminative 4-mers (ordered left to right) are found in the CD005 repertoire is shown

Back to article page