Skip to main content

Capturing the differences between humoral immunity in the normal and tumor environments from repertoire-seq of B-cell receptors using supervised machine learning



The recent success of immunotherapy in treating tumors has attracted increasing interest in research related to the adaptive immune system in the tumor microenvironment. Recent advances in next-generation sequencing technology enabled the sequencing of whole T-cell receptors (TCRs) and B-cell receptors (BCRs)/immunoglobulins (Igs) in the tumor microenvironment. Since BCRs/Igs in tumor tissues have high affinities for tumor-specific antigens, the patterns of their amino acid sequences and other sequence-independent features such as the number of somatic hypermutations (SHMs) may differ between the normal and tumor microenvironments. However, given the high diversity of BCRs/Igs and the rarity of recurrent sequences among individuals, it is far more difficult to capture such differences in BCR/Ig sequences than in TCR sequences. The aim of this study was to explore the possibility of discriminating BCRs/Igs in tumor and in normal tissues, by capturing these differences using supervised machine learning methods applied to RNA sequences of BCRs/Igs.


RNA sequences of BCRs/Igs were obtained from matched normal and tumor specimens from 90 gastric cancer patients. BCR/Ig-features obtained in Rep-Seq were used to classify individual BCR/Ig sequences into normal or tumor classes. Different machine learning models using various features were constructed as well as gradient boosting machine (GBM) classifier combining these models. The results demonstrated that BCR/Ig sequences between normal and tumor microenvironments exhibit their differences. Next, by using a GBM trained to classify individual BCR/Ig sequences, we tried to classify sets of BCR/Ig sequences into normal or tumor classes. As a result, an area under the curve (AUC) value of 0.826 was achieved, suggesting that BCR/Ig repertoires have distinct sequence-level features in normal and tumor tissues.


To the best of our knowledge, this is the first study to show that BCR/Ig sequences derived from tumor and normal tissues have globally distinct patterns, and that these tissues can be effectively differentiated using BCR/Ig repertoires.


Recent insights into cancer immunity have provided new possible treatment strategies against tumors based on immunotherapy. Since tumor cells contain certain proteins known as tumor-specific antigens (TSAs), which have unique sequences due to somatic mutations and are expressed almost exclusively in tumor environment, evaluation of antigen receptors against TSAs expressed in tumor-infiltrating lymphocytes is important for elucidating cancer immunity. There are two main types of immunity conferred by lymphocytes: cellular immunity, which is largely attributed to the action of T-cell receptors (TCRs), and humoral immunity, which is attributed to the action of immunoglobulins secreted by B-cells.

Recently, advances in next-generation sequencing technology have provided the opportunity to sequence TCRs and B-cell receptors (BCRs) or immunoglobulins (Igs) on an unprecedented scale [1, 2]. However, previous studies analyzing the global patterns of antigen receptor sequences have mostly focused on TCRs [37], resulting in the identification of specific features of the amino acid motifs in TCRs. In contrast, characterizing humoral immunity is more complex than characterizing cellular immunity, because BCRs/Igs show higher sequence diversity than TCRs due to somatic hypermutations. Therefore, analyses of BCRs/Igs have mostly focused on only a small number of known antigens or epitopes [8]. Moreover, the conventional approach used for TCR analysis based on analyzing sequence motifs or identical sequences cannot be applied to BCRs/Igs in tumors, because there are very few BCR/Ig sequences shared by different individuals in cancer microenvironments, unlike infection, vaccine administration, and autoimmunity [912].

Nevertheless, given the importance of humoral immunity in cancer, global sequence analysis of BCRs/Igs in tumors is essential to understand tumor immunity [13].

We hypothesized that because of TSAs, BCR/Ig sequences in the tumor environment may exhibit characteristics which differ from those in normal tissue environment.

In this study, we tackled this problem by constructing classifiers of BCRs/Igs obtained from the immune repertoire sequencing (Rep-Seq) data of 89-paired tissue specimens obtained from patients with gastric cancer, one of the most common malignancies worldwide and particularly in Asian countries. These classifiers were based on supervised machine learning techniques that differentiated between individual BCR/Ig sequences in normal and tumor environments. V/J-frame pattern, CDR-lengths, the number of SHMs, and physicochemical properties of amino acid sequences of CDRs were used as the key features of BCR/Ig sequences for the differentiation. This approach allowed us to identify distinct characteristics of BCRs/Igs in tumor tissues.

We also classified normal and tumor tissues based on the set of BCR/Ig sequences considering a hypothetical diagnostic situation. Classification of BCR/Ig sequence sets in the context of autoimmune-diseases was conducted previously [14]. Therefore, we compared the classification performance of our classifier with that used in the previous research. This analysis demonstrated that our classifier outperformed the other method when applied to our dataset.

We expect that this approach will advance the field of cancer research and improve immunotherapy toward better personalized medicine in cancer treatment.


Clinical samples

Ninety frozen gastric cancer specimens surgically resected from patients between 2009 and 2016 at the University of Tokyo Hospital were analyzed in this study after receiving written, informed consent. This study was approved by the institutional review boards of the University of Tokyo and Tokyo Medical and Dental University.

Rep-Seq data of BCRs/Igs

Total RNA was extracted from approximately 10 sequential frozen sections (10 µm thick) of gastric cancer tissues using the RNeasy Mini kit (Qiagen, Hilden, Germany). Quantity and quality assessments of the extracted total RNAs were made using the Agilent Bioanalyzer (Agilent Technologies, Foster City, CA, USA). Multiplex polymerase chain reaction (PCR) primers targeting BCR/Ig genes (iRepertoire, Inc., AL, USA) were used to amplify BCR/Ig repertoires in each of tumor and normal tissue samples according to the manufacturer’s protocol. Each repertoire library was then sequenced on an Illumina MiSeq instrument (Illumina, San Diego, CA, USA) with 2x 300 bp paired-end sequencing according to the manufacturer’s protocol. The results of paired-end nucleotide sequencing were obtained from two FASTQ files, one read from the 5’ end to the 3’ end and the other from the 3’ end to the 5’ end.

Pre-processing of BCR/Ig sequences

An overview of the workflow for the analysis is provided in Fig. 1. The workflow consists of 1) aligning BCR/Ig sequences, 2) defining clonotypes, and 3) extracting dominant BCR/Ig clones.

Fig. 1
figure 1

Pipeline for obtaining the BCRs/Igs data used as the query for the classification machine from normal or tumor tissue

Alignment of BCR/Ig sequences

A heavy chain of BCR/Ig is composed of variable(V), diversity (D), and joining (J) gene segments. To align paired-end nucleotide sequences of BCRs/Igs and estimate the appropriate assignment of V, D, and J gene segments, we used MiXCR, a tool for the universal framework processing of large immunome data from raw sequences for quantitating clonotypes [15]. Since D frame is very short and surrounded by random nucleotides, the accuracy of D frame assignment is generally very low compared to that of V/J frame assignment. Thus, we did not apply D frame assignment to further analysis.

Defining clonotypes

Although the BCR sequence of each B-cell is essentially different, a group of B-cells called a clone has the same ancestral origin, and similar BCR sequences. Clonal lineage of BCRs/Igs was estimated using MiXCR. The following clustering criteria were applied as described by Uduman et al. [16]:

  • The same combination of V/J frames

  • The same length of CDR3

  • A maximum of three nucleotide mismatches in CDR3

BCRs/Igs that satisfy the above criteria were defined as belonging to the same clonal family.

Extraction of dominant BCR/Ig clones

After constructing clonal families, BCR/Ig clones were selected for training the classification machine. Each clone included BCRs/Igs derived from tumor tissues as well as normal tissues. BCR/Ig clones for training were extracted using the following steps: 1) clones with at least 50 reads were extracted to reduce the effect of sequence errors. 2) A clone with 90% or more tumor-derived BCR/Ig content was defined as a tumor-specific clone. The same threshold was applied to define normal clones. 3) Three BCR/Ig sequences were sampled randomly from each extracted clone to eliminate the influence of biased clonal size.

Detailed information about the BCR/Ig data is provided in Additional file 1 (Table S1).

Classification using V/J frame assignment

Classification using V/J frame patterns was conducted using a Bayes classifier. When training the classifier, likelihood of combinations of V/J frames for a given type of tissue (normal or tumor) were calculated as follows

$$\begin{array}{*{20}l} P(V,J|N) & = \frac{1}{n}\sum\limits_{i = 1}^{n}P_{i}(V,J|N) \\ P(V,J|T) & = \frac{1}{n}\sum\limits_{i = 1}^{n}P_{i}(V,J|T) \end{array} $$

where n denotes the number of patients in the training data, and Pi(V,J|N) and Pi(V,J|T) denote the relative frequencies of ith patient for normal and tumor, respectively. Posterior probabilities of normal or tumor given V/J frames in the test data were calculated using Bayes’ theorem.

Since this classification machine contains no hyperparameter, the performance of the machine is simply measured by conducting leave-one-out cross-validation (LOOCV).

Classification using lengths of CDRs

Classification using lengths of CDRs was conducted in a similar way to the one using V/J frame assignment. Likelihood of combinations of lengths of CDR1, CDR2, CDR3 for a given type of tissue (normal or tumor) were calculated as follows

$$\begin{array}{*{20}l} P(L_{1},L_{2},L_{3}|N) & = \frac{1}{n}\sum\limits_{i = 1}^{n}P_{i}(L_{1},L_{2},L_{3}|N) \\ P(L_{1},L_{2},L_{3}|T) & = \frac{1}{n}\sum\limits_{i = 1}^{n}P_{i}(L_{1},L_{2},L_{3}|T) \end{array} $$

where L1, L2 and L3 denote the length of CDR1, CDR2 and CDR3, respectively, n denotes the number of patients in the training data, and Pi(L1,L2,L3|N) and Pi(L1,L2,L3|T) denote the relative frequencies of ith patient for normal and tumor, respectively. Posterior probabilities of normal or tumor given lengths of CDRs in the test data were calculated using Bayes’ theorem.

Classification using the number of SHMs

Classification using SHMs was conducted using a support vector classifier with linear kernel, and input features are the number of SHM in framework region (FR) and that in CDR.

Classification using amino acid feature vectors

Three kinds of machine learning models were used to classify based on the amino acid sequences of BCRs/Igs: convolutional neural networks (CNN), support vector machine (SVM), and random forest (RF). To take the physicochemical properties of BCRs/Igs into account, we created a feature vector based on Kidera factors, which transforms each amino acid into a 10-dimensional vector. It was originally derived from multivariate analysis of 188 physicochemical properties in each of the 20 amino acids [17].

Since CDRs in BCRs/Igs are mainly involved in antigen binding, only CDRs are used in this classification. Additionally, their individual lengths were trimmed or padded so that all the feature length be the same. The target lengths of CDRs were determined by their median lengths: 8, 8, and 17 for CDR1, CDR2, and CDR3 respectively. The following methodology was applied to fix the sequence length:

  • If the original length was the same as the target length, the sequence was not modified.

  • If the original length was larger than the target length, the center of the sequence was trimmed.

  • If the original length was smaller than the target length, a pseudo amino acid with a zero feature vector was inserted into the center of the sequence.

The above process created 33(= 8 + 8 + 17)-by-10 matrix for each BCR/Ig. The resulting dimension of the feature vector was 330.

Convolutional neural network

The CNN classifier includes a stack of several Conv-Leaky ReLU-MaxPool units, which are followed by fully connected hidden layers with Leaky ReLU activation. The last layer is a fully connected layer with the softmax function for normal and tumor classes. All Leaky ReLU activations were used with leak rate of 0.2. The optimization was done by Adam optimizer with minibatch size set to 100.

Hyperparameters were optimized using a random search strategy [18]. The search range of each hyperparameter is described in Table 1. CNN was implemented using the Python API of TensorFlow [19].

Table 1 Range of hyperparameter searched in the CNN classifier

Support vector machine

We used SVM using the radial basis function (RBF) kernel implemented in scikit-learn. There were two hyperparameters in our SVM classifiers; C and γ. C trades off any misclassification of training examples against the simplicity of the decision surface [20], and γ defines the extent of the influence of a single training example. These hyperparameters were tuned using a grid search strategy. The search range of C and γ were [100,101,102,103] and [10−2,10−3,10−4,10−5,], respectively.

Random forest

RF implemented in scikit-learn was used [20]. The maximum depth of a tree was tuned as a hyperparameter of the RF model, and its possible values were \(\sqrt {f}\), log2f, and f, where f is the number of features (=330) of an input BCR/Ig.

Model selection of machine learning

To optimize the hyperparameters of the classification machines with small number of samples, double cross-validation called nested cross validation was conducted [21]. The purposes of inner and outer cross validation are to determine the hyperparameters and to measure the generalization performance of the determined model, respectively. In our analysis, the inner loop was two-fold cross validation and the outer loop was LOOCV. When holding out validation data in each cross-validation, BCRs/Igs were split at the patient level instead of individual sequence level.

Effect of fixing the length of CDRs

Because the fixed CDR length could cause bias in the classification, effect of CDR length on the performance of our classifier was determined.

To check the effect of trimming and padding the CDR sequences, we calculated the classification performances of each length of CDR3. Because CDR3 has much larger diversity in terms of length and amino acid composition than CDR1 and CDR2, we assumed the effect of trimming and padding would be the largest in CDR3.

Classification using gradient boosting ensemble model

Linear gradient boosting machine implemented in XGBoost was applied to create ensemble classifiers using the four types of features. All hyperparameters were set to default values of Python API of XGBoost.

Motif analysis

Sequence motifs in CDR1, CDR2, and CDR3 were constructed by applying WebLogo3 [22] to the following 4 groups.

  1. 1.

    BCRs/Igs correctly predicted as Normal tissue-derived with high confidence (P(N)>0.9)

  2. 2.

    BCRs/Igs correctly predicted as Tumor-derived with high confidence (P(T)>0.9)

  3. 3.

    BCRs/Igs misclassified as Tumor-derived with high confidence (P(T)>0.9)

  4. 4.

    BCRs/Igs misclassified as Normal-derived with high confidence (P(N)>0.9)

Only sequences with average length of each region (8, 8 and 17 for CDR1, CDR2, and CDR3 respectively) were used.

Both ends of the CDR3 sequence (1st – 3rd amino acids from the N-terminus and 1st – 4th amino acids from the C-terminus), which is highly conserved, were removed to highlight the motifs in the other variable region.

Permutation test

To assess whether or not single BCR/Ig-classifier outperforms random classification, permutation analysis was conducted. The sequence-labels of the BCRs/Igs within each patient were shuffled randomly. A hundred-thousand permutations were performed, and average AUC was calculated over patients in each permutation.

Tissue-level classification

Sets of BCR/Ig sequences were classified into normal or tumor classes.

We used all of the Rep-Seq data of BCRs/Igs when evaluating the classifier in order to simulate actual diagnostic situations, although only tumor/normal-specific clonal families were used when constructing classifiers for individual BCRs/Igs.

Tissue classifier was constructed based on a trained GBM for individual BCR/Ig sequences. Since GBM estimates the probability of individual BCR/Ig deriving from tumor tissue, we used the average probability of the GBM-based classifier for individual BCR/Ig in a set as the tissue-level classification score, which was formulated as follows

$$\begin{array}{*{20}l} P_{\text{tissue}}(T|\mathcal{A}){: =}\frac{1}{|\mathcal{A}|}\sum\limits_{A \in \mathcal{A}}^{}P_{\text{seq}}(T|A) \end{array} $$

where A denotes a vector of outputs of four classifiers using different features for an individual BCR/Ig sequence, and \(\mathcal {A}\) denotes the set of all BCRs/Igs from the same tissue. The classification performance was determined by using a ROC curve on 180 samples from the 90 patients. To classify a target sample from a patient, individual BCR/Ig classifier trained with patients except for the target patient was used. In addition to the average score, tissue-level classification performance was determined by using median or mode of probabilities for comparison.

Training Ostmeyer’s model

Ostmeyer’s model was trained against our dataset from scratch, and evaluated using LOOCV over 90 patients. Training parameters were set to default values except for the number of replicas, which controls the number of simultaneous runs of Adam optimization. Because the main memory in our server was not enough to set the value to 100000 as in the original Ostmeyer’s model, we set the value to 1000, which is a limitation of the comparison.

Statistical test

Classification performance (AUC) between classifiers was evaluated using the Wilcoxon signed-rank test for statistical comparisons. These statistical tests were performed using SciPy Python library [23]. p<0.05 was considered significant for all statistical tests. A Bonferroni correction was applied for multiple comparison testing.

To test whether AUCs of different ROC curves differ significantly, the AUCs were compared using R (V.3.4.4) and the pROC package [24].


classification of individual BCRs/Igs

First, we investigated whether it is possible to discriminate individual BCR/Ig sequences in normal tissue from those of the tumor tissue.

Table S1 (Additional file 1) shows the read and clone statistics for each sample we analyzed. We evaluated the extent of CDR3 sequence sharing among patients, and 2.1%±3.5 (mean ±SD) of CDR3 amino acid sequence in normal tissues, and 1.6%±6.6 in tumor tissues were shown to be shared, which is consistent with previous studies [12, 25].

Since normal tissues can contain some clones from tumor tissues and vice versa, such contamination in training data could adversely affect the classification performance. Thus, we constructed training data using only tissue type-specific clones for each patient. Clones for 89 patients (1 patient excluded as no tumor-specific clones were identified) were used for the individual BCR/Ig classification. In order to remove the bias of clone sizes, 3 BCR/Ig sequences were extracted from each clone, and as a result, approximately 600,000 BCR/Ig sequences were used as training data in each leave-one-out model. The performance of the classification was measured by AUC-score calculated in each held-out patient in a LOOCV scheme.

We constructed classifiers using BCR features which could contribute to the difference between normal and tumor microenvironment and compared the classification performance among features. The features selected were amino acid sequences of CDRs, lengths of CDRs, V/J-frames, and the number of SHMs. Workflow of the classification using such features is illustrated in Fig. 2.

Fig. 2
figure 2

Workflow of individual BCR/Ig classification and tissue classification between normal/tumor environment using various sequence-independent features

First, multiple classifiers were constructed using amino acid sequences, and their performance was compared. Three representative supervised learning algorithms were compared: Convolutional Neural Networks (CNN), Support Vector Machine (SVM), and Random Forest (RF). To take the physicochemical properties of BCRs/Igs into account, amino acid sequences were encoded into matrices using Kidera factor and used as features. Hyperparameters of these classifiers were tuned using two-fold cross validation. As shown in Fig. 3a, CNN significantly outperformed the other two methods. Therefore, CNN using amino acid sequences was adopted in the rest of the study.

Fig. 3
figure 3

Letter-value plots are showing distribution of the area under the Receiver Operating Characteristic curve (AUROC) calculated on 89 held-out patient. The figures are illustrating the comparison of (a) different models using amino acid sequences, (b) different trimming/padding strategies, (c) models using various sequence-independent features as well as ensemble model combining them, and (d) CNN against different length of CDR3. Barplots (e) shows the coefficients of linear ensemble model

In the above experiment, CDR3 length was fixed by trimming or padding amino acid sequences to retain both end of the sequences since there is no straightforward way to deal with variable-length input in CNN, SVM, and RF. To verify our approach, classification performance among different trimming/padding strategies was compared. As shown in Fig. 3b, no significant difference was observed between the classification performances of CNN using center-trimmed/padded and ends-trimmed/padded strategies. Figure 3d shows the performance of CNN using each of the various amino acid lengths (8 to 26) of CDR3. Although significant difference was observed (p=0.02, Skillings-Mack test [26]), performances were better than random in all lengths. These results suggest that our trimming/padding strategy does not degrade the performance of CNN.

Our results indicate that BCR/Ig-sequences from normal and tumor environment have some distinct patterns in their amino acid sequences. However, differences in amino acid sequences in normal and tumor environment detected by CNN could be captured by other sequence-independent information, which can be calculated more easily. For example, most of the signals detected by the model may reflect the fact that tumor-derived clones tend to have undergone affinity maturation, while normal tissue-derived clones do not have such tendency. Therefore, classification experiments were conducted using the three sequence-independent features, CDR-length, usage of V/J frames and the number of SHMs, that might also exhibit distinct patterns in normal and cancer tissues.

In order to explore the possibility to improve the classification performance by ensemble, and to evaluate the contribution of each model to the performance, ensemble models were also constructed using multiple features including amino acid sequences by combining all the output probabilities against the sequence from the models described above. Gradient Boosting Machine (GBM) with linear model was used for the ensemble classifier.

Figure 3c shows the performances of different models using different sequence-independent features, as well as the ensemble model. All models performed significantly better than random (p-value < 10-5, permutation test), which means that all sequence-independent features we selected have distinct characteristics in normal and tumor microenvironment. Among the four non-ensemble models using sequence-independent features, those using the physicochemical properties of amino acid sequences and number of SHMs showed comparable performances, outperforming the other two models. Although ensemble models outperformed them, the difference was not remarkable. Coefficients of the GBM linear model shown in Fig. 3e indicated that output probability of CNN model is the most useful feature for the classification. Nonetheless, the performance of ensemble models that combine all models for each sequence-independent feature was similar to that of CNN, and 89 AUCs of CNN were correlated to those of other classifiers (Additional file 2: Figure S1). These results suggest that CNN can deal with the richest information and produce the most reliable probabilities, and that most of the patterns it recognizes might be correlated to other sequence-independent features.

Since input of CNN is the encoded physicochemical properties in amino acid sequences, and the classifiers is nonlinear, there is no straightforward way to extract biologically interpretable information from the model. Instead, sequence motifs in CDRs captured by the classifier were investigated, which is easy-to-understand. Sequence motifs on BCRs/Igs that were correctly classified or misclassified with high confidence in CDR1, CDR2, and CDR3 are shown in Fig. 4. Unfortunately, no prominent motif patterns were found between the BCRs/Igs of normal and tumor tissues. Although there might be some differences between correctly classified motifs and misclassified ones, the biological importance is unclear.

Fig. 4
figure 4

Sequence motifs constructed for all CDRs. Each motif is made using sequences that have average length of each region. We note that the y-axes of CDR3 is different from those of CDR1, CDR2

The distributions of tumor probabilities in normal/tumor samples were investigated over all patients (Additional file 3: Figure S2).

Tissue-level classification

Another main goal of this work was to precisely estimate tissue type, normal or tumor, using BCR/Ig sequences obtained from a tissue sample, under a hypothetical diagnostic scenario in which we can only access to a set of BCR/Ig sequences obtained from either of the tissue.

The tissue classifier was constructed using descriptive statistics values of the tumor probabilities of the already-trained GBM for individual BCRs/Igs. Figure 5a-c show the ROC-curves using average, median, and mode of the probabilities emitted by each single sequence-level classification models, respectively. Among the models based on the same sequence-level classifiers, the models employing average probabilities always outperformed the other two. Therefore, we adopted average probabilities as the descriptive statistics value for the tissue-level classification model in the rest of the study.

Fig. 5
figure 5

Receiver Operating Characteristic (ROC) curves showing tissue-level classification performance. Each figure illustrates the comparison of ROC-curves of (a) models using average score from sequence-independent features, (b) those using median score, (c) those using mode score, (d) ensemble model and clonal entropy, and (e) ensemble model and Ostmeyer’s model

Tissue-level classification was also conducted by adding clonal entropy information to the ensemble model. However, the comparison of such models shown in Fig. 5d suggested that it would not improve the classification performance.

In addition, our tissue-level classifier was compared to the previously introduced machine-learning method for disease classification based on BCR/Ig sequences [14]. Their model employs k-mer snippet on CDR3 sequences as a feature, and tissue-level classifications are carried out by constructing logistic regression model for a single snippet and by taking the maximum to aggregate the probabilities. Although this method was originally developed for classifying autoimmune diseases such as multiple sclerosis, it is applicable to cancer diagnosis. As shown in Fig. 5e, our model significantly outperformed the Ostmeyer’s model in our cancer dataset (p=3×10−6).


In this paper, we hypothesized that BCR/Igs repertoire in tumor microenvironment have distinct features from those in normal microenvironment, and thus they can be distinguished using machine learning techniques from repertoire sequence data. Classifiers using various sequence-independent features were developed and compared the performances to investigate what features contribute to the difference of the repertoire.

In the first experiment, individual BCRs/Igs were classified using amino acid sequences of CDRs, lengths of CDRs, V/J-frames and the number of SHMs. We found that all the features contributed to the classification performance with different degrees. This result indicates that sequence-independent features of normal/tumor tissues have shared propensity across patients. We also found that the number of SHMs in tumor tissue is fewer than that in normal tissue and it is one of the most significant discriminative features. There are some possibilities which explain the observation. It might be because relatively short time has passed since immune system recognize the tumor antigens compared to other antigens in gastric tissues. Another possibility is that there is a mechanism that suppresses affinity maturation in tumor microenvironment. Although some of the mechanism are already known [27, 28], the contribution in tumor immunity is yet to be investigated.

We also found that the CNN model using encoded amino acid sequences in BCRs/Igs contributed the most in the GBM ensemble model. High correlation value between the AUC values of the ensemble classifier without CNN model and that with CNN model suggests that CNN model captures most of the discriminative information captured by the other sequence-independent features. Although the performance gain over the classifier without CNN model was very small, the CNN model might be worthy of further investigation since there is still room for improvement. For example, our current CNN models cannot handle variable CDR lengths. Removing center of CDRs might lose important information for the discrimination. Incorporating more sophisticated machine learning models that can handle the variable length inputs such as Recurrent Neural Network might capture the information not capture by the other sequence-independent features and improve the classification performance.

In the second experiment, the tissue-level classification model was developed and compared the performance with Ostmeyer’s model, a logistic regression model using 6-mer snippets of CDR3 amino acid sequences in multiple-instance learning settings. Our model significantly outperformed Ostmeyer’s model. One of the advantages of our method is that it partially retains information about the position of amino acids in BCR/Ig sequences. Another advantage is that our model deals with global statistics of sequence-independent features, such as the number of SHMs and V/J-frame patterns. On the other hand, Ostmeyer’s model utilizes short sequence motifs in CDR3 sequences, irrespective of its relative position, and classification is performed based on only the most discriminative motif. The significant difference in the classification performance suggests that global sequence-independent features rather than individual amino acid sequences of BCRs/Igs contribute the difference between normal and tumor microenvironment, which could differ from other immune-related diseases such as autoimmune diseases. In autoimmune diseases, small number of common antigens and antibodies to the antigens contribute to the pathogenesis, which could result in the convergent amino acid sequence motifs in BCRs/Igs. On the other hand, in cancer microenvironment, antigens could be highly diverse both within each cancer tissue and across different cancer tissues. Such differences of immune landscapes can also explain why remarkable sequences motifs specific to normal/tumor microenvironments were not found.

The tissue-level classification in the second experiment achieved an AUC value of 0.826, which indicates that our model can be utilized in clinical applications. For example, cancer could be diagnosed even if the tissues do not contain cancer cells, so long as they contain tumor-infiltrated B cells or plasma cells. Indeed, single biopsies of gastric cancer tissues fail to show malignancy in 20% to 30% of cancer cases [29, 30]. Therefore, our approach could help correctly identify some of the false negatives in gastric biopsies and diagnose gastric cancers correctly by examining the BCR/Ig repertoires of B-cells adjacent to cancer cells, although further experimental validation will be required for clinical application of this concept.


Using a machine learning approach, we have shown that the BCR/Ig repertoire in gastric cancer tissues has distinct sequence characteristics compared to those of normal gastric tissues. BCR/Ig-features obtained in Rep-Seq showed their distinctiveness between normal and tumor environment in single sequence resolution and also showed high diagnostic capacity. Although we focused specifically on BCRs/Igs in gastric cancer, the same approach could be applied to evaluating TCRs and BCRs/Igs in other types of cancers or immune-related diseases, such as autoimmune diseases and infections, including those examined by past research [35, 7, 14]. Our analysis will open the door for new areas of tumor immunology and may lead to the development of novel diagnostic tools for cancer using repertoire sequencing.



Area under the curve


B-cell receptor


Complementarity-determining region


Convolutional neural networks


Framework region


Gradient boosting Machine




Leave-one-out cross-validation


Rectified linear units


Repertoire sequencing


Random forest


Receiver operating characteristic


Support vector machine


T-cell receptor


Tumor-specific antigen


  1. Wang C, Liu Y, Cavanagh MM, Le Saux S, Qi Q, Roskin KM, Looney TJ, Lee J-Y, Dixit V, Dekker CL, Swan GE, Goronzy JJ, Boyd SD. B-cell repertoire responses to varicella-zoster vaccination in human identical twins. Proc Natl Acad Sci. 2015; 112(2):500–5. NIHMS150003.

    Article  Google Scholar 

  2. Rechavi E, Lev A, Lee YN, Simon AJ, Yinon Y, Lipitz S, Amariglio N, Weisz B, Notarangelo LD, Somech R. Timely and spatially regulated maturation of B and T cell repertoire during human fetal development,. Sci Transl Med. 2015; 7(276):276–25.

    Article  CAS  Google Scholar 

  3. Glanville J, Huang H, Nau A, Hatton O, Wagar LE, Rubelt F, Ji X, Han A, Krams SM, Pettus C, Haas N, Arlehamn CSL, Sette A, Boyd SD, Scriba TJ, Martinez OM, Davis MM. Identifying specificity groups in the T cell receptor repertoire. Nature. 2017; 547(7661):94–8.

    Article  CAS  Google Scholar 

  4. Dash P, Mcclaren JL, Iii THO, Rothwell W, Todd B, Morris MY, Becksfort J, Reynolds C, Brown SA, Doherty PC, Thomas PG, Wang GC, Dash P, McCullers JA, Doherty PC, Thomas PG, Fiore-Gartland AJ, Hertz T, Wang GC, Sharma S, Souquette A, Crawford JC, Clemens EB, Nguyen THO, Kedzierska K, La Gruta NL, Bradley P, Thomas PG. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. 2017; 121(1):128–4212842. NIHMS150003.

    Article  CAS  Google Scholar 

  5. Thomas N, Best K, Cinelli M, Reich-Zeliger S, Gal H, Shifrut E, Madi A, Friedman N, Shawe-Taylor J, Chain B. Tracking global changes induced in the CD4 T-cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence. Bioinformatics. 2014; 30(22):3181–8.

    Article  CAS  Google Scholar 

  6. Epstein M, Barenco M, Klein N, Hubank M, Callard RE. Revealing individual signatures of human T cell CDR3 sequence repertoires with Kidera Factors. PLoS ONE. 2014; 9(1):1–10.

    Article  Google Scholar 

  7. Emerson RO, DeWitt WS, Vignali M, Gravley J, Hu JK, Osborne EJ, Desmarais C, Klinger M, Carlson CS, Hansen JA, Rieder M, Robins HS. Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat Genet. 2017; 49(5):659–65.

    Article  CAS  Google Scholar 

  8. Trück J, Ramasamy MN, Galson JD, Rance R, Parkhill J, Lunter G, Pollard AJ, Kelly DF. Identification of Antigen-Specific B Cell Receptor Sequences Using Public Repertoire Analysis. J Immunol. 2015; 194(1):252–61.

    Article  Google Scholar 

  9. Katoh H, Komura D, Konishi H, Suzuki R, Yamamoto A, Kakiuchi M, Sato R, Ushiku T, Yamamoto S, Tatsuno K, Oshima T, Nomura S, Seto Y, Fukayama M, Aburatani H, Ishikawa S. Immunogenetic Profiling for Gastric Cancers Identifies Sulfated Glycosaminoglycans as Major and Functional B Cell Antigens in Human Malignancies. Cell Rep. 2017; 20(5):1073–87.

    Article  CAS  Google Scholar 

  10. Parameswaran P, Liu Y, Roskin KM, Jackson KKL, Dixit VP, Lee JY, Artiles KL, Zompi S, Vargas MJ, Simen BB, Hanczaruk B, McGowan KR, Tariq MA, Pourmand N, Koller D, Balmaseda A, Boyd SD, Harris E, Fire AZ. Convergent antibody signatures in human dengue. Cell Host Microbe. 2013; 13(6):691–700. NIHMS150003.

    Article  CAS  Google Scholar 

  11. Zhang W, Feng Q, Wang C, Zeng X, Du Y, Lin L, Wu J, Fu L, Yang K, Xu X, Xu H, Zhao Y, Li X, Schoenauer UH, Stadlmayr A, Saksena NK, Tilg H, Datz C, Liu X. Characterization of the B Cell Receptor Repertoire in the Intestinal Mucosa and of Tumor-Infiltrating Lymphocytes in Colorectal Adenoma and Carcinoma. J Immunol. 2017:1602039.

    Article  CAS  Google Scholar 

  12. Galson JD, Clutterbuck EA, Trück J, Ramasamy MN, Münz M, Fowler A, Cerundolo V, Pollard AJ, Lunter G, Kelly DF. BCR repertoire sequencing: Different patterns of B-cell activation after two Meningococcal vaccines. Immunol Cell Biol. 2015; 93(10):885–95. arXiv:1408.1149.

    Article  CAS  Google Scholar 

  13. Fridman WH, Pagès F, Sautès-Fridman C, Galon J. The immune contexture in human tumours: impact on clinical outcome. Nat Rev Cancer. 2012; 12(4):298–306. NIHMS150003.

    Article  CAS  Google Scholar 

  14. Ostmeyer J, Christley S, Rounds WH, Toby I, Greenberg BM, Monson NL, Cowell LG. Statistical classifiers for diagnosing disease from immune repertoires: a case study using multiple sclerosis. BMC Bioinformatics. 2017; 18(1):401.

  15. Bolotin DA, Poslavsky S, Mitrophanov I, Shugay M, Mamedov IZ, Putintseva EV, Chudakov DM. MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods. 2015; 12(5):380–1.

    Article  CAS  Google Scholar 

  16. Uduman M, Shlomchik MJ, Vigneault F, Church GM, Kleinstein SH. Integrating B cell lineage information into statistical tests for detecting selection in Ig sequences,. J Immunol. 2014; 192(3):867–74.

    Article  Google Scholar 

  17. Kidera A, Konishi Y, Oka M, Ooi T, Scheraga Ha. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J Protein Chem. 1985; 4:23–55.

    Article  CAS  Google Scholar 

  18. Bergstra J, Bengio Y. Random Search for Hyper-Parameter Optimization. J Mach Learn Res. 2012; 13:281–305.

    Google Scholar 

  19. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viégas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from 2015. Accessed 1 Dec 2017.

  20. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011; 12:2825–30.

    Google Scholar 

  21. Fran R. Double Cross Validation for Model Based Classification. 2006. Accessed 1 Dec 2017.

  22. Crooks G, Hon G, Chandonia J, Brenner S. WebLogo: a sequence logo generator. Genome Res. 2004; 14:1188–90.

  23. Jones E, Oliphant T, Peterson P. SciPy: Open source scientific tools for Python. 2001. Accessed 1 Dec 2017.

  24. Turck N, Vutskits L, Sanchez-Pena P, Robin X, Hainard A, Gex-Fabry M, Fouda C, Bassem H, Mueller M, Lisacek F, Puybasset L, Sanchez J-C. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011; 8:12–77.

    Article  Google Scholar 

  25. Soto C, Bombardi RG, Branchizio A, Kose N, Matta P, Sevy AM, Sinkovits RS, Gilchuk P, Finn JA, Crowe JE. High frequency of shared clonotypes in human B cell receptor repertoires. Nature. 2019; 566(7744):398–402.

  26. Chatfleld M, Mander A. The skillings-mack test (friedman test when there are missing data). Stata J. 2009; 9(2):299–305. TheStataJournal.

  27. Brief I. of Germinal Centers. 2018; 24(13):3367–73.

  28. Godoy-Lozano EE, Téllez-Sosa J, Sánchez-González G, Sámano-Sánchez H, Aguilar-Salgado A, Salinas-Rodríguez A, Cortina-Ceballos B, Vivanco-Cid H, Hernández-Flores K, Pfaff JM, Kahle KM, Doranz BJ, Gómez-Barreto RE, Valdovinos-Torres H, López-Martínez I, Rodriguez MH, Martínez-Barnetche J. Lower IgG somatic hypermutation rates during acute dengue virus infection is compatible with a germinal center-independent B cell response. Genome Med. 2016; 8(1):1–19.

  29. Graham DY, Schwartz JT, Cain GD, Gyorkey F. Prospective evaluation of biopsy number in the diagnosis of esophageal and gastric carcinoma. Gastroenterology. 1982;82(2):228–31.

    Article  CAS  Google Scholar 

  30. Choi Y, Choi HS, Jeon WK, Kim BI, Park DI, Cho YK, Kim HJ, Park JH, Sohn CI. Optimal number of endoscopic biopsies in diagnosis of advanced gastric and colorectal cancer. J Korean Med Sci. 2012; 27(1):36–9.

    Article  Google Scholar 

Download references


The super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo.


This study was supported by the AMED Basic Science and Platform Technology Program for Innovative Biological Medicine (JP18am0301011) to S.I., the AMED Project for Cancer Research and Therapeutic Evolution (P-CREATE) (JP17cm0106514 and JP18cm010653) to S.I., and the AMED Research on Development of New Drugs (JP18ak0101096) to H.Katoh. The funding bodies did not play any role in the design of the study or in collection, analysis, and interpretation of data, or in writing the manuscript.

Availability of data and materials

BCR/Ig sequence data from the 90 patients are available at the Japanese Genotype-Phenotype Archive under accession number JGAS00000000103.

Author information

Authors and Affiliations



HKK, DK, HTK, and SPI conceived of and designed the study. HTK, AY, HMK, and SA were involved in processing tissue samples and acquisition of the repertoire sequencing data. YS and MF contributed to the clinical tissue sample collection. HKK implemented the pipeline for the computer experiments. SPI, SII, and RY supervised the research and provided meaningful suggestions and discussion. HKK, DK, and SPI wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Shumpei Ishikawa.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the institutional review boards of the University of Tokyo and Tokyo Medical and Dental University. Ninety frozen specimens of gastric cancer surgically resected between 2009 and 2016 at the University of Tokyo Hospital were analyzed in this study with written, informed consent from the patients.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1

Table S1. A table showing the number of reads, inferred reads, unique reads, clones, and extracted clones for each sample. (XLS 33 kb)

Additional file 2

Figure S1. Scatter plots showing correlation between 89 AU-ROCs of CNN to those of other classifiers (ensemble classifier, ensemble classifier without CNN, linear classifier using # of SHMs, Bayes classifier using usage of V/J frames, and Bayes classifier using CDR-lengths. (PNG 188 kb)

Additional file 3

Figure S2. Clonal entropy and distribution of tumor probabilities in whole normal/tumor samples over 90 patients. (PNG 889 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Konishi, H., Komura, D., Katoh, H. et al. Capturing the differences between humoral immunity in the normal and tumor environments from repertoire-seq of B-cell receptors using supervised machine learning. BMC Bioinformatics 20, 267 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: