Multiple instance neural networks based on sparse attention for cancer detection using T-cell receptor sequences

Early detection of cancers has been much explored due to its paramount importance in biomedical fields. Among different types of data used to answer this biological question, studies based on T cell receptors (TCRs) are under recent spotlight due to the growing appreciation of the roles of the host immunity system in tumor biology. However, the one-to-many correspondence between a patient and multiple TCR sequences hinders researchers from simply adopting classical statistical/machine learning methods. There were recent attempts to model this type of data in the context of multiple instance learning (MIL). Despite the novel application of MIL to cancer detection using TCR sequences and the demonstrated adequate performance in several tumor types, there is still room for improvement, especially for certain cancer types. Furthermore, explainable neural network models are not fully investigated for this application. In this article, we propose multiple instance neural networks based on sparse attention (MINN-SA) to enhance the performance in cancer detection and explainability. The sparse attention structure drops out uninformative instances in each bag, achieving both interpretability and better predictive performance in combination with the skip connection. Our experiments show that MINN-SA yields the highest area under the ROC curve scores on average measured across 10 different types of cancers, compared to existing MIL approaches. Moreover, we observe from the estimated attentions that MINN-SA can identify the TCRs that are specific for tumor antigens in the same T cell repertoire.

The problem to distinguish normal and cancerous tissues/patients has attracted much attention due to its great significance in biomedical fields for cancer prognosis. To address this biological problem, past works have used medical images [25,26], gene expressions [27][28][29][30], and single nucleotide polymorphisms (SNPs) [31][32][33], etc. As the basis for forming predictive models, the recent appreciation of the roles of the host immunity system in tumor biology has motivated researchers to study T cell receptors (TCRs) [7,24,[34][35][36]. [7] predicted the tumor status of patients using TCR sequences. As multiple TCR sequences (instances) are observed together in different T cells in the same patient (tumor or normal), the observations naturally fall into the category of MIL. The authors conducted a benchmarking study to compare MIC algorithms, but they did not treat deep neural networks in depth, thus leaving the performance of neural network models under-explored. Though [37]'s models are included in comparison, they are not complex enough to learn successfully the underlying structure of the TCR data, thus showing unsatisfactory performance. [24] investigated the multiple instance learning task that distinguishes T-cell repertoires between tumor and healthy tissues, but they assumed the standard MIL assumption which does not consider relative importance of instances. [34] proposed a deep learning model called DeepCat that utilizes tumorspecific or non-cancer TCRs, but DeepCat ignores the bag-structure as well as possibly different contributions of TCRs. Another deep learning model, named DeepLION, is proposed by [38], but DeepLION cannot completely remove unimportant instances in explaining the bag labels.
Towards bridging this gap, we introduce a novel neural network model, titled MINN-SA (Multiple Instance Neural Network based on Sparse Attentions), for cancer detection based on TCR sequences. The salient part of the proposal is the sparse attention structure that flexibly drops out uninformative instances, thus rendering model interpretation more achievable. Recent works from [39][40][41][42] also employed an attention structure in their neural networks. However, their attention scores are dense, meaning none of them is exactly 0, so that irrelevant information for bag classification could be involved in extracted features. Moreover, the sparsity pattern considered in [42] is based on a simple heuristic that keeps top-N instances with the largest scores, which lacks of optimality and stability in results. In contrast, equipped with the sparsemax function by [43], MINN-SA adaptively discovers the pattern of sparsity in attention scores. This flexibility is also beneficial to predictive performance, which can be further enhanced by adding the skip connection by [44]. With this state-of-the-art architecture, we achieve the highest overall AUC scores both in balanced and imbalanced datasets in the cancer detection problem. Our main contributions are summarized as follows: • We propose a sparse attention-based neural network that drops out uninformative instance per bag, achieving model interpretability. • MINN-SA outperforms comparative methods in cancer detection based on TCR sequences, achieving the highest AUC scores both in balanced and imbalanced datasets.
The remainder of this paper is organized as follows. in "Methods" Section gives the details of our method. in "Result" Section presents the experimental setup and results of TCR dataset. Finally, we end in "Discussion" Section with a summary and discussion.

Multiple instance learning for cancer detection using TCR sequences
In MIL, an observational unit is a bag (a sample). In bag i ( i = 1, . . . , n ), each of multiple instances is characterized by a vector x ij of p features in R p , j = 1, . . . , m i , and a single label y i is tagged on it. The goal of MIL is to estimate a function f that predicts a bag-level label from a set of instances. Note that this function takes a set of instances as an input, so it should be adaptive to different number of instances for each bag. In our application, tissue samples are collected either from normal or cancer patients and a set of TCR sequences are identified in each sample by using next generation sequencing technologies [7,45]. The main task is to determine whether a tissue is cancerous or not based on its TCR sequences. Here, we treat the tissue type as a bag-level label and the set of TCR sequences as multiple instances, all of which are contained together in a patient, or a bag. Under this context, we focus on a binary MIC task and thus restrict a bag label in {0, 1} ; for example, 0 (a negative bag) is non-cancer and 1 (a positive bag) is cancer. The binary classification could be easily extended to the multiclass setup simply by changing a score function to the softmax function and a loss function to the multi-class cross-entropy. To associate a series of unlabeled instances to a bag label, we adopt the primary instance assumption. In other words, it is assumed that a portion of instances, or primary instances, can explain the label while the remaining instances, or non-primary instances, are irrelevant to it. In our contexts, those selected TCRs represent specialized T cells that the human immune system develops against the tumor cells. More specifically, the TCRs recognize the tumor-associated antigens [46,47] or tumor neoantigens [48][49][50] presented on the surface of the tumor cells, which are markers of the tumor cells and distinguish them from normal epithelial cells.
We utilize the sparse attention in the multiple instance neural networks to detect such meaningful TCRs. The proposed layer selectively reflects instances' information to an extracted feature vector for final classification. The sparsity enhances the classifier's performance and the explainability of classification results. Moreover, the proposed method is computationally efficient. The details of the proposed method are presented in the following subsections.

Numeric embeddings of TCR sequences
We describe the process of numeric embedding of TCR sequences carried out in [7]. TCR sequence is a text string comprising a series of amino acids, which is actually a text string. According to [51], each amino acid can be converted to five Atchley (latent) factors that sufficiently represent the attributes of the amino acid. This conversion of a set of TCR sequences returns a Atchley matrix, which is inserted into the TCR encoding algorithm [36,45]. The key part of the algorithm named TESSA is a stacked auto-encoder that takes Atchley matrices and returns a set of vectors of fixed length (30 dimensions) determined by the number of neurons in the bottleneck layer. The encoded numeric representation facilitates the usage of TCR sequence data. Most MIL algorithms are only compatible with the numeric type of data, especially for those methods calculating a distance between instances (or bags). We refer to [7] for more information about the data and processing details. In particular, we do not claim any original contribution to the data.
We mention that there are other sequence processing methods such as [52]. The autoencoder TESSA offers satisfactory performance in our datasets. However, practitioners can always experiment on other methods for potentially better performance.

Neural networks based on sparse attention
We propose a neural network based on sparse attention to solve multiple instance classification problems. The overall structure of our model is illustrated in Fig. 1a, and we give details of each component in our neural network below.
The input size is fixed by m * × p for each of n bags where m * is the hyperparameter that decides the number of instances to be included in modeling. If a bag has fewer instances than m * , then an empty part is padded by zeros. Some of bags have larger instance size than m * . In the case, we determine the first m * instances as our bag instances. Also, we keep using this masking information across the whole process. For ease of handling, one can easily set m * by the largest bag size in data. In "Results" Section, we conduct a sensitivity analysis for choice of the size m * .
Each layer consists of m * × p neurons, but they are not fully connected to the activation functions in the previous layer as in the usual fully-connected layer. Instead, we build the full connections between neurons within each instance, but not across different instances (see Fig. 1b). The weights for a fully connected layer are shared across the instances to handle the variable number of instances in each bag. Hence, the two consecutive layers are connected by a p × p weight matrix (or (p + 1) × (p + 1) if a bias term is included). The network deals with the non-linearity in data with the rectified linear unit (ReLU) [53]. Note that we used fully connected layers rather than convolution layers because the input features are not locally correlated between adjacent features.
We employed the dropout [54] and skip connection [44] to enhance the predictive performance. The dropout layer is attached after each locally fully-connected layer appears. It randomly forces the output variables to be zero with probability 0.3 while training the network. It is well known that this layer prevents the complex neural network from overfitting on training data. The residual learning framework eases the training of networks with stacked layers. The skip connections in the framework explicitly let the layers fit a residual mapping instead of hoping stacked layers to directly fit a desired underlying mapping denoted by H(x) . We let the stacked nonlinear layers fit another mapping of F(x) := H(x) − x . The bypassing path for a gradient mitigates the vanishing gradient issue in neural networks. The empirical studies show that it is easier to optimize the residual mapping than to optimize the original. Thus, we employ the skip connection to enhance the performance of the locally fully-connected neural network with multilayers. The effect of residual connection is demonstrated in "Ablation study" Section.
The attention layer combines the attention weights {α j } m * j=1 returned by the sparsemax function [43] with the feature matrix Z ∈ R m * ×p obtained from the last network layer, ending up with a weighted feature vector z = m * j=1 α j z j where z j is the j-th row of Z and z ∈ R 1×p (see Fig. 1c). The commonly used softmax function does not pursue exact zeros in the output, but the sparsemax function permits zeros in it. Let The sparsemax is a function mapping vectors in R K to probability distributions in K −1 : In our context, the sparsemax function takes attention scores and encourages some of them to be zero if they do not exceed some thresholding value. As shown in [43], the threshold is adaptively determined from the scores, not manually by users like hard/ soft-thresholding functions. The manual adjustment of the attention threshold uses an additional hyperparameter to determine redundant attention scores forced to be zero. If some of the attention scores are less than the threshold, we manually set the attention score equal to zero. A grid search algorithm usually optimizes the hyperparameter and depending on the grid, the optimal value can vary. However, adaptive thresholding with the sparsemax function finds theoretically optimal value without heuristic search, leading to better generalization performance of the prediction model by training with the optimal hyperparameter. Moreover, adaptive thresholding is computationally more efficient than manual thresholding because it requires no validation procedure to optimize the threshold value. For each bag, two scores, which correspond to classes of tumor and normal tissues, are converted to probabilities, which is easier to interpret. The transition is done by the sigmoid function. The input of classification layer is batch-normalized feature vector z * ∈ R 1×p and the output is a scalar. The aggregated feature vector z may have different means and variances across components, which could hamper the network from learning data stably. Thus, we apply the batch normalization [55] to overcome this difficulty.
We should mention the difference between the proposed attention structure and the others from the previous works. Derived from [39][40][41][42], the attention scores are strictly larger than zero. Thus, it is challenging to discard unimportant or non-primary instances that have little to do with bag classification. In contrast, MINN-SA allows the attention scores to be sparse or have many zeros, meaning that strictly positive weights are only given to the primary instances responsible to bag classification. This is an attractive instance selection because it transparently shows which instances are chosen in model training and thus facilitates one to interpret classification results only depending on the selected instances. Moreover, our selection is more advanced than the elementary top-N rule [42] and free from tuning parameters often required in thresholding operators.

Computation
We implement the method with a deep learning framework, PyTorch 1.8.1 [56] with CUDA 11.1 [57]. The computation system consists of Intel i9-10900 CPU, 32GB RAM, and RTX 3090 GPU. It takes 0.01 seconds for each epoch in the training stage. We stop the training after 100 epochs and determine the best performing model during the training procedure. Therefore, the whole training process approximately takes 1 second in our computation system setting. Referring to the computational time results from [7], the proposed method is more efficient than other multiple instances learning methods in computation.

Results
In this section, we showcase our novel model in distinguishing tumor and normal tissue samples from different types of cancers. Different existing methods are compared against our model in terms of predictive accuracy. Moreover, we provide the instance selection result derived from the estimated attention weights. Lastly, we conduct an ablation study to examine contributions of individual components to our model.
The real datasets we analyze are from The Cancer Genome Atlas (TCGA) database, in whole generated by the TCGA Research Network 1 . Normal and healthy tissues are collected for 10 types of cancers listed in Table 1. As mentioned in "Numeric embeddings of TCR sequences" Section, these samples are processed through the next generation sequencing, TCR reconstruction techniques, and TCR encoding algorithms so that the genomic data from the donors are converted in numeric vectors.

Setting
To figure out how they behave in different scenarios, we test the comparative models under two scenarios: (1) the balanced case and (2) the imbalanced case. The former has an equal number of positive (tumor) and negative (normal) bags, while the latter sets about 10% of bags to be positive. The balanced data is commonly used and often preferred in machine learning literature. The other one is to capture characteristics of large population cancer screening where few patients have tumors [7,58,59]. To create a dataset for each cancer type, we subsample normal and tumor tissues to keep the aimed proportion of positive (tumor) bags. The sample size of each dataset is tabulated in Table 1. For model training and validation, we conduct 10-fold cross validation (CV) to split training and testing datasets. On the testing dataset, the Area Under the Curve (AUC) of each method is calculated based on the Receiver Operating Characteristic (ROC) curve. We follow the same experimental design in the preceding work [7] for fair comparison.

Benchmarking on cancer detection
To benchmark the proposed model, 18 MIC methods are considered, which are listed in Table 2. We refer to Section 3 of [7] for a detailed exposition about these methods.  [72], meaning they have a lot of T cell infiltrations. It makes sense these cancer types are among the ones [72] for which our model performs the best, which investigates TCRs of T cells for classification. Generally, when the class distribution is balanced, each model shows more stable performance [72]. The gap between MINN-SA and the second best method is considerably big in BRCA and KIRC datasets for the imbalanced case. Figure 3 shows the boxplots of all methods in an ascending order of median AUC. MINN-SA tops in both balanced imbalanced cases, with medians 74.40 and 81.80, respectively, followed by EMD-SVM with 72.20 and 75.40. To demonstrate the superiority of the proposed method over comparative methods, we conducted Wilcoxon signed-rank test on the best and second-best methods with rank statistics. Our proposed method achieves the best performance in terms of average rank over cancers. The average ranks of the proposed method are 2.3 and 2.2 in balanced and imbalanced cases, respectively. The second best method is EMD-SVM, whose average ranks are 2.7 and 5.2. In the balanced case, the p-value is 0.0372, and the p-value for the imbalanced case is 0.0043. Our proposed method outperforms other approaches in both cases with statistical significance. For detailed AUC values averaged over cancer types, See Table 3.
Moreover, the proposed MINN-SA enjoys interpretable results from selected instances, which EMD-SVM does not afford. Also, MINN-SA is not very sensitive to class imbalance. Contrary to it, most MIL methods have degraded performance for the imbalanced case, calling for further modifications; for example, data generation or modifying the loss function of the classification model [73][74][75].

Attention of instances
In Fig. 4, attention weights of instances are displayed in heatmap. The heatmap is given in a n × m * matrix form where the weights are colored in blue-white spectrum and the masking area (no instances) in gray. The visual inspection demonstrates that the attention weights estimated by MINN-SA are sparser than those by the softmaxbased method. For the case based on the softmax function, all instances have strictly positive weights, which is depicted by smooth patterns in the heatmap. Consequently, the dense weights makes the aggregated feature vector from the attention layer depend on redundant information for classification. On the other hand, MINN-SA forces the attention weights of insignificant instances to be exactly zero, which makes decisions of MINN-SA independent of them. In our data, the selected instances are likely the TCRs that are specific for tumor antigens, such as tumor neoantigens or tumor associated antigens, presented on the surface of the tumor cells.
To prove this claim, another validation experiment is conducted. We extract all virus-related TCRs from http://friedmanlab.weizmann.ac.il/McPAS-TCR/. We embed the 31315 TCRs in 30-dimensional space via the TCR encoder TESSA mentioned in "Numeric embeddings of TCR sequences" Section. Then, we measure Euclidean distances between all pairs of TCRs from the new database and TCGA database. By averaging the distances for each TCR from tumor/normal samples of TCGA, we can compare how close each TCR is to the virus-related ones. We focus on the three cancer types (ESCA, OV, STAD) where the classifiers considered show best performance. We show boxplots to compare distances. Recall that MINN-SA estimated weights of instances, so we know which are primary/non-primary.
In Fig. 5 above, tumor bags (Label=1) have larger distances in primary instances compared with non-primary instances (all p-values close to 0), which implies the primary instances responsible for bag classification are not virus-specific, but tumorspecific. This is opposite in normal samples (Label=0), where the primary instances are specific to non-tumor immunologic events, such as virus infection but of course    Fig. 2 The AUC of all models in comparison across different cancer types. Each point is the average AUC values over 10-fold CV other events as well. As a result, this additional analysis proves well the performance of instance selection using MINN-SA, which is directly related to its interpretability. Figure 6 shows the extracted features before the last classification layer. The heatmap is given in a n × p matrix form colored in a blue-yellow spectrum. It can be seen that the extracted features using the sparsemax function (left) are more activated than using the softmax function (right). Hence, the difference between the features in each observation of the sparsemax case is more distinct than the softmax case. The results demonstrate that the extracted features using the sparsemax function are more informative to characterize the characteristics of each sample. We believe that the sparsity in the proposed attention structure distinguishes the instances responsible for the bag classification so that the aggregated features can accurately discriminate bags in the classification layer.

Ablation study
Firstly, we perform an ablation study to measure contributions of the two components to our neural network: (1) the skip connection and (2) the sparsemax function. Thus, we set a baseline model, denoted by "FC" (short for "fully-connected"), by removing the two components from MINN-SA, and we add each component one after another to "FC". This leads to the four comparative models shown in Table 4. Note that "FC" is the model used in [22], and "Proposed" is MINN-SA. Figure 7 shows AUC values of the four models for different cancer types. "Proposed" outperforms the others in most cases; otherwise it makes a close second. The superiority of "Proposed" is less distinct in the imbalanced case, but it always takes the first  Table 4). These results lend strong support to the proposed method against the existing multiple instance neural network model by [22], a fully-connected neural network based on the softmax function without the skip connection. According to the known phenomenon of immunodominance in the field of immunology [76,77], not all instances own necessary information and thus such instances would be better removed for classification by the sparse attention structure. The skip connection also proves valuable for this specific application. Interestingly, the performance is the best when both skip connection and sparsemax function are utilized together. This phenomenon aligns with the previous study [78] that shows combining regularization tricks can achieve the best classification performance. In the following ablation study, we assess the sensitivity to the maximum number of instances per bag. We have tried different maximum numbers by m * = 30, 60, . . . , 150 and their AUC values are reported in Fig. 8.
The results show that there is no significant difference in classification performance according to the hyperparmeter m * . This implies that the absolute amount of data increases as m * increases, but useful information learned for classification could be limited. However, as shown in Fig. 9, bags with more than 30 instances belong to the

Discussion
This paper shows that MINN-SA has improved performance on most cancer types compared to the existing methods in predicting tumor vs. normal tissue samples using TCRs.
The results imply that a deep neural network is suitable for multiple instance learning. In contrast, traditional statistical approaches may not work that well, and this is because the deep learning approaches can capture very complicated bag-instance structures. The flexible structures of deep neural networks reflect the bag-instance information efficiently through purely data-driven approaches. For statistical approaches, such baginstance relationships have to be assumed and sophisticatedly specified. We believe the hand-made specifications are vulnerable to data variations such as cancer types. The prediction performance can be varied depending on data distributions which are different for each cancer type. The proposed method describes some cancer data distributions well, but in some other cases, it shows worse performance than other methods. It is closely related to the "No Free Lunch" theorem in machine learning, [79]; that is, no algorithm can always perform better than others for any data distribution.
The proposed method achieves better performance in imbalanced cases. We believe that the sparse attention leads to the improvement. When the data are imbalanced, it is harder to determine the appropriate decision boundary for classification, and the boundary between classes could be vague in empirical cases. The sparse attention removes noisy instances from blurring the decision boundary. We conjecture that such feature improvement effects are more significant in the imbalanced cases.
Setting m * as a hyperparameter is closely related to computational efficiency. Bag size [1,5) [ 5,10) [ 10,30) [ 30,100) [100,Inf] Fig. 9 The distribution of bag sizes in the balanced dataset. Most of bags have less than 5 instances, expressing heavy-tailed features resulting in inefficiency in both memory and computation. Thus, our model is designed to allow the adjustment of m* according to an actual situation. The performance comparison results on the different m * values can be seen in Fig. 8 of the revised manuscript. Overall, the performance difference does not seem to be significant, and the performance variation pattern is unclear. Therefore, it is recommended to optimize performance by choosing an appropriate m * . Note that users can set it to the maximum instance number if they do not wish to tune m * .
Potentially, the proposed model can be applied to various MIC problems where instances are naturally arranged in sets and weakly annotated. The applications include biology and chemistry, computer vision, document classification, web mining, reinforcement learning, speech recognition, and time series classification [8]. Specifically to biology and bioinformatics, bags can consist of complex chemical or biological entities; e.g. protein-protein interactions [80][81][82] where a bag can be defined between two objects (proteins or protein complexes) and instances defined by pairwise combinations of sub-units in each object. Thus, a seemingly non-MIL problem can be reformulated into a MIL problem. We conjecture that MINN SA might be adapted to other prediction problems in bioinformatics, such as (lncRNA-miRNA association prediction [83]; drug-disease association prediction [84,85]). Certainly, there is ample space for future research.
The instance selection procedure by the sparse attention could be considered a regularization technique commonly used in statistical learning [86]. The LASSO [87] is a representative method to select essential features with sparsity. Regression coefficients of LASSO have non-zero or exact zero values by a shrinkage constraint. The features with non-zero coefficients affect the response variations, but the features with zero coefficients have no influence. The sparsity removes redundant feature information in calculating responses. The proposed method and the LASSO share a similar concept of selecting important information from data. The difference is that LASSO selects important features, but the proposed method selects important instances in a bag. The relationship between the sparse and basic attention for MIL is demonstrated by the relationship between LASSO and Ridge regression [88]. Ridge regression is also a shrinkage method, but the coefficients are not forced to be exact zeros. Thus, it is hard to interpret the Ridge regression results regarding feature selection. In the MIL problem, previous methods based on attention are inappropriate for explaining classification results because all the attention values are non-zero.