A multitask transfer learning framework for novel virus-human protein interactions

Understanding the interaction patterns between a particular virus and human proteins plays a crucial role in unveiling the underlying mechanism of viral infection. This could further help in developing treatments of viral diseases. The main issues in tackling it as a machine learning problem is the scarcity of training data as well input information of the viral proteins. We overcome these limitations by exploiting powerful statistical protein representations derived from a corpus of around 24 Million protein sequences in a multi task framework. Our experiments on 7 varied benchmark datasets support the superiority of our approach.


INTRODUCTION
Viral infections most have been increasingly burdening the healthcare systems. Biologically the viral infection involves many protein-protein interactions (PPIs) between the virus and its host. These interactions range from the initial biding of viral coat proteins to the host membrane receptor to the hijacking of the host transcription machinery by viral proteins. In this work we develop a deep learning based computational model for predicting interactions between a novel virus (a completely new one) and human proteins.
One of the key challenges in tackling the current learning task with novel unseen viruses is the limited training data. Often, some known interactions of related viruses are used to train supervised models. These data is usually collected by wet lab experiments and are usually too little to ensure generalizability of trained models. In effect, the trained models might overfit the training data and would give inaccurate predictions for the novel virus.
Moreover, viral proteins are substantially different from human or bacterial proteins. They are structurally dynamic so that they cannot be easily detected by common sequence-structure comparison (Requião et al., 2020). Virus protein sequences of different species share only little in common (Eid et al., 2016). Therefore, models trained for other human PPI (Li & Ilie, 2020;Sun et al., 2017;Li, 2020;Chen et al., 2019;Sarkar & Saha, 2019) or for other pathogen-human PPI (Sudhakar et al., 2020;Mei & Zhang, 2020;Dick et al., 2020;Li et al., 2014;Guven-Maiorov et al., 2019;Basit et al., 2018)(for which more data might be available) cannot be directly used for predictions for novel viral-human protein interactions.
While for human proteins, features related to their function, semantic annotation, domain, structure, pathway, etc. can be extracted from public databases, such information is not readily available for viral proteins. The only reliable source of viral protein information is its amino acid sequence. Learning effective representations of the viral proteins is thus an important step towards building the prediction model. Heuristics such as K-mer composition usually used for protein representations are bound to fail as it is known that viral proteins with completely different sequences might show similar interaction patterns.
Other existing works also employed additional features to represent viral proteins such as protein functional information (or GO annotation) (Wang, 2020), proteins domain-domain associations information as in (Barman et al., 2014), protein structure information as in (Lasso et al., 2019;Guven-Maiorov et al., 2019), and the disease phenotype of clinical symptoms as in (Wang, 2020). A major limitation of these approaches is that they cannot generalize to novel viruses where such information is not available or lack experimentally supported evidence. In this work we tackle the above limitations by exploiting powerful statistical protein representations derived from a corpus of around 24 Million protein sequences in a multitask framework. Noting the fact that virus tends to mimic humans towards building interactions with its proteins, we use the prediction of human PPI as a side task to further regularize our model and improve generalization. Our large scale experiments on a number of datasets showcase the superiority of our approach.

OUR APPROACH
The schematic diagram of our proposed model is presented in Figure 1. We use human and virus raw protein sequences as input. As side or domain information we use human protein-protein interaction network of around 20,000 proteins and over 22M interactions from STRING (Szklarczyk et al., 2015) database.
We note that the protein sequence determines the protein's structural conformation (fold), which further determines its function and its interaction pattern with other proteins. However, the underlying mechanism of the sequence-to-structure matching process is very complex and cannot be easily specified by hand crafted rules. Therefore, rather than using handcrafted features extracted from amino acid sequences we employ the pre-trained UNIREP model (Alley et al., 2019) to generate latent representations or protein embeddings. The protein representations extracted from UNIREP model are empirically shown to preserve fundamental properties of the proteins and are hypothesized to be statistically more powerful and generalizable than hand crafted sequence features.
We further fine-tune these representations by training 2 simple neural networks (single layer MLP with ReLu activation) using an additional objective of predicting human PPI in addition to the main task. We use Logistic Regression networks to predict likelihood of having interaction between virushuman proteins or human-human proteins. The two networks' parameters are not shared among tasks allowing them to extract more task-specific representation.
The rationale behind using human PPI task is that viruses have been shown to mimic and compete with human proteins in their binding and interaction patterns with other human proteins (Mei & Zhang, 2020). Therefore, we believe that the patterns learned from the human interactome (or human PPI network) should be a rich source of knowledge to guide our virus-human PPI task and further helps to regularize our model.
Let Θ, Φ denote the set of learnable parameters corresponding to representation tuning components, i.e., the Multilayer Perceptrons (MLP) corresponding to the virus and human proteins, respectively. Let W 1 , W 2 denote the two learnable weight matrices (parameters) for the logistic regression modules for the virus-human and human-human PPI prediction tasks. We use V H, and HH to denote the training set of virus-human, human-human PPI, correspondingly. We use binary cross entropy loss for virus-human PPI predictions as given below where variables z vh is the corresponding binary target variable and y vh is the predicted probability of virus-human PPI or the output of the Logistic regression (LR) module. The input to the LR module is the element wise product of fine-tuned representations (output of the MLP) of virus and human protein.
For human PPI, the target variables (z hh ) are the normalized confidence scores which can be interpreted as the probability of observing an interaction. We use binary cross entropy loss as below where y hh is the element wise product of fine-tuned representations (output of the second MLP ) of human and human protein.
We use a linear combination of the two loss functions to train our model, i.e., L = L 1 + α · L 2 , where α is the human PPI weight factor. We set it to 10 −3 in our experiments.

EXPERIMENTAL EVALUATION
We compare our method with following six baseline methods and two simper variants of our model.
(1) GENERALIZED (Zhou et al., 2018): It is a generalized SVM model trained on hand crafted features extracted from protein sequence for the novel virus-human PPI task.
(2) HYBRID (Deng et al., 2020): It is a complex deep model with convolutional and LSTM layers for extracting latent representation of virus and human proteins from their input sequence features and is trained using L1 regularized Logistic regression.
(3) DOC2VEC (Yang et al., 2020): It employs the doc2vec (Le & Mikolov, 2014) approach to generate protein embeddings from the corpus of protein sequences. A random forest model is then trained for the PPI prediction.
(4) MOTIFTRANSFORMER (Lanchantin et al., 2020): It first generates protein embeddings using supervised protein structure and function prediction tasks. Those embeddings were later passed as input to a an order-independent classifier to do the PPI prediction task. (7) 2 simpler variants of MTT: Towards ablation study we evaluate two simpler variants: (i) SIN-GLETASK TRANSFER (STT), which is trained on a single objective of predicting pathogen-human PPI and (ii) NAIVE BASELINE, which is a Logistic regression model using concatenated human and pathogen protein UNIREP representations as input.

BENCHMARK DATASETS AND RESULTS
We evaluate our approach on 7 benchmark datasets. As several of our competitors do not release their code, we use the reported performance scores (using the same evaluation metrics) in the original papers giving them full advantage. Besides, as many of the methods use hand crafted features which might not be available for other benchmark datasets not evaluated in their original papers. Detailed data statistics can be found in the Appendix A.  Table 2: Comparison on datasets with rich feature information. SN, SP, ACC refer to Sensitivity, Specificity, and Accuracy, respectively. "-" denotes that the corresponding score is not reported in the original paper.
Additional results on novel bacteria-human PPI prediciton. We further demonstrate our model effectiveness on the novel bacteria-human PPI prediction task. We compare our method with Denovo on the three datasets for three human bacteria: BACILLUS ANTHRACIS (B1), YERSINIA PESTIS (B2), and FRANCISELLA TULARENSIS (B3), obtained from (Eid et al., 2016). The results are shown in Table 3. MTT clearly outperforms the baseline method (Denovo  Table 3: Comparison for the novel bacteria-human PPI prediction task. SN, SP, ACC refer to Sensitivity, Specificity, and Accuracy, respectively.

DISCUSSION AND FUTURE WORK
Our methods shows superior performance on a wide range of tested datasets. Note that this is despite the fact that each of our baselines have been proposed to exploit certain specific kind of information which was in the first place used to construct the dataset. MTT also outperforms its simpler variants developed with single task objective. Note that our naive baseline which directly trains a logistic regression classifier with pretrained embeddings already outperforms several methods. This points to the superiority of these representations as compared to hand-crafted features. As future work We will enhance our multi task approach by incorporating more domain information as well as exploiting more sophisticated multi task model architectures.  We use the benchmark datasets for human H1N1 and human Ebola viruses as released by (Zhou et al., 2018). The dataset is prepared as follows and is suitable for testing predictions for novel viruses.
The known PPIs between virus and human were retrieved from four databases: APID, IntAct, Metha, and UniProt. In total, there are 11,491 known PPIs between 246 viruses and human. The training data for the human-H1N1 dataset includes PPIs between human and all viruses except H1N1. Similarly, the training data for the human-Ebola dataset includes PPIs between human and all viruses except Ebola. The testing data for the human-H1N1 dataset contains PPIs between human and 11 H1N1 viral proteins. Likewise, the testing data for the human-Ebola dataset contains PPIs between human and 3 Ebola proteins.
The statistics for those datasets are presented in Table 4 The data was first collected from HPIDB (Ammari et al., 2016). B1 belongs to a bacterial phylum different from that of B2 and B3, while B2 and B3 share the same class but differ in their taxonomic order. B1 has 3057 PPIs, B2 has 4020, and B3 has 1346 known PPIs. A sequence-dissimilaritybased negative sampling method was employed to generate negative samples. For each bacteria protein, ten negative samples were generated randomly. Each of the bacteria was then set aside for testing, while the interactions from the other two bacteria were used for training. In the end, we have three datasets. For simplicity, we use the name of the bacteria in the testing set as the name of the dataset. The statistics for those three datasets are presented in table 5.