Prediction of virus-host protein-protein interactions mediated by short linear motifs

Background Short linear motifs in host organisms proteins can be mimicked by viruses to create protein-protein interactions that disable or control metabolic pathways. Given that viral linear motif instances of host motif regular expressions can be found by chance, it is necessary to develop filtering methods of functional linear motifs. We conduct a systematic comparison of linear motifs filtering methods to develop a computational approach for predicting motif-mediated protein-protein interactions between human and the human immunodeficiency virus 1 (HIV-1). Results We implemented three filtering methods to obtain linear motif sets: 1) conserved in viral proteins (C), 2) located in disordered regions (D) and 3) rare or scarce in a set of randomized viral sequences (R). The sets C,D,R are united and intersected. The resulting sets are compared by the number of protein-protein interactions correctly inferred with them – with experimental validation. The comparison is done with HIV-1 sequences and interactions from the National Institute of Allergy and Infectious Diseases (NIAID). The number of correctly inferred interactions allows to rank the interactions by the sets used to deduce them: D∪R and C. The ordering of the sets is descending on the probability of capturing functional interactions. With respect to HIV-1, the sets C∪R, D∪R, C∪D∪R infer all known interactions between HIV1 and human proteins mediated by linear motifs. We found that the majority of conserved linear motifs in the virus are located in disordered regions. Conclusion We have developed a method for predicting protein-protein interactions mediated by linear motifs between HIV-1 and human proteins. The method only use protein sequences as inputs. We can extend the software developed to any other eukaryotic virus and host in order to find and rank candidate interactions. In future works we will use it to explore possible viral attack mechanisms based on linear motif mimicry. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1570-7) contains supplementary material, which is available to authorized users.


S1 HIV-1 Sequences
The number of sequences obtained from the NIAID database is given in Table S1 [1]. The number of randomized sequences generated to obtain the rare motif set (R) are 1000 times the values presented in Table S1. The randomized sequences are available on request.

S2 Disordered regions
The disordered regions were computed using IUPred [2] with the window addition explained in the methods section [3]. The results are in in text files named with the pattern regions-proteinName, where proteinName is one of the protein names in Table S1. The files have the following layout:

S3 Short Linear Motifs
The SLiM sets computed were generated as CSV files with names like ca motifs D.csv with the HIV-1 protein name as prefix, and the set name as suffix. The protein name is one of the names in Table S1, the suffixes used are in Table S2. Unions like C ∪ D were represented in filenames as suffixes uCD and intersections like D ∩ R as iDR.

S4.1 Human-HIV1 interactions in the NIAID database
The number of predicted interactions per set and HIV-1 protein is presented in Tables S4 and S5. The first one presents the interactions validated in the NIAID database and the second one the interactions not present in the NIAID database [8], the candidate interactions. These interactions are in the [Additional file 1], this is a CSV file named with the pattern: hivProtein-interactions-suffix.csv.
Where hivProtein is one of the names in Table S1 and the suffix is one of the names in Table S2. Each file includes the human proteins using their Uniprot id.

S4.2 Human-HIV1 interactions in the LMPID database
In Table S6 we report the SLiM-mediated interactions between HIV-1 and humans that were extracted from the LMPID database [10].

S5 Sensitivity and specificity
Although there is no gold-standard dataset for VHPPIs we use the NIAID database to estimate the sensitivity of the SLiM-based predictions. We iterate through all possible interactions between human and HIV-1 proteins to compute the true positives, true negatives, false positives and false negatives. Tables S7 and S8 report the sensitivity and specificity. The values are discriminated per HIV-1 protein and SLiM set used to infer the intearcions.

S6.1 RefSegs to Uniprot Ids
To map the NIAID human-HIV-1 interactions given in RefSeqs to Uniprot Ids given in the ELM database we use the files in the Uniprot FTP. contains the correspondence between Uniprot ids and other databases ids like RefSeq.

S6.2 Domains to Proteins
To map the domains given in the ELM database to proteins containing them we use the PFam mapping from the ftp: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam28.0/proteomes/ 9606.tsv.gz.