Efficient link prediction in the protein–protein interaction network using topological information in a generative adversarial network machine learning model

Background The investigation of possible interactions between two proteins in intracellular signaling is an expensive and laborious procedure in the wet-lab, therefore, several in silico approaches have been implemented to narrow down the candidates for future experimental validations. Reformulating the problem in the field of network theory, the set of proteins can be represented as the nodes of a network, while the interactions between them as the edges. The resulting protein–protein interaction (PPI) network enables the use of link prediction techniques in order to discover new probable connections. Therefore, here we aimed to offer a novel approach to the link prediction task in PPI networks, utilizing a generative machine learning model. Results We created a tool that consists of two modules, the data processing framework and the machine learning model. As data processing, we used a modified breadth-first search algorithm to traverse the network and extract induced subgraphs, which served as image-like input data for our model. As machine learning, an image-to-image translation inspired conditional generative adversarial network (cGAN) model utilizing Wasserstein distance-based loss improved with gradient penalty was used, taking the combined representation from the data processing as input, and training the generator to predict the probable unknown edges in the provided induced subgraphs. Our link prediction tool was evaluated on the protein–protein interaction networks of five different species from the STRING database by calculating the area under the receiver operating characteristic, the precision-recall curves and the normalized discounted cumulative gain (AUROC, AUPRC, NDCG, respectively). Test runs yielded the averaged results of AUROC = 0.915, AUPRC = 0.176 and NDCG = 0.763 on all investigated species. Conclusion We developed a software for the purpose of link prediction in PPI networks utilizing machine learning. The evaluation of our software serves as the first demonstration that a cGAN model, conditioned on raw topological features of the PPI network, is an applicable solution for the PPI prediction problem without requiring often unavailable molecular node attributes. The corresponding scripts are available at https://github.com/semmelweis-pharmacology/ppi_pred. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04598-x.


Using embedding features for alternative implementations
Instead of the original model ("Adjacency only"), we investigated two alternative models: one which utilizes embedding features only ("Embedding only"), and one which concatenates the adjacency matrices with these corresponding embedding vectors ("Combined"). In both cases, the embedding is executed in the preprocessing module ( Supplementary Fig. 1). We chose the popular node2vec algorithm, which is a natural language processing inspired embedding method, that reduces the representation space of nodes by learning low-dimensional features based on their neighborhoods in the graph [1]. We used the out-ofthe-box, high performance C++ implementation of the node2vec algorithm, that is included in the libraries of the Stanford Network Analysis Platform (SNAP) [2]. Parametrization of the learning followed the values presented in Supplementary Table 1, with a carefully selected number of embedding dimensions in the output representation, which was set to match the number of nodes in the induced subgraphs in order to construct matching square matrices that can be concatenated later on in the main machine learning part. correction was used to adjust p-values for multiple testing. Figure 1: Preprocessing of the input network for the combined model. Schematic summary of the preprocessing module in the combined model, that takes in the provided protein-protein interaction (PPI) network, and produces the downscaled networks with 90% (N90) and its 90% (thus 81%, N81) of edges from the original one (N100) in the form of adjacency lists, and generates the induced subgraphs as well as the node embeddings for each. These representation files are created for the original network as well but are not required in the machine learning part, resulting in the listed 7 files to be fed into the conditional generative adversarial network (cGAN) model down the line.

Number of dimensions ( ) 32
Length of the walk per source ( ) 80

Number of walks per source ( ) 10
Context size for optimization ( )

Gene ontology analysis of the predicted protein-protein interactions
Using the results summarized in Supplementary Table 2, we also performed an analysis of biological functionalities among the proteins of the predicted interactions from cGAN, node2vec and struc2vec. For each species node2vec and struc2vec predictions were filtered based on the confidence score with a cutoff value >0.5, while from our cGAN results, we selected the top 20% highest confidence scores produced by one prediction from N80 to N100, as that resembles the intended use of the model in practice. We generated the analysis inputs from filtered positive edges represented by its associated proteins (nodes) as the edges cannot be interpreted as a unit in this approach. We used protein information files obtained from the STRING database to map preferred gene names to STRING protein identifiers. Overrepresentation analysis-based Gene Ontology (GO) enrichment analysis [5,6] with biological process ontologies was performed on filtered predicted results of node2vec, struc2vec and cGAN, using clusterProfiler R library [7] version 4.3.1, and N100 network as background for the analysis.
In overrepresentation analysis FDR correction was used to adjust p-values for multiple testing and the significance level was set to 0.05. We separately compared GO term sets enriched in node2vec and struc2vec with the GO enrichment results of our method by calculating the semantic similarities between the GO term sets using GOSemSim R library [8]  maintained by Bioconductor project [9]. In case of the species Saccharomyces cerevisiae the inclusion of org.Sc.sgd.db BioConductor annotation data package [10] was necessary due to difficulties with its OrgDb object. In case of Sus scrofa, proportionally fewer genes were included into the enrichment sets compared to other species, which could be caused by poorer mapping between GO terms and gene identifiers. In a similar fashion, we also performed GO enrichment analysis with biological process, molecular function and cellular component ontologies on our results to map certain features of the newly predicted edges.
Following the intended use of the proposed cGAN model, for each species we pooled together the top 10% highest confidence interactions from all 10-folds of predictions that were generated during the link prediction from N90 to N100. We separated the true positive and false positive edges to generate the analysis inputs, containing only the nodes connected by the filtered edges as the edges cannot be interpreted as a unit in this approach. Using the corresponding N100 networks as the background for the analysis, we separately compared GO term sets enriched in true positive and false positive sets for each species and each ontology type (Supplementary Table 3, 4) by calculating the semantic similarities between the GO term sets using GOSemSim R library [8].