ANAT 3.0: a framework for elucidating functional protein subnetworks using graph-theoretic and machine learning approaches

Background ANAT is a Cytoscape plugin for the inference of functional protein–protein interaction networks in yeast and human. It is a flexible graphical tool for scientists to explore and elucidate the protein–protein interaction pathways of a process under study. Results Here we present ANAT3.0, which comes with updated PPI network databases of 544,455 (human) and 155,504 (yeast) interactions, and a new machine-learning layer for refined network elucidation. Together they improve network reconstruction to more than twofold increase in the quality of reconstructing known signaling pathways from KEGG. Conclusions ANAT3.0 includes improved network reconstruction algorithms and more comprehensive protein–protein interaction networks than previous versions. ANAT is available for download on the Cytoscape Appstore and at https://www.cs.tau.ac.il/~bnet/ANAT/.

ANAT's main engine is an algorithm to reconstruct subnetworks that link protein nodes of interests (anchors) with target proteins (terminals). It relies on an assembled database of PPI networks, each of which can be viewed as a weighted graph G = (E, V), where E are the edges, representing PPIs, each of which is assigned with a reliability score, and V are the nodes, representing proteins. It optimizes the reconstruction of a connected subgraph H(V H ,E H ) connecting anchors with terminals by simultaneously optimizing the global weight of the resulting network (sum of weights of the edges in E H ) and the local length of the shortest path connecting each terminal to one of the anchors. The algorithm is governed by the 0 < α < 0.5 parameter, which balances the local and global criteria. A value closer to 0 emphasizes the local criterion while a value closer to 0.5 emphasizes the global criterion, with α = 0.25, balancing both [2,5].
Here we present ANAT3.0, an updated version of ANAT that refines the anchored subnetwork reconstruction engine using machine learning techniques. In addition, ANAT3.0 comes with updated PPI networks that integrate KEGG signaling pathways [6] and IntAct interactions [7].

Implementation
ANAT is implemented as a Java-based plugin for Cytoscape 3.8 and above. The ANAT plugin can be downloaded at http:// www. cs. tau. ac. il/ ~bnet/ ANAT/, together with installation instructions, documentation, the user manual and sample inputs and outputs. It is also available on the Cytoscape App Store.

Materials and methods
For ANAT 3.0, IntAct [7] database interactions were merged with the BioGRID [8] interactions. Since the experimental detection methods (which are used as a feature to assign scores to the final base network in ANAT) for interactions that appeared in both databases were not always concordant, the lowest common ancestral experimental detection method from the PSI-MI ontology was used to represent each such interaction. A total of 109,261 new IntAct interactions (67,864 for human and 41,397 for yeast) were added to the ANAT base PPI network. In addition, we integrated KEGG human and yeast signaling pathways into ANAT base network, by translating the pathways from the KEGG format described in [6] to the ANAT format, by using KEGG proteins as nodes (disregarding small signaling metabolites like PIP3 and decomposing KEGG protein groups into fully connected subnetworks) and protein-protein interactions as edges. All KEGG edges were given a uniform confidence score of 0.6, following consistent rules from [2] of fixed confidence scores for non-experimental data, and the results were merged together with the background network of ANAT, resulting in 2301 new H. sapiens interactions and 1003 new S. cerevisiae interactions.
The main addition to ANAT's anchored reconstruction algorithm is a machine-learning layer that evaluates candidate proteins predicted by the algorithm and scores them according to their likelihood to appear in the true pathway being sought. To this end, ANAT3.0 exploits known signaling pathways from KEGG. All the nodes (representing KEGG pathway proteins) for each of the 37 (H. sapiens) and 20 (S. cerevisiae) signaling pathways were concatenated together in a training set and, subsequently, 5 features were obtained for each node as follows: anchors and terminals were extracted individually for each individual pathway by taking KEGG pathway nodes with no inward or no outward edges, respectively. These were used to run ANAT2.0 three times, with three different values of α = {0, 0.25, 0.5}, for every pathway. The output of a single ANAT run contains a list of nodes, each one of them coming with a confidence value, calculated as the percentage of different solutions containing the given node. The first three features of ANAT3.0's machine learning layer are the confidence values for the given protein, for the three alpha values.
The next set of features represents the proximity of the node to the anchors and terminals, respectively, evaluated using a network propagation calculation [9]. Two network propagations were initialized from anchors and terminals of all pathways, by setting all nodes in the ANAT network to an initial value of 0, and starting nodes with value 1/ (number of starting nodes). This yielded two network propagation coefficients for every node in every pathway, which were used as two additional features. If one node is present in more than one KEGG pathway, it will be present twice in the training set, as two different feature vectors. Labels are then assigned to each sample by comparing it to the original KEGG pathways. Finally, eventual KEGG pathway nodes not output by ANAT2.0 are added as an all 0 feature vector with label = 1.
All features were normalized to have mean 0 and variance 1. All nodes vectors were concatenated together to form the input matrix, which was then fed to a random forest classifier using python3's sklearn module with default parameters.
The margin parameter in ANAT2.0 controls the percentage of deviation from the optimum solution to include (a margin of 1.2 will include solutions of 20% deviation from the optimum) [2]. The higher the margin, the bigger the final output network. The aforementioned pipeline was run 6 times, to tune the margin parameter, with 6 different margin values m = {1, 1.2, 1.4, 1.6, 1.8, 2}, and AUC from ROCs and Precision Recall Curves were calculated to select the best model.
The resulting machine learning framework is then applied to a set of features generated for any input of anchors and terminals, and normalized to a standard distribution. The final output of ANAT3.0 is a minimum spanning tree connecting the resulting nodes, each of which is assigned with a confidence score. The machine learning framework can be applied to any given input by generating the five features, normalizing them and applying the trained classifier. The final output of ANAT3.0 is a minimum spanning tree connecting the resulting nodes, each of which is assigned with a confidence score. This yields a refined network structure where nodes are assigned with confidence values using the trained classifier.

Usage
From the ANAT sidebar on Cytoscape, select "new subnetwork" → "anchored network", and select the organism of interest (H. sapiens or S. cerevisiae), and click "next". In the next screen, input the names of anchors and terminals, then click on "advanced", select "ML-based", and click "finish" (Fig. 2).

Availability and requirements
Project name: ANAT 3.0. Project home page: http:// www. cs. tau. ac. il/ ~bnet/ ANAT/. Operating system(s): Operates on all major operating systems. Programming language: Java, Python3. Other requirements: Cytoscape 3.8 and above. License: Free for academic use. Any restrictions to use by non-academics: License needed.