For ANAT 3.0, IntAct [7] database interactions were merged with the BioGRID [8] interactions. Since the experimental detection methods (which are used as a feature to assign scores to the final base network in ANAT) for interactions that appeared in both databases were not always concordant, the lowest common ancestral experimental detection method from the PSI-MI ontology was used to represent each such interaction. A total of 109,261 new IntAct interactions (67,864 for human and 41,397 for yeast) were added to the ANAT base PPI network. In addition, we integrated KEGG human and yeast signaling pathways into ANAT base network, by translating the pathways from the KEGG format described in [6] to the ANAT format, by using KEGG proteins as nodes (disregarding small signaling metabolites like PIP3 and decomposing KEGG protein groups into fully connected subnetworks) and protein–protein interactions as edges. All KEGG edges were given a uniform confidence score of 0.6, following consistent rules from [2] of fixed confidence scores for non-experimental data, and the results were merged together with the background network of ANAT, resulting in 2301 new H. sapiens interactions and 1003 new S. cerevisiae interactions.
The main addition to ANAT's anchored reconstruction algorithm is a machine-learning layer that evaluates candidate proteins predicted by the algorithm and scores them according to their likelihood to appear in the true pathway being sought. To this end, ANAT3.0 exploits known signaling pathways from KEGG. All the nodes (representing KEGG pathway proteins) for each of the 37 (H. sapiens) and 20 (S. cerevisiae) signaling pathways were concatenated together in a training set and, subsequently, 5 features were obtained for each node as follows: anchors and terminals were extracted individually for each individual pathway by taking KEGG pathway nodes with no inward or no outward edges, respectively. These were used to run ANAT2.0 three times, with three different values of α = {0, 0.25, 0.5}, for every pathway. The output of a single ANAT run contains a list of nodes, each one of them coming with a confidence value, calculated as the percentage of different solutions containing the given node. The first three features of ANAT3.0’s machine learning layer are the confidence values for the given protein, for the three alpha values.
The next set of features represents the proximity of the node to the anchors and terminals, respectively, evaluated using a network propagation calculation [9]. Two network propagations were initialized from anchors and terminals of all pathways, by setting all nodes in the ANAT network to an initial value of 0, and starting nodes with value 1/(number of starting nodes). This yielded two network propagation coefficients for every node in every pathway, which were used as two additional features. If one node is present in more than one KEGG pathway, it will be present twice in the training set, as two different feature vectors. Labels are then assigned to each sample by comparing it to the original KEGG pathways. Finally, eventual KEGG pathway nodes not output by ANAT2.0 are added as an all 0 feature vector with label = 1.
All features were normalized to have mean 0 and variance 1. All nodes vectors were concatenated together to form the input matrix, which was then fed to a random forest classifier using python3's sklearn module with default parameters.
The margin parameter in ANAT2.0 controls the percentage of deviation from the optimum solution to include (a margin of 1.2 will include solutions of 20% deviation from the optimum) [2]. The higher the margin, the bigger the final output network. The aforementioned pipeline was run 6 times, to tune the margin parameter, with 6 different margin values m = {1, 1.2, 1.4, 1.6, 1.8, 2}, and AUC from ROCs and Precision Recall Curves were calculated to select the best model.
The resulting machine learning framework is then applied to a set of features generated for any input of anchors and terminals, and normalized to a standard distribution. The final output of ANAT3.0 is a minimum spanning tree connecting the resulting nodes, each of which is assigned with a confidence score.