We first considered a number of node proximity measures as the basic element of link prediction. In these experiments, a random-walk based proximity measure was found to perform best. This result strongly contrasts results from a recent study comparing similar proximity measures on probabilistic graphs , where a method based on expected shortest path distance performed best, followed by probability of best path, network reliability, and finally random walk, with clear margins. The most striking difference is that in their link prediction experiments, random walk performed only slightly better than a random guess. We hypothesize that the main reason is the difference in the edge weighting schemes, which may be more suitable for some methods than for others. One difference in weighting is that the graphs of Potamias et al. contain more edges with probabilities close to 1.0, whereas in our scheme the types of edges and degrees of nodes have a larger effect on edge weights, and the resulting distribution of edge weights is more uniform.
Our empirical results show that the Biomine approach has strong statistical prediction power (see, e.g., Figure 5). However, the prediction accuracy is likely not sufficient for predicting arbitrary links within Biomine, as there are relatively few true positives among a huge number of potential links. Consider the current statistics of Biomine: the number of node pairs is of the order of 1011, while the current number of edges is of the order of 107. For the sake of example, assume that the number of true positive links is 10 times larger than the number of current edges, i.e., 108. The fraction of true positives among all potential links would be 108/1011, i.e., one positive instance for every 1,000 negative ones.
Now, assume a true positive rate of 0.1 and a false positive rate of 0.0001, similar to experimental results in Figure 5. We are then 1,000 times more likely to classify positively a true positive than a true negative. Incidentally, this ratio is identical to the ratio of the negative and positive instances assumed above. In other words, the predicted positives would be expected to contain an equal amount of true and false positives. If one produced predictions for the whole Biomine, there would with these parameters be about 107 true and false positives—clearly too much for any practical use.
In practical applications, such as analysis of protein interaction measurements or disease gene ranking, the set of potential links to predict is limited to a predefined set of candidate links that is already enriched with positive instances. This also means that although the proximity measure itself does not take into account the type of links to be predicted, the set of candidates is already chosen in such a way that the edge type is implicitly defined.
We next discuss our two test settings, protein interaction prediction and gene prioritization, and their results.
The protein interaction prediction experiments have been carried out with respect to new links introduced to the source databases between 2007 and 2010, with the above-mentioned good results. An interesting question is if and how much the methods are biased to making predictions in active areas of research. Existing information in the source databases reflects past and current research topics and hypotheses, and these may well correlate with future research and discoveries. A topic for future study is to investigate if certain types of links are easier to predict than others.
Use of Biomine in disease gene ranking enables identifying, from among a number of putative candidate genes, the ones that appear most plausible based on the data contained in the source databases. This approach is expected to work best in cases where several functionally related genes contribute to the disease, and knowledge about functions of the genes is already present in the source databases. Obviously, less studied genes with little or no functional annotations cannot be identified in this way.
We considered two versions of the gene ranking problem: one where genes are ranked based on their proximities to an already known reference set of disease-related genes (supervised setting), and another where ranking is based on the mutual proximities of the putative genes (unsupervised setting). Both formulations have already been considered in previous work[16, 18]. The methods of producing data and computing proximities are different, however. Kohler et al.  only use a single type of edge (protein associations) while Franke et al.  collapse information from several data sources into a single type of edge. Neither of these approaches considers edge weights. In contrast, we retain the original edge types and construct a heterogeneous, weighted network. An additional difference is that the approach of Franke et al. is directly aimed at linkage studies where genes within continuous susceptibility intervals are examined, whereas we consider cases where the genes may be spread over the whole genome.
Powerful methods for disease gene prediction have been proposed by Hwang and Kuang  and Vanunu et al. . They assume availability of three specific types of links (similarities between diseases, links from diseases to proteins, and protein interactions) and these are used in specific ways. In contrast, we do not make such assumptions about edge types: the link prediction methods used in this study are purely based on node proximities. While the approaches of Hwang, Kuang and Vanunu et al. can take better advantage of specific information, the methods used in this paper are more flexible and can also utilize unanticipated types of links.
Unlike the methods considered here, Wang et al.  and Chasman  do not use graph data, but instead perform joint testing of disease association within predefined sets of related genes (pathways and functional categories). In contrast, the methods applied in this paper are not limited to detecting association only within such predefined sets.
To sum up, the following combination of factors distinguishes our work from previous work on utilizing protein networks for disease gene prioritization:
use of weighted edges, with weights based on combining information from the type of edges, node degrees and weights in the original databases,
use of a heterogeneous graph, and
the novel single-cluster clustering formulation.
In this paper, the evaluation of Biomine has been carried out quantitatively, using numerical measures of prediction accuracy. Such measures are directly motivated by the prediction tasks considered in this study. An important, complementary application of Biomine is visualizing relationships between entities of the biological graph (cf. Figure 1), enabling the basis of predictions to be shown to the user for subjective analysis and verification. Finding connections previously unknown to the user may help understand biological mechanisms and produce new biological hypotheses.
Consider, for instance, the top ranking genes from a gene mapping study, or a gene that by some other evidence might be related to the phenotype under study. A subgraph that connects them  can be used to show the concrete chains of annotations that link the genes to the phenotype. Such use of Biomine is remotely similar to search engines: enter a number of query entities, and Biomine will search for chains and networks of entities that summarize the known relationships between the query entities. This search functionality is available in the Biomine web site http://biomine.cs.helsinki.fi.