Hubs of knowledge: using the functional link structure in Biozon to mine for biologically significant entities

Shafer, Paul; Isganitis, Timothy; Yona, Golan

doi:10.1186/1471-2105-7-71

Methodology article
Open access
Published: 15 February 2006

Hubs of knowledge: using the functional link structure in Biozon to mine for biologically significant entities

Paul Shafer¹,
Timothy Isganitis¹ &
Golan Yona¹

BMC Bioinformatics volume 7, Article number: 71 (2006) Cite this article

3900 Accesses
10 Citations
Metrics details

Abstract

Background

Existing biological databases support a variety of queries such as keyword or definition search. However, they do not provide any measure of relevance for the instances reported, and result sets are usually sorted arbitrarily.

Results

We describe a system that builds upon the complex infrastructure of the Biozon database and applies methods similar to those of Google to rank documents that match queries. We explore different prominence models and study the spectral properties of the corresponding data graphs. We evaluate the information content of principal and non-principal eigenspaces, and test various scoring functions which combine contributions from multiple eigenspaces. We also test the effect of similarity data and other variations which are unique to the biological knowledge domain on the quality of the results. Query result sets are assessed using a probabilistic approach that measures the significance of coherence between directly connected nodes in the data graph. This model allows us, for the first time, to compare different prominence models quantitatively and effectively and to observe unique trends.

Conclusion

Our tests show that the ranked query results outperform unsorted results with respect to our significance measure and the top ranked entities are typically linked to many other biological entities. Our study resulted in a working ranking system of biological entities that was integrated into Biozon at http://biozon.org.

Background

There is vast amount of heterogeneous biological data today that is warehoused in multiple databases. Among these are databases of protein sequences [1, 2], protein structures [3], DNA sequences [4], protein-protein interactions [5, 6], cellular pathways [7, 8], and many others. These databases are typically highly focused and are usually limited to one data type. However, biological entities are strongly related and mutually dependent on each other, and to properly analyze the function of an entity one needs to know its extended biological context and its relation to other entities. For example, to define most accurately the functional role of a specific gene it is necessary to consider also the interactions it is involved in and the set of biochemical pathways it participates in. This mutual dependency is especially important when, for example, querying this wealth of data for disease related genes or for interactions that mediate signal transduction in a specific biological system. The integration of information from multiple resources can either corroborate a certain aspect of a biological entity, validate a hypothesis, or sometimes give initial clues to the function of a completely uncharacterized object.

Existing methods for querying biological data available on the web are mostly limited to the one data type warehoused in the database being queried. There are a few servers that allow one to query multiple databases at once, such as the NCBI entrez server [9] and the EMBL server [10]. These servers also maintain links among different entities stored within. Other servers such as Moby [11] and Biomediator [12] provide interface to query multiple resources. However, all these servers do not integrate the results or analyze the relations between the objects at query time. Moreover, the query results are ordered arbitrarily or by features irrelevant to the query (e.g. in alphabetical order). This is clearly not ideal as one might need to scan through hundreds or thousands of matches before encountering the instance that is the most relevant, the most studied, or the most interconnected. Furthermore, there are many instances in biological databases that are partially annotated or completely uncharacterized. Even if biologically relevant to the search term, these objects will be overlooked by traditional search methods. However, the relations between these objects and other, better annotated objects may help identify their functions. This may in turn imply that these objects are indeed relevant to the search term.

There has been a substantial amount of work directed towards developing methods and prominence models for effectively querying and ranking documents on the World Wide Web. The underlying idea behind many prominence models is that the link structure of the Internet can be used to identify the web pages most relevant to the user's query. A well-known example is Google's PageRank [13]. Another model, proposed in [14], identifies "hubs" and "authorities," where authorities are web pages that are linked to by prominent hubs and hubs are pages that link to prominent authorities. Both models assign prominence values to documents using the eigenvectors of the data graph's adjacency matrix (or a close variant). Many studies have tested variants of these models; for a review the reader is referred to [15, 16]. A recent study has generalized link-based ranking techniques developed for the Web to arbitrary relational databases [17] by analyzing relations induced by a set of queries. This work ranks data instances through a spectral analysis of what they call a database graph, defined in terms of a database and a (finite) set of queries (query language). In this graph a vertex corresponds to a tuple of data and two tuples are related by an edge if there is a query in the query language that outputs one when using the other to specify the query's parameters. However, as the authors indicate, the rankings produced by this model are sensitive to the choice of query language.

Today's most advanced systems belong to companies such as Google, Yahoo and Microsoft. Unfortunately, these companies do not disclose the algorithmic details of their search and ranking engines which augment the link analysis with information retrieval (IR) techniques. But despite the significant progress in ranking web documents, no equivalent systems were developed in the biological knowledge domain. The only exception is PubMed [18] that uses IR techniques to identify documents that are related to a given document, based on their similarity of word frequency.

This paper addresses the problem of how to query heterogeneous biological data effectively. We sort query results based on their biological significance by exploiting the relationships between biological entities. Unlike a web query, when querying heterogeneous biological data, a user might be interested in all entities (of one or more data types) that match the query and in the relationships among these entities. We refer to these sets of interconnected instances that share a common theme as hubs of knowledge. Detecting these hubs is the main focus of this paper.

We propose a model which resembles methods that are used to search and rank web documents. Our method builds on the extensive schema of the Biozon database, the heterogeneous data stored in this database, and its link structure that connects different biological entities. This link structure is used both to propagate information between documents and to rank matches based on their broader biological context. We explore several different approaches and prominence models, study the properties of these systems, and develop means to evaluate their effectiveness and compare their performance using an objective probabilistic model. We study the effect of non-principal eigenspaces on the ranking and test several different functions for combining their contributions. We also test variants that consider the special properties of the biological data network. Specifically, in the Biozon setting, each link carries a specific meaning with it, and certain relations are deemed more significant than others. Therefore, the models that are used to rank web documents have to be modified to query biological data effectively. We test variants that account for similarity relationships between entities and for their statistical significance, and we test other variants that consider also the semantic significance of other relations.

This paper is organized as follows. We start with a brief description of the Biozon database, the prominence models we tested and the different search strategies. Next we present several example queries and their results followed by qualitative conclusions ('Results'). We proceed to a rigorous performance evaluation of the proposed prominence models and of the use of non-principal eigenspaces ('Discussion'). A few variants are described next and are followed by conclusions.

Results

The Biozon database and data graph

Biozon is a system that consolidates multiple biological databases consisting of a variety of heterogeneous data types (such as DNA sequences, proteins, interactions and cellular pathways) into a single extensive schema that is logically represented as a large data graph Σ. Each node represents some datum, and an edge between two nodes represents a relationship between them. Formally, Σ = (D, R) where D = {d₁...d_n} is the set of all nodes (documents) in the graph and R = {r₁...r_m} ⊂ D × D is the set of all edges (relations) in the graph. Much of the data in Biozon is gleaned from publicly available databases such as SwissProt, PDB, GenBank, BIND, KEGG, and more. In addition, Biozon stores novel computed data, such as similarity relationships and functional predictions. The data is warehoused locally so the fundamental biological objects represented are non-redundant even though data within and between their originating sources overlap. The Biozon resource is available online at http://biozon.org.

In this study we consider the following subset of the Biozon data types: nucleic acid sequences, protein sequences, protein structures, enzyme families, interactions, and pathways. We also consider all relations among these types, including the relations 'member of', 'manifests as', 'encodes', 'similar', and 'contains' (for more information on the Biozon data and relation types, see subsection 'The Biozon database and data graph' of the 'Methods' section). This subgraph, at the instance level, is the subject of our analysis of prominence models.

Prominence models

All the prominence models we consider are based on the idea that a node is prominent if it is connected to other prominent nodes. Given a query, our analysis starts by defining the graph (or subgraph) of relevant documents and its adjacency matrix A. From the adjacency matrix we derive a connectivity matrix B, using some function. The spectral properties of the connectivity matrix are then analyzed by computing its eigenvectors. Each eigenvector is considered to be a possible assignment of prominence values to documents, where node u is assigned a prominence value equal to the u^th component of the eigenvector. The highest scoring nodes in the principal vector(s) are returned as potential significant documents that match the query.

The spectral methods we consider in this paper are based on those used to rank documents on the World Wide Web. There are, however, several important differences between the structure of the Internet and that of the Biozon database. First, the hyperlink structure of the Internet defines a clearly directed graph using a single type of relation (the "links to" relation). In contrast, we generally view the Biozon data graph as undirected, unless we impart a specific, biologically motivated direction associated with the various types of relations (see subsection 'Forced directed graphs' in the 'Methods' section). Second, the nodes of the Internet are generally viewed as a homogeneous set, while Biozon's objects represent heterogeneous biological entities.

We test four main methods for assigning prominence values to nodes: Eigenvector Centrality, Hubs & Authorities, PageRank and Katz's Status. The methods differ in the way they characterize the connectivity among the elements of the subgraph. First we make the distinction between sparse and dense models. Eigenvector Centrality, Hubs & Authorities and Katz's Status are sparse models in the sense that they analyze connectivity matrices that are derived directly from the adjacency matrices of sparse graphs and hence are of the same order of sparsity. PageRank on the other hand augments the adjacency matrix with a complete matrix of prior probabilities, and hence practically analyzes a new, completely connected graph. Among sparse models, the differences are subtle, but for undirected graphs they are essentially reduced to the paths that are considered by the model. For example, Eigenvector Centrality takes the connectivity matrix to be simply the adjacency matrix, while Hubs & Authorities defines the connectivity matrix based only on paths of length two between objects. Katz's Status combines information from all paths of length one, two and three. The four models are described in more detail in the 'Methods' section.

In the 'Discussion' section we explore several variations of these methods that test the effect of non-principal eigenspaces on the ranking, different weighting functions, and inclusion of similarity data in the adjacency matrix.

Computing prominence vectors

In the prominence models we considered, prominence values are associated with the eigenvectors of the corresponding connectivity matrices. As such, any method for finding eigenvectors will suffice. A special case is multidimensional eigenspaces. We take the prominence of an object i in an eigenspace with cardinality k > 1 to be the projection of that object on the eigenspace:

$\hat{e} (i) = \sqrt{\sum_{j = 1}^{k} e_{j} {(i)}^{2}}$

where {e₁, e₂, ..., e_k} is an orthonormal basis of the eigenspace. For a more detailed discussion of computing prominence vectors see 'Methods'.

Search strategies: global graph vs. focused subgraphs

Our ultimate goal is to provide a sensible ranking of results from queries to the Biozon database. We consider three general strategies for generating and then ordering these result sets: The focused subgraph method (referred to also as the 'local method') first creates a subgraph of Biozon consisting of nodes that satisfy the query and their immediate neighbors as described in Fig. 1. Prominence values are then assigned to the nodes of the subgraph using one of the above models, and nodes are ranked by descending prominence. The global method, on the other hand, first uses one of the models to assign prominence values to every node in the Biozon graph. Then, for a given search, nodes satisfying the query are extracted and ordered by descending prominence. The main advantage of the global method is its speed since the prominence values can be pre-computed. However, it is less sensitive than the local method which can report entities that do not match the query term but may still be relevant to the query (based on their relations with entities that do match the query). The extended global method is a variation of the global method which utilizes information in neighboring entities to bring forward uncharacterized entities, thus combining the advantages of both methods. For a detailed description of these procedures see section 'Search strategies' in 'Methods'.

The spectral properties of the connectivity matrices

Examples

To demonstrate the effect of the different prominence models on the ranking of biological documents we ran several test queries. For each query we generated an undirected focused subgraph of the Biozon database as described in Fig. 1. From that subgraph we constructed four different connectivity matrices, one for each prominence model, and computed the eigenvectors corresponding to the twenty largest-magnitude eigenvalues for each matrix. After translating each multidimensional eigenspace to a single prominence vector (as described in section 'Computing prominence vectors'), we examined the 50 highest scoring documents in each of these prominence vectors.

In general, we observe that sparse models (Eigenvector Centrality, Hubs & Authorities, and hybrid Katz's Status) perform similarly in the sense that they commonly emphasize the same biological entities in each of their respective eigenspaces. If Eigenvector Centrality ranks some object in a particular eigenspace highly, then Hubs & Authorities and hybrid Katz's Status also rank that object highly in the corresponding or in a nearby eigenspace. (Corresponding Eigenvector Centrality eigenspaces and Hubs & Authorities eigenspaces emphasize the same biological entities. Hybrid Katz's Status eigenspaces may be off by one or so.) PageRank, on the other hand, tends to consolidate the most significant information from across the other methods' eigenspaces into its principal eigenspace.

For example, we searched for proteins with the definition ubiquitin. These proteins are involved in protein degradation and are attached to other proteins by an enzyme, thereby marking them for degradation. Sparse models rank the protein family ubiquitin-protein ligase, ubiquitin-activating enzyme as the most prominent entity in the principal eigenspace (This family is associated with Biozon docid 5977355. To view an entry with docid x, follow the URL http://www.biozon.org/Biozon/Profile/x). The family is followed by its members, such as Ubiquitin-like protein SUMO-l conjugating enzyme (docid: 365280), Ubiquitin-protein ligase RSP5 (docid: 1026628), and Ubiquitin-activating enzyme (docid: 1054300). The most prominent entity in the second eigenspace is the protein family ubiquitin thioiesterase (docid: 5975715). It is followed by its members Ubiquitin carboxyl terminal hydrolase 15 (docid: 1066565), Ubiquitin carboxyl terminal hydrolase 6 (docid: 911606), and Ubiquitin carboxyl terminal hydrolase 14 (docid: 1025595). We observe that the most prominent proteins in these two eigenspaces are generally encoded by several DNA sequences and are involved in several interactions. Although these DNA sequences and interactions are not among the most prominent elements of these eigenspaces, they do contribute to the high ranking of the proteins. The third eigenspace emphasizes the protein Skplp (docid: 446186) and its interactions. This protein is involved in ubiquitin-mediated proteolysis and is part of a larger complex that underlies many of the interactions associated with it. The fourth eigenspace emphasizes protein orf yol133w (docid: 259185) and its interactions. Note that the main definition of this protein does not contain the query term.

However, interaction data suggests ubiquitin-related activity. The fifth through seventh eigenspaces behave similarly to the third and fourth eigenspaces. Each focuses on a protein and its interactions. The fifth eigenspace emphasizes UV excision repair protein RAD23 (docid: 808559) and 26S protease regulatory subunit 7 homolog (docid: 882149). Interestingly, the sixth eigenspace emphasizes these same two proteins, however, while the fifth eigenspace favors the first, the sixth eigenspace favors the latter.

The most prominent entities in PageRank's principal eigenspace are the two protein families mentioned above and their member proteins, as well as the proteins emphasized in the non-principal eigenspaces of the other methods. In general, PageRank tends to favor entities with a high degree. For example, proteins Ubiquitin carboxyl terminal hydrolase 15 (docid: 1066565) and Ubiquitin-activating enzyme (docid: 1054300) are among the 50 most highly ranked entities by PageRank and are members of the two families mentioned above. Both of these proteins have many neighbors (including multiple protein-protein interactions). These proteins are thus highly ranked on their own merit, and not just because they are members of a highly ranked protein family.

We also observe this phenomenon when using the query term cancer. For cancer, the most prominent entities in the principal eigenspaces of the sparse models are the tumor suppressor protein P53 (docid: 802537) and its interactions. The second eigenspace emphasizes protein breast cancer type 1 susceptibility protein (docid: 1079763) and its interactions. The third eigenspace emphasizes DNA repair protein RAD51 (docid: 811200) and its interactions. The fourth eigenspace emphasizes protein CRK-associated substrate (docid: 1036799), which is a breast cancer anti-estrogen resistance protein, and its encoding DNA sequences and interactions. PageRank's principal eigenspace emphasizes all four of these proteins. Similarly, for the query term autoimmune, we observe the following. The principal eigenspace of the sparse models emphasizes protein Autoimmune regulator (docid: 947655), protein Autoimmune regulator (docid: 943600) and Ribonuclease P protein subunit RPP1 (docid: 640274) along with the nucleic acid sequences that encode these proteins. The second eigenspace emphasizes Tumor necrosis factor ligand (docid: 621283) Tumor necrosis factor precursor (docid: 533673) and the DNA sequences that encode either of them. PageRank's principal eigenspace emphasizes all of the above proteins (besides 533673) along with several DNA sequences which encode for these and similar proteins.

We compare these results with those generated by similar queries to the NCBI or SwissProt servers. For example, given the query ubiquitin, both servers only return records that contain the query term. The SwissProt set is ordered alphabetically based on the protein ID, while the NCBI server orders the records based on the date the record was created. Both orderings are quite arbitrary and uninformative. On the other hand, the examples described above demonstrate that the prominence models we tested are indeed effective in identifying interesting instances that are highly connected to other biological entities, thereby providing a broader biological context for functional analysis of these instances. Our observations of independent queries suggest that sparse models most readily identify Hubs of Knowledge as different eigenspaces tend to emphasize nearly disjoint sets of intrarelated biological entities. Consequently, no single eigenspace completely characterizes the set of most important entities in the focused subgraph. On the other hand, PageRank's principal eigenspace does appear to summarize this set by singling out the highest scoring entities in each of the other methods' top eigenspaces. Thus PageRank most readily provides a ranking for query results. PageRank's non-principal eigenspaces appear less coherent than its principal eigenspace and are usually associated with significantly smaller eigenvalues. Our qualitative conclusions are supported by more extensive tests and quantitative results that are reported in the 'Discussion' section.

Distributions of eigenvalues and prominence values

We study the distributions of the eigenvalues of the different prominence models' connectivity matrices. This analysis can help determine how many different eigenvectors should be used when ranking instance sets (see 'Discussion'). In general, we observe that the eigenvalues produced by sparse models all display similar decay patterns across different queries. PageRank, on the other hand, usually produces one relatively large eigenvalue while the rest of the eigenvalues all have very similar smaller values. We observe these patterns for queries in local and global mode. Our results are reported in the 'Methods' section. We also studied the distribution of prominence values in the principal eigenvectors produced by the various prominence models. We observe that for a given prominence model, the properties of the distribution of prominence values within an eigenvector are fairly consistent both across different query terms and across the different eigenvectors of a particular matrix. In general, only a small fraction of prominence values are actually relevant in each eigenspace. However, a priori it is unclear which prominence values should be considered significant, since every eigenvector spans the whole query graph and all documents are assigned prominence values in each eigenspace. To address this problem, we model the distribution of low-scoring documents and use that distribution to estimate the significance of outliers. For details see 'Methods'.

Eigenspaces and connected components

We examine the correlation between eigenspaces and connected components of the data graph. Previous studies [14, 15] observed that the same documents might appear in multiple eigenspaces. Our experiments are in agreement with these observations. For example, the 1977 nodes in the focused subgraph for the search term cancer form 656 connected components (CC), the largest of which contains 264 nodes. The rest of the components are much smaller and the majority of the nodes (1374) are in CC of size 5 or less. The set of top scoring nodes in the first, second, fifth, eight and tenth eigenspaces (computed with the Eigenvector Centrality model) are drawn completely from the largest CC. Some of these objects appear in more than one eigenspace. On the other hand, most of the other connected components in this graph are mapped to specific eigenspaces. For example, the second CC corresponds exactly to the group of top scoring nodes in the third eigenspace. Similarly, the third CC constitutes the the top scoring nodes in the fourth eigenspace. Interestingly, even when overlap is observed (as is the case for the top CC), most of the top scoring nodes in each eigenspace are unique.

We observe similar behavior with the query term ubiquitin. The graph contains 6219 nodes, of which 1811 are in the largest CC. The second largest CC contains only 41 nodes, and 2364 nodes are connected only to one other object. However, in this case the top CC is so large that it completely dominates all top ten eigenspaces. We also observe overlap between these eigenspaces. For example, as was pointed out in section 'Examples', the fifth and sixth eigenspaces rank the same two proteins (UV excision repair protein RAD23 and 26S protease regulatory subunit 7 homolog) at the top, but in reverse order. Interestingly, these two proteins are related by an interaction (docid: 60875934) which is ranked 3rd in the fifth eigenspace but ranked 48th in the sixth eigenspace. Each of these proteins is associated with many interactions, and the immediate structures of the graphs centered at each of these two proteins are somewhat similar with respect to interactions. However, the graph structures are different with respect to nucleic acid sequences as UV excision repair protein RAD23 is linked to a DNA sequence that encodes several other proteins, forming a slightly different community from 26S protease regulatory subunit 7 homolog, whose related DNA sequences do not encode other proteins.

Both examples demonstrate that eigenspaces can overlap. However, each eigenspace is centered on a different subset of objects and the high scoring documents in each one form a different subgraph with a different structure. We postulate that the top-scoring documents in overlapping eigenspaces are indicative of highly non-planar graphs. The prominence models we considered are based on eigenvector analysis and are essentially linear projection methods. These methods fail to faithfully embed complex non-planar subgraphs within a single hyperplane, and therefore we observe multiple non-isomorphic projections of these graphs (for a discussion on embedding algorithms see [19]). We conjecture that such graphs are split between eigenspaces roughly along minimal cuts (i.e. partitions with minimal number of cross edges).

Discussion

Evaluating the quality of the results

While each of the prominence models is designed to generate result sets that are sorted based on relevance, it is hard to quantitatively evaluate the quality of these results on a large scale. Clearly, one can provide a few examples demonstrating the properties of a specific system; however, it is difficult to infer general conclusions from a few carefully chosen examples.

An independent measure of quality is especially important when comparing different weighting schemes and evaluating the impact of non-principal eigenspaces (as described in section 'Weighting functions'). Ideally, such a measure would reflect the utility of the results returned by the query. However, utility is a largely subjective matter, especially when considering that Biozon is intended for a wide variety of users with many different interests. Thus, our attempt to define an absolute quality measure on the search results can only be viewed as an attempt to approximate a "consensus" point of view based on the objects themselves. To that end we chose to use the textual information associated with the biological entities, and we validate and assess the performance of a given method by measuring the coherence among the descriptions of instances in the result set and their linked objects. Our intuition is that "good" results are those that not only contain the query term but are also connected to many other objects that contain the query term. Specifically, for each instance v in the result set we examine its set of direct neighbors (denoted as the subgraph G_v) and count how many of them match the query term (referred to as consistent neighbors). Denote by n_vthe number of direct neighbors of v that have definitions (This last condition is required since there are many objects that do not have explicit definitions) and by m_vthe number of consistent neighbors of v. Given the result sets, one can compute the average number of consistent neighbors per query (as a typical performance measure), or the total number of consistent neighbors among all queries tested (as an overall performance measure). However, a few instances that are connected to many consistent neighbors can dominate and bias these performance measures. Normalized measures, such as the relative measure m_v/n_v, are not optimal either. Many of the instances are connected to only one other, consistent object. With a relative measure of 1 for these instances, the overall or typical performance will be biased as well by diminishing the contributions of strongly connected instances in the result set. Instead, we propose a probabilistic measure, and for each instance we estimate the significance of observing m_vconsistent neighbors out of n_vneighbors. Formally, given a query term Q we estimate the probability that a random object will match the term by p = N_Q/N where N_Qis the total number of objects that match the query term, and N is the total number of objects in the Biozon database. (It is possible to obtain better estimates of the parameter p. However, the differences are marginal and due to its computational simplicity we chose the latter.) The binomial distribution with parameter p can then be used to estimate the probability to observe exactly m consistent neighbors out of n neighbors, employing the simplifying assumption of independence

$\Pr (m) = (\begin{array}{l} n \\ m \end{array}) p^{m} {(1 - p)}^{n - m} .$

The significance (pvalue) of the subgraph G_vis estimated by the total probability to observe by chance graphs that are at least as consistent as the observed graph

$pvalue (G_{v}) = \Pr (m^{'} > = m) = \sum_{m^{'} = m}^{n} \Pr (m^{'}) .$

This measure accounts for both the consistency and the subgraph size. Given the complete result set R, the total pvalue is approximated by

$pvalue (R) = \prod_{v \in R} pvalue (G_{v})$

or with the monotonic log-transformation

$Q (R) = - \sum_{v \in R} \log (pvalue (G_{v}))$

This measure, however, is useful only when comparing different sets of results because it does not account for the ordering within the set of results. Our second performance measure evaluates different ranking methods by accounting also for the ordering of objects within the result set. Our method is a variation over the popular ROC measure [20]. However, unlike the typical setting for this measure (which requires labeled data) we have quantitative data with a significance value assigned to each sample. We assume that better models will report the more significant instances first. Therefore, the cumulative area under the curve corresponding to the sorted list of instances (from most significant to least significant) can serve as an overall performance measure. Formally, we are given a set of N instances, sorted by ranking function f. Denote by R(i) the set of i highest ranked documents. The quality of f is estimated by the functional UROC (unsupervised ROC) which we define as

$U R O C_{N} (R) = \sum_{i = 1}^{N} \sum_{j = 1}^{i} - \log (pvalue (G_{i})) = \sum_{i = 1}^{N} Q (R (i))$

This functional obtains its maximal value when the instances are sorted in increasing order of significance values and obtains its minimal value when instances are sorted in decreasing order of significance.

Therefore, a ranking function that correlates well with ordering by significance values will get a high UROC score. Here we limit N to be 50, and compute UROC₅₀, assuming that it is not very likely that a user will scan more than the top 50 results. We use this measure to compare different models and variations of prominence models as described next.

Note that the UROC functional induces a certain ordering on instances that we perceive as important, and the best ranking procedure would be a one that is perfectly correlated with this functional. One might argue that instead of using the prominence models, entities should be ranked directly based on these probability-based measures. However, computing these values in real time for an arbitrary query is time consuming and therefore we resort to precomputed indices.

Assessing the contribution of non-principal eigenspaces

All the models we tested return multiple potential prominence vectors in the form of eigenvectors or vector projections of multidimensional eigenspaces. Each such prominence vector represents a different way of distributing prominence values among the graph nodes. Hence, a given connectivity matrix can produce several potential values for each object.

As the examples in section 'Examples' demonstrated, it is hard to assign a universal meaning to the different eigenspaces. Moreover, not all eigenspaces are equally informative, and entities highly relevant to the query can be distributed across multiple vectors. Furthermore, a document might be included in different groups of inter-related documents and therefore might appear in multiple eigenspaces. Previous studies [14, 15] observed that non-principal components may or may not contain useful information. However, no methodological solutions were proposed to handle non-principal eigenspaces.

Here we attempt to quantitatively assess the importance of non-principal eigenspaces and to propose methods for integrating information from multiple prominence vectors. Each method is assessed in terms of its effectiveness in ranking documents in focused subgraphs, using the measures proposed in the previous section. Our goal is to determine if the information stored in the first eigenspace is enough to rank the instances effectively.

Weighting functions

We explore several methods for weighting and combining prominence vectors in order to consolidate all information available in the different prominence vectors into a single meaningful ranking system. Given a set of eigenvalues e₁, ..., e_nand associated prominence vectors e₁, ..., e_n (where the i^th component of each vector corresponds to the i^th document), we define the following weighting functions:

Principal Eigenspace – A document's final score is its component in the prominence vector corresponding to the principal eigenspace. That is, the i^th document's final score is equal to the i^th component of e₁. (this is the approach taken by previous studies on prominence models when ranking web documents).
Max – A document's final score is the maximum prominence value it receives in any of the eigenspaces. Max(i) = max_{k = 1...n}(e_k(i))
Weighted Max – A document's final score is its maximum prominence value weighted by the corresponding eigenvalue. WeightedMax(i) = max_{k = 1...n}(e_k·e_k(i))
Weighted Sum – A document's final score is the sum of the prominence values in each vector weighted by the corresponding eigenvalues. WeightedSum(i) = $\sum_{k = 1}^{n} (e_{k} \cdot e_{k} (i))$

Comparison of weighting functions

We tested a total of 44 queries with each function. The queries consist of a combination of different target data types, prominence models and query terms. Query terms were chosen purely based on biological interest prior to performance evaluation. Because of the small size of the focused subgraphs (Table 5), we could reliably generate a large number of eigenvectors. The documents in each result set were filtered for the proper data type (i.e. the query data type) and ranked using the four weighting functions over the prominence vectors derived from the top 20 eigenvalues/eigenspaces (for all queries we considered, the information contained in eigenspaces other than the top twenty seem to be marginal). The ordered set of documents was then evaluated using the UROC measure described above, and the results are summarized in Table 1a. While all four functions do well, it appears that the Principal Eigenspace measure produces the best results on average, both in terms of average rank and average ratio.

Table 1 Part (a) Comparing weighting functions for combining the contributions of multiple eigenspaces.

Full size table

Analyzing the results by prominence methods reveals an interesting trend that supports our qualitative results reported in section 'Examples'. Table 1b shows the results for the four different weighting functions when considering only the PageRank method. Here, ranking documents by the Principal Eigenspace dominates performance in all aspects. This indicates that in general the non-principal eigenvectors produced using the PageRank method are much less informative than the principal eigenvector. On the other hand, when considering only the Hubs & Authorities model (Table 1c) the Principal Eigenspace method does significantly worse than the others, which indicates that useful information can be found in Hubs & Authorities's non-principal eigenvectors. The differences are even more significant for Eigenvector Centrality (results not shown).

Comparison of prominence models

Practical considerations motivated us to look further into global methods (see section 'Search strategies' in 'Results'), since computing the focused subgraphs is a relatively slow process. Global methods provide us with precomputed scores that can be used to rank documents returned by arbitrary queries on the fly. However, global methods are bounded by other computational issues, and generating more than just the principal eigenspace can be very expensive. Encouraged by the relative success of the PageRank method with its principal eigenspace, we ran a second set of experiments to re-evaluate which prominence method is the most effective in global mode when only the principal eigenspace is used. Detailed results are given in Table 2, and example UROC graphs are shown in Fig. 2. The results are consistent and suggest that PageRank is the most successful method under this setup.

Table 2 Comparison of prominence models on the global Biozon graph.

Full size table

The effect of normalization

We conjecture that the observed differences in performance between the prominence models is an artifact of matrix normalization. The PageRank model uses a connectivity matrix in which every row is normalized to sum to 1. This means a node splits its rank evenly among its neighbors. The other three models do not normalize their connectivity matrices. In these models, a node confers its whole rank upon each of its neighbors. For example, consider a graph containing (among other things) a large protein family with n related proteins. Suppose each protein is only connected to the family. With a normalized matrix, each of the proteins confers its whole rank to the family, but the family only confers 1/n of its rank back to the proteins. Consequently, although many proteins contribute to the family to give it a high rank, the family splits this rank among its proteins and so the individual proteins' ranks remain low. Thus PageRank emphasizes the family without emphasizing its members. On the other hand, with an unnormalized matrix, the protein family confers its whole authority upon each of its members. Thus if the protein family is highly ranked, its members will also be highly ranked. Hence an unnormalized matrix method will tend to co-emphasize protein families and their members.

We tested the effect of normalization on the performance of the prominence models. The results are summarized in Table 3. As the table shows, normalization can have a drastic impact on the results, and performance almost always improves with normalized connectivity matrices. However, note that PageRank still performs better.

Table 3 The effect of normalization on the ranking.

Full size table

Local vs. global methods

We compare the performance of the PageRank model using focused subgraph, global and extended global modes (section 'Search strategies'). We observe that the global mode sometimes outperforms focused subgraph mode with respect to the UROC₅₀ measure (see Table 4). This is possibly due to "edges effects": objects that are excluded from the focused graph are considered in global mode and can increase the prominence values of their neighbors which, in turn, increase the prominence values of the objects relevant to the query term. However the global mode is outperformed when it produces less than 50 results. The extended global mode improves over the global mode and seems to produce results that are comparable to the best mode for each query. These results suggest that the extended global mode is a good compromise between sensitivity and speed. Interestingly, the top result returned for the stromelysin query in extended global mode is, in fact, a protein whose definition does not contain the search term. This protein, interstitial collagenase precursor (docid 884427), is similar to several other stromelysin proteins and is encoded by a stromelysin DNA gene. For comparison, we also provide the baseline performance. This was evaluated by picking 50 random documents from the subgraph and computing UROC based on the random sampling order (the procedure was repeated 50 times and the results were averaged).

Table 4 Performance evaluation: focused subgraph vs. global vs. extended global.

Full size table

Table 5 Makeup of the focused subgraphs for selected queries.

Full size table

Variations

We tested several other variations of prominence models that exploit the structure of the Biozon graph and the different data and relation types. Our first experiment used directed graphs where edges are directed towards a specific data type (forced directed graph). The main advantage of this method is that it produces a clear and sensible distinction between hubs and authorities, unlike undirected graphs. We also attempted to incorporate the similarity relation into our graphs to help characterize new and unannotated objects. We found that in some cases similarity data caused an immediate and significant improvement in result set quality. Unfortunately, these improvements were not observed consistently across multiple queries, indicating that the problems of redundancy and localization (as discussed in subsection 'Variations on prominence models' of the 'Methods' section) need to be considered. Finally, to account for possible discrepancies in significance among biological relations, we weighted the entries of the adjacency matrices based on the type of the relation. While in some cases edge weighting clearly improves the results, in others it does not, and further study is necessary to converge to a stable and consistent weighting scheme. Our experiments are reported in more detail in the 'Methods' section.

Conclusion

In this paper we present a system that ranks biological entities returned as results from querying heterogeneous biological data. We view important or interesting instances in the result sets as those that are linked to many other important entities. Since these instances are associated with myriad of biological knowledge (through their relations to other biological entities) they can serve as a useful entry point to researchers who would like to study similar systems. To identify these instances we analyze the intricate link structure of Biozon by applying spectral methods. We test several popular prominence models, variations of these models, different query modes, and different scoring functions. To evaluate the quality and effectiveness of these models we propose an objective probabilistic measure, UROC, that accounts for both the structure of the Biozon graph and the textual information contained therein. This measure quantifies the thematic unity within instance subgraphs, directed at detecting what we call "hubs of knowledge".

We examine several issues with prominence models that have not been quantitatively addressed so far. We evaluate the utility of information contained in non-principal eigenspaces as well as different ways to incorporate this information into our prominence models. Our tests indicated that with a certain family of prominence models (including Eigenvector Centrality, Hubs & Authorities and Katz's Status) the information is distributed across multiple eigenspaces and that the Max weighting function is the most effective approach for ranking documents with these models. On the other hand, non-principal eigenspaces seem to contain little information when using the PageRank model. For PageRank, the most effective ranking is produced by relying solely on prominence values in the principal eigenspace. The differences between the PageRank model and the other three models are attributed to two factors. First, PageRank includes a prior matrix that essentially connects all documents. Therefore, PageRank analyzes a completely connected graph while all other models analyze sparse graphs. Second, the rows of PageRank's connectivity matrix are normalized. Indeed, the performance of the other models improved when their connectivity matrices were normalized in a similar manner. Furthermore, we observe that the sparse models are most effective for producing hubs of knowledge while PageRank is the most effective model for ranking query results.

We also compare different search strategies. While the local method can bring forward documents that do not match the query term (the function of these documents is inferred from their broader graph context, through relations to other, better characterized objects), it is computationally prohibitive in real-time. In comparison, global methods rely on pre-computed prominence values and no additional preprocessing is required during query time execution. We also introduced the extended global method that combines the benefits of both the focused subgraph method and the global method, thus allowing fast propagation of information to uncharacterized or unstudied objects.

Practical considerations promoted the use of global methods for a real-time ranking system. However, due to the size of the global connectivity matrix, this choice precludes us from computing many eigenspaces. Since PageRank's principal eigenspace tends to incorporate information from the other methods' non-principal eigenspaces, we conclude that the PageRank model with the principal eigenspace scoring function is the most feasible system for ranking Biozon query results. Indeed, our experiments have shown that PageRank produces the best results given these restrictions. Moreover, PageRank's performance under these conditions is comparable to the best performance observed from any combination of model and scoring function used in local or global mode. The PageRank model has other interesting aspects. For example, the inclusion of the matrix E allows one to explore various types of prior knowledge.

One advantage that Biozon has over the World Wide Web is that for the most part Biozon data is highly reliable. While web search engines such as Google have to deal with dynamic data of questionable quality, almost all relations in the Biozon database are well established relations that are derived from high quality observations and measurements. Therefore, our models do not have to address directly issues such as noise and "dead links". There are some exceptions. For example, high-throughput interaction data that is generated using the yeast two-hybrid system tends to contain many false positives. However, these relations constitute a small fraction of the Biozon data graph, and our tests indicate that their presence does not change the ranking significantly.

The heterogeneous data sets reveal other interesting properties and challenges. For example, upon testing the effect of similarity data on performance we observed that similarity relations can be useful, but they can also lead to mediocre performance because of problems of redundancy and localization. Thus, similarity data should be considered with the proper normalization. It should be noted that much of the similarity information already exists in another form. For instance, proteins are often classified to families based on similarity data and thus forms paths (of length two) between similar proteins in the Biozon data graph. Like our tests with similarity data, our tests with edge weighting open another direction that requires further analysis. Here, future work includes exploring more robust weighting and normalization schemes. Another type of information that is currently ignored is the context (such as the species).

Variations that will focus on organism-specific subnetworks will be another topic of future research, as well as the integration of information retrieval techniques and other variations of prominence models. Of special interest are non-linear projection methods that can handle large, non-planar graphs.

The application of all prominence models we tested involves non-trivial issues, and further improvements to a real-time database query system require additional study of the underlying data graph. In particular, it is desirable to precompute as much data as possible to minimize the amount of processing required for each individual query. Among the issues that one has to consider is the overlap between different eigenspaces and the meaning of different eigenspaces. One possibility is to utilize the causal structure and the connectivity within the graph as analysis can be carried out separately on each connected component. To this end, we have isolated the connected components of the Biozon data graph and are studying their properties.

Finally, we should note that although the examples we used in the paper are single term queries, Biozon has the capability of ranking any query the user may execute. This includes multiple term queries and queries involving relations among multiple data types. The current ranking system, based on the PageRank prominence model, was integrated into the Biozon database and is available online at http://biozon.org.