GeneRank: Using search engine technology for the analysis of microarray experiments
© Morrison et al. 2005
Received: 19 April 2005
Accepted: 21 September 2005
Published: 21 September 2005
Skip to main content
© Morrison et al. 2005
Received: 19 April 2005
Accepted: 21 September 2005
Published: 21 September 2005
Interpretation of simple microarray experiments is usually based on the fold-change of gene expression between a reference and a "treated" sample where the treatment can be of many types from drug exposure to genetic variation. Interpretation of the results usually combines lists of differentially expressed genes with previous knowledge about their biological function. Here we evaluate a method – based on the PageRank algorithm employed by the popular search engine Google – that tries to automate some of this procedure to generate prioritized gene lists by exploiting biological background information.
GeneRank is an intuitive modification of PageRank that maintains many of its mathematical properties. It combines gene expression information with a network structure derived from gene annotations (gene ontologies) or expression profile correlations. Using both simulated and real data we find that the algorithm offers an improved ranking of genes compared to pure expression change rankings.
Our modification of the PageRank algorithm provides an alternative method of evaluating microarray experimental results which combines prior knowledge about the underlying network. GeneRank offers an improvement compared to assessing the importance of a gene based on its experimentally observed fold-change alone and may be used as a basis for further analytical developments.
Since its launch in 1998, the Google search engine has all but monopolized page searches on the world-wide web . The basis of this astonishing success is the PageRank algorithm developed by Google founders Larry Page and Sergey Brin , which allows efficient and stable prioritization of search results. Here we show how the basic idea of PageRank can be transferred quite intuitively to the analysis of gene expression datasets in molecular biology. We modify PageRank appropriately to produce a new algorithm, GeneRank, and explore its limits and potential for the analysis of both synthetic and real-world data.
Just as the original PageRank is stable against the artificial inflation of a web page's rank by web designers, we hope that GeneRank may obtain a more robust ranking of genes in (typically very noisy ) microarray experiments. While PageRank uses hyperlinks between web pages to achieve this end, we combine the expression measurements with external information, such as functional annotations, protein interaction data or previous experimental results.
Data sharing techniques have been successfully implemented previously using, for example, GO annotations [4, 5] or protein-protein interactions [6–8] to define the network connectivity. Justification for this is given by the observation that connected genes are more likely to be co-expressed . Additional advantages include the possibility for GO terms to provide insight into the process of co-expression  and for functional classification to be made for previously unannotated genes [7, 9]. These examples serve to demonstrate the value and feasibility of combining data from different sources. Not only is the analysis of a microarray experiment likely to be more robust when prior information is included, but networks rather than single genes can be identified as important, and further biological inferences can be drawn about the data with the aid of this added information.
Our aim in this work is to use connectivity data to produce a prioritization of the genes in a microarray experiment that is less susceptible to variation caused by experimental noise than one based on expression levels alone. This is achieved using GeneRank, a customised version of the PageRank algorithm.
The original PageRank algorithm also has a random walk interpretation where the ranks correspond to the invariant measure of a teleporting random walk on the web. This is equivalent to saying that the rank of a web page is proportional to the time spent at the web page whilst surfing the web. This idea can also be intuitively extended to ranking genes, where the rank of a gene is proportional to the amount of time a biologist should spend looking at a gene whilst analysing the experimental results.
As with the original algorithm, we require a network or graph to allow us to calculate a rank for each entity in the network. With the original algorithm, nodes represent web pages and a link exists between two nodes if one page contains a hyperlink to the other. This results in a directed graph. In our case, we define an undirected graph where a node represents a gene and the edges can be defined by some other "previous knowledge". For our purpose, we use either Gene Ontology annotations  or expression profile correlation coefficients. In addition to the network structure, we require a vector of expression changes, experimentally observed in a microarray experiment, as input for the algorithm. Other measures of differential expression could also be used, e.g. p-values, which are suitably transformed to ensure genes which are highly changed have a large input value. We shall only consider the use of expression changes.
For each gene, the expression level vector contains the value for its expression change in the experiment under consideration. The algorithm, GeneRank, also uses a free parameter, d, in the range [0..1] that controls the weighting of expression change to connectivity used in the calculations. If d = 0, the ranking returned is based solely on the absolute value of the expression fold-change for that gene. For d = 1 we return the ranking based on connectivity. By setting d in the range [0..1] we interpolate between these two extremes. One advantage of our approach is that we apply the algorithm to the entire network, i.e. we do not require a pre-defined threshold of important genes. We simply require an experimentally determined expression change and some connectivity information for each gene, and based on this GeneRank will provide a re-ordering of genes in terms of their apparent importance. While "importance" may seem a rather vague concept in a biological context, we will show how it can be assessed objectively in the Testing the Algorithm section below. For full details of the algorithm and the random walk interpretation, the reader is referred to the Methods section.
In addition to the gene expression data, GeneRank uses a network or graph as input. We use the absolute value (the algorithm requires positive expression change values) of the gene expression data as a weight for each node in the graph and define the network connectivity by some other criterion. We use either Gene Ontology annotations or correlation coefficients, but there are many other possibilities, e.g., metabolic networks, transcription factor networks, or protein-protein interactions. We also used synthetic networks with controlled topological features for evaluation purposes. The three types of network were constructed as follows.
Synthetic Network 1
Synthetic Network 2
Synthetic Network 3
A yeast stress data set consisting of 156 microarray experiments under a wide range of stress conditions was used to construct these networks. This data set is discussed in . We randomly removed 15 experiments from the data set and each was used as the expression change vector. The correlation coefficient  was calculated for each gene pair using the reported expression changes for the remaining 141 experiments. Edges were defined in the network for pairs of genes with correlation r > 0.5. This data was taken from the Stanford Microarray Database . Values for and C are given in Table 1.
To allow control over the network structure, synthetic networks were defined with 1000 genes. The genes were split into two sets, A and B, where the genes in set A ("changed genes") were allocated an expression change drawn from a N(2,1) distribution and the expression of the set B genes ("unchanged genes") were drawn from a N(0,1) distribution. Unless otherwise stated, |A| = 100 and |B| = 900 (the sizes of sets A and B). Edges were randomly assigned between genes with probabilities p A , p AB and p B , where these are the probabilities of two set A genes being connected, a set A gene being connected to a set B gene and two set B genes being connected, respectively. The clustering coefficient and the mean degree of the nodes depend on the values of p A , p AB and p B . Representative examples are listed in Table 1.
Synthetic network 1 is the standard case with equal expected degree across both sets and |A| = 100. Synthetic network 2 is the case where E deg A = 1.5E deg B. Also the relative connectivity ( ) in both cases is 1 since this is where optimal results are observed (see relative expected degree section below). Synthetic network 3 is the case where |A| = 500 and again we have an equal expected degree across boths sets. Here the relative connectivity is 0.2618, again where optimal results are observed for this network (see Relative set size section below). The expected degree of a node in A is E deg (A) = (|A| - 1)p A + |B|p AB and for set B is E deg (B) = |A|p AB + (|B| - 1)p B .
Since we are trying to improve the ranking of genes produced in microarray experiment, we need to quantify the quality of the ranking produced by the algorithm. In the case of synthetic data, we know that all genes in set A should be ranked above the genes in set B, which gives us a basis for comparing the ranking from experimentally observed fold-changes with the re-ordered ranking produced by GeneRank. To quantify GeneRank results, we used the Area Under the Receiver Operating Characteristic Curve (AUC) [17, 18]. This value describes how well the ranked list discriminates between genes in set A and set B. It will have a value of 0.5 if sets A and B are randomly mixed and a value of 1 if they are perfectly separated, i.e. if all the genes of set A ("true changed genes") appear at the very top of the list. We emphasise that with d = 0 the algorithm is equivalent to ranking on pure fold-change.
The construction of synthetic networks allows us to obtain full control over the network structure. We experimented with various network parameters.
A number of observations can be made from the experimental results where E deg (A) > E deg (B):
The maximum AUC achieved by the algorithm increases as the difference between E deg (A) and E deg (B) increases. For the case where α = 1.5, we come particularly close to the maximum value of 1, (0.98).
As the difference between E deg (A) and E deg (B) increases, we see less distortion of results even for unsuitably high values of d. This is consistent with the fact that increasing d gives connectivity a greater influence in the ranking.
The maximum AUC achieved in each case occurs at larger values of d as the difference between the expected degrees increases. Again this is consistent with a high value of d corresponding to a greater weighting of connectivity on the algorithm.
The improvement by the algorithm over expression change ranking is greater when the difference between the expected degree of both sets is greater.
To summarise these findings, a higher expected degree of set A genes compared with set B results in the algorithm producing a higher AUC, and hence more accurate results.
These results on synthetic networks suggest that for certain network structures GeneRank can achieve a significant improvement over ranking based on pure differential expression. The relative expected degree of sets A and B has a considerable effect on performance. In cases where the algorithm performs well (E deg (A) > E deg (B)), the optimal results occur when 0.75 ≤ d ≤ 0.85. It is curious, but probably pure coincidence, that the value d = 0.85 is reportedly used by Google . At such a high value of d, the algorithm is giving significant weight to connectivity information, as is appropriate considering the high level of experimental noise in the simulated expression data.
However, in our tests the quality of the results generally decreases for values of d beyond ≈ 0.85, showing that we do require some expression change information to make the best possible interpretation. It appears that although optimal results arise when there is some expression considered in the ranking, a major contributory factor to the success of the algorithm is the high relative degree of the genes that are differentially expressed. This structure is certainly not present in all biological networks. In the next section we explore whether suitable real networks can be identified.
As described earlier, we construct the GO networks by defining an edge between two genes if they share an annotation allocated by the Gene Ontology Consortium. This allows us to construct three networks, one for each section defined by the Gene Ontology: Biological Process, Cellular Component and Molecular Function.
To quantify the results we calculated the quality index B as
where alt_PR is the GeneRanked position after the expression of the gene has been artificially altered, orig_exp_rank is the original expression-based position in the list, and alt_exp_rank is the expression-based position after the differential expression has been set to 0.
In the case where we are altering a gene in the top 200, a 'boosting' effect is observed and the ranked position after the fold-change has been moved towards the original ranked position. We can observe groups of genes which are boosted to the same level (shown by 'lines' of blue asterisks). It is likely that these genes are a completely connected subgraph, which results in all genes being given the same ranking. Altering genes which were originally ranked in the top 200, we achieve B = 0.7728. This effect is not observed for a random set of genes (B = 0.4165).
Again we calculate a value of our quality index B as described above. As a result of the high degree of the top 200 genes, we see that a high level of 'boosting' is achieved, as demonstrated by the high values of B. In other words, we are able to significantly raise the position of the altered gene in comparison with the ranked position that would have been observed had the standard expression change ranking been used.
The purpose of this work was to explore the possibility of adapting the PageRank algorithm, used by Google in assessing the importance of web pages, for the task of prioritizing the 'importance' of genes in a microarray experiment. Our new algorithm, GeneRank, allows connectivity and expression data to be combined to produce a more robust and informed summary of an experiment, compared to the standard procedure of basing the importance of a gene on its measured expression change. Although we use expression change values as expression data, this is not restricted, and some other means of capturing the expression information may also be used. GeneRank can be justified theoretically and has been tested on synthetic data, experimental data and a combination of both. The algorithm has a single parameter, d, that controls the relative weighting of expression and connectivity information. A value d = 0 ranks genes based on pure expression information and a value d = 1 ranks on pure connectivity degree. The optimal value of d is data-dependent, but based on our results we suggest d = 0.5 for general use. With d = 0.5 we observed no deterioration and generally an improvement over ranking based on pure expression change in the case where we combined real connectivity information with synthetic expression changes. GeneRank is simple to implement, gives a principled approach to combining different data types, and is a novel instance of applying search engine technology to this important task. We note that GeneRank results are not designed to replace the actual expression measurements, but should be used alongside the results with additional biological knowledge, to draw attention to unusual structures within the data. For example, a gene which is not viewed as important from the microarray results alone but is highly ranked in the GeneRank results, should be given further biological consideration.
While the improvement of gene rankings upon application of GeneRank is already significant in the examples presented, it may become even more so once comprehensive high-quality biological network information becomes available. Of particular interest in that respect will be transcriptional regulatory networks, such as are now being generated by technologies like ChIP-chip (see [19–21] for early examples using yeast as a model organism). As discussed above, the information encoded in such regulatory networks will be intuitively amenable to GeneRank analysis. It will also re-introduce an element of directedness into the network, moving it even closer to the original PageRank application.
We summarize the basic PageRank algorithm which was developed by Larry Page and Sergey Brin at Stanford University  and forms the basis of the successful search engine Google. Further details may also be found in [1, 22].
PageRank assigns a measure of relevance or importance to each web page, allowing Google to return high-quality pages in response to a user query. The algorithm is designed to be robust to methods of deception, where web page designers attempt to artificially boost the PageRank of their page by altering the local link structure. Robustness follows from the recursive nature of the algorithm, where a page is highly ranked if it is linked to by other highly ranked pages. A link from page i to page j is regarded as a "vote of confidence" for page j from page i. The algorithm views the web as a directed graph G(V, E), where the N nodes V are the web pages and the edges E represent the links between pages. This information can be stored in an adjacency matrix, W ∈ R N × N , where w ij = 1 if there is a link from page i to page j and w ij = 0 otherwise. We define deg i := to be the degree (more precisely, the out-degree) of the ith page. Suppose we have assigned an initial ranking r  ∈ N . The PageRank algorithm proceeds iteratively, updating the ranking for the jth page from to according to the formula
(I - dW T D -1)r = (1 - d)e, (2)
where I is the identity matrix, W T is the transpose of W, D = diag(deg i ) and e ∈ N has all e i = 1. Applying PageRank is equivalent to applying the Jacobi iteration  to (2), and convergence to a unique solution r is guaranteed under the condition
ρ(dW T D -1) < 1, (3)
where ρ(·) denotes the spectral radius. The convergence condition (3) holds for any 0 <d < 1.
teleports: with probability 1 - d moves to a new page, chosen uniformly over all web pages, or,
surfs: with probability d moves to a page that is linked to from page i; in this case each page j such that w ij = 1 is equally likely to be chosen as the destination.
The PageRank vector r, when normalised so that its components sum to one, corresponds to the invariant measure for this process. In other words, r j is the long-time proportion of visits made to page j. A further interpretation based on mean hitting times rather than invariant measures is given in . The biological implication of the random walk interpretation is discussed in the description of the algorithm in the Results and Discussion section.
The PageRank idea translates intuitively to the analogous situation of gene expression analysis. Instead of producing a ranked list of web pages, we produce a ranked list of genes. PageRank views hyperlinks as votes of confidence, so we similarly allow functional connections to boost rank. Just as PageRank counts votes from a highly-ranked page as more influential than votes from a lowly-ranked page, we will allow connections to genes with high differential expression to carry greater significance than connections to genes with low differential expression. Figure 1 gives a graphical view of the concept.
PageRank gives each web page a rank of (1 - d) "for free". We will adapt this to give each gene a rank of (1 - d)ex i , where ex i is the absolute value of the expression change for gene i. Letting denote the ranking of gene j after the nth iteration, we take initial ranking r  = ex/||ex||1, where ||·||1 denotes the vector 1-norm. Then we let
Here W ∈ N × N is the connectivity matrix for the gene network, so w ij = w ji = 1 if genes i and j are connected and w ij = w ji = 0 otherwise.
We remark that this iteration may also be motivated from the viewpoint of personalised PageRanking [1, 2], where teleporting jumps in the random walk process are biased towards a user's preferred locations – here, we are biasing according to expression level.
The iteration (4) corresponds to Jacobi on the system
(I - dW T D -1)r = (1 - d)ex, (5)
and, because the iteration matrix has not been altered, the condition that convergence is guaranteed for all 0 <d < 1 continues to hold. Since W is symmetric as the network is undirected, we could replace W T by W. This is unlike the original algorithm, where a directed network is used.
In summary, the GeneRank algorithm is finding the customised ranking vector r defined by the linear system (5). A Matlab implementation of the algorithm is available in the additional file geneRank.m The random walk interpretation carries through to this more general setting. If the teleporting step is re-defined so that the destination gene is not chosen uniformly over the whole set, but rather is chosen with probability proportional to absolute expression level, then r in (5), suitably scaled, is the invariant measure. Overall, we have a true generalization of PageRank in the sense that (a) the algorithm has both "vote of confidence" and "random walk" interpretations and (b) for the case where all ex i = 1 we recover the original PageRank algorithm.
It is trivial to check that with the choice d = 0 the system (5) has solution r = ex. In this case the genes are ranked purely on expression level. We will now study the other extreme, where d = 1, and show that this case may be regarded as ranking purely on connectivity.
For d = 1, the iteration (4) becomes
and the system (5) for the corresponding fixed point becomes
(I - W T D -1)r = 0. (7)
First, we show that the sum of the rankings is preserved by the iteration. From (6),
Now, we observe that ρ(W T D -1) ≤ ||W T D -1||1 = 1, and hence all eigenvalues of W T D -1 are less than or equal to 1 in modulus. Because W is symmetric, we have W T D -1 deg = W T e = deg, showing that there is at least one eigenvector, deg, corresponding to eigenvalue 1. Suppose now that λ = 1 is a simple eigenvalue of W T D -1 and that r* with ||r*||1 = 1 is another solution of (7). Then
So r* - deg/||deg||1 is an eigenvector of W T D -1 corresponding to eigenvalue 1. It follows that r* - deg/||deg||1 must be a multiple of deg and hence r* = ± deg/||deg||1 . We may summarize our findings in the following result.
Result If the eigenvalue λ = 1 of W T D -1 is simple, then r = deg/||deg||1 is the unique solution of (7) that satisfies the required constraints ||r||1 = 1 and r i ≥ 0.
Overall, we conclude that the extremal parameter values d = 0 and d = 1 represent ranking by pure expression level and pure degree, respectively, and hence by changing the value of d we may interpolate between these two extremes.
Receiver Operating Characteristic
Area Under ROC Curve
JLM was supported by a Synergy scholarship, a jointly supervised studentship between the universities of Strathclyde and Glasgow.
RB was supported by a BBSRC grant (17/G17989) to Anna Amtmann and a personal research fellowship from the Caledonian Research Foundation..
DJH was supported by a Fellowship from the Royal Society of Edinburgh/Scottish Executive Education and Lifelong Learning Department and by the EPSRC grant GR/562383/01.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.