Data from a lung cancer study
For illustration we used expression data from a cohort of 123 lung cancer patients who underwent complete surgical resection at the Institut Mutualiste Montsouris (Paris, France) between 30 January 2002 and 26 June 2006. The recruitment of participants was carried out in compliance with the Helsinki Declaration, and this study was approved by the ethics committee of the Institut Gustave Roussy (Paris). All involved patients gave a written informed consent. From each patient, snap-frozen tumor and corresponding normal tissues were collected. We assayed the samples for gene expression, performed using dual-color human array from Agilent containing 41,000 gene probes; a dye-swap was employed for each sample and the log-ratio value was combined by averaging. Scanned microarray images were analyzed by using Agilent Feature Extraction software version 10.5.1.1. A loess normalization was then performed using the normalizeWithinArrays function from R package limma [16].
For a subset of 16 tumor samples we also obtained protein expression profiles. From each sample 160 μ g of protein was taken off and precipitated using four volumes of ice cold acetone. The precipitated samples were dissolved in iTRAQ dissolution buffer and digested according to manufacturers instructions (Applied Biosystems). Three proteomics platforms were used: Applied Biosystems’s MALDI TOF-TOF, Agilent’s Accurate-Mass Q-TOF and a hybrid LTQ-Orbitrap from Thermo Fischer Scientific. From these platforms, a total of 4,406 proteins were detected in at least one sample.
Altered gene sets
AGSs are usually defined as genes that are differentially expressed (DE) between clinical groups. In our current application we defined the AGS as the DE genes in the within-person comparison of tumor vs normal tissues, so each sample (patient) had an associated AGS. The top-ranking genes are not interpreted statistically significant DE genes, but as potentially deregulated genes. To investigate the functional heterogeneity of individual tumors, we ranked the differences between individuals. To this end, we transformed the mRNA expression data to difference values that conveyed how uncommon is the differential expression of gene g in patient p compared to the group of patients:
(1)
where log(T/N) is the log intensity-ratio of tumor vs normal expression. For the AGS with a fixed size, we considered the top 100 (k = 100) genes per individual.
The AGS from the protein expression data was defined similarly: for each sample we kept proteins whose expression exceeded the corresponding average across 16 samples. The number of proteins in the AGSs has a median of 424 (range 257 to 654).
NEA statistics and randomization
Let A(k) be an AGS of size k, and definen
AF
(k), a measure of network connectivity between A(k) and a known FGS (F), as the number of links between members of A(k) and F. In practice we might be interested to see the network connectivity for a fixed number k of top-ranking genes. Since the connectivity depends on the constituent genes, we correctn
AF
(k) by its expected value:
whereμ
AF
(k) is the expected number of links between A(k) and F. The computation ofμ
AF
(k) is performed using the network randomization described below. To avoid dependence on k, we define the largest (signed) deviation betweenn
AF
(k) and its expected value:
The immediate statistical question is whether an observedd
AF
(k) ord
AF
is significantly higher or lower than expected under the null hypothesis of no biological relation. This assessment is not trivial, as we need to respect the network topology. Crucially, it is not adequate to compare the AGS with a random list of genes. The scale-free property of most biological networks means that the degree distribution is highly skewed, with a few network hubs but a large number of sparsely connected nodes. (The degree of a node is defined as the number of neighbors connected to that node). Randomly replacing a network hub by a sparsely connected gene does not make sense. A hub gene may have a few links to genes in a FGS simply by chance, whereas for a non-hub gene the same number of links would indicate an important biological pattern.
In some simple cases, analytical calculations of the null distribution are possible, for example using the hyper-geometric distribution [17]. However, for general networks, e.g. when gene groups may share members, theoretical analyses are not feasible. In such a situation, for example, Li et al.[18] chose to limit their analysis to only cases where the gene groups did not overlap, which is a serious limitation.
A general approach in generating the null distribution of the network statistic is based on a network randomization that preserved the degree distribution. This approach was applied, for example, by [19–22]. The basic algorithm rewires the whole network by systematic swapping of network links so that the degree of each node is preserved, while its network neighbors are replaced. The algorithm is as follows:
• Step 0: Randomly select a pair of edges (A-B and C-D).
• Step 1: ‘Rewire’ the two edges in such a way that A becomes connected to D, while B connects to C. In case one or both of these new links already exist in the network, this step is aborted and a new pair of edges is selected.
Repeated applications of the above steps lead to a randomized version of the original networks. However, for a large network it is not clear at what point a sufficient level of rewiring has been achieved, and the algorithm is also inefficient as it rejects too many rewired links that are generated earlier.
The key to a fast and transparent algorithm is to first represent the network in terms of a binary adjacency table, with each node represented by a row and a column, and the cell entry ‘1’ represents connected nodes, and ‘0’ otherwise. The table is symmetric, and its margin is equal to the degree distribution of the nodes. Thus our goal is to generate a randomized table with fixed margins, similar to the hypergeometric sampling model associated with the Fisher’s exact test, but with the additional constraints of symmetry and binary entries. Adapting the hypergeometric sampling model to these constraints, the network randomization is achieved by the following sequential algorithm:
• Step 0: Start with a list of nodes 1,…,N, ordered by degreeδ1 ≥ ⋯ ≥ δ
N
> 0.
• Step 1: Assign theδ1 links of the largest node to the other nodes randomly but with probability weighted byδ2, …, δ
N
.
• Step 2: Remove node 1, and reduce by 1 the degree of each node connected to node 1.
• Step 3: Construct a new list of remaining nodes with positive degrees, then reorder and renumber them 1,…,N, and go to Step 1. Stop if there is no remaining node with positive degree.
At first appearance the use of the weights (δ2,⋯,δ
N
) makes it more difficult to understand the distributional property of generated networks. Here we explain why the weights lead to an unbiased random sample of networks with a given degree distribution. First consider a naive algorithm for generating networks uniformly from all networks with the given degree distribution:
-
1.
Construct a random network: for i = 1,⋯,N, takeδ
i
nodes randomly with equal weights as neighbors of the i th node. The choices of neighbors are made independently across different nodes.
-
2.
Test the network: compute the resulting degree distribution and accept the network if it has the required degree distribution, otherwise reject it.
Since a great majority of generated networks will not satisfy the given degree distribution, and will thus be rejected, this algorithm is highly inefficient. However, clearly the accepted/generated networks are uniformly distributed over all networks with the given degree distribution. Now, at Step 1 the probability that the second largest node has a link with the largest node is
Likewise, the probability that the node with degreeδ
i
has a link with the largest node is . This is exactly the same as the weights in our algorithm. Thus, our weighting scheme simply offers a short-cut to produce truly unbiased random networks. A similar proof was used in [23]. Our proposed algorithm is computationally very efficient, since each node needs to be visited at most once. The average number of nodes that need to be visited depends on the degree distribution. In our application with ∼16,000 nodes and ∼1.5 M links (see the subsection on gene network below), this average was ∼8,200 nodes. On a 3 GhZ PC with 4 Gb RAM, using native R commands, the algorithm takes ∼60 seconds.
Denote by the number of links between A(k) and F based on the randomized network. On replicating the randomization many times (in practice we used 100 replications), we obtain of collection of ’s, which behave as a random sample from the null distribution. We can then estimateμ
AF
(k) by the average of over the randomizations, and compute and based on the randomized network, and compute the right-sided p-value as the proportion of ’s ≥d
AF
(similarly the left-sided p-value). The one-sided p-value is the minimum of these two, and the two-sided p-value is twice the one-sided p-value. We checked that the p-values computed for for a typical FGS (with median size), across all AGSs and all the randomized networks, followed a uniform distribution, indicating that the randomization worked as expected; see Figure 1.
For easy comparison with other methods we computed the maximum NEA (MNEA) score as
where ands
AF
are the mean and standard deviation of . Similarly, the fixed NEA (FNEA) score is computed as
where and
s
AF
(k) are the mean and standard deviation of . We use the term NEA when referring to both FNEA and MNEA. These scores can be used as a measure of activation of the FGS, thus providing a biological characterization of the AGS. Assuming a normal distribution, one could convert the scores to p-values, which are convenient for combining results of network analyses with information from other findings, such as genome-wide association studies, small-scale experiments, mutation analysis, etc.
To assess a large number of AGS-FGS pairs, we transform the p-values into false discovery rates (FDR) [24]. Letp1,…,p
m
be the ordered p-values from m AGS-FGS pairs. The standard estimate of FDR as a function of the p-values is given by
where is the estimated proportion of null results. Monotonicity is then imposed by taking the cumulative minimum over all p-values ≥p
k
.
Comparative procedures: GEA and GSEA
For comparisons with NEA, the GEA score was computed from an AGS of size k as follows. Firstly, a standard 2×2 table is constructed for each AGS-FGS pair according to whether or not a gene belongs to the AGS and FGS. In small samples the Fisher’s exact p-value is usually computed using the hyper-geometric distribution, which, as explained above, is equivalent to randomizing the genes while fixing the marginal totals. Liu et al.[12] used this assessment for their statistic, which is why the method was more like a GEA rather than a true NEA, as it ignored the network topology in its inference. For convenience and easy comparison with NEA scores, we compute a GEA score as
where (a b c d) are the table frequencies. This can be seen as the z-statistic associated with the log odds-ratio. If some observations in the table are zero, then they are replaced with 0.5 to prevent z from being undefined.
Similar to MNEA, the gene-set enrichment analysis (GSEA) avoids the dependence on k by taking the maximum deviation from zero encountered in a running sum statistic [1]. GSEA enrichment score is defined as
whereg
j
is the j th-ranked gene in the AGS,N
R
is the number of genes in R. As with the GEA, however, GSEA also does not use any network topology in its inference. To assess the significance of observed enrichment score, a permutation test procedure was originally developed for group comparisons. In our application the AGS is defined for each person, so the permutation is performed within each person.
Gene network
We compiled a comprehensive network by merging 4 networks: (1) the data integration network of functional coupling [21], re-generated with updated datasets on protein-protein interactions (IntAct), gene expression (GEO) and sub-cellular localization (GO). This network contains 1,515,318 links, with each gene identified by the ENSEMBL gene ID; (2) curated pathways (KEGG KGML files, as of June 2010), consisting of 25,765 links; (3) known protein complexes (KEGG and CORUM, as of June 2010), consisting of 54,046 links; (4) coherent annotations (coherence was estimated by a custom software as simultaneous presence of two genes in a series of overlapping GO terms, weighted for compactness of the latter), consisting of 20,412 links.
The final network contained 1,445,027 functional links between 16,299 distinct HUPO genes (some links were lost as we translated the ENSEMBL IDs to HUPO gene symbols). Note that no evidence from the lung cancer datasets was involved in the network construction. The network is available from http://www.meb.ki.se/~yudpaw.
Functional gene sets
To characterize AGSs by involvement of known biological processes, we compiled a list of genes which are of importance in cancer:
• All 235 pathways in the KEGG database (as of 21 Apr. 2010);
• 16 GO terms that could be related to hallmarks of cancer [25];
• 7 cancer-related pathways and otherwise defined FGS from publications reporting on large-scale cancer genome projects;
• EMT, epithelial-mesenchymal transition (a pathway behind metastatic development, manual curation by S. Souchelnytskyi);
• Acidic switch (a pathway behind tumor-specific pH-shift, i.e. tumor’s ability to grow in hypoxic environment, manual curation by A. de Milito).
In total the list included 5,698 distinct HUPO gene symbols assigned to 264 FGSs (overlaps are allowed). The list of pathways with their meanings and references are given in Section B of the Additional file 1 Report. The complete list of genes for each FGS is available as an RData file at http://http:/www.meb.ki.se/~yudpaw.