- Research
- Open Access
An effective method for network module extraction from microarray data
- Priyakshi Mahanta1,
- Hasin A Ahmed1,
- Dhruba K Bhattacharyya1Email author and
- Jugal K Kalita2
https://doi.org/10.1186/1471-2105-13-S13-S4
© Mahanta et al; licensee BioMed Central Ltd. 2012
- Published: 24 August 2012
Abstract
Background
The development of high-throughput Microarray technologies has provided various opportunities to systematically characterize diverse types of computational biological networks. Co-expression network have become popular in the analysis of microarray data, such as for detecting functional gene modules.
Results
This paper presents a method to build a co-expression network (CEN) and to detect network modules from the built network. We use an effective gene expression similarity measure called NMRS (Normalized mean residue similarity) to construct the CEN. We have tested our method on five publicly available benchmark microarray datasets. The network modules extracted by our algorithm have been biologically validated in terms of Q value and p value.
Conclusions
Our results show that the technique is capable of detecting biologically significant network modules from the co-expression network. Biologist can use this technique to find groups of genes with similar functionality based on their expression information.
Keywords
- Span Tree
- Network Module
- Connected Region
- Soft Thresholding
- Hard Thresholding
Introduction
The development of high-throughput Microarray technologies has provided a range of opportunities to systematically characterize diverse types of biological networks. Biological networks can be broadly classified as protein interaction networks [1–3], metabolic networks [4–6] and gene co-expression networks [7]. These networks provide an effective way to summarize gene and protein correlations. In this paper, we focus on gene co-expression networks, which is an undirected graph where nodes represent gene and nodes are connected by an edge if the corresponding gene pairs are significantly co-expressed. Gene co-expression networks provide the association between individual genes in terms of their expression similarity and a network-level view of the similarity among a set of genes. In co-expression networks, two genes are connected by an undirected edge if their activities have significant association, as computed using gene expression measurements such as Pearson correlation, Spearman correlation, mutual information. Compared to gene regulatory networks, a gene co-expression network is built upon gene neighborhood relations, which give interesting geometric interpretations of the network. One of the most important applications of gene co-expression networks is to identify functional gene modules [8] or network modules, which are represented by the strongly connected regions of the co-expression network.
Problem formulation
Due to non-transitive nature of connections among genes, genes form a very complicated connectivity network with respect to a particular similarity measure in a gene expression data set. Such a connectivity network is often referred to as a co-expression network. A major use of this co-expression network is extraction of network modules that represent the strongly connected regions in the co-expression network. These modules may present highly co expressed genes, which are functionally similar.
- 1.
Each vertex v∈V represents a gene.
- 2.
Each edge e∈E represents a connection between a pair of vertices v1,v2 where v1,v2 ∈V.
- 3.
There is an edge between two vertices v1,v2 ∈V if the similarity of the genes corresponding to the vertices is more than a user defined threshold.
Our contribution
We claim the following contributions in this paper.
-
We introduce an effective gene similarity measure NMRS.
-
We propose an approach to construct a co-expression network using NMRS.
-
We develop a spanning tree based method to extract the potential network modules.
Background
In the literature, a number of techniques have been proposed for gene co-expression network construction. When inferring co-expression networks from gene expression data, the algorithms take a gene expression dataset as primary input and then, by using a correlation-based proximity measure, constructs the corresponding co-expression networks. Frequently used correlation-based measures are Pearson correlation coefficient, Spearman correlation coefficient and Mutual information. Approaches such as [9, 10] used Pearson correlation coefficient to extract the association among genes in a co-expression network. The Spearman correlation coefficient is used as a gene expression similarity measure to construct co-expression network in [10]. [11], Steuer et al. [12] reports the use of Mutual Information to find similarly expressed gene pairs in such networks. While some studies attempted to apply algorithms directly to the adjacency matrices of networks to partition network nodes into groups [13, 14], other studies rely on special purpose algorithms for identifying subnetworks with certain properties [15].
Generally, in a co-expression network, the connections between genes are obtained from the absolute values of a co-expression measure. Several researchers have suggested to threshold this value of the co-expression measure to construct gene co-expression networks. There are two ways to pick a threshold: one way is picking a hard threshold (a number) based on the notion of statistical significance so that gene co-expression is encoded using binary information (connected=1, unconnected=0). The other way is called soft thresholding which weighs each connection by a number between 0 and 1. The drawbacks of hard thresholding include loss of information regarding the magnitude of gene connections and sensitivity to the choice of the threshold. Generally, hard thresholding results in unweighted networks while soft thresholding results in weighted networks.
Methodology
To construct the gene co-expression network, we use the general framework proposed by [16]. A new effective gene similarity measure called NMRS is used to construct the distance matrix. We use a hard thresholding based signum function to construct the adjacency matrix from the distance matrix. A spanning tree based approach is used to detect network modules in the co-expression network. Extracted network modules are projected as functional categories of genes and these modules are validated using p value and Q value. Our approach is explained next.
Define a gene expression measurement
To determine whether two genes have similar expression patterns, an appropriate similarity measure must be chosen [17]. To measure the level of concordance between gene expression profiles, we develop a gene co-expression measure called NMRS. The NMRS of gene d1=(a1, a2,…, a
n
) with respect to gene d2=(b1, b2,…, b
n
) is defined by
NMRS as a metric
NMRS satisfies all the properties of a metric. We establish The non-negativity, symmetricity and triangular inequality properties for our measure in additional file 1.
Significance of NMRS
Comparison of proximity measures
Proximity measure | Mode | Normalization required | Detects shifting pattern | Detects scaling pattern |
---|---|---|---|---|
Euclidian | Mutual | Yes | Yes | No |
Pearson | Mutual | No | Yes | Yes |
Spearman | Mutual | No | No | No |
MSR | Aggregate | No | Yes | Yes |
NMRS | Mutual | No | Yes | Yes |
Example patterns used for evaluation of proximity measures The figure 1 presents the value of some example patterns that are used to demonstrate the superiority NMRS over other proximity measures viz. Euclidean distance, Pearson correlation coefficient and Spearman correlation coefficient.
Gene pattern
a | 4 | 7 | 6 | 3 | 6 | 5 | 8 | 7 | 3 |
---|---|---|---|---|---|---|---|---|---|
b1 | 10 | 13 | 12 | 9 | 12 | 11 | 14 | 13 | 9 |
b2 | 10.4286 | 12.5714 | 11.8571 | 9.7143 | 11.8571 | 11.1429 | 13.2857 | 12.5714 | 9.7143 |
b3 | 10.8571 | 12.1429 | 11.7143 | 10.4286 | 11.7143 | 11.2857 | 12.5714 | 12.1429 | 10.4286 |
b4 | 11.2857 | 11.7143 | 11.5714 | 11.1429 | 11.5714 | 11.4286 | 11.8571 | 11.7143 | 11.1429 |
b5 | 11.7143 | 11.2857 | 11.4286 | 11.8571 | 11.4286 | 11.5714 | 11.1429 | 11.2857 | 11.8571 |
b6 | 12.1429 | 10.8571 | 11.2857 | 12.5714 | 11.2857 | 11.7143 | 10.4286 | 10.8571 | 12.5714 |
b7 | 12.5714 | 10.4286 | 11.1429 | 13.2857 | 11.1429 | 11.8571 | 9.7143 | 10.4286 | 13.2857 |
b8 | 13 | 10 | 11 | 14 | 11 | 12 | 9 | 10 | 14 |
NMRS and Pearson correlation coefficient among considered example patterns The figure 2 presents NMRS and Pearson correlation coefficient of patterns b1-b8 with that of a.
Compute an adjacency matrix
Detect network modules
where l ij = ∑ u a iu a ij , and k i = ∑ u a iu is the node connectivity.
Extract useful information
Extraction of useful biological information is one of the main usages of gene co-expression networks. From the constructed network, one can explore various important information such as functionality and pathways of genes, essential genes susceptible to diseases.
Proposed algorithm: Module Miner
Module Miner takes NMRS threshold, δ, as a input and works on a microarray gene data and constructs the gene co-expression network and finally network modules are extracted from the network. Our approach uses an effective similarity measure NMRS to form a co-expression network using signum function. The co-expression network is further explored to mine the potential network modules using a spanning tree based method and a connectivity measure called Topological Overlap Matrix.
Symbolic representation
SYMBOL | MEANING |
---|---|
D | The gene expression matrix |
d i | i th gene in D |
δ | Signum threshold |
G | Co-expression network |
V | Set of vertices in G |
E | Set of edges in G |
Dist | Distance matrix |
Dist(d i , d j ) | NMRS distance between genes d i , d j ∈D |
Adj | Adjacency matrix |
Adj(v i ,v j ) | 1 if v i and v j are connected by an edge 0 otherwise |
G con | Set of connected region |
| i th connected region |
| Set of vertices in i th connected region |
| Set of edges in i th connected region |
| Adjacency matrix of the i th connected region |
| i th network module |
D net | Set of network modules obtained from G |
TOM(v i ,v j ) | Topological Matrix value between vertices v i and v j |
TOM(V1) | Average TOM of the set of vertices V1 |
| TOM for i th connected region |
| Maximum spanning tree obtained from i th connected region |
| Set of edges in
|
Definition 1 A CEN can be defined by an undirected, graph G={V,E} where each v∈V corresponds to a gene and each edge e∈E corresponds a pair of genes d i , d j ∈D such that Dist(di, d j )≥δ.
Definition 2 Connected regions in a CEN are parts of the network where each pair of vertices is connected by a path. The i
th
connected region extracted from G can be defined as a graph
where
and
such that for any vertex
, there is at least one vertex
which are connected by an edge
.
Definition 3 Maximum spanning tree
of a weighted graph is a spanning tree obtained from i
th
connected region,
can be defined as
, where the sum of TOM values associated with edges in
is maximum compared to other spanning trees.
Definition 4
Network modules
are highly connected regions of the co-expression network. The i
th
network module derived from j
th
connected region
is defined as a set of vertices
if
Algorithm: Module Miner
Algorithm complexity
The complexity of different steps of our method is presented in this section.
-
The preparation of the distance matrix involves a complexity of O(n×n-1)/2, where n is the number of genes.
-
Finding connected regions from the co-expression network requires a complexity of O(n).
-
Computation of the TOM matrix involves a complexity of O(n c ×(d c ×(d c -1)/2)), where n c is the total number of connected regions and d c is the average number of genes in the connected regions.
-
Finding a maximum spanning tree consumes a complexity of
.
Experimental results
Datasets used for evaluating ModuleMiner
Serial. No | Dataset | No. of Genes/ No. of Conditions | Source |
---|---|---|---|
1 | Yeast Sporulation | 474/17 | |
2 | Yeast Diauxic Shift | 689/72 | Sample gene in expander |
3 | Subset of Yeast Cell Cycle | 384/17 | |
4 | Arabidopsis Thaliana | 138/8 | http://homes.esat.kuleuven.be/~sistawww/bioi/thijs/Work/Clustering.html |
5 | Rat CNS | 112/9 |
Validation
The performance of Module Miner on the five publicly available benchmark microarray dataset is measured in terms of p value and Q value.
p value
where f and g denote the total number of genes within a category and within the genome respectively.
P-value of one of the network modules of Dataset 2
P-value | GO number | GO category |
---|---|---|
2.32E-28 | GO:0000788 | nuclear nucleosome |
5.12E-27 | GO:0000786 | nucleosome |
7.27E-23 | GO:0006334 | nucleosome assembly |
2.06E-20 | GO:0032993 | protein-DNA complex |
8.61E-19 | GO:0034728 | nucleosome organization |
1.14E-18 | GO:0065004 | protein-DNA complex assembly |
1.12E-17 | GO:0006333 | chromatin assembly or disassembly |
4.12E-16 | GO:0005694 | chromosome |
2.49E-14 | GO:0044454 | nuclear chromosome part |
1.70E-13 | GO:0031298 | replication fork protection complex |
9.47E-14 | GO:0006325 | chromatin organization |
6.78E-13 | GO:0044427 | chromosomal part |
2.32E-12 | GO:0034622 | cellular macromolecular complex assembly |
p-value of one of the network modules of Dataset 3
P-value | GO number | GO category |
---|---|---|
3.93E-25 | GO:0006281 | DNA repair |
1.03E-26 | GO:0006259 | DNA metabolic process |
1.23E-23 | GO:0006974 | response to DNA damage stimulus |
7.69E-27 | GO:0006260 | DNA replication |
6.94E-19 | GO:0007049 | cell cycle |
5.55E-16 | GO:0005634 | nucleus |
8.53E-18 | GO:0044454 | nuclear chromosome part |
1.51E-17 | GO:0022402 | cell cycle process |
3.53E-17 | GO:0000079 | regulation of cyclin-dependent protein kinase activity |
5.72E-15 | GO:0045859 | regulation of protein kinase activity |
5.16E-16 | GO:0005657 | replication fork |
Q value
Q-value of one of the network modules of Dataset 3
GO annotation | Q value |
---|---|
DNA replication | 1.93E-21 |
DNA repair | 1.93E-21 |
response to DNA damage stimulus | 2.17E-20 |
DNA-dependent DNA replication | 3.07E-19 |
replication fork | 6.27E-19 |
nuclear chromosome | 1.23E-17 |
mitotic sister chromatid cohesion | 5.51E-17 |
nuclear replication fork | 9.37E-17 |
nuclear chromosome part | 2.00E-16 |
sister chromatid cohesion | 5.13E-15 |
Q-value of one of the network modules of Dataset 1
GO annotation | Q value |
---|---|
cytosolic ribosome | 1.43E-52 |
cytosolic part | 3.26E-48 |
structural constituent of ribosome | 2.11E-44 |
ribosomal subunit | 1.16E-42 |
cytosolic large ribosomal subunit | 2.65E-36 |
large ribosomal subunit | 1.47E-27 |
preribosome | 2.96E-23 |
cytosolic small ribosomal subunit | 3.71E-17 |
90S preribosome | 8.48E-16 |
Q-value of one of the network modules of Dataset 1
GO annotation | Q value |
---|---|
sporulation resulting in formation of a cellular spore | 1.53E-34 |
sporulation | 1.53E-34 |
anatomical structure formation involved in morphogenesis | 1.53E-34 |
spore wall assembly | 3.43E-33 |
ascospore wall assembly | 3.43E-33 |
ascospore formation | 3.43E-33 |
sexual sporulation | 3.43E-33 |
spore wall biogenesis | 3.43E-33 |
ascospore wall biogenesis | 3.43E-33 |
sexual sporulation resulting in formation of a cellular spore | 3.43E-33 |
cell development | 3.43E-33 |
cell wall assembly | 8.88E-33 |
reproductive process in single-celled organism | 2.59E-32 |
cell differentiation | 8.40E-32 |
fungal-type cell wall biogenesis | 6.93E-30 |
reproductive developmental process | 1.40E-29 |
reproductive process | 1.86E-25 |
reproductive cellular process | 1.86E-25 |
reproduction of a single-celled organism | 9.90E-25 |
cell wall biogenesis | 1.25E-24 |
sexual reproduction | 4.83E-24 |
anatomical structure development | 5.45E-24 |
anatomical structure morphogenesis | 5.45E-24 |
M phase | 2.10E-23 |
meiotic cell cycle | 1.62E-21 |
meiosis | 2.74E-21 |
M phase of meiotic cell cycle | 2.74E-21 |
Q-value of one of the network modules of Dataset 4
GO annotation | Q value |
---|---|
synaptic transmission | 1.29E-13 |
glutamate receptor activity | 3.77E-11 |
synapse | 6.68E-08 |
regulation of synaptic transmission | 3.06E-07 |
regulation of transmission of nerve impulse | 4.00E-07 |
regulation of neurological system process | 7.07E-07 |
regulation of system process | 5.38E-05 |
synapse part | 8.11E-04 |
cell projection part | 9.46E-04 |
Q-value of one of the network modules of Dataset 5
GO annotation | Q value |
---|---|
regulation of synaptic transmission | 6.438756E-7 |
regulation of transmission of nerve impulse | 9.297736E-7 |
regulation of neurological system process | 1.533111E-6 |
intermediate filament cytoskeleton organization | 2.056912E-6 |
intermediate filament-based process | 5.218967E-6 |
neurofilament cytoskeleton | 1.109702E-5 |
intermediate filament organization | 1.454524E-5 |
synapse part | 2.543099E-5 |
growth factor binding | 2.571707E-5 |
intermediate filament | 2.938762E-5 |
positive regulation of neurogenesis | 9.6019E-5 |
The weightage of co-expression by Module Miner
Datasets | Network Modules | Percentage |
---|---|---|
Dataset1 | C1 | 99.57% |
C2 | 88.89% | |
Dataset2 | C1 | 59.23% |
C2 | 77.27% | |
Dataset3 | C1 | 92.13% |
C2 | 88.89% | |
C3 | 92.33% | |
C4 | 67.65% | |
Dataset4 | C1 | 81.85% |
Dataset5 | C1 | 76.62% |
Visualization of co-expressed network The figure3 presents co-expressed network by GeneMANIA for Dataset1.
Visualization of co-expressed network The figure 4 presents co-expressed network by GeneMANIA for Dataset2 and Dataset3.
Visualization of co-expressed network The figure 5 presents co-expressed network by GeneMANIA for Dataset4 and Dataset5.
Conclusion and future work
In this paper, an effective gene expression similarity measure NMRS is introduced, which is used to construct the co-expression network through a signum function based hard thresholding scheme. Finally, network modules are extracted from the network using maximum spanning tree and topological overlap matrix. However, soft thresholding method can be used to construct the adjacency matrix to reduce information loss. Generalized Topological Overlap Measure [25] can be used instead of Topological Overlap Measure to get more accurate results. There is scope to design supervised models to derive gene regulatory network from the co-expression network.
Declarations
Acknowledgment
This paper is an outcome of a research project supported by (1) DST, Govt. of India in collaboration with ISI, Kolkata and (2) National Science Foundation, USA under grants CNS-095876 and CNS-085173.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 13, 2012: Selected articles from The 8th Annual Biotechnology and Bioinformatics Symposium (BIOT-2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S13/S1
Authors’ Affiliations
References
- Wagner A: How the global structure of protein interaction networks evolves. Proc Biol Sci 2003, 270: 457–466. 10.1098/rspb.2002.2269PubMed CentralView ArticlePubMedGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences of the United States of America 2001, 98: 4569–4574. 10.1073/pnas.061034498PubMed CentralView ArticlePubMedGoogle Scholar
- Jeong H, B AL, Mason SP, Oltvai ZN: Lethality and centrality in protein networks. Nature 2001, 411: 41–42. 10.1038/35075138View ArticlePubMedGoogle Scholar
- Wagner A, Fell DA: The small world inside large metabolic networks. Proceedings. Biological sciences /The Royal Society 2001, 268(1478):1803–1810. 10.1098/rspb.2001.1711PubMed CentralView ArticlePubMedGoogle Scholar
- Ma H, Zeng AP: Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics 2003, 19: 270–277. 10.1093/bioinformatics/19.2.270View ArticlePubMedGoogle Scholar
- Jeong H, B AL, Mason SP, Oltvai ZN: The large-scale organization of metabolic networks. Nature 2000, 407: 651–654. 10.1038/35036627View ArticlePubMedGoogle Scholar
- van Noort V, Snel B, Huynen M: The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Reports 2004, 5(3):280–284. 10.1038/sj.embor.7400090PubMed CentralView ArticlePubMedGoogle Scholar
- Ruan J, Dean A, Zhang W: A general co-expression network-based approach to gene expression analysis: comparison and applications. BMC Systems Biology 2010., 4:Google Scholar
- Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proceedings of the National Academy of Sciences of the United States of America 2000, 97: 12182–12186. [http://dx.doi.org/10.1073/pnas.220392197] 10.1073/pnas.220392197PubMed CentralView ArticlePubMedGoogle Scholar
- D’Haeseleer P, Liang S, Somogyi R: Genetic Network Inference: Prom Co-Expression Clustering To Reverse Engineering. 2000.Google Scholar
- Butte AJ, Kohane IS, Kohane IS: Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy Measurements. Pacific Symposium on Biocomputing 2000, 5: 415–426.Google Scholar
- Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 2002, 18: S231-S240. 10.1093/bioinformatics/18.suppl_2.S231View ArticlePubMedGoogle Scholar
- Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P, Alerting E, Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of human genes across many microarray data sets. Genome Res 2004, 14: 1085–1094. 10.1101/gr.1910904PubMed CentralView ArticlePubMedGoogle Scholar
- Expression G, Zhu D, Hero AO, Cheng H, Khanna R: Network constrained clustering for gene microarray data. Bioinformatics 2005, 21: 4014–4021. 10.1093/bioinformatics/bti655View ArticleGoogle Scholar
- Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science 2003, 302: 249–255. 10.1126/science.1087447View ArticlePubMedGoogle Scholar
- Zhang B, Horvath S: A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology 2005., 4:Google Scholar
- Yona G, Dirks W, Rahman S, Lin DM: Effective similarity measures for expression profiles. Bioinformatics 2006, 22(13):1616–1622. 10.1093/bioinformatics/btl127View ArticlePubMedGoogle Scholar
- Jiang D, Tang C, Zhang A: Cluster Analysis for Gene Expression Data: A Survey. IEEE Transactions on Knowledge and Data Engineering 2004, 16: 1370–1386. 10.1109/TKDE.2004.68View ArticleGoogle Scholar
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabási AL: Hierarchical organization of modularity in metabolic networks. Science (New York, N.Y.) 2002, 297(5586):1551–1555. 10.1126/science.1073374View ArticleGoogle Scholar
- Prim RC: Shortest connection networks and some generalizations. Bell System Technology Journal 1957, 36: 1389–1401.View ArticleGoogle Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature Genetics 1999.Google Scholar
- Berriz GF, King OD, Bryant B, Sander C, Roth FP: Characterizing gene sets with FuncAssociate. Bioinformatics (Oxford, England) 2003, 19: 2502–2504. 10.1093/bioinformatics/btg363View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 1995, 57: 289–300.Google Scholar
- Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C, Kazi F, Lopes CT, Maitland A, Mostafavi S, Montojo J, Shao Q, Wright G, Bader GD, Morris Q: The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research 2010, 38: W214-W220. 10.1093/nar/gkq537PubMed CentralView ArticlePubMedGoogle Scholar
- Yip AM, Horvath S: Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 2007., 8:Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.