Volume 10 Supplement 11
Threshold selection in gene co-expression networks using spectral graph theory techniques
© Perkins and Langston; licensee BioMed Central Ltd. 2009
Published: 8 October 2009
Gene co-expression networks are often constructed by computing some measure of similarity between expression levels of gene transcripts and subsequently applying a high-pass filter to remove all but the most likely biologically-significant relationships. The selection of this expression threshold necessarily has a significant effect on any conclusions derived from the resulting network. Many approaches have been taken to choose an appropriate threshold, among them computing levels of statistical significance, accepting only the top one percent of relationships, and selecting an arbitrary expression cutoff.
We apply spectral graph theory methods to develop a systematic method for threshold selection. Eigenvalues and eigenvectors are computed for a transformation of the adjacency matrix of the network constructed at various threshold values. From these, we use a basic spectral clustering method to examine the set of gene-gene relationships and select a threshold dependent upon the community structure of the data. This approach is applied to two well-studied microarray data sets from Homo sapiens and Saccharomyces cerevisiae.
This method presents a systematic, data-based alternative to using more artificial cutoff values and results in a more conservative approach to threshold selection than some other popular techniques such as retaining only statistically-significant relationships or setting a cutoff to include a percentage of the highest correlations.
The construction of gene co-expression networks is often a necessary step in a bioinformatic analysis of microarray gene expression data. Studies have shown that genes showing a similar pattern of expression, those sharing edges in a co-expression network, tend to have similar function . This principle, often referred to as "guilt-by-association" is the idea that motivates many microarray studies. With new high-throughput sequencing technologies currently being used for digital gene expression applications, gene co-expression networks promise to continue to find wide utility in genome-wide association studies and other computational analyses.
These networks are constructed by computing some similarity value between gene transcripts based upon their expression values over a set of samples. Nodes in the network represent transcripts while edges are weighted by these similarity values. A threshold is often applied to the resulting networks to retain only the most biologically significant relationships. This threshold application step is a major juncture in which errors can be introduced in the form of both false negatives and false positives. By setting this threshold too high, important relationships can be lost. Likewise, we must be sure to remove connections that do not represent "real" relationships. This task is difficult since the range of thresholds representing real biological relationships that also avoid over-filtering can be narrow.
Some of the many methods that have been applied to the threshold selection problem in various types of networks are using an arbitrary threshold , retaining only the top x percent of the strongest relationships , permutation testing , and filtering based upon control spot correlations  or the statistical significance of the relationships [5–7]. The method presented here makes use of initial spectral graph theory-based clusterings to help identify an appropriate threshold. Combinatorial methods such as those described in  will be used to analyze the final gene co-expression network, and such methods often require significant computational resources. We can justify the expense of this initial clustering by the computational resources saved by picking a suitable threshold in advance, especially one that removes most non-biologically-relevant relationships, which will significantly decrease computational requirements. We know that spectral graph theory methods can give us important information on the structure of a graph, such as the number of connected components, information about random walks in the graph, and a bound on the graph diameter . Various spectral methods have also been employed to identify clusters of related vertices [9–12]. It is these spectral clustering methods that we believe can contribute toward selecting a biologically-relevant threshold in co-expression networks. A more detailed initial analysis is presented in .
Results and discussion
Spectral properties and algebraic connectivity
We introduce a method for threshold selection based upon the spectrum of the graph at varying thresholds. That is, the eigenvalues and eigenvectors of a transform of the graph's adjacency matrix. We applied this method to yeast cell cycle data  and human expression values collected over many different tissue types . It has been shown that the number of connected components of a network can be identified using the spectrum of the network . Ding et al. observed that "nearly-disconnected" portions can also be identified by examination of the eigenvector associated with the smallest nonzero eigenvalue of the network, often called the Fiedler vector . The ability to find the nearly-disconnected pieces allows us to identify those nodes sharing a well-connected, or dense, cluster.
Many spectral clustering methods exist, with possibly the simplest being a spectral bipartitioning of the network such as that described in . In that case, the eigenvector associated with λ1, which we will refer to as v1, is sorted and nodes are partitioned into two groups based upon the magnitude of their associated eigenvector value. In , the authors showed that sorting the eigenvector associated with λ1 in ascending order often produces a step function-like plot. They also showed that the steps in such a plot delineate transitions from one nearly-disconnected component to another. Since each eigenvector value is associated with a node in the network, individual nodes can be assigned to a cluster based upon the steps in the eigenvector values. This method allows a finer partitioning than the spectral bipartitioning methods and precludes the need for recursive application of the partitioning method. This particular spectral clustering method is particularly amenable to the threshold selection problem due to its ability to identify clusters of various sizes and because it is not necessary to specify the number of partitions desired.
We can see in the previous figure that the number of clusters identified by the spectral method subsequently decreases as we proceed past the selected threshold. This is likely due to a decreasing network size overall, as well as individual clusters falling below the minimum cluster size. Similarly, the algebraic connectivity shown in Figure 2 shows an associated increase at the upper end of the threshold range due to the very small size of the largest component at these thresholds. For example, at the t = 0.98 threshold in yeast data (not shown), the network consists of only two nodes, with a single edge connecting them for a 100% edge density.
Paracliques  were computed for co-expression networks generated at the selected thresholds for both the human and yeast data sets. The Paraclique algorithm, based upon solving the -complete clique problem , is often more appropriate for microarray data than using the basic clique method. Due to the noise inherent in such data, a small number of edge weights can drop just below the network threshold. Paraclique corrects such a situation by allowing vertices to be added to the paraclique if they are adjacent to at least g of the original clique members. For most of our analyses (except the comparison with known clusters of co-expressed yeast genes described below), we set g = 1. The Paraclique algorithm performs this adjustment while still retaining the benefits of clique such as being an unsupervised method, identifying only the densest subgraphs, and possessing a natural resistance to false positives.
In the yeast co-expression network at the t = 0.78 threshold chosen by the spectral method, 93 paracliques were found with the largest containing 21 gene transcripts. At the t = 0.55 threshold identified by choosing the top one percent of correlations, we found 636 paracliques with a maximum size of 93. The human network produced many more and larger paracliques, with 497 paracliques and the largest one containing 78 transcripts at the more conservative threshold of 0.83. The human network constructed over all tissues and replicates at the lower threshold of 0.65 contained 2, 843, 536 edges, and the Paraclique run extended for almost 2.9 hours on an Intel Pentium 4 EM64T 3.4 GHz processor. This graph contained 1283 paracliques, with the largest having 324 members.
Comparison with other results
Threshold values. Threshold values computed by various methods for yeast and human co-expression networks.
p < 0.05
p < 0.01
Adj. p < 0.05
Adj. p < 0.01
Vertex and edge counts. The numbers of vertices and edges in graphs constructed at thresholds identified by various methods.
Vertex and edge counts
Adj. p < 0.05
Adj. p < 0.01
50, 202, 163
44, 057, 599
2, 843, 536
To correct for multiple tests, we apply the method described in . For example, the α = 0.05 significance level was divided by the number of transcripts on the array. The normal quantile function was used, followed by an inverse Fisher's z' transformation to determine the associated correlation value. This adjustment for multiple comparisons increases the standard p-value slightly, though the significance level threshold is still very low. Such large sample sizes (n = 82, yeast; n = 158, human) tend to translate into low correlation values required for significance, even with adjustment.
Other spectral techniques have also previously found utility in addressing in the network threshold problem. Nearest neighbor eigenvalue spacing was used in  to employ random matrix theory methods for threshold selection. Here, the authors analyzed the eigenvalues of the network by examining the distribution of spacings between successive eigenvalues and determined the point at which this spacing distribution transitioned from Poisson to Gaussian Orthogonal Ensemble (GOE).  also studied the yeast dataset described in  and found that the transition began at t = 0.62 and was complete by t = 0.77. For this yeast data set, the identification of the t = 0.77 point corresponds approximately to our result of t = 0.78.
Comparison with known co-expressed yeast genes
Cluster and paraclique overlap
Cluster (from )
Number of genes in cluster (abbreviated-from heat map figures)
Number of paracliques with cluster overlap
Total paraclique overlap
Due to the nature of the data set analyzed, genes existing in dense regions of the human co-expression network will be those that show the same pattern of expression over many tissue types, though not necessarily over- or under-expressed in a single tissue type. Similarity in many samples is likely required to drive correlations to a significant level. Similarly, genes identified to be in paracliques in the yeast data set are those that vary together throughout the cell cycle. We used the GO Slim Mapper at the Saccharomyces Genome Database (SGD)  and Ingenuity Pathways Analysis (Ingenuity Systems, http://www.ingenuity.com) to analyze some resulting paracliques in yeast and human, respectively.
In the yeast networks, we examined the biological process gene ontology category for the three largest paracliques and identified categories for which more than three genes appeared. At the t = 0.78 threshold, these paracliques were of size 21, 17, and 15. For the largest paraclique, nine of the 21 genes had unknown molecular function; 7, hydrolase activity category; 6, helicase activity; 3, RNA binding. The second paraclique showed categories of DNA binding, enzyme regulator activity, and hydrolase activity. All genes in the third appeared in the structural molecule activity category, and five in RNA binding. The three largest paracliques at the lower threshold of t = 0.55 identified by the top one percent of correlations method were of size 93, 53, and 37, respectively. Many more of these genes were found to have unknown molecular function (40, 13, and 17). The first also contained genes related to hydrolase activity, RNA binding, helicase activity, transferase activity, and nucleotidyltransferase activity. Those with more than three members in the second paraclique were transferase activity, DNA binding, hydrolase, enzyme regulator activity, protein binding, and protein kinase activity. Protein binding, hydrolase, and RNA binding were identified in the third paraclique.
For human paraclique results, we also examined the three largest paracliques at the thresholds identified by the spectral method and the "top 1%" method. At the t = 0.83 threshold, the first paraclique matched five networks containing more than three of the paraclique members. These included networks related to cellular organization, gene expression, genetic disorder, drug metabolism, and cell signaling, for example. The second paraclique matched only three networks, all related to protein synthesis. Similarly, the third network aligned with only two networks with a match of more than one gene. Both of these were related to reproductive systems development and disease, respectively, among other functions. The t = 0.65 threshold produced a maximum paraclique size of 324 which matched 14 networks with more than three genes in common, with the most enriched being related to post-transcriptional modification. The second largest paraclique matched 13 networks ranging from cellular assembly and organization, genetic disorder, to inflammatory disease, and many others. Similarly, the third paraclique matched nine networks, mostly having some relation to cancer, though some were annotated with cellular development, post-translational modification, and developmental/genetic disorder, for example.
For yeast results, while paracliques computed at the higher threshold of t = 0.78 are understandably smaller, fewer genes are unidentified based upon their biological process. In one case, all of the genes in a paraclique fell into the same category. In paracliques on networks constructed at both the high and low thresholds, genes belonged to a wide variety of biological processes, and largely the same categories appeared within the three largest paracliques in both groups. IPA results on the three largest human paracliques shows that lower thresholds result in a larger number of networks matching the paraclique transcripts. These networks seem to be annotated with a larger range of functions compared to the relatively few networks identified at the higher thresholds. In this sense, it is possible that the higher threshold values produce paracliques that are more specific to a particular network or function, allowing us to examine the results at a finer granularity. Of course, analyzing only the three largest paracliques does not give enough information to draw definitive conclusions, and it is likely that some of the actual biological networks involved or genes belonging to these networks will have been lost by using a more stringent threshold.
Threshold effect on co-expression networks
With respect to an individual paraclique, an increase in correlation threshold can have have at least two effects, and possibly both of these: a decrease in the number of genes contained within a paraclique, or the splitting of a paraclique into two or more disjoint paracliques. These new disjoint pieces may contain additional genes that were not present in the original paraclique due to the smaller number of genes with which a new gene would have to share a connection. Both of these cases are possibly desirable when a large paraclique encompasses genes participating in a variety of biological functions. If that large paraclique is split into multiple disjoint pieces of highly connected genes, or genes connected to the paraclique at a lower correlation level are excluded, only the core set of genes putatively involved in a more focused set of biological functions or pathways remain.
We decided to identify occurrences of each of these cases in the human data set due to the availability of the rich annotation information for human results available within IPA. Using the combinatorial methods described above can become intractable at the very low threshold values corresponding to large numbers of vertices and edges identified by the statistical significance methods. Therefore, we performed a pairwise comparison between paracliques computed at the two highest threshold values selected by all of the methods studied. The degree of overlap between each paracliques in the graph constructed by choosing the highest one percent of correlations (0.65) and each of those identified at the higher spectral threshold (0.83) was found. This allowed us to determine which paracliques at the higher threshold possibly correspond to those in the graph at the lower threshold. Note that due to the nature of the Paraclique algorithm, there is not necessarily a one-to-one correspondence between every paraclique in the first set with one or more paracliques in the second set.
IPA networks from two paracliques
Hematological System Development and Function, Humoral Immune Response, Tissue Morphology
Cellular Movement, Embryonic Development, Hair and Skin Development and Function
Hematological System Development and Function, Humoral Immune Response, Tissue Morphology
Cellular Movement, Embryonic Development, Hair and Skin Development and Function
We have presented a systematic threshold selection method that makes use of spectral graph theory techniques. We have shown that in the selected data sets this method results in a more conservative approach to threshold selection than both the test of statistical significance at p < 0.01 and including only the highest-weighted 1% of edges, in terms of the number of relationships retained for further analysis. We believe that the primary strength of the spectral graph theory-based method presented here is that it is a systematic method for threshold selection. Both the statistical significance method and the percentage cutoff method can be adjusted to produce graphs that prove to be tractable in a combinatorial analysis and contain fewer false positives, but the need for an arbitrary cutoff value is still present in these methods. The spectral approach attempts to move beyond the need for employing these arbitrary thresholds and computes a cutoff value based upon the underlying community structure of the data rather than merely sample size or the relative distribution of correlation values. We have also shown that for the yeast cell cycle data studied, this method produces results in agreement with a previous study making use of methods from random matrix theory. Functional comparisons between networks constructed at the threshold selected by the spectral method and the method of choosing the top one percent of correlations show that the networks built at the lower threshold are often time consuming to analyze and in the yeast data set, many of the paraclique members fall into the unknown biological process category while other genes span several other GO categories. At the higher threshold, fewer of these genes fail to be categorized based upon the gene ontology. For human data, fewer networks were identified as being enriched in the paracliques, making interpretation of the results easier.
Future work may include adapting more advanced spectral clustering methods such as the k-way partitioning methods described in [9, 11, 12] for use in threshold selection. We also plan to investigate the use of the metric of modularity , which serves as a quantitative measure of the proportion of intra-cluster edges, as a guide for determining an optimal threshold. Both of these features can be incorporated into a future graphical user interface-based software package that can be applied to general microarray data sets to perform a spectral analysis for determining an appropriate threshold.
Microarray data sets
We studied the publicly-available Homo sapiens and Saccharomyces cerevisiae microarray data sets described in  and , respectively. The former contains expression values from a panel of seventy-nine different tissue types in human measured on Affymetrix HG_U133A gene expression microarrays at the Genomics Institute of the Novartis Research Foundation (GNF). Data was downloaded from the NCBI Gene Expression Omnibus website http://www.ncbi.nlm.nih.gov/geo/ as raw CEL files and subsequently preprocessed and normalized using the R statistical software package version 2.6.1  and the justRMA() function of the affy version 1.12.2  Bioconductor  package. The latter contains expression from baker's yeast samples collected over a time period to measure changes during the cell cycle and was downloaded from the author's webpage in tab delimited format.
Network construction and representation
where deg(i) denotes the degree of vertex i. The benefit of the Laplacian matrix is that both adjacency and degree information is readily available.
Eigenvalue and eigenvector computation
where n is the number of nodes in the component being analyzed.
The linear algebra software package MATLAB version R2008b (The Mathworks, Inc., http://www.mathworks.com) was used to compute approximations to selected eigenvalues and eigenvectors of the filtered correlation network. Using the sparse matrix operations native to MATLAB and the eigs() function, the two smallest eigenvalues and their associated eigenvectors were computed. The eigenvector associated with the second-smallest eigenvalue λ1 was extracted and sorted in increasing order.
The detection of "gaps" in the ordered set of eigenvector values was performed using a sliding window technique. The sliding window compares two eigenvector values windowsize positions apart, where windowsize was chosen to be five for this study. When these two values are significantly different, then the beginning of a new cluster is indicated. In this case, we define a significant difference to be greater than m + , where m is the median of all differences in positions windowsize apart and s is the standard deviation of this set of values. To prevent the many small partitions that often occur at extremely high thresholds from overwhelming the results, identified partitions less than some minimum size, in this case 10 nodes, were discarded.
The graph theoretical algorithm Paraclique, developed by Michael A. Langston's team at the University of Tennessee and described in , was employed to extract dense sets of genes from resulting co-expression networks. Paraclique begins by finding a clique, or completely connected subgraph, of maximum size in the network. The maximum clique is augmented with genes connected to all but g of the clique members, with g = 1 in this case. This dense subgraph is removed from the network and the process repeats until no new paracliques larger than some minimum size can be found. For comparisons with known yeast co-expression networks, we set g = 3, which was found to incorporate more of the known co-expressed genes without significantly increasing the number of other genes present. We required the base maximum clique size to consist of at least five members for the comparison with previous yeast co-expression studies and three for all other human and yeast results.
Functional comparisons were performed using the Saccharomyces Genome Database GO Slim Viewer http://www.yeastgenome.org and Ingenuity Pathways Analysis software (Ingenuity Systems, http://www.ingenuity.com) for yeast and human networks, respectively.
Thanks to Bhavesh Borate for helpful discussions on threshold selection in gene co-expression networks. ADP was supported by a new faculty startup package from the Department of Computer Science and Engineering, Bagley College of Engineering, and the Office of Research and Economic Development at Mississippi State University.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S11.
- Wolfe CJ, Kohane IS, Butte AJ: Systematic survey revals general applicability of "guilt-by-association" within gene coexpression networks. BMC Bioinformatics 2005., 6(79):
- Freeman TC, Goldovsky L, Brosch M, van Dongen S, Mazière P, Grocock RJ, Freilich S, Thornton J, Enright AJ: Construction, visualization, and clustering of transcription networks from microarray expression data. PLoS Computational Biology 2007, 3(10):e206. 10.1371/journal.pcbi.0030206PubMed CentralView ArticleGoogle Scholar
- Ala U, Piro RM, Grassi E, Damasco C, Silengo L, Oti M, Provero P, Cunto FD: Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Computational Biology 2008, 4(3):e1000043. 10.1371/journal.pcbi.1000043PubMed CentralView ArticlePubMedGoogle Scholar
- Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proceedings of the National Academy of Sciences of the United States of America 2000, 97(22):12182–12186. 10.1073/pnas.220392197PubMed CentralView ArticlePubMedGoogle Scholar
- Voy BH, Scharff JA, Perkins AD, Saxton AM, Borate B, Chesler EJ, Branstetter LK, Langston MA: Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Computational Biology 2006, 2(7):e89. 10.1371/journal.pcbi.0020089PubMed CentralView ArticlePubMedGoogle Scholar
- Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of human genes across many microarray data sets. Genome Res 2004, 14: 1085–1094. 10.1101/gr.1910904PubMed CentralView ArticlePubMedGoogle Scholar
- Moriyama M, Hoshida Y, Otsuka M, Nishimura S, Kato N, Goto T, Taniguchi H, Shiratori Y, Seki N, Omata M: Relevance network between chemosensitivity and transcriptome in human hepatoma cells. Molecular Cancer Therapeutics 2003, 2: 199–205.PubMedGoogle Scholar
- Chung FRK: Spectral Graph Theory. Regional Conference Series in Mathematics, Providence: American Mathematical Society 1994., 92:Google Scholar
- Alpert CJ, Kahng AB, Yao SZ: Spectral partitioning with multiple eigenvectors. Discrete Applied Mathematics 1999, 90(1–3):3–26. 10.1016/S0166-218X(98)00083-3View ArticleGoogle Scholar
- Ding CHQ, He X, Zha H: A spectral method to separate disconnected and nearly-disconnected web graph components. Proceedings of the Seventh ACM International Conference on Knowledge Discovery and Data Mining: 26–29 August 2001; San Francisco 2001.Google Scholar
- Ng AY, Jordan MI, Weiss Y: On spectral clustering: analysis and an algorithm. Advances in Neural and Information Processing Systems: 3–8 December 2001; Vancouver 2001.Google Scholar
- Ruan J, Zhang W: Identifying network communities with a high resolution. Physical Review E 2008., 77(016104):
- Perkins AD: Addressing challenges in a graph-based analysis of high-throughput biological data. PhD thesis. University of Tennessee, Department of Electrical Engineering and Computer Science; 2008.Google Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccaromyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 1998, 9(12):3273–3297.PubMed CentralView ArticlePubMedGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America 2004, 101(16):6062–6067. 10.1073/pnas.0400782101PubMed CentralView ArticlePubMedGoogle Scholar
- Shi J, Malik J: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000, 22(8):888–905. 10.1109/34.868688View ArticleGoogle Scholar
- Chesler EJ, Langston MA: Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic data. RECOMB Satellite Workshop on Systems Biology and Regulatory Genomics: 2–4 December 2005; San Diego 2005.Google Scholar
- Garey MR, Johnson DS: Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W. H. Freeman; 1979.Google Scholar
- Luo F, Yang Y, Zhong J, Gao H, Khan L, Thompson DK, Zhou J: Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory. BMC Bioinformatics 2007, 8: 299. 10.1186/1471-2105-8-299PubMed CentralView ArticlePubMedGoogle Scholar
- Borate B: Comparative Analysis of Thresholding Algorithms for Microarray-derived Gene Correlation Matrices. In Master's thesis. The University of Tennessee; 2008.Google Scholar
- Lai LC, Kosorukoff AL, Burke PV, Kwast KE: Metabolic-state-dependent remodeling of the transcriptome in response to anoxia and subsequent reoxygenation in Saccharomyces cerevisiae. Eukaryotic Cell 2006, 5(9):1468–1489. 10.1128/EC.00107-06PubMed CentralView ArticlePubMedGoogle Scholar
- SGD Project: Saccharomyces Genome Database.[http://www.yeastgenome.org]
- Newman MEJ, Girvan M: Finding and evaluating comunity struture in networks. Physical Review E 2004., 69(026113):
- R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2008.Google Scholar
- Gautier L, Cope L, Bolstad BM, Irizarry RA: affy – analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20(3):307–315. 10.1093/bioinformatics/btg405View ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 2004, 5: R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.