Gene differential coexpression analysis based on biweight correlation and maximum clique

Differential coexpression analysis usually requires the definition of 'distance' or 'similarity' between measured datasets. Until now, the most common choice is Pearson correlation coefficient. However, Pearson correlation coefficient is sensitive to outliers. Biweight midcorrelation is considered to be a good alternative to Pearson correlation since it is more robust to outliers. In this paper, we introduce to use Biweight Midcorrelation to measure 'similarity' between gene expression profiles, and provide a new approach for gene differential coexpression analysis. Firstly, we calculate the biweight midcorrelation coefficients between all gene pairs. Then, we filter out non-informative correlation pairs using the 'half-thresholding' strategy and calculate the differential coexpression value of gene, The experimental results on simulated data show that the new approach performed better than three previously published differential coexpression analysis (DCEA) methods. Moreover, we use the maximum clique analysis to gene subset included genes identified by our approach and previously reported T2D-related genes, many additional discoveries can be found through our method.

Background DNA Microarray has been widely used as measurement tools in gene expression data analysis [1][2][3][4]. Gene expression profiling data from DNA microarray can detect the expression levels of thousands of genes simultaneously. Which provide an effective way for mining diseaserelated genes nalysis of gene expression data can be divided into three levels: firstly, analysis the expression level of individual genes, and to determine its function based on gene expression level changes under different experimental conditions. For example, the tumor type specific genes are identified according to the significance of difference in gene expression using the statistical hypothesis testing analysis method. Secondly, study gene interaction and co-regulation through the combination of genes and grouping. Finally, attempt to deduce the potential gene regulatory networks mechanism and explain the observed gene expression data.
Among the microarray data analysis methods, gene differential expression analysis is one of the most widely used types of analysis for disease research. Gene differential expression analysis method selects differentially expressed genes according to expression change value of a single gene. Gene expression value change between normal samples and disease samples can be used to present the possibility of the relation between gene and disease. However, the traditional pathogenicity genes selection methods based on gene expression data treats each gene individually and the interaction between them is not considered. Actually, genes and their protein products do not perform their functions in isolation [5,6], but in cooperation. Functional changes such as alteration in tumor cell growth process, energy metabolism and immune activity are accompanied with gene coexpression changes. Differentially expressed genes selection methods often focus only on the size of the single gene and disease relation, ignoring a plurality of pathogenic genes of the complex disease as a gene module with disease related, as well as within the module gene [7]. Differential coexpression analysis, as a more comprehensive technique to the differential expression analysis, was raised to research gene regulatory networks and biological pathways of phenotypic changes through measure gene correlation changes between disease and normal conditions. Differential coexpression genes are defined as genes whose correlated expression pattern differs between classes [8]. The gene coexpression changes between different conditions indicate gene regulatory pathways and networks associated with disease. In gene differential coexpression analysis, a pair of gene expression datasets under disease and normal conditions are transformed to a pair of coexpression matrix in which links represent transcriptionally correlated gene pairs, and then the differential coexpression score is calculated for each gene [9].
Until now, methods for differential coexpression analysis of gene expression data have been extensively researched, and multiple algorithms have been developed and tested [10][11][12][13]. Carter [10] mined the molecular characteristics of the cell state through gene coexpression topology method. Stuart et al. [14] and Bergmann et al. [15] separately constructed the gene coexpression network that connected genes whose expression profiles were similar across different organisms. They showed that functionally related genes are frequently coexpressed across organisms constituting conserved transcription modules [5]. Graeber [16] and Choi [5] both studied cancer from the perspective of differential coexpression. They found some genes were not be detected from the perspective of gene differential expression analysis. Butte [17] found gene coexpression modules based on a new gene expression similarity measure method, i.e., mutual information. Varadan [18] searched for disease-related gene differential coexpression modules from all gene subsets by entropy minimization and Boolean reduction methods (EMBP). Bansal [19], Della Gtta [20] and Lorenz [21] used linear regression method to excavate relation of gene transcription and regulation separately. In those gene differential coexpression analysis methods, the most common choice of similarity measurement is Pearson correlation coefficients [5,11,12,22]. However, Pearson correlation is sensitive to outliers. Biweight midcorrelation (bicor) is considered to be a good alternative to Pearson correlation since it is more robust to outliers [23].
Graph theoretical concepts are useful for description and analysis of interaction and relationships in biological systems. The maximum clique problem (MCP) is a classical combinatorial optimization problem in graph theory. In 1957, Harary and Ross first proposed the deterministic algorithm to solve the maximum clique problem [24]. Since then many researchers have presented a variety of algorithms to solve this problem. The maximum clique problem is widely used in different areas, such as signal transmission, computer vision, and biological research etc. In this paper, we will use the concept of maximum clique to further investigate the identified gene set to gain insight into coexpression relationship between genes. For the sake of convenience, we use the terms graph and network interchangeably, the former stressing the mathematical concept, the latter the application. A graph consists of a set of nodes and a set of edges that connect the nodes. For a graph G = (V, E), the graph G is specified by the set of nodes V which also means genes in gene coexpression network, and the set of edges E which also represents gene coexpression relationships in gene network. A maximum clique means a clique which is a subset of the nodes in V that every pair of nodes in the subset is joined by an edge and is not a proper subset of any other clique.
However, the requirement of complete connectivity for gene maximum clique is restrictive. For real biological data network, its data size is large and has very complex relationships between the network nodes. When dealing with imperfect systems or with experimental data, we may need to consider more general notions of cohesive subgroups [25]. Our description here follows that of [26], they consider different notions of cohesive subgroups that include n-clique, k-plexes. In the analysis of gene expression data, genes closely linked functional module is not the strict sense of maximum clique due to the lack of certain section. We use density to measure approximation degree of functional module with maximum clique, making it more biological significance. In our study, we used the maximum clique concept to mine disease-related differential coexpression gene cluster.
In this paper, we propose a new approach for gene differential coexpression analysis based on Biweight Midcorrelation and half-threshoding strategy. Biweight midcorrelation is used to measure coefficients between the all gene pairs in normal condition and disease condition separately. The two gene correlation datasets are encoded into a pair correlations matrix over all gene pairs. We then filter out non-informative correlation pairs using the half-thresholding strategy and calculate the differential coexpression value of gene. We apply the new approach to a simulate dataset and a pair of type 2 diabetes(T2D) in rats datasets. Moreover, the maximum clique analysis are used to analyse the gene subset identified by our new approach and previously reported T2D-related genes in the dataset. In the light of this observation, we are confident that our method has a high potential for generating relevant hypotheses in biological and clinical research.

Biweight midcorrelation
Differential coexpression analysis usually requires the definition of 'distance' or 'similarity' between measured datasets, and the most common choice is Pearson correlation coefficient. However, Pearson correlation coefficient is sensitive to outliers. Biweight midcorrelation is considered to be a good alternative to Pearson correlation since it is more robust to outliers [23].
In Figure 1, for each sample Z, we measure the expression levels of p genes, let X ij denotes the expression level of the jth gene in the Z i sample, where j = 1,... p. The x th column vector of matrix represents the expression profile of the gene X. In order to define the biweight midcorrelation(bicor) [23] of two numeric vectors x = (x 1 , ..., x m ) and y = (y 1 , ..., y m ), we first defines v i, v i with i = 1,...,m: Where med(x) is the median of x, mad(x) is the median absolute deviation of x, mad(x) is the median of new numeric vector which each number is absolute difference between original vector value and med(x), this lead us to the definition of mad(x) and weight w i for x i, which are, Where the indicator I(1 − |u i |) takes 1 if 1 − |u i | > 0 and 0 otherwise. Thus, the weight w (x) i is close to 1 if x i is close tomed(x), approaches 0 when x i differs from by nearly 9mad(x), and is 0 if x i differs from med(x) by more than 9mad(x). An analogous weight w (x) i can be defined for y i. Given the weights, we can define biweight midcorrelation of x and y as: It should be noted that the equations of biweight midcorrelation does not invole an explicit identification of outliers, and all elements whose weight w i = 0 can be considered outliers. The user can also set up the maximum allowed proportion of outliers using the argument "maxPOutliers", the "max POutliers" is interpreted as the maximum proportion of low and high outliers separately. The value of bicor ranges from -1 to 1. Where -1 represents the maximum negative correlation and 1 represents the maximum positive correlation. Zero represents irrelevant.
'Half-thresholding' strategy in constructing gene coexpression networks Gene expression data has the characteristic of small samples and large number of genes, and contains noise and unrelated genes. Therefore need to use the appropriate strategy to extract disease-related genes. There are currently two accepted strategies, namely hard-thresholding and softthresholding, for inferring gene coexpression network from original gene coexpression values. Those strategies can remove noise and irrelevant genes effectively. However, hard-thresholding ignores continuous nature of the coexpression information and encodes gene connections in a binary fashion, dichotomizing the continuous correlation values to be coexpression and non-coexpression. It is sensitive to the choice of the threshold and may be result in the loss of co-expression information.
The soft-thresholding keeps all possible coexpression relationships and uses the power b (i.e. soft-threshold) to emphasize the original high coexpression values and reduce the original low coexpression values simultaneously. Although soft-threshold overcomes the disadvantages of the hard-threshold, it keeps noisy variations and unrelated gene information in its calculation. These interference information lower the accuracy of gene differential coexpression analysis, especially when soft-threshold strategy uses a low value as the powerb. During our gene differential coexpression analysis, pair of gene expression datasets under disease and normal conditions are transformed to a pair of coexpression matrix. We calculate bicor coefficients over all gene pairs in each dataset. We use m ij to denote bicor coefficient between genei and gene j under normal condition, and n ij to denote bicor coefficient under disease condition. The 'half-thresholding' strategy [17] keep coexpression value in both coexpression matrix if at least one of the two coexpression values exceeds the threshold. For example, we keep m 12 and n 12 if they both exceed threshold value 0.4. In this way, we ignore 'non-informative relationship' whose correlation values in both networks are below the threshold and filer the gene pair, but thoroughly examine the possibly meaningful coexpression changes of values remaining in the two coexpression matrix.
The 'biweight midcorrelation and half-thresholding' method (BMHT) In our method, for each dataset, we calculate the biweight midcorrelation coefficients between the expression profiles of all gene pairs in normal condition and disease condition separately. The biweight midcorrelation coefficients matrix represents the original correlation structure in each condition. After calculated biweight midcorrelation coefficients of all gene pairs, the two datasets are encoded into a pair correlations matrix over all gene pairs. We then filter out non-informative correlation pairs using the half-thresholding strategy. This results in a new coexpression networks of gene set.
In gene expression data, for genei, the biweight midcorrelation coefficients between it and its N neighbors in the filtered set can be calculated from two vectors, i.e., X=(x i1 ,x i2 ,...,x in ) and Y=(y i1 ,y i2 ,...,y in ) for the two conditions. We calculate the differential coexpression value of gene iusing the following equation.
This calculates the average coexpression change between a gene and its informative coexpression genes. Then we can use the dc values to rank genes. Naturally, the question arises, i.e., whether our findings are artifacts of the high dimensionality and low sample of the data? To assess this question, we apply permutation test to evaluate the statistical significance of gene differential coexpression value. Under the null hypothesis, we assume that all genes are mutually independent in both conditions. During the permutation test, we firstly randomly permute the disease and normal conditions of the samples M times, then calculate new Biweight Midcorrelation coefficents using 'half-thresholding' strategy based on the new values, finally calculate the dc statistics. For gene set c, the permutation p-value is: Here I(*) is an indicator function. If the absolute value of the dc of the permuted experimental matrices is larger than that of the original dc, I = 1. Otherwise, I = 0. The T m 1 and T m 2 denote samples derived from the m-th permuted dataset. An estimated FDR is obtained by converting the p-values to q-values using Benjamini-Hochberg method [29]. The p-value for each gene can then be calculated. In our study, we considered M = 1000. As this method is based on the Biweight Midcorrelation and Half-thresholding, it is denoted as BMHT in this paper.

The maximum clique analysis
More and more researchers realized that gene module is high related with disease, but not individual gene. In gene expression network, gene is only related with other genes. Based on the characteristic of no self-loop, the graph of gene coexpression network is a simple undirected graph, and the diagonal elements of gene coexpression matrix are all 0. The gene coexpression matrix is a square and symmetric matrix whose rows and columns correspond to the genes and whose element A ij denotes the coexpression relationship between genes. The graph of maximum clique network is a complete graph that every pair of nodes is joined by edge, and the adjacency matrix elements of the complete graph are all 1 except the diagonal elements. For a simple undirected graph G containing N nodes, its adjacency matrix A = (a ij ) N×N contains only 1 and 0. It is a square and symmetric matrix obviously. a ij = 1 represents that gene i and j is coexpressed, a ij = 0 means that gene i and j is not connected.
We set two thresholds T 1 for adjacency matrix A 1 in normal condition and T 2 for adjacency matric A 2 in disease condition. A 1 (i,j) set to 1 if the value of A 1 (i,j) greater than or equal to T 1 , otherwise, A 1 (i,j) set to 0. A 2 (i,j) set to 1 if value of A 2 (i,j) less than or equal to T 2 , otherwise, A 2 (i,j) set to 0.We integrated A 1 and A 2 into a matrix A after we had intersection the corresponding elements of A 1 and A 2 . A (i,j) = 1 means coexpression value of gene i and gene j in A 1 greater than or equal to T 1 , and coexpression value of gene i and j in A 2 less than or equal to T 2 . Equation 6 summarized the process. We excavated cliques which have biological significance from A adjacency matrix to further investigate gene regulatory networks.

Experiment result on simulate datasets
In this experiment, we analyzed a pair of simulated datasets used in a published study [27], which were generated based on two yeast signaling networks using SynTReN [28]. The simulate datasets consists of 20 genes, 50 samples in normal and disease conditions. MBP1_SWI6, PHO2, CLB5, TRP4, CLB6, FLO1, FLO10 were identified as differential coexpression genes. We evaluated BMHT method in terms of its capability to discover the differential coexpression genes from the simulated datasets, and compared it with methods, i.e., 'Log Ratio of Connection'(LRC), 'Average Specific Connection'(ASC), and 'Weighted Gene Coexpression Network Analysis'(WGCNA). We adopted the signed version of WGCNA and set the parameter β = 12 [22]. The results are listed in Table 1. From Table 1 it can be seen that, the BMHT method can detected all seven differential coexpression genes and ranked them at top, while the other three methods cannot detect them accurate. Bold shown genes refers to the seven differential coexpression genes in the simulate datasets. We arranged the gene in accordance with the BMHT value.

Analyzing a Type 2 Diabetes (T2D) in rats
In this section, we apply the BMHT method to a pair of type 2 diabetes(T2D) in rats datasets (dataset pair T), which has been published in study [22]. Dataset pair T from dataset GSE3068 of Gene Expression Omnibus (GEO) database, which had been preprocessed by Hui Yu et.al [22]. Dataset pair T includes 4765 genes in 10 disease samples and 10 normal samples. After applied BMHT method to dataset pair T, we obtained 334 differential coexpression genes of 4765 genes, p-values cut-off 0.05, FDR<0.6% (see Additional file 1). Based on the good performance of p-values of most genes, we selected 7% as the differential coexpression genes. The false discovery rate (FDR) is estimated from the p-value of biweight midcorrelation using Benjamini-Hochberg method [29]. In the differential coexpression genes, Rap-gef4 [30] and Notch2 [31] are reported T2D-related genes. We listed all 20 differential coexpression genes with T2D relevance in table 2. Some reported relevance in table 2 are obtained from Kyoto Encyclopedia of Genes and Genomes (KEGG) database, it is a bioinformatics resource for linking genomes to life and the environment. Although the rest genes are not be previously reported to be related with T2D, they should also deserve more attention. It is helpful for researchers to excavate gene modules and disease genes, establish a disease-related gene clusters, and further explore the pathogenesis of the disease and the biological function of the related-gene.

The maximum clique analysis of real gene expression data
In applications, the node and edge sets of the graphs we need to consider are that we interested. In section 3.2, we selected 334 differential coexpression genes (DCGs) based on BMHT method. The type 2 diabetes data contains some previously reported TD2-related genes. DCGs and T2D-related genes in GSE3068 dataset form a total of 595 gene subset K. Gene subset K 1 represents the gene expression value in normal condition and gene subset K 2 represents the gene expression value in disease condition. We got two 595 × 595 symmetric bicor coefficient matrix K 1 and K 2 after we had computed bicor values of every pair gene of gene subset with halfthresholding strategy. K 1 (i,j) means bicor value of i gene and j gene in normal sample and K 2 (i,j) represents bicor value of i gene and j gene in disease sample. In this study, we searched gene modules which have high bicor value in normal samples and low bicor value in disease samples for exploring the impact of disease on the gene coexpression. We set two thresholds T 1 = 0.76 for K 1 and T 2 = 0.2 for K 2 . K 1 (i,j) set to 1 if absolute value of K 1 (i,j) greater than or equal to T 1 , otherwise, K 1 (i,j) set to 0. K 2 (i,j) set to 1 if absolute value of K 2 (i,j) less than or equal to T 2 , otherwise, K 2 (i,j) set to 0. We integrated K 1 and K 2 into a matrix K. The K(i,j) set to 1if the values of K 1 (i,j) and K 2 (i,j) both equal to 1, otherwise, K (i,j) is 0. K is a square and symmetric adjacency matrix with only two different class elements, i.e., 0 and 1.
We analyzed the matrix as the adjacency matrix of graph. Each gene corresponds to one node of graph. K (i,j) = 1 also means node i and j node are connected by edge in correspond graph. We excavate cliques which have biological significance from the K adjacency matrix. Bold shown genes refers to the seven differential coexpression genes in the simulate datasets. We arranged the gene in accordance with the BMHT value.
and calculate the sum of each row or column of the K matrix. Which represents the number of edges that a gene connected to other genes. In order to improve the efficiency of search, we delete those isolated points whose numbers are 0 and set the minimum number of clique genes as 4. We mined 7 cliques all include 4 genes. The 7 cliques are combined into a gene module which includes 8 genes and 19 edges. The complete graph edge number of the gene function module is c 2 8 , and the density is 0.68. We listed the genes of each clique in table 3. The result is shown in Figure 2. Bold shown genes refers to the four DCG selected in the GSE3068 dataset based on BMHT method. The other genes refer to DCG. Figure 2 is the gene module. Rpl9, Polr2f, Pxmp3 and Ctsd are DCGs. Smarca4, Sirt2 and Prkaca are T2D-related genes. Tsc2 is DCG and T2Drelated gene.
However, it is not easy to determine the optimal threshold for each specific study. Too large T 1 or too small T 2 will lead to small number edges and low density of the adjacency matrix K, which corresponding to a graph, and fail to find clique which meet the requirements. On the contrary, too small T 1 or too large T 2 will lead to overlapping cliques. These two cases have no sense for the analysis of biological process. So further investigation on optimizing thresholds procedure is necessary. In fact, the threshold can be determined based on the proportion of isolated points or density of the graph. The density is defined as the ratio of number of edges to the maximum number of edges. The maximum number of edges is the edge number of complete graph.

Conclusion
In this paper, we proposed a new approach for differential coexpression analysis, which combine Biweight Midcorrelation and half-thresholding strategy and also applied maximum clique analysis to the specific gene set to further investigate gene regulatory networks. Biweight Midcorrelation is more robust for outliers and half-thresholding is an effective preprocess step of the proposed method.  Bold shown genes refer to the four DCG selected genes in the GSE3068 dataset based on BMHT method. The other genes are DCG.

Figure 2
Black spots refer to DCGs from GSE3068 dataset based on BMHT method. Gray spots refer to T2D-related genes. White spot refer both DCG and T2D-related gene.
Experimental results on simulate datasets show that our method had better performance than three previsouly proposed methods. We also applied the proposed BMHT method to real dataset designed for T2D study, and 334 differential coexpression genes were selected, which may be a useful resource for T2D study and explore the biological function of the related-gene. In the future, we will focus on how to introduce new measure to scale the similarity of gene pairs.

Additional material
Additional file 1: 334 differental coexpression genes identified by our approach file format: .doc.

Competing interests
The authors declare that they have no competing interests.