Global similarity measures such as Euclidean distance or Pearson correlation coefficient may not always capture true gene-gene relationships [34]. In addition, most existing techniques give low emphasis to pattern matching based on local similarity. It has also been observed that genes share local rather than global functional similarity in their gene expression profiles [8]. Moreover, another observation is that most existing techniques are computationally expensive. In this section, we develop an approach based on local expression pattern similarity, to construct co-expression networks with signed edges to represent regulatory relationships among genes. In general, comparing pair-wise gene profiles requires multiple passes over the database, which often is quite expensive, especially for datasets with large numbers of genes. In this work, we perform pair-wise comparison using a one-pass approach, and we compute supports using a single scan of the dataset. Pairs of genes showing similarity above a user-defined threshold θ are used to construct the adjacency matrix which is used, in turn, to construct and visualize the network. A preliminary version of the work can be found in [35].
To capture the patterns in an expression profile, we consider the line between two consecutive expression values, termed as edge. Thus, for an expression data with M conditions or time points, there are (M − 1) edges. To represent the edge we use two measures, degree of fluctuation and regulation pattern of the edge. The degree of fluctuation of an edge is the angular deviation of the edge on the 180-degree normal plane. Regulation pattern represents the up- and down-regulation of an edge. The method is discussed in details below.
Capturing expression patterns
Now, we discuss the preprocessing steps involved in capturing the degree of fluctuation and regulation pattern information for each expression profile. We compare two gene expressions both in terms of degree of fluctuation [36] and pattern of regulation between two adjacent conditions (edges), simultaneously [26]. To capture both regulation pattern and degree of fluctuation of each gene, we read rows of original data with M expression values or conditions and convert them into another row of (M − 1) columns, each column of which contains the degree of fluctuation and the regulation pattern of an edge between two adjacent conditions. We represent regulation information as 1 and -1 to denote up-regulation and down-regulation, respectively. The regulation value in the kth edge of a gene G
i
, G
i
(rk), based on two consecutive conditions (say, Ok−1 &Ok) is calculated as:
(4)
To calculate the degree of fluctuation for kth edge of Gi, Gi(ak), we compute the arctangent between two adjacent expression levels (Ok−1, Ok) corresponding to the kth edge. We use two argument arctangent function arctan2. The purpose of using two arguments instead of one is to gather information on the signs of the inputs in order to return the appropriate quadrant of the computed angle, which is not possible for the single-argument arctangent function. Since, arctan2 returns value in the range −π to π, we convert the angle to be in the 180 degree plane as follows:
(5)
The fact is illustrated in Figure 7 taking an example of a gene expression dataset with a single gene, G = {343, 314, 409} with three expression values. After transforming the values into angular deviation and regulation pattern, it becomes G = {138, −1; 52, 1}.
To formulate the pattern similarity based co-expression networking problem we define the following terms based on angular deviation and regulation pattern of a gene expression profile.
Terminologies used
Let G = {G1, G2, · · · , G
N
} be the set of N genes and T = {T1, T2, · · · , T
M
} be the set of M conditions or time points of a microarray dataset. The gene expression dataset D is represented as an N × M matrix DN ×M where each entry d
i,j
corresponds to the logarithm of the relative abundance of mRNA of a gene. The following definitions and lemmas provide the theoretical basis for the proposed GeCON algorithm.
Definition 1 (Pattern Similarity). Given degrees of fluctuation A = {a1, a2, · · · , aM −1} and regulation patterns R = {r1, r2, · · · , rM −1} of a gene, derived from the gene expression profile, two gene G
i
and G
j
s' kth expression patterns, G
ik
and G
jk
, are similar if the difference in the degrees of fluctuation of the two genes' kth edges (G
i
(a
k
) and G
j
(a
k
)) is less than some given threshold τ.
In calculating similarity between two genes, we consider two patterns: positive similarity, Pos_sim, when the regulation patterns are the same (in case of up-regulation) and negative similarity, Neg_sim, when the patterns are inverted (in case of down-regulation) for a particular edge (inverted pattern). Both the similarities are defined as follows:
(6)
(7)
where G
i
(r
k
) and G
j
(r
k
) are the regulation value of kth edges of gene G
i
and G
j
respectively. In case of Neg_sim, we subtract 180 from the sum of degree of fluctuation values of G
i
and G
j
to keep the difference in the range of 0 to 180.
Definition 2 (Support). It is the ratio between the number of edges for which genes G
i
and G
j
are similar and the total number of edges i.e. (M − 1). We consider both positive and negative supports to measure the number of edges where both genes have similar or inverted pattern tendencies, respectively. The formulas are given below.
(8)
(9)
Definition 3 (Strongly Connected). Two genes G
i
and G
j
are said to be StronglyConnected (or have an inter-relationship) if Pos_support(G
i
, G
j
) + Neg_support(G
i
, G
j
) >θ, where θ is a user defined threshold to indicate that the minimum number of edges of two expression profiles must match.
Definition 4 (Co-expression Network). A Co-expression network is a graph T = {G', E} containing a subset of genes that are strongly connected. If two genes (G
i
, G
j
) ∈ G' are connected by an arc E
ij
∈ E, then G
i
and G
j
are strongly connected to each other. Here, E = {(E
ij
, S
k
), · · · (E
mn
, S
k
)} is a set of pairs, where E
ij
represents an arc connecting G
i
and G
j
, and S
k
represents the sign of the arc E
ij
. A value of S
k
= +1 indicates up or positive regulation and -1 indicates down or negative regulation. To calculate the value of S
k
of edge E
ij
, we use Pos_support and Neg_support. This is defined as:
(10)
Lemma 1. For any two genes Gi and Gj, if Gi ∈ T, a gene co-expression network, and Gi is strongly connected to Gj, then Gj ∈ T.
Proof. The lemma can be proved by contradiction. Assume, G
i
and G
j
are two strongly connected genes and G
j
∈ T, but G
j
∉ T. As per Definition 4, T is a subset of strongly connected genes and since G
i
and G
j
are strongly connected, G
j
∈ T, which is a contradiction and hence the proof. □
Similarly the following lemma is trivial based on the Definitions 1 through 4 and Lemma 1.
Lemma 2. Let Gi and Gj be two genes, and T1 and T2 be two gene co-expression networks. If Gj ∈ T1 and Gj ∈ T2, then Gi and Gj are not connected.
Lemma 3. Genes belonging to the same gene co-expression network are co-regulated or similar.
Proof. This lemma can also be proved by contradiction. Let us assume that any two genes G
i
and G
j
∈ T are not co-expressed. If G
i
and G
j
are in the same network, they are strongly connected (as per Definitions 3 and 4), and hence G
i
and G
j
are strongly connected. Again, any two strongly connected genes are similar or co-expressed (as per Definitions 1 through 3), which contradicts the assumption, hence the proof. □
Similarly, the proof of the following lemma (the reverse case of lemma 3) is trivial.
Lemma 4. Genes belonging to different gene networks are not co-expressed.
Construction of co-expression network
This section discusses the counting of pair-wise support between genes using only one pass over the database to construct the co-expression network of connected genes. We use a correlogram matrix approach [37] for computing similarity between two target genes based on the degree of fluctuation and regulation between them. Later, similarity values are used to calculate the support values needed to construct the co-expression network. We first invert the preprocessed database obtained using the above technique, by placing edges as rows and genes as columns. We read each row from the database, and check whether two consecutive genes (say, G
i
and G
j
) satisfy the similarity criterion (in terms of degree of fluctuation and regulation information) or not, using (6) and (7). If two genes are similar, the content of the correlogram matrix cell with index (i,j) is increased. This step is repeated for all pairs of genes for each row. This continues for all the rows to be processed.
From the correlogram matrix, it is very simple to extract the support count of gene pairs. Using these support counts, we compute all strongly connected genes that satisfy the given θ constraints. Based on all strongly connected pairs, the adjacency matrix is computed as:
(11)
where 0 indicates the lack of any relation between the genes. A gene co-expression network connecting various genes is constructed using the adjacency matrix.
Our approach is advantageous because (i) it requires only single scan over the database; (ii) it is faster, (iii) our approach does not use any standard proximity measures, (iv) since it is pattern based, it is insensitive to normalization of data as normalize data maintain similar pattern or tendency with original data even after normalization and (v) it does not require any discretization step where continuous values are mapped into pre-specified intervals or classes. The preprocessing steps discussed above are only for an internal representation of expression profile into angular deviation and regulation pattern. Apparently regulation pattern calculation looks like discretization step. However, regulation values, +1 and -1, are simply a symbolic representation of upward and downward inclination of an edge between two consecutive expression values that helps only in choosing appropriate pattern matching formula and calculating Pos_support and Neg_support. There is no information loss incurred during the conversion.
GeCON: the algorithm
The steps in GeCON are given in Algorithm 1. Step 1 of the algorithm, is dedicated to the first phase of the approach, i.e., preprocessing dataset D to D'. Step 2 deals with construction of the correlogram matrix. In step 3, all connected genes are extracted and the adjacency matrix is constructed. Finally, the algorithm returns the adjacency matrix A.
input : D (Expression Dataset), θ (Support threshold)
output: A (Adjacency matrix)
1 Preprocess original database D to D' wrt. τ;
2 Generate correlogram matrix from D';
3 foreach gene pair (G
i
, G
j
) ∈ D' do
4 Compute all connected gene pairs by using support count from the correlogram matrix wrt. θ;
5 Construct adjacency matrix A using all connected genes with regulation information;
6 end
7 Return A;
Algorithm 1: The GeCON Algorithm
Complexity analysis
GeCON uses a correlogram matrix for storing support for pairs of genes. Thus for N genes, GeCON requires fixed memory of size N × (N − 1)/ 2. GeCON needs time for preprocessing and network construction using the correlogram matrix. For a dataset with N genes and C conditions, the preprocessing step requires O(N ∗ C) time and to transpose the preprocessed data it requires O(C ∗ N ) time. To construct the network, it traverses the correlogram matrix. Thus, the time required for network construction is O(N × (N − 1)/ 2). The total computational cost of GeCON is: