In search of functional association from time-series microarray data based on the change trend and level of gene expression
© He and Zeng; licensee BioMed Central Ltd. 2006
Received: 24 June 2005
Accepted: 15 February 2006
Published: 15 February 2006
The increasing availability of time-series expression data opens up new possibilities to study functional linkages of genes. Present methods used to infer functional linkages between genes from expression data are mainly based on a point-to-point comparison. Change trends between consecutive time points in time-series data have been so far not well explored.
In this work we present a new method based on extracting main features of the change trend and level of gene expression between consecutive time points. The method, termed as trend correlation (TC), includes two major steps: 1, calculating a maximal local alignment of change trend score by dynamic programming and a change trend correlation coefficient between the maximal matched change levels of each gene pair; 2, inferring relationships of gene pairs based on two statistical extraction procedures. The new method considers time shifts and inverted relationships in a similar way as the local clustering (LC) method but the latter is merely based on a point-to-point comparison. The TC method is demonstrated with data from yeast cell cycle and compared with the LC method and the widely used Pearson correlation coefficient (PCC) based clustering method. The biological significance of the gene pairs is examined with several large-scale yeast databases. Although the TC method predicts an overall lower number of gene pairs than the other two methods at a same p-value threshold, the additional number of gene pairs inferred by the TC method is considerable: e.g. 20.5% compared with the LC method and 49.6% with the PCC method for a p-value threshold of 2.7E-3. Moreover, the percentage of the inferred gene pairs consistent with databases by our method is generally higher than the LC method and similar to the PCC method. A significant number of the gene pairs only inferred by the TC method are process-identity or function-similarity pairs or have well-documented biological interactions, including 443 known protein interactions and some known cell cycle related regulatory interactions. It should be emphasized that the overlapping of gene pairs detected by the three methods is normally not very high, indicating a necessity of combining the different methods in search of functional association of genes from time-series data. For a p-value threshold of 1E-5 the percentage of process-identity and function-similarity gene pairs among the shared part of the three methods reaches 60.2% and 55.6% respectively, building a good basis for further experimental and functional study. Furthermore, the combined use of methods is important to infer more complete regulatory circuits and network as exemplified in this study.
The TC method can significantly augment the current major methods to infer functional linkages and biological network and is well suitable for exploring temporal relationships of gene expression in time-series data.
Gene expression profiling has gained a tremendous importance in functional genomic research. The presently often used approach is to compare gene expression in discrete time points resulting for example of different genotypes or cell lines, morbid and healthy (control) objects or under different physiological conditions. This type of static gene expression profiling can already give useful information on the patterns of significantly differentiated expression of genes. However, in order to achieve a more complete picture of significantly differentiated gene expression, especially in order to capture and understand the dynamics of the altered gene expression, it is desirable to measure time-series gene expression .
Large efforts have been made in recent years to develop bioinformatics methods to study gene expression patterns [2–5] and/or to infer functional linkages and even regulatory network from microarray data [6–11]. For inference of functional associations among genes two major classes of methods are presently in use. One class of the methods is based on graphical modeling, which include Bayesian network  and Gaussian graphical model . Recently, dynamic Bayesian networks have been proposed to model temporal gene expression and represent a promising direction. However, most of the current work in this area is limited to the analysis of a relatively small set of genes due to computational complexity [9–11]. Another class of methods infers functional association from large-scale gene expression data by defining a statistic threshold for the association. The main measure used for defining association is the Pearson correlation coefficient (PCC) [6, 13, 14]. PCC is widely used for detecting co-expressed genes from both static and time-series expression data. However, several important issues are not specifically addressed in PCC based methods when applied to time-series expression data. A major issue is that the PCC clustering method treats its input as a vector of independent samples and hence doesn't take into account the temporal relationship between consecutive time points. In addition, time-shifted and/or inverted expression of certain gene pairs are not considered. Time-shifted and inverted relationships are important features of gene expression regulation . For example, a gene may activate or inhibit another gene or even several related genes downstream in a regulatory pathway, resulting in time-delayed positive or negative response in the transcription of the downstream gene(s). To consider these phenomena, Qian et al.  proposed a local clustering (LC) method. As demonstrated with the expression profiling data of yeast cell cycle this method can identify new, biologically relevant interactions that could not be found by the conventional PCC clustering method. However, the method of Qian et al.  is principally still based on a point-to-point comparison or local clustering of expression levels of genes although it explicitly considers time-shifted and inverted gene expression profile. Kwon et al.  proposed an 'event-based' edge detection method to consider the change trend of gene expression between consecutive time points. By simplifying a profile of time series into a sequence of decrease or increase events this method is more robust to noises. However, it does not fully make use of the information contained in the gene expression levels in the original data. Filkov et al.  proposed a similar method called 'edge detection'. Recently Balasubramaniyan et al.  proposed a method to use Spearman rank correlation based on the rank of expressional values. The rank of expressional values is more insensitive to noses or outliers but this method is still based on the point-to-point comparison per se.
To more comprehensively consider the temporal relationships of gene expression in time-series microarray data we propose here a new method that is based on extracting the main features of the change trend and the change level of gene expression between consecutive time points. We not only consider the qualitative information (i.e. the change trend) but also the quantitative information (the change level) in the original data. We seek to make the methods more noise-tolerant and at the same time to keep more useful information in the expression values. This new method, termed here as trend correlation (TC), is demonstrated with the microarray data from cell cycle of yeast . The biological significance of functional associations of inferred gene pairs is examined with several large-scale yeast databases. We also extensively compare our method with the LC method and the PCC based clustering method of Eisen et al. . It is shown that a significant number of functionally associated gene pairs, which have well-documented biological interactions and relationships but cannot be significantly detected by the LC and PCC methods, can be inferred by the new method with high statistic significance. The biological significance of the functional association pairs inferred by our method is generally higher than that of the LC method and similar to that of the PCC clustering method which detects however only simultaneous co-expression gene pairs. Furthermore, it is shown that the overlapping of gene pairs detected by the three methods is normally not very high, indicating a necessity of combining the different methods in search of functional association of genes from time-series microarray data.
Principle and scheme of the proposed method
Generating a random dataset by shuffling the normalized expression levels at different time points among each gene expression profile in the original dataset;
Calculating a maximal local alignment of change trend (sc) between each gene pair in the random dataset as illustrated in Fig. 1 for a simple case;
Calculating a correlation coefficient (cc) between the maximal alignment for each gene pair in the random dataset (Fig. 1);
Tabulating the frequency of sc (i.e. f(sc)) as function of sc as shown in Fig. S1A [see Additional file 1]; followed by tabulating the distribution of cc for gene pairs which have the same sc;
Calculating the conventional p-values for the two scores sc and cc (P sc (s >= sc), P cc (c >= cc)) through integration of the frequency distributions (Fig. S1B);
Calculating sc and cc between each gene pair in the original dataset;
Extraction of functional linkages using Procedure I proposed: extract gene pairs with significantly high sc values with a certain preset p-value. The correlation coefficient cc is regarded as a second index when the gene pairs have the same score sc;
Extraction of functional linkages using Procedure II: extract gene pairs with statistically significantly high value of combined scores of sc and cc.
The key steps mentioned above are described in the section of Methods in more detail. The main programs TC_linkage_infer created in this work can be freely downloaded from our website http://www.gbf.de/SystemsBiology.
Dataset of yeast cell-cycle
We tested our algorithm with the time series microarray data (17 time points) generated by Cho et al.  for yeast cell cycle with a whole genome yeast oligonucleotide chip which included over 6000 ORFs. After removing all the negative expression levels in the scaled measurements and all the dubious and genes now deleted in the SGD database , 5680 genes were included in our calculation. We examined all the possible pairs among them. The values of the two scores sc and cc and the type of possible relationship (simultaneous, time-shift or inverted) were calculated and assigned for each gene pair.
Functional associations inferred by different methods
As can be clearly seen in Figs. 2A and 2B, the number of inferred gene pairs depends much on the p-value in all the three methods. In general, the TC method infers a significantly lower number of gene pairs compared to the other two methods. 39.8% and 44.3% of the pairs inferred by the TC method are also found by the PCC and LC methods respectively. The common part of the three methods is somewhat lower and accounts only 12.5% for the LC method, 32.5% for the PCC method and 39.5% for the TC method. The number of gene pairs merely inferred by the TC method is considerable compared with those merely from the LC method (20.5%) and the PCC clustering method (49.6%). By decreasing the p-value threshold to 1E-5, the number of inferred functional pairs decreases remarkably, especially for the TC and LC methods (Fig. 2B). However, the shared part of gene pairs predicted by the TC method increases up to 65.4% and 77.7% compared to the LC and PCC methods respectively, indicating an increased reliability of the prediction (see below for biological significance). The number of additional gene pairs merely inferred by the TC method amounts 1442 and is still significant (Fig. 2B).
Biological significance of the gene pairs inferred
To assess the biological significance of the inferred functional associations, the gene pairs are compared to known biological processes and protein functions, known protein interactions and regulatory interactions in yeast respectively. The results from the three methods are also compared to each other. The results show that the TC method can significantly enhance the LC and PCC methods to infer functional linkages and biological network, and is well suited to explore temporal relationships of gene expression in time-series data as detailed below.
Biological process and protein function-similarity gene pairs
In order to generally assess the biological significance of the gene pairs inferred, we first use two databases of biological processes and protein functions classification (supplementary Table S1) [see Additional file 1]. The S. cerevisiae Genome Database (SGD) mainly utilizes the Gene Ontology (GO) annotations . We use 32 main biological processes (i.e. conjugation). In this work, if two genes in the pair inferred are involved in the same biological process, we consider the gene pair as a process-identity one. 19.9% of the 108489 gene pairs inferred with the TC method at a p-value threshold of 2.7E-3 (Fig. 2A) are found to be process-identity pairs (Fig. 2C). The detailed distribution of the process-identity pairs in each biological process is listed in Table S3. A similar ratio (20.4%) of process homology gene pairs is found for the 174768 gene pairs inferred by the PCC method. Only 14.6% of the 342594 gene pairs inferred by the LC method with the same p-value cutoff are process-identity pairs. If the results inferred by procedure I and procedure II of the TC method are separately considered, 22.9% of the 58376 gene pairs detected by procedure I are process-identity pairs. The percentage (17.8%) of process-identity pairs among the 81271 gene pairs detected by procedure II is slightly lower than that by procedure I but still somewhat higher than that (14.6%) of the LC method. The additional number (Fig. 2C) of process-identity pairs inferred by the TC method is considerable compared to those resulted only from the LC method (17.8%) and the PCC method (32%) respectively.
Among the 6490 gene pairs (Fig. 2B) predicted by the TC method with a p-value threshold of 1E-5 there are 3138 (48.3%) process-identity pairs (Fig. 2D). Separately, only 32.6% of the gene pairs (1400) detected by procedure I are process-identity pairs, in contrast to as high as 50.8% for the gene pairs by procedure II. Considering the results from p-value ≤ 2.7E-3 mentioned above no general conclusion can be drawn with regard to the question which extraction procedure is more relevant. The percentage of the process-identity pairs among the genes pairs resulted from the LC method and the PCC method is 49% and 42% respectively. Thus, the lower p-value threshold can significantly increase the portion of gene pairs involved in the same biological processes in all the three methods. At this low p-value the additional number (584, Fig. 2D) of process-identity pairs merely inferred by the TC method amounts to 12.9% of those only resulted from the LC method. Compared with those resulted by the PCC method this number declines to 286 (3.3%), suggesting that the gene pairs with a higher ranked functional association inferred by the TC method is more similar to those resulted by the PCC method.
With the percentage of process-identity pairs in the range of 14.6–20.4 at p-value ≤ 2.7E-3 the gene pairs inferred by the three methods seem to have a relatively low biological significance. If the common part of the gene pairs inferred by all the three methods is considered (Fig. 2A), the percentage of process-identity pairs (Fig. 2C) increases to 34.8%, resulting in fairly good biological significance for an in silico method of biological function inference. The biological significance can be significantly increased by lowering the p-value. At p-value ≤ 1E-5 the percentage of process-identity pairs ranges from 42 to 49% for the three methods. If the common part of the gene pairs inferred by the three methods at this p-value is considered (Fig. 2B), the percentage of process-identity pairs (Fig. 2D) increases to 60.2%, resulting in a satisfactorily high biological significance.
The second source used in this work for assessing biological relevance of the gene pairs is the Munich Information Center for Protein Sequences (MIPS, ) functional catalogue database. For protein functional classification, the MIPS database contains up to 6 different levels within the hierarchy (i.e. metabolism in the first level). We use here the second level of MIPS (i.e. respiration) as Qian et al.  did. Altogether 158 function classes are used. It should be mentioned that some functional classes merely belong to protein cellular functions of plants and animals. Here if two genes in the pair inferred have the same protein cellular function, we term the pair as a function-similarity pair. The results of comparison among the three methods (Fig. S2) are similar to those of process-identity pairs, and detailed distribution of the function-similarity pairs in each protein cellular function by the trend correlation method is provided in Table S4 [see Additional file 1]. As in the case of function-similarity pairs, if the common part of the inferred gene pairs (Figs. S2A and S2B) from the three methods is considered the percentage of function-similarity pairs (Figs. S2A and S2B) increases to 31.7% and 55.6% at the two p-values respectively, resulting in a good basis for further experimental and functional studies of the gene pairs inferred.
Comparison of inferred gene pairs with known protein interactions
To further assess the biological significance of the gene pairs inferred and especially for comparing the three methods we examine here the known protein-protein interactions (including protein complexes) in current databases of yeast among the gene pairs inferred from the yeast cell cycle data by the different methods. Four databases and published high quality datasets (Table S2) are chosen for this purpose.
The protein-protein interactions collection of Yu et al  integrates datasets from the databases of MIPS , the Database of Interacting Proteins (DIP, ), the Biomolecular Interaction Network Database (BIND, ) and the experimental datasets of yeast two-hybrid [25, 26] and high-throughput mass spectrometry measurements [27, 28]. Many of the interactions are manually curated beyond the experimentally derived protein-protein interactions in the three databases mentioned above .
Results by the trend correlation (TC) method compared to those resulted from the local clustering method (LC) and PCC based clustering method in four protein interactions datasets (p-value threshold 2.7E-3 if not otherwise mentioned).
Collection Dataset with p-value ≤ 1.3E-2
Compared to the LC method
Additional by TC
Compared to the PCC method
Additional by TC
Comparison with known regulatory interactions
To examine the biological significance of functional associations inferred by the different methods it is also interesting to know if the gene pairs inferred cover some of the known regulatory interactions, especially those involved in the regulation of cell cycle. For this purpose, we use two regulatory datasets (Table S2) as a comparison basis. We first use the dataset including regulatory interactions which are confirmed with a p-value threshold of 1e-3 by genome wide location analysis (GWLA) . With a p-value threshold of 2.7E-3, only a relative low number of the regulatory interactions is detected by the LC method (127), the PCC method (47) and the TC (24) (Fig. S3A). Regulatory interactions detected by the TC method are significantly less than those by the other two methods. Nevertheless, the additional number of regulatory interactions detected by the TC method is considerable compared with those only resulted from the LC method (about 12%) and the PCC method (about 37.8%).
Similar results are obtained when the regulatory interaction collection dataset of Luscombe et al.  is used (Fig. S3B). The additional number of interactions predicted by the TC method is 21 (11%) and 22 (about 32%) compared with those resulted only from the LC method and the PCC with a p-value threshold of 2.7E-3. With a p-value threshold of 1.3E-2, the results of comparison are similar to those at a p-value cutoff of 2.7E-3. Some of the typical interactions between transcriptional regulators and target genes which are detected by the TC method but cannot be significantly detected by the LC and/or the PCC method are summarized in Table S5.
A detailed investigation of the results from a p-value threshold of 2.7E-3 shows that the transcriptional factors and/or the targeted genes of 29 [see Additional file 3] among 34 known regulatory interactions inferred by the TC method are involved in cell cycle regulation according to Luscombe et al. . Among the 29 interactions only six interactions have both the activated regulator and the differentially expressed targeted gene according to the gene set of cell cycle regulation given in Luscombe et al. . The fact that 23 of the known interactions have either an inactivated regulator or a non-differentially expressed targeted gene in the cell cycle condition as detected by the TC method with a high p-value cutoff seems to indicate controversies in the designation of activated regulators and differentially expressed genes in the cell cycle condition . For example, the gene FKH2 is annotated in Hollenhorst et al.  as a transcriptional factor of the forkhead family that regulates the cell cycle according to SGD. Furthermore, many interactions known to involve FKH2 are detected by at least one of the three methods. However, FKH2 is not regarded as a regulator in the cell cycle condition by Luscombe et al. .
The inference of regulatory circuit and network
From a practical point of view there are two key general issues in the analysis of gene expression data. First, we ought to infer functional relationships with high statistic significance and as completely as possible. Second, the functional relationships inferred should have a high biological significance. We proposed a new method in this work and evaluated it mainly regarding these two aspects with the microarray data of yeast cell cycle . The new method is also compared with the local clustering method and the Pearson correlation coefficient based clustering method. The number of functional gene pairs inferred depends very much on the p-value threshold in all the three methods (Fig. 2). With the two p-value cutoffs (2.7E-3 and 1E-5) applied the TC method detects general significantly lower number of gene pairs. Nevertheless, a considerable number of gene pairs is only detected by the TC method, ranging from 20.5% of the number merely detected by the LC method to as high as 49.6% of that merely detected by the PCC method at p-value ≤ 2.7E-3. The ratio of additional gene pairs inferred by our method remains similar (22%) at p-value ≤ 1E-5 compared with the LC method but decreases to a relatively lower value (6.7%) compared with the PCC method. This can be explained by the fact that the statistically higher ranked correlations have mostly simultaneous relationships that can be well detected by all the three methods. Since p-value ≤ 2.7E-3 represents a relatively high statistic significance it can be concluded that the new method detects a significantly high portion of additional functional relationships compared to the other two methods. Since the shared part of gene pairs of the three methods at p-value ≤ 2.7E-3 is less than 50% (12.5–39.5% for the individual method pair) it is also obvious that these methods should be combined to have a more complete exploitation of functional associations of genes buried in time-series expression data.
Concerning the biological significance 19.9% and 48.3% of the gene pairs (Figs. 2A and 2B) inferred by our method (at p-value ≤ 2.7E-3 and 1E-5 respectively) are process-identity ones (Figs. 2C and 2D) by comparing with the known biological processes in the S. cerevisiae Genome Database (SGD). This is compared with a process-identity ratio in the range of 14.6–49% for the LC method and 20.4–42 % for the PCC method. Similar results are obtained when the MIPS protein functional catalogue database is used to assess the biological significance (Fig. S2). These results suggest that the gene pairs detected by the three methods achieve fairly good and comparable biological significance. As in the case of the total gene pairs the additional gene pairs with biological significance inferred merely by the trend correlation method is considerable compared to the other two methods (Figs. 2 and S2).
We also examined the gene pairs inferred by the three methods with regard to known protein interactions (Figs. 2E and 2F and Table 1) and known regulatory interactions (Fig. S3) in yeast. In general, the percentage of protein interaction pairs in the overall gene pairs is low and comparable for all the methods (2.5% for our method (3.1% for procedure II), 1.5% for the LC method and 2.7% for the PCC method). This can be understood in view of the fact that only cell cycle data under very specific conditions are used here and the cell cycle represents only a small portion of the cellular activities. Furthermore, only protein interaction pairs with significantly changed expression levels of all the involved partner proteins can be theoretically detected. This applies also to the known regulatory interactions. It is shown that most of the detected regulatory interactions are indeed involved in the cell cycle regulation. Nevertheless, the number of protein and regulatory interactions additionally inferred merely by the trend correlation method is significant in all the cases. It is obvious that more molecular interactions can be obtained if more microarray datasets under different conditions are considered .
Given the large number of functionally associated gene pairs inferred by the different methods an important question arises as to how we can find gene pairs which have really a high biological significance and thus would be best candidates for further experimental and functional studies. To this end, the shared part of gene pairs inferred by all the three methods is of particular interest, especially at low p-values. It is found for example that the percentage of process and function-similarity gene pairs of the shared pairs at p-value ≤ 1E-5 can be as high as 60.2% and 55.6%, building a very good basis for experimental study. The common part of the TC and LC methods would be also of particular interest for finding time-delayed and/or inverted functional relationships which have received so far less attention. It should be mentioned that all the functional associated pairs predicted have a high probability to be true and can thus serve as hypotheses for further study in view of the high statistic significance criteria applied. We would also like to emphasize that the combined use of the different methods is not only useful for finding more and highly possible potential candidates of functional association but also very important to infer more complete regulatory circuits and network as exemplified in this study (Figs. 4, 5 and 4).
The major difference between our method and the other current methods is that the trend correlation method is based on the main change trend and comprehensively considers correlation coefficient between the main change trend of two genes, whereas the other methods are mainly based on the correlation of point-to-point expression levels of two genes. Hence the trend correlation method can reveal additional gene pairs with same function or in the same biological process but yet not significantly co-expressed. It therefore also can infer additional protein-protein and regulatory network as demonstrated. As mentioned above, a combined use of different methods is presently necessary for the analysis of time-series microarray data. As clearly demonstrated in this work the proposed new method can significantly augment the currently major methods and is well suitable for exploring temporal relationships of gene expression in time-series data.
Maximal local alignment of expression change trend
The calculation of the maximal local alignment of expression change trend is similar to the algorithm of local sequence alignment  and the local clustering method of Qian et al.  for gene expression level. Different from the local clustering method that merely compares the expression levels at each time point, the change trends between time points are used in our method. Considering an expression profiling dataset of n time points the expression ratios at the n time points are first normalized in the "z-score" fashion, resulting in an average expression ratio of zero and a standard deviation of 1 for each gene. The normalized expression level at time point i for gene X is denoted as X i . The change level between time points i and i+1 for gene X is denoted as . The change trend for is denoted as · can have only one of the three possible values 1, -1 and 0, corresponding to increase, decrease and no change of gene expression between time points i and i+1 respectively. A matrix of possible alignment between the change trends for gene X and gene Y can then be formed. In our algorithm two matrices P and N are calculated in the following way:
if = , Pi, j= Pi-1, j-1+1; else Pi, j= Pi-1, j-1
if * <0, Ni, j= Ni-1, j-1+1; else Ni, j= Ni-1, j-1
The initial conditions are P0, j= 0 and Pi, 0= 0, and the same initial conditions are also applied to the matrix of N. The purpose of calculating P and N is to find a local segment that has a maximal aggregated score, namely a maximal match of change trend between two expression patterns. This can be accomplished by using standard dynamic programming  as in the local clustering method and results in an alignment of lc matched change trend, where lc ≤ n-1 (altogether n-1 changes among n time points).
Finally, an overall maximal value sc is found by comparing the maximums for matrices P and N. The maximal value is the matched change trend score sc for the two expression patterns. A maximal value from the matrix P means a positively correlated expression pattern of the two genes, whereas a maximal value from the matrix N indicates that these two patterns have an inverted relationship. If the maximum is off-diagonal in its corresponding matrix, then the two expression patterns have a time-shifted relationship. When the number of matched change trends between two genes is relatively small, it is possible that several repeated maximal values exist with different number of shifted-time-points. In this case, we choose the maximal matched change trends with the shortest time shift in the algorithm. Normally these gene pairs are not regarded as inferred linkages because of the small matched change trends based on the p-value threshold.
Correlation coefficient between the maximal matched change trend of two genes
After obtaining the maximal matched change trend, we need to calculate the correlation coefficient of the maximal matched change trend of the gene pair (Fig. 1). It is calculated according to the following algorithm. Assuming that the final matched change trend for gene X and gene Y are and respectively in the maximal alignment, where xf and yf refer to the last time points of genes X and Y in the maximal alignment and we assign xf ≤ yf. We then give a new index for the time points of match trends, where the matched change levels for genes X and Y are now denoted as Xmc and Ymc in the following algorithm:
Do while 2 ≤ i ≤ xf and yf - xf + 2 ≤ j ≤ yf
If sc from the matrix P and = then
If sc from the matrix N and * < 0 then
In the following equation k corresponds to p or n in the above loop.
cc(x, y)is the correlation coefficient between the maximal matched change trends of genes X and Y. and are the mean of the maximal matched change levels of genes X and Y respectively σ Xmc and σ Ymc are the standard deviation of the maximal matched change levels.
In order to estimate the p-value for a given score a dataset of random expression patterns was generated by shuffling the normalized expression levels of the original data at different time points. To assure the reliability of p-value we generated the random expression patterns three times, resulting in calculations for about 4.9e7 gene pairs. Using the algorithm described above we calculated the maximal matched change trend scores sc and the correlation coefficient cc between the maximal matched change trend for each gene pair in the random expression dataset. We first tabulated the frequency (f(sc)) of sc and then the distribution of cc for pairs which have the same sc value. Through integration we could calculate the conventional p-value for the two scores sc (Fig. S1B) and cc (Fig. S1C) (P sc (s >= sc), P cc (c >= cc)). They are defined as the probability of obtaining a score (s or c) larger than or equivalent to sc or cc from the random patterns. The higher the matched change trend score is, the more likely the gene pair is correlated (positively or negatively). The higher the correlated coefficient, the more likely the gene pairs are correlated for a given value of sc.
Extraction procedure I
As shown in Fig. S1B the probability to achieve a high sc score in the random dataset is very low. For example, the p-value for gene pairs with sc ≥ 15 is as low as 1.7e-3 and with sc ≥ 16 as low as 1.1e-4. Hence, if two genes in the original expression data have a sc score ≥ 15 it is very likely that these two genes have a functional association that would be expected with a low probability by chance. For a given significant p-value threshold such as 2.7E-3, because the p-value of gene pairs with sc ≥ 15 is only 1.7e-3 that is lower than the threshold, all the gene pairs with sc ≥ 15 can be regarded as having functional relationships. But P sc for gene pairs with sc ≥ 14 is about 1.27e-2 that is higher than the threshold 2.7E-3, only parts of the gene pairs with sc = 14 should be extracted. Among the gene pairs with sc = 14 the pairs with a higher cc value (Fig. S1C) should have a higher possibility to have a functional association. Considering these facts and the distribution frequency of sc we propose to define the following overall p-value to better reflect the significance statistics for inferring functional associations among genes:
P = P sc (sc + 1) + (P sc (sc) - P sc (sc + 1)) * P cc or P = P sc (sc + 1) + f(sc) * P cc
where if sc is the highest possible score (here 16 in the yeast cell cycle dataset with 17 time points), then P sc (sc + 1) = 0. This overall p-value combines the p-values for both sc and cc and the distribution frequency of sc. If the sc and cc values of a gene pair in original expression dataset result in an overall p-value less than a certain threshold, this gene pair is considered as functionally associated. This extraction procedure is called here Procedure I.
Extraction procedure II
In applying extraction Procedure I we noticed that there are many gene pairs the change trend score (sc) of which is not very high, but the correlation coefficient between the maximal change levels of which is significantly high. These gene pairs will not be extracted according to Procedure I because of the relatively high P sc values at low sc values (Fig. S1B). However, they may in fact also have a high possibility of functional association since the main change levels are correlated well. The difference of some of the change trends from the main trend between the two genes may be caused by the involvement of multi-regulators in some time regions or by measurement errors and expression noises. For these reasons, we propose a second procedure (Procedure II) to extract gene pairs with possible functional relationships. In Procedure II, if the correlation coefficient cc is not considered, the likelihood of functional association of all the pairs increases with increasing sc if sc is higher than or equal to a characteristic score scm. This characteristic score scm is defined as the sc that has the highest frequency in the random dataset (Fig. S1A, here scm 10). If sc is smaller than scm, the frequency of pairs with sc reduces along with the decrease of sc, as does also the possibility for a functional association of the gene pair. Because the possibility for functional association should have a reverse relationship with the frequency of the corresponding sc, one should not consider gene pairs with sc smaller than scm. We therefore propose to extract gene pairs through the formulas:
P-value of cc cutoff for each sc:
scm, sc with the highest frequency.
scmax, the maximal possible sc score in the time series dataset
m, the number between scm and scmax
p, the given significant p-value threshold
f, the p-value threshold for each sc (between scm and scmax)
f(sc i ), the frequency of gene pairs with sci
If the p-value of cc from a gene pair with sci in the original expression dataset as determined in Fig. S1C is less than the corresponding P cc (i), this gene pair is considered as functionally associated in the extraction procedure II.
We are grateful to Dr. Hong-Wu Ma for useful discussions and to MarcioRosa da Silvafor help with network visualization. We acknowledge the support by the Deutsche Forschungsgemeinschaft through SFB 578 and by the Federal Ministry for Education and Research (BMBF) through the bioinformatics project "Intergenomics".
- Bar-Joseph Z: Analyzing time series gene expression data. Bioinformatics 2004, 20: 2493–2503. 10.1093/bioinformatics/bth283View ArticlePubMedGoogle Scholar
- D'haeseleer P, Wen X, Fuhrman S, Somogyi R: Linear modeling of mRNA expression levels during CNS development and injury. Pac Symp Biocomput 1999, 41–52.Google Scholar
- Bar-Joseph Z, Gerber G, Jaakkola T, Gifford D, Simon I: Comparing the continuous representation of time series expression profiles to identify differentially expressed genes. Proc Natl Acad Sci USA 2003, 100: 10146–10151. 10.1073/pnas.1732547100PubMed CentralView ArticlePubMedGoogle Scholar
- Schliep A, Schonhuth A, Steinhoff C: Using hidden Markov models to analyze gene expression time course data. Bioinformatics 2003, 19: 1264–1272. 10.1093/bioinformatics/btg1036View ArticleGoogle Scholar
- Guthke R, Möller U, Hoffmann M, Thies F, Töpfer S: Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection. Bioinformatics 2005, 21: 1626–1634. 10.1093/bioinformatics/bti226View ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian network to analyze expression data. J Comput Biol 2000, 7: 601–620. 10.1089/106652700750050961View ArticlePubMedGoogle Scholar
- Qian J, Filhart MD, Lin J, Yu HY, Gerstein M: Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new. biologically relevant interactions. J Mol Biol 2001, 314: 1053–1066. 10.1006/jmbi.2000.5219View ArticlePubMedGoogle Scholar
- Ong I, Glasner J, Page D: Modelling regulatory pathways in E.Coli from time series expression profiles. Bioinformatics 2002, 18: S241–248.View ArticlePubMedGoogle Scholar
- Perrin BE, Ralavivola L, Mazurie A, Bottani S, Mallet J, D'Alche-Buc F: Gene network inference using dynamic bayesian networks. Bioinformatics 2003, 19: II138-II148. 10.1093/bioinformatics/btg1071View ArticlePubMedGoogle Scholar
- Zou M, Conzen SD: A new dynamic Bayesian network(DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 2005, 21: 71–79. 10.1093/bioinformatics/bth463View ArticlePubMedGoogle Scholar
- Magwene PM, Kim J: Estimating genomic coexpression networks using first-order conditional independence. Genome Biol 2004, 5: R100. 10.1186/gb-2004-5-12-r100PubMed CentralView ArticlePubMedGoogle Scholar
- Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science 2003, 302: 249–255. 10.1126/science.1087447View ArticlePubMedGoogle Scholar
- Lee I, Data VS, Adai AT, Marcotte EM: A probabilistic functional network of yeast genes. Science 2004, 306: 1555–1558. 10.1126/science.1099511View ArticlePubMedGoogle Scholar
- Kwon AT, Hoos HH, Ng R: Inference of transcriptional regulation relationships from gene expression data. Bioinformatics 2003, 19: 905–912. 10.1093/bioinformatics/btg106View ArticlePubMedGoogle Scholar
- Filkov V, Skiena S, Zhi JZ: Analysis techniques for microarray time-series data. J Comput Biol 2002, 9: 317–330. 10.1089/10665270252935485View ArticlePubMedGoogle Scholar
- Balasubramaniyan R, Hüllermeiser E, Weskamp N, Kämper J: Clustering of gene expression data using a local shape-based similarity measure. Bioinformatics 2005, 21: 1069–1077. 10.1093/bioinformatics/bti095View ArticlePubMedGoogle Scholar
- Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, Davis RW: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 1998, 2: 65–73. 10.1016/S1097-2765(00)80114-8View ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G, Sethuraman A, Weng S, Botstein D, Cherry JM: Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res 2002, 30: 69–72. 10.1093/nar/30.1.69PubMed CentralView ArticlePubMedGoogle Scholar
- Yu HY, Zhu XW, Greenbaum D, Karro J, Gerstein M: TopNet: a tool for comparing biological sub-networks, correlating protein properties with topological statistics. Nucleic Acids Res 2004, 32: 328–337. 10.1093/nar/gkh164PubMed CentralView ArticlePubMedGoogle Scholar
- Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res 2002, 30: 31–34. 10.1093/nar/30.1.31PubMed CentralView ArticlePubMedGoogle Scholar
- Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 2002, 30: 303–305. 10.1093/nar/30.1.303PubMed CentralView ArticlePubMedGoogle Scholar
- Bader GD, Betel D, Hogue CW: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res 2003, 31: 248–250. 10.1093/nar/gkg056PubMed CentralView ArticlePubMedGoogle Scholar
- Ito T, Tashiro K, Muta S, Ozawa R, Chiba T, Nishizawa M, Yamamoto K, Kuhara S, Sakaki Y: Toward a protein-protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci USA 2000, 97: 1143–1147. 10.1073/pnas.97.3.1143PubMed CentralView ArticlePubMedGoogle Scholar
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae . Nature 2000, 403: 623–627. 10.1038/35001009View ArticlePubMedGoogle Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang LY, Wolting C, Donaldson L, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sørensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CWV, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415: 180–183. 10.1038/415180aView ArticlePubMedGoogle Scholar
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Höfert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier M, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415: 141–147. 10.1038/415141aView ArticlePubMedGoogle Scholar
- Yu HY, Luscombe NM, Qian J, Gerstein M: Genomic analysis of gene expression relationship relationships in transcriptional regulatory networks. Trends in Genetics 2003, 19: 422–427. 10.1016/S0168-9525(03)00175-6View ArticlePubMedGoogle Scholar
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar JZ, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA: Transcriptional Regulatory Networks in Saccharomyces cerevisiae . Science 2002, 298: 799–804. 10.1126/science.1075090View ArticlePubMedGoogle Scholar
- Luscombe NM, Babu MM, Yu HY, Snyder M, Teichmann SA, Gerstein M: Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 2004, 431: 308–312. 10.1038/nature02782View ArticlePubMedGoogle Scholar
- Hollenhorst PC, Pietz G, Fox CA: Mechanisms controlling differential promoter-occupancy by the yeast forkhead proteins Fkh1p and Fkh2p: implications for regulating the cell cycle and differentiation. Genes Dev 2001, 15: 2445–2456. 10.1101/gad.906201PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.