An efficient method for mining cross-timepoint gene regulation sequential patterns from time course gene expression datasets

Cheng, Chun-Pei; Liu, Yu-Cheng; Tsai, Yi-Lin; Tseng, Vincent S

doi:10.1186/1471-2105-14-S12-S3

Volume 14 Supplement 12

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Bioinformatics

Research
Open access
Published: 24 September 2013

An efficient method for mining cross-timepoint gene regulation sequential patterns from time course gene expression datasets

Chun-Pei Cheng¹,
Yu-Cheng Liu^1,2,
Yi-Lin Tsai¹ &
…
Vincent S Tseng^1,3

BMC Bioinformatics volume 14, Article number: S3 (2013) Cite this article

2500 Accesses
3 Citations
Metrics details

Abstract

Background

Observation of gene expression changes implying gene regulations using a repetitive experiment in time course has become more and more important. However, there is no effective method which can handle such kind of data. For instance, in a clinical/biological progression like inflammatory response or cancer formation, a great number of differentially expressed genes at different time points could be identified through a large-scale microarray approach. For each repetitive experiment with different samples, converting the microarray datasets into transactional databases with significant singleton genes at each time point would allow sequential patterns implying gene regulations to be identified. Although traditional sequential pattern mining methods have been successfully proposed and widely used in different interesting topics, like mining customer purchasing sequences from a transactional database, to our knowledge, the methods are not suitable for such biological dataset because every transaction in the converted database may contain too many items/genes.

Results

In this paper, we propose a new algorithm called CTGR-Span (Cross-Timepoint Gene Regulation Sequential pattern) to efficiently mine CTGR-SPs (Cross-Timepoint Gene Regulation Sequential Patterns) even on larger datasets where traditional algorithms are infeasible. The CTGR-Span includes several biologically designed parameters based on the characteristics of gene regulation. We perform an optimal parameter tuning process using a GO enrichment analysis to yield CTGR-SPs more meaningful biologically. The proposed method was evaluated with two publicly available human time course microarray datasets and it was shown that it outperformed the traditional methods in terms of execution efficiency. After evaluating with previous literature, the resulting patterns also strongly correlated with the experimental backgrounds of the datasets used in this study.

Conclusions

We propose an efficient CTGR-Span to mine several biologically meaningful CTGR-SPs. We postulate that the biologist can benefit from our new algorithm since the patterns implying gene regulations could provide further insights into the mechanisms of novel gene regulations during a biological or clinical progression. The Java source code, program tutorial and other related materials used in this program are available at http://websystem.csie.ncku.edu.tw/CTGR-Span.rar.

Background

Over the past decade, a great number of studies on time course issue have become increasingly important since most clinical/biological events, such as infection-related chronic/acute inflammatory responses [1–3], drug treatment-related experiments [4], cell cycle-arrest [5] or other important issues [6], require a period of time in which aberrant alterations in gene expression would lead to different outcomes. Therefore, through performing a consecutive monitoring of massive gene expressions and discovering their regulations during clinical/biological manifestations, the hidden layer of biological mechanisms could be unveiled. However, to our knowledge, these is no effective method can handle this issue although the high-throughput microarray is a powerful tool and has been widely utilized to efficiently detect differentially expressed genes among a group of patients in a time course experiment [3, 4]. These authors only focused on how to identify differentially expressed genes varied with time but actually we did not know whether these genes are associated with each other or not. Their results did not show the valuable information.

Sequential pattern mining is one of the most important topics in the field of data mining, especially for the database systems. The fundamental meaning of a sequential pattern refers to a set of singleton frequent items/differentially expressed genes that are followed by another set of items/differentially expressed genes in the time-stamp ordered transaction. Therefore, once the potential gene regulations occurred in a period of time, it could be identified by mining such sequential patterns from a dataset-converted database. Referring to previous studies, several parental algorithms using different computational designs, such as AprioriAll [7], SPADE [8] and PrefixSpan [9], have been successfully proposed and used for different databases to discover their own sequential patterns. The apriori-like (level-wise) GSP [10] and pattern-growth-based Prefix-growth [11] as well as DELISP [12] are evolutionarily designed incorporating with many constraints such as the size of gap among the sequence-involved singleton items, or a time interval within which items are observed as belonging to the same transaction even if they originate from different transactions. Besides, any possible subpatterns derived from each parental sequential pattern also satisfy the user-set constraint values. This property is called downward closure [7–12]. Therefore, any possible subpatterns of each sequential pattern, particularly for the longer ones, need to be generated during the decomposing process that is time-consuming and space-exhausting. Once both shorter and longer sequential patterns have the same occurrence times across all transactions in the database, i.e., closed sequential patterns, the shorter ones will be eliminated from the final resulting patterns. For this purpose, some newer algorithms like incorporating with constraints, CTSP [13], and without constraints, CloSpan [14], were then designed to tackle this problem. In addition to these traditional algorithms, an increasing number of extended methods have also been performed on some interesting topics. For example, an algorithm called WSpan [15] could be used to determine weighted sequential patterns from a transactional database, and the MAGIIC [16] was designed to discover the structure motifs from protein sequences. However, to the best of our knowledge, all of the aforementioned methods are not suitable for the widely used microarray data, as a large-scale DNA microarray-based platform normally consists over tens of thousands of probes/genes, e.g., over 45,000 probes/genes in rice and over 20,000 probes/genes in human arrays. A set of differentially expressed genes (significant singleton gene items) on a single array could be individually considered as a single transaction. In that way, each transaction (each time point contained gene items) may contain too many significant singleton gene items after converting the numeric datasets into the format (discrete) of transactional databases [17]. This is called a long transaction issue. However, to date, there exists no method which can efficiently handle such kind of issue. Actually, a lot of items would frequently occur at most time points. They are similar to the housekeeping genes, which are very insensible to an extracellular stimulus; instead, they play critical roles as maintenance genes in the basic cellular functions [18]. Moreover, mining sequential patterns containing too many such items may increase the difficulty in interpreting the resulting gene regulations. The performance of the preceding sequential pattern mining methods would also be limited to these simultaneous items.

In this paper, we propose a new algorithm called CTGR-Span (Cross-Timepoint Gene Regulation Sequential pattern) with some biologically designed parameters to solve the issue mentioned above by mining CTGR-SPs (Cross-Timepoint Gene Regulation Sequential Patterns). The CTGR-Span ensures that all of the resulting patterns imply gene regulations, which take place across different time points during the course of biological observations. The method is an extended and improved version of our previous paper [19] presented in the 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). The most important changes include: first, we designed a new optimal parameter tuning procedure for the proposed algorithm to ideally determine suitable conditions in pattern mining. The procedure has a merit that there is no need to additionally compute the standard deviation of time intervals in a time course dataset. Based on this design, then we compared our method with two representative sequential pattern mining algorithms, namely GSP and PrefixSpan, in execution efficiency and effectiveness. The resulting patterns were validated using a manual literature survey and an automatic Gene Ontology enrichment analysis [20]. Finally, more explanations for the proposed algorithm have also been added to this paper like i) providing complete examples for readily understanding both our proposed algorithm and the new parameter tuning procedure, and ii) performing more experimental results on the two publicly available human disease-related time course microarray datasets [3, 4].

The rest of this paper is organized as follows. The proposed method and materials for analysis are described in Methods. In Results and Discussion, we give the experimental results of the proposed method on two time course gene expression datasets. Concluding remarks are given in Conclusions.

Methods

In this section, we introduce how to efficiently discover CTGR-SPs (Cross-Timepoint Gene Regulation Sequential Patterns) from a time course microarray dataset through 3 main parts: i) an introduction to the experimental background of 2 input microarray datasets, ii) how to convert a numeric dataset into a transactional database, and iii) the kernel of the CTGR-Span (Cross-Timepoint Gene Regulation Sequential pattern) and its required biologically designed arguments.

Input microarray datasets

We tested this paper presenting method using the same input datasets as our previous works [19]. In brief, 2 time course gene expression microarray datasets (GSE6377 [3] and GSE11342 [4]) were downloaded from the GEO database. In GSE6377, McDunn et al. attempted to detect 8,793 transcriptional changes in 11 ventilator-associated pneumonia patients' leukocytes across 10 time points. For the other GSE11342, Taylor et al. monitored 22,283 gene expression changes in peripheral blood monocytes of 20 hepatitis C virus infected patients across the first 10 weeks right after treating with the Peg-interferon alfa-2b plus ribavirin.

Converting microarray datasets into transactional databases

The sequential patterns could be mined directly from a transactional database if the data are discrete. The microarray-involved probe/gene expression values need to be discretized into singleton items within every transaction. Here we show you an example from Table 1 to 3. Table 1 shows the probe/gene expression values of 3 genes G₁ to G₃ over 4 time points TP₁ to TP₄ with a fixed interval (1 day). The experimental design is performed in 3 patients. The first time point of this example is regarded as a baseline for deriving the significant items at each time point. All of the values are then divided by the first time point. The divided values can be presented in a fold change matrix as Table 2. The absolute fold changes exceeding a fold-change threshold are further defined as the significant genes. Suppose that the threshold is set as 1.5, only the eligible significant genes can be preserved as new items as shown in Table 3. Take patient 1 for instance, up-regulated G₁, down-regulated G₂ and down-regulated G₃ occur at the second time point that will be presented within the same parentheses (transaction). In this example, a set of 3 time-ordered transactions for each patient is called a sequence.

Table 1 Example of time course microarray dataset

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Bioinformatics

An efficient method for mining cross-timepoint gene regulation sequential patterns from time course gene expression datasets

Abstract

Background

Results

Conclusions

Background

Methods

Input microarray datasets

Converting microarray datasets into transactional databases

CTGR-Span: cross-timepoint gene regulation sequential pattern

Kernel procedure

Biological parameter designs

Results and discussion

Optimal parameter tuning

High performance of CTGR-Span

Evaluation with literature

Conclusions

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Electronic supplementary material

Additional file 1: Characteristics of mined sequential patterns (minSupp= 70~100% and minTSupp=70%~90%) (DOC 68 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us