 Research Article
 Open access
 Published:
Partitioning of functional gene expression data using principal points
BMC Bioinformatics volumeÂ 18, ArticleÂ number:Â 450 (2017)
Abstract
Background
DNA microarrays offer motivation and hope for the simultaneous study of variations in multiple genes. Gene expression is a temporal process that allows variations in expression levels with a characterized gene function over a period of time. Temporal gene expression curves can be treated as functional data since they are considered as independent realizations of a stochastic process. This process requires appropriate models to identify patterns of gene functions. The partitioning of the functional data can find homogeneous subgroups of entities for the massive genes within the inherent biological networks. Therefor it can be a useful technique for the analysis of timecourse gene expression data. We propose a new selfconsistent partitioning method of functional coefficients for individual expression profiles based on the orthonormal basis system.
Results
A principal points based functional partitioning method is proposed for timecourse gene expression data. The method explores the relationship between genes using Legendre coefficients as principal points to extract the features of gene functions. Our proposed method provides high connectivity in connectedness after clustering for simulated data and finds a significant subsets of genes with the increased connectivity. Our approach has comparative advantages that fewer coefficients are used from the functional data and selfconsistency of principal points for partitioning. As real data applications, we are able to find partitioned genes through the gene expressions found in budding yeast data and Escherichia coli data.
Conclusions
The proposed method benefitted from the use of principal points, dimension reduction, and choice of orthogonal basis system as well as provides appropriately connected genes in the resulting subsets. We illustrate our method by applying with each set of cellcycleregulated timecourse yeast genes and E. coli genes. The proposed method is able to identify highly connected genes and to explore the complex dynamics of biological systems in functional genomics.
Background
Discovering which genes are functioning and how they express their changes at each time is a necessary and challenging problem in understanding cell functioning [10]. The large number of genes in biological networks makes it complicated to analyze to understand their dynamics. The mathematical and statistical modelling of these dynamics, based on the gene expression data, has become an intensive and creative research area in bioinformatics.
Statistical models can find genes with similar expression profiles whose functions might be related through statistics or biology. Our approach has the assumption that specific curve form exists for each geneâ€™s trajectory and for each partition of these gene curves.
The observations of gene expressions are curves measured according to time on each gene. We can then call the observed lines of genes functional data because an observed intensity is recorded at each time point on a line segment. Functional data analysis is possibly considered a suitable method to model these gene curves [53].
Clustering algorithms are utilized to find homogeneous subgroups of gene data with both supervised or unsupervised [1]. For functional data, clustering algorithms based on the functional structure are also useful to find representative curves in each partition.
To obtain more knowledge about biological pathways and functions, classifying genes into characterized functional groups is a first step. Many methods of analysis, such as hierarchical clustering [34], Kmeans clustering [48, 52], correlation analysis [22, 24] and support vector machines (SVM) [6] classification, can be used to classify temporal gene profiles. Modelbased clustering with finite mixture [29] was done based on probabilistic models [4, 13, 20, 28, 54]. Recently timecourse gene expression data is often clustered in the relation between successive time points [7, 51, 55]. Yeast gene network is investigated for possible functional relations [31]. Fourier transformation is also incorporated in clustering and compared with Gaussian process regression (GPR) [21]. We use the word partitioning instead of clustering since we use a principal points partitioning technique. After partitioning, the subsets are often but not always normally disjoint.
In this paper, we use Legendre orthogonal polynomial system and principal points to obtain functional partitions. Analysis can be accomplished through extracting representative coefficients via data dimension reduction and finding principal points. Connectedness and silhouette values are computed for partition validity measure. An efficient way to deal with such gene data is to incorporate the functional data structure and to use a partitioning technique.
As a smooth stochastic functional process, the observed gene expression profiles have the covariance function which can be expressed with smooth orthogonal eigenfunctions based on functional principal components. The random part of KarhunenLoeve representation of the observed sample paths serves as a statistical approximation of the random process.
Abraham et al. [1] proposed a partitioning procedure of functional data by Bsplines. Kurata and Tang [23] investigated the properties of 2principal points with the data from spherically symmetric distributions. Tarpey et al. [44] compared a growth mixture modeling and optimal partitioning with the principal points for longitudinal clinical trial data. Their simulation results indicated that the optimal partitioning worked better than the mixture model in a squared error, even if there is a covariate. Tarpey et al. [41] used the selfconsistent partitioning with the functional data.
The kprincipal points are defined as a set of kpoints that minimizes the sum of expected squared distances from every point to the nearest point of the set. These kprincipal points are mathematically equivalent to centers of gravity obtained by Kmeans clustering. Tarpey [42, 43] also extended and applied the principal points idea for functional data analysis (FDA).
In this paper, we handle the relation between clustering functional data and partitioning functional principal points. We propose to use selfconsistent partitioning techniques for gene grouping based on curvature profiles as FDA. Some advantages in the use of FDA techniques for partitioning are:

(i)
Tarpey [41] showed that partitioning random functions can be replaced by partitioning the coefficients of the orthonormal basis functions in finite Euclidean space if its approximation can be done based on a finite number of orthonormal basis functions. The orthonormal polynomials are estimated and partitioned ([39, 42,43,44]). Tarpey [41] proved that principal points of a Gaussian random function can be found in a finite dimensional subspace spanned by eigenfunctions of the covariance kernel associated with the distribution.

(ii)
For functional data, clustering algorithms are useful to find representative curves under the different modes of variation. Representative curves from a data set that can be found using principal points from a large collection of functional data curves [11, 37].

(iii)
Principal points are special cases of selfconsistent points. A set of kpoints are selfconsistent for a distribution if each of the points is the conditional mean of the distribution over its respective Voronoi region. Kmeans algorithm converges to a set of k selfconsistent points of the empirical distribution if a set of kpoints are selfconsistent.
Partitioning based on interactions of genes is studied for the structure of genetic networks. In addition, statistical test and association rule approach represents another new strategy. Recently a statistical biclustering technique was proposed with applying on microarray data (gene expression as well as methylation) [25,26,27]. Consensus clustering is proposed via checking intermethod of clustering [40]. Recursive partition is also worked with classification trees to improve the precision of classification [56, 57]. To find the combinatorial marker [2, 3] integrated multiple data sources are surveyed in a comparative study. For yeast data a functional network partitioning was done [8].
Numerous research results on clustering microarray data which are mostly grouping common expression patterns. There are a few cases for partitioning genes with timecourse regarded as functional data. In this research, we propose a new method for selfconsistent partitioning of genes with functional gene expression data. The proposed method consists of two main steps. The first step is to represent each gene profile by functional polynomial representation. The second is to find principal points and appropriate partitions. We applied our method to simulated data and analyzed yeast gene microarray data and Escherichia coli data that resulted in partitioning with interpretable genes.
Methods
Model
Consider the gene expression data curve Y _{ i }(t) as a stochastic process at time t. Let f _{ i }(t) denote the expected expression at time t for the ith subject. The model with the functional data representation is
with
where each \( {\overset{\sim }{\xi}}_j(t) \) corresponds to the normalized Î¾ _{ j }(t). For example, Legendre polynomials, as an orthonormal polynomial system, are expressed using Rodriguesâ€™ formula as
The first few Legendre polynomials are
and Îµ _{ i }(t) is an error function with mean 0, independent of each other term in the model. For each gene Î² _{ i0},â€‰Î² _{ i1},â€‰Î² _{ i2},â€‰Î² _{ i3},â€‰Î² _{ i4} are regression coefficients based on Legendre polynomials. In the microarray experiment Y _{ i }(t) is the log gene expression of gene i at time t.
The curves given by the orthogonal polynomials are characterized by five coefficients, four of which are used to classify subjects. First, the coefficient Î² _{1} in (1) gives the overall trend in the outcome profile, then the derivative f _{ i } ^{â€²}(t) gives the rate of change in the expected outcome at time t. Parameter Î² _{2} is the coefficient of the quadratic polynomial providing a measure of concavity of the outcome curve. Parameter Î² _{3} as the coefficient of the cubic polynomial is a measure of curvilinearity and Î² _{4} as the coefficient of the quartic polynomial gives a measure of concavity of the outcome curve. The estimated polynomial coefficients have information about the underlying functional patterns and enable the automatic estimation of pattern functions.
Partitioning functional gene curves
Selfconsistent partitions
Principal points and selfconsistent points can be used for partitioning a homogeneous distribution. Principal points can be defined as a subset means for theoretical distributions.
For a set Wâ€‰=â€‰{y _{1},â€‰y _{2},â€‰â‹¯,â€‰y _{ k }} the k distinct nonrandom functions in a function space L^{2}, define
as a domain of attraction D _{ j } of y _{ j } that consists of all yâ€‰âˆˆâ€‰R ^{p}. The sets of D _{ j } are often referred to the Voronoi neighborhoods of y _{ j }. The domains of attraction induce a partition as D _{ j } via the preimages B _{ j } such as âˆªB _{ j }â€‰=â€‰R ^{p} where the boundaries have a probability of zero.
The set of optimal kpoints is expressed in terms of mean squared error (MSE). A set of k points Î¾ _{1},â€‰Î¾ _{2},â€‰â‹¯,â€‰Î¾ _{ k } are principal points [8] for a random vector Xâ€‰âˆˆâ€‰R ^{p} if
for every set of k points y _{1},â€‰y _{2},â€‰â‹¯,â€‰y _{ k }. The optimal onepoint representation of a distribution is the mean, which is corresponding to k = 1 principal point. For k > 1 principal points are a generalization of the mean from one to several points optimally representing the distribution. A nonparametric estimate for the principal points is obtained via Kmeans algorithm. Thus the kpoints are mathematically equivalent to centers of gravity by Kmeans clustering.
The concept of principal points can be extended to functional data clustering. Tarpey [41,42,43] proved that principal points of a Gaussian random function can be found in a finite dimensional subspace spanned by eigenfunctions of a covariance kernel associated with the distribution.
We derive functional principal points of orthonormal polynomial random functions based on the transformation.
A set {Î¾ _{1},â€‰Î¾ _{2},â€‰â‹¯,â€‰Î¾ _{ k }} is selfconsistent for a random vector X if
A set of kpoints is selfconsistent if each of the points is a conditional mean in the respective domain of attraction. Principal points are selfconsistent [8], but the converse is not necessarily true. Tarpey and Kinateder [46, 47] proved that self consistent points of elliptical distributions exist only in a principal component subspace. Tarpey [41] proved the principal subspace theorem as follows. Suppose X is pvariate elliptical with E(X)â€‰=â€‰0 and Cov(X)â€‰=â€‰Î£, then v, the subspace spanned by a selfconsistent set of points is spanned by an eigenvector set of Î£. Principal points find the optimal partitions of theoretical distributions. It would be interesting to study principal points of theoretical distributions such as finite mixtures, for which cluster analysis is meant to work.
Tarpey [41] showed that principal points form symmetric patterns for the multivariate normal and other symmetric multivariate distributions. For symmetric, multivariate distributions several different sets of selfconsistent points may exist and the optimal symmetric pattern of selfconsistent points depends on the underlying covariance structure.
Cluster analysis is related to finding homogeneous subgroups in a mixture of distributions, it would be appropriate to give optimal cluster means to the principal points inspired by [24]. Cluster analysis methods are considered as purely dataoriented without a statistical model in the background in order to pragmatically find optimal partitions of observed data. It would be intriguing to further study principal points of theoretical distributions that reflect group structure, such as finite mixtures, due to their ability to find optimal partitions of theoretical distributions. Principal points may be used to define the best kpoint approximations to continuous distributions.
Estimators of the principal points [11] can be obtained as cluster means form the Kmeans algorithm. Tarpey and Kinateder [46] examined the Kmeans algorithm for functional data and provided results on principal points for random functions. They proved that principal points of a Gaussian random function can be found in a finite dimensional subspace spanned by the eigenfunctions of covariance kernel associated with distributions that can be extended to nonGaussian random functions.
The selfconsistent curves inspired by Hastie and Stuetzle [15] can be generalized to provide a unified framework for principal components, principal curves and principal points. A principal component analysis is proposed to identify important modes of variation among curves [17] with principal component scores demonstrating the form and extending variations.
Clustering algorithms are often used to find homogenous subgroups of entities depicted in a set of data. For functional data, clustering algorithms are also useful to find representative curves that correspond to different models of variation. Early work on the problem of identifying representative curves from a data set can be found based on the principal points [12, 17]. The concept of principal points to functional principal point was extended; subsequently, functional principal points of polynomial random functions were derived using orthonormal basis transformation [36].
Suppose {f _{1},â€‰f _{2},â€‰â‹¯,â€‰f _{ n }} is a random sample of polynomial functions of the form (1) where the coefficient vector Î²â€‰=â€‰(Î² _{0},â€‰Î² _{1},â€‰Î² _{2},â€‰Î² _{3},â€‰Î² _{4})^{â€²} follows 5variate normal distribution. The L _{4} version of the Kmeans algorithm can be run on the functions f _{ i },â€‰â€‚iâ€‰=â€‰1,â€‰â‹¯,â€‰n to estimate principal points. The center of Kmeans clustering for the estimated coefficient vectors is based on the orthonormal transformation that constitutes the functional principal point; therefore, we consider Kmeans clustering for the Legendre polynomial coefficient vectors and for the Fourier coefficient vectors after Fourier transformation.
The Kmeans algorithm [47] provides that the Gaussianbased estimates coincide theoretically and the subspace containing a set of principal points must be spanned by the eigenfunctions of the covariance matrix. Clustering functional data using an L _{2} metric on function space can be done by clustering regression coefficients linearly transformed based on the orthogonal system [45]. Clustering after transformation and nonparametric smoothing is suggested [36] without assuming independence between curves.
Estimated coefficient vectors can be used to obtain the principal points for partitioning. The subspace can be spanned by eigenfunctions of the covariance kernel C(s,â€‰t) for Î² because the estimated coefficient vector can be a Gaussian random function. Eigenvalues and eigenvectors are then obtained from the covariance matrix of the estimated coefficients.
Finding the number of partitions
One difficult problem in clustering analysis is to identify the appropriate number of groups for the dataset. As a nonparametric way [39] for choosing the number of clusters is based on distortion that measures the average between each observation and its closed cluster center. The minimum achievable distortion associated with fitting K centers to the data is
where Î“ is the covariance matrix. If Î“ is the identity matrix, distortion is a mean squared error.
The sample Legendre coefficients and the sample Fourier coefficients approximately follow the multivariate normal distribution; therefore, Gaussian mixture modelbased clustering can be considered in addition to the number of partitions that can be chosen as a maximizer of the Bayesian Information Criterion (BIC).
Choice of Legendre coefficients
xTo determine the value of J, the number of polynomials, we can consider several J values and BIC, assuming that each partition covariance has the same elliptical volume and shape. We surmise that a true optimal J value for all the genes may not exist because the known optimal J values are various for each gene function. Our experiments consider the feasible numbers of partitions and J values for their optimality with the corresponding dataset.
Partition validation
The determination of the number of subsets (clusters) is an intriguing problem in unsupervised classification. To assess the resulting cluster quality various cluster validity indices are used. We consider silhouette measure proposed by [32] and connectivity in [14].
The silhouette width for the ith sample in the jth cluster is defined as:
where a(i) is the average distance between the ith sample and all other samples included in the jth cluster, b(i) is the minimum average distance between the ith sample and all the samples clustered in kth cluster for kâ€‰â‰ â€‰j. A point is regarded as well clustered if s(i) is large. The silhouette width is an internal cluster validity index used when true class labels are unknown. With a partitioning solution C, the silhouette width judges the quality and determines the proper number of partitions within a dataset. The overall average silhouette value can be an effective validity index for any partition. Choosing the optimal number of clusters/partitions is proposed as the value maximizing the average s(i) over the data set [19].
Connectivity was suggested in [14] as a clustering or partitioning validity measure such as
where Câ€‰=â€‰{ C _{1},â€‰â‹¯,â€‰C _{ N }} are clusters, and p is the number of variables contributing to the connectivity measure. Define nn _{ i }(j) is the jth nearest neighbor of observation i, and let \( {x}_{i,{nn}_i(j)}\kern0.5em \)be zero if i and nn _{ i }(j) are in the same cluster and 1/j otherwise.
The connectivity assesses how well a given partitioning agrees with the concept of connectedness. This evaluates to what degree a partitioning observes local densities and groups genes (data items) together within their nearest neighbor in the data space based on violation counts of nearest neighbor relationships. The connectivity has a value between zero and âˆž that should be minimized for the best results. Dunnâ€™s index [9] is another type of connectedness measure between clusters.
Stability measures can be computed after partitioning. Average Distance (AD) computes the average distance between genes placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed. AD has a value between zero and âˆž; therefore, smaller values are preferred.
Figure of Merit (FOM) measures the average intracluster variance of the genes in the deleted column, where clustering is based on remaining (undeleted) samples. FOM estimates the mean error using predictions based on cluster averages. The final FOM score is averaged over all the removed columns with a value between zero and âˆž. FOM with smaller values means better performance.
Results and discussion
Worked example
We consider flexible functional patterns of data since real gene expression functions are various with noise. Nonlinear curves are generated according to the regression model
for iâ€‰=â€‰1,â€‰2,â€‰â‹¯,â€‰6,â€‰â€‚uâ€‰=â€‰1,â€‰2,â€‰â‹¯,â€‰m, and t _{ u }â€‰=â€‰u/m. The underlying regression functions for f are:
The simulated data consist of 1000 curves with 6 different underlying functions. The data set has 500 curves of f _{1} and 100 curves of each of f _{2},â€‰â‹¯,â€‰f _{6} to reflect certain aspect of gene expression data. Noise is imitated by adding random values from a normal distribution. Two noise levels are considered as low noise Ïƒâ€‰=â€‰0.5 and high noise Ïƒâ€‰=â€‰1.5. The number of time points is set to m = 20.
The advantages of the proposed method are evaluated by simulations. The number of subsets are known as K = 6. Table 1 shows connectivity and silhouette values after partitioning, which are better for 6 subsets with J = 3, 4, 5 coefficients in Gaussianbased principal points partitioning. The mean silhouette values and connectivity vary little according to J values. The number of subsets can be determined with modified GAP statistics [49]. The simulation results illustrate that the principal points via Legendre polynomial coefficients have favorable statistical properties in connectedness and can be used in timecourse gene data. Figure 1 provides the flowchart of our proposed partitioning procedure.
Evaluation for a clustering method can be done on theoretical grounds by internal or external validation, or both [14, 31]. Likewise, silhouette width and connectivity measure is considered for tightness in regards to genes in partitions. The evaluation of partitioning algorithms for gene data cannot be conducted by similar measures, but only by internal or external validation measures. The connectivity of genes in each partition can be regarded as an association of genes.
Application to partitioning with yeast cell cycle microarray expression data
The yeast cellcycle data set [38] includes more than 6000 yeast genes at 18 time points measured every 7 min that start at 0 min and end at 119 min. Temporal gene expression data (Î±factor synchronized) for the yeast cell cycle data is used for our real data analysis. A total of 4489 genes remain after removing genes with the missing values. The timecourse yeast microarray data are functional data obtained according to 18 time points for each gene [38]. Yeast is a free living, eukaryotic and single cell and highly complex organism that plays an important role for biology research.
First, the Legendre coefficients and Fourier coefficients are estimated. Then each set of estimated coefficients is applied to Kmeans clustering and Gaussianbased principal point estimation with the estimated covariance matrix.
Figure 2 shows that the GAP statistic for original data is maximized at k = 5. We considered from k = 4 since previous research typically provides at least 4 subsets, even with different criterion. BIC is maximized at k = 5 for modelbased clustering with the Legendre polynomial coefficients under VEV (volume:variable, shape:equal, and mean:variable) condition. Therefore, we decide the number of subsets as k = 5.
The number of Legendre polynomials J is considered from J = 2 to J = 7 and the average silhouette value is maximized at J = 5. The average silhouette values for J = 4 and J = 5 is 0.511 and 0.520 which are very close. However the mean within sum of squares (MSW) with J = 4 is 7376 and MSW with J = 5 is 144,650. MSW with J = 4 is less than MSW with J = 5. Consequently, the genes within each subset are closer to its center for J = 4. Therefore, we decide to use J = 4 Legendre polynomials and one constant term with the resulting coefficients used for partitioning. Table 2 shows that J = 4 Fourier coefficients are suggested for partitioning. We consider the same number of Fourier coefficients and those of Legendre polynomials for the comparison of yeast data.
Then Kmeans clustering is done with the timecourse original data (y), with 4 Legendre polynomial coefficients (LPC) and one constant term, and with 4 Fourier coefficients (FC) and one constant mean term respectively. Kmeans clustering with Legendre polynomials result in five subsets with 120, 128, 914, 1241, and 2086 genes respectively. The 2086 genes in Subset 5 seem to be nondifferential. Table 3 shows the partitioning results with the validation measures such as silhouette and connectivity. LPC has the best silhouette and the lowest (best) connectivity values. Figure 3 shows means, 2.5% and 97.5% percentiles of gene scores which provides a 95% empirical confidence interval for each subset. The graph in the bottom righthand corner of Fig. 3 shows the estimated mean change patterns of the five subsets. Figure 4 and Fig. 5 provide the LPC partitioning information including underlying functions and Legendre polynomial coefficients. In Fig. 4, the expression patterns of Subset 1 and 2 are similar to those of Subset 3 and 4, respectively, with less fluctuations. This means their relevance to cell cycle could be similar to each other (Subset 1 and 3, Subset 2 and 4), but they are possibly involved in different biological activities during the cell cycle. Subset 3 and Subset 4 seem to have initial different parts and their coefficients are reverse in sign in Fig. 5. Our proposed algorithm was able to identify any subtle differences in terms of biological processes. In Table 4, most of the GO terms in Subset 1 are mainly related to DNA replication during the S (synthesis) phase of cell cycle, while the terms in Subset 3 represent different biological processes such as protein mannosylation, which is an essential process for cell wall maintenance. GO terms related to cell division, including cell wall synthesis, were in Subset 2, which is mainly activated during the M (mitosis) phase of the cell cycle. Genes in Subset 4 showed similar expression profiles with Subset 2, but their biological processes are mostly related to a protein synthesis that was not represented in Subset 2. Therefore, the genes in Subset 3 and 4 are possibly involved in the crucial biological processes required during the S or M phase of the cell cycle. The constant expression pattern and overrepresented GO terms in the subsets suggested that these genes could be related to biological processes such as protein transport, which is constantly activated throughout the cell cycle.
Nonparametric estimators of principal points are given by the subset center means (Fig. 5). Figure 6 shows the relation between linear and quadratic Legendre polynomial coefficients. Figure 7 shows the hierarchical structure of Legendre coefficients as the heatmap. Legendre coefficients 2 and 4 as well as coefficients 1 and 3 seem to be clustered first. Subset stability measures such as average distance (AD) and Figure of Merit (FOM) are computed. AD is 20.6059 and FOM is 8.15, which are minimized with 5 subsets instead of 4 subsets; consequently, 5 partitions are more stable than 4 partitions in regards to AD and FOM.
OverRepresented Analysis (ORA) was performed with the genes in each subset in order to explain the explain biological relevance of the partitioned data. ORA searches for Gene Ontology (GO) terms of a given set of genes by evaluating the statistical significance of overrepresented functional and molecular mechanisms [5, 6]. GO is divided into three separate ontologies (Cellular Component, Molecular Function, and Biological Process) and our ORA analysis focuses on the Biological Process of a group of genes. In each subset, we selected the top 10 overrepresented GO terms in the smallest order of pvalues and compared them in terms of biological significance to overrepresented GO terms with the Partitioning Around Medoids (PAM) clustering method (Fig. 8) that can be seen in detail in the legend of the figure. Many of the annotated GO terms, such as DNA replication in Subset 1 and conjugation in Subset 2, are adequate to explain the cell cycle data used. However, in Subset 3, our partitioning technique found four GO terms, GO:0007010, GO:0035268, GO:0035269, and GO:0044710 were not significantly overrepresented in the PAM result. In Subset 3, the annotations of the found terms, especially Protein Olinked mannosylation recently reported that the lack of this biological function crucially affects cell morphology such as cell wall defects and cellcell separation in S. pombe [50]. Therefore GO: 0035268, GO: 0035269, and GO:0044710 are closely related to each other and reasonably explain the cell cycle process. In addition, GO:0035268 and GO:0035269 can be found as child terms by following connections from GO:0044710 in a GO tree. The results indicate that our partitioning approach can find functionally related genes which are not identified by the commonly used PAM clustering method.
With similar approach, we annotated the genes in each subset in terms of biological pathways. KEGG is a wellknown pathway whose biological functions are manually curated [18]. DAVID website provides KEGG information along with various annotation tools that include ORA [16]. Table 4 summarizes the overrepresented KEGG pathways that are statistically significant with pvalue <0.1. We drew our attention on Subset1 where the highly significant pathway terms are involved in DNA replication and repair processes during the cell cycle. Sugar metabolisms are easily detected because sugars are the basic building blocks of DNA. From these annotation results, the genes in Subset 1 are closely interrelated in the role of DNA replication. However, 53 of 96 genes in this subset are not included for the annotation; therefore, these 53 genes could be good candidates for further study with a hypothesis that they are dynamically involved in the DNA replication and repair process. Recently FDA ([33, 35]) provides new tools wellsuited for discrimination and classification [30, 42].
Application to partitioning with Escherichia coli microarray expression data
We applied our method to microarray data tracking Escherichia coli (E. coli) transcriptional responses to recovering from the stationary phase. This experimental dataset consists of log ratio intensity values for E. coli genes measured in cDNA microarray hybridizations. The final data set includes more than 3607 genes at 11 time points; however, 3452 genes remain after removing genes with missing values. Timecourse E. coli microarray data are regarded as functional data obtained according to 11 time points for each gene. This dataset is part of a study that tracks transcriptional responses to over 30 chemical and physiological perturbations [34].
The current study took advantage of the available information about the physiology of E. coli bacteria. Functional and regulatory classifications for E. coli genes are considered to evaluate transcriptional activity within and across groups of related genes. Figure 9 provides the silhouette profiles of the partitioning with the overall average silhouette value at 0.51. Figure 10 shows the expression patterns of the four subsets that were determined by the proposed algorithm. Each subset has 1349, 251, 1444, and 408 genes from Subset 1 to Subset 4. The connectivity measure is 62.68 and Dunnâ€™s index is 0.000759 for the resulting partitions. Each subset has its own distinctive expression pattern depicted by the smoothed expression mean (red line). Similar to the yeast cell cycle data results, Subset 1 and Subset 3 have identical expression profiles to Subset 4 and Subset 2, respectively, but with less fluctuations. The genes in Subset 4 and Subset 1 are actively involved in recovery processes such as protein synthesis, carbon energy metabolism, cell division, and nutrient uptake; however, the genes in Subset 2 and Subset 3 were possibly involved in the processes that stabilize the cells after their growth [34]. We performed gene enrichment analysis using the DAVID website to evaluate the partitioned genes in terms of their encoded protein keywords. Table 5 shows the enriched keywords with pvalues less than 0.01. As expected, genes in Subset 4 are mainly involved in cell growth; however, the genes in Subset 1 are also related to cell growth similar to genes in Subset 4 that have distinct cellular processes such as molecular bindings. However, the keywords in Subset 2 and Subset 3 are mainly related to enzymatic processes after cell growth. For example, acetylation affects protein stability; in addition, purine/pyrimidine biosynthesis, ligase, transferase, are all important enzymatic processes for cell stabilization. Oxidoreductase and NADP are also responsible for the electron transfer. The proposed technique proved that it provides decisive and biologically meaningful subsets of genes in timecourse experiments despite the limited biological annotations.
Conclusions
The dynamic nature of biological systems makes the investigation of temporal gene expression data important for exploration of gene expression regulation since they provide valuable functional information about temporal underlying patterns. Partitioning these genes is therefore an interesting problem in order to find gene functions in each partition.
In this paper, we present a functional partitioning procedure using principal points for temporal gene expression data after Legendre polynomial transformation. The optimal partitioning results produce a set of gene curve profiles that identify distinct types of gene expressions. Temporal gene expression data can be viewed as functional data since they are continuous and discretized samples of smooth random gene expression trajectories according to time. Partitioning differentiates cellcycle regulated genes and other noncellcycle regulated genes for yeast. Also partitioning differentiates distinct cellular processes for E. coli.
The proposed method identified each partition for its cellular process properties, which shows that transformation via orthogonal polynomials could work for selfconsistent partitioning. Our contributions include proposing principal points for microarray partitioning and the idea of some functional coefficients as transformation giving information about functional data. The future development of our method considers other transformations of functional data and functional time dependency that expects improvements in partitioning evaluation.
The yeast cell cycle data used is an early version of a two channel microarray that was hybridized with cDNA from two samples to be compared (e.g. normal versus cancer cells). The E. coli dataset in this work is also generated using the custom made two channel microarray technique with two different fluorescence dyes. However, RNASeq uses a nextgeneration sequencing (NGS) technique to measure the quantity of RNA in a sample of interest. The expression intensity is quantified by counting the number of reads mapped to each gene; therefore, care should be taken as the changes of total RNA amount between conditions possibly lead misrepresentation of the changes of individual transcript. In conclusion our method can be applied if the RNASeq data is appropriately processed. Further study is expected to utilize the proposed method in the analysis of more complex model organisms such as rats.
Abbreviations
 AD:

average distance
 E. coli :

Escherichia coli
 FDA:

functional data analysis
 FOM:

Figure of Merit
 GO:

Gene Ontology
 GPR:

Gaussian process regression
 NGS:

nextgeneration sequencing
 ORA:

OverRepresented Analysis
 PAM:

Partitioning Around Medoids
References
Abraham C, Cornillon PA, MatznerLober E, et al. Unsupervised curve clustering using Bsplines. Scandinavian J Statistics. 2003;30:581â€“95.
Bandyopadhyay S, et al. A Survey and comparative study of statistical tests for identifying differential expression from microarray data. IEEE/ACM Trans Comput Biol Bioinform. 2014;11:95â€“115.
Bandyopadhyay S, et al. Integrating multiple data sources for combinatorial marker discovery: a study in tumorigenesis. IEEE/ACM Trans Comput Biol Bioinform. 2016;99:1â€“1.
Banfield JD, Raftery AE. Modelbased Gaussian and nonGaussian clustering. Biometrics. 1993;49:803â€“21.
BeiÃŸbarth T, Speed TP, Beissbarth T. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 2004;20(9):1464â€“5.
Brown MPS, Grundy WN, Lin D. Knowledgebased analysis of microarray gene expression data by using support vector machines. PNAS retrieves. 2000;97:267.
Chiu TY, Hsu TC, Yen CC, et al. Interpolation based consensus clustering for gene expression time series. BMC Bioinformatics. 2015;16:117.
Conant GC, Wolfe KH. Functional partitioning of yeast coExpression networks after genome duplication. PLoS Biol. 2006;4(4):e109.
Dunn J. Well separated clusters and optimal fuzzy partitions. J Cybernetics. 1974;4:95â€“104.
Fujita A, Sato FR, GarayMalpartida HM, et al. Timevarying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method. Bioinformatics. 2007;23:1623â€“30.
Flury BD. Principal points. Biometrika. 1990;77:33â€“41.
Flury BD, Tarpey T. Representing a large collection of curves: A case for principal points. Amer Stat. 1993;47:304â€“6.
Fraley C, Raftery AE. Modelbased clustering, discriminant analysis, and density estimation. J Amer Statist Assoc. 2002;97:611â€“31.
Handl J, Knowles J, Kell DB. Computational cluster validation in postgenomic data analysis. Bioinformatics. 2005;2:3201â€“12.
Hastie T, Stuetzle W. Principal curves. J Amer Statist Assoc. 1989;84:502â€“16.
Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44â€“57. https://doi.org/10.1038/nprot.2008.211.
Jones MC, Rice JA. Displaying the important features of large collections of similar curves. Amer Statist. 1992;46(2):140â€“5.
Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27.
Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley; 1990.
Kim J, Kim H. Clustering of change patterns using Fourier coefficients. Bioinformatics. 2008;24(2):184â€“91.
Kim J, Ogden T, Kim H. A method to identify differential expression profiles of timecourse gene data with Fourier transformation. BMC Bioinformatics. 2013;14:310.
Kruglyak S, Tang H. A new estimator of significance of correlation in time series data. J Comp Biology. 2001;8(5):463â€“70.
Kurata S, Tang H. On principal points for location mixtures of spherically symmetric distributions. J Stat Plan and Inf. 2008;138:3405â€“18.
Leng X, Muller HG. Classification using functional data analysis for temporal gene expression data. Bioinformatics. 2006;22:68â€“76.
Mallik S, Mukhopadhyay A, Maulik U. Integrated Statistical and Rule Mining Techniques for DNA Methylation and Gene Expression Data Analysis. Journal of Artificial Intelligence and Soft Computing Research. 2013;3(2)
Mallik S, Bhadra T, Maulik U. RANWAR: Rankbased weighted association rule mining from gene expression and methylation Data. IEEE Transactions on Nanobioscience. 2015;14:990â€“9.
Mallik S, Bhadra T, Maulik U. Identifying Epigenetic Biomarkers using Maximal Relevance and Minimal Redundancy Based Feature Selection for MultiOmics Data. IEEE Transactions on Nanobioscience. 2017;3(2):101â€“15.
Mclaclan GJ, Bean RW, Peel D. A mixture modelbased approach to the clustering of microarray expression data. Bioinformatics. 2002;18:413â€“22.
Mclaclan GJ, Peel D. Finite Mixture Model. New York: John Wiley; 2002.
Pearson K. Principal components analysis. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 1901;559(6.2)
Priness I, Maimon O, BenGal I. Evaluation of geneexpression clustering via mutual information distance measure. BMC Bioinformatics. 2007;8:111.
Rousseeuw PJ. Silhouettes: graphical aid to the interpretation and validation of cluster analysis. J of Computation and Applied Math. 1987;20:53â€“65.
Ramsey JO, Silverman BW. Functional data analysis. New York: Springer; 2005.
Sangurdekar DP, Srienc F, Khodursky AB. A classification based framework for quantitative description of largescale microarray data. Genome Biol. 2006;7:R32.
Sen TZ, Kloczkowski A, Jernigan RL. Functional clustering of yeast proteins from the proteinprotein interaction network. BMC Bioinformatics. 2006;7:355.
Serban N, Wasserman L. CATS: Clustering after transformation and smoothing. J Amer Statist Assoc. 2005;471:990â€“9.
Shimizu N, Mizuta M. Functional clustering and functional principal points. KnowledgeBased Intelligent Information and Engineering Systems. Lecture Notes in Computer Science. 2007;4693:501â€“8.
Spellman PT, Sherlock G, Zhang MQ, et al. Comprehensive identification of cell cycleregulated genes of the yeast Saccaromyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998;9:3273â€“97.
Sugar CA, James GM. Finding the number of clusters in a dataset: an informationtheoretic approach. J Amer Statist Assoc. 2003;463:750â€“63.
Swift S, Tucker A, et al. Consensus clustering and functional interpretation of geneexpression data. Genome Biol. 2004; https://doi.org/10.1186/gb2004511r94.
Tarpey T. Selfconsistent patterns for symmetric multivariate distributions. J Classification. 1998;15:57â€“79.
Tarpey T. Selfconsistency algorithms. J Comp and Graph Statist. 1999;8:899â€“905.
Tarpey T. Linear Transformations and the kmeans clustering algorithm. Amer Stat. 2007;61:34â€“40.
Tarpey T, Li L, Flury BD. Principal points and selfconsistent points of elliptical distributions. Ann Stat. 1995;23:103â€“12.
Tarpey T, Petkova E, Ogden RT. Profiling placebo responders by selfconsistent partitioning of functional data. J Amer Statist Assoc. 2003;98(464):850â€“8.
Tarpey T, Kinateder KKJ. Clustering functional data. J Classification. 2003;20:93â€“114.
Tarpey T, Petkova E, Lu Y, Govindarajulu U. Optimal partitioning for linear mixed effects models: applications to identifying placebo responders. J Amer Statist Assoc. 2010;105:968â€“77.
Tavazoie S, Hughes JD, Campbell MJ, et al. Systematic determination of genetic network architecture. Nat Genet. 1999;22:281â€“5.
Tibshirani R, Wlther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Statist Soc B. 2001;63:411â€“23.
Willer T, Brandl M. Protein Omannosylation is crucial for cell wall integrity, septation and viability in fission yeast. Mol Microbiol. 2005;57(1):156â€“70. doi:10.1111/j.13652958.2005.04692.x.
Wang K, Ng SK, McLachlan G. Clustering timecourse gene expression profiles using normal mixture models with autoregressive random effects. BMC Bioinformatics. 2012;13:300.
FX W, Zhang WJ, Kusalik A. A genetic Kmeans clustering algorithm applied to gene expression data. Advances in Artif Intell. 2003;2671:520â€“6.
Yao F, Muller HG, Wang JL. Functional data analysis for sparse longitudinal data. J Amer Statist Assoc. 2005;100:577â€“90.
Yeung KY, Fraley C, Murua A, et al. Modelbased clustering and data transformation for gene expression data. Bioinformatics. 2001;17:977â€“98.
Yuan Y, Li CT, Wilson R. Partial mixture model for tight clustering of gene expression timecourse. Bioinformatics. 2008;9:287.
Zhang H, CY Y, Singer B, Xiong M. Recursive partitioning for tumor classification with gene expression microarray data. Proc Natl Acad Sci U S A. 2001;98(12):6730â€“5. doi:10.1073/pnas.111153698.
Zhang Y, Liu Z, et al. Expression partitioning of homeologs and tandem duplications contribute to salt tolerance in wheat (Triticum aestivum L.). Sci Rep. 2016; doi:10.1038/srep21476.
Acknowledgements
Not applicable.
Funding
J. Kimâ€™s research was partially supported by Basic Science Research Program through the National Research Foundation of Korea (NRF2014R1A1A3050229). J. Kimâ€™s research was also supported in part by Basic Science Research Program through the National Research Foundation of Korea (NRF2016R1D1A1B03933527).
Availability of data and materials
The datasets generated and analysed during the current study are available from the corresponding author on reasonable request.
Author information
Authors and Affiliations
Contributions
JK initiated this research, developed the basic idea, designed the simulation work and applied with real data. HK accomplished the gene ontology and provided the biological interpretation of our application to yeast data and E. coli data. He also compared our results with other partitioning study. Both authors read, discussed and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Kim, J., Kim, H. Partitioning of functional gene expression data using principal points. BMC Bioinformatics 18, 450 (2017). https://doi.org/10.1186/s1285901718600
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901718600