Alignment of time course gene expression data and the classification of developmentally driven genes with hidden Markov models
 Sean Robinson^{1}Email author,
 Garique Glonek^{1},
 Inge Koch^{1},
 Mark Thomas^{2} and
 Christopher Davies^{2}
https://doi.org/10.1186/s1285901506349
© Robinson et al.; licensee BioMed Central. 2015
Received: 7 December 2014
Accepted: 1 June 2015
Published: 18 June 2015
Abstract
Background
We consider data from a time course microarray experiment that was conducted on grapevines over the development cycle of the grape berries at two different vineyards in South Australia. Although the underlying biological process of berry development is the same at both vineyards, there are differences in the timing of the development due to local conditions. We aim to align the data from the two vineyards to enable an integrated analysis of the gene expression and use the alignment of the expression profiles to classify likely developmental function.
Results
We present a novel alignment method based on hidden Markov models (HMMs) and use the method to align the motivating grapevine data. We show that our alignment method is robust against subsets of profiles that are not suitable for alignment, investigate alignment diagnostics under the model and demonstrate the classification of developmentally driven genes.
Conclusions
The classification of developmentally driven genes both validates that the alignment we obtain is meaningful and also gives new evidence that can be used to identify the role of genes with unknown function. Using our alignment methodology, we find at least 1279 grapevine probe sets with no current annotated function that are likely to be controlled in a developmental manner.
Keywords
Background
Alignment of time course gene expression data is an important problem since, ‘biological processes have the property that multiple instances of a single process may unfold at different and possibly nonuniform rates in different organisms, strains, individuals, or conditions’ [1]. Such different rates may affect the timing of gene expression, which will be manifest in the observed expression profiles.
The rate of development of the grape berries was different at the Willunga and Clare vineyards. Differences between the vineyards such as soil conditions, viticultural management and climate are likely causes of the different rates of berry development [2]. During the experiment, the length of the development cycle was 19 weeks at Willunga and 17 weeks at Clare. Since the experiment called for weekly measurements, the expression profiles from Willunga have length 19 while the expression profiles from Clare have length 17 (Fig. 1). Hence we require an alignment between the different length profiles.
The basic underlying pattern of berry growth and ripening was the same at both the Willunga and Clare vineyards, which suggests a common underlying framework of gene expression control. Hence in spite of the different conditions, if a pair of expression profiles exhibit the same basic shape at both vineyards and are suitable for alignment, this is strong evidence that the corresponding gene is likely to be developmentally controlled. On the other hand, pairs of profiles with different shapes are not suitable for alignment and the corresponding gene is unlikely to be driven by the development process but by other factors.
A recent survey of grapevine genes [3] indicated that the annotation of 44 % of genes is ‘poorly informative’ (including 29 % having no Blast hit and 9 % with function unknown). Actual functional data is available for only a small subset of those genes with an assigned function and most often function is defined on the basis of sequence similarity with genes from other species. Additionally, the assignment of a biochemical function does not define whether a gene has a mainly developmental role or is merely responding to external cues.
Hence considering whether a pair of profiles is well aligned will give important additional evidence that can be used to identify genes as either likely to be developmentally driven or not.
The time sparsity and variability of the grapevine data is typical of longer term time course gene expression experiments. Interpolation of the expression values between observed time points is not readily justified as significant nonlinear variations in expression could conceivably occur between adjacent time points. Rather than the expression levels week by week, the biological relevance is in the general expression behaviour over the entire development cycle, which is where both the available data and current biological understanding lie.
Nonmodel based alignment methods such as discrete time warping (DTW) have been used for alignment of time course gene expression data [1]. However, for the grapevine data, DTW invariably produces pathological results. For example, >3 time points mapped to a single time point from Willunga to Clare immediately followed by the same from Clare to Willunga has no reasonable interpretation when each time step is a week and especially when the alignment differs for different pairs of profiles. Simply considering the lag between profiles would also not be a suitable model for the timing differences between vineyards and would violate the experimental setup.
In order to work with the typical sparsity of the grapevine data, as well as to provide a principled way to obtain a common alignment across both vineyards, we turn to hidden Markov model (HMM) based alignment methods.
Leftright HMMs
Lin et al. [4] aligned gene expression profiles using an HMM by constraining the Markov chain component to be a ‘leftright’ model. In a leftright HMM a state can never be revisited once it has been left and transitions away from a state may only occur to a single other state. Hence an alignment is achieved between the expression profiles by considering the different times the state transitions occur in the corresponding Viterbi paths.
A leftright HMM can be altered to allow for less restrictive transitions between states while keeping the same alignment idea, for example allowing the ‘leapfrogging’ of states. Schliep et al. [5] considered such an alignment, however their main focus was a modelbased ‘soft’ clustering method for expression profiles using mixtures of HMMs.
We aim to capture the basic pattern of each pair of profiles, which may be different from any other pair (Fig. 1). Hence approaches that constrain the Markov state transitions to the extent that all realised state sequences must share the same basic shape are not suitable in this case.
Pair HMMs
Pair HMMs are the standard model for the alignment of genomic sequence data [6]. However, Pair HMMs require discrete emission random variables to model the genomic sequences of interest. In addition, the conditional information of a previous emission observation is not the actual observed value but whether the observation was a pair or single nucleotide symbol. Since we aim to interpret the underlying Markov structure as capturing distinct quantitative levels of the expression profiles, we require more than the binary pair/single nucleotide symbol dynamics of the Markov chain component of a Pair HMM.
Extensions of Pair HMMs

Retain the binary dynamics of the Markov chain component of the model and consider continuous emission random variables; or

Incorporate additional information into the model so that the Markov structure encodes more than just binary dynamics.
Note that these possible extensions do not explicitly take alignment into account, although the motivation in considering such extensions is that the established alignment method of Pair HMMs could be carried over.
Binary Markov dynamics with continuous emissions
Yuan and Kendziorski [7], and Yoneya and Mamitsuka [8] both proposed extensions of a Pair HMM that retain the binary dynamics of the Markov chain component of the model. Both modelled time course gene expression data and hence considered continuous emission random variables. Yuan and Kendziorski [7] did not aim to obtain an alignment between expression profiles, and it is not clear how their model could be adapted for this purpose. Although the model of Yoneya and Mamitsuka [8] could be used as the basis of an alignment, their model requires strict assumptions about the shape of the expression profiles, assuming average expression levels except for at least one spike feature. Most genes in the grapevine data do not display expression profiles with such patterns (Fig. 1) so this approach is not suitable.
Additional information incorporated into the model
Listgarten et al. [9] proposed a ‘Continuous Profile Model’ (CPM), which they consider to be a ‘continuous analogue’ to a Profile HMM. Also widely used for the alignment of genomic sequence data, Profile HMMs are closely related to Pair HMMs [6]. Under a CPM, each time series is modelled as an emission sequence and the corresponding realisation of the state sequence is a mapping to an additional input sequence or ‘latent trace’. The latent trace has a higher number of time points than the observed time series (approximately double), which allows the mapping to ‘slow down’ and ‘speed up’ relative to ‘latent time’ and hence constitute an alignment.
The CPM was developed for mass spectrometry and speech waveform time series that were sampled frequently enough in time that interpolating smoothly between time points was a reasonable approach. The assumption of smoothness in time necessary for the ‘continuous’ CPM alignment is not reasonable for the grapevine data. Therefore, it would not be appropriate to apply the CPM alignment method to the grapevine data.
Our approach
We will model the expression profiles as multiple emission sequences of an HMM so that each pair corresponds to a common underlying state sequence. The emission sequences are aligned under the model in that aligned emission random variables are conditioned by the same state random variable. We will assume that the underlying Markov state sequence represents a common expression profile at both vineyards and that the Markov states represent distinct quantitative levels of gene expression.
Like the CPM, our alignment HMM is conceptually similar to a Pair HMM. However, in contrast to Pair HMMs, the alignment in our model is not determined by the underlying Markov chain but through ‘gap position’ parameters, which we incorporate into the model as additional information. Rather than the latent trace and continuous time warping of Listgarten et al. [9], this coarse approach to alignment is necessitated by the sparsity of our data.
We use our alignment HMM to achieve an alignment of the grapevine data and quantify how well each pair of profiles is aligned. We show that our method of training the model is computationally efficient and also robust against subsets of profiles that do not align. We then consider diagnostics under the model and demonstrate that genes can be classified as either likely to be developmentally driven or not by how well they align.
Methods
Grapevine data
In addition to being from spatially distinct vineyards, the time course microarray experiment was run in the 2004 grape growing season at Willunga and in the 2005 grape growing season at Clare. Gene expression levels were measured weekly at both vineyards using Affymetrix grapevine GeneChips (Santa Clara, CA, USA, Part #520054). We discard the expression profiles not differentially expressed in time at the 0.001 % significance level using LIMMA [10], as well as those without at least a 2fold change in expression level. We also discard all profiles corresponding to the Vitis vinifera Array (non vinifera / non 3 prime) Mask. We average the replicate expression observations at each time point and then linearly scale each profile individually so that all observed expression levels lie in the interval [0,1] (Additional file 1: Figure S1). We refer to the resultant 8644 pairs of profiles as the ‘grapevine data’.
Alignment model
For a single pair of expression profiles, there is usually insufficient information to identify optimal gap positions. However, since the grapevine data have been scaled so that all observed expression levels lie in the interval [0,1], the Markov state space and conditional emission distributions can be considered common for all genes. This allows us to estimate a single set of gap positions by pooling the data from all pairs of profiles.
The state random variables S _{1:19} that form the Markov chain component of the alignment HMM are discrete valued and take values in a common state space Ω _{ S }={1,2,…,N}. For convenience we use p(x) to symbolise both a probability density function and a probability mass function, in addition to using the event ‘ X=x’ as an argument.
for i,j=1,2,…,N.
for j=1,2,…,N.
There are well established methods for efficient calculation of the likelihood, finding the Viterbi paths and estimating the model parameters for a standard HMM [11]. These methods are readily adapted to our alignment model defined by (1) if the gaps g _{1} and g _{2} are given. Note that our alignment HMM is a special case of a hidden semiMarkov model [12].
Alignment model fitting method
We fit the alignment HMM to the grapevine data by maximising the loglikelihood ℓ(λ,g _{1},g _{2}) with respect to the HMM parameters λ and the gap positions g _{1} and g _{2}. A profile likelihood approach could be implemented by applying the BaumWelch algorithm [11] to obtain an estimate \(\hat \lambda ^{*}(g_{1},g_{2})\) for the HMM parameters for each pair (g _{1},g _{2}) and then maximising the profile likelihood \(\ell (\hat \lambda ^{*}(g_{1},g_{2}),g_{1},g_{2})\) with respect to g _{1} and g _{2}.
We propose a twostep approach with a much lower computational requirement and greater robustness to nonaligned expression profiles. In the first step, an estimate \(\hat \lambda \) for the HMM parameter is obtained, independent of the pairing and of the gap positions. In the second step, the loglikelihood \(\ell (\hat \lambda,g_{1},g_{2})\) is evaluated for each pair (g _{1},g _{2}) and the maximum likelihood estimates are selected from the enumeration. The estimate \(\hat \lambda \) is obtained from modelling each individual expression profile at both Willunga and Clare by a standard HMM [11] in which the same parameters λ apply to both vineyards. Such a model is implied by (1) when dropping the constraint that each pair of emission sequences correspond to a common state sequence.
The computational advantage of this approach is that it requires only a single maximisation of the HMM likelihood rather than one for each pair of gap positions. More importantly, it is also robust against the influence of expression profiles not suitable for alignment. The notion of a common alignment is plausible for developmental genes but not for those driven by environmental factors such as temperature. Since the nondevelopmental genes are not known in advance, they cannot be removed and their presence may produce significant bias in the estimate \(\hat \lambda ^{*}(\hat g_{1},\hat g_{2})\). A minor issue is that the standard HMM model from which \(\hat \lambda \) is obtained is inconsistent with the alignment HMM (1) because of the gaps in the Clare sequence. However, it is reasonable to assume that any bias arising from this inconsistency is minor compared to that arising from nonaligned expression profiles in the full likelihood estimate \(\hat \lambda ^{*}(\hat g_{1},\hat g_{2})\).
 1.
The gene expression profiles are filtered so that only those with significant differential expression and at least 2fold change in expression over the time course are retained.
 2.
Each expression profile is linearly rescaled to lie in the interval [0,1].
 3.
A standard HMM is fitted to the data to obtain the estimated HMM parameters \(\hat \lambda \).
 4.
The gap positions are estimated by maximising the alignment HMM loglikelihood \(\ell (\hat \lambda,g_{1},g_{2})\) with respect to g _{1} and g _{2}.
 5.
A single representation of the aligned expression profiles can be obtained either by averaging the aligned expression profiles or by finding the Viterbi path.
We implemented our methodology in MATLAB by adapting the code provided in the HMM Toolbox [13].
Results and discussion
We consider the robustness of the estimates of the gap positions. In a simulation experiment, even with up to 80 % of the data not suitable for alignment, the true gaps can clearly still be found through the loglikelihood (Additional file 2: Figure S2). For subsets of simulated profiles with different true gap positions, the maximum peak in the loglikelihood heatmap becomes less concentrated and spreads out (Additional file 2: Figure S2). For the grapevine data, the loglikelihood is sharply peaked (Fig. 3) and the estimated gaps additionally conform with other physiological features measured on the berries during the experiment. For example, both total soluble solids (sugar content) and berry weight were also measured weekly at Willunga and Clare and the same gap positions appear to work well for this additional data (Additional file 3: Figure S3).
We also consider fitting the alignment model with different choices of the number of states N. The estimated emission densities and heatmaps for N=3 and N=7 are given in Additional file 4: Figure S4. We can see that the same maximum likelihood gaps are found in both cases. It appears that N=3 states is not enough over the range of the data while N=7 is too many as two of the emission densities coincide.
As previously outlined, how well a pair of expression profiles align across vineyards is evidence for whether the corresponding gene is likely to be developmentally driven. To illustrate the potential for identification of biological function from alignment, a set of 198 genes were considered as test data (Additional file 6). Although this test data were also used to train the alignment model, we never made use of the labels in the model fitting. Our classifier arises out of the diagnostics of the alignment HMM as we assume there is a correspondence between ‘well aligned’ and ‘developmental’.
Grimplet et al. [3] surveyed the current gene function annotation for grapevines. Assigning a developmental role to genes based on the putative function of the proteins they encode, as determined by sequence similarity to other genes of known function and without reference to their expression patterns, is an uncertain practice. For example, socalled ‘heat shock’ proteins with similar protein sequences may be either developmentally controlled or may be induced by changes in temperature, or both. Additionally, differences in the promoter sequences of genes encoding similar proteins may determine whether a gene is involved in a developmentally controlled process or not.
By comparing the expression of genes under different growth conditions, as has been done in this paper, we are able to gain evidence regarding the reproducibility of gene expression patterns indicative of a role in development as opposed to a response to external signals. This information can be used as additional evidence in the further investigation of gene function. For example, using the annotation of Grimplet et al. [3], of the 8644 genes represented in the grapevine data, 1968 have no description of possible function (‘no function’, ‘no hit’, ‘unknown’ or ‘unknown function’) and of these we find 1279 probe sets with H(k)≤ 10. That is, 1279 genes with no current annotated function are well aligned between the Willunga and Clare vineyards and therefore we now have additional information that these genes are likely to be controlled in a developmental manner Additional file 7.
The proposed alignment method could be extended and refined in a number of ways. In particular, potential improvements may be obtained through more detailed modelling of the emission distributions in the HMM. In the present paper, we have applied Gaussian emission distributions to the expression profiles averaged over replicates within vineyards. This approach could be refined by considering the replicates as multivariate observations instead of averaging and also by considering alternatives to the Gaussian emission assumption. Autoregressive emissions as well as higherorder Markov components of HMMs have been investigated and found to improve performance in the identification of overexpressed genes [14]. The incorporation of this structure into our framework may more realistically model the expression profiles with potential improvements in performance. The implementation and evaluation of these improvements are the subject of future research.
Conclusion
We have presented a novel alignment method based on an HMM and demonstrated the alignment on the grapevine data. This is a model suitable for sparse time course data where interpolation is not appropriate. The estimated model parameters have simple interpretations and the estimated gap positions are well determined for the grapevine data. We have demonstrated that the estimates of the HMM parameters as well as the gap positions are robust against subsets of profiles that are not suitable for alignment. For pairs of expression profiles that are well aligned, the Viterbi paths or average profile representations can be used as the input to downstream analysis of the data. This allows for an integrated analysis of multiple site time course gene expression data such as the Willunga and Clare grapevine data.
We have demonstrated the use of the Hamming distance and the loglikelihood as a measure of quality for the alignment of a pair of expression profiles. Pairs of profiles that are well aligned will have high loglikelihood and a small Hamming distance while the poorly aligned pairs will have low loglikelihood and a large Hamming distance. We have also shown, for a set of genes with known function, that classification of genes according to the Hamming distance has reasonable predictive power for the classification of developmentally driven genes. This both validates that the alignment we obtain is meaningful and also suggests the potential for helping to identify the role of genes with unknown function.
Availability of supporting data
The MATLAB code and grapevine data to obtain all of the output described in this paper are provided as Additional files. The raw gene expression data is stored at NCBI in the GEO database as GSE7677 (Willunga) and GSE8445 (Clare) Additional file 8.
Declarations
Acknowledgements
The authors would like to thank Pat Corena for valuable contributions to the project. This project was financially supported by the Australian Grape and Wine Authority, the CRC for Viticulture and CSIRO Agriculture. The authors would also like to thank the editor and two anonymous reviewers for valuable suggestions.
Authors’ Affiliations
References
 Aach J, Church GM. Aligning gene expression time series with time warping algorithms. Bioinformatics. 2001; 17(6):495–508.View ArticlePubMedGoogle Scholar
 Pearce I, Coombe BG. Grapevine phenology In: Dry PR, Coombe BG, editors. Viticulture. Volume 1  Resources. Adelaide: Winetitles: 2004.Google Scholar
 Grimplet J, Van Hemert J, CarbonellBejerano P, DíazRiquelme J, Dickerson J, Fennell A, et al. Comparative analysis of grapevine wholegenome gene predictions, functional annotation, categorization and integration of the predicted gene sequences. BMC Res Notes. 2012; 5(1):213.View ArticlePubMedPubMed CentralGoogle Scholar
 Lin T, Kaminski N, BarJoseph Z. Alignment and classification of time series gene expression in clinical studies. Bioinformatics. 2008; 24(13):147–55.View ArticleGoogle Scholar
 Schliep A, Costa IG, Steinhoff C, Schönhuth A. Analyzing gene expression timecourses. IEEE/ACM Trans Comput Biol Bioinform. 2005; 2(3):179–93.View ArticlePubMedGoogle Scholar
 Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequence analysis. Cambridge: Cambridge University Press; 1998.View ArticleGoogle Scholar
 Yuan M, Kendziorski C. Hidden Markov models for microarray time course data in multiple conditions. J Am Stat Assoc. 2006; 101(476):1323–32.View ArticleGoogle Scholar
 Yoneya T, Mamitsuka H. A hidden Markov modelbased approach for identifying timing differences in gene expression under different experimental factors. Bioinformatics. 2007; 23(7):842–9.View ArticlePubMedGoogle Scholar
 Listgarten J, Neal RM, Roweis ST, Emili A. Multiple alignment of continuous time series. Adv Neural Inf Process Syst. 2004; 17:817–24.Google Scholar
 Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Molec Biol. 2004; 3(1):1–25.Google Scholar
 Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989; 77(2):257–86.View ArticleGoogle Scholar
 Yu SZ. Hidden semiMarkov models. Artif Intell. 2010; 174:215–43.View ArticleGoogle Scholar
 Murphy K. The HMM Toolbox. http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html. Accessed 7 December 2014.
 Seifert M, AbouElArdat K, Friedrich B, Klink B, Deutsch A. Autoregressive higherorder hidden Markov models: exploiting local chromosomal dependencies in the analysis of tumor expression profiles. PLOS One. 2014; 9(6):100295.View ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.