
Research

Open

 Published:
Predicting microRNA targets in timeseries microarray experiments via functional data analysis
BMC Bioinformaticsvolume 10, Article number: S32 (2009)
Abstract
Background
MicroRNA (miRNA) target prediction is an important component in understanding gene regulation. One approach is computational: searching nucleotide sequences for miRNA complementary base pairing. An alternative approach explored in this paper is the use of gene expression profiles from timeseries microarray experiments to aid in miRNA target prediction. This requires distinguishing genuine targets from genes that are secondarily downregulated as part of the same regulatory module. We use a functional data analytic (FDA) approach, FDA being a subfield of statistics that extends standard multivariate techniques to datasets with predictor and/or response variables that are functional.
Results
In a miR124 transfection experiment spanning 120 hours, for genes with measurably downregulated mRNA, exploratory functional data analysis showed differences in expression profiles over time between directly and indirectly downregulated genes, such as response latency and biphasic response for direct miRNA targets. For prediction, an FDA approach was shown to effectively classify direct miR124 targets from timeseries microarray data (accuracy 88%; AUC 0.96), providing better performance than multivariate approaches.
Conclusion
Exploratory FDA analysis can reveal interesting aspects of dynamic microarray miRNA studies. Predictive FDA models can be applied where computational miRNA target predictors fail or are unreliable, e.g. when there is a lack of evolutionary conservation, and can provide posterior probabilities to provide additional confirmatory evidence to validate candidate miRNA targets computationally predicted using sequence information. This approach would be applicable to the investigation of other miRNAs and suggests that dynamic microarray studies at a higher time resolution could reveal further details on miRNA regulation.
Background
MicroRNAs (miRNAs) are a class of small noncoding RNAs that are found in both plants and animals. They are known to play important roles in gene regulatory networks by posttranscriptionally regulating the expression of other genes. These miRNAs target other transcripts by forming nearperfect or imperfect base pairings. Such formations silence genes either by mRNA cleavage or translational repression [1].
Computational sequencebased methods have been developed to predict miRNA direct targets. As animal mRNAs have imperfect complementary base pairing to their targets and as they are short in length, most such approaches first search for a perfect 7mer seed in the 5' end of miRNAs that match to their targets. Searching for such small motifs can lead to high false positive rates and so additional tests such as conservation filters are typically applied [2]. However, although many miRNA targets are conserved in related species, restricting the analysis only to recognizably conserved targets would exclude miRNAs that have recently diverged or changed rapidly. Also, only a small number of such predicted targets have been experimentally validated.
Although animal miRNAs were originally believed to primarily translationally suppress gene expression, they have also been found to lead to mRNA destabilization or degradation. MiRNA transfection microarray experiments capable of detecting such effects at the mRNA level of targets have shown that a large number of transcripts are downregulated by overexpression of miRNAs [3]. Such highthroughput expression data provide a promising way to assist in miRNA target prediction.
The mRNA expression changes after miRNA transfection could be a result of miRNAs directly targeting these messages (direct targets in the sequel), or the mRNA of genes could also be indirectly downregulated if they are a part of a miRNAmediated regulatory module controlled by the direct miRNA targets (indirect targets in the sequel) [3]. In this case, these indirectly regulated genes' response is causally related to the directly regulated genes in the corresponding module. As such, indirect targets would be expected to respond with delayed expression changes over time relative to the direct targets.
Timeseries microarray experiments that repeatedly measure gene expression simultaneously for multiple genes over a time period can capture temporal patterns and dependency of gene expression changes. A recent study by Wang and Wang [4] applied sequencebased target prediction to identify targets and then used a timeseries miRNA transfection microarray analysis to experimentally validate the predictions. Another study used a gene set enrichmentlike score to measure expression changes at different times [5]. However, these methods treated expression changes over time points as separate observations and therefore did not fully utilize the information on time dependency of gene expression changes. In this paper we aim to explore the differences between the expression profiles of genes that are direct miRNA targets and indirectly targeted genes, including classification of genes based on their profile. We use functional data analysis (FDA) to utilize the higher level structure due to the data being a set of finite samples from a continuous and presumably smooth curve.
Functional data analysis [6–9] is a subfield of statistics applicable when the predictor or response variable can be treated as a set of samples from a function, rather than simply as a feature vector. It is, in essence, the extension of standard multivariate analysis of finite dimensional vector spaces to infinite dimensional function spaces (e.g. reproducing kernel Hilbert spaces (RKHS)) using a functional analytic approach. FDA entails interpolating and smoothing the set of function samples using some appropriate set of basis functions (e.g. splines), kernel smoothing, or regularization. Multivariate procedures such as principal component, regression and classification analyses typically involve an inner product (or some form of metric) defined on the vector space: for the reconstructed continuous function, the conventional multivariate dot product, $\u3008x,y\u3009={\displaystyle \sum _{i=1}^{n}{x}_{i}{y}_{i}}$, can be replaced by the corresponding functional inner product, $\u3008x,y\u3009={\displaystyle {\int}_{0}^{T}x(t)y(t)dt}$, for time interval T, in the corresponding multivariate analysis method. For example, standard PCA can be extended to functional PCA (FPCA) [6, 10] as follows. Conventional multivariate PCA finds the axes of maximum variance in the data by solving the eigenequation Vξ = λξ for variance – covariance matrix V = n^{1} X'X, where X is the (mean centred) n × p data matrix, with n samples and p features; and λ and ξ are the corresponding eigenvalue and eigenvector. In functional PCA, the equivalent eigenequation is generalized to Vξ = λξ where ξ is now an eigenfunction and V is the covariance operator defined by: Vx(s) = ∫v(s, t)x(t)dt = ⟨(v(s, ·), x⟩; where the variancecovariance function (the functional extension of the variancecovariance matrix) $v(s,t)={n}^{1}{\displaystyle \sum _{i=1}^{n}{x}_{i}(s){x}_{i}(t)}$. So, in summary, FPCA can be expressed as the eigenanalysis of the covariance operator V, defined by using the covariance function v as the kernel of an integral transform, and this formulation of the eigenequation in terms of inner products, ⟨v(s, ·), ξ⟩ = λξ, can be applied to either multivariate or functional data, with the respective definition of inner product, where s ∈ T for function domain T for the functional case, and index set T = 1,..., p in the multivariate case. In most cases, this expression can be computed quickly using only matrix expressions utilizing the sampled data points and a matrix of inner products between basis functions, avoiding estimation of the integral [10]. Similarly, most other standard multivariate data analyses such as canonical correlation, cluster, regression and classification analysis can be extended to functional versions.
In this paper, we use FPCA for initial exploratory analysis and a nonparametric functional data analysis (NPFDA) approach [11] for prediction. The advantage of this functional data analytic approach is that the prior knowledge that the data is generated by a continuous smooth dynamics allows for: accurate estimates of derivatives which can then be used in the analysis; effective noise reduction through curve smoothing; applicability to data with irregular time sampling schedules. Such functional data analytic approaches have been previously applied to the analysis of temporal microarray expression data e.g. to cluster timeseries gene expression data [12] and to classify yeast cellcycle gene expression profiles and Dictyostelium celltype [13]. In this paper, we analyze miRNA124 transfection timeseries microarray expression data using FDA and show that such expression change differences can be modeled as different time activity curves to effectively predict miRNA direct targets and distinguish them from indirect targets.
Methods
We analyzed the miR124 transfection timeseries microarray data previously published in [4], which measures expression levels over 120 hours at 7 time points. We initially identified a set of genes downregulated after miRNA transfection. To identify, using sequence information alone, a set of true (with high probability) targets and nontargets to train the functional model, we predicted miRNA targets using publicly available target predictors and used the high confidence predictions to determine directtarget and indirecttarget sets as training sets for classification. FPCA was used for exploratory data analysis and NPFDA was used to distinguish the direct targets from the indirect targets. An overview of the analysis pipeline in this study is outlined in figure 1 and details are described in the following sections.
miRNA transfection dataset
We retrieved the miRNA124 transfection timeseries microarray expression dataset from the GEO database [14] (accession no. GDS2657). The dataset contains mRNA expression profiles generated by overexpressing miRNA124 duplexes into human HepG2 cells in comparison to negative controls at times 4, 8, 16, 24, 32, 72 and 120 hours, using Affymetrix microarrays (U133Plus2; RMAnormalized). miRNA124 is known to be highly expressed in several tissues including brain and kidney, but is not highly expressed in the cell line used [3, 4]. Starting with all genes provided on the U133Plus2 microarray, we first excluded those genes showing low overall expression levels in the control samples as these genes would be unlikely to be able to clearly show downregulation after miRNA transfection. Only probes where the expression levels over at least half of all time points of the control samples were greater than the median expression level of control samples were retained for further analysis. Expression values were averaged for those genes with corresponding multiple probes.
We next identified genes showing substantial evidence for downregulation of mRNAs. Genes with over 1.4fold underexpression (corresponding to a log_{2} expression fold change (FC) of less than 0.5) for at least one time point were considered to be downregulated genes. We subsequently restricted our analysis to these downregulated genes in this study (from this downregulated set, we also excluded a further 19 very noisy genes which had some time points showing greater than 1.3fold overexpression).
miRNA target prediction
We obtained two precomputed target gene sets for miR124 that were predicted by TargetScanS [15] and PITA [16]. The TargetScanS precomputed target set for miR124 contains 289 downregulated genes with 7 and 8mer sites conserved across several mammalian species and 384 genes with nonconserved sites. The PITA precomputed target set for miR124 contains 198 downregulated genes with 7 and 8mer conserved sites and 236 genes with nonconserved sites (a conservation score of higher and lower than 0.9 was considered as the conserved and nonconserved targets, respectively [16]). To construct a control set of genes that were not predicted to be miRNA124 targets, we further excluded genes showing even weak sites. To do so, we first extracted annotated 3'UTR sequences of genes from Ensemble [17] using Biomart [18]. miRNA target predictions were then performed on the 3'UTRs using PITA [19] with a relaxed parameter setting: a seed of length 6 to 8 bps and allowing G::U pairing in 7 and 8mer seeds.
Construction of datasets
We constructed two datasets to be used as training data for functional classification analysis.

(1)
Directtarget training set: we used the existing computational miRNA target predictors described above to construct a relatively high confidence directtarget training set. As current computational target predictors can vary in their effectiveness [20], to increase confidence we required both predictors to agree. We obtained 157 downregulated genes that were predicted by both predictors with 7 and 8mer seed sites and that were found to be evolutionarily conserved in several other species by both predictors.

(2)
Indirecttarget training set: we selected the downregulated genes that had annotated 3'UTRs but in which no target sites could be found by either predictor and even no weak sites could be found by using PITA prediction with relaxed parameters. From these genes, a set of size identical to the directtarget training set was randomly sampled to form the indirecttarget training set.
We also constructed two independent datasets that had not been used for training that we used to further evaluate our model.

(1)
Nonconserved directtarget test set: we combined the downregulated genes that were predicted from either of the predictors, yet that had no evidence of conservation. This set consisted of 424 genes and presumably would be enriched for direct targets.

(2)
Indirecttarget test set: we constructed another independent dataset which met the same criteria as the indirecttarget training set but excluding genes that had been used for training the model. This set consisted of 372 genes.
Functional data analysis
We used FPCA to perform exploratory data analysis. To fit a smooth function to the discrete sampled data, we used a set of 9 Bspline basis functions of order 4 (for cubic smoothing splines). Knots were located at the data points. Additional regularized smoothing (λ = 0.01) was applied (second derivative roughness penalty).
The NPFDA approach was used for classification. The first derivative of the expression profile was effectively used as the predictor variable, with a boolean (twoclass) response variable. The NPFDA used Bspline basis functions (with 7 knots) to produce smooth first derivative estimates. Performance evaluation was by 10 × stratified 10fold crossvalidation (CV). Major parameters were determined via nested CV separately for each fold; other parameter settings were set to their defaults or as appropriate for the data size. To evaluate the discriminability of the classes, the Receiver Operating Characteristics (ROC) curves were calculated as well as the associated area under the ROC curve (AUC).
R code implementing the analysis is available from the authors.
Results and discussion
Exploratory data analysis with functional principal components analysis (FPCA)
In the miR124 dataset, Wang and Wang [4] previously noted that genes directly targeted by miR124 tended to have substantial downregulation at time points earlier than the indirectly downregulated genes. This phenomena is as would be expected as the indirect targets would be causally downstream to the directly regulated genes. Figures 2(a) and 3(a) show the mean, unsmoothed, gene expression profiles of the directtarget training set and the indirecttarget training set, respectively (See Methods for construction of the directtarget training and the indirecttarget training set). The measurement time points are plotted equally spaced on the time axis to emphasize the early stages of the expression profile. Comparing the mean curves of the two datasets, the direct targets show an immediate decrease in mRNA levels reaching a minimum at approximately 72 hours. By contrast, the indirect targets show a delayed response and do not, on average, commence decreasing until approximately 32 hours, reaching a minimum at about 72 hours. Figures 2(b) and 3(b) show the smoothed functional profiles that were reconstructed using Bspline basis functions (see Methods). As discussed in the background section, it can be seen that smoothed functional curves allow us to reduce noise and have a better estimate of curve derivatives.
FPCA analysis reveals the modes of variation across the samples. Figure 4 shows the results of FPCA applied to the indirecttarget training set and the major modes of variance of the signal compared with the mean (the times shown are linear along the xaxis in these figures). In figure 4(a) the first principal component (PC) function (or harmonic) is added and subtracted from the mean curve, and similarly for the second PC function in figure 4(b). Figure 4(a) shows that there is a large variance in the overall expression fold change along the curves. The curves are normalized in the following classification step to minimize this variance (see below). Figure 4(b) shows that the downregulation peak varies somewhat in the exact time of its minimum across genes. The extent to which this variance is biologically meaningful and represents causal relationships, as opposed to being explained by the relatively low resolution sampling interval used, will require higher resolution studies to investigate. For example, only approximately three time points were available in this dataset to capture the details of the major indirect target downregulation peak, which is only just sufficient. Ideally, for an FDA analysis of such noisy data a sampling rate 2 or more times higher than this would be needed to capture fine details well [8].
Figure 5 shows the FPCA results for the directtarget training set. Again, a large variance in overall scale in the first channel (figure 5(a)) is seen. Figure 5(b) shows a substantial variance in the time of the major peak at around 72 hours as in the indirect target. However, it also shows evidence of a smaller secondary mode of variance at approximately 24 hours; this would suggest that there is a mixture of components to the curve, with the main direct miRNA repression occurring in the first phase, and a later component matching the response of the indirect targets. This biphasic response is also demonstrated in figure 6 which shows phase plane plots of curve slope against curve value for two representative genes of the direct and indirect target sets. Abrupt changes in such plots (cusps) highlight sudden changes in curve shape. The direct target plot, figure 6(a), shows a cusp at approximately 32 hours, compared with the smooth plot of the indirect target, figure 6(b). Existing kinetic models of miRNA gene regulation have not modeled this effect [21].
Classification using nonparametric FDA (NPFDA)
As noted above, the gene expression curves show a large variance in overall fold change in the first principal component of the functional PCA. A key preprocessing step in FDA is registration of the curves to remove unimportant amplitude or other variations. Prior to classification, therefore, each curve was standardized to have a mean (log_{2}) fold change of 1. Thus, average fold change differences between the direct and indirect target sets were normalized away, and the shape of the response curves alone was effectively used to distinguish the direct targets from the indirect targets. Using curve shape information only would be expected to provide a more robust predictor than also relying on the absolute fold change as a feature, as it varies substantially between genes. An NPFDA discrimination model was trained on the direct and indirect target training sets (see Methods). The performance was compared with linear discriminant analysis (LDA) as an example of a standard multivariate analysis.
The results are shown in Table 1. The NPFDA gave very good prediction accuracy of 88% and AUC of 0.96. (AUC is a measure of the discriminative power of the classes using the given features and classifier, and varies from 0.5 for nondistinguishable classes to 1.0 for perfectly distinguishable classes. The AUC can be interpreted as the probability that two random samples from the two classes will be ranked correctly, and is invariant to changes in class proportions (unlike accuracy)). The increase in accuracy over LDA was statistically significant (p < 0.01; McNemar test). Figure 7 shows the ROC curves for NPFDA and standard LDA: NPFDA shows a substantial improvement over multivariate LDA. For comparison, the discriminability of each time point treated separately as a univariate feature was examined: the expression level across genes at each time point was used directly for ranking to compute the corresponding AUC. The best discriminating time point was at 24 hours with an AUC of 0.91.
To further validate the trained model, we also used it to classify data that was independent of the training set: the nonconserved directtarget test set and the indirecttarget test set (See Method). Although the true status of these data are not known and so explicit accuracy and AUC results cannot be computed, we would expect the majority of the nonconserved test set to be true direct targets. However without conservation information we would expect a proportion of false positives i.e. indirect targets in actuality. For the indirecttarget test set, we would expect predominantly true indirect targets. We generated histograms of the posterior probabilities from the NPFDA in these datasets (figure 8). Comparing the histograms of the nonconserved direct target test set (red) and the indirecttarget test set (black), we see this expected result: the red set has overall high posterior probability (for being a direct target) with a large mode approaching 1. Based on this evidence, the majority of the predicted genes without conservation information are indeed true direct targets.
Figure 9(a) shows the time series plots for the top positive predictions of the nonconserved direct target test set (red set in the figure 8) with a posterior probability greater than 0.8 (size 262). These true positives show a profile similar to the direct targets in the training set, which gives high confidence that they are indeed true positives. Taking those high confidence predictions with posterior probability > 0.9 gives a set of 203 new nonconserved miRNA target predictions (cf. 76 conserved targets predicted in [4]).
Figure 9(b) shows the bottom predictions of the nonconserved direct target test set with a posterior probability of less than 0.2 (size 44). These curves appear to match the indirect target profile seen in the training set indicating that they are likely false positives of the computational predictors. Figure 9(c) shows the time series plots for the top predictions for the indirecttarget test set (black set in figure 8) with a posterior probability of less than 0.2 (size 262). These curves match the expected indirect target expression change profile. It also appears that there are at least two subgroups amongst these indirect targets, one with a much longer response time.
Conclusion
In this study, we presented an FDA analysis of the differences in expression profiles between direct and indirect miRNA targets in a miR124 miRNA transfection experiment. An exploratory FDA analysis showed differences in response latency, with direct targets showing an immediate downregulation whereas indirect targets typically showed an approximately 32 hour delay. Also, direct miRNA target curves show evidence of a biphasic, twocomponent response with an initial early decrease presumably due to direct effects of the miRNA on mRNA, and a later component matching the response of the indirect targets, possibly due to secondary effects.
These time profile differences can be utilized for classification, and in the prediction analysis we showed that FDA can provide very good discrimination, substantially better than a standard multivariate analysis, as FDA utilizes the prior knowledge that the biological process of regulation generates a smoothly varying time profile. Such a predictive model would be especially useful in cases where computational approaches are less reliable e.g. lack of evolutionary conservation. Further, this approach can be used to provide additional confirmatory evidence (posterior probabilities) for computationally predicted miRNA targets and so improve computational miRNA target prediction. This approach would be applicable to the investigation of other miRNAs and these results suggest that dynamic microarray studies at a higher time resolution could reveal further details on miRNA regulation.
References
 1.
Engels BM, Hutvagner G: Principles and effects of microRNAmediated posttranscriptional gene regulation. Oncogene 2006, 25: 6163–6169. 10.1038/sj.onc.1209909
 2.
Yoon S, De Micheli G: Computational identification of microRNAs and their targets. Birth Defects Res C Embryo Today 2006, 78: 118–128. 10.1002/bdrc.20067
 3.
Lim LP, Lau NC, GarrettEngele P, Grimson A, Schelter JM, Castle J, Bartel DP, Linsley PS, Johnson JM: Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 2005, 433: 769–773. 10.1038/nature03315
 4.
Wang X, Wang X: Systematic identification of microRNA functions by combining target prediction and expression profiling. Nucleic Acids Res 2006, 34: 1646–1652. 10.1093/nar/gkl068
 5.
Cheng C, Li LM: Inferring microRNA activities by combining gene expression with microRNA target prediction. PLoS ONE 2008, 3: e1989. 10.1371/journal.pone.0001989
 6.
Ramsay JO, Dalzell CJ: Some tools for functional data analysis. J R Stat Soc Ser B 1991, 53: 539–572.
 7.
Ramsay JO, Silverman BW: Functional data analysis. first edition. SpringerVerlag, New York: Springer Series in Statistics; 1997.
 8.
Ramsay JO, Silverman BW: Functional data analysis. second edition. SpringerVerlag, New York: Springer Series in Statistics; 2005.
 9.
Hastie T, Tibshirani R, Friedman J: The elements of statistical learning. SpringerVerlag, New York: Springer Series in Statistics; 2001.
 10.
Besse P, Ramsay JO: Principal components analysis of sampled functions. Psychometrika 1986, 51: 285–311. 10.1007/BF02293986
 11.
Ferraty F, Vieu P: Nonparametric functional data analysis. SpringerVerlag, New York: Springer Series in Statistics; 2006.
 12.
Song JJ, Lee HJ, Morris JS, Kang S: Clustering of timecourse gene expression data using functional data analysis. Comput Biol Chem 2007, 31: 265–274. 10.1016/j.compbiolchem.2007.05.006
 13.
Müller HG, Leng X: Classification using functional data analysis for temporal gene expression data. Bioinformatics 2005, 22: 68–76. 10.1093/bioinformatics/bti742
 14.
 15.
TargetScanS[http://www.targetscan.org/]
 16.
 17.
Ensembl[http://www.ensembl.org/]
 18.
BioMart[http://www.biomart.org/]
 19.
Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E: The role of site accessibility in microRNA target recognition. Nat Genet 2007, 39: 1278–1284. 10.1038/ng2135
 20.
Baek D, Villén J, Shin C, Camargo FD, Gygi SP, Bartel DP: The impact of microRNAs on protein output. Nature 2008, 455: 64–71. 10.1038/nature07242
 21.
Khanin R, Vinciotti V: Computational modeling of posttranscriptional gene regulation by microRNAs. J Comput Biol 2008, 15: 305–316. 10.1089/cmb.2007.0184
Acknowledgements
Thanks to Anders Krogh for helpful discussions. This work was supported by the Danish National Science Foundation and Novo Nordisk Foundation.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S1
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
BJP and JW designed the study and analyzed the data. Both authors drafted, read and approved the final manuscript. BJP and JW contributed equally to this work.
Rights and permissions
About this article
Published
DOI
Keywords
 Linear Discriminant Analysis
 miRNA Target
 Functional Data Analysis
 miRNA Target Prediction
 Functional Principal Component Analysis