Principal component analysis for predicting transcription-factor binding motifs from array-derived data

Background The responses to interleukin 1 (IL-1) in human chondrocytes constitute a complex regulatory mechanism, where multiple transcription factors interact combinatorially to transcription-factor binding motifs (TFBMs). In order to select a critical set of TFBMs from genomic DNA information and an array-derived data, an efficient algorithm to solve a combinatorial optimization problem is required. Although computational approaches based on evolutionary algorithms are commonly employed, an analytical algorithm would be useful to predict TFBMs at nearly no computational cost and evaluate varying modelling conditions. Singular value decomposition (SVD) is a powerful method to derive primary components of a given matrix. Applying SVD to a promoter matrix defined from regulatory DNA sequences, we derived a novel method to predict the critical set of TFBMs. Results The promoter matrix was defined to establish a quantitative relationship between the IL-1-driven mRNA alteration and genomic DNA sequences of the IL-1 responsive genes. The matrix was decomposed with SVD, and the effects of 8 potential TFBMs (5'-CAGGC-3', 5'-CGCCC-3', 5'-CCGCC-3', 5'-ATGGG-3', 5'-GGGAA-3', 5'-CGTCC-3', 5'-AAAGG-3', and 5'-ACCCA-3') were predicted from a pool of 512 random DNA sequences. The prediction included matches to the core binding motifs of biologically known TFBMs such as AP2, SP1, EGR1, KROX, GC-BOX, ABI4, ETF, E2F, SRF, STAT, IK-1, PPARγ, STAF, ROAZ, and NFκB, and their significance was evaluated numerically using Monte Carlo simulation and genetic algorithm. Conclusion The described SVD-based prediction is an analytical method to provide a set of potential TFBMs involved in transcriptional regulation. The results would be useful to evaluate analytically a contribution of individual DNA sequences.


Background
The use of microarrays has led to a significant number of exciting discoveries establishing important links between mRNA expression patterns and cellular states [1,2]. Math-ematical and computational models have been developed to understand and characterize the molecular mechanisms underlying expression patterns [3,4]. However, it remains difficult to discover and validate novel transcrip-tion-factor binding motifs (TFBMs) in the human genome. The popular approach to identify TFBMs utilizes sequence comparisons among co-expressed genes [5] or across multi-species [6]. Although any consensus motif can be searched among the co-regulated genes in hierar-chical clusters [7,8], this approach is not aimed to build a global model with multiple binding motifs. TFBM can be inspected through phylogenetic footprinting [6,9,10], but identifying orthologous genes and their associated regulatory regions are not always possible. Model-based approaches, initially developed using yeast genome [3], encounter difficulty in evaluating the astronomical number of TFBM selections in the combinatorial problem [11,12]. Although multiple binding motifs were selected in the yeast dataset using a recursive formula, prediction of TFBMs would be affected depending on the order of selected motifs [3]. Some models lack statistical standards for determining the number of TFBMs having combinatorial roles that are critical in expression patterns. Thus, a predictive model that provides a comprehensive set of TFBMs still needs to be developed.
The specific aim of the current study was to devise a model for predicting known and de novo transcription factor binding motifs from array-derived mRNA expression levels by developing a unique principal component analysis. We employed the responses of human chondrocytes to interleukin-1 (IL-1) as a model system [13]. IL-1 is a proinflammatory cytokine, and it stimulates not only inflammatory responses but also tissue degeneration [5]. More than 100 microarray analyses have been conducted to analyze IL-1-driven responses in various cell types, including chondrocytes [14,15], and significant efforts have been made to understand transcriptional mechanisms of IL-1 response [16][17][18]. However, few of the previous studies have validated the global roles of multiple critical TFBMs in downregulation or upregulation of a cluster of genes.
In this principal component analysis, we introduced the Akaike information criterion (AIC) test, singular value decomposition (SVD), and a genetic algorithm (GA) to predict and evaluate TFBMs from a pool of random DNA sequences (Fig. 1). The predictive model was formulated using state vectors, which represented a contribution of potential TFBMs to the IL-1 responses. The promoter matrix was defined to build the quantitative relationship between the mRNA expression vector and the state vectors, and a unique SVD procedure was applied to the promoter matrix. Although one previous study defined the mRNA expression level as a state variable, dynamical correlations among the mRNA levels do not directly represent biological processes [19]. Here, a state variable was defined as an activation level of each TFBM, and SVD was used to link the primary components in the expression vector to the influential TFBM candidates in the state vector through the eigen gene vectors and the eigen TFBM vectors. The analytical prediction of TFBMs with SVD was evaluated numerically using Monte Carlo simulation and GA.
Flowchart of the model-based analysis of transcription factor binding motifs (TFBMs) Figure 1 Flowchart of the model-based analysis of transcription factor binding motifs (TFBMs Selection of 45 IL-1-responsive genes and AIC analysis  Numbe r o f TFBMs AIC C

Results
Prediction and validation of novel and known TFBMs were conducted using logarithmic ratios of the IL-1-driven mRNA alterations in human chondrocytes (Fig. 1). First, AIC was used to determine a statistically meaningful number of TFBMs in the model. Second, the contribution of each of the 512 TFBM candidates to the IL-1 responses was evaluated by decomposing the promoter matrix with SVD. Third, the SVD-based priority of TFBMs was evaluated numerically by GA and Monte-Carlo simulation. Fourth, a linkage was established among the predicted and known TFBMs.

Messenger RNA ratios and AIC analysis
Using data obtained in primary cultures of human articular chondrocytes, 45 IL-1-responsive genes were selected and the ratios of mRNA levels from IL-1-treated cells against mRNA levels in untreated cells were calculated from the list of IL-1-responsive genes in primary chondrocytes published by Vincenti and Brinckerhoff [13]. As shown in Fig. 2A, the relative mRNA levels are represented in a greyscale, and the logarithmic ratios are illustrated in a green to red color code. The mRNA ratios for 33 genes were positive (upregulation; indicated by green), while the ratios for 12 genes were negative (downregulation; indicated by red). Using Eq. (1) and the SVD procedure, these logarithmic mRNA ratios were modelled against 1 to 32 TFBMs that were chosen from random DNA sequences of 5 bp in length (Fig. 2B). As expected, the model error decreased monotonically as the number of TFBMs increased from 1 to 32. In order to estimate the proper number of TFBMs in the model, AIC was calculated using Eq. (2) (Fig. 2C). The minimum AIC was obtained with 8 TFBMs, which were used as models for further analysis.

SVD analysis
Using the SVD procedure, the promoter matrix H, built from the 300-bp upstream flanking sequences, was factorized into three matrices in Eq. (4). Using the eigen gene vectors in U (Fig. 3A) and the eigen values in Λ (Fig. 3B), the observed mRNA ratios were decomposed linearly with definition of the weighing factors, k i (Fig. 3C), in Eq. (5). Out of 45 eigen values, the primary and the secondary eigen values were 133.4 and 64.6. Shannon entropy was calculated as 0.65 [6], and the eigen values suggested a relatively even spread distribution among the 45 eigen gene vectors. Note that that Shannon entropy takes values between 0 and 1, and a smaller value suggests that expression data are dominated by influential eigen values. Using the weighing factors for each of the eigen TFBM vectors, the most influential 8 TFBMs, whose contribution to the expression levels of IL-1-responsive genes was predicted to be larger than the others, were selected. First, the eigen TFBM vectors (Fig. 4A) were derived as a complement of the eigen gene vectors. Then, each TFBM candidate in the eigen TFBM vectors was weighted by the same weighting factors defined in Eq.

GA analysis, Monte-Carlo simulation, and leave-one-out test
In order to evaluate the selection of 8 TFBMs based on the above principal component analysis, the numerical search for TFBM candidates was conducted with the GA analysis. Starting with 200 digital chromosomes in Eq. (6), including the chromosome for the SVD solution, the population of chromosomes was evolved for 10 4 generations. During evolution, the model error was reduced through artificial chromosome recombinations and mutations (Fig. 5A). The sum square error for the mRNA ratios was 15.94 (SVD solution) and 7.55 (GA solution). These values were smaller than the Monte-Carlo results of 58.97 ± 8.61 (N = 10,000) using a random selection of TFBMs (Fig. 5B). The GA solution reduced the error of the SVD solution by 52.6% by retaining five SVD-driven TFBMs and introducing three new TFBMs, 5'-CGTCC-3', 5'-AAAGG-3', and 5'-ACCCA-3' (Fig. 5C).
In order to further examine the SVD-based model, we conducted a leave-one-out test. In this test, (N -1) genes were used to build a model and one gene was used to validate the model through any difference between the observed and the predicted expression levels. The process was repeated N times (N = 45) by removing one gene at a time.
The model error for a complete set of leave-one-out tests was 33. To evaluate significance of the leave-one-out model error, Monte-Carlo analyses were conducted using two datasets. In the first dataset the elements in the promoter matrix was reshuffled, and in the second dataset the order of mRNA expression levels was randomized. The model error was 108 ± 31 (mean ± s.d.) and 93 ± 23 for the first and the second datasets, respectively (Fig. 6).

Linkage to known TFBMs
The 8 TFBM candidates obtained from the GA analysis were graphically linked to the known TFBMs (Fig. 7). The GA-based TFBMs are represented by 8 boxes in the first column, and each box is linked to the biologically known TFBMs such as AP2, SP1, EGR1, etc. For instance, 5'-CGCCC-3', one of the TFBMs predicted by GA, is part of consensus sequences of SP1, EGR1, KROX, GC-BOX, and ABI4.

Discussion
In this report, we have presented a predictive model and its validation using the transcriptional responses to IL-1 in human chondrocytes as a model system. From a pool of 512 random DNA sequences of 5 bp in length as potential TFBM candidates, the SVD analysis and the GA simulation both identified 8 TFBMs. Five out of 8 TFBM candidates were identical in both analyses, and several of the known TFBMs, including AP2, EGR1, GC-BOX, SP1, NFκB, and LEF1, coincided with the predicted TFBMs.
Prior to application to the mammalian gene expression in the current study, the described approach was examined to build a model for a Ras/cAMP signal transduction pathway in yeast. This pathway is well characterized in yeast, and a cAMP responsive element (CRE; , which is conserved in eukaryotes, is known to be involved. The SVD-based approach with 5-bp sequences predicted a part of CRE (5'-AATGC-3') together with two yeast-specific binding motifs such as 5'-AGGGG-3' (binding motif for MSN2/MSN4; stress responsive element) and 5'-ACCGG-3' (binding motif for LEU3). Since both MSN2 and LEU3 are differentially expressed in response to Ras activation [5], the results allowed us to apply this principal component approach to the current study on the human IL-1 responses (see additional file).
In the prediction phase of TFBM analysis, we demonstrated that the SVD analysis prioritized the contribution of individual TFBM candidates, and the GA algorithm was employed to evaluate independently the SVD solution. SVD is computationally inexpensive, and the results are reproducible since no random parameters are involved. It is straightforward to incorporate the effects of degenerate binding sequences by modifying a linear combination of the eigen TFBM vectors and adding contributions from redundant sequences in the final SVD procedure. More specifically, to any TFBM candidate there are 15 degenerate motifs with one base-pair mismatch and the contribution of these degenerate sequences can be included in the model with an appropriate weighting factor. The standard computational complexity of SVD procedure is estimated The use of mathematical and computational procedures such as AIC, SVD, and GA have been used previously to analyze the behaviour of complex biological systems [22,23]. In prediction of TFBMs from the microarray data, however, the described usage here is unique in a novel state-variable representation. Since many genes are regulated by multiple TFBMs, a statistical standard such as AIC may be used appropriately to validate the number of TFBMs that are meaningful in array-derived data. The previous use of SVD has been limited to clustering expression patterns in the eigen gene space [22,24]. The unique feature of the described predictive model is to link the eigen gene space to the eigen TFBM space by applying SVD to the promoter matrix defined from TFBMs. Evolutionary algorithm such as GA has been used to estimate the values of parameters [25,26]. We employed GA to select the set of TFBMs from an artificial chromosome that is composed of on/off switches for 512 random DNA sequences.

SVD-based selection of TFBMs
The predictive model in this study generated many testable hypotheses on known TFBMs, as well as novel TFBM candidates, and led us to the analysis of transcription factors. Five out of the 8 TFBM candidates were linked to known transcription factors. Among them, AP2 is known to be involved in stress responses [27] and LEF1 is known to be involved in a wnt signalling pathway [28]. However, neither AP2 nor LEF1 is reported to be responsive to IL-1. EGR1 increases expression of inflammatory cytokines and is involved in IL-1-induced downregulation of the type II collagen promoter in chondrocytes [29], and the GC-box is a widely distributed promoter component. The binding site of SP1 is recognized by SP3, which may oppose positive effects of SP1 [30]. NFκB is a pivotal transcription factor that is both induced at the mRNA level, as shown here, and activated by proinflammatory cytokines [31][32][33]. However, the relatively long degenerate consensus sequence of its binding site 5'-GGG(A/G)(C/A/T)T(T/ C)(T/C)CC-3'requires a further linkage analysis to the predicted TFBM of 5'-GGGAA-3'. In a separate study, the promoter competition assay was conducted to evaluate the role of the SVD-selected TFBMs using three IL-1-responsive genes, LIF, NFκB2, and IRF1 [34]. In the assay, the stimulatory effects of 5'-CAGGC-3' and 5'-CGCCC-3', as well as the inhibitory effects of 5'-CCGCC-3', 5'-CACCG-3', and 5'-GCGCC-3', were consistent to the SVD prediction. In order to further validate the stimulatory role of 5'-CAGGC-3', a gel shift assay was conducted. As predicted, incubation with the nuclear extracts isolated from the IL-1-treated cells retarded the mobility of the DNA fragment containing 5'-CAGGC-3' (see additional file).
The described state-variable formulation of the predictive model can be extended to include redundancy in TFBM consensus sequences, temporal mRNA profiles, and interactions of TFBMs with transcription factors and cofactors.

Short motifs such as 5 bp TFBMs in this study may present
Leave-one-out test less specificity, but the described SVD procedure can increase specificity easier than any combinatorial search algorithm such as GA. The model can be extended to predict a dynamical state of TFBMs associated with the regulation of the temporal mRNA expression profiles [23].
Interactions among TFBMs through transcription factors and cofactors can be implemented through the nonlinear version [35].

Conclusion
Identification of TFBMs in the human genome is critically important in the post Human Genome Project era [36].
Although experimental evaluation is mandatory to gain biological insights from the model-based predictive results, an analytical model at nearly no computational cost would be useful to provide initial conditions for numerical optimization or predict a set of potential targets for experimental verification. Although the prediction is dependent on definition of regulatory regions, the described model-based analysis allowed us to gain a new biochemical insight on the IL-1 responses by integrating the SVD procedure and Akaike information criterion.
In conclusion, the current study on gene responses to IL-1 demonstrates that application of the primary component analysis would predict and validate the novel and known TFBMs from the microarray data using genomic DNA information.

Determination of mRNA ratios
The mRNA expression data for the IL-1-responsive genes in primary cultures of human articular chondrocytes were obtained from the lists published by Vincenti and Brinckerhoff [13]. The logarithmic ratios of mRNA levels in IL-1β (10 ng/ml)-treated chondrocytes to control mRNA levels were determined for 45 IL-1-responsive genes, whose transcription initiation site was identifiable in the Gen-Bank sequences or by the PEG program [37,38]. The logarithmic ratio makes it easy to characterize both upregulation and downregulation to the control level, and it has been widely used to model array-derived expression data in yeast and human [3,39]. The described SVD-based approach is effective for modelling both upregulated and downregulated genes, and the positive and negative values represent the upregulated and downregulated genes, respectively ( Fig. 2A).

Definition of promoter matrix
Prior to mathematical formulation, a promoter matrix H nxM was defined, where n was the number of the IL-1responsive genes and M was the total number of TFBM candidates. The element h ij in H nxM represented the number of appearance of the j-th TFBM candidate on the 5'-end flanking region, 300 bp in length in the current study, of the i-th IL-1-responsive gene. The length of 300 bp was determined to minimize the least-square model error from the upstream regions of 100 bp to 5000 bp with a 100-bp interval. In this study, 512 TFBM candidates (M = 512), 5-bp DNA sequences including 5'-AAAAA-3', 5'-AAAAC-3', etc., were initially screened without considering polarity of DNA strands, and the critical TFBMs were selected by the SVD-based procedures described below.
Since the length of motifs varies from 5 to 30 bp in TRANSFAC database, the 5-bp sequences were chosen as a potential core binding motif and their linkage to known motifs with redundancy was considered using TRANSFAC database.

Formulation
Using the promoter matrix H nxm , the mRNA level of each IL-1-responsive gene was modelled [40]: where z was the mRNA expression vector representing the logarithmic mRNA ratios for the 45 IL-1-responsive genes, and x was the state vector representing the role of TFBM candidates in achieving the observed values in z. Note that we used M as the total number of TFBM candidates (M = 512 in this study), m as the number of TFBMs in the SVDbased model, and as the estimate of m based on Akaike information criterion below.

Akaike information criterion
In order to avoid underfitting or overfitting the mRNA ratios with TFBM candidates, AIC was defined and used as an indicator of statistical measure [41]: where L( , m) was the likelihood function, and was the estimate of x. The value of was determined using the singular value decomposition procedure described below.

Singular value decomposition (SVD)
SVD is a matrix decomposition technique which can be applied to any rectangular matrix. It decomposes a matrix into two orthogonal matrices and one eigenvalue matrix. Two orthogonal matrices represent the column and the row spaces in the original matrix, and the eigenvalue matrix relates these two spaces. In order to evaluate the contribution of 512 potential TFBMs to the IL-1 responses, the promoter matrix H nxM was factorized using SVD: where U nxn (u 1 , u 2 , ..., u n ) was defined as the eigen gene πσ σm m Determination of k i was achieved by projecting the vector z to λ i u i direction. Therefore, taking an inner product between z and λ i u i gave k i .
Since u i and v i are the associated bases in the gene space and the TFBM space respectively, the factor k i (i = 1, 2,..., n) for describing the expression in gene space can be used to model the contribution of individual TFBMs in the TFBM space. For instance, let us consider one extreme case where z was parallel to u 1 . Then, a contribution of TFBMs to z would be proportional to v 1 and not affected by the other v i (i ≠ 1) since the eigenvalue matrix Λ nxM does not have any non-diagonal components. Therefore, the elements in v 1 would be used to indicate potential importance of M TFBM candidates. In a general case, the SVD procedure allowed us to evaluate n pairs of u i and v i through λ i and k i without conducting any combinatorial search. In order to model the gene space using the observed mRNA expression of z, the orthogonal vectors (u 1 , u 2 ,..., u n ) are linearly combined using k i (i = 1, 2,..., n). In order to model the TFBM space, a linear combination of the orthogonal vectors (v 1 , v 1 ,..., v n ) is made.
Based on the above rationale, we evaluated the linear combination of the eigen TFBM vectors in a form of . This vector a plays the similar role of zin Eq.
(5). M elements in a indicates the role and the contribution of M TFBM candidates. The positive/negative value suggests a stimulatory/inhibitory role, and a larger absolute value implies a stronger contribution. Therefore, the procedure to select TFBMs is to choose a set of top TFBMs whose value in a is larger than other (512 -) TFBMs. To include redundancy in TFBM consensus sequences, a weighted linear combination of elements in a can be used. In summary, the principal component analysis allows us to identify the principal expression components using the eigen gene vectors and to predict the principal TFBM using the eigen TFBM vectors. With the weighting factors defined from the observed value of z, the vector a indicates the predicted contribution of individual TFBM candidates to the observed expression pattern.

Genetic Algorithm (GA) and Monte Carlo simulations
In order to evaluate the SVD-based prediction of TFBMs, the numerical simulations with GA were conducted using the procedure described previously [42]. In a chromosome-like bit map, 512 TFBM candidates were embedded: C = [c 1 , c 2 ,..., c 512 ] (6) where each chromosomal element took "1" and "0" for inclusion and non-inclusion in the model, respectively.
Note that for any chromosome, and the promoter matrix was constructed based on the value of each chromosomal element c i . Two hundred chromosomes represented the population, and one chromosome in the first generation corresponded to the SVD selection. In each generation, 100 chromosomes with smaller errors were recombined, and the other 100 chromosomes with larger errors were mutated. The model error was defined as , and the state variable, x, was estimated using a least-square scheme: Note that n = 45 and m = = 8 in GA. Monte Carlo simulation was also performed to evaluate numerically the SVD-and GA-based selection of TFBMs [42]. A set of TFBMs was randomly chosen from 512 TFBM candidates, and the error distribution associated with the randomly selected TFBMs was compared to the error in the modelbased prediction. The simulation was conducted 10,000 times.

Linkage map among TFBMs
The 8 TFBM candidates, derived from the GA analysis, were linked to the biologically known TFBMs. We evaluated the 5-bp core consensus sequences identical to the known TFBMs using TRANSFAC database [43]. Since the motifs in the database ranges up to 30 bp, it is possible that a 5-bp TFBM candidate corresponds to multiple motifs in the database. Namely, the state vector could represent the combined role of binding motifs when the predicted motifs are shared among transcription factors.