- Methodology article
- Open Access
Exploring the transcription factor activity in high-throughput gene expression data using RLQ analysis
BMC Bioinformatics volume 14, Article number: 178 (2013)
Interpretation of gene expression microarray data in the light of external information on both columns and rows (experimental variables and gene annotations) facilitates the extraction of pertinent information hidden in these complex data. Biologists classically interpret genes of interest after retrieving functional information from a subset of genes of interest. Transcription factors play an important role in orchestrating the regulation of gene expression. Their activity can be deduced by examining the presence of putative transcription factors binding sites in the gene promoter regions.
In this paper we present the multivariate statistical method RLQ which aims to analyze microarray data where additional information is available on both genes and samples. As an illustrative example, we applied RLQ methodology to analyze transcription factor activity associated with the time-course effect of steroids on the growth of primary human lung fibroblasts. RLQ could successfully predict transcription factor activity, and could integrate various other sources of external information in the main frame of the analysis. The approach was validated by means of alternative statistical methods and biological validation.
RLQ provides an efficient way of extracting and visualizing structures present in a gene expression dataset by directly modeling the link between experimental variables and gene annotations.
Gene expression microarray technology enables simultaneous monitoring of the expression level of thousands of genes. The biological interpretation of gene expression microarray findings remains challenging since it generally requires the explicit link to supplementary knowledge related to the function of genes and their inter-connections through functional networks .
Information on samples, usually related to the design of experiments (e.g. disease classes, treatment, time-course effect, replicates, etc.), is commonly integrated and modeled in the main analysis in order to identify genes which are specifically dysregulated under certain pre-defined conditions. In a second step, a selection of genes of interest is classically interpreted in the light of external information including functional annotations derived from various knowledge databases such as Gene Ontology  or KEGG (Kyoto Encyclopedia of Genes and Genomes) molecular pathways . These two steps (gene extraction followed by interpretation) are generally distinct and come sequentially. We hypothesize that treating these two analytical steps in one single integrated manner can facilitate the interpretation of gene expression microarray data.
Transcription factors are regulatory elements which bind to specific DNA sequences generally located in the promoter region of genes. They orchestrate the regulation of genes by enhancing or inhibiting their transcription. Putative transcription factor binding sites (TFBS) can be predicted by searching for short specific motifs in the region upstream of the gene transcription starting site. Identifying TFBS in a list of genes of interest can help the interpretation of gene expression data in the light of the transcription factor activity. The prediction of TFBS in the promoter region of a list of genes was until recently a tedious task, involving the extraction of gene promoter sequences, followed by pattern recognition using motif databases such as TRANSFAC  or JASPAR . However, recent bioinformatic developments allow the automation of most of these complex processes. Several open-source applications were developed, including statistical packages (e.g. the R package MotIV ), as well as various web tools. As an example, Zambelli and collaborators  recently proposed a new application — pscan — which facilitates the discovery of TFBSs, which are over- (or under-) represented in a list of genes.
Despite the emergence of novel bioinformatics solutions, methodological improvements are required in order to integrate TFBS information in the analytical work-flow of gene expression data and simplify results visualization and interpretation.
Correspondence analysis (CA) is, together with principal component analysis (PCA), a popular ordination method for the exploratory analysis of gene expression microarray data. Applications of CA in the field of ’omics was first described in the early 2000s [8-10]. Since then, several refinements of CA were described, exploiting some particular features of the method in order to investigate patterns of variation present in microarray data. Besides the table of direct interest (gene expression data), external information regarding both observations and genes is generally available. This information can be integrated to CA as shown by Busold and colleagues . The authors proposed to use CA for the exploration of microarray data in the light of gene ontology annotations. This supplementary information is superimposed with the original CA results. CA eventually provides graphical solutions that allow to visualize in a single plot, genes, observations, experimental conditions and gene annotations [11, 12]. On the other hand, a supervised counterpart of CA (a.k.a constrained correspondence analysis) was applied to the analysis of microarray data. Constrained CA has the advantage over unsupervised-CA of taking the external information explicitly into account. Between-group correspondence analysis (BGA)  is an example where an explanatory variable is used to constrain CA. BGA applies when observations are grouped into categories (e.g. disease classes) defined by one single nominal variable. BGA tries to best discriminate the per-group centroids by finding axes that maximize the ratio of between- over within-group variance. More complex designs of experiments can be modeled using the generalized correspondence analysis with respect to instrumental variables (CAIV) . Qualitative as well as quantitative variables can be modeled, positively or negatively (effect removal), within the framework of CA. Recently, Jeffery and colleagues  combined BGA with an additional table including the occurrence of transcription factor binding sites (BG-COI) using co-inertia analysis [16, 17].
In this manuscript, we introduce RLQ (R-mode; Q-mode; L-link between R and Q), to provide a broader generalization of the analysis of a central table of interest for which external information on both rows and columns is available. RLQ is a three-table ordination method, initially developed in ecological science [18, 19]. Variations around the same L-structure principle exist in various fields of science such as food science (L-PLSR) , psychometry , consumer preference analysis . In RLQ, the joint structure of three tables is analyzed, the central table being treated by CA.
RLQ analysis is suitable for answering questions such as:
How can we interpret gene expression data in the light of external gene annotations?
How strong is the link between experimental variables and gene annotations?
Can we find patterns of variation in the gene expression data set, which can be both explained by the sample and gene descriptors?
Throughout this work, we describe the general framework of RLQ and show its applicability for the interpretation of gene expression data in light of external gene annotations, with a main focus on the presence/absence of putative TFBS.
In the first section, the mathematical background of RLQ is described. As an illustration, RLQ is applied to a biological example where the aim is to explore the effect of the steroid mometasone furoate (MF) in the proliferation of primary human lung fibroblasts. The regulatory role of transcription factors in this example is of specific interest. It will be compared to the existing knowledge and challenged using alternative approaches. Additional biological validation will be provided. In the final section, the relevance of the method, as well as its strengths and limitations are discussed.
Theory of RLQ analysis
RLQ was used to explore the inter-connections between three matrices linked together in an L-shape manner (Figure 1).
Let us define the following three tables:
L the (n×m) table of gene expression (n genes, m samples)
R the (m×p) table including the experimental design information (m samples, p variables)
Q the (n×q) table including external information about genes (n genes, q descriptors)
The inter-relationship between the three tables is analyzed by performing singular value decomposition (SVD) of the following: . The matrices (R0,L0,Q0) derive from the original (R, L, Q) matrices after the following transformations.
The central table L is treated by correspondence analysis (CA). In the scheme of CA, the table is converted into a χ2 distance matrix L0 defined as follows:
with P=L/N the data matrix divided by its grand total, r the n-dim vector of row sums of P, c the m-dim vector of column sums of P, D r =diag(r) and D c =diag(c) the diagonal matrices of the row sums and the column sums respectively.
Let Z be the cross-product . The singular value decomposition of Z can be written as follows:
with Λ the k×k (k= rank(Z)) diagonal matrix of singular values associated with Z with , U an (m×k) matrix whose columns are the left singular vectors of Z and V an n×k matrix whose columns are the right singular vectors of Z. The rows of U and V are orthogonal with respect to D r and D c respectively:
The principal components and row coordinates are respectively given by and . The principal axes and column coordinates are respectively given by and .
Through this particular procedure, where the central table L is analyzed by correspondence analysis, RLQ analysis maximizes the covariance between linear combinations of columns of R and Q.
In order to test the link between the experimental design and the gene annotations, the fourth-corner statistic was used [19, 23]. This permutation procedure tests the null-hypothesis of absence of link between tables Q and R mediated byL. The permutations are performed within each row of table L.
The transformation applied to R depends on the nature of the data (quantitative, qualitative, fuzzy coded, etc.). In the particular case where R only includes quantitative variables, R0 is obtained by normalizing R column-wise (centered by the weighted mean, and divided by the standard deviation). Row weights are set by the column weights of the previous CA procedure (c).
Similarly, the transformation applied to Q depends on the nature of the data. When Q only includes quantitative variables, Q0 is obtained by normalizing Q column-wise (centered by the weighted mean, and divided by the standard deviation). Row weights correspond to the row weights of the previous CA procedure (r).
Transcription factor binding sites and functional annotations
TFBSs were extracted using the stand-alone version of the pscan software . The original Affymetrix IDs needed first to be converted into RefSeq IDs. The following options were used: the TFBS database was TRANSFAC; the mapping was based on the promoter region specified as 450 bases upstream and 50 bases downstream the gene transcription starting site. Pscan outcome resulted in z-scores matrix linking each gene with each TFBS. High scores reflect a higher chance of presence of a given TFBS in the promoter region as compared to the genome-wide mean, whereas low scores reflect a lower chance of the presence of a given TFBS in a given gene promoter region as compared to the genome-wide mean. In turn, this z-scores matrix was used to build the occurrence matrix (table Q) in the RLQ analysis. The occurrence matrix reflects the presence/absence (0/1) of TFBS in a list of genes. We considered that z-scores superior to 2 reflects the actual presence of a given TFBS in the promoter region of the gene. The transformation of the quantitative z-score matrix into a qualitative occurrence matrix was decided in order to focus on TFBSs with the highest confidence. This resulted in a more sensitive analysis with a better readability of the results. The table of TFBS occurrence included 282 TFBS entries. Notice that some TFBS from the TRANSFAC database have several accession number corresponding to different motifs specific to the same TFBS. This explains why some TFBS are sometimes displayed more than once.
Molecular KEGG pathways and Gene Ontology annotations were directly retrieved using the hgu133a.db meta-data package. A table of KEGG term occurrence (table Q) was built based on the presence/absence of KEGG annotations for each of the investigated genes. Following this procedure, a total of 87 KEGG terms were specifically integrated. Similarly the table of GO terms occurrence (restricted to biological process domain) included 694 entries.
Comparison with current standards
The results obtained by RLQ analysis were compared to TFBS enrichment analysis. Over-represented TFBS motifs were extracted using pscan. The TRANSFAC motif database was used with the same setting as the one used to generate the TFBS occurrence matrix in RLQ analysis. The TFBS enrichments are given by z-score test p-values. In addition, RLQ was compared with 3 competing ordination methods: CA, BG-COI and L-PLSR. The three ordination methods used in this comparison rely on three different schemes. Correspondence analysis is a 1-table ordination method, where supplementary information about rows and columns can be projected a posteriori. Between-group coinertia analysis is a 2-table ordination method where the main table (gene expression data) is constrained observation-wise by a single categorical variable defining groups among observations. L-PLSR is a 3-table ordination method where all three tables (gene expression data and its row/column external information) are treated symmetrically. All these methods were applied to explore the transcription factor activity associated with MF (including all 22’283 probe sets). Their relative performance was assessed and compared to RLQ.
Mometasone furoate dataset
In this dataset, we investigated the time-course effect of the glucocorticoid mometasone furoate (MF) on primary human lung fibroblasts. Two cell lines of fibroblasts were established from pulmonary biopsies obtained from patients undergoing lobectomie or pneumonectomy for peripheral lung cancer as described elsewhere . Cells were cultured in RPMI 1640, 5% FCS, 20 mM HEPES or DMEM, 10% FCS, MEM vitamins. All cell culture media and additives were purchased from Facola/Seromed (Basel, Switzerland). Treatment and experiments were performed between passages 2 and 5. Before the preparation of nuclear and cytosolic extracts, cells were subcultured in Petri dishes and kept for 24 to 48 h in serum-rich medium until they reached 60-80% confluency. Prior to treatment, cells were serum deprived for 24 h with 0.1% FCS. Low serum medium was exchanged every 12 h. MF was diluted in 100% ethanol and added to the medium with a final concentration of 10−8M. Cell lines were harvested and gene profiled at 8 time points (baseline, 20’, 40’, 1 h, 1.5 h, 2 h, 3 h and 6 h). The gene expression level was monitored according to the manufacturer recommendations using the Affymetrix Genechip Human Genome U133A platform which measured 22’283 probe set intensity levels (raw files have been deposited in NCBI’s Gene Expression Omnibus (GEO) and are accessible through GEO series accession number GSE30242).
Statistical considerations and implementations
All calculations were done using the R statistical software including the package ade4, as well as packages from Bioconductor  including affy, hgu133a.db, annotate, seqinr, GO.db. Gene expression data were normalized using the robust multichip average (RMA) method .
The RLQ procedure is implemented in the package ade4 with the function rlq. It requires the use of three separate analyses of tables R, L and Q which are combined using the function rlq. As previously mentioned, the central table L must be treated by CA, whereas the analysis of the two other tables (R and Q) depends on the nature of the variables (principal component analysis for quantitative variables, multiple correspondence for qualitative variables, Hillsmith analysis for a mixed table of quantitative and qualitative variables, etc.).
The ade4 package also includes Monte-Carlo permutation tests specifically implemented for rlq, as well as the function fourthcorner which is the implementation of the fourth-corner statistic measuring and testing the link between the three tables [19, 23].
A wrap-up package (R package rlqomics) that helps to automate these analytical steps in the frame of genomics analysis was developed and is available upon request.
Time course effect of mometasone furoate on the proliferation of lung fibroblasts
To provide a concrete application of RLQ, we analyzed the time course effect of the glucocorticoid MF on the proliferation of primary human lung fibroblasts in the light of transcription factor activity. Two cell lines treated by MF were monitored at 8 time points.
The genes which were mostly dysregulated during the time-course of MF treatment are described in Table 1. These genes included among others various key genes associated with the general mechanisms of the known action of glucocorticoids, such as the inhibition of the transcription of proinflammatory genes via specific transcription factors . Accordingly, our results showed that MF down-regulated genes involved in the initiation and maintenance of inflammation, e.g. chemokines (CCL2, CCL7, CCRL1, CXCL2, CXCL3), interleukines (IL6, IL8, IL11, IL33), early response genes (EGR1, IER2, IER3) and transcription factors such as PI3K or FOXO1.
RLQ analysis and transcription factors activity
The presence of putative TFBSs was assessed using pscan. In this example, RLQ analysis was based on the 100 genes mostly dysregulated during the time course effect of MF (Table 1). The transcription factors activity described below is measured by the presence of TFBSs that presumably belong (according to pscan predictions) to the promoter region of the most dysregulated genes. The putative transcription factors activity is summarized by an RLQ biplot (Figure 2). The biplot representation depicts the activity of transcription factors varying over time, as measured on the first 2 axes of RLQ. Biplots in RLQ should be interpreted similarly to biplots of other ordination methods. Both variables (TFBS) and observations (time points) are displayed. The TFBSs that mostly explain the variability extracted by the RLQ axes are the ones with the highest absolute scores (TFBSs located at the extremity of these axes). The further away the TFBSs are from the origin of the axes, the stronger they are associated with the time points lying towards the same direction. The cosine of the angle formed by two TFBSs indicates the correlation between these two TFBSs. Two TFBSs located in the opposite direction in the biplot (relatively to the origin of the axes) are inversely correlated.
The first 2 axes of the RLQ analysis summarize the vast majority of the total inertia (83% on the first axis and 14% on the second). RLQ analysis shows a gradient in the time course effect of MF associated with the first RLQ axis. Early time points correlate with low scores on the first axis whereas late time points correlate with high scores on the first axis. The second axis separates the intermediate to the extreme time points. When considering the 2 axes together, the different time points are distributed according to a U-shape. This effect, known as horseshoe effect, outlines the transcriptional changes over time as the effect of the action of MF. Although this effect results of a distortion of the ordination diagram, it facilitates the biplot-based interpretation of the activated TF along a unidirectional time gradient. Considering that the activity of MF is effective within 1-2 hours, the effect of MF can be subdivided into early and late effects. In Figure 2, the TFBSs associated with genes that are early up-regulated (time 1.5-2 h) are located in the lower right quadrant, whereas TFBSs associated with genes that are late up-regulated (time 3-6 h) are located in the upper right quadrant of the biplot. At the opposite directions, one can identify TFBSs which are associated with genes that are early down-regulated (upper left quadrant), as well as TFBSs associated with genes that are late down-regulated (lower left quadrant).
The forkhead-related activator 2, 3, and 4 (Freac 2, Freac 3, Freac 4), as well as cMyb are TFBSs which are prominently present in genes that are early up-regulated due to the effect of MF. The transcription factor c-Myb is known to play an important role in the regulation of cellular proliferation and differentiation. CREB, CRE-BP1, NF-1 (nuclear factor 1) are transcription factors which are associated with late up-regulated genes by MF. In the opposite direction (left quadrants), TFBSs which were present in down-regulated genes included AP1-2, transcription factors involved in cellular differentiation, proliferation and apoptosis. In Figure 2, NF- κB and cRel were both found in the opposite direction (upper-left corner) from the early time points (times 1.5-2 h), suggesting an early inhibition of this transcription factor by MF. The combined role of NF- κB and AP-1 transcription factors in the action of steroids is well documented in the literature [29, 30]. As confirmed by the RLQ analysis, the repression of these transcription factors corresponds to the mechanism underlying the anti-inflammatory efficacy of corticosteroids [31, 32]. Our data suggest that the octamer transcription factor 1 (OCT-1) is involved in the early down-regulation of genes by MF. Prior work indicated that OCT-1 cooperates synergistically with the glucocorticoid receptor (GR) in restricting transcriptional cooperativity to promoters containing DNA binding sites for both factors .
A permutation test based on the total inertia computed by RLQ analysis showed a significant link between R and Q through L (p<0.001). Another measure of the link between the three tables was further computed using the fourth-corner statistics. This procedure provides a synthetic representation of the TFBS activity over time (Additional file 1: Figure S1).
Biological validation of the transcription factor activity
The activity of transcription factors OCT-1 and CREB was further investigated using protein analysis by Western blotting. Based on our RLQ findings, OCT-1 and CREB have a strong activity associated with MF treatment. According to predictions summarized in the RLQ biplot (Figure 2), the high level of OCT-1 activity at baseline decreased over time, whereas CREB activity showed an increase within 2 hours. Figure 3 shows the protein expression time course (0-2 hours) of OCT-1 and CREB in the nuclear compartment of primary lung fibroblasts treated by MF. OCT-1 is present in the nucleus at early time points and is rapidly decreasing after steroid treatment. In addition, by comparing the protein expression level in the nucleus and the cytosol (Additional file 1: Figure S2), one can see that OCT-1 is decreasing from both compartments. This decrease of OCT-1 in the cytosol could be explained by either an increase of protein degradation or a reduction of de novo synthesis. In contrast, CREB is not present in the nucleus (only in the cytosol) at early time points, and after 45 min of MF treatment, it is translocated into the nucleus. Here MF acts predominantly by translocating the CREB protein, since the total of nuclear plus cytosolic CREB levels remains constant (indicating no significant netto cytosolic protein degradation) (Additional file 1: Figure S2). It is notable that the expression of OCT-1 increased after 60 minutes. However, protein analysis suggested that this occurs because of a decreasing level of cytosolic protein. Overall, MF induces a reduction of OCT-1 protein levels, and it induces a translocation of activated CREB within the first 2 hours, which is in agreement with RLQ predictions.
RLQ analysis and other publicly available gene annotations
The same data set was analyzed in the light of other functional feature databases including KEGG biochemical pathways and Gene Ontology (GO). Figure 4 summarizes the RLQ analyses made at these 2 additional levels.The left panel displays the KEGG analysis of the MF dataset. The interpretation of the biplot follows the same process as for the TFBS analysis. Down-regulated genes (left quadrants) involve cytokine-cytokine receptor interactions pathways (KEGG 04060), whereas genes that are early up-regulated (lower right quadrant) are involved in pathways such as Type II diabetes mellitus (KEGG 04930) and insulin signaling pathway (KEGG 04910).
The right panel displays the GO analysis of the MF dataset. Genes down-regulated due to the effect of MF are associated with response to drug (GO:0042493) or response to organic cyclic compound (GO:0014070) (upper-left quadrant). After one hour, MF affects genes involved in inflammatory response (GO:0006954) and signal transduction (GO:0007165). At later time points (upper-right quadrant), genes that belong to functional categories such as insulin receptor signaling pathways (GO:0008286) and phosphotidylinositol-mediated signaling (GO:0048015) are over-represented.
Comparison of RLQ analysis with alternative approaches
The prediction of TFBS motifs in our MF dataset was analyzed using enrichment analysis (EA). The pscan software tool was used to identify over-represented TFBS motifs in the previously described list of 100 most dysregulated genes (Table 1). The 20 most significantly over-represented TFBS (p<0.1) were compared to the 20 most influential TFBS identified by RLQ and highlighted in Figure 2. Table 2 provides a summary of the TFBS motifs identified by either of the 2 methods. Common findings revealed by the 2 methods include the prominent role played by the NF- κB/Rel transcription factor family. Among the transcription factors specifically extracted by RLQ, the forkhead family of transcription factors (FREAC2, FREAC3, FREAC4), regulating the expression of genes involved in cell proliferation, is activated in intermediate time points. Other TFBS specifically highlighted by RLQ are the transcription factors CREB, cMyb and CRE-P1 which are known to interact in the mechanisms of action of glucocorticoids. It is worth noting that the 2 transcription factors validated by Western blotting could not be identified using enrichment analysis.
The advantage of using RLQ analysis over EA is that the interpretation based on a single biplot is direct. The biplot representation shows associations between the time course effect of MF and the activation of transcription factors. Using EA, one has to identify the most enriched TFBS, then determine the genes that carry these TFBS (as well as the nature of the dysregulation), and finally interpret the role of transcription factors in the experiment. Discrepancies between the two approaches are partly due to the univariate nature of EA.When using RLQ analysis, functional annotation terms are treated in a multivariate fashion. This ensures that the presence of possible interactions and co-variations is accounted for in the analysis.
Other ordination methods
The following 3 ordination methods were compared to RLQ:
Correspondence analysis with projection of supplementary information on both rows and columns
Between-group coinertia analysis (BG-COI)
Correspondence analysis (Figure 5, upper left panel), due to its unsupervised nature, does not allow to depict appropriately the time course effect of MF. The time gradient is only supported by the second axis and the expected early inhibition of key transcription factors such as NF- κB combined with AP-1 and c-Rel is not immediately accessible in contrast to the results obtained by RLQ (Figure 5, lower right panel). When comparing the 50 most contributing TFBS identified by CA and RLQ, an overlap of 26 TFBS (52%) was found.
Between-group coinertia analysis (BG-COI) only differs from RLQ in the way tables R and L are linked together (in a symmetrical way for RLQ, and in an asymmetrical way for BG-COI). The results of BG-COI shown in Figure 5 (upper right panel) are mostly congruent with the ones obtained by RLQ. The biological findings derived by both methods are comparable. Among the 50 most contributing TFBSs identified by BG-COI and RLQ, an overlap of 40 TFBSs (80%) was found. Mathematically, RLQ provides a more general solution to the 3-table problem with a stronger theoretical background.
L-PLSR differs from RLQ in the sense that the central table is treated with a double centered PCA, whereas in RLQ the main table is treated by CA. In practice, due to the double-centering performed in L-PLSR, both methods provided almost identical results apart from a scaling factor (Figure 5, lower left panel). All of the 50 most contributing TFBS identified by L-PLSR match the ones identified by RLQ. However, to our knowledge, no implementation of L-PLSR adapted to genomics data is currently available.
Overall, the comparison of RLQ with other ordination methods stress the benefit of actively integrating external annotations on rows and columns using supervised modeling. Important data structures proved to be more challenging to interpret using unsupervised approaches. Both BG-COI and L-PLSR provide solutions which are in agreement with RLQ analysis. RLQ includes key features for life scientists, such as an available implementation of the method itself, together with a series of graphical tools, as well as various permutation procedures (Monte-Carlo permutation test, fourth-corner statistic), and several additional advanced procedures (e.g. between/within-class RLQ).
To date, several approaches have been proposed to interpret gene expression microarray data using external information. The classical approach identifies a list of genes of interest, then interprets these genes in a second step using tools of functional annotations. CA is a powerful method to describe sources of variation present in a microarray data set. Using a biplot representation, it is easy to simultaneously visualize the ordination of samples and to identify genes that are responsible for this ordination. It may be useful to include gene annotations directly in the frame of correspondence analysis. External information can be inserted by simple projection into the dataset in an unsupervised fashion as demonstrated by various authors [11, 12]. Although this procedure brings insights which are helpful for the interpretation of the data, in most existing studies this additional information is only indirectly involved in the modeling of the data. In our example, we showed that this unsupervised approach was only partially effective, the interpretation of the data in the light of transcription factor activity remaining challenging.
Supervised counterparts of correspondence analysis were proposed in the literature [13, 14]. Jeffery and collaborators  also described a method (BG-COI) integrating the information of TFBSs by combining in a 2-table scheme the gene expression data set (primarily analyzed by supervised non-symmetric correspondence analysis) with a TFBS occurrence table (primarily analyzed by principal component analysis). The main difference with the RLQ procedure is that in BG-COI, the first two matrices (R and L) are combined using a non-symmetrical procedure (between-class correspondence analysis), whereas the third table Q (TFBS occurrence) is integrated using a symmetrical procedure (co-inertia analysis). The initial asymmetric procedure necessitates a regression step implying dimensionality constraints in the number of experimental variables which can be integrated in the analysis . In simple cases, e.g. when only one single categorical variable is used, this does not constitute a limitation. Accordingly, when applying BG-COI to the MF dataset, the results were comparable to the one obtained by RLQ with only minor discrepancies. However, in some other cases, for example when using a larger set of experimental variables, asymmetric procedures will show limitations.
RLQ generalizes the concept of symmetric analysis, an extension of the co-inertia principle for 3-tables, where each of the three tables plays independently a symmetric role. As highlighted by Dray and collaborators , the simultaneous analysis of this 3-table scheme is a more coherent solution. The advantage of RLQ over other methods such as between-class co-inertia analysis is that it provides a more consistent analysis framework with a stronger mathematical background. Other asymmetric approaches have been proposed in psychometrics such as the double constrained correspondence analysis where the correspondence analysis of the central table L is constrained by both information on observations (table R) and variables (table Q). This double constrained procedure introduces 2 dimensionality constraints. Each regression step may lead to numerical instability that can generate poor predictive power. The solution proposed in RLQ analysis is more satisfying because it has no dimensionality restrictions, providing greater numerical stability particularly when the number of variables taken into account is large . In RLQ, the central table is treated by correspondence analysis, whereas the 2 additional matrices can be analyzed using different schemes depending on the nature of the descriptive variables. The chemometrics L-PLSR procedure described by Martens and collaborators  follows the same idea with the only noticeable difference that the central table is treated by principal component analysis (after initial double-centering transformation), and not correspondence analysis. As previously shown, L-PLSR and RLQ are practically comparable and show very minor differences. Considering the different options associated with the analysis of the 2 marginal tables, and the choice in the scheme of the central table, this further extends the theoretical framework of RLQ.
In the scope of the current work, we described three applications of RLQ for the interpretation of high-throughput gene expression data, using TFBS, GO and KEGG pathways annotations as input of the Q matrix. Other variations around the same principle include the integration of literature co-occurrence information (e.g. by extracting information from pubmed), the presence/absence of microRNA targets , as well as more direct information such the chromosomal location, tissue expression patterns, etc.
It is worth noting that the transcription factors activity as assessed by RLQ analysis depends on several assumptions made for the identification of the putative TFBSs. This includes the choice of the TFBS motif database, the length of the promoter region, the threshold chosen to define the likelihood of the actual presence of TFBSs, etc. The current method takes into account the fact that several transcription factors can bind to a promoter and interact. However, more complex phenomena influencing the action of transcription factors, such as folding of the DNA promoter region , are not modeled in the current approach.
RLQ analysis is essentially exploratory. However, several testing procedures were specifically proposed in the framework of RLQ [19, 23]. Using permutational models, the link between experimental variables and gene annotations can be tested. These inferential techniques provide an immediate overview of the nature (positive/negative) and significance of the relationship between experimental variables and gene annotations.
RLQ analysis is a new approach to extract and visualize structures in a microarray dataset by combining external information on both columns (experimental variables) and rows (gene annotations). Biplot representations provide a unique all-integrated picture of the results, which allows us to directly relate experimental variables to gene annotations.
This approach was successfully used to describe the transcription factor activity associated with the action of the glucocorticoid MF furoate on the growth of human fibroblasts. In an integrated manner, RLQ analysis unveiled distinct mechanisms of action of glucocorticoids, in agreement with prior existing knowledge from the literature. The nuclear expression levels of OCT-1 and CREB confirmed the transcription factor activity predicted by using RQL analyses, and provide a direct molecular biological validation of the method.
The set of R functions proposed in the frame of this work further facilitates the use of RLQ analysis with regards to transcriptomics data in the light of GO, KEGG and TFBS information, as exemplified in this study.
Further work is needed to explore the performance of RLQ in specific cases, such as datasets comprising a larger set of experimental variables, or variables of heterogeneous nature (e.g. mixture of quantitative and qualitative variables). Computationally, the current implementation of RLQ allows to analyze efficiently standard gene expression datasets. However, further optimizations might be needed in order to deal with even larger highly multivariate datasets such as the ones generated by whole genome exon arrays.
Bellazzi R, Zupan B: Towards knowledge-based gene expression data mining. J Biomed Inform. 2007, 40: 787-802. 10.1016/j.jbi.2007.06.005.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.
Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010, 38: D355—D360-
Wingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 1996, 24: 238-241. 10.1093/nar/24.1.238.
Bryne JC, Valen E, Tang MH, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A: JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 2008, 36: D102—D106-
Mercier E, Droit A, Li L, Robertson G, Zhang X, Gottardo R: An integrated pipeline for the genome-wide analysis of transcription factor binding sites from ChIP-Seq. PLoS One. 2011, 6 (2): e16432-10.1371/journal.pone.0016432.
Zambelli F, Pesole G, Pavesi G: Pscan: finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes. Nucleic Acids Res. 2009, 37: W247—W252-
Kishino H, Waddell PJ: Correspondence analysis of genes and tissue types and finding genetic links from microarray data. Genome Inform Ser Workshop Genome Inform. 2000, 11: 83-95.
Fellenberg K, Hauser NC, Brors B, Neutzner A, Hoheisel JD, Vingron M: Correspondence analysis applied to microarray data. Proc Natl Acad Sci USA. 2001, 98: 10781-10786. 10.1073/pnas.181597298.
Yano K, Imai K, Shimizu A, Hanashita T: A new method for gene discovery in large-scale microarray data. Nucleic Acids Res. 2006, 34: 1532-1539. 10.1093/nar/gkl058.
Busold CH, Winter S, Hauser N, Bauer A, Dippon J, Hoheisel JD, Fellenberg K: Integration of GO annotations in Correspondence Analysis: facilitating the interpretation of microarray data. Bioinformatics. 2005, 21: 2424-2429. 10.1093/bioinformatics/bti367.
Fellenberg K, Busold CH, Witt O, Bauer A, Beckmann B, Hauser NC, Frohme M, Winter S, Dippon J, Hoheisel JD: Systematic interpretation of microarray data using experiment annotations. BMC Genomics. 2006, 7: 319-10.1186/1471-2164-7-319.
Culhane AC, Perrière G, Considine EC, Cotter TG, Higgins DG: Between-group analysis of microarray data. Bioinformatics. 2002, 18: 1600-1608. 10.1093/bioinformatics/18.12.1600.
Baty F, Facompré M, Wiegand J, Schwager J, Brutsche MH: Analysis with respect to instrumental variables for the exploration of microarray data structures. BMC Bioinformatics. 2006, 7: 422-10.1186/1471-2105-7-422.
Jeffery IB, Madden SF, McGettigan PA, Perrière G, Culhane AC, Higgins DG: Integrating transcription factor binding site information with gene expression datasets. Bioinformatics. 2007, 23: 298-305. 10.1093/bioinformatics/btl597.
Dray S, Chessel D, Thioulouse J: Co-inertia analysis and the linking of ecological tables. Ecology. 2003, 84: 3078-3089. 10.1890/03-0178.
Culhane AC, Perrière G, Higgins DG: Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics. 2003, 4: 59-10.1186/1471-2105-4-59.
Doledec S, Chessel D, Ter Braak CJF, Champely S: Matching species trait to environmental variables: a new three-table ordination method. Environ Ecol Stat. 1996, 3: 143-166. 10.1007/BF02427859.
Dray S, Legendre P: Testing the species traits-environment relationships: the fourth-corner problem revisited. Ecology. 2008, 84: 3078-3089.
Martens H, Anderssen E, Flatberg A, Halvor Gidskehaug L, Høy M, Westad F, Thybo A, Martens M: Regression of a data matrix on descriptors of both its rows and of its columns via latent variables: L-PLSR. Comp Stat Data An. 2005, 48: 103-123. 10.1016/j.csda.2003.10.004.
Takane Y, Shibayama T: Principal component analysis with external information on both subjects and variables. Psychometrika. 1991, 1: 97-120.
Esposito Vinzi V, Guinot C, Squillacciotti S: Two- step PLS regression and L-structured data: an application in cosmetic industry. Stat Meth Appl. 2007, 16: 263-278. 10.1007/s10260-006-0028-2.
Ter Braak CJ, Cormont A, Dray S: Improved testing of species traits-environment relationships in the fourth-corner problem. Ecology. 2012, 93 (7): 1525-1526. 10.1890/12-0126.1.
Rüdiger JJ, Roth M, Bihl MP, Cornelius BC, Johnson M, Ziesche R, Block LH: Interaction of C/EBPalpha and the glucocorticoid receptor in vivo and in nontransformed human cells. FASEB J. 2002, 16 (2): 177-184. 10.1096/fj.01-0226com.
Dray S, Dufour AB, Chessel D: The ade4 package - II: Two-table and K-table methods. R News. 2007, 7 (2): 47-52.
Gentleman RC, Carey VJ, Bates DM, et al: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80-10.1186/gb-2004-5-10-r80. [http://genomebiology.com/2004/5/10/R80]
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4 (2): 249-264. 10.1093/biostatistics/4.2.249.
Newton R: Molecular mechanisms of glucocorticoid action: what is important?. Thorax. 2000, 55 (7): 603-613. 10.1136/thorax.55.7.603.
Hart L, Lim S, Adcock I, Barnes PJ, Chung KF: Effects of inhaled corticosteroid therapy on expression and DNA-binding activity of nuclear factor kappaB in asthma. Am J Respir Crit Care Med. 2000, 161: 224-231. 10.1164/ajrccm.161.1.9809019.
De Bosscher K, Vanden Berghe W, Haegeman G: The interplay between the glucocorticoid receptor and nuclear factor-kappaB or activator protein-1: molecular mechanisms for gene repression. Endocr. Rev. 2003, 24 (4): 488-522. 10.1210/er.2002-0006.
Barnes PJ, Adcock I: Anti-inflammatory actions of steroids: molecular mechanisms. Trends Pharmacol Sci. 1993, 14 (12): 436-441. 10.1016/0165-6147(93)90184-L.
Austin RJ, Maschera B, Walker A, Fairbairn L, Meldrum E, Farrow SN, Uings IJ: Mometasone furoate is a less specific glucocorticoid than fluticasone propionate. Eur Respir J. 2002, 20 (6): 1386-1392. 10.1183/09031936.02.02472001.
Prefontaine GG, Lemieux ME, Giffin W, Schild-Poulter C, Pope L, LaCasse E, Walker P, Hache RJ: Recruitment of octamer transcription factors to DNA by glucocorticoid receptor. Mol Cell Biol. 1998, 18 (6): 3416-3430.
Geng CD, Vedeckis WV: Steroid-responsive sequences in the human glucocorticoid receptor gene 1A promoter. Mol Endocrinol. 2004, 18 (4): 912-924. 10.1210/me.2003-0157.
Thioulouse J, Simier M, Chessel D: Simultaneous analysis of a sequence of paired ecological tables. Ecology. 2004, 85: 272-283. 10.1890/02-0605.
Böckenholt U, Böckenholt I: Canonical analysis on contingency tables with linear constraints. Psychometrika. 1990, 55: 633-639. 10.1007/BF02294612.
Madden SF, Carpenter SB, Jeffery IB, Bjorkbacka H, Fitzgerald KA, O’Neill LA, Higgins DG: Detecting microRNA activity from gene expression data. BMC Bioinformatics. 2010, 11: 257-10.1186/1471-2105-11-257.
Horn PJ, Peterson CL: Molecular biology. Chromatin higher order folding-wrapping up transcription. Science. 2002, 297 (5588): 1824-1827. 10.1126/science.1074200.
The authors are indebted to Drs. A. Franzini, S. Haile and Ms. K. Kozu for their support and most appreciated editing of the manuscript.
This work was supported by institutional research funding and also received support from the Swiss National Science Foundation (project number: 310030-133109).
The authors declare that they have no competing interests.
FB wrote the manuscript, carried out the analysis and implemented the methodology. LK reviewed the manuscript and provided expertise on the clinical aspects. MB designed the experiments, and supervised all research experiments. JR performed the cell culture experiment and RNA extraction for gene expression analysis. PB, NM performed the Western analysis and validated the biological relevance of the findings. All authors read and approved the final manuscript.
Electronic supplementary material
About this article
Cite this article
Baty, F., Rüdiger, J., Miglino, N. et al. Exploring the transcription factor activity in high-throughput gene expression data using RLQ analysis. BMC Bioinformatics 14, 178 (2013). https://doi.org/10.1186/1471-2105-14-178
- Gene Ontology
- Correspondence Analysis
- Transcription Factor Activity
- Transcription Factor Binding Site
- Ordination Method