Sorting out inherent features of head-to-head gene pairs by evolutionary conservation
© Li et al. 2010
Published: 14 December 2010
Skip to main content
© Li et al. 2010
Published: 14 December 2010
A ‘head-to-head’ (h2h) gene pair is defined as a genomic locus in which two adjacent genes are divergently transcribed from opposite strands of DNA. In our previous work, this gene organization was found to be ancient and conserved, which subjects functionally related genes to transcriptional co-regulation. However, some of the biological features of h2h pairs still need further clarification.
In this work, we assorted human h2h pairs into four sequentially inclusive sets of gradually incremental conservation, and examined whether those previously asserted features were conserved or sharpened in the more conserved h2h pair sets in order to identify the inherent features of the h2h gene organization. The features of TSS distance, expression correlation within h2h pairs and among h2h genes, transcription factor association and functional similarities of h2h genes were examined. Our conservation-based analyses found that the bi-directional promoters of h2h gene pairs are most likely shorter than 100 bp; h2h gene pairs generally have only significant positive expression correlation but not negative correlation, and remarkably high positive expression correlations exist among h2h genes, as well as between h2h pairs observed in our previous study; h2h paired genes tend to share transcription factors. In addition, expression correlation of h2h pairs is positively related with the TF-sharing and functional coordination, while not related with TSS distance.
Our findings remove the uncertainties of h2h genes about TSS distance, expression correlation and functional coordination, which provide insights into the study on the molecular mechanisms and functional consequences of the transcriptional regulation based on this special gene organization.
A ‘head-to-head’ (h2h) gene pair is defined as a genomic locus in which two adjacent genes are divergently transcribed from opposite strands of DNA, and, the region between the two transcription start sites (TSSs), commonly shorter than 1000 bp, is termed the ‘bi-directional promoter’ [1, 2]. H2h gene pairs have been found to be a unique gene arrangement in vertebrates, particularly in human genome [2, 3]. Recent studies have been characterizing the sequential features of the bi-directional promoters [4, 5], exploring the co-regulation pattern among h2h gene pairs , and investigating their functional relevance such as that with tumorigenesis [6, 7]. Taken together, these findings seem to echo a preliminary conclusion we made in 2006 : “the head-to-head gene organization is ancient and conserved, which subjects functionally related genes to correlated transcriptional regulation and thus provides an exquisite mechanism of transcriptional regulation based on gene organization.” However, there is still some doubt or uncertainty on specific features of h2h gene pairs to be resolved by close-up investigations. For instance, we observed in our previous study that pairs with TSSs separated 1- to 400- bp apart formed the peak columns in the TSS distance distribution, and we anticipated a compression of these columns to a narrower or sharper region. Although we did witness a significant inflation of rat h2h pairs in the 1- to 400- bp TSS distance group during a three-year update, we still could not affirm how long a bi-directional promoter most optimally is. For another example, we observed positive, negative, and alternative expression correlation between h2h paired genes, but negative correlation was not confirmed by peer studies [2, 4], and a novel opinion came up that significant expression correlation may exist among h2h genes (not necessarily within pairs) . Other aspects of h2h gene pairs, such as their transcriptional regulation and function coordination, are still ambiguous to some extent.
In the present study, we sorted previously asserted features of h2h gene pairs, trying to remove these uncertainties and identify the inherent features of this gene arrangement. Based on a commonly accepted principle that evolutionarily conserved facts are by all means associated with biological significances , we believed that the more conserved head to head gene pairs, of greater biological importance, must more likely represent the inherent features of h2h gene pairs. Therefore, we assorted human h2h pairs into four sets of incremental conservation in vertebrates, and sorted out inherent features of vertebrate h2h gene pairs by comparing the four h2h pair sets on a series of points. We gave comprehensive analyses on h2h pair features including TSS distance, expression correlation nature, transcription factor association, and functional coordination, and provided unambiguous judgment on specific features according to their evolutionary conservation. This study provides useful clues for the mechanism study on the transcriptional regulation of the h2h gene organization.
According to DBH2H (http://lifecenter.sgst.cn/h2h/), we determined human, chicken, and fugu H2h gene pairs, and the TSS Distances of each pair. Expression correlation data were downloaded from two sources: DBH2H  (http://lifecenter.sgst.cn/h2h/) and COXPRESdb (http://coxpresdb.jp/).
Transcription factor association of h2h gene pairs was enabled by the integrated transcription factor platform  (ITFP, http://itfp.biosino.org/itfp/), which maintains both experimentally verified TFs and in-silico predicted TFs.
Annotation of Gene Ontology (http://www.geneontology.org) terms of h2h genes was aided by Bioconductor packages org.Hs.eg.db 2.3.6 and GO.db 2.3.5.
From DBH2H, we got Pearson and Spearman expression correlation data of human h2h gene pairs on 43 public datasets respectively; from COXPRESdb, we got the Pearson expression correlation value, as well as a relative correlation index MR (Mutual Rank) , for each of all possible pairs among 19777 human genes. COXPRESdb data were calculated from gene expression profiles across 3749 human samples.
Specifically, MR is defined as the geometric mean of the reciprocal relative expression correlation ranks with respect to the two genes of a pair: (A and B stand for two genes).Additionally, we calculated another relative expression correlation index RR (Relative Rank), defined as RR(A,B)=min(Rank(A->B),Rank(B->A)). Wherever one single expression correlation value was used for summarizing an h2h pair set, we performed the average operation over all COXPRESdb values of the set. A total of 1447000 (1447*1000) of random gene pairs and 5252 same-strand adjacent pairs involving 2835 h2h genes were determined for control. Their expression correlation values were also taken from the COXPRESdb data.
With DBH2H expression correlation data, we determined for each h2h pair the significant correlations with the corresponding p-values lower than 0.05. As the significant correlations could be positive or negative, we got three total numbers respectively: SP, SN, and SP+SN. Dividing the three total numbers with the number of investigated datasets separately, we obtained the SPR (Significant Positive Ratio), SNR (Significant Negative Ratio), and SR (Significant Ratio), representing the proportion of significant positive correlation, significant negative correlation, and significant correlation of an h2h pair, respectively. Note that SPR+SNR=SR. When different sets of h2h pairs were compared in terms of expression correlation level, we reported the average SPR, SNR, or SR of each set.
The calculations of functional similarity were performed using the GOSim  package, version 188.8.131.52(http://cran.r-project.org/web/packages/GOSim/index.html) in the R environment (http://www.r-project.org/) . We also calculated the functional similarity of random pair sets with the same size of annotated h2h gene pairs, with iteration 100 times.
We studied head-to-head gene organization in vertebrates by selecting fugu rubripes, gallus gallus, mus musculus, and homo sapiens genomes as the representative vertebrate phylogeny. Fugu has the shortest known genome (~365 Mb) of any vertebrate species - around one eighth of the size of the human genome , therefore roughly representing the start-point of the vertebrate phylogeny. The chicken has a genome of 1.2 Gb, approximately 40% of the size of the human genome, and is the premier non-mammalian vertebrate model organism . Mouse and human are two of the most well-studied mammalian model animals, and, in contrast to fugu, they approximately represent the end-point of the vertebrate phylogeny. Based on data downloaded from DBH2H , 1447 human h2h gene pairs were assorted into four sequentially inclusive sets: set H, including all 1447 human pairs; set HM, including 191 pairs conserved between human and mouse; set HMC, including 77 pairs conserved across human, mouse and chicken; set HMCF, including the 14 pairs conserved across human, mouse, chicken and fugu. The four sets of human h2h pairs with gradually increasing conservation levels were compared in terms of genomic TSS distance, expression correlation, transcriptional factor association, and functional similarity. In each analysis, we firstly compared the feature of the largest set H and that of a randomly sampled gene pair set or a set of ‘adjacent’ gene pairs composed of h2h genes and their adjacent genes. If a statistically significant difference between set H and the random set (or the adjacent set) was observed, we furthermore compared the feature between the four h2h pair sets, and relied on two-group t-tests or wilcoxon rank-sum tests to decide whether there was statistically significant difference between the different conservation levels. If a feature was validated in both stages of statistical tests, we declared it was an inherent feature of the h2h gene organization; if a feature was not validated by either stage, or if it showed contrary trend in the conservation-based test, we tentatively negated it. If a feature had significant difference between set H and the random set (or the adjacent set), but did not display significant difference, in consistent directions, between the different conservation levels, we postponed the related declaration to future studies where hopefully expanded data would lead to an unambiguous conclusion
Percentages of h2h pairs within particular TSS distance intervals
(0, 100) bp
(100, 200) bp
(0, 400) bp
Considering another fact that the core promoter , or the minimal portion of the promoter required to properly initiate transcription, is confined to 100 bp region upstream of a TSS, we have increased confidence in that the h2h pair with their TSSs separated 1-100 bp most likely has a functional bi-directional promoter, which has biological relevance to the co-regulation of the two genes. As we witnessed a compression of TSS distances of rat h2h pairs between two batches of analyses [3, 9], we anticipated an impending replacement of the then peak column (100, 200] by (0, 100] in future data updates.
We also related TSS distance with expression correlation of the h2h paired genes, but found no significant relationship between them, no matter in set H or in the more conserved set HM, HMC and set HMCF. Even if we studied the overlapping and non-overlapping h2h pairs separately, we still did not detect any correlation between TSS distance and expression correlation. Hence, we stuck to our postulation that a bi-directional promoter tend to coordinately regulate the transcriptions of h2h paired genes in a TSS distance-unrelated manner .
Expression correlation within h2h pairs in DBH2H
Furthermore, we examined whether negative correlation is an inherent feature of h2h gene pairs. We first noticed that, in COXPRESdb, set H had a smaller fraction of gene pairs with negative expression correlation than random pair set and adjacent pair set (chi-squared test, p<0.01), and the fractions in sets HM, HMC and HMCF were even smaller (0.02 in HM, 0 in both HMC and HMCF). Additionally, the average correlation values separately for positive and negative correlation of each h2h pair were examined according to DBH2H . Interestingly, we observed a stable increment in positive correlation between the four h2h sets, but no similar trend in negative correlation. Moreover, we discerned a remarkable preponderance of positive correlation over negative correlation, as the Significance Ratios (SRs) were mostly contributed by Significant Positive Ratios (SPRs) (Table 2). The average ‘Significant Negative Ratio’ (SNR) of h2h pairs, at any conservation level, was lower than 10%, and it even decreased a little from set H to set HMCF (Table 2). A more typical decreasing trend was found with the average proportion of datasets showing negative correlation (data not shown). This indicated that negative correlation was quite likely not an inherent feature of the h2h gene arrangement, in accordance with a previous claim that there was no evidence for negative expression correlation of a significant number of gene pairs .
In summary, our conservation-based analyses validated the significant positive coexpression tendency within and between h2h gene pairs, but negated the universal existence of negative expression correlation of h2h pairs. The intra-pair expression correlation level seems higher than the inter-pair one. A further study on the roles of h2h genes in coexpression networks is still going on.
Despite the consensus that h2h gene pairs are often co-transcribed, the transcriptional regulation mechanisms of h2h gene pairs remain unclear. Lin et al  addressed this issue by discriminating over-represented and under-represented transcription factor binding sites (TFBSs) from bi-directional promoters. We wanted to complement their work by emphasizing the transcription factors (TFs) which potentially regulate h2h genes.
We tried associating TFs to human h2h genes (within set H) based on the experiment and computation-based ITFP database  and the experiment-based TRANSFAC database. Through ITFP, we determined 207 ‘TF-associated h2h gene pairs’ of which the two h2h paired genes were both associated to TFs; this number was by far larger than that obtained through TRANSFAC. By adopting ITFP, therefore, we achieved an optimal trade-off between data size and credibility.
TF-association of h2h pairs
Proportion of TF-sharing pairs in annotated pairs
Seven h2h gene pairs in which one gene regulates the other
Pearson Correlation Coefficient (PCC)
Mutual Rank (MR)
CSTF1 -> AURKA
DTX3L -> PARP9
WDSOF1 -> SLC25A32
MCM4 -> PRKDC
RECQL -> GOLT1B
NUFIP1 -> KIAA1704
POLR3K -> C16orf33
According to our results, h2h paired genes tend to share TFs, and the TF sharing degree is positively correlated with expression correlation. Sharing regulators seems to be a universal characteristic of h2h gene pairs which partially explains the significant positive expression correlation between h2h paired genes.
Functional similarities of h2h gene pairs
Adjacent gene pairs
0.29 (1603) a
0.34 b (1859)
0.51 c (508)
Average functional similarity of h2h gene pairs delimited by expression correlation thresholds
Taking the above two points together, there seems to be a functional similarity between h2h organized genes and a correlation between the functional coordination and the expression correlation. In all, through sharing bi-directional promoters, h2h gene pairs tend to be coexpressed and their products tend to perform similar functions. As we previously proposed, similar to operons in bacteria, h2h gene arrangement is an economic and ingenious strategy in eukaryotes to achieve coordination between functionally related genes.
In this work, using recently accumulated genomic and expression data, we systematically re-examined the diverse features of head-to-head gene pairs previously proposed  and verified the features inherent in the h2h gene arrangement based on the evolutionary conservation. On a whole, most discoveries or hypotheses made in the previous work were confirmed: the functional bi-directional promoters of h2h gene pairs are most likely shorter than 100 bp; h2h paired genes show significantly high positive expression correlation; h2h paired genes are involved in related functions and the functional similarity is positively correlated with gene pair expression correlation. However, negative expression correlation is probably not an inherent feature of h2h gene pairs. As an additional discovery, we found that the expression correlation among all h2h genes (not necessarily forming h2h pairs) are higher than the background level, indicating that h2h genes in aggregate may subject to shared regulatory program. We further demonstrated that each h2h gene pair statistically tends to share common transcription factors, which in part explains the unusually high expression correlation among h2h genes.
Our present findings resolved the uncertainties on TSS distance, expression correlation nature, and functional coordination of h2h gene pairs, which may benefit future studies on the transcriptional regulation mechanism and the biological significance of h2h gene pairs.
We would like to thank Associate Prof. Chun Li from Vanderbilt University, U.S.A., for his constructive instructions. This work was supported by grants from Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences (2008KIP207), the National “973” Basic Research Program (2006CB0D1203, 2006CB0D1205), the National Natural Science Foundation of China (30770497, 31000380), the National Key Technologies R&D Program (2007AA02Z331).
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 11, 2010: Proceedings of the 21st International Conference on Genome Informatics (GIW2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S11.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.