Analysis tools for the interplay between genome layout and regulation
© Bouyioukos et al. 2016
Published: 6 June 2016
Genome layout and gene regulation appear to be interdependent. Understanding this interdependence is key to exploring the dynamic nature of chromosome conformation and to engineering functional genomes. Evidence for non-random genome layout, defined as the relative positioning of either co-functional or co-regulated genes, stems from two main approaches. Firstly, the analysis of contiguous genome segments across species, has highlighted the conservation of gene arrangement (synteny) along chromosomal regions. Secondly, the study of long-range interactions along a chromosome has emphasised regularities in the positioning of microbial genes that are co-regulated, co-expressed or evolutionarily correlated. While one-dimensional pattern analysis is a mature field, it is often powerless on biological datasets which tend to be incomplete, and partly incorrect. Moreover, there is a lack of comprehensive, user-friendly tools to systematically analyse, visualise, integrate and exploit regularities along genomes.
Here we present the Genome REgulatory and Architecture Tools SCAN (GREAT:SCAN) software for the systematic study of the interplay between genome layout and gene expression regulation. GREAT:SCAN is a collection of related and interconnected applications currently able to perform systematic analyses of genome regularities as well as to improve transcription factor binding sites (TFBS) and gene regulatory network predictions based on gene positional information.
We demonstrate the capabilities of these tools by studying on one hand the regular patterns of genome layout in the major regulons of the bacterium Escherichia coli. On the other hand, we demonstrate the capabilities to improve TFBS prediction in microbes. Finally, we highlight, by visualisation of multivariate techniques, the interplay between position and sequence information for effective transcription regulation.
Advances in genomics, transcriptomics and genome structural biology have revealed significant insights on the interdependence between genome expression, genome layout and the three-dimensional (3D) chromosome conformation . Evidence for non-random genome layout, defined as the relative positioning of co-regulated or co-functional genes, stems from two main insights. First, the analysis of contiguous genome segments across species has highlighted synteny, that is the conservation of gene order along chromosome regions . Secondly, studies of long-range regularities within chromosomes in eubacteria, archaea and yeast have emphasised periodic positioning of genes that are co-regulated, co-expressed, or evolutionarily correlated [3–8] respectively. These studies have all proposed a non-random, periodic arrangement of genomic features (such as genes, operons and gene expression) as a common feature for compact genomes of all phyla of life. This periodic arrangement of genomic features imposes certain 3D conformational advantages which provide a potential mechanism for genome regulatory efficiency and which has been favoured by evolution in genomes that are under selective pressure to remain small. Furthermore, in organisms with more complex genomes, the formation of loops, inter-chromosomal associations and transcription factories affects (and gets affected by) the expression of genes [9–11], suggesting that active transcription might be a shaping force of genomes. A set of tools which are able to investigate genomic positional regularities, in the context of genome expression regulation, could provide bioscience researchers -in combination with the high availability of multi-omics data- with novel and informative insights regarding genome organisation, regulation and function.
We developed GREAT:SCAN (Genome REgulatory Architecture Tools:SCAN), a collection of on-line software tools designed to perform systematic detection of regular patterns along genomes, integrate and interconnect results between available methods and provide informative visualisations. GREAT:SCAN extends two algorithms previously developed by our team for the detection of periodically arranged genes  and the prediction of transcription factor binding sites (TFBS) . It provides a web user interface which streamlines the usage of these algorithms, performs a fully automated analysis of regularities among genomic features, extends with novel functionalities the analytical capabilities of the previous software and reports results in human- (plots and graphs) as well as in machine- (tables) readable formats. GREAT:SCAN is available in two versions: a) running as an online application integrated in the computational framework of the GREAT portal in the servers of abSYNTH platform (absynth.issb.genopole.fr/Bioinformatics/tools/GREAT); b) as a downloadable stand-alone command line Docker image of each individual tool, to facilitate incorporation into pipelines.
Here, we introduce this new collection of tools called GREAT:SCAN, we describe their novel features and we demonstrate their use and analytical capabilities by a) calculating regularities on the regulons of the seven major transcription factors (TFs) in Escherichia coli; and b) predicting new target genes in the corresponding regulons by using data from two different sources: local TFBS sequence and global gene position along the genome.
Genome organisation influences fundamental biological processes such as transcription and replication, and reciprocally, through evolutionary pressure, those fundamental biological processes are shaping genome organisation [14, 15]. In prokaryotes transcription and genome organisation are tightly coupled, with all major TFs playing a dual role as chromosome structural proteins and as transcriptional regulators . Furthermore, transcriptional activity -and therefore expression regulation- is spatially organised both in bacterial nucleoids and eukaryotic nuclei [17, 18], showing indeed regular spatial patterns. Ascertaining the interplay between genome organisation and transcription regulation will provide key insights into whole genome expression, nucleus/nucleoid organisation and genome architecture . Understanding and exploiting this interplay is an essential step towards rational automated whole-genome design and engineering.
The collection currently includes two tools. GREAT:SCAN:PATTERNS, a package for the systematic analyses of regular patterns on genomes, and GREAT:SCAN:PRECISION, a multi-view machine learning tool to predict novel TFBSs.
GREAT:SCAN:PATTERNS performs a complete analysis of periodic patterns along genomes. The analysis comprises three steps: 1) The systematic detection and visualisation of all possible periods from the genome positions of features of interest (such as co-regulated genes); 2) The clustering and visualisation of genomic features which are “in-phase” in the phase coordinates; 3) The mapping of any sub-region of the genome where a periodic pattern can be detected.
The third step introduces a novel capability of the periodicity detection algorithm: a variable size sliding window approach. The algorithm performs a similar fine-tuned search for regular patterns as described above, but within a specific genomic region delimited by a sliding window. It starts with a 10-kbp size window which runs along the whole genome and looks for periodicities of the features of interest. The window is then enlarged incrementally until it covers 95 % of the length of the whole genome. By reporting the boundaries of the regions where periodicities are detected, this approach is able to map the observed periods on their respective genomic regions.
GREAT:SCAN:PRECISION (“PRECISION” stands for “PREdiction of CIS-regulatory elements improved by gene positiON”) is a novel implementation in the R language  of PRECISION , a multi-view learning algorithm for TFBS prediction which incorporates two views: a) DNA sequence motif readout calculated by a TFBS position weight matrix (local sequence classifier) and b) individual gene contribution to overall genome periodic pattern calculated as the positional score by GREAT:SCAN:PATTERNS (global position classifier). This ensemble classifier, which is a weighted combination of a set of base classifiers trained on different views, is implemented using a modified version of the AdaBoost algorithm . The underlying rationale is to combine TFBS sequence motif information with gene positioning information to obtain an accurate and robust TFBS prediction model. Computational approaches for TFBS prediction, so far, relied on local sequence information only, in one way or another. With PRECISION, we show that for bacteria, respective gene positioning along the chromosome carries significant information for TFBS prediction. The design and the implementation of GREAT:SCAN:PRECISION boosting algorithm is open to incorporate any suitable algorithm as an additional “view” as long as it provides a scoring function for each genomic feature of interest.
GREAT:SCAN tools focus on detecting periodicities in compact genomes of single cell organisms (as periodicities have been searched only in this kind of organisms so far) and it operates by including information of one chromosome at a time. However, periodicities might appear as prominent genome organisation features in different organisation scales in more complex genomes. We envisage the application of GREAT:SCAN tools in studying intra-chromosomal interactions and arrangements such as complex regulatory regions of higher eukaryotes (plants or mammals).
In this work, we demonstrate the analytical capabilities of GREAT:SCAN:PATTERNS: by conducting a complete analysis of the seven major E. coli regulons, report results of regions of periodic arrangement which are associated with large scale genomic structures such as the organisation in macro-domains  and discuss preliminary results on the use of GREAT:SCAN:PRECISION to formulate and test biological hypotheses.
The features we analyse here include the transcriptionally co-regulated genes (and operons) of the seven TFs of E. coli with the highest number of targets. For the periodicity analysis, all the regulatory network interactions of E. coli were retrieved from RegulonDB  (version 8.6). The target genes and operons of the seven major TFs of E. coli (namely CRP, Lrp, H-NS, Fis, Fnr, ArcA and IHF) were selected. Each predicted interaction from RegulonDB was automatically filtered, by an in-house script, to keep only those which have been identified by at least two “strong” validation experiments or at least three “weak” ones (look figure 4 of  for the classification of each prediction method in RegulonDB as “strong” and “weak”). The start codon coordinate of each gene was taken as the gene’s start site. This information was retrieved from the E. coli EcoCyc “SmartTables” resource . For the novel TFBS prediction each gene regulatory sequences was retrieved from RSAT  and the genomic coordinates from the UCSC microbial genome browser .
Results and discussion
Periodic patterns among E. coli co-regulated genes
Top scoring periods for each of the seven major regulons of E. coli together with the respective p-values (first two columns)
Most significant period
Interplay between sequence and position with PRECISION
This section builds upon our previous work in  applying PRECISION for the prediction of E. coli TFBS. Those results had indicated both the importance of genome position for the prediction of TFBS of several E. coli TFs, as well as the inter-dependence of position and sequence information for effective boosting learning of TFBS predictions in some other E. coli TFs. Indeed, even when both views are little informative, their optimised combination may be effective (extended discussion in the Fig. 2 and legend at ref. ). Using two different readouts the boosting approach developed in PRECISION was able to take advantage of the balance as well as the inter-dependence of these data in order to improve TFBS prediction in E. coli. This unique multi-view classifier is strong because a) its components (a set of consensus sequence and periods) each fit well to a particular region of the landscape and b) it contains classifiers that are trained to focus on different views of the data. These qualities of the PRECISION boosting algorithm make it suitable to incorporate a diverse set of classifiers with input data from multi-omics studies.
We present a unified computational framework with tools for systematically analysing regular patterns in genomes and for studying their interplay with the regulation of gene expression. We described the first two tools of GREAT:SCAN: a periodicity analysis tool named PATTERNS and a TFBS prediction tool named PRECISION. We also demonstrate and discuss an example application of the GREAT:SCAN tools to the major E. coli regulons, revealing a complex but coherent genome periodic pattern. Some features of this pattern had been reported in numerous previous studies using cruder methods and less complete data [3, 6–8]. Using PRECISION, we demonstrated that insights from the mechanics of a multi-view learning algorithm, able to improve TFBS predictions, can be exploited to formalise and test further biological hypotheses. Moreover, we applied CCA to explore and quantify the interplay of sequence specificity with genome position for the effective binding of TFs. Using this method we uncover for some regulons in E. coli the existence of negative correlations between these two quantities, indicating a potential interplay between sequence quality and the 3D location of the site. Overall, GREAT:SCAN analyses provide novel views on the long-range genome organisation in bacteria, explores its association with genome expression and provide methods to evaluate meaningful biological hypotheses.
Availability and requirements
The software is available to the community as free online tools (Additional file 1) which can be found on the abSYNTH platform af the institute of Systems and Synthetic Biology (iSSB). The software runs as a web application freely for any non-commercial use (i.e. academic, teaching). No installation is required as all computations are performed by the abSYNTH servers (access at: absynth.issb.genopole.fr/Bioinformatics/tools/GREAT). Every user can, after the end of the computations, download a compressed file with all the plots and the tables the program has generated. All input data and results are kept for one week and are available for downloading by the user with the job specific URL that the portal provides (Additional file 2).
We thank François Bucchini for his help with the web application, Ivan Junier for sharing his preliminary observations on the coincidence of macrodomain and periodic region boundaries, Genopole and the abSYNTH platform for hosting the applications and all the members of MEGA team at iSSB for being avid beta-testers of the tools. This study was supported by the EU FP7 project ST-FLOW.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Cook PR. A model for all genomes: the role of transcription factories. J Mol Biol. 2010; 395(1):1–10. doi:http://dx.doi.org/10.1016/j.jmb.2009.10.031.View ArticlePubMedGoogle Scholar
- Huynen MA, Snel B. Gene and context: integrative approaches to genome analysis. Adv Protein Chem. 2000; 54:345–79. doi:http://dx.doi.org/10.1016/S0065-3233(00)54010-8.View ArticlePubMedGoogle Scholar
- Képès F. Periodic transcriptional organization of the E.coli genome. J Mol Biol. 2004; 340(5):957–64. doi:http://dx.doi.org/10.1016/j.jmb.2004.05.039.View ArticlePubMedGoogle Scholar
- Képès F. Periodic epi-organization of the yeast genome revealed by the distribution of promoter sites. J Mol Biol. 2003; 329(5):859–65. doi:http://dx.doi.org/10.1016/S0022-2836(03)00535-7.View ArticlePubMedGoogle Scholar
- Bouyioukos C, Elati M, Képès F. Hydrocarbon and Lipid Microbiology Protocols Springer Protocols Handbooks In: McGenity TJ, Timmis KN, Nogales Fernández B, editors. Heidelberg: Humana Press: 2015. p. 1–16, doi:http://dx.doi.org/10.1007/8623_2015_92. http://link.springer.com/protocol/10.1007%8623_2015_92.
- Jeong KS, Ahn J, Khodursky AB. Spatial patterns of transcriptional activity in the chromosome of Escherichia coli. Genome Biol. 2004; 5(11):86. doi:http://dx.doi.org/10.1186/gb-2004-5-11-r86.View ArticleGoogle Scholar
- Junier I, Hérisson J, Képès F. Genomic organization of evolutionarily correlated genes in bacteria: limits and strategies. J Mol Biol. 2012; 419(5):369–86. doi:http://dx.doi.org/10.1016/j.jmb.2012.03.009.View ArticlePubMedGoogle Scholar
- Wright MA, Kharchenko P, Church GM, Segré D. Chromosomal periodicity of evolutionarily conserved gene pairs. Proc Natl Acad Sci U S A. 2007; 104(25):10559–10564. doi:http://dx.doi.org/10.1073/pnas.0610776104.View ArticlePubMedPubMed CentralGoogle Scholar
- Dekker J. Gene regulation in the third dimension. Science. 2008; 319(5871):1793–1794. doi:http://dx.doi.org/10.1126/science.1152850.View ArticlePubMedPubMed CentralGoogle Scholar
- Spilianakis CG, Lalioti MD, Town T, Lee GR, Flavell RA. Interchromosomal associations between alternatively expressed loci. Nature. 2005; 435(7042):637–45. doi:http://dx.doi.org/10.1038/nature03574.View ArticlePubMedGoogle Scholar
- Papantonis A, Cook PR. Transcription factories: genome organization and gene regulation. Chem Rev. 2013; 113(11):8683–705. doi:http://dx.doi.org/10.1021/cr300513p.View ArticlePubMedGoogle Scholar
- Junier I, Hérisson J, Képès F. Periodic pattern detection in sparse boolean sequences. Algorithm Mol Biol. 2010; 5:31. doi:http://dx.doi.org/10.1186/1748-7188-5-31.View ArticleGoogle Scholar
- Elati M, Fekih R, Nicolle R, Junier I, Herisson J, Kepes F. Boosting binding sites prediction using gene positions. Lect Notes Comput Sci. 2011:92–103. doi:http://dx.doi.org/10.1007/978-3-642-23038-7_9.
- Képès F, Vaillant C. Transcription-based solenoidal model of chromosomes. ComPlexUs. 2003; 1(4):171–80. doi:http://dx.doi.org/10.1159/000082184.View ArticleGoogle Scholar
- Dorman CJ. Genome architecture and global gene regulation in bacteria: making progress towards a unified model?Nat Rev Microbiol. 2013; 11(5):349–55. doi:http://dx.doi.org/10.1038/nrmicro3007.View ArticlePubMedGoogle Scholar
- Dillon SC, Dorman CJ. Bacterial nucleoid-associated proteins, nucleoid structure and gene expression. Nat Rev Microbiol. 2010; 8(3):185–95. doi:http://dx.doi.org/10.1038/nrmicro2261.View ArticlePubMedGoogle Scholar
- Weng X, Xiao J. Spatial organization of transcription in bacterial cells. Trends Genet. 2014. doi:http://dx.doi.org/10.1016/j.tig.2014.04.008.
- Sutherland H, Bickmore WA. Transcription factories: gene expression in unions?Nat Rev Genet. 2009; 10(7):457–66. doi:http://dx.doi.org/10.1038/nrg2592.View ArticlePubMedGoogle Scholar
- Képès F, Jester BC, Lepage T, Rafiei N, Rosu B, Junier I. The layout of a bacterial genome. FEBS Lett. 2012; 586(15):2043–048. doi:http://dx.doi.org/10.1016/j.febslet.2012.03.051.View ArticlePubMedGoogle Scholar
- Ester M, Kriegel H-p, Jörg S, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). Palo Alto: AAAI Press: 1996. p. 226–31.Google Scholar
- R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2015. R Foundation for Statistical Computing. http://www.R-project.org.Google Scholar
- Schapire RE, Singer Y. Improved boosting algorithms using confidence-rated predictions. Mach Learn. 1999; 37(3):297–336. doi:http://dx.doi.org/10.1023/a:1007614523901.View ArticleGoogle Scholar
- Valens M, Penaud S, Rossignol M, Cornet F, Boccard F. Macrodomain organization of the Escherichia coli chromosome. EMBO J. 2004; 23(21):4330–341. doi:http://dx.doi.org/10.1038/sj.emboj.7600434.View ArticlePubMedPubMed CentralGoogle Scholar
- Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muñiz-Rascado L, García-Sotelo JS, Weiss V, Solano-Lira H, Martínez-Flores I, Medina-Rivera A, Salgado-Osorio G, Alquicira-Hernández S, Alquicira-Hernández K, López-Fuentes A, Porrón-Sotelo L, Huerta AM, Bonavides-Martínez C, Balderas-Martínez YI, Pannier L, Olvera M, Labastida A, Jiménez-Jacinto V, Vega-Alvarado L, Del Moral-Chávez V, Hernández-Alvarez A, Morett E, Collado-Vides J. Regulondb v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Res. 2013; 41(Database issue):203–13. doi:http://dx.doi.org/10.1093/nar/gks1201.View ArticleGoogle Scholar
- Karp PD, Weaver D, Paley S, Fulcher C, Kubo A, Kothari A, Krummenacker M, Subhraveti P, Weerasinghe D, Gama-Castro S, Huerta AM, Muñiz-Rascado L, Bonavides-Martinez C, Weiss V, Peralta-Gil M, Santos-Zavaleta A, Schröder I, Mackie A, Gunsalus R, Collado-Vides J, Keseler IM, Paulsen I. The ecocyc database. EcoSal Plus. 2014; 2014. doi:http://dx.doi.org/10.1128/ecosalplus.ESP-0009-2013.
- Thomas-Chollier M, Defrance M, Medina-Rivera A, Sand O, Herrmann C, Thieffry D, van Helden J. RSAT 2011: Regulatory sequence analysis tools. Nucleic Acids Res. 2011; 39(Web Server issue):86–91. doi:http://dx.doi.org/10.1093/nar/gkr377.View ArticleGoogle Scholar
- Riley M, Abe T, Arnaud MB, Berlyn MKB, Blattner FR, Chaudhuri RR, Glasner JD, Horiuchi T, Keseler IM, Kosuge T, Mori H, Perna NT, Plunkett 3rd G, Rudd KE, Serres MH, Thomas GH, Thomson NR, Wishart D, Wanner BL. Escherichia coli k-12: a cooperatively developed annotation snapshot–2005. Nucleic Acids Res. 2006; 34(1):1–9. doi:http://dx.doi.org/10.1093/nar/gkj405.
- Junier I, Martin O, Képès F. Spatial and topological organization of dna chains induced by gene co-localization. PLoS Comput Biol. 2010; 6(2):1000678. doi:http://dx.doi.org/10.1371/journal.pcbi.1000678.View ArticleGoogle Scholar
- Cook PR. Predicting three-dimensional genome structure from transcriptional activity. Nat Genet. 2002; 32(3):347–52. doi:http://dx.doi.org/10.1038/ng1102-347.View ArticlePubMedGoogle Scholar
- Elati M, Nicolle R, Junier I, Fernández D, Fekih R, Font J, Képès F. PreCisIon: PREdiction of CIS-regulatory elements improved by gene’s positION. Nucleic Acids Res. 2013; 41(3):1406–1415. doi:http://dx.doi.org/10.1093/nar/gks1286.View ArticlePubMedGoogle Scholar
- Hotteling H. Relations between two sets of variates. Biometrika. 1936; 28(3-4):321–77. doi:http://dx.doi.org/10.1093/biomet/28.3-4.321.View ArticleGoogle Scholar
- Lê Cao K-A, González I, Déjean S. integrOmics: an R package to unravel relationships between two omics datasets.Bioinformatics. 2009; 25(21):2855–856. doi:http://dx.doi.org/10.1093/bioinformatics/btp515.View ArticlePubMedPubMed CentralGoogle Scholar