CAGECAT: The CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters

van den Belt, Matthias; Gilchrist, Cameron; Booth, Thomas J.; Chooi, Yit-Heng; Medema, Marnix H.; Alanjary, Mohammad

doi:10.1186/s12859-023-05311-2

Software
Open access
Published: 03 May 2023

CAGECAT: The CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters

Matthias van den Belt¹,
Cameron Gilchrist^2,3,
Thomas J. Booth²,
Yit-Heng Chooi²,
Marnix H. Medema¹ &
…
Mohammad Alanjary¹

BMC Bioinformatics volume 24, Article number: 181 (2023) Cite this article

5332 Accesses
19 Citations
25 Altmetric
Metrics details

Abstract

Background

Co-localized sets of genes that encode specialized functions are common across microbial genomes and occur in genomes of larger eukaryotes as well. Important examples include Biosynthetic Gene Clusters (BGCs) that produce specialized metabolites with medicinal, agricultural, and industrial value (e.g. antimicrobials). Comparative analysis of BGCs can aid in the discovery of novel metabolites by highlighting distribution and identifying variants in public genomes. Unfortunately, gene-cluster-level homology detection remains inaccessible, time-consuming and difficult to interpret.

Results

The comparative gene cluster analysis toolbox (CAGECAT) is a rapid and user-friendly platform to mitigate difficulties in comparative analysis of whole gene clusters. The software provides homology searches and downstream analyses without the need for command-line or programming expertise. By leveraging remote BLAST databases, which always provide up-to-date results, CAGECAT can yield relevant matches that aid in the comparison, taxonomic distribution, or evolution of an unknown query. The service is extensible and interoperable and implements the cblaster and clinker pipelines to perform homology search, filtering, gene neighbourhood estimation, and dynamic visualisation of resulting variant BGCs. With the visualisation module, publication-quality figures can be customized directly from a web-browser, which greatly accelerates their interpretation via informative overlays to identify conserved genes in a BGC query.

Conclusion

Overall, CAGECAT is an extensible software that can be interfaced via a standard web-browser for whole region homology searches and comparison on continually updated genomes from NCBI. The public web server and installable docker image are open source and freely available without registration at: https://cagecat.bioinformatics.nl.

Peer Review reports

Background

Genes working cooperatively in a metabolic pathway are often physically co-localized in prokaryotic and fungal genomes. These gene clusters are commonly observed in specialized metabolism involved in ecological adaptations, such as nutrient utilization and production of virulence factors. In particular, Biosynthetic Gene Cluster (BGCs) that code for specialized metabolites has gained significant interest due to their major role in modern society as a source of pharmaceutical drugs (e.g. antibiotics) and crop protection chemicals [1, 2]. These loci not only contain genes responsible for biosynthesis but often include auxiliary regions coding for regulatory and transporter proteins [2, 3]. Using signature genes and machine-learning-based methods, several computational frameworks have been developed to effectively detect hypothetical BGCs from genomic data, such as ClusterFinder, PRISM, DeepBGC, and antiSMASH [4,5,6,7]. With these mature pipelines and the increase in publicly available genomes, a vast number of BGCs, both experimentally verified and hypothetical, have been catalogued in several databases. These include MIBiG, antiSMASH-DB, BiG-FAM, ARTS-DB, and IMG–ABC [8,9,10,11,12]. Unfortunately, much of this data remains unannotated. For instance, as little as 0.3% of the ~ 400,000 BGCs in IMG–ABC v5 are experimentally validated. Comparative genomic analysis can shed light on the functions of BGCs and their underlying genes. However, accessible online tools to allow scientists to perform custom comparative genomic analyses are lacking.

Gene cluster analysis methods for homology grouping, search, and visualisation are essential tasks to effectively leverage the available public resources. While tools such as BIG-SCAPE, BiG-SLiCE, MultiGeneBlast and cblaster aid in gene cluster analysis, these demand local computational resources or require command-line experience [13,14,15,16]. Due to the technological barrier, there is a need for a user-friendly and accessible platform for performing these analyses. Additionally, downstream methods for interpreting these results are often required. Visualisation and comparative genomic tools such as clinker and CORASON are capable of highlighting synteny or evolutionary relationships between BGCs; however, these also require expertise to operate and are not easily connected to homology search results [13, 17]. To remedy this problem and provide an accessible, “BLAST-like” web server for gene clusters, we present CAGECAT (the CompArative GEne Cluster Analysis Toolbox).

The CAGECAT web server enables researchers to execute a full gene cluster analysis pipeline using customizable BLAST searches on up-to-date genomic databases. The service provides seamless connections between the search and visualisation modules, enabling execution, inspection, and fine-tuning of relevant search results. While some multi-gene search portals exist, such as ClusterScout and antiSMASH-DB, these only provide for model-based searching (e.g. Pfam) on predefined genome datasets, which often lag behind rapidly growing public genomic databases [9, 18]. In addition to providing more up-to-date results, leveraging BLAST homology allows for refined control compared with model searches (e.g. identity and coverage), which can lead to more specific matches that aid in annotation, taxonomic distribution, or gene cluster evolution. Furthermore, with the interconnection of modules a user can accelerate result curation and downstream analysis, e.g. using gene neighbourhood estimation output to adjust intergenic distance thresholds to obtain more relevant matches. To our knowledge, we present the first free and publicly available web server for accelerated curation of homologous gene clusters with integrated downstream interpretation. By broadening accessibility of gene cluster analysis methods we hope this will lead to accelerated analysis and annotation of BGCs and contribute to the general knowledge of their subsequent products.

Implementation and available tools

The aim of CAGECAT is to provide a platform to seamlessly connect gene cluster analysis tools in an accessible web server for search and interpretation of results. To provide this service, CAGECAT implements a queue system that allows parallel job submissions which is supported by the python ‘rq’ library and Flask web-server (see Additional file 1). The search module leverages the cblaster pipeline, which utilises remote BLAST searches via NCBI’s servers as well as accelerated local Hidden Markov Model (HMM) based searches. Besides rapid similarity searches of entire BGC regions, cblaster provides several functions for gene neighbourhood estimation (GNE), sequence extraction, and visualisation (see Gilchrist et al. for a detailed description of methods) [16]. The clinker pipeline is currently used for the visualisation module, which provides automated cluster alignment and homology annotations. CAGECAT has been designed to provide rapid interoperability between these functions, where homologous clusters of interest can be selected to be used in subsequent analysis. A graphical summary of tool interoperability is given in Fig. 1.

Databases for hidden markov model (HMM) searches

Searches for homologous gene clusters based on HMM profiles using cblaster require cblaster-generated HMM databases. Genus-specific Pfam databases were generated as detailed in supplemental methods resulting in 70 genera with 10 or more genomes for fungi, and 43 genera with 50 or more genomes. A custom script to fetch representative and reference genomes of prokaryotes and fungi was made using NCBI’s e-search utilities [19]. To maintain CAGECAT’s free accessibility and storage, researchers will be required to use the command line version of cblaster or a local installation of CAGECAT to utilise custom HMM databases.

Job management

CAGECAT manages job submissions through a queue submission system, which processes jobs in a parallelizable first-in-first-out manner. Remote BLASTp queries are submitted to the NCBI API which leverages a scalable infrastructure allowing for multiple simultaneous searches (~ 10 requests/sec with an API key). By default, up to 15 jobs can be run in parallel to ensure stability and throughput. Upon job execution, the job command is constructed with the user-defined values of the input parameters and the appropriate pipelines are executed via Python. All output files are then stored and saved using a uniquely generated job ID. See supplemental methods for further technical details.

Results and user interface

Input and output

Two entry points for queries are currently implemented in CAGECAT for either gene cluster search via cblaster (search module) or visualisation via clinker (visualisation module). Input and output for other implemented modules are shown in Table 1.

Table 1 Current entry points of CAGECAT and their inputs and outputs

Full size table

The search module allows for local files in either GenBank or FASTA format (protein sequences) to be uploaded and processed by the cblaster pipeline. Additionally, NCBI accession numbers can be used to submit a search query on the NCBI database, which can be combined with local searches using HMM profiles in predefined databases on CAGECAT. The input page (Additional file 1: Figure S1) also contains optional parameters for selection of remote databases, search behaviour, and clustering of results. For the visualisation module, users can upload several genbank files or directly use outputs from the search module.

After completion of remote NCBI searches, users are presented with a cluster heatmap, which displays the absence/presence of each query protein sequence across the genomic hits (Fig. 2A). As in the original cblaster, the results are sorted and colored based on BLAST similarity and number of matching proteins to the query cluster for rapid identification and comparison of homologous gene clusters across genomes. For the visualisation module, clinker will generate interactive gene cluster comparison figures with links drawn between similar genes on neighbouring clusters and shaded based on sequence identity (Fig. 2B). Further details of these modules can be found at https://cagecat.bioinformatics.nl/tools/explanation and several example case studies for the cblaster output can be found in Gilchrist et al.

Features and interoperability

Users can download job results to their local computer within 30 days and output HTML files are displayed in-browser allowing for interactive inspection of results. The search module output allows for manual gene cluster selection to further curate results, which can be directly exported as genbank sequences. To accelerate analysis, CAGECAT provides interoperation between results and the available modules. Selections of output from the search module can be directly used as input for downstream analysis (e.g. to selectively visualise some results) or to recompute a search using different parameters (Fig. 3). Notably, when genomic regions from the search module are used for analysis in the visualisation module, it will include all genes present within each genomic region that were not specified in the search query.

Runtime and scalability

Remote search times are largely dependent on NCBI services which cannot be definitively benchmarked due to dependency on service traffic. However, processing of 346 queries over the 5-month user testing period showed an average search completion time under 8 min. Other functions such as clinker visualisation, recompute, gene cluster neighbourhood estimation, and cluster extraction all showed negligible processing time under 30 s (Additional file 1: Table S1).

Conclusions and future directions

With CAGECAT, we aim to lower the technical barrier to execute gene cluster analysis. Downstream analyses can be rapidly performed using the results of a previously executed job, which accelerates curation and comparative visualization. This service enables a quick search of whole gene cluster sequences against NCBI non-redundant or RefSeq databases that can be confined to a selected genus. Currently, two entry points exist to start analysing on CAGECAT: (I) finding homologous gene clusters using a query cluster and the cblaster search module, and (II) a visualisation of gene clusters using a set of query clusters and the clinker module. CAGECAT does not impact or interfere with the analysis capabilities of the implemented tools and acts as a bridge to allow for rapid retrieval of homologous gene clusters from continually updated public databases. We foresee CAGECAT being used by a wide audience to easily uncover homologous BGCs and provide publication-quality visualisations without the need for computational resources or programming expertise. The service is also built to be extensible so that additional downstream analyses can be connected in future versions. Suggestions and comments sent via the contact page will be carefully considered during development. Furthermore, CAGECAT is also useful for comparative analysis and discovery of gene clusters beyond those that encode the production of specialized metabolites, such as xenobiotic degradation pathways [20]. Considering the remote database has no restriction to any particular taxa, this service can thus be used for general homology searches beyond those detailed in this manuscript on a variety of genomes (e.g. Human, mouse). Inter-taxa results are also possible with lower homology thresholds set in the advanced options. With this web server, we aim to accelerate comparative analysis of gene clusters and provide an easy-to-use interface to help uncover clues for further study of BGCs encoding useful specialized metabolites as well as a starting point for investigating gene cluster evolution.

Availability

Project name: Comparative Gene Cluster Analysis Toolbox (CAGECAT).

Project home page: https://cagecat.bioinformatics.nl

Operating system(s): Linux / Platform independent via Docker.

Programming language: Python.

Other requirements: Python 3.8, Docker.

License: MIT.

Source code: https://github.com/malanjary-wur/CAGECAT

Availability of data and materials

All data and materials are freely available via the updated git repository: https://github.com/malanjary-wur/CAGECAT as well as the release version used in this manuscript: https://github.com/malanjary-wur/CAGECAT/releases.

Abbreviations

API:: Application programming interface
BGC:: Biosynthetic Gene Cluster
CAGECAT:: Comparative Gene Cluster Analysis Toolbox
CORASON:: Core Analysis of Syntenic Orthologs to prioritize Natural Product BGCs
HMM:: Hidden Markov Model
IMG–ABC:: Integrated Microbial Genomes–Atlas of Biosynthetic Gene Clusters
MIBiG:: Minimum information about a biosynthetic gene cluster
NCBI:: National Center for Biotechnology Information

References

Laich F, Fierro F, Cardoza RE, Martin JF. Organization of the gene cluster for biosynthesis of penicillin in Penicillium nalgiovense and antibiotic production in cured dry sausages. Appl Environ Microbiol. 1999;65:1236–40.
Article CAS PubMed PubMed Central Google Scholar
Medema MH, Fischbach MA. Computational approaches to natural product discovery. Nat Chem Biol. 2015;11:639–48.
Article CAS PubMed PubMed Central Google Scholar
Crits-Christoph A, Bhattacharya N, Olm MR, Song YS, Banfield JF. Transporter genes in biosynthetic gene clusters predict metabolite characteristics and siderophore activity. Genome Res. 2020. https://doi.org/10.1101/gr.268169.120.
Article PubMed Google Scholar
Cimermancic P, Medema MH, Claesen J, Kurita K, Wieland Brown LC, Mavrommatis K, Pati A, Godfrey PA, Koehrsen M, Clardy J, et al. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell. 2014;158:412–21.
Article CAS PubMed PubMed Central Google Scholar
Skinnider MA, Merwin NJ, Johnston CW, Magarvey NA. PRISM 3: expanded prediction of natural product chemical structures from microbial genomes. Nucleic Acids Res. 2017;45:W49–54.
Article CAS PubMed PubMed Central Google Scholar
Hannigan GD, Prihoda D, Palicka A, Soukup J, Klempir O, Rampula L, Durcak J, Wurst M, Kotowski J, Chang D, et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 2019;47: e110.
Article CAS PubMed PubMed Central Google Scholar
Blin K, Shaw S, Steinke K, Villebro R, Ziemert N, Lee SY, Medema MH, Weber T. antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline. Nucleic Acids Res. 2019;47:W81–7.
Article CAS PubMed PubMed Central Google Scholar
Kautsar SA, Blin K, Shaw S, Navarro-Muñoz JC, Terlouw BR, van der Hooft JJJ, van Santen JA, Tracanna V, Suarez Duran HG, Pascal Andreu V, et al. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Res. 2020;48:D454–8.
PubMed Google Scholar
Blin K, Shaw S, Kautsar SA, Medema MH, Weber T. The antiSMASH database version 3: increased taxonomic coverage and new query features for modular enzymes. Nucleic Acids Res. 2021;49:D639–43.
Article CAS PubMed Google Scholar
Kautsar SA, Blin K, Shaw S, Weber T, Medema MH. BiG-FAM: the biosynthetic gene cluster families database. Nucleic Acids Res. 2021;49:D490–7.
Article CAS PubMed Google Scholar
Mungan MD, Blin K, Ziemert N. ARTS-DB: a database for antibiotic resistant targets. Nucleic Acids Res. 2022;50:D736–40.
Article CAS PubMed Google Scholar
Palaniappan K, Chen I-MA, Chu K, Ratner A, Seshadri R, Kyrpides NC, Ivanova NN, Mouncey NJ. IMG-ABC vol 5.0: an update to the IMG/Atlas of Biosynthetic Gene Clusters Knowledgebase. Nucleic Acids Res. 2020;48:D422–30.
CAS PubMed Google Scholar
Navarro-Muñoz JC, Selem-Mojica N, Mullowney MW, Kautsar SA, Tryon JH, Parkinson EI, De Los Santos ELC, Yeong M, Cruz-Morales P, Abubucker S, et al. A computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol. 2020;16:60–8.
Article PubMed Google Scholar
Kautsar SA, van der Hooft JJJ, de Ridder D, Medema MH. BiG-SLiCE: a highly scalable tool maps the diversity of 12 million biosynthetic gene clusters. Gigascience. 2021;10:giaa154.
Article PubMed PubMed Central Google Scholar
Medema MH, Takano E, Breitling R. Detecting sequence homology at the gene cluster level with MultiGeneBlast. Mol Biol Evol. 2013;30:1218–23.
Article CAS PubMed PubMed Central Google Scholar
Gilchrist CLM, Booth TJ, van Wersch B, van Grieken L, Medema MH, Chooi Y-H. cblaster: a remote search tool for rapid identification and visualization of homologous gene clusters. Bioinf Adv;2021:1.
Gilchrist CLM, Chooi Y-H. Clinker & clustermap.js: automatic generation of gene cluster comparison figures. Bioinformatics;2021. https://doi.org/10.1093/bioinformatics/btab007
Hadjithomas M, Chen I-MA, Chu K, Huang J, Ratner A, Palaniappan K, Andersen E, Markowitz V, Kyrpides NC, Ivanova NN. IMG-ABC: new features for bacterial secondary metabolism analysis and targeted biosynthetic gene cluster discovery in thousands of microbial genomes. Nucleic Acids Res. 2017;45:D560–5.
Article CAS PubMed Google Scholar
Entrez Programming Utilities Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US);2010.
Wisecaver JH, Rokas A (2015) Fungal metabolic gene clusters—caravans traveling across genomes and environments. In: Frontiers in microbiology (Vol. 6).

Download references

Acknowledgements

We thank all researchers involved in beta testing from within the Bioinformatics group, Wageningen University, School of Molecular Sciences, The University of Western Australia.

Funding

M.A is supported by the NWO Talent programme Veni science domain (VI.Veni.202.130). C.L.M.G is supported by the Australian Government Research Training Project (RTP) Ph.D. scholarship, the National Research Foundation of Korea (NRF) [2021R1C1C1012065, 2019R1A6A1A10073437], the Samsung DS research fund program and the Creative-Pioneering Researchers Program through Seoul National University. Y-H.C is supported by an Australian Research Council Future Fellowship (FT160100233). M.H.M. is supported by an ERC Starting Grant (948770-DECIPHER to M.H.M.).

Author information

Authors and Affiliations

Bioinformatics Group, Wageningen University and Research, 6708PB, Wageningen, The Netherlands
Matthias van den Belt, Marnix H. Medema & Mohammad Alanjary
School of Molecular Sciences, The University of Western Australia, Crawley, WA, 6009, Australia
Cameron Gilchrist, Thomas J. Booth & Yit-Heng Chooi
School of Biological Sciences, Seoul National University, Seoul, South Korea
Cameron Gilchrist

Authors

Matthias van den Belt
View author publications
You can also search for this author in PubMed Google Scholar
Cameron Gilchrist
View author publications
You can also search for this author in PubMed Google Scholar
Thomas J. Booth
View author publications
You can also search for this author in PubMed Google Scholar
Yit-Heng Chooi
View author publications
You can also search for this author in PubMed Google Scholar
Marnix H. Medema
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Alanjary
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.B. developed and maintained web and core python architecture for CAGECAT. C.L.M.G provided cblaster / clinker integration support and product testing. Y-H.C and T.J.B contributed to testing and manuscript preparation. M.H.M and M.A. supervised and coordinated project development. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mohammad Alanjary.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

MHM is a co-founder of Design Pharmaceuticals and a member of the scientific advisory board of Hexagon Bio. All other authors have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Supplemental methods and data with further details on server specifications and implementation.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

van den Belt, M., Gilchrist, C., Booth, T.J. et al. CAGECAT: The CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters. BMC Bioinformatics 24, 181 (2023). https://doi.org/10.1186/s12859-023-05311-2

Download citation

Received: 10 February 2023
Accepted: 27 April 2023
Published: 03 May 2023
DOI: https://doi.org/10.1186/s12859-023-05311-2

CAGECAT: The CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters