Skip to main content

CFVisual: an interactive desktop platform for drawing gene structure and protein architecture

Abstract

Background

When researchers perform gene family analysis, they often analyze the structural characteristics of the gene, such as the distribution of introns and exons. At the same time, characteristic structural analysis of amino acid sequence is also essential, for example, motif and domain features. Researchers often integrate these analyses into one image to dig out more information, but the tools responsible for this integration are lacking.

Results

Here, we developed a tool (CFVisual) for drawing gene structure and protein architecture. CFVisual can draw the phylogenetic tree, gene structure, and protein architecture in one picture, and has rich interactive capabilities, which can meet the work needs of researchers. Furthermore, it also supports arbitrary stitching of the above analysis images. It has become a useful helper in gene family analysis. The CFVisual package was implemented in Python and is freely available from https://github.com/ChenHuilong1223/CFVisual/.

Conclusion

CFVisual has been used by some researchers and cited by some articles. In the future, CFVisual will continue to serve as a good helper for researchers in the study of gene structure and protein architecture.

Peer Review reports

Background

With the continuous sequencing of more and more genomes of plants and animals, a large number of genome annotation files have been produced, which are generally in formats such as GFF3 and GTF. Researchers often need to obtain information about gene structure of some gene sets (such as gene families) from these annotation files and display these exon–intron structure graphically. This can help researchers to understand the composition and position of gene exons and introns, and help to advance the understanding of gene variable splicing. Moreover, in conjunction with phylogenetic analysis, it also helps to understand gene evolution. At present, the better drawing tool is GSDS [1]. Unfortunately, it does not fully satisfy the requirements of researchers for graphics. The defects are as follows: the phylogenetic tree cannot be classified and colored, specific numerical information cannot be provided, and the website is often inaccessible, etc.

Motifs and domains are the functional units and characteristic structures of amino acid sequences, and are often identified by tools such as MEME and Pfam/NCBI-CDD/SMART [2,3,4,5]. Displaying these motifs and domains along a line helps folk understand the structure of the protein sequence. Comparing with other protein sequences is helpful to find out the conserved parts and difference sites. Moreover, combined with the phylogenetic tree, it is helpful to study the evolution of motifs and domains. When conducting gene family analysis, researchers often need to splice the gene structure map with the motif and/or domain location distribution map into one map for display, so as to obtain more information. Therefore, researchers need to use Adobe Illustrator, Adobe Photoshop or other image editing software to stitch the images. To the best of our knowledge, this work is time-consuming and tedious. Therefore, it is important to develop a suitable tool to avoid this situation.

Methods

We used the Python language to write the software implementation logic, then used the Python language PySide2 library to implement the software GUI interface, and then used the Python language matplotlib library to visualize the data via our own logic. Finally, We used the Pyinstaller library in the Python language to complete the creation of the CFVisual platform.

In order to better reflect the advantages of CFVisual, we downloaded the latest rice genome data from the rice database (http://rice.uga.edu/) [6], including the whole genome protein sequence and GFF3 annotation file, and then used HMMER software (parameter threshold was set to 1e-10) based on the pectinesterase domain Hidden Markov model (PF01095.19) to identify the candidate sequences of rice PME protein [7]. Finally, all candidate protein sequences were determined by Pfam (https://pfam.xfam.org/), NCBI-CDD (https://www.ncbi.nlm.nih.gov/cdd), and SMART (http://smart.embl-heidelberg.de/) databases, and only protein sequences that contain the pectinesterase domain are considered members of the PME gene family.

After that, we wrote a Python script (https://github.com/ChenHuilong1223/CFVisual/) to extract the amino acid sequences and GFF3 annotation information of rice PMEs. The amino acid sequences of rice PME were analyzed by MEGA X [8], MEME (https://meme-suite.org/meme/), Pfam, NCBI-CDD, and SMART tools to generate the result file. Finally, these results were visualized using CFVisual.

Results

Function overview, usage, and illustrative examples

In the functional aspect, CFVisual can be divided into three parts, namely gene structure level, protein architecture level, and classification and coloring of phylogenetic tree.

Gene structure

Users can provide GFF3, GTF or BED files, and then use CFVisual to draw the picture. In the interface shown in Fig. 1b, users can set the style of each feature, such as color, shape, thickness, etc. Clicking the “Statistics” button to make CFVisual automatically count the length of gene, the number of introns, utrs, cds, and other quantitative information (Fig. 1c). Of course, users can also add other information, including domains and signal peptides, etc. (Fig. 1a). Using the combined form of rectangular boxes helps researchers intuitively judge which cds fragments encode the domain and the presence of introns.

Fig. 1
figure 1

Drawing function and core interface of CFVisual. a Classic stitching diagram in structural analysis (tree + motif + gene structure + domain diagram). b User interaction window. Each tab corresponds to the control interface of a graphical part. c The basic statistical details on structural elements of genes. d The subgraph of promoter

Regarding the promoter map (Fig. 1d), users provide location results from PlantCare [9] and other tools for predicting the position of cis-acting elements and CFVisual will read out all cis-acting elements at once, which can be selectively displayed according to needs.

Protein architecture

The preparation file for drawing the motif diagram (Fig. 1a) is the result file predicted by the MEME tool. Compared with some conventional motif visualization tools, the advantages of CFVisual are as follows. First of all, the software completely reproduces the results of MEME and realizes that the height of the rectangular box representing the motif is negatively correlated with the p value. The lower the height, the higher the p value, and the lower the credibility of the predicted motif. Secondly, the result of “Scanned Sites” can be displayed in the form of transparent rectangular boxes. At last, users can selectively display motif units that need to be studied.

The preparation file of the domain map is the result file of NCBI-CDD, Pfam or SMART. Users can still selectively display the domains that need to be studied. Another advantage of CFVisual is that the structure domain can be superimposed on the motif diagram in the form of a rectangular box (Fig. 1a), so that researchers can intuitively judge the location distribution relationship of motifs and domains.

Classification and coloring of phylogenetic tree

While studying gene structure and protein architecture, researchers often joint a phylogenetic tree to study the evolution of structures. Here, CFVisual supports this demand well. Users only needs to provide the tree file in Newick format to be recognized by CFVisual and can draw the picture easily (Fig. 1a). After that, researchers can use the “Tree Edit Tab” to classify and color the phylogenetic tree, and finally produce high-definition bitmaps and/or editable vector graphics that meet publication quality.

Illustrative examples

To better reflect the above advantages of CFVisual, we take the gene structure, motif, and domain drawing results of the PME gene family of rice as an example.

The gene structure of rice PME is shown in Fig. 2 and the number of structural elements is shown in Table 1. We observed that the average length of rice PME gene is 2802.62 bp, the longest is 8802 bp (LOC_Os01g21034.1), and the shortest is 557 bp (LOC_Os04g43370.1); the average numbers of introns, cds, and utrs are 1.79, 2.76, and 1.69, respectively; the maximum values are 5 (LOC_Os10g26680.1 and LOC_Os02g46310.1), 6 (LOC_Os10g26680.1 and LOC_Os02g46310.1), and 3 (LOC_Os11g43830.1), respectively; and the minimum values are 0 (LOC_Os11g07090.1, LOC_Os03g18860.1, LOC_Os04g38560.1, LOC_Os04g35770.1, and LOC_Os09g39760.1), 1 (LOC_Os11g07090.1, LOC_Os03g18860.1, LOC_Os04g38560.1, LOC_Os04g35770.1, and LOC_Os09g39760.1), and 0 (LOC_Os11g07090.1, LOC_Os09g37360.1, LOC_Os11g36240.1, LOC_Os04g43370.1, LOC_Os01g19440.1, and LOC_Os02g46310.1), respectively.

Fig. 2
figure 2

Phylogenetic tree, gene structure, and domain diagram of rice PMEs

Table 1 The basic statistical details on structural elements of rice PME genes

According to the number of introns, eukaryotic genes can be divided into three categories: intronless (no introns), intron-poor (three or fewer introns per gene), and intron-rich (more than three introns per gene) [10]. Combined with the phylogenetic relationship, we found that the genes in Group 1 are only intronless (4, 15.38%) and intron-poor (22, 84.62%). Therefore, Group 1 is intron-poor clade. The genes in Group 2 contain these three types of genes, among them, intron-rich is the most (9, 56.25%), followed by intron-poor (6, 37.50%), and the least is intronless (1, 6.25%). Therefore, Group 2 is an intron-rich clade.

Combined with the location of the domains, we found that introns are almost always present in the region encoding the pectinesterase domain, whereas introns are absent in the region encoding the PMEI domain. Intriguingly, for the region encoding the pectinesterase domain, the genes of Group 2 contain more introns, while the genes of Group 1 contain fewer introns.

In conclusion, CFVisual showed the structure of rice PME gene well and provided useful quantitative information, which promoted our understanding and evolution of rice PME gene structure.

The structural motifs and domains along a line representing the amino acid sequence were shown in Fig. 3. We found that motif 10 exists only in the PMEI domain, and is a sequence signature of the PMEI domain. Motif 7, motif 4, motif 5, motif 1, motif 11, motif 3, motif 2, motif 9, motif 6, and motif 12 are contained in the pectinesterase domain. Moreover, we also found some cases of motif repetition and loss, for example, motif 7 located in the pectinesterase domain has a repetition after motif 4, and the PME in Group 1 is relatively intact, while the PME in Group 2 is mostly missing. Interestingly, motif 8 and motif 10 are only present in PMEs in Group 1 and cannot be found in PMEs in Group 2. All in all, rice PME protein sequences are generally conserved and have some obvious differences. From a phylogenetic point of view, the distribution of motifs and domains has obvious specificity. This helps us to better understand the sequence characteristics and evolution of rice PME.

Fig. 3
figure 3

Phylogenetic tree, motif, and domain diagram of rice PMEs

Discussion

CFVisual can draw phylogenetic tree, gene structure, promoter cis-acting element, motif, and domain diagram, and stitch them in any form. The generated pictures can be directly used in the paper for display, allowing researchers to bid farewell to the retouching. CFVisual has been used by some researchers and cited by some articles [11,12,13]. In the future, it will become the best choice for researchers to draw gene structure and protein architecture.

Availability of data and materials

All data generated or analyzed during this study were included in this published article and the Additional files. We have been using public data and do not have produced sequence data by ourselves.

References

  1. Hu B, Jin J, Guo A-Y, Zhang H, Luo J, Gao G. GSDS 2.0: an upgraded gene feature visualization server. Bioinformatics. 2015;31(8):1296–7.

    Article  Google Scholar 

  2. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37(suppl_2):W202–8.

    Article  CAS  Google Scholar 

  3. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J. Pfam: the protein families database. Nucleic Acids Res. 2014;42(D1):D222–30.

    Article  CAS  Google Scholar 

  4. Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR. CDD: a conserved domain database for the functional annotation of proteins. Nucleic Acids Res. 2010;39(suppl_1):D225–9.

    PubMed  PubMed Central  Google Scholar 

  5. SMART: recent updates, new developments and status in 2020. https://academic.oup.com/nar/article/49/D1/D458/5940513?login=false.

  6. Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, Schwartz DC, Tanaka T, Wu J, Zhou S. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice. 2013;6(1):1–10.

    Article  Google Scholar 

  7. Mistry J, Finn RD, Eddy SR, Bateman A, Punta M. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res. 2013;41(12):e121–e121.

    Article  CAS  Google Scholar 

  8. Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol Biol Evol. 2018;35(6):1547–9.

    Article  CAS  Google Scholar 

  9. Lescot M, Déhais P, Thijs G, Marchal K, Moreau Y, Van de Peer Y, Rouzé P, Rombauts S. PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences. Nucleic Acids Res. 2002;30(1):325–7.

    Article  CAS  Google Scholar 

  10. Liu H, Lyu HM, Zhu K, Van de Peer Y, Cheng ZM. The emergence and evolution of intron-poor and intronless genes in intron-rich plant gene families. Plant J. 2021;105(4):1072–82.

    Article  CAS  Google Scholar 

  11. Chen H, Wang X, Ge W. Comparative genomics of three-domain multi-copper oxidase gene family in foxtail millet (Setaria italica L). Comput Mol Biol. 2021;11(4):1–13.

    Google Scholar 

  12. Chen H, Ge W. Identification, molecular characteristics, and evolution of GRF gene family in foxtail millet (Setaria italica L.). Front Genet. 2021;12:727674–727674.

    Article  Google Scholar 

  13. Chen H, Ji K, Li Y, Gao Y, Liu F, Cui Y, Liu Y, Ge W, Wang Z. Triplication is the main evolutionary driving force of NLP transcription factor family in Chinese cabbage and related species. Int J Biol Macromol. 2022;201:492–506.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank all comments from users of CFVisual.

Funding

The work was supported by the Hebei Provincial College Student Innovation and Entrepreneurship Training Program (X2021006).

Author information

Authors and Affiliations

Authors

Contributions

HC and WG conceived the study and led the research. HC implemented and coordinated the analyses. HC, XS, QS, and SF performed the analysis. HC wrote the paper. All authors contributed to revising the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Weina Ge.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, H., Song, X., Shang, Q. et al. CFVisual: an interactive desktop platform for drawing gene structure and protein architecture. BMC Bioinformatics 23, 178 (2022). https://doi.org/10.1186/s12859-022-04707-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-022-04707-w

Keywords