- Open Access
CFVisual: an interactive desktop platform for drawing gene structure and protein architecture
BMC Bioinformatics volume 23, Article number: 178 (2022)
When researchers perform gene family analysis, they often analyze the structural characteristics of the gene, such as the distribution of introns and exons. At the same time, characteristic structural analysis of amino acid sequence is also essential, for example, motif and domain features. Researchers often integrate these analyses into one image to dig out more information, but the tools responsible for this integration are lacking.
Here, we developed a tool (CFVisual) for drawing gene structure and protein architecture. CFVisual can draw the phylogenetic tree, gene structure, and protein architecture in one picture, and has rich interactive capabilities, which can meet the work needs of researchers. Furthermore, it also supports arbitrary stitching of the above analysis images. It has become a useful helper in gene family analysis. The CFVisual package was implemented in Python and is freely available from https://github.com/ChenHuilong1223/CFVisual/.
CFVisual has been used by some researchers and cited by some articles. In the future, CFVisual will continue to serve as a good helper for researchers in the study of gene structure and protein architecture.
With the continuous sequencing of more and more genomes of plants and animals, a large number of genome annotation files have been produced, which are generally in formats such as GFF3 and GTF. Researchers often need to obtain information about gene structure of some gene sets (such as gene families) from these annotation files and display these exon–intron structure graphically. This can help researchers to understand the composition and position of gene exons and introns, and help to advance the understanding of gene variable splicing. Moreover, in conjunction with phylogenetic analysis, it also helps to understand gene evolution. At present, the better drawing tool is GSDS . Unfortunately, it does not fully satisfy the requirements of researchers for graphics. The defects are as follows: the phylogenetic tree cannot be classified and colored, specific numerical information cannot be provided, and the website is often inaccessible, etc.
Motifs and domains are the functional units and characteristic structures of amino acid sequences, and are often identified by tools such as MEME and Pfam/NCBI-CDD/SMART [2,3,4,5]. Displaying these motifs and domains along a line helps folk understand the structure of the protein sequence. Comparing with other protein sequences is helpful to find out the conserved parts and difference sites. Moreover, combined with the phylogenetic tree, it is helpful to study the evolution of motifs and domains. When conducting gene family analysis, researchers often need to splice the gene structure map with the motif and/or domain location distribution map into one map for display, so as to obtain more information. Therefore, researchers need to use Adobe Illustrator, Adobe Photoshop or other image editing software to stitch the images. To the best of our knowledge, this work is time-consuming and tedious. Therefore, it is important to develop a suitable tool to avoid this situation.
We used the Python language to write the software implementation logic, then used the Python language PySide2 library to implement the software GUI interface, and then used the Python language matplotlib library to visualize the data via our own logic. Finally, We used the Pyinstaller library in the Python language to complete the creation of the CFVisual platform.
In order to better reflect the advantages of CFVisual, we downloaded the latest rice genome data from the rice database (http://rice.uga.edu/) , including the whole genome protein sequence and GFF3 annotation file, and then used HMMER software (parameter threshold was set to 1e-10) based on the pectinesterase domain Hidden Markov model (PF01095.19) to identify the candidate sequences of rice PME protein . Finally, all candidate protein sequences were determined by Pfam (https://pfam.xfam.org/), NCBI-CDD (https://www.ncbi.nlm.nih.gov/cdd), and SMART (http://smart.embl-heidelberg.de/) databases, and only protein sequences that contain the pectinesterase domain are considered members of the PME gene family.
After that, we wrote a Python script (https://github.com/ChenHuilong1223/CFVisual/) to extract the amino acid sequences and GFF3 annotation information of rice PMEs. The amino acid sequences of rice PME were analyzed by MEGA X , MEME (https://meme-suite.org/meme/), Pfam, NCBI-CDD, and SMART tools to generate the result file. Finally, these results were visualized using CFVisual.
Function overview, usage, and illustrative examples
In the functional aspect, CFVisual can be divided into three parts, namely gene structure level, protein architecture level, and classification and coloring of phylogenetic tree.
Users can provide GFF3, GTF or BED files, and then use CFVisual to draw the picture. In the interface shown in Fig. 1b, users can set the style of each feature, such as color, shape, thickness, etc. Clicking the “Statistics” button to make CFVisual automatically count the length of gene, the number of introns, utrs, cds, and other quantitative information (Fig. 1c). Of course, users can also add other information, including domains and signal peptides, etc. (Fig. 1a). Using the combined form of rectangular boxes helps researchers intuitively judge which cds fragments encode the domain and the presence of introns.
Regarding the promoter map (Fig. 1d), users provide location results from PlantCare  and other tools for predicting the position of cis-acting elements and CFVisual will read out all cis-acting elements at once, which can be selectively displayed according to needs.
The preparation file for drawing the motif diagram (Fig. 1a) is the result file predicted by the MEME tool. Compared with some conventional motif visualization tools, the advantages of CFVisual are as follows. First of all, the software completely reproduces the results of MEME and realizes that the height of the rectangular box representing the motif is negatively correlated with the p value. The lower the height, the higher the p value, and the lower the credibility of the predicted motif. Secondly, the result of “Scanned Sites” can be displayed in the form of transparent rectangular boxes. At last, users can selectively display motif units that need to be studied.
The preparation file of the domain map is the result file of NCBI-CDD, Pfam or SMART. Users can still selectively display the domains that need to be studied. Another advantage of CFVisual is that the structure domain can be superimposed on the motif diagram in the form of a rectangular box (Fig. 1a), so that researchers can intuitively judge the location distribution relationship of motifs and domains.
Classification and coloring of phylogenetic tree
While studying gene structure and protein architecture, researchers often joint a phylogenetic tree to study the evolution of structures. Here, CFVisual supports this demand well. Users only needs to provide the tree file in Newick format to be recognized by CFVisual and can draw the picture easily (Fig. 1a). After that, researchers can use the “Tree Edit Tab” to classify and color the phylogenetic tree, and finally produce high-definition bitmaps and/or editable vector graphics that meet publication quality.
To better reflect the above advantages of CFVisual, we take the gene structure, motif, and domain drawing results of the PME gene family of rice as an example.
The gene structure of rice PME is shown in Fig. 2 and the number of structural elements is shown in Table 1. We observed that the average length of rice PME gene is 2802.62 bp, the longest is 8802 bp (LOC_Os01g21034.1), and the shortest is 557 bp (LOC_Os04g43370.1); the average numbers of introns, cds, and utrs are 1.79, 2.76, and 1.69, respectively; the maximum values are 5 (LOC_Os10g26680.1 and LOC_Os02g46310.1), 6 (LOC_Os10g26680.1 and LOC_Os02g46310.1), and 3 (LOC_Os11g43830.1), respectively; and the minimum values are 0 (LOC_Os11g07090.1, LOC_Os03g18860.1, LOC_Os04g38560.1, LOC_Os04g35770.1, and LOC_Os09g39760.1), 1 (LOC_Os11g07090.1, LOC_Os03g18860.1, LOC_Os04g38560.1, LOC_Os04g35770.1, and LOC_Os09g39760.1), and 0 (LOC_Os11g07090.1, LOC_Os09g37360.1, LOC_Os11g36240.1, LOC_Os04g43370.1, LOC_Os01g19440.1, and LOC_Os02g46310.1), respectively.
According to the number of introns, eukaryotic genes can be divided into three categories: intronless (no introns), intron-poor (three or fewer introns per gene), and intron-rich (more than three introns per gene) . Combined with the phylogenetic relationship, we found that the genes in Group 1 are only intronless (4, 15.38%) and intron-poor (22, 84.62%). Therefore, Group 1 is intron-poor clade. The genes in Group 2 contain these three types of genes, among them, intron-rich is the most (9, 56.25%), followed by intron-poor (6, 37.50%), and the least is intronless (1, 6.25%). Therefore, Group 2 is an intron-rich clade.
Combined with the location of the domains, we found that introns are almost always present in the region encoding the pectinesterase domain, whereas introns are absent in the region encoding the PMEI domain. Intriguingly, for the region encoding the pectinesterase domain, the genes of Group 2 contain more introns, while the genes of Group 1 contain fewer introns.
In conclusion, CFVisual showed the structure of rice PME gene well and provided useful quantitative information, which promoted our understanding and evolution of rice PME gene structure.
The structural motifs and domains along a line representing the amino acid sequence were shown in Fig. 3. We found that motif 10 exists only in the PMEI domain, and is a sequence signature of the PMEI domain. Motif 7, motif 4, motif 5, motif 1, motif 11, motif 3, motif 2, motif 9, motif 6, and motif 12 are contained in the pectinesterase domain. Moreover, we also found some cases of motif repetition and loss, for example, motif 7 located in the pectinesterase domain has a repetition after motif 4, and the PME in Group 1 is relatively intact, while the PME in Group 2 is mostly missing. Interestingly, motif 8 and motif 10 are only present in PMEs in Group 1 and cannot be found in PMEs in Group 2. All in all, rice PME protein sequences are generally conserved and have some obvious differences. From a phylogenetic point of view, the distribution of motifs and domains has obvious specificity. This helps us to better understand the sequence characteristics and evolution of rice PME.
CFVisual can draw phylogenetic tree, gene structure, promoter cis-acting element, motif, and domain diagram, and stitch them in any form. The generated pictures can be directly used in the paper for display, allowing researchers to bid farewell to the retouching. CFVisual has been used by some researchers and cited by some articles [11,12,13]. In the future, it will become the best choice for researchers to draw gene structure and protein architecture.
Availability of data and materials
All data generated or analyzed during this study were included in this published article and the Additional files. We have been using public data and do not have produced sequence data by ourselves.
Hu B, Jin J, Guo A-Y, Zhang H, Luo J, Gao G. GSDS 2.0: an upgraded gene feature visualization server. Bioinformatics. 2015;31(8):1296–7.
Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37(suppl_2):W202–8.
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J. Pfam: the protein families database. Nucleic Acids Res. 2014;42(D1):D222–30.
Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR. CDD: a conserved domain database for the functional annotation of proteins. Nucleic Acids Res. 2010;39(suppl_1):D225–9.
SMART: recent updates, new developments and status in 2020. https://academic.oup.com/nar/article/49/D1/D458/5940513?login=false.
Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, Schwartz DC, Tanaka T, Wu J, Zhou S. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice. 2013;6(1):1–10.
Mistry J, Finn RD, Eddy SR, Bateman A, Punta M. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res. 2013;41(12):e121–e121.
Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol Biol Evol. 2018;35(6):1547–9.
Lescot M, Déhais P, Thijs G, Marchal K, Moreau Y, Van de Peer Y, Rouzé P, Rombauts S. PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences. Nucleic Acids Res. 2002;30(1):325–7.
Liu H, Lyu HM, Zhu K, Van de Peer Y, Cheng ZM. The emergence and evolution of intron-poor and intronless genes in intron-rich plant gene families. Plant J. 2021;105(4):1072–82.
Chen H, Wang X, Ge W. Comparative genomics of three-domain multi-copper oxidase gene family in foxtail millet (Setaria italica L). Comput Mol Biol. 2021;11(4):1–13.
Chen H, Ge W. Identification, molecular characteristics, and evolution of GRF gene family in foxtail millet (Setaria italica L.). Front Genet. 2021;12:727674–727674.
Chen H, Ji K, Li Y, Gao Y, Liu F, Cui Y, Liu Y, Ge W, Wang Z. Triplication is the main evolutionary driving force of NLP transcription factor family in Chinese cabbage and related species. Int J Biol Macromol. 2022;201:492–506.
We thank all comments from users of CFVisual.
The work was supported by the Hebei Provincial College Student Innovation and Entrepreneurship Training Program (X2021006).
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Chen, H., Song, X., Shang, Q. et al. CFVisual: an interactive desktop platform for drawing gene structure and protein architecture. BMC Bioinformatics 23, 178 (2022). https://doi.org/10.1186/s12859-022-04707-w
- Gene structure