ProClaT, a new bioinformatics tool for in silico protein reclassification: case study of DraB, a protein coded from the draTGB operon in Azospirillum brasilense
- Elisa Terumi Rubel†1, 2,
- Roberto Tadeu Raittz†1, 2,
- Nilson Antonio da Rocha Coimbra1, 2,
- Michelly Alves Coutinho Gehlen1, 2 and
- Fábio de Oliveira Pedrosa3, 4Email author
© The Author(s). 2016
Published: 15 December 2016
Azopirillum brasilense is a plant-growth promoting nitrogen-fixing bacteria that is used as bio-fertilizer in agriculture. Since nitrogen fixation has a high-energy demand, the reduction of N2 to NH4 + by nitrogenase occurs only under limiting conditions of NH4 + and O2. Moreover, the synthesis and activity of nitrogenase is highly regulated to prevent energy waste. In A. brasilense nitrogenase activity is regulated by the products of draG and draT. The product of the draB gene, located downstream in the draTGB operon, may be involved in the regulation of nitrogenase activity by an, as yet, unknown mechanism.
A deep in silico analysis of the product of draB was undertaken aiming at suggesting its possible function and involvement with DraT and DraG in the regulation of nitrogenase activity in A. brasilense. In this work, we present a new artificial intelligence strategy for protein classification, named ProClaT. The features used by the pattern recognition model were derived from the primary structure of the DraB homologous proteins, calculated by a ProClaT internal algorithm. ProClaT was applied to this case study and the results revealed that the A. brasilense draB gene codes for a protein highly similar to the nitrogenase associated NifO protein of Azotobacter vinelandii.
This tool allowed the reclassification of DraB/NifO homologous proteins, hypothetical, conserved hypothetical and those annotated as putative arsenate reductase, ArsC, as NifO-like. An analysis of co-occurrence of draB, draT, draG and of other nif genes was performed, suggesting the involvement of draB (nifO) in nitrogen fixation, however, without the definition of a specific function.
Azospirillum brasilense is a diazotrophic organism used as commercial inoculants, since it promotes plant growth . As a nitrogen-fixing bacterium, A. brasilense has a specific metabolic pathway for the conversion of gaseous dinitrogen into ammonia. The N2 is fixed under limiting conditions of NH4 + and O2, through the activity of nitrogenase . A post-translational control of nitrogenase occurs via the DraG-DraT system, in which the DraT enzyme (dinitrogenase reductase ADP-ribosyltransferase) acts in the nitrogenase shutdown by inactivating the NifH (dinitrogenase reductase) in response to the presence of ammonium ions in the environment, while the DraG enzyme (dinitrogenase reductase activating-glycohydrolase) restores the activity of NifH, after ammonium ions consumption [3, 4]. The DraT and DraG enzymes are encoded by the draTG genes, of the draTGB operon in A. brasilense . The draB gene was annotated as coding a putative arsenate reductase  [GenBank: CCC97498]. However, this function for the draB gene product of Azospirillum brasilense has never been confirmed to date. There is evidence that a homologous protein in Rhodospirillum rubrum seems to regulate the activity of DraG . The draB gene is homologous to nifO of A. vinelandii and arsC of E. coli . The A. vinelandii nitrogenase-associated NifO protein, part of operon nifBfdxNnifOQ, has a role in regulating the activity of nitrate reductase, whereas mutants NifO− cannot fix nitrogen in the presence of low concentrations of nitrate [8, 9].
To test the hypothesis that the draB gene codes for a NifO-like protein, since DraB protein has no known homologous in the Gene Ontology database, we developed a strategy named ProClaT - Protein Classifier Tool - for the reclassification of DraB/NifO homologous proteins, hypothetical, conserved hypothetical and those annotated as putative arsenate reductase, ArsC, as NifO-like.
A supervised pattern recognition approach was developed with a neural network as classifier. Also, the relationship and co-occurrence of draB with other genes related to nitrogen fixation, the minimum nif gene set, nifHDKENB , and with the draT and draG genes involved in the control of nitrogenase activity was determined by the Pearson Correlation Analysis.
ProClaT is a new machine learning approach to classify proteins based on protein sequence features and conserved domains. ProClaT was used to classify draB gene products and to discover NifO-like proteins.
ProClaT was applied to 2,773 complete bacterial genomes obtained from the NCBI database  via FTP, containing 5,182 GenBank data downloaded in July 2014. The download file size was 78.1 GB.
ProClaT pattern recognition sequence-based features
Amino acid composition
Consensus region alignment scores
Protein physico-chemical properties
The protein physicochemical features used to develop ProClaT were the isoelectric point (pI), charge, nominal mass, aromaticity, instability, hydropathy, entropy and energy.
Isoelectric point: The estimated pI for an amino acid sequence was calculated with Matlab and the Bioinformatics Toolbox™, using the pK values described on http://www.mathworks.com/help/bioinfo/ref/isoelectric.html.
Charge: The estimated charge of a protein in a given pH was calculated by the same Matlab function of the Bioinformatics Toolbox™ as for the pI described above. The default value was taken as the typical intracellular pH of 7.2.
Nominal mass: The expected protein nominal mass was also calculated by a Matlab function of the Bioinformatics Toolbox™, which analyzes a peptide sequence (http://www.mathworks.com/help/bioinfo/ref/isotopicdist.html).
Aromaticity: The aromaticity value of a protein was calculated according to Lobry , and consider the relative frequency of Phe + Trp + Tyr.
Instability: The protein instability index was calculated according to Guruprasad et al. . In this procedure a value above 40 means that the protein is unstable or has a short half-life.
Hydropathy or GRAVY (Grand Average of Hydropathy) Index: The protein GRAVY index was calculated according to the Kyte and Doolittle methodology . This index reveals the solubility of a protein, where a positive GRAVY value corresponds to a hydrophobic protein and a negative GRAVY value corresponds to a hydrophilic protein. The GRAVY value of a peptide/protein is calculated by adding the values of hydropathy of each amino acid, divided by the total number of residues of the sequence.
Entropy and Energy: In this context, the descriptors Energy and Entropy represent, respectively, the degree of uniformity and disorder of each protein sequences. Co-occurrence matrices 3 × 3 were generated from amino acids based on the sequence, and for each entry, the sequence was read from the right to the left and stored in a 3 × 3 amino acids arrangement. Based on this list, the combinations in pairs were analyzed one by one, and in case of co-occurrence, the count and recording of data was updated. This calculation was based on the Haralick methodology  called “matrix of co-occurrence”, developed for the description of textures images based on second-order statistics.
Features of the ProClaT pattern recognition model
Number of features
Function (Matlab or Python)
AA functional propertya
Scores alignment with consensus region
Self align with consensus region
Global alignment score with consensus region
Local alignment score with consensus region
Protein physico-chemical properties
isoelectric (sequence) (first returned value)
isoelectric (sequence) (second returned value)
ProtParam.ProteinAnalysis (seq).aromaticity() (python)
ProtParam.ProteinAnalysis (seq).instability_index() (python)
ProtParam.ProteinAnalysis (seq).gravy() (python)
function developed in python
function developed in python
ProClaT was parameterized in order to classify the NifHDK, NifENB, DraT and DraG proteins. Instead of a single TRUE/FALSE classifier, its returns 1 for NifH, 2 for NifD, 3 for NifK, 4 for NifE, 5 for NifN and 6 for NifB. For DraT and DraG, it returns 1 and 2 respectively.
Functions to get the conserved domain, features extraction and create the classifier.
Functions to perform PSI-Blast and features extraction.
Generate the protein conserved domains.
Test of the classifiers algorithms.
Correctly classified proteins by Weka algorithms
Correctly classified instances without cross-validation
Correctly classified instances with cross-validation
-L 0.3 –M 0.2 –N 500 –V 0 –S 0 –E 20 –H a
-S 1 –M 2.0 –N 5 –C 1.0
-G 5 –I 5
-C 0.25 –M 2
-P 100 –S 1 –I 0 –W weka.classifiers.trees. DecisionStump
For the nifO neighborhood analysis, we identified the nifO neighboring genes in a five window genes upstream and downstream using ProClaT.
Results and discussion
The product of the PST1305 gene of Pseudomonas stutzeri A1501, classified as NifO-like with ProClaT, was suggested to participate in biological nitrogen fixation, probably involved in electron transport or in an oxygen protection mechanism for nitrogenase . The authors considered this gene product to be required for optimal nitrogenase activity of Pseudomonas stutzeri A1501.
Moreover, the A. vinelandii NifO protein was also classified as NifO-like, as expected. Laboratory tests suggests that this protein has a role on ammonium repression of the nitrite-nitrate (nasAB) assimilatory operon of Azotobacter vinelandii .
Considering that the nifO gene is involved in the molybdenum (Mo) metabolism in A. vinelandii, and that nitrogenase and nitrate reductase contain Mo cofactors, NifO may be involved in regulating the distribution of Mo towards the synthesis of nitrogenase FeMoco or the synthesis of the nitrate reductase cofactor .
ProClaT was applied also in the classification of NifHDK, NifENB, DraT and DraG in order to confirm its general applicability.
The Additional file 2 lists all bacterial species containing at least five essential nif genes, and the presence of nifHDK, nifENB, nifO, draT and draG genes, according to ProClaT. Of the 80 bacterial species (or 119 strains) that have the six essential nif genes, 42 (or 61 strains) or 50 % co-occur with nifO, including Acidithiobacillus ferrivorans, Bradyrhizobium japonicum, Burkholderia xenovorans, Magnetospirillum magneticum, Pseudomonas stutzeri and Rhodospirillum rubrum. However, 41 bacterial species (or 58 strains) have no nifO-like genes, including Herbaspirillum seropedicae, Klebsiella oxytoca, Enterobacter sp and Burkholderia phenoliruptrix.
Interestingly, the species Azospirillum brasilense, Azospirillum lipoferum and Azotobacter vinelandii have two genes coding for NifO-like protein, according to ProClaT. Worth noting that no genes coding for NifO-like proteins were found in plasmids.
The co-occurrence correlation of nifO and other nif genes is higher than that observed with the draT and draG genes.
The Pearson Correlation Coefficient of nifO co-occurrence with all the six nif genes is 0.6350, and with the presence of both draT and draG genes is 0.4544.
Genes present in the nifO neighborhood
Absolute number of occurrences of the genes in the nifO neighborhood
ProClaT comparison and validation
Sensitivity and specificity of protein prediction methods
Calculated sensitivity (%)
Calculated specificity (%)
1. Cut-off score (>30 % local identity and > 50 % positive)
2. Conserved domain
3. Conserved domain with cutoff score
A PSI-Blast was performed on the NCBI NR protein library, using the consensus region of NifO as input query. It returned 3,000 hits of similar proteins, which 296 are NifO-like, after curation. All these proteins were submitted to the above methods. ProClaT showed the best sensitivity.
NifHDKENB proteins identification by ProClaT
Number of curated proteins
Although of high accuracy, ProClaT specificity can be improved. The observed average low error rate (3.17 %) was probably due to the fact that a small number of curated NifHDKENB proteins was available in biological databases to train the ProClaT neural network.
DraB classification with published protein prediction tools
Since A. brasilense DraB protein has no homologous in the GO database, as revealed by BLAST performed with the AmiGO web tool , the functional classification services based on GO terms were not specific. The ConFunc tool  predicted for the DraB protein the following terms: 1) GO: 0008794 (ontology: molecular function, description: arsenate reductase glutaredoxin activity) with probability of 0.667 and 2) GO: 0006351 (ontology: biological process, description: transcription, DNA- template) with probability 0.306. With the Blast2GO tool , the terms suggested to the DraB protein were: 1) GO: 0055114 (ontology: biological process, description: oxidation-reduction process) and 2) GO: 0016491 (ontology: molecular function, description: oxidoreductase activity). Other Bioinformatics tools suggest that DraB can belong to the families arsenate reductase-like (InterPro  and PANTHER ), thioredoxin-like fold (InterPro , Pfam  and PROSITE ) or to the family annotated, but not proven, as nitrogenase-associated protein (InterPro ). The protein prediction methods based on its tertiary structure are not recommended in this case, since there are no models of tertiary structure of DraB/NifO homologous obtained via experiments laboratory in protein structure databases.
A new efficient tool for protein classification - ProClaT - is described and tested. In this in silico study, ProClaT revealed that the draB gene of Azospirillum brasilense codes for a NifO-like protein. There is evidence that A. vinelandii NifO is possibly involved in regulating the distribution of Mo towards the synthesis of nitrogenase FeMoco or the synthesis of the nitrate reductase cofactor .
All the genes coding for NifO-like found with ProClaT belong to bacteria having at least three of the six essential nif genes, nifHDK and nifENB . With the correlation analysis of co-occurrence of these genes in complete bacterial genomes, we observed that the nifO/draB gene has a higher correlation coefficient with the essential nif genes than with draT and draG, whose products is involved in controlling nitrogenase activity in response to ammonium levels.
Analysis of the neighborhood revealed that nifO may have both nif and/or draT and draG genes as neighbors, but no clear pattern was identified.
Of the 80 bacterial species analyzed containing the six essential nif genes, 42 also contain the nifO gene. However, 41 diazotrophic bacterial species have no nifO-like genes, which suggests that nifO is not essential for the nitrogen fixation by nitrogenase.
ProClaT found nine genes annotated as arsenate reductase, six as hypotheticals and six with variable names in complete bacterial genomes. This suggests that these gene products should be reclassified as NifO-like.
ProClaT was developed to reclassify the DraB protein vis a vis the NifO-like proteins and to approach its biological functions.
ProClaT was tested with curated Nif proteins and showed average hit rate of 96.83 % in classifying known Nif proteins, confirming that it can be useful in the (re)classification of other proteins. Thus, ProClaT has a much wider application as revealed by its validation with the defined essential nitrogen fixation proteins.
We thank R.A. Vialle, C.E. Brim, V. Weiss for technical assistance and to A.C. Bonatto, L.F. Huergo and J. Marchaukoski for review and kindly correct the paper. We thank the Graduate Program in Bioinformatics of Federal University of Paraná and the National Science and Technology Institutes of Biological Nitrogen Fixation (INCT).
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 18, 2016. Proceedings of X-meeting 2015: 11th International Conference of the AB3C + Brazilian Symposium on Bioinformatics: bioinformatics. The full contents of the supplement are available online https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-18.
This work and publication was supported by the National Council for Scientific and Technological Development (CNPq), the Coordination for the Improvement of Higher Education Personnel (CAPES) and National Institute for Science and Technology of Biological Nitrogen Fixation (INCT-FBN/CNPQ/MCTIC). The publication costs will be covered with resources from CAPES to the Graduate Program in Bioinformatics of the Federal University of Paraná.
Availability of data and materials
Project name: ProClaT.
Project home page: https://sourceforge.net/projects/proclat/
Operating system(s): Platform independent.
Programming language: Matlab (R2012b) and Python 3.4.
Other requirements: MathWorks Bioinformatics Toolbox™ and Biopython.
License: GNU GPL v3.
The datasets supporting the results of this article are available in the repository, https://sourceforge.net/projects/proclat/.
FOP and RTR proposed the concept, validated the results and revised the manuscript. The methodology, implementation and results achievement was developed by ETR and RTR, under the supervision of FOP. NARC and MACG provided technical assistance and developed some functions. All authors contributed to and approved the manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Hungria M, Campo RJ, Souza EM, Pedrosa FO. Inoculation with selected strains of Azospirillum brasilense and A. lipoferum improves yields of maize and wheat in Brazil. Plant Soil. 2010;331:413–25.View ArticleGoogle Scholar
- Postgate JF. The fundamentals of nitrogen fixation. Cambridge: Cambridge Univ. Press; 1982.Google Scholar
- Zumft WG, Castillo F. Regulatory properties of the nitrogenase from Rhodopseudomonas palustris. Arch Microbiol. 1978;117:53–60.View ArticlePubMedGoogle Scholar
- Huergo LF, Pedrosa FO, Muller-Santos M, Chubatsu LS, Monteiro RA, Merrick M, Souza EM. PII signal transduction proteins: pivotal players in post-translational control of nitrogenase activity. Microbiology. 2012;158:176–90.View ArticlePubMedGoogle Scholar
- Zhang Y, Burris RH, Roberts GP. Cloning, sequencing, mutagenesis, and functional characterization of draT and draG genes from Azospirillum brasilense. J Bacteriol. 1992;174(10):3364–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Liang J, Nielsen GM, Lies DP, Burris RH, Roberts GP, Ludden PW. Mutations in the draT and draG Genes of Rhodospirillum rubrum result in loss of regulation of nitrogenase by reversible ADP-Ribosylation. J Bacteriol. 1991;173:6903–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang Y, Pohlmann EL, Halbleib CM, Ludden PW, Roberts GP. Effect of PII and Its Homolog GlnK on Reversible ADP-Ribosylation of Dinitrogenase Reductase by Heterologous Expression of the Rhodospirillum rubrum dinitrogenase reductase ADP-ribosyl transferase-dinitrogenase reductase-activating glycohydrolase regulatory system in Klebsiella pneumonia. J Bacteriol. 2001;183:1610–20.View ArticlePubMedPubMed CentralGoogle Scholar
- Quiñones FR, Bosh R, Imperial J. Expression of the nifBfdxNnifOQ Region of Azotobacter vinelandii and Its Role in Nitrogenase Activity. J Bacteriol. 1993;175:2926–35.View ArticleGoogle Scholar
- Gutierrez JC, Santero E, Tortolero M. Ammonium repression of the nitrite-nitrate (nasAB) assimilatory operon of Azotobacter vinelandii is enhanced in mutants expressing the nifO gene at high levels. Mol Gen Genet. 1997;255:172–9.View ArticlePubMedGoogle Scholar
- Dos Santos PC, Fang Z, Mason SW, Setubal JC, Dixon R. Distribution of nitrogen fixation and nitrogenase-like sequences amongst microbial genomes. BMC Genomics. 2012;13:162.View ArticlePubMedPubMed CentralGoogle Scholar
- NCBI GenBank FTP. ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/ (2015). Accessed 19 Apr 2015.
- Lobry JR, Gautier C. Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes. Nucleic Acids Res. 1994;22:3174–80.View ArticlePubMedPubMed CentralGoogle Scholar
- Guruprasad K, Reddy BV, Pandit MW. Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. 1990;4:155–61.View ArticlePubMedGoogle Scholar
- Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157:105–32.View ArticlePubMedGoogle Scholar
- Haralick RM. Statistical and structural approaches to texture. Proc IEEE. 1979;67:786–804.View ArticleGoogle Scholar
- Jonassen I, Collins JF, Higgins DG. Finding flexible patterns in unaligned protein sequences. Protein Sci. 1995;4:1587–95.View ArticlePubMedPubMed CentralGoogle Scholar
- Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJL. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3.View ArticlePubMedPubMed CentralGoogle Scholar
- Jain AK, Duin RPW, Mao J. Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell. 2000;22(1):4–37.View ArticleGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I. The WEKA data mining software: an update. ACM SIGKDD Explorations News. 2009;11:10–8.View ArticleGoogle Scholar
- Wu X, Kumar V, Quinlan JR, Ghosh J, Motoda QYH, Mclachlan GJ, Ng A, Liu B, Yu PS, Zhou Z, Steinbach M, Hand DJ, Steinberg D. Top 10 algorithms in data mining. Knowl Inf Syst. 2008;14:1–37.View ArticleGoogle Scholar
- Fan H, Yan Y, Li Y, Ping S, Zhang W, Chen M, Lin M, Lu W. Analysis of a new nitrogen fixation gene in Pseudomonas stutzeri A1501. Acta Microbiol Sin. 2009;49:580–4.Google Scholar
- Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S, AmiGO Hub, Web Presence Working Group. AmiGO: online access to ontology and annotation data. Bioinformatics. 2009;25(2):288–9.View ArticlePubMedGoogle Scholar
- Wass MN, Sternberg JE. ConFunc - functional annotation in the twilight zone. Bioinformatics. 2008;24:798–806.View ArticlePubMedGoogle Scholar
- Conesa A, Gotz S. Blast2GO: a comprehensive suite for functional analysis in plant genomics. Int J Plant Genomics. 2008. doi:10.1155/2008/619832.PubMedGoogle Scholar
- The InterPro Consortium. InterPro: An integrated documentation resource for protein families, domains and functional sites. Brief Bioinform. 2002;3:225–35.View ArticleGoogle Scholar
- Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003;13:2129–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hethweington K, Holm L, Mistry J, Sonnhammer ELL, Tate J, Punta M. The Pfam protein families database. Nucleic Acids Res. 2014;42:D222–30.View ArticlePubMedGoogle Scholar
- Sigrist CJA, Cerutti L, De Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch A, Hulo N. PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res. 2010;38:161–6.View ArticleGoogle Scholar
- Adler J, Parmryd I. Quantifying colocalization by correlation: the pearson correlation coefficient is superior to the mander’s overlap coefficient. Wiley InterScience. 2010. doi:10.1002/cyto.a.20896.Google Scholar