In silico analysis of methyltransferase domains involved in biosynthesis of secondary metabolites

Background Secondary metabolites biosynthesized by polyketide synthase (PKS) and nonribosomal peptide synthetase (NRPS) family of enzymes constitute several classes of therapeutically important natural products like erythromycin, rapamycin, cyclosporine etc. In view of their relevance for natural product based drug discovery, identification of novel secondary metabolite natural products by genome mining has been an area of active research. A number of different tailoring enzymes catalyze a variety of chemical modifications to the polyketide or nonribosomal peptide backbone of these secondary metabolites to enhance their structural diversity. Therefore, development of powerful bioinformatics methods for identification of these tailoring enzymes and assignment of their substrate specificity is crucial for deciphering novel secondary metabolites by genome mining. Results In this work, we have carried out a comprehensive bioinformatics analysis of methyltransferase (MT) domains present in multi functional type I PKS and NRPS proteins encoded by PKS/NRPS gene clusters having known secondary metabolite products. Based on the results of this analysis, we have developed a novel knowledge based computational approach for detecting MT domains present in PKS and NRPS megasynthases, delineating their correct boundaries and classifying them as N-MT, C-MT and O-MT using profile HMMs. Analysis of proteins in nr database of NCBI using these class specific profiles has revealed several interesting examples, namely, C-MT domains in NRPS modules, N-MT domains with significant homology to C-MT proteins, and presence of NRPS/PKS MTs in association with other catalytic domains. Our analysis of the chemical structures of the secondary metabolites and their site of methylation suggested that a possible evolutionary basis for the presence of a novel class of N-MT domains with significant homology to C-MT proteins could be the close resemblance of the chemical structures of the acceptor substrates, as in the case of pyochelin and yersiniabactin. These two classes of MTs recognize similar acceptor substrates, but transfer methyl groups to N and C positions on these substrates. Conclusion We have developed a novel knowledge based computational approach for identifying MT domains present in type I PKS and NRPS multifunctional enzymes and predicting their site of methylation. Analysis of nr database using this approach has revealed presence of several novel MT domains. Our analysis has also given interesting insight into the evolutionary basis of the novel substrate specificities of these MT proteins.

Methyltransferase (MT) domains present in NRPS and PKS clusters constitute a major class of tailoring domains/ enzymes involved in biosynthesis of secondary metabolites. They catalyze the transfer of methyl group from Sadenosylmethionine (SAM or AdoMet) to the carbon, nitrogen or oxygen atoms at various positions on the backbones of polyketides, nonribosomal peptides and fatty acids and therefore have been classified as C-MT, N-MT and O-MT respectively depending upon their site of methylation. These enzymatic domains in general have a bidomain structure, where the first subdomain contains the binding site for methyl group donor, while the second subdomain harbors the binding site for acceptor substrate [21,22]. The presence of MT domains in multifunctional NRPS and PKS proteins is generally inferred from chemical structure of the secondary metabolite products. There are only few in vitro studies on enzymatic characterization of NRPS/PKS MT domains [23][24][25][26][27]. A recent study on MT domains from type II PKS biosynthetic pathways has revealed interesting correlation between regioselectivity of methylation and MT sequence [24]. However, no such analysis has been carried out for MT domains present in type I PKS or NRPS proteins. In contrast to type II PKS MTs which are stand alone proteins, MT domains in type I PKS and NRPS are present along with other catalytic domains on a single polypeptide chain. Therefore, it has been diffi-cult to decipher the correct length and domain boundaries for MT domains in type I PKS or NRPS proteins. Various studies have suggested that the size of N-MT domain is typically 450 amino acids, while C-MT and O-MT are generally 300 amino acids long. A set of 3 conserved sequence motifs has been identified in most MTs [28][29][30]. Mutational studies of N-MTs of peptide synthetases have shown that these 3 motifs are essential for the catalysis [31]. The knowledge of these MT sequence motifs and the expected spacing between them is often used for discerning presence of MT domains in multifunctional NRPS and PKS proteins. However, because of the high degree of sequence divergence, delineating the correct boundary of these proteins is quite often a difficult task. In our earlier study, we attempted to identify MT domains in various NRPS/PKS gene clusters based on pairwise alignment with MT domain from actinomycin cluster [32]. However, this domain identification protocol failed to detect 23 out of 32 MT domains. The 23 unidentified MT domains included the three groups of MTs (C-, O-and N-MTs), for which proper templates were not available. The general purpose domain identification tools like CDD-search can identify MT domains in NRPS and PKS proteins, but can not predict the domain boundaries accurately and they also fail to classify them as C-MT, N-MT and O-MT. Such classification is crucial for prediction of chemical structures of secondary metabolites. The knowledge of substrate specificity and domain boundaries of MT domains is also important for rational design of novel secondary metabolites by introduction of heterologous MT domains.
In this manuscript, we have carried out a systematic analysis of the sequence/structural features of MT domains present in various experimentally characterized NRPS and type I PKS clusters having known metabolic products. Since crystal structures are available for many stand alone small molecule methyltransferases from several microbial organisms, we have carried out threading analysis for the experimentally characterized MT domains from NRPS and PKS biosynthetic pathways. The threading analysis has helped in elucidating the putative three dimensional structure adopted by MT domains and based on the alignment of MT containing sequences on the structural fold of MT domain it has been possible to delineate the correct boundaries for NRPS/PKS MT domains. Our threading analysis has also given novel insight into the structural features of linker sequences flanking the MT domains in NRPS and PKS proteins. Using the curated sequences of these MTs, we have carried out detailed phylogenetic analysis to investigate whether these catalytic domains cluster as per their specificity for site of methylation i.e. C-MT, N-MT and O-MT. Based on this analysis, we have identified suitable template sequences of C-MT, N-MT and O-MT domains from representative clusters, which can be used to identify MT domains in uncharacterized NRPS/PKS proteins. We have also developed Hidden Markov Model (HMM) profiles which can identify MT domains in a query sequence and classify them as N-MT, C-MT and O-MT. Using these HMM profiles, we have analyzed nonredundant protein sequence database of NCBI to identify other multifunctional enzymes containing C-MT, N-MT and O-MT domains.

Results
In this study, we have carried out a number of different bioinformatics analyses on MT domains present in type I PKS and NRPS proteins, to correlate the sequence of these MT domains to their substrate specificity i.e. the site of methylation. Figure 1 gives a schematic overview of the various different analyses carried out and the type of results obtained from them, while the results are discussed in detail in the following sections.
The chemical structures of the secondary metabolites produced by various PKS, NRPS and hybrid NRPS/PKS clusters cataloged in NRPS-PKS web resource were analyzed carefully to identify methyl substitutions on polyketide or nonribosomal peptide backbones. The presence of methyl substitutions on nitrogen and oxygen atoms indicated presence of N-MT or O-MT domains in the proteins encoded by these gene clusters. However, in absence of MT domains methyl substitutions on carbons in a ketide group can also result from selection of methylmalonate extender groups by the AT domains of PKS proteins.
A schematic overview of different bioinformatics analyses carried out in the current work on MT domains present in type I PKS and NRPS proteins Therefore, for correctly inferring presence of C-MT domains in a PKS protein, the substrate specificity of the corresponding AT domain was also checked. Table 1 lists various ORFs harboring the MT domains, their GenBank accession number and the type or substrate specificity of MT domain as deduced from the chemical structure of the metabolite. As can be seen, the data set consisted of 20 C-MT, 19 O-MT and 22 N-MT domains from 27 different NRPS/PKS clusters. Figure 2 shows the chemical structures of 5 representative secondary metabolites highlighting the methyl groups added by C-MT, N-MT and O-MT domains, while chemical structures of the remaining 22 secondary metabolites are shown in Additional file 1. Each of the ORFs listed in Table 1 were analyzed by NRPS-PKS search tool as well as CDD server of NCBI. NRPS-PKS search tool, which used a single MT domain from actinomycin cluster as template, could identify only 26 MT domains out of a total of 61. Even though the latest version of CDD server could identify 55 out of these 61 MT domains, the lengths of the MT domains detected by both these programs were notably shorter than the typical length of these domains. These programs also failed to distinguish between C-MT, N-MT and O-MT domains. Additional file 2 shows the length of each MT containing sequence stretch and other catalytic domains flanking this region. As can be seen, all N-MT domains are present in NRPS clusters only as C-A-MT-T modules and typically a 400 amino acid sequence stretch containing this domain is inserted in the adenylation domain between the conserved motifs A8 and A9 of adenylation domains and hence alignment of these N-MT containing A domains with regular A domains produces a split alignment ( domain in a hybrid NRPS/PKS system. In these cases MT-AMT stretch was inserted between a PKS and a NRPS module. It appears that MT domains in type I PKS proteins are present in AT-KR, DH-ER or DH-KR linker regions which are typically more than 200 amino acids long. Hence, length of the flanking linker region could be the reason for the larger length of these MT containing sequence stretches. Therefore, we decided to carry out various structure based sequences analysis for representative MT containing stretches of each category to delineate the exact domain boundaries for MT domains.

Threading analysis of MT domains
Since domain boundaries can be identified correctly by aligning the sequence of multi domain proteins with the 3D structures of the corresponding single domain proteins, we attempted to identify other proteins in PDB which are structurally similar to these MT domains of PKS/NRPS enzymes. However, the lack of crystal structures for any MT domains from PKS/NRPS biosynthetic pathway and high degree of sequence divergence in this enzyme family prompted us to use threading or fold recognition approach. These tools can potentially reveal structural similarity in absence of high degree of similarity in sequences. GenTHREADER and PHYRE fold prediction servers were used for threading analysis. As discussed in the methods section, the MT sequence stretch identified by CDD or NRPS-PKS along with their flanking linkers was threaded on various structural folds in PDB. In cases where chemical structure of metabolite indicated presence of MT domains but no MT domain was detected by these programs, all linker stretches having unusual length were analyzed by both these fold recognition servers. Table 2 shows the results of threading analysis for representative members of these MT containing sequences. The fold prediction hits with highest level of statistical significance corresponding to p-value lower than 0.0001 were labeled as CERTAIN by the GenTHREADER server, while hits having p-value between 0.0001 and 0.001 are labeled as HIGH. We considered only those matches which are labeled as CERTAIN or HIGH by GenTHREADER or have precision of more than 95% in case of fold prediction by PHYRE. As can be seen, all the 18 sequences matched with structure of MT proteins in PDB. This suggests that, the MT domains present in NRPS/PKS proteins would adopt a fold similar to the small molecule methyltransferase in other organisms. The N-MT domains present in our data set aligned not only with N-MT structures, but also with C-MT and O-MT structures. Similar was the case for C-MT and O-MT domains in our data set. Analysis of their alignment scores did not show any preference for structural matches from the same functional category. Therefore, the structures which aligned consistently with all the sequences by both servers and had maximum sequence identity and alignment length with the query sequences were chosen as structural templates for MT domains of NRPS/PKS proteins. The structure of histamine N-MT (PDB code 1VLM; 207 residues) and a hypothetical protein from M. tuberculosis (PDB code 1VL5; 230 residues) showed alignment with most C-MT, O-MT and N-MT by both the fold prediction servers. However, taking into consideration the alignment length and the length of the query sequence, for further analysis 1VLM was selected as template for C-MT and O-MT domains, while 1VL5 was selected as structural template for N-MT domains. These results suggest that, MT domains present in type I PKS and NRPS proteins are likely to be around 200 amino acids long.
It may be noted that the small molecule MT structures present in PDB show significant variation in their length and in case of some of these sequences alignments were found with MT structures having significant difference in length. For example, 3 of the C-MT domains and 1 N-MT domain in our data set aligned with aclacinomycin-10hydroxylase (PDB code 1QZZ; 340 residues) [33] and histamine N-MT (PDB code 1VLM; 207 residues). The huge length difference between these two MT structures made the choice of correct structural template a difficult task. 1QZZ aligns with the beginning of the MT containing   Tubulysin  tubF  CAF05651  PF08242  1  --tubB  CAF05647  PF08242  --1  tubC  CAF05648  PF08242  --1  3  Yersiniabactin  HMWP-1  AAC69588  PF08242  2  --2   Total  20  19  22  61 List of ORFs containing methyltransferase domains in various experimentally characterized NRPS/PKS gene clusters. Table also lists the number and type of MT domains in each of the ORFs and results from PFAM analysis using CDD search.    Figure 5, the N-MT sequences have the conserved motifs I, II/Y, IV and V in N-MTs. Similarly motifs I, motif I-post, II and III are present in the C-MT and O-MT sequences (Figures 6 &7). Thus all the motifs [28,30,31] identified in earlier analysis of small molecule methyltransferase sequences were found to be conserved in the multiple sequence alignments of C-, O-and N-MT domains identified by our threading analysis. The average percent identity among the C-, O-and N-MT domains was found to be 31, 28 and 17% respectively.
The threading analysis also showed statistically significant matches with proteins other than methyltransferases. Such matches were specifically seen for MT containing sequences which were longer in length due to the presence of large flanking linker sequences. As can be seen from

Figure 3 Alignment of the sequence stretch containing A and N-MT domains from a C-A-MT-T NRPS module with sequence of A domain from a C-A-T module. A split alignment is obtained, because N-MT domain is integrated between A-8 and A-9 signature motifs of A domain.
sequences with the 18 template sequences indicated that C-MT templates were able to identify all other C-MT sequences. However, there were a few O-and N-MTs, which were also recognized by these C-MT templates. Specifically, 2 MT sequences in onnamide gene cluster (onnB and onnI) showed very high similarity to C-MT while they are functionally annotated as O-MTs by the authors who reported experimental characterization of this gene cluster [36]. These 2 O-MTs also have motif I as ExGxG which is characteristic of C-MTs. An N-MT from pyochelin synthetase (pchF) also exhibited considerable similarity to several C-MTs and the two onnamide O-MTs. In view of this apparent anomaly in sequence similarity of these proteins, their functional assignment needs to be examined carefully. In the ORF apdB of anabaenopeptilide gene cluster, there are two MT-domains wherein the first one is entirely different from all the other MT sequences and does not show similarity with any other MT sequence. A dendrogram (Figure 9) of all the 60 MT sequences also illustrates the same pattern of results as obtained by these pairwise alignments. The two onnamide O-MTs in onnB and onnI genes and the N-MTs in pyochelin and anabaenopeptilide show clustering with C-methyltransferases. A stand alone O-MT in stigmatellin (stiK) is sequentially different from other O-methyltransferases, which was observed from the pairwise alignment results and is evident from the dendrogram. Thus it can be concluded from the above analysis that these 18 sequences can be used as templates to identify MT-domains in any given query sequence by pairwise alignment. These new MT templates were included in the current version of NRPS-PKS program for correct identification of various types of MT domains.

Profile HMMs of C-, O-and N-methyltransferases
An alternative approach to detect MT domains in a new sequence is to query that sequence against a database of HMM profiles. Individual profiles were built for C-MT, O-MT and N-MTs using the curated sequences. These profiles were then used to make a HMM profile database for MT domains. The set of 61 C-MT, O-MT and N-MT sequences from experimentally characterized NRPS/PKS clusters were queried against this HMM database and the location of the MT-domain and their class was predicted in these sequences. Table 3 lists the score and E-value for alignment of 18 representative MT sequences with the HMM profiles of N-MT, C-MT and O-MT domains. As can be seen from Table 3 Methyltransferases  Len.  2hg4  1qzz  1vl5  1vlm  1wzn  2gpy  2aot  1g6q  1xxl  1im8  2fr0   917  374  260  219  252  233  292  328  239  244  486   melit01_OM_001  609  C  -C {100}  C  -C  ----C  anaba01_OM_001 263 Column 1 gives the unique name assigned to each MT containing sequence stretch, while the second column indicates their length. Column 3-13 lists the PDB IDs for the structures which show alignment with these sequences in fold recognition analysis using GenTHREADER and PHYRE servers. Results from GenTHREADER corresponding to confidence level CERTAIN and HIGH are labeled as C and H respectively. Similarly, PHYRE results corresponding to precision level 100% and 95% are labeled as 100 and 95 respectively in curly braces. (-) indicates the absence of that particular fold.
MT profile alone. This finding is consistent with results from pairwise sequence alignment discussed in the previous section.
In order to test the predictive ability of our MT HMM profiles further, the recent version of the nr database of NCBI was also searched using these profiles to identify putative Threading alignment C-MT containing sequence (ORF blmVIII) stretch from bleomycin gene cluster  (Table 3). Similarly, a C-MT domain from yersiniabactin synthase was found to be the closest homolog of C-MTs found in NRPS modules of hybrid NRPS/PKS proteins and the sequence similarity was also very high. This MT domain in yersiniabactin synthase catalyzes C-methylation ( Figure 2), but is present as C-CMT-PCP module similar to the domain organization found in C-MT containing hybrid NRPS/PKS proteins found by our profile search. This suggests that our profile search has genuinely identified yersiniabactin type novel C-MT domains embedded in NRPS modules. In case of 20 other NRPS proteins which showed matches with C-MT profiles, the C-MT domains from yersiniabactin, leinamycin and nodularin were found to be closest match. Table 4 shows the domain organization predicted for each of them by HMM profiles, and the percentage identity and similarity with closest matching MT domain in our 18 representative templates set. Even though the closest homolog approach also detects C-MT domains in these proteins in agreement with results from HMM approach, in view of the relatively low sequence identity with C- domains in NRPS/PKS family have evolved to acquire specificities for different substrates. However, a close examination of the chemical structures of pyochelin [37] and yersiniabactin [38] provides a rational for presence of N-MT domains with homology to C-MT sequences. As can be seen from Figure 2, in both these biosynthetic clusters MT domains transfer methyl groups to a five membered rings having very similar chemical structures. They only A '-'sign indicates that alignments resulted in scores with E-value higher than 1.0E-6. Several C-MTs align with O-MT profiles, while only N-MT of pyochelin synthase shows alignment with C-MT profile as well.  differ by the site of methylation. In view of the similarity in the structure of the acceptor substrate, pyochelin type N-MT domains show homology to C-MT proteins. It may be interesting to note that, in a recent study involving MTs from type II PKS pathways, Hertweck and coworkers [24] have observed that, these type II MTs cluster according to the site of methylation not necessarily as C-MTs, O-MTs and N-MTs. Based on this observation, they have proposed that for polyketide alkylation, regioselectivity is more dominant than the type of nucleophile (C-, O-or N-) that is being alkylated. Thus the C-MT domains found in NRPS modules by our profile search could indeed be due to such correlation between sequence of type I MTs and their regioselectivity. These observations have interesting implications of prediction of chemical structures of metabolites by genome mining.
The analysis of MT containing proteins in nr database also revealed the co-occurrence of NRPS/PKS type MT domains with several other catalytic domains apart from core PKS and NRPS domains (Figure 10d). However, in most cases these proteins had O-MT domains and there were relatively few C-MT and N-MT domains. Table 5 shows domain organization for representative proteins from each category and also lists the number of such proteins identified by our HMM profile search. As can be seen, NRPS biosynthetic pathways [39,40] [41].
The MT sequences with correct domain boundaries were also used to build profile HMMs for O-MT, C-MT and N-MT domains. Using these HMM profiles searches were carried out in the nr database of NCBI for identifying various other proteins containing C-MT, O-MT and N-MT domains. It is interesting to note that, apart from core PKS and NRPS domains, these secondary metabolite biosynthetic MT domains are also associated with other important catalytic domains like glycosyltransferase, oxidases, hydroxylases, phosphodiesterases and reductases. These proteins could be interesting targets for experimental characterization. Our analysis also surprisingly revealed the presence of a large number C-MT domains in NRPS modules adjacent to condensation (C) and adenylation (A) domains. These predicted C-MT domains can be classified into two groups. One group of C-MT domains showed high homology to N-MT domain of pyochelin synthase, which is different from typical N-MT domains present in NRPS modules. Unlike typical N-MT domains of NRPS proteins, this shows homology to C-MT domains but catalyzes transfer of methyl group to nitrogen. This group could indeed be pyochelin type N-MT domains, but are mis-classified as C-MT due to their homology with C-MT proteins and under representation of pyochelin type N-MT in our training data set. The closest homolog of the other group of C-MT domains present in NRPS modules is the C-MT domain of yersiniabactin synthase, which is present adjacent to a condensation domain of NRPS, but is indeed a C-MT. However, in view of the low homology between the C-MT from yersiniabactin and this second group of predicted C-MTs, it is difficult to predict whether they are yersiniabactin or pyochelin type MTs. A close examination of the chemical structures of the yersiniabac-tin and pyochelin provides an evolutionary basis for the presence of pyochelin type N-MT domains having homology with C-MT proteins. As can be seen from Figure 2, an identical five membered ring is the acceptor moiety for these two classes of MTs, while pyochelin N-MT methylates at the N position of the ring, the yersiniabactin C-MT methylates the adjacent C position. It may be interesting to note that such correlation between regioselectivity of the site of methylation and MT sequence have also been reported earlier by Hertweck and coworkers [24] for MTs in type II PKS biosynthetic pathways. Experimental characterization of proteins identified by our analysis would help in building more specific profiles for these novel MT domains.

Conclusion
We have carried out a comprehensive bioinformatics analysis of methyltransferase (MT) domains present in PKS/ NRPS clusters having known secondary metabolite products. Based on the site of methylation of these known secondary metabolites, the MT domains have been grouped as N-MT, C-MT and O-MT proteins and sequence/structural features have been analyzed in detail for each group. Based on the results of this analysis, we have developed a novel knowledge based computational approach for detecting MT domains present in PKS and NRPS megasynthases, delineating their correct boundaries and classifying them as N-MT, C-MT and O-MT using profile HMMs. Analysis of proteins in nr database of NCBI using these class specific profiles has revealed several interesting examples of MT domains with novel substrate specificities. Our analysis has also given interesting insight into the evolutionary basis of the novel substrate specificities of these MT proteins. These results have interesting implications for identification of novel secondary metabolites by genome mining and also rational design of novel natural products by biosynthetic engineering.
In order to choose a minimal set of templates representing the entire range of sequence diversity, every sequence was compared with every other sequence using BLAST program [65]. Based on these alignments, sequences with similarity above 50% and alignment length greater than 90% were grouped together. A representative sequence was selected from each group and finally 18 sequences were selected as templates for identification of MT domains by pairwise alignment. These 18 templates included five C-MT, five N-MT and eight O-MT sequences.
The sequence to structure alignments obtained from Gen-THREADER were sorted according to their threading score and the top-ranking hits were selected as possible structural templates. The boundaries of the predicted domains were obtained from the aligned region between the query and the crystal structure. The sequences were then classified into three groups as N-MT, C-MT and O-MTs by correlating with the structure of the metabolic product. The curated sequences of the three classes were taken and MSA was performed using ClustalW2 [66] and ESPript [67] software. MSA was used to study the similarity of the three classes of MT sequences and also find the pattern of conservation of the motifs among these sequences. A phylogenetic tree was obtained from multiple sequence alignment using iTOL [68]. Bootstrapping was performed 1000 times to obtain support values for each node. All significant bootstrapping values (more than 500) are shown in figure 7.