Functional Diversification of Tripeptidylpeptidase and Endopeptidase Sedolisins in Fungi

Sedolisins are acid proteases that are related with the basic subtilisins. They have been identified in all three superkingdoms but are not ubiquitous, although fungi that secrete acids as part of their lifestyle can have up to six paralogs. Both tripeptidylpeptidase (TPP) and endopeptidase activity have been identified and it has been suggested that these correspond to separate subfamilies. We studied eukaryotic sedolisins by computational analysis. A maximum likelihood tree shows three major clades of which two contain only fungal sequences. One fungal clade contains all known TPPs whereas the other contains the endosedolisins. We identified four cluster specific inserts (CSIs) in endosedolisins, of which CSIs 1, 3 and 4 appear as solvent exposed according to structure modeling. Part of CSI2 is exposed but a short stretch forms a novel and partially buried α-helix that induces a conformational change near the binding pocket. We also identified a total of 13 specificity determining positions (SDPs) that are divided over three SDP sub-networks. The major SDP network contains eight SDPs and modeling of virtual mutants suggests a key role for the W307A or F307A substitution. This is accompanied by a group of four SDPs that physically interact at the interface of the catalytic domain and the enzyme’s prosegment. Modeling of virtual mutants suggests these SDPs are indeed required to compensate the conformational change induced by CSI2 and the A307. The additional small SDP networks appear to be linked to the major network all together explaining the hypothesizes functional diversification of fungal sedolisins.


Abstract (250 max)
Sedolisins are acid proteases that are related with the basic subtilisins. They have been identified in all three superkingdoms but are not ubiquitous, although fungi that secrete acids as part of their lifestyle can have up to six paralogs. Both tripeptidylpeptidase (TPP) and endopeptidase activity have been identified and it has been suggested that these correspond to separate subfamilies.
We studied eukaryotic sedolisins by computational analysis. A maximum likelihood tree shows three major clades of which two contain only fungal sequences. One fungal clade contains all known TPPs whereas the other contains the endosedolisins. We identified four cluster specific inserts (CSIs) in endosedolisins, of which CSIs 1, 3 and 4 appear as solvent exposed according to structure modeling. Part of CSI2 is exposed but a short stretch forms a novel and partially buried αhelix that induces a conformational change near the binding pocket. We also identified a total of 13 specificity determining positions (SDPs) that are divided over three SDP sub-networks. The major SDP network contains eight SDPs and modeling of virtual mutants suggests a key role for the W307A or F307A substitution. This is accompanied by a group of four SDPs that physically interact at the interface of the catalytic domain and the enzyme's prosegment. Modeling of virtual mutants suggests these SDPs are indeed required to compensate the conformational change induced by CSI2 and the A307. The additional small SDP networks appear to be linked to the major network all together explaining the hypothesizes functional diversification of fungal sedolisins.

Proteases are ubiquitous enzymes that can be classified in many ways.
Proteases or peptidases degrade proteins by hydrolysis of peptide bonds. They are involved in various biological processes as cell death [1], nutrition [2] and infections [3]. Proteases are classified, primarily based on the catalytic mechanism, into several classes among which aspartic, cystein, metallo and serine proteases. These main classes are further hierarchically organized into clans and families based on homology [4]. Proteases can also be classified based on other characteristics. A major difference can be made between endo-and exopeptidases, where the latter include di-and tripeptidyl peptidases. Among exopeptidases one can discriminate carboxy-and aminopeptidases.
1.2 Sedolisins are acid proteases related with the basic subtilisins.
Perhaps the most common proteases are serine proteases in which a serine serves as the nucleophylic amino acid in the catalytic site. The catalytic site is most often formed by a triade which can differ among the different superfamilies. Currently 12 clans or superfamilies with 29 families have been assigned by MEROPS [5]. One of the most important clans is the SB clan that contains the ubiquitous subtilisins, which include the kexins (S8) [6], and the rather rare sedolisins (S53, for review see [7]). Where the basic subtilisins have a triade that consists of, besides the Serine, a Histidine and an Aspartate, the acid sedolisins have a Glutamate replacing the the Histidine [8]. Despite the different triade and consequently a large difference in optimal activity pH, there is ample evidence the two subfamilies form a superfamily. In 2001 the first structure of a sedolisin, endosedolisin PSCP from Pseudomonas sp. 101, complexed with inhibitor iodotyrostatin, was resolved (PDB code 1GA4) [9], shortly followed by a structure from kumamolysin from Bacillus sp. MN-32 (PDB codes 1T1E for precursor and 1GT9 for mature peptidase). Although sequence similarity between sedolisins and subtilisins is low, structural alignments clearly indicate they are homologous [10] since they have similar folds. Catalytic residues as well as the oxyanion Aspartate appear as homologous. Sedolisins also have a Calcium binding site albeit at a different position than subtilisins [10]. The best studied sedolisin is Human lysosomal CLN2 since mutant forms are involved in the fatal classical late-infantile neuronal ceroid lipofuscinosis or Batten disease. The structure (PDB codes 3EDY for precursor and 3EE6 for mature peptidase) of this tripeptidyl aminopeptidase has been determined and a number of publications describe the effect of many mutations found [11] [12,13]. Of particular interest is W542 which has been shown to be required for activity.
The W542L mutant was shown to be retained in the ER which suggests misfolding [14]. In addition, W290L and W307L showed largely reduced activities.
1.3 Sedolisins have a large prosegment that appears to have various functions.
The processing of subtilisins and sedolisins is also similar. Both have a large and similar prosegment that appears to be able to form an independent domain that seems to be involved in correct folding of the core or the catalytic domain [15] [16]. The prosegment and catalytic domain are separated by a short propeptide or linker that is removed during isozyme activation at low pH.
In general it has been shown that prosegments act in assisting in refolding as well as targeting (for review see [17]). For Human TPP is has been shown that prosegment and catalytic domain have multiple molecular interactions including salt bridges and hydrogen bonds, covering 15% of the solvent accessible surface of the catalytic domain [18]. It has been shown that the prosegment of Human TPP also functions as inhibitor. Secretome analysis of for instance B. cinerea has shown certain paralogs consist of the core part of the enzyme only [19].
secreting fungi as B. cinerea, S. sclerotiorum but also the Aspergilli show relatively few subtilisins and many sedolisins, as compared to non-acid secreting fungi such as Giberella zeae [23]. This also suggests functional diversification has occurred. Here we study the functional redundancy and diversification of fungal sedolisins by computational analysis. We reconstructed a phylogenetic tree that, together with the underlying multiple sequence alignment (MSA), was used for the identification of cluster specific inserts (CSIs) and specificity determining positions (SDPs), which form the sequence characteristics that can explain functional diversifications. Modeling of wild type and mutant sequences was performed in order to show how functional diversification into endosedolisins and TPP has likely occurred, demonstrating important roles for part of CSI2 and the position homologous to Human TPPs W307.

Sedolisin sequence identification.
The MEROPS [5] holozyme sequences from the MEROPS database were used to build a first HMMER [24] profile that was used to search a total of 343 complete proteomes as well as the PDB and SwissProt databases. The obtained sequences were scrutinized using MEROPS batch blast [25] which includes a scrutiny for the presence of catalytic site and oxyanion residues.
Confirmed sedolisin sequences were aligned and the resulting MSA was used for a second HMMER screening. This process was iterated until convergence.

MSA and Phylogeny
MSAs were constructed using MAFFT's [26] iterative refinement method and represented using Endscript [27] [28]. The final alignment was manually corrected using as criteria that secondary structure elements (taking as reference 3EE6) should be represented by each sequence combined with entropy minimization. The MSA was trimmed using BMGE [29] (-g 0.3 and h 0.8) which resulted in a trimmed MSA that maintains, among others, over 90% of columns corresponding to secondary structure elements. Maximum likelihood phylogeny was constructed using 5 Of 23 85 90 95 100 105 PHYML3.1 [30] using the WAG model of amino acid substitution using a discrete gamma model of 4 categories and a shape parameter of 1.4, as determined by prior ProtTest [31] analysis. Statistical support was determined by PHYML's aLRT as well as boostrapping (n = 1024) using the resources available at https://www.e-biothon.fr. The major graphical representations were made using Dendroscope [32]. The presentations showing the CSIs were made using iTOL [33].

Identification of Specifity Determining Positions (SDPs).
We identified CDPs using SDPfox [34]. MISTIC [35] was used to determine levels of mutual information (MI) between positions or columns of the MSA. SDPs are then defined as CDPs with at least two direct connections with other CDPs, using MISTIC's default z-score cut-off of 6.5. CDPs that only connect to a single other CDPs were considered putative SDPs. Cytoscape [36] was used to identify and draw networks of directly connected SDPs. Sequence Logo's were made using Weblogo [37].

Structure analysis
3D structures of sedolisins were obtained from the Protein Data Bank [38]. 3EE6 [18], corresponding to mature Human TPP was used as a standard. Models were made using I-Tasser [39] using either default settings or using 3EE6 as the reference model. Binding pockets were predicted using Coach [40]. Visualization was performed using VMD which included structural alignment using the STAMP [41] extension. The pocket predictions for 3EE6 and the SED_A and SED_B models were performed with the software Fpocket [42] using the default parameters.

Datamining, Multiple Sequence Alignment and Phylogeny
In order to perform structure-function prediction of eukaryote sedolisins we set out to obtain a  Fig. 1. In general, eukaryotic sedolisins are largely conserved, including the prosegment part. Interestingly, of the three disulfide bridges identified in the resolved structure 3EE6, only the second appears to be conserved among eukaryotes. A trimmed MSA, lacking low quality subalignments, was used to reconstruct a maximum likelihood tree.  . The apparent random taxonomic distribution of the fungal sequences over the two well separated clades indicates the bifurcation is caused by a functional diversification. In correspondence with Reichard and coworkers [22], one of the fungal clades contains SED_A from A. fumigatus and Aorsin [43] from A. oryzae that have both been characterized as endopeptidases .
The other fungal clade contains SED_B, SED_C and SED_D from A. fumigatus, all characterized as TPPs [22]. Although not supported by a lot of biochemical evidence the above mentioned data suggest sequence diversification has resulted into endo and TPP sedolisins, hence for the remainder of the manuscript we will refer to the sequences and clades as Hypo-endosedolisin and Hypo-TPP.
Both endo and TPP activity has been shown for non-fungal sedolisins, hence, as such we have no indication regarding the state of the ancestral enzyme. In addition, at least Human TPP has been shown to have endo activity at low pH [44]. CSI3 is present in all Hypo-endosedolisins as well as in certain Hypo-TPPs and some non-fungal sedolisins. CSI1 is present in all sedolisins but is longer in the Hypo-endosedolisins. CSIs 1 and 3

Fig. 2. Phylogenetic Clustering and Cluster Specific Inserts of Eukaryotic sedolisins
show no clear conservation.
Because structure determines function we wondered if the CSIs would interfere in the core structure of the protein or if they might appear as solvent exposed loops. We created a model of SED_A and structurally aligned it with 3EE6 (Fig. 3). Overall, the model is very similar to the 3EE6 structure and the non-homologous CSIs 1, 3 and 4 are indeed predicted at the surface of the mature protein. CSI2 is largely exposed but forms a partially buried helix as is discussed below.
This confirms that the CSIs not necessarily affect the basic functional fold of subtilisin-like proteases. Although in the absence of homologous template the prediction of the loop structures formed by the CSIs is not very reliable, the model does allow some general predictions. CSI1 is located next or into the prosegment and likely does not form part of the mature enzyme but rather forms part of the propeptide or linker region (See Fig. 1). CSI2 and CSI4 are located on opposite ends of the predicted binding pocket (Not shown). CSI3 appears near the calcium binding site.

SDP identification
We used SDPfox [34] to identify CDPs, positions that contribute significantly to the underlying clustering. We identified 25 CDPs between the Hypo-endo and the Hypo-TPP clusters. Then we performed an analysis of mutual information between positions using MISTIC [35]. Mutual information expresses levels of covariation and high levels suggest co-evolution. CDPs might result form genetic drift but CDPs that show high levels of interaction are more likely specificity determining positions (SDPs). Table 1   Next we considered how evolution might have occurred. As stated before, we have no insight in the exact functional characteristics of the ancestral protein. This might have been an endo-sedolisin, a TPP or a protease capable of both reactions. On the one hand we must assume the diversification process in one clade has been largely independent from that in another clade. On the other hand we can also envisage that the diversification processes, although strictly independent, might affect the same positions. As a result, a priori one does not know which dataset should be used to determine MI levels. We first performed a global MI analysis comparing the networks obtained when using only the Hypo-endo, the Hypo-TPP and all fungal sequences combined, respectively. Fig. 5 shows that when separate clades are used as dataset, similar networks with partitions in two major clusters are identified. When all fungal sequences are used the MI network partitions into three major clusters, showing little similarity to the partitions of the separate subfamily MI analyses. Hence, it is more likely that the diversification has been the result of independent processes and that as such MI analysis for the identification of SDNs involved in the fungal sedolisin diversification should be determined using clade specific datasets. We identified eight directly connected CDPs using the MI analysis of the Hypo-endo sequences (Fig. 6.). The high level of connection suggests these form an SDN of co-evolving SDPs that would correspond with the subfamily´s functional diversification towards and endopeptidase. In addition we identified a pair of connected CDPs that, since we cannot establish connectivity levels, we initially considered as putative SDPs. The Hypo-TPP MI analysis identified a three node SDN that seems unrelated to the Hypo-endosedolisin SDPs, confirming that the diversification processes in both subfamilies are independent and that in this case, as such, for the determination of SDNs it is necessary to use the alignment of each of the studied subfamilies.

Shown are the MI networks obtained for the hypo-endo cluster; the hypo-TPP cluster; and the combined hypo-endo / hypo-TPP cluster containing all fungal sequences. Red nodes represent the same positions in the three MI networks demonstrating high similarity between the Hyp-endo and
Hypo-TPP MI networks only.

SDPs, i.e. CDPs with high mutual information determined on the Hypo-Endo and Hypo-TPP subclade datasets form one highly connected Specificity Determining Network each. Sequence logos show residue conservation at the involved positions in the Hypo-Endo, the Hypo-TPP and
non-fungal clades.

Structure-Function Prediction
Probably not all SDPs of the identified SDN will have the same impact on enzyme function. The most highly connected SDPs that between clades have residues with a large physicochemical difference, are most likely to explain the functional diversification. Both SDP346 and SDP89, located in the core and the prosegment respectively, connect to five other SDPs (Fig. 6). Core SDP346 shows a substitution of the amphipathic, positively charged Lysine in Hypo-endo (K407 in SED_A) to most often the polar glutamine in Hypo-TPP (Gln377). In Human TPP 3EE6, K346 is located on the interface between core and prosegment interacting with E343 (Fig. 7A). Although SED_A has the same residues, K407 (homologous to K346 from 3EE6, see Table 1) physically interacts with SDP89 (F92). This, according to the model, seems to be caused by a slight dislocation of helix 9 in SED_A, on its turn forced by the additional helix 9b, which results from CSI2. The corresponding Q377 of SED_B was modeled internally and is likely stabilized in the opposite direction by the polar residue Arg 381 in SED_B (Fig 7C) whereas F92 is replaced by a Leucine.
The hydrophobic interaction between the large F92 and the aliphatic chain of K407 suggests SDP346 is involved in the interaction between prosegment and core. The ε-amino group of K407, no longer forming a salt bridge with E404 is envisaged to be protonated upon secretion into the acid environment, thereby destabilizing the interaction between core and prosegment regions. Given the suggested role of the prosegment in sedolisin folding and stabilization [15] this could explain at least part of the diversification.
13 Of 23 285 290 295 300 Fig. 7. Interaction of SDP346, SDP343 and SDP89 at the interface between core and prosegment. SDP349 also occurs in the vicinity of SDP346 and connects directly to both SDP346 and SDP89.

Cartoons showing interaction in 3EE6 (A), SED_A (B) and SED_B (C
Alanine, common in Hypo-TPP is slightly less hydrophobic than the predominant Leucine of Hypoendosedolisins. SDP 340 is also in close range of SDP346 and connects directly to SDP89. It shows predominantly a polar Glutamine in Hypo-Endo and a hydrophobic Valine in Hypo-TPP clade. Interestingly, SDP340 is found next to to position 341 that corresponds with one of two cysteines that are absent in non-fungal sedolisins and strictly conserved in fungal sedolisins (See Figure 1).
Since strictly conserved cysteine pairs often correspond with disulfide bridges we checked their orientation in the models of SED_A and SED_B. Although in both the SED_A (S Fig. 1D) and the SED_B model they are modeled at positions that seem to favor a disulfide bridge, this is not modeled. Nevertheless, the virtual C402A / C452A double mutant of SED_A also reverts to the structure lacking helix 9b (Fig. 4C). All together, this suggests that the SDN is related to the predicted structural changes discussed in the previous section.
SDP307, part of helix 9, is a position that, according to the MI analysis, interacts with SDPs 346, 349 and 89. Although in the structure of Human TPP 3EE6 W307 is located at over 9 Å from the local SDP network described above, in the model of SED_A, its counterpart Alanine is found at since the principal helices H9 and 10 are modeled nearly identically (Fig 4D and E).
The other SDPs show a lower level of connectivity and might involve secondary compensations. SDP 73 connects to key SDP346 as well as to SDP343. SDP 73 is mostly S in the Hypo-Endo and H/N in the Hypo-TPP clade. Structural analysis reveals that this SDP is located in the hinge region preceding the helix that contains SDP89 and following the helix that contains the pair of putative SDPs 55 and 58 that we identified (S Fig. 1). The mentioned pair 55-58 physically interacts with SDP89 and SDP346 suggesting that they might be involved in the same diversification process. SDP73 not only might affect the interaction of SDP89 and SDP 346 but the interaction between prosegment and core in general. Finally, SDP 260 only connects to residue 340 in the SDN and is distant to the local SDP network. The fact that it concerns a proline located at a hinge position in the core, which is important in structural conformation, in Hypo-Endo sedolisins suggests this SDP might be involved in a conformational change. The Hypo-TPP network (SDP301-SDP302-SDP304) shows less pronounced physicochemical differences in residue occupation.
Interestingly, it is situated within contact range of the SDP307.

Discussion
We studied eukaryotic sedolisins by computational analysis of protein sequences. The protein sequences were obtained from a large set of complete proteomes among which many of fungal origin since it has been suggested that acidification is related to expansion of the sedolisin familiy in 15 Of 23   335   340   345   350 fungi. Therefore, the main attention of this study is directed at the evolutionary history of fungal sedolisins.
Sequences were aligned by MAFFT's iterative refinement method and the resulting MSA was manually improved in order to correct poorly aligned hallmark residues as well as residues in secondary structure elements. Correction was likely required due to large taxonomic distances and the presence of CSIs which tend to disturb the alignment process. Prior tree reconstruction, the MSA was subjected to trimming, guided by maintaining particularly β-sheets, in order to remove these CSIs and unreliably aligned subsequences. The clustering pattern of the CSIs largely corresponds with the phylogenetic clustering (See figure 2B), which confirms the topology of the tree is basically correct, as is also indicated by bootstrap support (See figure 2A and B).
Interestingly, the tree contains two fungal clades and a clade with other eukaryotic sedolisin sequences. Since the fungal sequences are clearly separated from the non-fungal eukaryotic sequences it appears the evolutionary rate of fungal sedolisins has been higher than that of other sedolisins. A possible explanation for the accelerated evolutionary rate in fungi is given by the fact that fungal sedolisins are predicted to be secreted, whereas Human TPP is lysosomal. Tree topology combined with the apparent random distribution of fungal paralogs over the two fungal clades clearly suggests functional diversification. Although additional subclades can be distinguished and certain fungi have up to six paralogs, we focused on comparing the two major fungal clades based on the observation, initially made by Reichard and coworkers [22], that characterized TPPs are found in a different clade than endo-sedolisins (Fig 2). We studied the role of CSIs and SDPs in the the reliability of the model and confirm that the CSI regions are unreliably modeled whereas in general confidence of the models is good (e.g. see for model of SED_A in Fig 4A and B). The fact that the loops are likely incorrectly modeled does however not imply that their approximate location is incorrect and we envisage their solvent exposed location does not severely affect the folding of the core. It is difficult to envisage or explain how CSIs 1, 3 and 4 are related with the hypothesized diversification towards endosedolisin and TPP but the fact that CSI2 seems to instigate a conformational change and that removing CSI2 from SED_A yields a model that resembles the fold of 3EE6 (Fig. 4) is intriguing. Furthermore, CSI2 and CSI4 are conserved among all and most Hypo-endosedolisins respectively ( Figure 2B) and are located on opposite sides of the predicted pocket cavity of Hypo-endosedolisins. As such they might affect the exact conformation of the pocket. They also may have some importance in substrate binding or contain retention signals. CSI3 and CSI1 show no clear conservation and are found near to the calcium binding site and in the progsegment respectively and there are no clues regarding their functions.
A number of approaches and softwares to predict SDPs exist, although it might be noted that most often CDPs are identified. CDPs are most likely the result of positive selection but neutral evolution or even phylogenetic reconstruction artifacts can also result in CDPs. Diverge [45] should theoretically contemplate this by using likelihood models. A recent version of evolutionary trace [46] includes mutual information in order to substantiate its predictions. We combined SDPfox [34], to identify CDPs, with MISTIC [35], to confirm CDPs as SDPs. In a recent paper we showed that combination identified a number of known target positions in the evolution of truncated hemoglobins [47], which shows the applicability of the method. only 13 SDPs that, as will be described below, appear to be related to a single functional diversification, just as we expected based on the tree topology.
The advantage in using SDNs is not only reflected in that it allows for a substantiation of SDP prediction, it also relies in the fact that it shows co-evolving partners allowing a more elaborated explanation and improved predictions and understanding of the complete process of functional diversification. As stated before, based on the tree topology we envisaged a single functional diversification towards, based on biochemical evidence, endosedolisin and TPP activity.
Although there are many uncertainties since it seems many factors, including SDPs but also at least related with the interaction with the chaperone-like prosegment, as shown by the apparently important interaction between the aliphatic sidechain of K346 and F89, further strengthened by the combination of SDP73 and the pSDP pair of 55 and 58, as demonstrated in S. Fig. 1. Interestingly the putative SDP pair we identified also points to this interaction and more easily explains the role SDP73 might play.
Another consideration that can be made is how evolution has most likely occurred. Although we did not perform a prediction of the ancestral fungal sequence, a number of observations should be made. We envisage the absence of the disulfide bridge C402-C452 in the SED_A model as a a modeling artifact. The fact that two Cysteines are modeled within 5 Å distance combined with the fact that they are strictly conserved among fungal sedolisins, is not likely a mere coincidence. It can be envisaged that such a disulfide bridge stabilizes the enzyme in the oxidative extracellular environment and that the invoked structural change triggered compensations which we hypothesize have resulted in the diversification towards TPP and endosedolisin activity. This predicted bridge appears to be required for the conformational change near the binding pocket as is shown by the virtual mutant ( Fig 4C). However, the presence of the disulfide bond in SED_B does not affect the conformation, suggesting its presence is not sufficient for the conformational change in SED_A. An obvious requirement is the presence of at least part of CSI2. Interestingly CSI2 is the only CSI that corresponds perfectly with the clustering into Hypo-endo and Hypo-TPP. The ancestral SDP307 was likely a Tryptophan but the other SDPs that show significant physiochemical changes seems to correspond with endosedolisins (S73, E343, K346 in 3EE6 and SED_A).
Although SDPs, mutant analysis and literature confirm each other, future validation of the proposed functional differences requires molecular dynamics and or wetlab experiments. Given the sheer amount of characteristics (i.e. CSIs and SDPs) involved, this will be difficult to achieve. As such, maybe the first approach is to obtain a structure of SED_A or another characterized endosedolisin such as Aorsin.
The remaining question concerns how this change in conformation is related to the difference in activity. TPP activity can be envisaged to require a less spacious binding cleft, as can be seen by comparing Fig. 4D and E, in concordance with the fact that SDP307 is occupied by a large hydrophobic residue in the Hypo-TPP and NF clades and by small hydrophobic in Hypo-Endo.