- Open Access
LacSubPred: predicting subtypes of Laccases, an important lignin metabolism-related enzyme class, using in silico approaches
BMC Bioinformatics volume 15, Article number: S15 (2014)
Laccases (E.C. 126.96.36.199) are multi-copper oxidases that have gained importance in many industries such as biofuels, pulp production, textile dye bleaching, bioremediation, and food production. Their usefulness stems from the ability to act on a diverse range of phenolic compounds such as o-/p-quinols, aminophenols, polyphenols, polyamines, aryl diamines, and aromatic thiols. Despite acting on a wide range of compounds as a family, individual Laccases often exhibit distinctive and varied substrate ranges. This is likely due to Laccases involvement in many metabolic roles across diverse taxa. Classification systems for multi-copper oxidases have been developed using multiple sequence alignments, however, these systems seem to largely follow species taxonomy rather than substrate ranges, enzyme properties, or specific function. It has been suggested that the roles and substrates of various Laccases are related to their optimal pH. This is consistent with the observation that fungal Laccases usually prefer acidic conditions, whereas plant and bacterial Laccases prefer basic conditions. Based on these observations, we hypothesize that a descriptor-based unsupervised learning system could generate homology independent classification system for better describing the functional properties of Laccases.
In this study, we first utilized unsupervised learning approach to develop a novel homology independent Laccase classification system. From the descriptors considered, physicochemical properties showed the best performance. Physicochemical properties divided the Laccases into twelve subtypes. Analysis of the clusters using a t-test revealed that the majority of the physicochemical descriptors had statistically significant differences between the classes. Feature selection identified the most important features as negatively charges residues, the peptide isoelectric point, and acidic or amidic residues. Secondly, to allow for classification of new Laccases, a supervised learning system was developed from the clusters. The models showed high performance with an overall accuracy of 99.03%, error of 0.49%, MCC of 0.9367, precision of 94.20%, sensitivity of 94.20%, and specificity of 99.47% in a 5-fold cross-validation test. In an independent test, our models still provide a high accuracy of 97.98%, error rate of 1.02%, MCC of 0.8678, precision of 87.88%, sensitivity of 87.88% and specificity of 98.90%.
This study provides a useful classification system for better understanding of Laccases from their physicochemical properties perspective. We also developed a publically available web tool for the characterization of Laccase protein sequences (http://lacsubpred.bioinfo.ucr.edu/). Finally, the programs used in the study are made available for researchers interested in applying the system to other enzyme classes (https://github.com/tweirick/SubClPred).
Laccases (EC 188.8.131.52) are the largest sub-group of multi-copper oxidases which includes ascorbate oxidases (EC 184.108.40.206), ferroxidases or ceruloplasmins (EC 220.127.116.11) and nitrate reductases (EC 18.104.22.168). Laccases were first discovered in the sap of the Japanese lacquer tree Rhus vernicifera. Since then they have been found in many taxa including plants, fungi, bacteria, and metazoa. Laccases are involved in a diverse range of cellular activities such as lignin degradation, lignin biosynthesis, pigment production, plant pathogenesis, melatonin production, spore coat resistance, morphogenesis and detoxification of copper [1–5]. Laccases are also widely used for industrial purposes. For example, Laccases are in paper and pulp, textile, and petrochemical industries for detoxification of industrial effluents . In medicine, Laccases are used for certain medical diagnostics and as catalysts for the manufacture of anti-cancer drugs . They are also used for environmental remediation of herbicides, pesticides and as explosives in soil and cleaning agents for certain water purification systems. In commercial products, they are found in cosmetics, denim bleaching, wine and beer stabilization, fruit juice processing, color enhancement of tea and even baking [6, 7]. Laccases are popular in industry for a number of reasons. They are better for the environment, and have fewer non-specific reactions than conventional oxidation technologies. Many Laccases are extracellular enzymes which makes their purification simple. Compared with other oxidative enzymes, these are easier to use as they catalyze reactions with molecular oxygen and do not need reactive oxygen species catalysis [6, 8]. Currently, fungal Laccases comprise most widely studied and commercially used Laccases. However, there is much interest in bacterial Laccases also due to their higher temperature stability and ability to operate at different pHs than fungal Laccases. Generally, Laccases are composed of dimeric or tetrameric glycoproteins with each monomer containing a copper containing site. These copper sites may be one of three types: Type-1 or blue copper, Type-2 or normal copper, and Type-3 or coupled-dinuclear centers. These copper binding motifs have been shown to be highly conserved across all Laccases, with a trend towards greater similarity in the N and C terminal domains as these are the copper containing domains. It has been noted that the size of the central binding pockets are larger in bacterial Laccases than in fungal or plant Laccases. These copper binding sites yield significant differences in conserved residues for Laccases of bacteria, fungi, and plants .
Fungal Laccases comprise the bulk of experimentally studied Laccases. They occur in many fungal species and are thought to play important roles in morphogenesis, fungal-plant interactions, stress defense, pigment production, and lignin degradation. While typically studied with respect to biomass degradation, most fungi found producing several isoenzymes of different types, enzymatic or physical properties, and expression levels. These can vary even more between species . For example, it has been reported that one of the most efficient lignin degraders, Phanerochaete chrysosporium produces a Laccase different than other efficient lignin degrading fungi . While most Laccases are extracellular enzymes, many fungal taxa produce intracellular Laccases  also. This is especially interesting when compared with enzymes of similar function such as lignin peroxidases which are strictly extracellular. It is speculated that the cellular localization of Laccases may be connected their function and substrate ranges. This hypothesis still remains elusive due to the majority of studied fungal Laccases coming from wood-rotting basidiomycetes. The enzymatic properties of fungal Laccases vary greatly such as temperatures vary from 25-80° C, pH optimums: 2,2'-azino-bis(3-ethylbenzothiazoline-6-sulphonic acid) (ABTS) from 2.0-5.0, 2,6-dimethoxyphenol (DMP) from 3.0-8.0, guaiacol from 3.0-7.0, and syringaldazine from 3.5-7.0. Similarly, Km (µM) ranges vary a lot such as: ABTS from 4-770, DMP from 26-14720, Guaiacol from 4-30000, syringaldazine from 3-4307. Also Kcat (S-1) vary in a broad range as: ABTS from 198-350000, DMP from 100-360000, Guaiacol from 90-10800 and syringaldazine from 16800-28000. These properties can further be altered by glycosylation.
Traditionally plant Laccases were considered to be only extracellular enzymes involved in the radical-based lignin polymerization. However, a high degree of divergence among Laccases within a single plant species has been observed, such as ryegrass which contains 25 different Laccase genes. Also, it is reported that Laccases lack N-terminal signal peptides for secretion but have signals targeting to other cellular components such as the endoplasmic reticulum or peroxisomes. Another study on poplars showed that Laccase repression had no effect on lignin production. Despite the evidence for novel functions and many known functions in other taxa, the grouping of plant Laccases still remain elusive .
Bacterial Laccases are known to be widespread in prokaryotes; however, only few have been experimentally characterized. To date, bacterial Laccases have been found mostly to be involved in lignin degradation, catabolism of phenolic compounds, cell pigmentation, morphogenesis, and copper defense [12–14]. The best studied bacterial Laccase is CotA and endospore coat protein from Bacillus subtilis which produces a melanin like pigment. This enzyme has generated high amounts of interest due to its extremely high temperature stability. Bacterial Laccases are also unique due to the lack of cellular partitions in prokaryotes. The reactions catalyzed by Laccases can produce quinones and semiquinones as by-products, which are powerful inhibitors of the electron transport change .
In metazoan, Laccases exist in mammals as well as invertebrates. The roles of Laccases in mammals do not appear to be well understood, however, insect Laccases are known to be involved in cuticle formation . Cuticle tanning also known as sclerotiziation and pigmentation is the process through which proteins in the exoskeleton are conjugated. This causes the exoskeleton to become insoluble, harder, and darker.
Classification of Laccases: current view
Laccases are currently classified as part of a larger classification scheme for multi copper oxidases [15, 16]. This is based on multiple sequence alignments and seems to classify by taxonomical association. The current classification system i.e. "The Laccase Engineering Database" (LccED), classifies multi copper oxidases into eleven classes: basidiomycetes Laccases, ascomycete Laccases, insect Laccases, fungal pigment MCOs, Fungal ferroxidases, fungal and plant ascorbate oxidases, plant Laccases, bacterial CopA proteins, bacterial bilirubin oxidases, bacterial CueO proteins, and SLAC homologs.
Machine learning-based classification systems
As discussed above, the current classification system for Laccases largely follow species taxonomy rather than substrate ranges, enzyme properties, or specific function. Although it has been observed that individual Laccases often exhibit distinctive and varied substrate ranges, and have different functions based on distinguishing pH values among different taxa. We hypothesize that a descriptor-based computational prediction system could be developed to generate a homology-independent classification system for better describing the functional properties of Laccases. In a previous study on feruloyl esterases (EC 22.214.171.124), an unsupervised learning approach was used to create a novel homology independent classification system for this enzyme class. Various bioinformatics tools were used to validate the identified classes . In the present study, we followed a two-way computational strategy to identify and classify various Laccase subtypes by developing a python command line-based implementation of the unsupervised and supervised learning approaches, respectively. Further, we implemented our prediction models as a web-based prediction server to classify novel Laccase subtypes. The tool could be useful to the biofuel researchers and industry as well.
Alternate names for Laccases were found via cross referencing with the KEGG database (http://www.kegg.jp/dbget-bin/www_bget?ec:126.96.36.199). To search for Laccase sequences, we combine these names to start as a basic query. Sequences with protein or transcript level evidence were selected to ensure high quality data as well as avoid potentially mislabeled multi-copper oxidases. Then we search UniprotKB for Laccase sequences using some search terms as listed in Table 1. Using the "browse by" option on Uniprot's GUI the query was checked for possible contaminating sequences. The contaminant sequences were filtered out using NOT conditions (see Table 1). Finally, 329 protein sequences are collected with average sequence length above 200 residues. To further validate the quality of the datasets the protein descriptions of the data were analyzed with the text clustering functionality in Google-Refine version 2.5. A significant variation was found in the protein descriptions but no cases of contamination were found. As a final check of data quality, the lengths of the sequences were calculated and plotted on a bar graph shown in Figure 1. Sequences containing non-standards/ambiguous characters were removed from the data set.
Feature representation of Laccase proteins
It is important to extract better features of protein sequences to improve the performance of the machine learning method. We used several features such as amino acid composition (AAC), Conjoint Triad (CT), Composition-Transition-Distribution (CTD), Dipeptide composition (DIPEP), Geary autocorrelation descriptors, Moran autocorrelation, Moreau-Broto autocorrelation, physicochemical properties and a composite vector of amino acid composition and physicochemical properties.
Amino acid composition (AAC)
Each protein sequence is represented as a 20-dimensional feature vector with each element corresponding to the percentage of one of the twenty amino acids . For a given protein sequence x, let the function f(x i ) represent the occurrence of the 20 standard amino acids. Thus, the composition of the amino acids Px in the given sequence can be represented as,
where P(x i ) is given as,
Dipeptide composition (DIPEP)
Dipeptide sequence composition is similar to amino acid composition. However, it considers the percentages of dipeptides occurring in a given protein sequence . Thus, the composition of each dipeptide is given as,
where is the fraction of number of instances of a specific dipeptide and the total number of all dipeptides.
Conjoint triad (CT)
In conjoint triad, in addition to amino acid composition it considers the sequence order effect . It is calculated by grouping the 20 standard amino acids into 7 groups based on physical and chemical similarity [(A,G,V), (I,L,F,P),(Y,M,T,S), (H,N,Q,W), (R,K), (D,E), (C)]. Triads are made from all combinations of three amino acids of these groups, resulting in a vector length of 343 (7 × 7 × 7). Thus, a protein sequence is represented as,
where is the number of occurrences of a specific triad and is the number of all triads .
In this representation three local descriptors, Composition (C), Transition (T) and Distribution (D) are used in combination to construct the feature vector. These descriptors are based on the variation of occurrence of functional groups of amino acids within the primary sequence of protein . Thus, before computing this feature the twenty amino acids are clustered into seven functional groups based on the dipoles and volumes of the side chains . The composition descriptor computes the occurrence of each amino acid group along the sequence. Transition represents the percentage frequency with which amino acid in one group is followed by amino acid in another group. The distribution feature reflects the dispersion pattern along the entire sequence by measuring the location of the first, 25, 50, 75 and 100% of residues of a given group. Hence, total 63 features (7 composition, 21 transition and 35 distribution) are constructed to represent a protein.
Autocorrelation feature vectors
Autocorrelation features describe the level of correlation between two protein sequences in terms of their specific physicochemical property, which are defined based on the distribution of amino acid properties along the sequence. There are 8 amino acid properties used for deriving autocorrelation descriptors.
The Moran autocorrelation (MAC) descriptor of a protein is defined as:
where N is the length of the protein sequence, d = 1,2,......30 is the distance between one residue and its neighbors, P j and P j+d are the properties of the amino acid at positions j and j+d respectively. is the average of the considered property P along the sequence.
Geary autocorrelation (GA) descriptor of a protein is defined as:
, N, Pj and Pj+d are defined in the same way as above.
Moreau-Broto autocorrelation (MBA) descriptor of a protein is defined as:
, N, Pj and Pj+d are defined in the same way as above.
Physicochemical properties of amino acids have been used successfully in numerous prediction tools . In this study, we grouped the amino acids of a protein into classes based on some physicochemical properties. Also the theoretical pI, molecular weight, and length of the protein are used in the feature vector. The non-composition based values are divided by the length or mass on the protein titan in order to provide values between one and zero. Molecular weights were calculated by adding the weights of the each amino acid in the sequence in a suitable way related to their chemical activity. A detailed description of these properties is provided in Table 2.
Split amino acid composition
Split amino acid composition aims to capture information about signal peptides at their N- or C-terminal region. The amino acid composition of the N-terminal region, Center, and C-terminal region are computed and then concatenated together. The N- and C- terminal regions are the first and last 25 amino acids in the sequence. Thus a protein sample is represented as a 60 element vector as,
Unsupervised learning organizes the data based on the similarity patterns between them. In this study, clustering was used to group the data into classes sharing same type of similarity not found in other classes. We followed the similar methodology as outlined in the paper . We first used self- organizing map (SOM) to identify the possible number of groups in the dataset and used that information in k-means clustering to divide them in different clusters.
Self-organizing maps (SOM)
SOMs are a type of artificial neural networks used in unsupervised learning to produce low dimensional discrete representations of the vector space represented by some training data . The discrete elements in SOMs are called nodes or neurons. It has been used widely in bioinformatics and computational biology mostly for tasks such as finding gene expression patterns and protein classification [22, 23]. The SOM map contains m neurons, where each contains a d-dimensional prototype vector with d as the dimensions of the input vectors. First, initial values were given to each prototype vector. When training begins a vector 'x' from the input data is randomly chosen. The distances from 'x' to the prototype vectors are computed and the neuron closest to 'x' or best matching unit (BMU) is selected. The radius of the neighborhood of the DMU is calculated, any neurons found within the radius are deemed neighbors. The neighbor's prototype vector is adjusted to be more similar to the input vector. This procedure was then repeated for certain iterations (N) . In this study, SOM of multiple dimensions were studied and N was 10,000 for all dimensions. For the SOM implementation, we used an open source machine learning package 'Orange.py' which is freely available at http://orange.biolab.si.
K-means clustering is a class of unsupervised learning algorithms which group input data set into 'k' parts or clusters  based on similarity measure. K-means is one of the oldest and simplest clustering methods, however still remains a useful tool for cluster analysis. It scales well to large data sets and medium numbers of clusters, however, has the drawback of needing to specify the number of clusters expected. The basic k-mean algorithm begins by initializing k cluster centers (centroids) and iterating to minimize the average distances between centroids and their cluster members. The data which are close to any cluster centroid belong to that cluster. The centroids were pre-computed using the neurons from the SOM. In this study, an open source machine learning library 'Sci-Kit Learn' was used to implement the k-means clustering method .
SOM for finding K number and centroid locations for K-means clustering
In this study, first an SOM network computed containing N neurons and calculates the Davies-Bouldin index (DBI) of the map treating the neurons as clusters. Then, (N) × (N-1) prototype maps were created by making all combinations of each neuron with the other neurons. The DBI is computed for all prototype maps, and the prototype map with the lowest DBI is selected. If the DBI of this map is lower than the current map the map is changed to other prototype map and the previous steps are repeated until no prototype map with a lower DBI can be found. This reduces the size of the map by one each iterations with the final number of neurons being used as the k value for k-means clustering and the cluster centroids are computed from the vectors belonging to each neuron. The efficiency of k-means clustering is measured using the difference between the inter-cluster and intra-cluster variance and the Davies-Bouldin index. As SOM find the clusters in random fashion, to get the optimum number of clusters, the clustering procedure was run 500 times for each vector type. The optimum number of clusters was chosen by selecting the cluster from the most often occurring cluster number with the largest intercluster and intracluster difference and smallest DBI.
Davies-Bouldin index (DBI)
The DBI is a metric for evaluating overall quality of a given set of clusters originally developed to aid in determining the optimum number of clusters within a dataset . Minimization of the DBI of the clusters within a dataset seems to generally indicate natural partitions of data sets. However, it should be noted that this is a heuristic approach and good values do not always indicate the best clustering arrangement. DBI of a clustering approach is defined as,
where Di is the worst case scenario of all values of Ri,j,
Ri,j is a measure of the clustering quality, defined as
The measure of scatter (S) within a given cluster i, is defined as
where Xj is a n-dimensional feature vector assigned to the cluster Ci and q was kept as two and Mi,j is a measure of separation between two clusters defined as
where Ai is the centroid of cluster Ci containing samples X1,X2......Xk and computed as,
Intra-cluster variance was calculated using the Euclidean distances between the points in the cluster and the centroid of the cluster.
Inter-cluster variance was calculated using the Euclidean distance between the centroids of the clusters.
Co-occurrence matrix analysis
The cluster numbers returned from the clustering approach is arbitrary which presents a unique problem when trying to access the similarity between runs. Thus, to assess the consistency of belonging of samples in a particular group, a co-occurrence matrix was generated to show the number of times a given data sample in one group occurred with other groups. The higher the numbers of data samples occurring together, the more consistency the clusters in various runs.
Support vector machine (SVM)
SVMs are a class of supervised learning algorithms based on the optimization principle from statistical learning theory [28, 29]. Support vector machines have been used widely in computational biology in diverse topics such as subcellular localization [18, 30–32], protein function prediction , secondary structure prediction , disease forecasting . SVMs solve classification problems by calculating a hyperplane that separates the training data with a maximum margin. For multi-class classification the classification is transformed into a series of binary classifications. There are numerous strategies for handling a multi-class problem separated into binary classifications and in this study the one-versus-rest approach was used. The SVM Classifiers were developed using the SVM_Light package (https://github.com/daoudclarke/pysvmlight), which is an open source package for SVM implementation . In a preliminary study, the RBF kernel was found to perform best. Therefore, we used RBF kernel in all our SVM classifiers.
Performance evaluation parameters
To assess the performance of the developed models, we used a five-fold cross validation test on the training dataset and then tested the models in an independent test. In a five-fold cross-validation procedure, the original sample is randomly partitioned into five equal size subsamples. Of the five subsamples, a single subsample is retained as the validation data for testing the model, and the remaining four subsamples are used as training data. The cross-validation process is then repeated 5 times (the folds), with each of the 5 subsamples used exactly once as the validation data. The results from the five-folds are then averaged to produce a single estimation. The performance is measured by the parameters such as overall sensitivity, specificity, precision, Matthews Correlation Coefficient (MCC) and average accuracy. These parameters are defined as follows:
(i)Sensitivity or coverage of positive examples: It is the percent of positive samples correctly predicted,
(ii)Specificity or coverage of negative examples: It is percent of negative samples correctly predicted as positive,
(iii)Accuracy: It is the percentage of correctly predicted samples,
(iv)Error rate: It is the total percentage of incorrect predictions is calculated as
Error rate (ER) = (17)
Precision: It is the percentage of positive PPIs those are correct identified true prediction,(18)
Matthew's correlation coefficient (MCC): it is considered to be the most robust parameter of any class prediction method. MCC equal to 1 is regarded as perfect prediction while 0 for completely random prediction.(19)
where true positive TP) is the numbers of positive samples that are predicted correctly; false negative (FN) is the number of positive samples that are predicted to be negative; false positive (FP) is the number of negative samples that are predicted positive and true negative (TN) is the number of negative samples that are predicted correctly as negative.
To have knowledge of most relevant features for classification of Laccase types, a feature scaling approac is conducted. Feature scaling was performed using univariate feature selection using the functions provided by Sci-Kit Learn using the program scale_features.py. Univariate feature selection implemented by considering each element of the descriptor vectors independent from one another and ranking them based on their occurrence between classes.
Domain map and phylogenetic trees construction
The program doMosaic was used to create domain maps for visualization of the domains in the initial data and newly generated classes . Interproscan was used to get the information about the domains in the Laccases . To show the relationship between Laccase samples, a phylogenetic tree was generated with the cleaned dataset using Clustal Omega version 1.0.3 . Dendroscope version 3.2.10 was used for the visualization of the tree.
Results and discussion
We have studied several SOM architectures to see the effect of clustering of the Laccases with many descriptors. The clustering algorithm was run 500 times for each SOM map size. The clustering performance of each descriptor is listed in Table 3. Physicochemical properties showed the best average performance among all the feature vectors providing 12 clusters as optimum cluster size. This is also in close agreement with the study for feruloyl esterases classification where the strongest descriptor was the composite vector combining amino acid composition and physicochemical properties . We also performed a co-occurrence matrix analysis to see the consistency of cluster instances in each group. The physicochemical property descriptor shows consistency in cluster instances between runs and different SOM dimensions. The co-occurrence matrix is shown in Figure 2. The 6x6 SOM dimension gave the best run with a DBI of 0.37 with an inter-cluster variance of 0.0088 and intra-cluster variance of 0.0015. The performance of the physicochemical descriptor in each SOM dimensions is listed in Table 4. The proteins classified in each group after the clustering approach are listed in Table 5.
Analysis of the taxa in each class revealed that the majority of the classes were dominated by single taxa as reported in Table 5. Several review papers containing large tables of experimentally validated Laccases with various properties were considered to validate the clusters. Unfortunately, these were difficult to draw patterns from as the substrates tested varied widely and heterologously expressed Laccases often have drastically different activities due to different amounts of glycosylation [15, 40, 41]. To better understand what is driving the distinction of different classes, feature scaling was applied to the physicochemical properties of all classes together, as well as each class against each other. The major contributing features were the percentage of negatively charged amino acids, isoelectric point, and the percentage of acidic or amidic groups. The detailed information about the significant features is shown in Figure 3. This is particularly interesting as Laccases as a group operate over a wide range of pHs while individual enzymes seem to have fairly specific or broad pH and substrate ranges . Also, it has been reported that different Laccases produced by the fungi Coriolus versicolor were easily distinguishable by their isoelectric points . The differences between classes in terms of physicochemical properties, the best features were calculated for all classes and shown in Figure 4. This showed that the variation seems to be strongly influenced by acid/base properties, and next to the small residues or aliphatic residues. The isoelectric point occurred most often within the top three features with 45 cases, followed by basic amino acids with 34 cases, acidic with 32 cases, ionizable amino acids with 23 cases, acidic and amidic with 13 cases, charged residues with 12 cases, h-bonding and small amino acids both had 8 cases, tiny with 6, neutral and hydrophobic with 4, aliphatic with 4, hydrophilic with 2, and molecular weight with 2.
Additionally, we analyzed the descriptor values for physicochemical properties and amino acid composition between classes with a standard t-test. The t-test results of the AAC features between the 12 classes are listed in Additional file 2. It shows that Ala, Cys, Asp, Glu, His, Lys, Met, Asn, Arg, Ser, and Thr vary significantly between the classes. This is particularly interesting as the amino acids which have the highest amounts of statistically significant differences between classes seem to be involved in important aspects of Laccases. For example, the top two amino acids are aspartic acid and lysine with significant differences among 51 of the 66 possible class comparisons. Aspartic acid plays an important role in many Laccase catalytic domains such as: assisting in substrate channels in basidiomycete Laccases, affecting Laccase activity of C-terminal domains when mutated in bacterial Laccases, and assisting in the exit of protons from the N-terminal domains of bacterial Laccases [43–45]. Lysine can also be found widely in catalytic domains, for example C-terminal lysines have been implicated in the inactivation of heterologously produced Laccases . Aside from function, lysines are also widely used as a cross linking target to bind Laccases to various materials [47–49]. Glutamic acid had the next most significant differences between classes. This was observed in Leu-Glu-Ala motifs which follow the copper ligating histidines and are thought to be related to Laccases with higher redox potentials . Further, Asparagine closely followed with 41 significant differences. Many Laccases are known to contain asparagines which serve as sites for N-linked glycosylation . These sites have been shown to be involved in regulation of Laccase activity through catalytic sites such as the Leu-Met-Asn motif which often replaces the previously mentioned Leu-Glu-Ala motif . N-Glycosylation has also been found to provide protection against proteolysis [51, 52]. Other types of glycosylation such as O-linked glycosylation are also major factors, so it comes as no surprise that both serine and threonine are high on the list .
In our other statistical analysis, the t-test results of the important physicochemical properties as identified in Figure 3 are listed in Additional file 3. It shows that all the physicochemical properties identified to be important in discriminating between classes are also significant. We believe since the generated classes contain many significant differences in physicochemical properties and the amino acids with high numbers of significant differences also strongly related to Laccase function, these classes may indeed represent different functional classes of Laccases. To investigate the classes further, a cladogram was constructed from a multiple sequence alignment using the sequences used for clustering. We then mapped our clusters and the classes from LccED to the cladogram Figures 5a and 5b respectively [15, 16]. Despite many of the clusters being dominated by a single taxa, when mapped to the cladogram they are widely dispersed throughout the taxonomic regions of the cladogram. This contrasts sharply with the LccED classes which largely only follow taxonomy. Many of the neighbors in the tree are composed of enzymes from the same or similar organisms; these could indicate Laccases of different function from within an organism.
To allow for the classification of newly discovered Laccases and Laccases with no experimental evidence, a Support Vector Machine-based classification system was developed. To accomplish this, 90% of the Laccase data collected was used for 5-fold cross-validation and the remaining 10% kept aside for independent testing. As physicochemical descriptors were used to build the classes, physicochemical properties were also used to develop the SVM classifiers. The developed models were further used to classify sequences annotated as Laccases with "homology" or "predicted" level evidence in the UniprotKB database.
The performance of the classifier in 5-fold cross-validation for all classes is reported in Table 6. The results show that the model achieves the overall accuracy of 99.03%, MCC of 0.9367, precision of 94.20%, sensitivity of 94.20% and specificity of 99.47%. The overall specificity is extremely high indicating a low rate of misclassified sequences. Considering the classes individually, the highest metrics achieved were MCC 1.0 and accuracy, specificity, and sensitivity of 100%. The lowest performance was accuracy of 98.98%, MCC of 0.7252, sensitivity of 80% and specificity of 99.31%.
Performance results on an independent test data are listed in Table 7. The model also provides higher performance with an overall accuracy of 97.98%, error rate of 1.02%, MCC 0.8678, precision of 87.88%, sensitivity of 87.88% and specificity of 98.90%. It should be noted that the MCC of cluster-3 was zero. However, this class contains only one sequence and performs well in cross validation, so we believe it is still credible.
Confusion matrixes were made in order to better understand which classes are more similar to one another. The confusion matrix for the independent test set is shown in Table 8. According to the confusion matrix, it appears that few proteins in classes 1, 2, 8, 10 and 11 are predicted as other classes. The results in confusion matrix show the efficiency of the developed classifier in predicting the samples correctly.
ROC curves are important to consider for prediction systems to give an accurate measure of credibility and or reliability. Each point on the curve is based on the confidence score thresholds of a single classifier. Each ROC curves compute the area under the curve (AUC). This indicates the probability of positive sequence having a higher value than a negative sequence when two are selected at random . The more shift of the curve toward left, the more accurate the predictor. We calculated the ROC curves for each class for 5-fold cross-validation and independent set testing separately. The ROC curve for 5-fold cross-validation is shown in Figure 6 and for independent set in Figure 7. Each contains a line for each class in the prediction system as well as a line showing the average performance of all classes. All classes show excellent performance with lines very close to the left side of the chart, indicating a high rate of correct predictions from these models. Indeed, the overall area under the curve rounds up to 1.00 showing the reliability of our classifier.
Functional annotation of different classes with domain maps
To investigate the role of domains in the functional variation between different classes, we generated domains maps for the sequences in each class. Eleven different types of domains were found to exist within the dataset. The frequently occurring domains are PF07732, PF00394, PF07731 and PF02578. The first three are mostly found in plants and fungi and the domain PF02578 found mostly in bacterial or mammalian origins. Class 4 contained a couple of polyphenol oxidase domains and tyrosinase domains. The domain maps generated for all the classes are shown in Figure S1 in the supplementary material. The majority of the domain maps were highly similar within and between classes with respect to domains present. However, there were some differences between the positions of the domains. We believe that these differences in the relationships between the positions of the domains could also account for functional differences.
Classification of Laccase homologs from UniProtKB
The efficiency of our prediction approach is tested by identifying the Laccases in UniprotKB with homology or predicted level evidence. Out of the 1656 sequences retrieved, 1587 were predicted to one of the 12 classes and reported in Table 9. These annotations could be a good resource to the scientific community working in these areas.
Web tool for classification of Laccases
We have developed a web resource for the classification of the Laccase subtypes by implementing the machine learning models. It will be very useful to the researchers to characterize the newly found Laccase sequences. The tool can be found at http://lacsubpred.bioinfo.ucr.edu/. We have also provided the codes used to develop the clustering and classification approach as an open source package available at https://github.com/tweirick/SubClPred.
In this work, we present a systematic computational approach to identify Laccase subtypes. First, a novel clustering method is developed to group the Laccase subtypes using the experimental data available in UniprotKB. Then a classification method is developed based on machine learning approach to generalize the functions of Laccases in each class. These identified groups can be a useful resource to the biologists to study the characterization of Laccases, particularly for researchers in the biofuel area.
LacSubPred, the web resource developed form this study, is freely available at http://lacsubpred.bioinfo.ucr.edu/.
Receiver Operating Characteristic
Matthews Correlation Coefficient
Support Vector Machines
Amino Acid Composition
Bourbonnais R, Paice MG: Oxidation of non-phenolic substrates: an expanded role for Laccase in lignin biodegradation. FEBS letters. 1990, 267 (1): 99-102. 10.1016/0014-5793(90)80298-W.
Clutterbuck A: Absence of Laccase from yellow-spored mutants of Aspergillus nidulans. Journal of general microbiology. 1972, 70 (3): 423-435. 10.1099/00221287-70-3-423.
Geiger JP, Nicole M, Nandris D, Rio B: Root rot diseases of Hevea brasiliensis. European journal of forest pathology. 1986, 16 (1): 22-37. 10.1111/j.1439-0329.1986.tb01049.x.
O'Malley DM, Whetten R, Bao W, Chen CL, Sederoff RR: The role of of Laccase in lignification. The Plant Journal. 1993, 4 (5): 751-757. 10.1046/j.1365-313X.1993.04050751.x.
Sharma P, Goel R, Capalash N: Bacterial Laccases. World Journal of Microbiology and Biotechnology. 2007, 23 (6): 823-832. 10.1007/s11274-006-9305-3.
Rodríguez Couto S, Toca Herrera JL: Industrial and biotechnological applications of Laccases: a review. Biotechnology advances. 2006, 24 (5): 500-513. 10.1016/j.biotechadv.2006.04.003.
Osma JF, Toca-Herrera JL, Rodríguez-Couto S: Uses of Laccases in the food industry. Enzyme research. 2010, 2010:
Baldrian P: Fungal Laccases-occurrence and properties. FEMS microbiology reviews. 2006, 30 (2): 215-242. 10.1111/j.1574-4976.2005.00010.x.
Dwivedi UN, Singh P, Pandey VP, Kumar A: Structure-function relationship among bacterial, fungal and plant Laccases. Journal of Molecular Catalysis B: Enzymatic. 2011, 68 (2): 117-128. 10.1016/j.molcatb.2010.11.002.
Larrondo LF, Salas L, Melo F, Vicuna R, Cullen D: A novel extracellular multicopper oxidase from Phanerochaete chrysosporium with ferroxidase activity. Applied and environmental microbiology. 2003, 69 (10): 6257-6263. 10.1128/AEM.69.10.6257-6263.2003.
Gavnholt B, Larsen K: Molecular biology of plant Laccases in relation to lignin formation. Physiologia plantarum. 2002, 116 (3): 273-280. 10.1034/j.1399-3054.2002.1160301.x.
Giardina P, Faraco V, Pezzella C, Piscitelli A, Vanhulle S, Sannia G: Laccases: a never-ending story. Cellular and Molecular Life Sciences. 2010, 67 (3): 369-385. 10.1007/s00018-009-0169-1.
Alexandre G, Zhulin IB: Laccases are widespread in bacteria. Trends in Biotechnology. 2000, 18 (2): 41-42. 10.1016/S0167-7799(99)01406-7.
Huang XF, Santhanam N, Badri DV, Hunter WJ, Manter DK, Decker SR, Vivanco JM, Reardon KF: Isolation and characterization of lignin-degrading bacteria from rainforest soils. Biotechnology and bioengineering. 2013, 110 (6): 1616-1626. 10.1002/bit.24833.
Hoegger PJ, Kilaru S, James TY, Thacker JR, Kües U: Phylogenetic comparison and classification of :accase and related multicopper oxidase protein sequences. Febs Journal. 2006, 273 (10): 2308-2326. 10.1111/j.1742-4658.2006.05247.x.
Sirim D, Wagner F, Wang L, Schmid RD, Pleiss J: The Laccase Engineering Database: a classification and analysis system for Laccases and related multicopper oxidases. Database: the journal of biological databases and curation. 2011, 2011:
Udatha D, Kouskoumvekaki I, Olsson L, Panagiotou G: The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases. Biotechnology advances. 2011, 29 (1): 94-110. 10.1016/j.biotechadv.2010.09.003.
Kaundal R, Sahu SS, Verma R, Weirick T: Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning. BMC bioinformatics. 2013, 14 (Suppl 14): S7-10.1186/1471-2105-14-S14-S7.
Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H: Predicting protein-protein interactions based only on sequences information. Proceedings of the National Academy of Sciences. 2007, 104 (11): 4337-4341. 10.1073/pnas.0607879104.
You Z-H, Lei Y-K, Zhu L, Xia J, Wang B: Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC bioinformatics. 2013, 14 (Suppl 8): S10-10.1186/1471-2105-14-S8-S10.
Kohonen T: Essentials of the self-organizing map. Neural Networks. 2013, 37: 52-65.
Udatha DBRKG, Kouskoumvekaki I, Olsson L, Panagiotou G: The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases. Biotechnology Advances. 2011, 29 (1): 94-110. 10.1016/j.biotechadv.2010.09.003.
Wang J, Delabie J, Aasheim H, Smeland E, Myklebost O: Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study. BMC Bioinformatics. 2002, 3 (1): 36-10.1186/1471-2105-3-36.
Demšar J, Zupan B, Leban G, Curk T: Orange: From experimental machine learning to interactive data mining. 2004, Springer
MacQueen J: Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability:. 1967, California, USA, 14-
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V: Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research. 2011, 12: 2825-2830.
Davies DL, Bouldin DW: A Cluster Separation Measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1979, PAMI-1 (2): 224-227.
Bhasin M, Raghava G: ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic acids research. 2004, 32 (suppl 2): W414-W419.
Park K-J, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003, 19 (13): 1656-1663. 10.1093/bioinformatics/btg222.
Garg A, Bhasin M, Raghava GP: Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. Journal of Biological Chemistry. 2005, 280 (15): 14427-14432. 10.1074/jbc.M411789200.
Kaundal R, Raghava GP: RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information. Proteomics. 2009, 9 (9): 2324-2342. 10.1002/pmic.200700597.
Kaundal R, Saini R, Zhao PX: Combining machine learning and homology-based approaches to accurately predict subcellular localization in Arabidopsis. Plant physiology. 2010, 154 (1): 36-54. 10.1104/pp.110.156851.
Cai C, Han L, Ji ZL, Chen X, Chen YZ: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic acids research. 2003, 31 (13): 3692-3697. 10.1093/nar/gkg600.
Ward JJ, McGuffin LJ, Buxton BF, Jones DT: Secondary structure prediction with support vector machines. Bioinformatics. 2003, 19 (13): 1650-1655. 10.1093/bioinformatics/btg223.
Kaundal R, Kapoor AS, Raghava GPS: Machine learning techniques in disease forecasting: a case study on rice blast prediction. BMC Bioinformatics. 2006, 7: 485-10.1186/1471-2105-7-485.
Joachims T: Svmlight: Support vector machine. SVM-Light Support Vector Machine. 1999, University of Dortmund, 19 (4): [http://svmlight.joachims.org/]
Moore AD, Held A, Terrapon N, Weiner J, Bornberg-Bauer E: DoMosaics: software for domain arrangement visualization and domain-centric analysis of proteins. Bioinformatics. 2014, 30 (2): 282-283. 10.1093/bioinformatics/btt640.
Zdobnov EM, Apweiler R: InterProScan-an integration platform for the signature-recognition methods in InterPro. Bioinformatics. 2001, 17 (9): 847-848. 10.1093/bioinformatics/17.9.847.
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology. 2011, 7 (1):
Morozova O, Shumakovich G, Gorbacheva M, Shleev S, Yaropolov A: "Blue" Laccases. Biochemistry (Moscow). 2007, 72 (10): 1136-1150. 10.1134/S0006297907100112.
Reiss R, Ihssen J, Richter M, Eichhorn E, Schilling B, Thöny-Meyer L: Laccase versus Laccase-Like Multi-Copper Oxidase: A Comparative Study of Similar Enzymes with Diverse Substrate Spectra. PloS one. 2013, 8 (6): e65633-10.1371/journal.pone.0065633.
Freixo MdR, Karmali A, Frazão C, Arteiro JM: Production of Laccase and xylanase from Coriolus versicolor grown on tomato pomace and their chromatographic behaviour on immobilized metal chelates. Process Biochemistry. 2008, 43 (11): 1265-1274. 10.1016/j.procbio.2008.07.013.
Garzillo AM, Colao MC, Buonocore V, Oliva R, Falcigno L, Saviano M, Santoro AM, Zappala R, Bonomo RP, Bianco C: Structural and kinetic characterization of native Laccases from Pleurotus ostreatus, Rigidoporus lignosus, and Trametes trogii. Journal of protein chemistry. 2001, 20 (3): 191-201. 10.1023/A:1010954812955.
Nasoohi N, Khajeh K, Mohammadian M, Ranjbar B: Enhancement of catalysis and functional expression of a bacterial Laccase by single amino acid replacement. International journal of biological macromolecules. 2013, 60: 56-61.
Silva CS, Damas JM, Chen Z, Brissos V, Martins LO, Soares CM, Lindley PF, Bento I: The role of Asp116 in the reductive cleavage of dioxygen to water in CotA Laccase: assistance during the proton-transfer mechanism. Acta Crystallographica Section D: Biological Crystallography. 2012, 68 (2): 186-193. 10.1107/S0907444911054503.
Bleve G, Lezzi C, Spagnolo S, Tasco G, Tufariello M, Casadio R, Mita G, Rampino P, Grieco F: Role of the C-terminus of Pleurotus eryngii Ery4 Laccase in determining enzyme structure, catalytic properties and stability. Protein Engineering Design and Selection. 2013, 26 (1): 1-13. 10.1093/protein/gzs056.
Yamaguchi H, Miyazaki M, Asanomi Y, Maeda H: Poly-lysine supported cross-linked enzyme aggregates with efficient enzymatic activity and high operational stability. Catalysis Science & Technology. 2011, 1 (7): 1256-1261. 10.1039/c1cy00084e.
Mikolasch A, Hahn V, Manda K, Pump J, Illas N, Gördes D, Lalk M, Salazar MG, Hammer E, Jülich W-D: Laccase-catalyzed cross-linking of amino acids and peptides with dihydroxylated aromatic compounds. Amino acids. 2010, 39 (3): 671-683. 10.1007/s00726-010-0488-4.
Kurniawan RA, Aulanni'am A, Shieh F-K, Chu PP-J: Carbon Nanotube Covalently Attached Laccase Biocathode for Biofuel Cell. The Journal of Pure and Applied Chemistry Research. 2013, 2 (2): 79-88.
Piontek K, Antorini M, Choinowski T: Crystal Structure of a Laccase from the FungusTrametes versicolor at 1.90-Å Resolution Containing a Full Complement of Coppers. Journal of Biological Chemistry. 2002, 277 (40): 37663-37669. 10.1074/jbc.M204571200.
Yoshitake A, Katayama Y, Nakamura M, Iimura Y, Kawai S, Morohoshi N: N-linked carbohydrate chains protect Laccase III from proteolysis in Coriolus versicolor. Journal of General Microbiology. 1993, 139 (1): 179-185. 10.1099/00221287-139-1-179.
Perry CR, Matcham SE, Wood DA, Thurston CF: The structure of Laccase protein and its synthesis by the commercial mushroom Agaricus bisporus. Journal of general microbiology. 1993, 139 (1): 171-178. 10.1099/00221287-139-1-171.
Lemeshow S, Hosmer D: Applied Logistic Regression (Wiley Series in Probability and Statistics: Wiley-Interscience. 2000
The authors duly acknowledge the funding support to RK for this study from OSU's Provost Office interdisciplinary grant (#12), the i CREST Center for Bioinformatics and Computational Biology (http://icrest.okstate.edu/). Partial support to TW from the National Science Foundation Grant No. EPS-0814361 is duly acknowledged. We also thank the University of California Riverside (UCR's) High Performance Computing / Bioinformatics Facility for hosting the web tool developed from this study.
Funding for the publication of this article has come from the 'start-up' funds provided to RK, account A01949-19900-44-CPK1, UCR.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 11, 2014: Proceedings of the 11th Annual MCBIOS Conference. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S11.
The authors declare that they have no competing financial interests.
TW collected the datasets related to Laccases from public repositories, wrote codes for clustering, developed algorithms and models, performed the calculations, figures and tables, and wrote the draft manuscript. SSS helped in model development, data analysis and tool building. RM helped in biological analysis and in editing the manuscript. RK conceived the study, participated in its design and coordination, and edited the final manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: Domain maps for each of the Laccase subtypes cluster generated using doMosaics (http://www.domosaics.net/). (DOCX 2 MB)
Additional file 2: P-values designating the statistical significance of one cluster over the other based on amino acid composition differences; values calculated using the standard t-test. (XLSX 30 KB)
Additional file 3: P-values designating the statistical significance of one cluster over the other based on protein physicochemical property differences; values calculated using the standard t-test. (XLSX 30 KB)
About this article
Cite this article
Weirick, T., Sahu, S.S., Mahalingam, R. et al. LacSubPred: predicting subtypes of Laccases, an important lignin metabolism-related enzyme class, using in silico approaches. BMC Bioinformatics 15, S15 (2014). https://doi.org/10.1186/1471-2105-15-S11-S15
- Lignin degradation / synthesis
- Machine Learning
- Unsupervised learning