LacSubPred: predicting subtypes of Laccases, an important lignin metabolism-related enzyme class, using in silico approaches

Weirick, Tyler; Sahu, Sitanshu S; Mahalingam, Ramamurthy; Kaundal, Rakesh

doi:10.1186/1471-2105-15-S11-S15

Volume 15 Supplement 11

Proceedings of the 11th Annual MCBIOS Conference

Proceedings
Open access
Published: 21 October 2014

LacSubPred: predicting subtypes of Laccases, an important lignin metabolism-related enzyme class, using in silico approaches

Tyler Weirick^1,2,
Sitanshu S Sahu^1,2,
Ramamurthy Mahalingam² &
…
Rakesh Kaundal³

BMC Bioinformatics volume 15, Article number: S15 (2014) Cite this article

2789 Accesses
11 Citations
Metrics details

Abstract

Background

Laccases (E.C. 1.10.3.2) are multi-copper oxidases that have gained importance in many industries such as biofuels, pulp production, textile dye bleaching, bioremediation, and food production. Their usefulness stems from the ability to act on a diverse range of phenolic compounds such as o-/p-quinols, aminophenols, polyphenols, polyamines, aryl diamines, and aromatic thiols. Despite acting on a wide range of compounds as a family, individual Laccases often exhibit distinctive and varied substrate ranges. This is likely due to Laccases involvement in many metabolic roles across diverse taxa. Classification systems for multi-copper oxidases have been developed using multiple sequence alignments, however, these systems seem to largely follow species taxonomy rather than substrate ranges, enzyme properties, or specific function. It has been suggested that the roles and substrates of various Laccases are related to their optimal pH. This is consistent with the observation that fungal Laccases usually prefer acidic conditions, whereas plant and bacterial Laccases prefer basic conditions. Based on these observations, we hypothesize that a descriptor-based unsupervised learning system could generate homology independent classification system for better describing the functional properties of Laccases.

Results

In this study, we first utilized unsupervised learning approach to develop a novel homology independent Laccase classification system. From the descriptors considered, physicochemical properties showed the best performance. Physicochemical properties divided the Laccases into twelve subtypes. Analysis of the clusters using a t-test revealed that the majority of the physicochemical descriptors had statistically significant differences between the classes. Feature selection identified the most important features as negatively charges residues, the peptide isoelectric point, and acidic or amidic residues. Secondly, to allow for classification of new Laccases, a supervised learning system was developed from the clusters. The models showed high performance with an overall accuracy of 99.03%, error of 0.49%, MCC of 0.9367, precision of 94.20%, sensitivity of 94.20%, and specificity of 99.47% in a 5-fold cross-validation test. In an independent test, our models still provide a high accuracy of 97.98%, error rate of 1.02%, MCC of 0.8678, precision of 87.88%, sensitivity of 87.88% and specificity of 98.90%.

Conclusion

This study provides a useful classification system for better understanding of Laccases from their physicochemical properties perspective. We also developed a publically available web tool for the characterization of Laccase protein sequences (http://lacsubpred.bioinfo.ucr.edu/). Finally, the programs used in the study are made available for researchers interested in applying the system to other enzyme classes (https://github.com/tweirick/SubClPred).

Background

Laccases (EC 1.10.3.2) are the largest sub-group of multi-copper oxidases which includes ascorbate oxidases (EC 1.10.3.3), ferroxidases or ceruloplasmins (EC 1.16.3.1) and nitrate reductases (EC 1.7.2.1). Laccases were first discovered in the sap of the Japanese lacquer tree Rhus vernicifera. Since then they have been found in many taxa including plants, fungi, bacteria, and metazoa. Laccases are involved in a diverse range of cellular activities such as lignin degradation, lignin biosynthesis, pigment production, plant pathogenesis, melatonin production, spore coat resistance, morphogenesis and detoxification of copper [1–5]. Laccases are also widely used for industrial purposes. For example, Laccases are in paper and pulp, textile, and petrochemical industries for detoxification of industrial effluents [6]. In medicine, Laccases are used for certain medical diagnostics and as catalysts for the manufacture of anti-cancer drugs [6]. They are also used for environmental remediation of herbicides, pesticides and as explosives in soil and cleaning agents for certain water purification systems. In commercial products, they are found in cosmetics, denim bleaching, wine and beer stabilization, fruit juice processing, color enhancement of tea and even baking [6, 7]. Laccases are popular in industry for a number of reasons. They are better for the environment, and have fewer non-specific reactions than conventional oxidation technologies. Many Laccases are extracellular enzymes which makes their purification simple. Compared with other oxidative enzymes, these are easier to use as they catalyze reactions with molecular oxygen and do not need reactive oxygen species catalysis [6, 8]. Currently, fungal Laccases comprise most widely studied and commercially used Laccases. However, there is much interest in bacterial Laccases also due to their higher temperature stability and ability to operate at different pHs than fungal Laccases. Generally, Laccases are composed of dimeric or tetrameric glycoproteins with each monomer containing a copper containing site. These copper sites may be one of three types: Type-1 or blue copper, Type-2 or normal copper, and Type-3 or coupled-dinuclear centers. These copper binding motifs have been shown to be highly conserved across all Laccases, with a trend towards greater similarity in the N and C terminal domains as these are the copper containing domains. It has been noted that the size of the central binding pockets are larger in bacterial Laccases than in fungal or plant Laccases. These copper binding sites yield significant differences in conserved residues for Laccases of bacteria, fungi, and plants [9].

Fungal Laccases

Fungal Laccases comprise the bulk of experimentally studied Laccases. They occur in many fungal species and are thought to play important roles in morphogenesis, fungal-plant interactions, stress defense, pigment production, and lignin degradation. While typically studied with respect to biomass degradation, most fungi found producing several isoenzymes of different types, enzymatic or physical properties, and expression levels. These can vary even more between species [8]. For example, it has been reported that one of the most efficient lignin degraders, Phanerochaete chrysosporium produces a Laccase different than other efficient lignin degrading fungi [10]. While most Laccases are extracellular enzymes, many fungal taxa produce intracellular Laccases [8] also. This is especially interesting when compared with enzymes of similar function such as lignin peroxidases which are strictly extracellular. It is speculated that the cellular localization of Laccases may be connected their function and substrate ranges. This hypothesis still remains elusive due to the majority of studied fungal Laccases coming from wood-rotting basidiomycetes. The enzymatic properties of fungal Laccases vary greatly such as temperatures vary from 25-80° C, pH optimums: 2,2'-azino-bis(3-ethylbenzothiazoline-6-sulphonic acid) (ABTS) from 2.0-5.0, 2,6-dimethoxyphenol (DMP) from 3.0-8.0, guaiacol from 3.0-7.0, and syringaldazine from 3.5-7.0. Similarly, K_m (µM) ranges vary a lot such as: ABTS from 4-770, DMP from 26-14720, Guaiacol from 4-30000, syringaldazine from 3-4307. Also K_cat (S^-1) vary in a broad range as: ABTS from 198-350000, DMP from 100-360000, Guaiacol from 90-10800 and syringaldazine from 16800-28000. These properties can further be altered by glycosylation.

Plants Laccases

Traditionally plant Laccases were considered to be only extracellular enzymes involved in the radical-based lignin polymerization. However, a high degree of divergence among Laccases within a single plant species has been observed, such as ryegrass which contains 25 different Laccase genes. Also, it is reported that Laccases lack N-terminal signal peptides for secretion but have signals targeting to other cellular components such as the endoplasmic reticulum or peroxisomes. Another study on poplars showed that Laccase repression had no effect on lignin production. Despite the evidence for novel functions and many known functions in other taxa, the grouping of plant Laccases still remain elusive [11].

Bacterial Laccases

Bacterial Laccases are known to be widespread in prokaryotes; however, only few have been experimentally characterized. To date, bacterial Laccases have been found mostly to be involved in lignin degradation, catabolism of phenolic compounds, cell pigmentation, morphogenesis, and copper defense [12–14]. The best studied bacterial Laccase is CotA and endospore coat protein from Bacillus subtilis which produces a melanin like pigment. This enzyme has generated high amounts of interest due to its extremely high temperature stability. Bacterial Laccases are also unique due to the lack of cellular partitions in prokaryotes. The reactions catalyzed by Laccases can produce quinones and semiquinones as by-products, which are powerful inhibitors of the electron transport change [5].

Other Laccases

In metazoan, Laccases exist in mammals as well as invertebrates. The roles of Laccases in mammals do not appear to be well understood, however, insect Laccases are known to be involved in cuticle formation [12]. Cuticle tanning also known as sclerotiziation and pigmentation is the process through which proteins in the exoskeleton are conjugated. This causes the exoskeleton to become insoluble, harder, and darker.

Classification of Laccases: current view

Laccases are currently classified as part of a larger classification scheme for multi copper oxidases [15, 16]. This is based on multiple sequence alignments and seems to classify by taxonomical association. The current classification system i.e. "The Laccase Engineering Database" (LccED), classifies multi copper oxidases into eleven classes: basidiomycetes Laccases, ascomycete Laccases, insect Laccases, fungal pigment MCOs, Fungal ferroxidases, fungal and plant ascorbate oxidases, plant Laccases, bacterial CopA proteins, bacterial bilirubin oxidases, bacterial CueO proteins, and SLAC homologs.

Machine learning-based classification systems

As discussed above, the current classification system for Laccases largely follow species taxonomy rather than substrate ranges, enzyme properties, or specific function. Although it has been observed that individual Laccases often exhibit distinctive and varied substrate ranges, and have different functions based on distinguishing pH values among different taxa. We hypothesize that a descriptor-based computational prediction system could be developed to generate a homology-independent classification system for better describing the functional properties of Laccases. In a previous study on feruloyl esterases (EC 3.1.1.73), an unsupervised learning approach was used to create a novel homology independent classification system for this enzyme class. Various bioinformatics tools were used to validate the identified classes [17]. In the present study, we followed a two-way computational strategy to identify and classify various Laccase subtypes by developing a python command line-based implementation of the unsupervised and supervised learning approaches, respectively. Further, we implemented our prediction models as a web-based prediction server to classify novel Laccase subtypes. The tool could be useful to the biofuel researchers and industry as well.

Methods

Dataset generation

Alternate names for Laccases were found via cross referencing with the KEGG database (http://www.kegg.jp/dbget-bin/www_bget?ec:1.10.3.2). To search for Laccase sequences, we combine these names to start as a basic query. Sequences with protein or transcript level evidence were selected to ensure high quality data as well as avoid potentially mislabeled multi-copper oxidases. Then we search UniprotKB for Laccase sequences using some search terms as listed in Table 1. Using the "browse by" option on Uniprot's GUI the query was checked for possible contaminating sequences. The contaminant sequences were filtered out using NOT conditions (see Table 1). Finally, 329 protein sequences are collected with average sequence length above 200 residues. To further validate the quality of the datasets the protein descriptions of the data were analyzed with the text clustering functionality in Google-Refine version 2.5. A significant variation was found in the protein descriptions but no cases of contamination were found. As a final check of data quality, the lengths of the sequences were calculated and plotted on a bar graph shown in Figure 1. Sequences containing non-standards/ambiguous characters were removed from the data set.

Table 1 Search terms used for collecting Laccases-related enzymes from UniProtKB database.

Full size table

Feature representation of Laccase proteins

It is important to extract better features of protein sequences to improve the performance of the machine learning method. We used several features such as amino acid composition (AAC), Conjoint Triad (CT), Composition-Transition-Distribution (CTD), Dipeptide composition (DIPEP), Geary autocorrelation descriptors, Moran autocorrelation, Moreau-Broto autocorrelation, physicochemical properties and a composite vector of amino acid composition and physicochemical properties.

Amino acid composition (AAC)

Each protein sequence is represented as a 20-dimensional feature vector with each element corresponding to the percentage of one of the twenty amino acids [18]. For a given protein sequence x, let the function f(x_i) represent the occurrence of the 20 standard amino acids. Thus, the composition of the amino acids Px in the given sequence can be represented as,

P (x) = [P_{1} (x), P_{2} (x), . . ., P_{20} (x)]

(1)

where P(x_i) is given as,

P (x_{i}) = \frac{f (x_{i})}{\sum_{i = 1}^{20} f (x_{i})} i = 1, 2, 3, \dots 20

(2)

Dipeptide composition (DIPEP)

Dipeptide sequence composition is similar to amino acid composition. However, it considers the percentages of dipeptides occurring in a given protein sequence [18]. Thus, the composition of each dipeptide is given as,

P (x_{i}, x_{j}) = \frac{f (x_{i}, x_{j})}{\sum_{i = 1}^{20} \sum_{j = 1}^{20} f (x_{i}, x_{j})} i, j = 1, 2, 3 \dots . . . 20

(3)

where $P (x_{i}, x_{j})$ is the fraction of number of instances of a specific dipeptide $f (x_{i}, x_{j})$ and the total number of all dipeptides.

Conjoint triad (CT)

In conjoint triad, in addition to amino acid composition it considers the sequence order effect [19]. It is calculated by grouping the 20 standard amino acids into 7 groups based on physical and chemical similarity [(A,G,V), (I,L,F,P),(Y,M,T,S), (H,N,Q,W), (R,K), (D,E), (C)]. Triads are made from all combinations of three amino acids of these groups, resulting in a vector length of 343 (7 × 7 × 7). Thus, a protein sequence is represented as,

P (x_{i}, x_{j}, x_{k}) = \frac{f (x_{i}, x_{j}, x_{k})}{\sum_{i = 1}^{7} \sum_{j = 1}^{7} \sum_{j = 1}^{7} f (x_{i}, x_{j}, x_{k})} i, j = 1, 2, 3 \dots . . . 20

(4)

where $f (x_{i}, x_{j}, x_{k})$ is the number of occurrences of a specific triad and $\sum_{i = 1}^{7} \sum_{j = 1}^{7} \sum_{j = 1}^{7} f (x_{i}, x_{j}, x_{k})$ is the number of all triads [19].

Composition-transition-distribution (CTD)

In this representation three local descriptors, Composition (C), Transition (T) and Distribution (D) are used in combination to construct the feature vector. These descriptors are based on the variation of occurrence of functional groups of amino acids within the primary sequence of protein [20]. Thus, before computing this feature the twenty amino acids are clustered into seven functional groups based on the dipoles and volumes of the side chains [19]. The composition descriptor computes the occurrence of each amino acid group along the sequence. Transition represents the percentage frequency with which amino acid in one group is followed by amino acid in another group. The distribution feature reflects the dispersion pattern along the entire sequence by measuring the location of the first, 25, 50, 75 and 100% of residues of a given group. Hence, total 63 features (7 composition, 21 transition and 35 distribution) are constructed to represent a protein.

Autocorrelation feature vectors

Autocorrelation features describe the level of correlation between two protein sequences in terms of their specific physicochemical property, which are defined based on the distribution of amino acid properties along the sequence. There are 8 amino acid properties used for deriving autocorrelation descriptors.

Moran autocorrelation

The Moran autocorrelation (MAC) descriptor of a protein is defined as:

D_{M A C} (d) = \frac{\frac{1}{N - d} \sum_{j = 1}^{N - d} (P_{j} - \bar{P}) \times (P_{j + d} - \bar{P})}{\frac{1}{N} \sum_{j = 1}^{N} {(P_{j} - \bar{P})}^{2}}

(5)

where N is the length of the protein sequence, d = 1,2,......30 is the distance between one residue and its neighbors, P_j and P_j+d are the properties of the amino acid at positions j and j+d respectively. $\bar{P} = \sum_{j = 1}^{N} \frac{P_{j}}{N}$ is the average of the considered property P along the sequence.

Geary autocorrelation

Geary autocorrelation (GA) descriptor of a protein is defined as:

D_{G A} (d) = \frac{\frac{1}{2 (N - d)} \sum_{j = 1}^{N - d} {(P_{j} - P_{j + d})}^{2}}{\frac{1}{N - 1} \sum_{j = 1}^{N} {(P_{j} - \bar{P})}^{2}}

(6)

$\bar{P}$ , N, P_j and P_j+d are defined in the same way as above.

Moreau-Broto autocorrelation

Moreau-Broto autocorrelation (MBA) descriptor of a protein is defined as:

D_{M B A} (d) = \sum_{j = 1}^{N - d} P_{j} \times P_{j + d}

(7)

$\bar{P}$ , N, P_j and P_j+d are defined in the same way as above.

Physicochemical properties

Physicochemical properties of amino acids have been used successfully in numerous prediction tools [18]. In this study, we grouped the amino acids of a protein into classes based on some physicochemical properties. Also the theoretical pI, molecular weight, and length of the protein are used in the feature vector. The non-composition based values are divided by the length or mass on the protein titan in order to provide values between one and zero. Molecular weights were calculated by adding the weights of the each amino acid in the sequence in a suitable way related to their chemical activity. A detailed description of these properties is provided in Table 2.

Table 2 Physicochemical properties used to represent a protein for Laccase subclass prediction.

Full size table

Split amino acid composition

Split amino acid composition aims to capture information about signal peptides at their N- or C-terminal region. The amino acid composition of the N-terminal region, Center, and C-terminal region are computed and then concatenated together. The N- and C- terminal regions are the first and last 25 amino acids in the sequence. Thus a protein sample is represented as a 60 element vector as,

P (x) = [A A C_{N - t e r m i n a l} A A C_{C e n t e r r e g i o n} A A C_{C - e r m i n a l}]

(8)

Unsupervised classification

Unsupervised learning organizes the data based on the similarity patterns between them. In this study, clustering was used to group the data into classes sharing same type of similarity not found in other classes. We followed the similar methodology as outlined in the paper [17]. We first used self- organizing map (SOM) to identify the possible number of groups in the dataset and used that information in k-means clustering to divide them in different clusters.

Self-organizing maps (SOM)

SOMs are a type of artificial neural networks used in unsupervised learning to produce low dimensional discrete representations of the vector space represented by some training data [21]. The discrete elements in SOMs are called nodes or neurons. It has been used widely in bioinformatics and computational biology mostly for tasks such as finding gene expression patterns and protein classification [22, 23]. The SOM map contains m neurons, where each contains a d-dimensional prototype vector with d as the dimensions of the input vectors. First, initial values were given to each prototype vector. When training begins a vector 'x' from the input data is randomly chosen. The distances from 'x' to the prototype vectors are computed and the neuron closest to 'x' or best matching unit (BMU) is selected. The radius of the neighborhood of the DMU is calculated, any neurons found within the radius are deemed neighbors. The neighbor's prototype vector is adjusted to be more similar to the input vector. This procedure was then repeated for certain iterations (N) [21]. In this study, SOM of multiple dimensions were studied and N was 10,000 for all dimensions. For the SOM implementation, we used an open source machine learning package 'Orange.py' which is freely available at http://orange.biolab.si[24].

K-means clustering

K-means clustering is a class of unsupervised learning algorithms which group input data set into 'k' parts or clusters [25] based on similarity measure. K-means is one of the oldest and simplest clustering methods, however still remains a useful tool for cluster analysis. It scales well to large data sets and medium numbers of clusters, however, has the drawback of needing to specify the number of clusters expected. The basic k-mean algorithm begins by initializing k cluster centers (centroids) and iterating to minimize the average distances between centroids and their cluster members. The data which are close to any cluster centroid belong to that cluster. The centroids were pre-computed using the neurons from the SOM. In this study, an open source machine learning library 'Sci-Kit Learn' was used to implement the k-means clustering method [26].

SOM for finding K number and centroid locations for K-means clustering

In this study, first an SOM network computed containing N neurons and calculates the Davies-Bouldin index (DBI) of the map treating the neurons as clusters. Then, (N) × (N-1) prototype maps were created by making all combinations of each neuron with the other neurons. The DBI is computed for all prototype maps, and the prototype map with the lowest DBI is selected. If the DBI of this map is lower than the current map the map is changed to other prototype map and the previous steps are repeated until no prototype map with a lower DBI can be found. This reduces the size of the map by one each iterations with the final number of neurons being used as the k value for k-means clustering and the cluster centroids are computed from the vectors belonging to each neuron. The efficiency of k-means clustering is measured using the difference between the inter-cluster and intra-cluster variance and the Davies-Bouldin index. As SOM find the clusters in random fashion, to get the optimum number of clusters, the clustering procedure was run 500 times for each vector type. The optimum number of clusters was chosen by selecting the cluster from the most often occurring cluster number with the largest intercluster and intracluster difference and smallest DBI.

Davies-Bouldin index (DBI)

The DBI is a metric for evaluating overall quality of a given set of clusters originally developed to aid in determining the optimum number of clusters within a dataset [27]. Minimization of the DBI of the clusters within a dataset seems to generally indicate natural partitions of data sets. However, it should be noted that this is a heuristic approach and good values do not always indicate the best clustering arrangement. DBI of a clustering approach is defined as,

D B \equiv \frac{1}{N} \sum_{i = 1}^{N} D_{i}

(9)

where D_i is the worst case scenario of all values of R_i,j,

D_{i} \equiv max_{j : i \neq j} R_{i, j}

(10)

R_i,j is a measure of the clustering quality, defined as

R_{i, j} \equiv \frac{S_{i} + S_{j}}{M_{i, j}}

(11)

The measure of scatter (S) within a given cluster i, is defined as

S_{i} = \sqrt[q]{\frac{1}{T_{i}} \sum_{j = 1}^{T_{i}} {|X_{j} - A_{i}|}^{q}}

(12)

where X_j is a n-dimensional feature vector assigned to the cluster C_i and q was kept as two and M_i,j is a measure of separation between two clusters defined as

M_{i, j} = A_{i} - {A_{j}}_{p}

(13)

where A_i is the centroid of cluster C_i containing samples X₁,X₂......X_k and computed as,

A_{i} = \frac{X_{1} + X_{2} + X_{3} + \dots + X_{k}}{k}

Intra-cluster variance

Intra-cluster variance was calculated using the Euclidean distances between the points in the cluster and the centroid of the cluster.

Inter-cluster variance

Inter-cluster variance was calculated using the Euclidean distance between the centroids of the clusters.

Co-occurrence matrix analysis

The cluster numbers returned from the clustering approach is arbitrary which presents a unique problem when trying to access the similarity between runs. Thus, to assess the consistency of belonging of samples in a particular group, a co-occurrence matrix was generated to show the number of times a given data sample in one group occurred with other groups. The higher the numbers of data samples occurring together, the more consistency the clusters in various runs.

Support vector machine (SVM)

SVMs are a class of supervised learning algorithms based on the optimization principle from statistical learning theory [28, 29]. Support vector machines have been used widely in computational biology in diverse topics such as subcellular localization [18, 30–32], protein function prediction [33], secondary structure prediction [34], disease forecasting [35]. SVMs solve classification problems by calculating a hyperplane that separates the training data with a maximum margin. For multi-class classification the classification is transformed into a series of binary classifications. There are numerous strategies for handling a multi-class problem separated into binary classifications and in this study the one-versus-rest approach was used. The SVM Classifiers were developed using the SVM_Light package (https://github.com/daoudclarke/pysvmlight), which is an open source package for SVM implementation [36]. In a preliminary study, the RBF kernel was found to perform best. Therefore, we used RBF kernel in all our SVM classifiers.

Performance evaluation parameters

To assess the performance of the developed models, we used a five-fold cross validation test on the training dataset and then tested the models in an independent test. In a five-fold cross-validation procedure, the original sample is randomly partitioned into five equal size subsamples. Of the five subsamples, a single subsample is retained as the validation data for testing the model, and the remaining four subsamples are used as training data. The cross-validation process is then repeated 5 times (the folds), with each of the 5 subsamples used exactly once as the validation data. The results from the five-folds are then averaged to produce a single estimation. The performance is measured by the parameters such as overall sensitivity, specificity, precision, Matthews Correlation Coefficient (MCC) and average accuracy. These parameters are defined as follows:

(i)Sensitivity or coverage of positive examples: It is the percent of positive samples correctly predicted,

Sensitivity (S_{n}) = \frac{T P}{T P + F N}

(14)

(ii)Specificity or coverage of negative examples: It is percent of negative samples correctly predicted as positive,

Specificity (Sp) = \frac{TN}{TN + FP} \times 100

(15)

(iii)Accuracy: It is the percentage of correctly predicted samples,

Accuracy (Acc) = \frac{TP + TN}{TP + FN + FP + FN} \times 100

(16)

(iv)Error rate: It is the total percentage of incorrect predictions is calculated as

Error rate (ER) = $Error rate (ER) = \frac{FP + FN}{TP + FN + FP + FN} \times 100$ (17)

(v)
Precision: It is the percentage of positive PPIs those are correct identified true prediction,
$Precision = \frac{T P}{T P + F P} \times 100$
(18)
(vi)
Matthew's correlation coefficient (MCC): it is considered to be the most robust parameter of any class prediction method. MCC equal to 1 is regarded as perfect prediction while 0 for completely random prediction.
$MCC = \frac{(TP \times TN) - (FP \times FN)}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}$
(19)

where true positive TP) is the numbers of positive samples that are predicted correctly; false negative (FN) is the number of positive samples that are predicted to be negative; false positive (FP) is the number of negative samples that are predicted positive and true negative (TN) is the number of negative samples that are predicted correctly as negative.

Feature scaling

To have knowledge of most relevant features for classification of Laccase types, a feature scaling approac is conducted. Feature scaling was performed using univariate feature selection using the functions provided by Sci-Kit Learn using the program scale_features.py[26]. Univariate feature selection implemented by considering each element of the descriptor vectors independent from one another and ranking them based on their occurrence between classes.

Domain map and phylogenetic trees construction

The program doMosaic was used to create domain maps for visualization of the domains in the initial data and newly generated classes [37]. Interproscan was used to get the information about the domains in the Laccases [38]. To show the relationship between Laccase samples, a phylogenetic tree was generated with the cleaned dataset using Clustal Omega version 1.0.3 [39]. Dendroscope version 3.2.10 was used for the visualization of the tree.

Results and discussion

We have studied several SOM architectures to see the effect of clustering of the Laccases with many descriptors. The clustering algorithm was run 500 times for each SOM map size. The clustering performance of each descriptor is listed in Table 3. Physicochemical properties showed the best average performance among all the feature vectors providing 12 clusters as optimum cluster size. This is also in close agreement with the study for feruloyl esterases classification where the strongest descriptor was the composite vector combining amino acid composition and physicochemical properties [17]. We also performed a co-occurrence matrix analysis to see the consistency of cluster instances in each group. The physicochemical property descriptor shows consistency in cluster instances between runs and different SOM dimensions. The co-occurrence matrix is shown in Figure 2. The 6x6 SOM dimension gave the best run with a DBI of 0.37 with an inter-cluster variance of 0.0088 and intra-cluster variance of 0.0015. The performance of the physicochemical descriptor in each SOM dimensions is listed in Table 4. The proteins classified in each group after the clustering approach are listed in Table 5.

Table 3 Performance of different descriptors in clustering of Laccases using various SOM dimensions

Full size table

Table 4 The average clustering performance at each SOM dimensions for physicochemical properties.

Full size table

Table 5 Distribution of Laccases in different identified clusters under each taxa.

Full size table

Analysis of the taxa in each class revealed that the majority of the classes were dominated by single taxa as reported in Table 5. Several review papers containing large tables of experimentally validated Laccases with various properties were considered to validate the clusters. Unfortunately, these were difficult to draw patterns from as the substrates tested varied widely and heterologously expressed Laccases often have drastically different activities due to different amounts of glycosylation [15, 40, 41]. To better understand what is driving the distinction of different classes, feature scaling was applied to the physicochemical properties of all classes together, as well as each class against each other. The major contributing features were the percentage of negatively charged amino acids, isoelectric point, and the percentage of acidic or amidic groups. The detailed information about the significant features is shown in Figure 3. This is particularly interesting as Laccases as a group operate over a wide range of pHs while individual enzymes seem to have fairly specific or broad pH and substrate ranges [41]. Also, it has been reported that different Laccases produced by the fungi Coriolus versicolor were easily distinguishable by their isoelectric points [42]. The differences between classes in terms of physicochemical properties, the best features were calculated for all classes and shown in Figure 4. This showed that the variation seems to be strongly influenced by acid/base properties, and next to the small residues or aliphatic residues. The isoelectric point occurred most often within the top three features with 45 cases, followed by basic amino acids with 34 cases, acidic with 32 cases, ionizable amino acids with 23 cases, acidic and amidic with 13 cases, charged residues with 12 cases, h-bonding and small amino acids both had 8 cases, tiny with 6, neutral and hydrophobic with 4, aliphatic with 4, hydrophilic with 2, and molecular weight with 2.

Additionally, we analyzed the descriptor values for physicochemical properties and amino acid composition between classes with a standard t-test. The t-test results of the AAC features between the 12 classes are listed in Additional file 2. It shows that Ala, Cys, Asp, Glu, His, Lys, Met, Asn, Arg, Ser, and Thr vary significantly between the classes. This is particularly interesting as the amino acids which have the highest amounts of statistically significant differences between classes seem to be involved in important aspects of Laccases. For example, the top two amino acids are aspartic acid and lysine with significant differences among 51 of the 66 possible class comparisons. Aspartic acid plays an important role in many Laccase catalytic domains such as: assisting in substrate channels in basidiomycete Laccases, affecting Laccase activity of C-terminal domains when mutated in bacterial Laccases, and assisting in the exit of protons from the N-terminal domains of bacterial Laccases [43–45]. Lysine can also be found widely in catalytic domains, for example C-terminal lysines have been implicated in the inactivation of heterologously produced Laccases [46]. Aside from function, lysines are also widely used as a cross linking target to bind Laccases to various materials [47–49]. Glutamic acid had the next most significant differences between classes. This was observed in Leu-Glu-Ala motifs which follow the copper ligating histidines and are thought to be related to Laccases with higher redox potentials [50]. Further, Asparagine closely followed with 41 significant differences. Many Laccases are known to contain asparagines which serve as sites for N-linked glycosylation [51]. These sites have been shown to be involved in regulation of Laccase activity through catalytic sites such as the Leu-Met-Asn motif which often replaces the previously mentioned Leu-Glu-Ala motif [50]. N-Glycosylation has also been found to provide protection against proteolysis [51, 52]. Other types of glycosylation such as O-linked glycosylation are also major factors, so it comes as no surprise that both serine and threonine are high on the list [52].

In our other statistical analysis, the t-test results of the important physicochemical properties as identified in Figure 3 are listed in Additional file 3. It shows that all the physicochemical properties identified to be important in discriminating between classes are also significant. We believe since the generated classes contain many significant differences in physicochemical properties and the amino acids with high numbers of significant differences also strongly related to Laccase function, these classes may indeed represent different functional classes of Laccases. To investigate the classes further, a cladogram was constructed from a multiple sequence alignment using the sequences used for clustering. We then mapped our clusters and the classes from LccED to the cladogram Figures 5a and 5b respectively [15, 16]. Despite many of the clusters being dominated by a single taxa, when mapped to the cladogram they are widely dispersed throughout the taxonomic regions of the cladogram. This contrasts sharply with the LccED classes which largely only follow taxonomy. Many of the neighbors in the tree are composed of enzymes from the same or similar organisms; these could indicate Laccases of different function from within an organism.

Classification framework

To allow for the classification of newly discovered Laccases and Laccases with no experimental evidence, a Support Vector Machine-based classification system was developed. To accomplish this, 90% of the Laccase data collected was used for 5-fold cross-validation and the remaining 10% kept aside for independent testing. As physicochemical descriptors were used to build the classes, physicochemical properties were also used to develop the SVM classifiers. The developed models were further used to classify sequences annotated as Laccases with "homology" or "predicted" level evidence in the UniprotKB database.

5-fold cross-validation

The performance of the classifier in 5-fold cross-validation for all classes is reported in Table 6. The results show that the model achieves the overall accuracy of 99.03%, MCC of 0.9367, precision of 94.20%, sensitivity of 94.20% and specificity of 99.47%. The overall specificity is extremely high indicating a low rate of misclassified sequences. Considering the classes individually, the highest metrics achieved were MCC 1.0 and accuracy, specificity, and sensitivity of 100%. The lowest performance was accuracy of 98.98%, MCC of 0.7252, sensitivity of 80% and specificity of 99.31%.

Table 6 Performance of physicochemical descriptor classifier in a 5-fold cross-validation test.

Full size table

Independent testing

Performance results on an independent test data are listed in Table 7. The model also provides higher performance with an overall accuracy of 97.98%, error rate of 1.02%, MCC 0.8678, precision of 87.88%, sensitivity of 87.88% and specificity of 98.90%. It should be noted that the MCC of cluster-3 was zero. However, this class contains only one sequence and performs well in cross validation, so we believe it is still credible.

Table 7 Performance of physicochemical descriptor classifier on an independent test data.

Full size table

Confusion matrix

Confusion matrixes were made in order to better understand which classes are more similar to one another. The confusion matrix for the independent test set is shown in Table 8. According to the confusion matrix, it appears that few proteins in classes 1, 2, 8, 10 and 11 are predicted as other classes. The results in confusion matrix show the efficiency of the developed classifier in predicting the samples correctly.

Table 8 Confusion matrix for the predicted Laccase subtypes from 5-fold cross-validation testing.

Full size table

ROC curves

ROC curves are important to consider for prediction systems to give an accurate measure of credibility and or reliability. Each point on the curve is based on the confidence score thresholds of a single classifier. Each ROC curves compute the area under the curve (AUC). This indicates the probability of positive sequence having a higher value than a negative sequence when two are selected at random [53]. The more shift of the curve toward left, the more accurate the predictor. We calculated the ROC curves for each class for 5-fold cross-validation and independent set testing separately. The ROC curve for 5-fold cross-validation is shown in Figure 6 and for independent set in Figure 7. Each contains a line for each class in the prediction system as well as a line showing the average performance of all classes. All classes show excellent performance with lines very close to the left side of the chart, indicating a high rate of correct predictions from these models. Indeed, the overall area under the curve rounds up to 1.00 showing the reliability of our classifier.

Functional annotation of different classes with domain maps

To investigate the role of domains in the functional variation between different classes, we generated domains maps for the sequences in each class. Eleven different types of domains were found to exist within the dataset. The frequently occurring domains are PF07732, PF00394, PF07731 and PF02578. The first three are mostly found in plants and fungi and the domain PF02578 found mostly in bacterial or mammalian origins. Class 4 contained a couple of polyphenol oxidase domains and tyrosinase domains. The domain maps generated for all the classes are shown in Figure S1 in the supplementary material. The majority of the domain maps were highly similar within and between classes with respect to domains present. However, there were some differences between the positions of the domains. We believe that these differences in the relationships between the positions of the domains could also account for functional differences.

Classification of Laccase homologs from UniProtKB

The efficiency of our prediction approach is tested by identifying the Laccases in UniprotKB with homology or predicted level evidence. Out of the 1656 sequences retrieved, 1587 were predicted to one of the 12 classes and reported in Table 9. These annotations could be a good resource to the scientific community working in these areas.

Table 9 Classification of UniProt sequences with our method for those annotated as Laccases with homology or predicted level evidence in UniProt KB database.

Full size table

Web tool for classification of Laccases

We have developed a web resource for the classification of the Laccase subtypes by implementing the machine learning models. It will be very useful to the researchers to characterize the newly found Laccase sequences. The tool can be found at http://lacsubpred.bioinfo.ucr.edu/. We have also provided the codes used to develop the clustering and classification approach as an open source package available at https://github.com/tweirick/SubClPred.

Conclusion

In this work, we present a systematic computational approach to identify Laccase subtypes. First, a novel clustering method is developed to group the Laccase subtypes using the experimental data available in UniprotKB. Then a classification method is developed based on machine learning approach to generalize the functions of Laccases in each class. These identified groups can be a useful resource to the biologists to study the characterization of Laccases, particularly for researchers in the biofuel area.

Availability

LacSubPred, the web resource developed form this study, is freely available at http://lacsubpred.bioinfo.ucr.edu/.

Abbreviations

ROC:: Receiver Operating Characteristic
MCC:: Matthews Correlation Coefficient
SOM:: Self-Organized Maps
SVM:: Support Vector Machines
DBI:: Davies-Bouldin Index
AAC:: Amino Acid Composition
CT:: Conjoint Triad
CTD:: Composition-Transition-Distribution
DIPEP:: Dipeptide Composition
MA:: Moran Autocorrelation
MBA:: Moreau-Broto Autocorrelation.

References

Bourbonnais R, Paice MG: Oxidation of non-phenolic substrates: an expanded role for Laccase in lignin biodegradation. FEBS letters. 1990, 267 (1): 99-102. 10.1016/0014-5793(90)80298-W.
Article CAS PubMed Google Scholar
Clutterbuck A: Absence of Laccase from yellow-spored mutants of Aspergillus nidulans. Journal of general microbiology. 1972, 70 (3): 423-435. 10.1099/00221287-70-3-423.
Article CAS PubMed Google Scholar
Geiger JP, Nicole M, Nandris D, Rio B: Root rot diseases of Hevea brasiliensis. European journal of forest pathology. 1986, 16 (1): 22-37. 10.1111/j.1439-0329.1986.tb01049.x.
Article Google Scholar
O'Malley DM, Whetten R, Bao W, Chen CL, Sederoff RR: The role of of Laccase in lignification. The Plant Journal. 1993, 4 (5): 751-757. 10.1046/j.1365-313X.1993.04050751.x.
Article Google Scholar
Sharma P, Goel R, Capalash N: Bacterial Laccases. World Journal of Microbiology and Biotechnology. 2007, 23 (6): 823-832. 10.1007/s11274-006-9305-3.
Article CAS Google Scholar
Rodríguez Couto S, Toca Herrera JL: Industrial and biotechnological applications of Laccases: a review. Biotechnology advances. 2006, 24 (5): 500-513. 10.1016/j.biotechadv.2006.04.003.
Article PubMed Google Scholar
Osma JF, Toca-Herrera JL, Rodríguez-Couto S: Uses of Laccases in the food industry. Enzyme research. 2010, 2010:
Google Scholar
Baldrian P: Fungal Laccases-occurrence and properties. FEMS microbiology reviews. 2006, 30 (2): 215-242. 10.1111/j.1574-4976.2005.00010.x.
Article CAS PubMed Google Scholar
Dwivedi UN, Singh P, Pandey VP, Kumar A: Structure-function relationship among bacterial, fungal and plant Laccases. Journal of Molecular Catalysis B: Enzymatic. 2011, 68 (2): 117-128. 10.1016/j.molcatb.2010.11.002.
Article CAS Google Scholar
Larrondo LF, Salas L, Melo F, Vicuna R, Cullen D: A novel extracellular multicopper oxidase from Phanerochaete chrysosporium with ferroxidase activity. Applied and environmental microbiology. 2003, 69 (10): 6257-6263. 10.1128/AEM.69.10.6257-6263.2003.
Article PubMed Central CAS PubMed Google Scholar
Gavnholt B, Larsen K: Molecular biology of plant Laccases in relation to lignin formation. Physiologia plantarum. 2002, 116 (3): 273-280. 10.1034/j.1399-3054.2002.1160301.x.
Article CAS Google Scholar
Giardina P, Faraco V, Pezzella C, Piscitelli A, Vanhulle S, Sannia G: Laccases: a never-ending story. Cellular and Molecular Life Sciences. 2010, 67 (3): 369-385. 10.1007/s00018-009-0169-1.
Article CAS PubMed Google Scholar
Alexandre G, Zhulin IB: Laccases are widespread in bacteria. Trends in Biotechnology. 2000, 18 (2): 41-42. 10.1016/S0167-7799(99)01406-7.
Article CAS PubMed Google Scholar
Huang XF, Santhanam N, Badri DV, Hunter WJ, Manter DK, Decker SR, Vivanco JM, Reardon KF: Isolation and characterization of lignin-degrading bacteria from rainforest soils. Biotechnology and bioengineering. 2013, 110 (6): 1616-1626. 10.1002/bit.24833.
Article CAS PubMed Google Scholar
Hoegger PJ, Kilaru S, James TY, Thacker JR, Kües U: Phylogenetic comparison and classification of :accase and related multicopper oxidase protein sequences. Febs Journal. 2006, 273 (10): 2308-2326. 10.1111/j.1742-4658.2006.05247.x.
Article CAS PubMed Google Scholar
Sirim D, Wagner F, Wang L, Schmid RD, Pleiss J: The Laccase Engineering Database: a classification and analysis system for Laccases and related multicopper oxidases. Database: the journal of biological databases and curation. 2011, 2011:
Google Scholar
Udatha D, Kouskoumvekaki I, Olsson L, Panagiotou G: The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases. Biotechnology advances. 2011, 29 (1): 94-110. 10.1016/j.biotechadv.2010.09.003.
Article CAS PubMed Google Scholar
Kaundal R, Sahu SS, Verma R, Weirick T: Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning. BMC bioinformatics. 2013, 14 (Suppl 14): S7-10.1186/1471-2105-14-S14-S7.
Article PubMed Central PubMed Google Scholar
Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H: Predicting protein-protein interactions based only on sequences information. Proceedings of the National Academy of Sciences. 2007, 104 (11): 4337-4341. 10.1073/pnas.0607879104.
Article CAS Google Scholar
You Z-H, Lei Y-K, Zhu L, Xia J, Wang B: Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis. BMC bioinformatics. 2013, 14 (Suppl 8): S10-10.1186/1471-2105-14-S8-S10.
Article PubMed Central PubMed Google Scholar
Kohonen T: Essentials of the self-organizing map. Neural Networks. 2013, 37: 52-65.
Article PubMed Google Scholar
Udatha DBRKG, Kouskoumvekaki I, Olsson L, Panagiotou G: The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases. Biotechnology Advances. 2011, 29 (1): 94-110. 10.1016/j.biotechadv.2010.09.003.
Article CAS PubMed Google Scholar
Wang J, Delabie J, Aasheim H, Smeland E, Myklebost O: Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study. BMC Bioinformatics. 2002, 3 (1): 36-10.1186/1471-2105-3-36.
Article PubMed Central PubMed Google Scholar
Demšar J, Zupan B, Leban G, Curk T: Orange: From experimental machine learning to interactive data mining. 2004, Springer
Google Scholar
MacQueen J: Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability:. 1967, California, USA, 14-
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V: Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research. 2011, 12: 2825-2830.
Google Scholar
Davies DL, Bouldin DW: A Cluster Separation Measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1979, PAMI-1 (2): 224-227.
Article Google Scholar
Bhasin M, Raghava G: ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic acids research. 2004, 32 (suppl 2): W414-W419.
Article PubMed Central CAS PubMed Google Scholar
Park K-J, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics. 2003, 19 (13): 1656-1663. 10.1093/bioinformatics/btg222.
Article CAS PubMed Google Scholar
Garg A, Bhasin M, Raghava GP: Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. Journal of Biological Chemistry. 2005, 280 (15): 14427-14432. 10.1074/jbc.M411789200.
Article CAS PubMed Google Scholar
Kaundal R, Raghava GP: RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information. Proteomics. 2009, 9 (9): 2324-2342. 10.1002/pmic.200700597.
Article CAS PubMed Google Scholar
Kaundal R, Saini R, Zhao PX: Combining machine learning and homology-based approaches to accurately predict subcellular localization in Arabidopsis. Plant physiology. 2010, 154 (1): 36-54. 10.1104/pp.110.156851.
Article PubMed Central CAS PubMed Google Scholar
Cai C, Han L, Ji ZL, Chen X, Chen YZ: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic acids research. 2003, 31 (13): 3692-3697. 10.1093/nar/gkg600.
Article PubMed Central CAS PubMed Google Scholar
Ward JJ, McGuffin LJ, Buxton BF, Jones DT: Secondary structure prediction with support vector machines. Bioinformatics. 2003, 19 (13): 1650-1655. 10.1093/bioinformatics/btg223.
Article CAS PubMed Google Scholar
Kaundal R, Kapoor AS, Raghava GPS: Machine learning techniques in disease forecasting: a case study on rice blast prediction. BMC Bioinformatics. 2006, 7: 485-10.1186/1471-2105-7-485.
Article PubMed Central PubMed Google Scholar
Joachims T: Svmlight: Support vector machine. SVM-Light Support Vector Machine. 1999, University of Dortmund, 19 (4): [http://svmlight.joachims.org/]
Google Scholar
Moore AD, Held A, Terrapon N, Weiner J, Bornberg-Bauer E: DoMosaics: software for domain arrangement visualization and domain-centric analysis of proteins. Bioinformatics. 2014, 30 (2): 282-283. 10.1093/bioinformatics/btt640.
Article CAS PubMed Google Scholar
Zdobnov EM, Apweiler R: InterProScan-an integration platform for the signature-recognition methods in InterPro. Bioinformatics. 2001, 17 (9): 847-848. 10.1093/bioinformatics/17.9.847.
Article CAS PubMed Google Scholar
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology. 2011, 7 (1):
Morozova O, Shumakovich G, Gorbacheva M, Shleev S, Yaropolov A: "Blue" Laccases. Biochemistry (Moscow). 2007, 72 (10): 1136-1150. 10.1134/S0006297907100112.
Article CAS Google Scholar
Reiss R, Ihssen J, Richter M, Eichhorn E, Schilling B, Thöny-Meyer L: Laccase versus Laccase-Like Multi-Copper Oxidase: A Comparative Study of Similar Enzymes with Diverse Substrate Spectra. PloS one. 2013, 8 (6): e65633-10.1371/journal.pone.0065633.
Article PubMed Central CAS PubMed Google Scholar
Freixo MdR, Karmali A, Frazão C, Arteiro JM: Production of Laccase and xylanase from Coriolus versicolor grown on tomato pomace and their chromatographic behaviour on immobilized metal chelates. Process Biochemistry. 2008, 43 (11): 1265-1274. 10.1016/j.procbio.2008.07.013.
Article CAS Google Scholar
Garzillo AM, Colao MC, Buonocore V, Oliva R, Falcigno L, Saviano M, Santoro AM, Zappala R, Bonomo RP, Bianco C: Structural and kinetic characterization of native Laccases from Pleurotus ostreatus, Rigidoporus lignosus, and Trametes trogii. Journal of protein chemistry. 2001, 20 (3): 191-201. 10.1023/A:1010954812955.
Article CAS PubMed Google Scholar
Nasoohi N, Khajeh K, Mohammadian M, Ranjbar B: Enhancement of catalysis and functional expression of a bacterial Laccase by single amino acid replacement. International journal of biological macromolecules. 2013, 60: 56-61.
Article CAS PubMed Google Scholar
Silva CS, Damas JM, Chen Z, Brissos V, Martins LO, Soares CM, Lindley PF, Bento I: The role of Asp116 in the reductive cleavage of dioxygen to water in CotA Laccase: assistance during the proton-transfer mechanism. Acta Crystallographica Section D: Biological Crystallography. 2012, 68 (2): 186-193. 10.1107/S0907444911054503.
Article CAS Google Scholar
Bleve G, Lezzi C, Spagnolo S, Tasco G, Tufariello M, Casadio R, Mita G, Rampino P, Grieco F: Role of the C-terminus of Pleurotus eryngii Ery4 Laccase in determining enzyme structure, catalytic properties and stability. Protein Engineering Design and Selection. 2013, 26 (1): 1-13. 10.1093/protein/gzs056.
Article CAS Google Scholar
Yamaguchi H, Miyazaki M, Asanomi Y, Maeda H: Poly-lysine supported cross-linked enzyme aggregates with efficient enzymatic activity and high operational stability. Catalysis Science & Technology. 2011, 1 (7): 1256-1261. 10.1039/c1cy00084e.
Article CAS Google Scholar
Mikolasch A, Hahn V, Manda K, Pump J, Illas N, Gördes D, Lalk M, Salazar MG, Hammer E, Jülich W-D: Laccase-catalyzed cross-linking of amino acids and peptides with dihydroxylated aromatic compounds. Amino acids. 2010, 39 (3): 671-683. 10.1007/s00726-010-0488-4.
Article CAS PubMed Google Scholar
Kurniawan RA, Aulanni'am A, Shieh F-K, Chu PP-J: Carbon Nanotube Covalently Attached Laccase Biocathode for Biofuel Cell. The Journal of Pure and Applied Chemistry Research. 2013, 2 (2): 79-88.
Google Scholar
Piontek K, Antorini M, Choinowski T: Crystal Structure of a Laccase from the FungusTrametes versicolor at 1.90-Å Resolution Containing a Full Complement of Coppers. Journal of Biological Chemistry. 2002, 277 (40): 37663-37669. 10.1074/jbc.M204571200.
Article CAS PubMed Google Scholar
Yoshitake A, Katayama Y, Nakamura M, Iimura Y, Kawai S, Morohoshi N: N-linked carbohydrate chains protect Laccase III from proteolysis in Coriolus versicolor. Journal of General Microbiology. 1993, 139 (1): 179-185. 10.1099/00221287-139-1-179.
Article CAS Google Scholar
Perry CR, Matcham SE, Wood DA, Thurston CF: The structure of Laccase protein and its synthesis by the commercial mushroom Agaricus bisporus. Journal of general microbiology. 1993, 139 (1): 171-178. 10.1099/00221287-139-1-171.
Article CAS PubMed Google Scholar
Lemeshow S, Hosmer D: Applied Logistic Regression (Wiley Series in Probability and Statistics: Wiley-Interscience. 2000
Google Scholar

Download references

Acknowledgements

The authors duly acknowledge the funding support to RK for this study from OSU's Provost Office interdisciplinary grant (#12), the i CREST Center for Bioinformatics and Computational Biology (http://icrest.okstate.edu/). Partial support to TW from the National Science Foundation Grant No. EPS-0814361 is duly acknowledged. We also thank the University of California Riverside (UCR's) High Performance Computing / Bioinformatics Facility for hosting the web tool developed from this study.

Declaration

Funding for the publication of this article has come from the 'start-up' funds provided to RK, account A01949-19900-44-CPK1, UCR.

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 11, 2014: Proceedings of the 11th Annual MCBIOS Conference. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S11.

Author information

Authors and Affiliations

National Institute for Microbial Forensics & Food and Agricultural Biosecurity (NIMFFAB), Oklahoma State University, Stillwater, Oklahoma, 74074, USA
Tyler Weirick & Sitanshu S Sahu
Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, Oklahoma, 74074, USA
Tyler Weirick, Sitanshu S Sahu & Ramamurthy Mahalingam
Bioinformatics Facility, Department of Botany & Plant Sciences, Institute for Integrative Genome Biology (IIGB), University of California, Riverside, California, 92521, USA
Rakesh Kaundal

Authors

Tyler Weirick
View author publications
You can also search for this author in PubMed Google Scholar
Sitanshu S Sahu
View author publications
You can also search for this author in PubMed Google Scholar
Ramamurthy Mahalingam
View author publications
You can also search for this author in PubMed Google Scholar
Rakesh Kaundal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rakesh Kaundal.

Additional information

Competing interests

The authors declare that they have no competing financial interests.

Authors' contributions

TW collected the datasets related to Laccases from public repositories, wrote codes for clustering, developed algorithms and models, performed the calculations, figures and tables, and wrote the draft manuscript. SSS helped in model development, data analysis and tool building. RM helped in biological analysis and in editing the manuscript. RK conceived the study, participated in its design and coordination, and edited the final manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2014_6662_MOESM1_ESM.docx

Additional file 1: Domain maps for each of the Laccase subtypes cluster generated using doMosaics (http://www.domosaics.net/). (DOCX 2 MB)

12859_2014_6662_MOESM2_ESM.xlsx

Additional file 2: P-values designating the statistical significance of one cluster over the other based on amino acid composition differences; values calculated using the standard t-test. (XLSX 30 KB)

12859_2014_6662_MOESM3_ESM.xlsx

Additional file 3: P-values designating the statistical significance of one cluster over the other based on protein physicochemical property differences; values calculated using the standard t-test. (XLSX 30 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Weirick, T., Sahu, S.S., Mahalingam, R. et al. LacSubPred: predicting subtypes of Laccases, an important lignin metabolism-related enzyme class, using in silico approaches. BMC Bioinformatics 15 (Suppl 11), S15 (2014). https://doi.org/10.1186/1471-2105-15-S11-S15

Download citation

Published: 21 October 2014
DOI: https://doi.org/10.1186/1471-2105-15-S11-S15

Proceedings of the 11th Annual MCBIOS Conference

LacSubPred: predicting subtypes of Laccases, an important lignin metabolism-related enzyme class, using in silico approaches

Abstract

Background

Results

Conclusion

Background

Fungal Laccases

Plants Laccases

Bacterial Laccases

Other Laccases

Classification of Laccases: current view

Machine learning-based classification systems

Methods

Dataset generation

Feature representation of Laccase proteins

Amino acid composition (AAC)

Dipeptide composition (DIPEP)

Conjoint triad (CT)

Composition-transition-distribution (CTD)

Autocorrelation feature vectors

Moran autocorrelation

Geary autocorrelation

Moreau-Broto autocorrelation

Physicochemical properties

Split amino acid composition

Unsupervised classification

Self-organizing maps (SOM)

K-means clustering

SOM for finding K number and centroid locations for K-means clustering

Davies-Bouldin index (DBI)

Intra-cluster variance

Inter-cluster variance

Co-occurrence matrix analysis

Support vector machine (SVM)

Performance evaluation parameters

Feature scaling

Domain map and phylogenetic trees construction

Results and discussion

Classification framework

5-fold cross-validation

Independent testing

Confusion matrix

ROC curves

Functional annotation of different classes with domain maps

Classification of Laccase homologs from UniProtKB

Web tool for classification of Laccases

Conclusion

Availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Electronic supplementary material

12859_2014_6662_MOESM1_ESM.docx

12859_2014_6662_MOESM2_ESM.xlsx

12859_2014_6662_MOESM3_ESM.xlsx

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us