Identification of biomarkers for genotypingAspergilliusing non-linear methods for clustering and classification
© Kouskoumvekaki et al. 2008
Received: 29 August 2007
Accepted: 28 January 2008
Published: 28 January 2008
Skip to main content
© Kouskoumvekaki et al. 2008
Received: 29 August 2007
Accepted: 28 January 2008
Published: 28 January 2008
In the present investigation, we have used an exhaustive metabolite profiling approach to search for biomarkers in recombinantAspergillus nidulans(mutants that produce the 6- methyl salicylic acid polyketide molecule) for application in metabolic engineering.
More than 450 metabolites were detected and subsequently used in the analysis. Our approach consists of two analytical steps of the metabolic profiling data, an initial non-linear unsupervised analysis with Self-Organizing Maps (SOM) to identify similarities and differences among the metabolic profiles of the studied strains, followed by a second, supervised analysis for training a classifier based on the selected biomarkers. Our analysis identified seven putative biomarkers that were able to cluster the samples according to their genotype. A Support Vector Machine was subsequently employed to construct a predictive model based on the seven biomarkers, capable of distinguishing correctly 14 out of the 16 samples of the differentA. nidulansstrains.
Our study demonstrates that it is possible to use metabolite profiling for the classification of filamentous fungi as well as for the identification of metabolic engineering targets and draws the attention towards the development of a common database for storage of metabolomics data.
Functional genomics approaches are increasingly being used for the elucidation of complex biological questions with applications that range from human health to microbial strain improvement[1–3]. Functional genomics tools have in common that they aim to map the complete phenotypic response of an organism to the environmental conditions of interest. Metabolomics technology is used to identify and quantify the metabolome, which represents the dynamic set of all small molecules – excluding those resulting from DNA and RNA transcription or translation – present in an organism or a biological sample. Fundamentally, the measured metabolite levels at a defined time under specific culture conditions for a given genotype should reflect a precise and unique signature of the metabolic phenotype. In this sense, the technique is distinct from metabolic profiling, which looks for target compounds identifieda prioriand their consequent biochemical transformation. Metabolomics has proven to be very rapid and superior to any other post-genomics technology for pattern-recognition analyses of biological samples. One of the major advantages of metabolomics is that there are fewer metabolites than genes or proteins, resulting in significant data reduction and high-throughput analysis. Furthermore, some environmental perturbations or genetic manipulations do not result in significant alterations at transcriptome and/or proteome levels; however, significant detectable changes in metabolite concentrations may be observed. Quantitative assessment of metabolite concentrations enables decoupling from genetic or environmental perturbations that may not affect gene transcription and/or protein translation, but may for example affect enzyme activity levels that could lead to correspondingly more or less metabolite. Metabolomics is therefore considered to be in many senses, more discriminatory than transcriptomics and proteomics.
The application of biostatistics and novel data-handling frameworks will have a strong role in the extraction of biologically meaningful information from large metabolomic data sets. Traditionally, data analysis has been conducted using methods that look for linear relationships within the metabolomics data, like principal components analysis (PCA)[7–9]. In recent years, non-linear methods have been successfully applied on analysis of metabolomics data, including clustering methods, e.g self organizing maps (SOM), as well as classification methods, e.g back propagation artificial neural networks and decision trees. The results from these analyses look promising and indicate that there indeed are non-linear patterns within the data. Like PCA, SOM is a tool for visualizing data sets and for extracting high-value features using unsupervised approaches, which are helpful to experimentalists for subsequent data interpretation. Clustering or unsupervised data analysis relies on similarities in unlabeled data, -in this case the metabolite concentrations and not on a preset class or target value as in classification or supervised data analysis. Given that there is no initial bias based on required model assumptions like in supervised methods, unsupervised methods are far less likely to identify false correlations. If an unsupervised algorithm clusters independent metabolome data with a high or low degree of separation, then the confidence associated with reporting identifying highly-correlated or un-correlated biological data, respectively, is high.
One of the more highly valued features of filamentous fungi is their capacity for producing a great variety of secondary metabolites. Several of these compounds are currently produced commercially, such as various antibiotics, vitamins, and value-added chemicals. For example, Aspergilli serve as microbial cell factories that have been metabolically engineered for the production of organic acids, enzymes and polyketides, such as statins – amongst the highest-value pharmaceutical class of compounds primarily produced by Aspergillus terreus. Included in this genus isAspergillus nidulansrepresenting an important model organism for studies of cell biology and gene regulation. In the present investigation we have exploited a metabolomics approach to search for high-value phenotypic features, we refer to as biomarkers, in recombinantAspergillus nidulans. The strains investigated areA. nidulansmutants, resulting from metabolic engineering efforts to produce the 6- methyl salicylic acid polyketide molecule. Metabolic engineering seeks to identify, introduce, and enhance those gene products that are important in increasing the productivity of biological processes, and to manipulate their concentrations or activities accordingly. Our approach consists of two analytical steps, an initial non-linear unsupervised analysis (SOM) to cluster the metabolome data collected from well-defined cultivations of the investigated strains, followed by a second, supervised analysis for training a predictor built on selected biomarkers. Identification of biomarkers, where high-value information is concentrated and stored, will subsequently suggest that the bulk of regulatory nodes are centered on these metabolites. Regulation, defined in this context as the metabolic response to a stimulus, is a primarily differentiator of organisms. Metabolic engineering aims to identify, isolate, and augment those regulatory points to enhance production of a desired product.
a) The seven biomarkers in respect to discrimination of the fourA. nidulansstrains, b)The three biomarkers in respect to discrimination of the four cultivation conditions (glucose, xylose, glycerol and ethanol as carbon sources)
M4: lactic acid
M157: fumaric acid
M350: 6-methyl salicylic acid (6-MSA)
The other combinations that were evaluated, gave either the same set of metabolites as before or larger sets (sets of 36 and 11 metabolites, for strain and carbon source discrimination respectively) that included all the metabolites shown in Table1. As it is preferable to work with as few biomarkers as possible, the smaller sets were chosen for the further modeling steps.
When all the metabolites are used, discrimination of the samples is not possible either based on genotype or by cultivation condition (Fig.1a). When the seven selected metabolites are employed, clustering based on the different genotypes provides a high degree of correlated discrimination (Fig.1b). In Figure1b, the samples of theA. nidulansA4 strain are clustered together, and so do the samples of the AR1phkGP74 strain. It is worth noting that in both cases, the strains cultivated on glucose are furthest from their cluster centers and approach each other. Although the glucose to glucose inter-cluster distance is longer than the intra-cluster distance, the data suggests a stronger correlation across the two different strains cultivated on glucose compared to the other three carbon sources. AR16msaGP74 and AR1phk6msaGP74 strains form a distinct cluster, distant from the other two, with very short inter-cluster distances suggesting strong similarity of the two strains.
When discrimination of the samples based on the carbon source (using the three selected metabolites of Table1b) is attempted (Fig.1c), the SOM grid seems distorted and the clustering is relatively poor. All strains cultivated on glucose and two strains (A. nidulansA4 and AR1phkGP74) cultivated on xylose are forming distinct clusters whereas there is no discrimination in the metabolic signature of cells grown on ethanol or glycerol. This suggests that the genotype is a much stronger distinguishing feature than the carbon source used for cultivation of the differentA. nidulansstrains when metabolite profiles are considered.
Furthermore, in Figure2 the component planes are clustered based on similarity in the distribution profiles of the component vectors over the data set, which allows us to draw interesting conclusions regarding the output of the data reduction step described previously. As seen in the figure, there are seven distinct clusters that include the majority of the 464 metabolites being placed on the borders of the matrix. Each cluster contains metabolites that are highly correlated with each other. An interesting observation is that all the seven metabolites of Table1a belong to six clearly distinguished large clusters of highly correlated metabolites, with profiles that show quite high variance.
On the other hand, two of the three metabolites of Table1b come from the same cluster of low variance metabolites (top left), while the third one has a totally unique profile and is therefore placed on its own in the central part of the matrix. This explains the inability of these three metabolites to cluster the data based on the different carbon source used in the cultivation (Fig.1c).
The label map of Figure3b makes a 2-dimensional visualization of the information from all the component planes and shows the clustering of the samples based on the seven selected metabolites (Table1a). It should be noted that information in Figures3a–b is equivalent to Figure1b, with the distance between neurons visualized by grey-scale in one case (Figures3a–b) and by lines in the 2D-space in the other (Figure1b).
Looking at the U-matrix and labels map of Figure3, it is worth noting that in the case of AR16msaGP74 and AR1phk6msaGP74 strains, the cultivation condition is a stronger discriminative parameter than the type of strain (6M-GLR and 6MP-GLR are placed in the same neuron, while 6M-GLU and 6MP-GLU are at neighboring neurons in a light-gray area of the map).
The bar-planes of Figure3c visualize the map prototype vectors (i.e. the coordinates of the map) as bar charts, indicating which metabolic signatures/profiles are responsible for clustering samples in each neuron. According to the bar-planes, the high concentrations of M23 and M238 are responsible for the clustering of the three samples of the AR1phkGP74 strain at the top left corner of the labels map. Similarly, the high concentrations of M19, M23 and M92 are responsible for the clustering of the samples of theA. nidulansA4 strain at the top right corner of the labels map.
The observation of the natural clustering of the samples is a guide towards whether it is feasible to model the genotype or the used carbon source based on alterations in the metabolite profile. From the above analysis it appears that an accurate predictor of the samples' cultivation condition cannot be built based on the given information. The analysis reveals that the different strains do form quite distinct natural clusters, suggesting that the metabolites that characterize each sample may be used as model parameters for the prediction of the genotype.
Comparison of performance of linear and non-linear machine-learning methods
Correctly classified samples
Support Vector Classifier
Multilayer Perceptron (1hidden layer with 4 neurons)
Multilayer Perceptron (no hidden layers)
Confusion matrix of Support Vector Classifier for the fourA. nidulansstrains
One of the primary objectives of metabolomics is to contribute to the design and implementation of metabolic engineering strategies in potential industrial hosts. There is often a disconnection between large-scale omics data sets and interpretation of the data in a physiological context that permits rational genetic or biochemical engineering applications. Tables1a and1b provide a summary of the seven and three biomarkers detected for discrimination of the fourA. nidulansand four carbon substrates, respectively. It is interesting to note that of the seven biomarkers listed in Table1a, two could be identified based on information in our in house library being valine (M20) and 6-MSA (M350). It is intuitive, yet none the less significant, that 6-MSA was identified as a biomarker metabolite across the four strains, confirming the detectable relationship between intentional genetic manipulations and resulting metabolite profiles. However, the other identified metabolite, valine, also provides some interesting insight into discrimination of the four strains. Valine, a branched, non-polar, amino acid, is coupled to the isoleucine and leucine super-family synthesis pathways. The first reaction in valine synthesis is a decarboxylation of pyruvate to form acetolactate, catalyzed by acetolactate synthase (E.C. 220.127.116.11). One of valine roles is as the primary substrate in the biosynthesis of Co-enzyme A. In Table1b, two metabolites are identified as discriminators of the four culture conditions: lactic acid (M4) and fumaric acid (M157). It's interesting to note that both metabolites, similar to valine, utilize pyruvate as their primary substrates. Lactic acid is formed by the NADH catalyzed reduction of pyruvate by lactate dehydrogenase (E.C. 18.104.22.168), while fumaric acid is formed by the oxidation of succinate, coupled to the reduction of FADH2, by succinate dehydrogenase (E.C. 22.214.171.124), as an integral part of the Krebs cycle. Pyruvate enters the Krebs cycle utilizing acetyl-CoA as an essential co-factor. It is further interesting to note that 6-MSA utilizes acetyl-CoA as an essential co-factor in its biosynthesis. It is expected that the four carbon sources utilized, coupled with the four mutant strains evaluated, would significantly impact pyruvate metabolism, which serves as key regulatory node for balancing purely fermentative and respiro-fermentative metabolism. However, identification of valine, lactic acid, and fumaric acid as key biomarkers provides highly specified targets for further investigation and development of potential metabolic engineering strategies. For example, increasing 6-MSA production would be the likely require the flux through valine biosynthetic pathways to increase to boost acetyl-CoA pools, while decreasing the flux from pyruvate to lactate, would likely result in increased flux through the Krebs cycle, forming the required intermediates, such as 2-oxoketoglutarate and glutamate, for valine biosynthesis. Searching for information rich metabolic nodes derived from a combinatorial survey of different culture conditions and genotypic organisms provides information and non-intuitive targets not decipherable from a simple inspection of known biochemical pathways.
In this study, we investigated metabolomic profiles of differentA. nidulansstrains, wild-type and mutants grown on a diverse array of carbon sources. This investigation reports a successful approach for developing a biomarker metabolite set that captures much of the metabolite variation, and consequently, high-value, discriminatory information present in the differentAspergilli sp. metabolome profiles using SOM and SMV. The principal objective of SOM is to obtain a 2D projection of a multidimensional space. This projection keeps the topology of the multidimensional space, i.e., points which are close to one another in the multidimensional space are neighbors in the two-dimensional space as well. The training of the network is unsupervised, that is, the property of interest, in this case the genotype, is not used during the training process. In the course of training, the objects are randomly presented to the neural network in an iterative manner. For each iteration step the so-called winning neuron for the input object is identified by determining the neuron having the minimum Euclidean distance to the input objects, i.e. the concentration profile of metabolites in each sample. To improve the response of the network, the neuron weights are adapted to become more similar to the input pattern. After termination of training, the response of the network is calculated for each object in the data set. The projection of the data set into the 2D space is then performed by mapping each object into the coordinates of the winning neuron. The SOM has already been widely applied in engineering and many other fields and is gaining popularity in the fields of medicine, computer-aided diagnosis and biotechnology [[21–23], respectively]. In our study, SOM was proven an invaluable tool to reveal a holistic picture of metabolism and provide insight into the relationships between the concentration levels of a metabolite pool and the genotype. In Figure1b, there is a clear cluster of theA. nidulansA4 wild type as well as the AR1phkGP74 strain, however, when the strains were cultivated on glucose they are displaced furthest from their cluster centers, and closer to one another. This is not surprising since the physiological characterization of the AR1phkGP74 mutant has shown that overexpression of the phosphoketolase gene has significant effects on the specific growth rate on xylose, glycerol and ethanol but no effect on glucose. On the other hand, it is obvious that the insertion of the gene coding for the secondary metabolite 6-MSA (strains AR16msaGP74 and AR16phk16msaGP74) resulted in mutants with very distinct metabolite profiles (Figure1b). The concentrations of metabolites in the central carbon metabolism are relatively constant, while the concentrations of metabolites that are present in pathways of secondary metabolism demonstrate much larger concentration ranges. The dominant role of secondary pathways for metabolite discrimination between genotypes was further verified by the selection of 6-MSA as a biomarker (Table1a). The inability of SOM to differentiate the metabolite profile of the two mutants AR16msaGP74 and AR1phk6msaGP74 grown on glycerol is in agreement with our findings from the physiological characterization where the production of 6-MSA of cells grown on this carbon source was very low. Metabolic flux analysis of the AR16msaGP74 mutant has shown that the insertion of the 6-MSA gene increased the flux through the phosphoketolase pathway due to increased requirements for the acetyl-CoA precursor molecule. This supports our findings from the metabolite profile study that the two mutants AR16msaGP74 and AR1phk6msaGP74 have a very similar metabolic signature (Fig.1b).
A very interesting result was that the biomarker selection by the neural network was not only based on the discrimination power but also on the interconnection with other metabolites that show similar variation (Fig.2). Selection of biomarkers that belong to larger metabolic networks tightly connected could be invaluable for the identification of regulatory nodes- a core element of metabolic engineering.
SMV is a supervised learning method that performs nonlinear mapping of input data that are inseparable in a low dimensional space, to a higher dimensional space, where a maximal separating hyperplane is constructed. As 'support vectors' are considered the samples along the hyperplanes that are used to generate the maximum margin hyperplane between the two classes. Selecting this particular hyperplane maximizes the SMV's ability to predict the correct classification of previously unseen data. This technique differentiates SMV from other hyperplane based classifiers and seems to be its key to success. An excellent and detailed description of how support vector machines work can be found in. SVM in our study was employed to construct a predictive model capable of distinguishing between differentA. nidulansstrains based on their metabolome profile. We were able to validate significant differences in metabolite levels and to detect metabolic signatures that classify correctly 90% of the strains. However, what still remains a challenge is to "decode" the selected biomarker set since six from the ten compounds could not be identified using our "in house library" (consisting of 78 metabolites), showing how important it is to develop a common database to store metabolomics data.
focus high-level gene annotation, ensuring that those pathways are well defined;
include them in genome-scale models for simulation purposes to determine if, via stochiometry, the final product formation can be enhanced;
over-express or delete using a factorial design to determine if within the biomarker set which metabolite exerts the most metabolic control; and,
introduce non-native pathways from other organisms to further push the limits of production.
Furthermore our study demonstrates that it is possible to use metabolite profiling for the identification and classification of filamentous fungi.
Four strains were used in the present study; theA. nidulansA4 wild type, theA. nidulansAR1phkGP74, where the gene (XP_662517) encoding phosphoketolase has been overexpressed, as well as the two mutants AR16msaGP74 and AR1phk6msaGP74 (double mutant) that contain the P22367 gene encoding for the 6-MSA polyketide molecule. The construction of the strains has been described elsewhere.
For all theA. nidulanscultivations a chemically defined medium containing trace metal elements was used. The medium used had the following composition: 15 g (NH4)2SO4l-1, 3 g KH2PO4l-1, 2 g MgSO4.7H2O l-1, 2 g NaCl l-1, 0.2 g CaCl2l-1and 1 ml trace element solution l-1. Trace element solution composition (per litre): 14.3 g ZnSO4.7H2O, 13.8 g FeSO4.7H2O and 2.5 g CuSO4.5H2O. Arginine, 0.7 g/L, was added in the auxotrophic strains (AR1phkGP74 and AR16msaGP74) by sterile filtration. The carbon sources used were glucose, xylose, glycerol and ethanol (20 g l-1) respectively. To determine the metabolite profiles cultivations were performed in well-controlled 1.5 l bioreactors with a working volume of 1.2 l. The bioreactors were equipped with two disc-turbine impellers rotating at 350 r.p.m. The pH was controlled at 5.5 ± 0.1 by addition of 2 M NaOH or HCl, and the temperature was controlled at 30 ± 0.1°C. Air was sparged through a ring-sparger for aeration of the bioreactor at a constant flow rate of 1.0 vvm (volume of gas per volume of liquid per minute).
Cell dry weight was determined using nitrocellulose filters (pore size 0.45 μm, Gelman Sciences). The filters were pre-dried in a microwave oven at 150 W for 15 min and subsequently weighed. A measured volume of cell culture was filtered and the residue was washed with distilled water and dried on the filter for 15 min in a microwave oven at 150 W. The filter was weighed again and the cell mass concentration was calculated.
For the analysis of intracellular metabolites triplicate samples were collected at the middle of the exponential growth phase. 10 ml fermentation broth was immediately quenched in 20 ml of cold 72% methanol (-40°C). After quenching the cells were separated from the quenching solution by centrifugation at 10000g for 20 min at -20°C and the intracellular metabolites were extracted as described by Villa-Boas et al.. Finally the samples were lyophilized and stored at -80°C until further analysis. The lyophilized samples were derivatized using methyl chloroformate as described by Villas-Boas et al.. Amino and non-amino organic acids were analysed by GC-MS. GC-MS analysis was performed with a Hewlett-Packard system HP 6890 gas chromatograph coupled to a HP 5973 quadrupole mass selective detector (EI) operated at 70eV. The column used for all analyses was a J&W1701 (Folsom, CA, 30-m × 250-μm-0.15 μm film thickness). The temperature of the inlet was 180°C, the interface temperature was 230°C, and the quadrupole temperature was 150°C. The profile of identified intracellular amino and non-amino organic acids was expressed in peak areas normalized by the biomass (Additional file1).
The data from GC-MS analyses were deconvoluted using the AMDIS spectral deconvolution software package. SpectConnect was used to automatically catalog and track otherwise unidentifiable conserved metabolite peaks across sample replicates and different sample conditions groups without use of reference spectra. Using SpectConnect 464 metabolite peaks (referred to from now on as M1-464) were detected and more than 40 were identified using an in-house library. Clustering and classification tools were used for the identification of specific differences between metabolite profiles and the characterization of specific biological activities. In the analysis each sample corresponds to a different genotype (A. nidulansA4, AR1phkGP74, AR16msaGP74, and AR1phk6msaGP74) each cultivated on previously specified carbon source (i.e., glucose, xylose, glycerol, ethanol, respectively).
Due to the large number of available descriptors (concentrations of the different metabolites) compared to the data set (number of mutants cultivated in different carbon sources), data reduction was considered necessary in order to remove irrelevant and/or intercorrelated descriptors and noise. For this purpose, model training was preceded with a descriptor selection stage in order to eliminate all but the most relevant descriptors. Reducing the dimensionality of the data by removing unsuitable descriptors usually improves the performance and speed of learning algorithms, and most importantly, yields a more compact and easily interpretable representation of the relationship between the input and output data.
In this work, data reduction was done using the freely available Java software package WEKA (version 3-4-6). The data reduction was done with theCfsSubsetEvaldescriptor subset evaluator, in combination with two different search algorithms,BestFirstandGreedyStepwise. These algorithms use greedy hill climbing with and without backtracking, respectively.CfsSubsetEvalwas chosen due to its ability to estimate the predictive value of all the descriptors individually and, at the same time, to evaluate the degree of redundancy among them. Data reduction was also attempted with three different single-descriptor evaluators, namelyChiSquaredAttributeEval, SymmetricalandInfoGaincombined with theRankerranking method.
Self-Organizing Maps(SOM) were applied for the clustering of the metabolome data using the Matlab SOM-Toolbox. The SOM Toolbox is a function library for the Matlab 5 computing environment, required for implementing the SOM algorithm and its visualization. It is currently in version 2.0 beta and is publicly available at.
The normalization of the input data and the initialization of training were optimized based on the obtained quantization error after training. The logistic transformation (scaling of all values between [0 1]) and linear initialization of training produced the lowest quantization error. For the training of SOM the default parameters were used: hexangular map lattice with unconnected edges, batch training mode, and inverse function learning rate. A map size of 5 × 4 was chosen automatically by SOM based on the dimensions of the input data. The training length was set to 20 epochs (iterations), based on the point that the calculated quantization error stabilized.
Two linear and two non-linear classifiers were selected from the WEKA toolbox to be trained for the classification of the data set; Logistic, Multilayer Perceptron (in both its linear and non linear form) and SMO.
Optimized parameters for Multilayer Perceptron
100 epochs (iterations)
Optimized parameters forSequentialMinimalOptimization algorithm (SMO)
Radial Basis Function (exp(-γ|x-y| 2)
Because the number of samples in the input data is limited (four strains at four cultivation conditions, yielding 16 different data objects), leave-one-out (LOO) cross-validation was used for evaluating the predictive power of the model. LOO cross-validation involves the sequential omission of each data object from the training set and using all the remaining ones to train the model. The model is then judged on its ability to correctly classify the omitted object. This is repeated for all the objects in the data set. This method of cross-validation ensures that the maximum amount of data is used for the training of the model, which is particularly important when analyzing a small number of samples, as in our case.
The authors would like to thank José Manuel Otero for fruitful discussions and critical comments. G. Panagiotou acknowledges financial support from Villum Kann Rasmussen Foundation. I. Kouskoumvekaki and S.Ó. Jónsdóttir acknowledge financial support from Research Council for Technology and Production Sciences and the Program Commission on Nanoscience, Biotechnology and IT (NABIIT)
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.