A factorization method for the classification of infrared spectra
© Henneges et al; licensee BioMed Central Ltd. 2010
Received: 15 June 2010
Accepted: 15 November 2010
Published: 15 November 2010
Bioinformatics data analysis often deals with additive mixtures of signals for which only class labels are known. Then, the overall goal is to estimate class related signals for data mining purposes. A convenient application is metabolic monitoring of patients using infrared spectroscopy. Within an infrared spectrum each single compound contributes quantitatively to the measurement.
In this work, we propose a novel factorization technique for additive signal factorization that allows learning from classified samples. We define a composed loss function for this task and analytically derive a closed form equation such that training a model reduces to searching for an optimal threshold vector. Our experiments, carried out on synthetic and clinical data, show a sensitivity of up to 0.958 and specificity of up to 0.841 for a 15-class problem of disease classification. Using class and regression information in parallel, our algorithm outperforms linear SVM for training cases having many classes and few data.
The presented factorization method provides a simple and generative model and, therefore, represents a first step towards predictive factorization methods.
Bioinformatics data analysis often deals with additive mixtures of signals from unknown interfering sources. In the majority of cases, only class labels are known for each sample, which hampers the estimation of the original source signals. An example of such a situation is the search for metabolic features in blood within different patient groups. In blood, several signal sources add up as each single organ may submit hormones contributing its state into this complex mixture. For instance, adipocytes secrete the hormone leptin to indicate their state. This signal is then recognized in the hypothalamus to regulate the appetite. At the same time, insulin is secreted by pancreatic beta cells for the regulation of the blood sugar. Both peptide hormones are present within the blood while their regulation results in different outcomes. However, both signals are also hidden within a huge and noisy background of further signals also present in the blood stream. Consequently, a large number of samples must be taken to clearly identify an unknown signal. Infrared (IR) spectroscopy is a rapid method for detecting signals in biological samples. It relies on quantities of 1 μl size that can be easily obtained and it is fast: measuring a complete sample where each single molecule is detected requires a total time of 30 s on a Bruker Tensor 37.
However, all diseased changes are included in detail integrately such that the sample can be analyzed objectively and without knowing disease markers with the IR-spectroscopy. In this way, IR spectroscopy has a great potential as a method for early diagnosis and therapy control [2–4]. Analyzing IR spectra is however a complex signal processing problem.
Nonetheless, there exist algorithms that are able to separate additive signals into estimated subcomponents. Examples for these methods are Non-negative Matrix Factorization (NMF)  or Independent Component Analysis (ICA) [6, 7]. Both compute a generative additive signal model that is fitted to data samples to estimate the basic subsignals each data sample is composed of. However, IR spectra do not completely fulfill sparseness or smoothness constraints used by ICA or NMF completely, see . Moreover, these methods are not designed for training on data with classification labels nor do they yield predictive models. In this work, we solve the class assignment problem and design a factorization method using a generative additive model that can be trained on data samples having class labels. For each class label, a factor signal is computed that, when exceeding a learned threshold, predicts the specific label. Therefore, our method can be trained on cheap IR spectra using class information and extract meaningful components from these signals, which leads to further insight into data and a predictive model.
This section develops the new predictive matrix factorization algorithm named BrierScoreMF for IR spectra. First, we motivate and define the problem. Then, we introduce factorization and classification loss functions and their matrix formulations. Finally, we derive the BrierScoreMF algorithm.
1.1 Problem formulation
In daily practice, bioinformatics often deals with signals from interfering sources. Each source could have considerable impact on the final interpretation of the signal. For instance, consider endocrine signaling. The endocrine system is composed of glands secreting a hormone into the blood stream. Within certain ranges, these signals represent the normal body state. However, increased signals may indicate a disease state, e.g. oncogenesis . Thus, measuring all endocrine signals yields a superposition of healthy and disease signal combinations that have to be separated to diagnose the physical state. Moreover, disease signals may be combinations of coregulated signals not originating from a single signal source. In practice, measured signals are only grouped by disease classes raising the question for the characteristic shape of the disease signals.
Thus, we are dealing with two simultaneous problems: A signal decomposition problem and a classification problem that is based on the signal decomposition. A practical approach would try to learn the signals from given data samples.
Matrix factorization methods are convenient algorithms for the signal decomposition task . These methods solve the problem of finding the decomposition X = AS for any matrix X. In general, this problem is ill-posed. However, using constraints restricts the number of feasible solutions, which can be found by local optimization algorithms. Commonly used restrictions comprise constraints for the statistical independence of signals  as well as non-negativity or sparsity of coefficients in A. Up to now no factorization method is known using class labels, therefore our approach includes constraints for classification that are needed to learn from IR spectra obtained in clinical studies.
The dimensions are X ∈ ℝn × dand Y ∈ ℝ n × k . Thus, each row in X defines a measured signal and relates to a row in Y containing binary class information.
where is a column threshold vector of the factorization. If the signal fraction exceeds a certain threshold, this will indicate the class membership within our prediction model.
1.2 Factorization loss functions
In general, factorization algorithms focus on the signal side of the problem. These methods optimize special distance functions between probability distributions, referred to as divergences, to estimate A and S. It can be shown that optimizing A and S in parallel is a non-convex optimization problem. Commonly used divergences include the Frobenius norm as well as the Kullback-Leibler divergence. Other exemplary divergences are the Itakura-Saito divergence and the families of α- and β-divergences .
for some matrix Z. Here, tr denotes the trace of a matrix.
We have chosen the Frobenius norm as divergence for the reconstruction error, because it easily allows to compute the matrix differentials of an expression. This will simplify the search for possible solutions in section 1.4.
1.3 Classification loss functions
Classification algorithms focus on the inference of a predictive model for a target variable from training data. Therefore, they optimize classification loss functions that penalize false predictions to find the most probable parametrization of a model. Convenient loss functions comprise the Brier Score , the SVM loss , the logistic loss , as well as the Misclassification loss function.
where f(x) is a parametrized model function.
where ,Y is the class matrix, 1 n×k an n×k matrix of ones, and ◦ denotes the Hadamard product.
1.4 The predictive factorization algorithm
Current factorization methods are not predictive and can only be used for signal inference. In the case of NMF methods , this arises from the gradient descent methods used for optimization. Often, an alternating gradient descent is performed, where one matrix is kept fixed while the other is optimized. The drawback for a predictive approach based on A is that for a given NMF signal matrix S the corresponding A is not uniquely defined.
For any predictive approach, training a model requires that A is treated as a function of S and X. This, to our best knowledge, is not the case in current factorization approaches.
using (7) and assuming the existence of the quadratic matrix (SS+)-1. Now, A is clearly defined as a function of X and S and we have solved the problem of the uniqueness.
and substitute A = XS+. Thus, the complete loss function is easily expressed using matrix terms, where we have omitted the sizes of the 1-matrices for simplicity. Furthermore, we have used that is suffices to optimize a monotonic transformation of F [, p. 129 Theorem 9].
for the coefficients of .
for unknown data X*.
where si and t i denote the cross-validated sensitivities and specificities, and r denotes the cross-validated reconstruction error. Using numerically computed gradients for in combination with a BFGS local search method  to optimize completes the BrierScoreMF.
We conclude this section with an interpretation of equation (9). First, we note that the BrierScoreMF has very few parameters, namely , which minimizes the probability of over-fitting (Occam's Razor), but also hampers the algorithm in obtaining high prediction performance. Next, the computation of S involves both, the design matrix X used for training and the class matrix Y . Thus, using the known classes and a linear offset the training data is projected by the MP of to a transformed matrix S.
Consequently, the training information Y and X are compressed together with the learned variables in S. In this way, our new factorization method is similar to nearest neighbor classifiers, which also store the training data itself while learning a threshold value for classification.
All software used in this article is freely available from the author.
Results and Discussion
This section empirically compares the performance of the BrierScoreMF with linear Support Vector Machines (SVM) . Therefore, we sample synthetic signal functions together with class and coefficient matrices for training both machine learning models. This setting was specifically designed with regard to the application case of IR spectroscopy. Finally, we train both algorithms on a real world IR data set comprising various diseases for classification.
We would like to note in advance that this comparison is not totally fair. SVM are pure classification algorithms that are statistically highly robust and achieve very high performance. In contrast, the BrierScoreMF is a factorization method designed for both, signal decomposition and prediction. Therefore, the problem solved by our algorithm is more constrained than the SVM.
In addition, our method has less degrees of freedom. To infer a BrierScoreMF model only k, being the number of classes, variables are optimized. Contrarily, even a linear SVM has m, being the number of input dimensions, variables to specify a predictive model. In our case, m = 3200 and k = 16, thus rendering BrierScoreMF the less flexible model. In addition, our method is a native multi-class algorithm where one model suffices to explain all classes. In contrast, the employed multi-class linear SVM are trained in one-versus-one mode resulting in 16 · 15 = 240 models used for prediction. In terms of Occam's razor, our model is the more simple method with an generative model suitable for prediction.
Thus, we compare both algorithms for baseline reasons and not to demonstrate the superiority of the BrierScoreMF. A comparison to actual factorization methods is planned as future work, because the question for fair performance measures for this task turns out to be far more delicate.
1.5 Experiments on synthetic data sets
IR spectra of chemical compounds and mixtures are smooth functions of the wavelength. In general, the measurement ranges from 400cm-1 to 4000cm-1 for Fourier-Transform Infrared Spectroscopy. However, we have chosen to sample base signals from Sobolev Spaces  defined on the range 0 as smoothness is more important than the signal domain.
where sampled by their coefficients θ j .
where TP i denotes the true positives, TN i the true negatives, FP i the false positives, and FN i the false negatives of class c i . Note that the BrierScoreMF employs an inner cross-validation loop for performance estimation, therefore the outer cross-validation measures the true generalization error of our model.
The generation of a data set was performed as follows: First, the seed of the random number generator was set. Then, the vector was sampled from a uniform distribution. After that an n-array y of classes was obtained by sampling classes with replacement from c 1 , . . ., c k . This was followed by sampling the order o of the Sobolev space by drawing an integer out of the range [1, 100]. Based on this, a matrix T containing o signal coefficients for each of the k signals was drawn from a uniform distribution. Finishing the sampling round, we finally drew the coefficient matrix A from a uniform distribution.
First, the matrix S containing the m measurements at equally spaced coordinates between 0 was computed from the coefficient matrix T (d = 3200). Then, the class matrix Y was constructed from the class array y by setting appropriate entries on +1 and every other entry to -1. Finally, we processed the coefficient matrix to relate to Y as follows: Each entry of A was scaled to the range [0, b i ) for negative corresponding entries in Y and transformed into the range [b i , 1] for positive ones. After that all entries relating to negative Y -entries were scaled such that . Given A and S we finally computed X = AS, completing the synthetic data set.
First, we found that there exist no significant differences in the performance behavior with respect to the input dimensions m for both algorithms. Inspection of the class parameter reveals that the linear SVM is superior to the BrierScoreMF for problems with less than five classes. Nonetheless, in these categories the BrierScoreMF achieves sensitivities and specificities around 0.8 with a standard deviation of less than 0.1. For problems with the number of classes between 10 and 25 the BrierScoreMF achieves superior sensitivities and specificities to the linear SVM. However, if the number of training samples is large (n = 150), the linear SVM obtains competitive specificities again. In summary, we find that the prediction performance of the BrierScoreMF decreases slower than the performance of the SVM with increasing class size. Finally, we note that in contrast to the SVM the standard deviations of the BrierScoreMF for sensitivity do not exceed 0.15 and for specificity 0.08. In conclusion, we have characterized and compared the prediction performance of the BrierScoreMF on synthetic data with a state of the art machine learning method. As explained, the BrierScoreMF solves a more complex system by generating an interpretable signal factorization, which balances the performance loss.
In the next section, we present results of the BrierScoreMF obtained by training on real IR spectra.
1.6 Experiments on a clinical data set
We found that the linear SVM was often superior to the BrierScoreMF. It was highly specific (Figure 6) while being less sensitive (Figure 5) than our method in some cases. As explained above, this outcome was expected as the linear SVM has more degrees of freedom (m = 3200) compared to BrierScoreMF (k = 16). In addition, training one-versus-one classifiers provides additional robustness with respect to noise as the classification problem is separated into smaller pieces. Whereas our algorithm is a native multi-class algorithm that is additionally constrained to yield an interpretable factorization.
However, our method achieved an estimated reconstruction error of 1.5325 × 10-04 per matrix entry for this data set. The sensitivity ranges from 0.2809 to 0.9586, while specificity ranges from 0.5324 to 0.8417. In addition, it infers interpretable and predictive signals that may lead to further insight into characteristic disease signals, Figure 4.
The additional files provide supplementary results for training without the water peaks (Additional file 1)) as well as the detailed prediction performance of the BrierScoreMF method on the clinical dataset (Additional file 2).
In this work, we have presented the BrierScoreMF algorithm for factorization of additive signals. The ultimate goal was to employ IR spectra obtained from blood samples to classify patients based on disease specific signals. We have established a performance baseline for our method on both, synthetic and real world data. Yielding interpretable base signals, our factorization obtains comparable prediction performance on synthetic data sets comprising more than 10 classes. On real world data, we measure sensitivities as well as specificities of up to 0.8.
Our factorization method combines both tasks of prediction and signal inference. Therefore, we are confident that our work constitutes the basis for further development of similar factorization algorithms. Future research should focus on improving the prediction performance of BrierScoreMF, as well as on a correct comparison with actual factorization methods. Also, the integration of non-negativity constraints into our algorithm is of practical interest.
We would like to thank Dipl.-Math. Oliver Lendle (University of Mainz) for mathematical proofreading of this manuscript.
- Stuart BH: Infrared spectroscopy: fundamentals and applications. Wiley 2004.Google Scholar
- Malins D, Anderson KM, Jaruga P, Ramsey CR, Gilman NK, Green VM, Rostad SW, Emerman TJ, Dizdaroglu M: Oxidative changes in the DNA of stroma and epithelium from the female breast: potential implications for breast cancer. Cell Cycle 2006, 5(15):1629–1632. 10.4161/cc.5.15.3098View ArticlePubMedGoogle Scholar
- Petrich W, Staiba A, Ottob M, Somorjaic RL: Correlation between the state of health of blood donors and the corresponding mid-infrared spectra of the serum. Vibrational Spectroscopy 2002, 28: 117–129. 10.1016/S0924-2031(01)00151-5View ArticleGoogle Scholar
- Staiba A, Dolenkob B, Finkc DJ, Frühd J, Nikulinb EA, Ottoe M, Pessin-Minsleyc MS, Quardera O, Somorjaib R, Thienelf U, Wernera G, Petricha W: Disease pattern recognition testing for rheumatoid arthritis using infrared spectra of human serum. Clinica Chimica Acta 2001, 308(1–2):79–89. 10.1016/S0009-8981(01)00475-2View ArticleGoogle Scholar
- Cichocki A, Zdunek R, Amari S: Nonnegative matrix and tensor factorizations. Wiley 2009.Google Scholar
- Aapo Hyvärinen and Erkki Oja: Independent Component Analysis: Algorithms and Applications. Neural Networks 2000, 13(45):411–430.Google Scholar
- Chen J, Wang XZ: A New Approach to Near-Infrared Spectral Data Analysis Using Independent Component Analysis. J Chem Inf Comput Sci 2001, 41: 992–1001.View ArticlePubMedGoogle Scholar
- Kopriva I, Jeric I, Cichocki A: Blind decomposition of infrared spectra using flexible component analysis. Chemometrics and Intelligent Laboratory Systems 2009, 97(2):170–178. [http://www.sciencedirect.com/science/article/B6TFP-4W1BV2M-1/2/1064f5a50e8e9ca9fb727f716ebd699c] 10.1016/j.chemolab.2009.04.002View ArticleGoogle Scholar
- Bhowmick NA, Chytil A, Plieth D, Gorska AE, Dumont N, Shappell S, Washington KayM, Neilson EG, Moses LH: TGF-Signaling in Fibroblasts Modulates the Oncogenic Potential of Adjacent Epithelia. Science 2004, 303(5659):848–850. 10.1126/science.1090922View ArticlePubMedGoogle Scholar
- Brier GW: Verification of forecasts expressed in terms of probability. Monthly Weather Review 1950.Google Scholar
- Vapnik VN: The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New York, Inc; 1995.View ArticleGoogle Scholar
- Friedman J, Hastie T, Tibshirani R: Additive Logistic Regression: a Statistical View of Boosting. Annals of Statistics 1998.Google Scholar
- Magnus JR, Neudecker H: Matrix Differential Calculus with Applications in Statistics and Econometrics. revised edition. John Wiley, Chichester; 1999.Google Scholar
- Nocedal J, Wright SJ: Numerical optimization. Springer Verlag 1999.Google Scholar
- Cortes C, Vapnik V: Support-vector networks. Machine Learning 1995, 20: 273–297.Google Scholar
- Wassermann L: All of nonparametric statistics. Springer 2005.Google Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria 2009. [ISBN 3–900051–07–0] [http://www.R-project.org] [ISBN 3-900051-07-0]Google Scholar
- Backhaus J, Müller R, Formanski N, Szlama N, Meerpohl HG, Eidt M, Bugert P: Diagnosis of breast cancer with infrared spectroscopy from serum samples. Vibrational Spectroscopy 2010, 52: 173–177. 10.1016/j.vibspec.2010.01.013View ArticleGoogle Scholar
- Trowbridge JJ, Orkin SH: DNA methylation in adult stem cells: New insights into self-renewal. Epigenetics 2010, 5(3):189–193. 10.4161/epi.5.3.11374View ArticlePubMedGoogle Scholar
- Navab M, Gharavi N, Watson AD: Inflammation and metabolic disorders. Current Opinion in Clinical Nutrition and Metabolic Care 2008, 11(4):459–464. 10.1097/MCO.0b013e32830460c2View ArticlePubMedGoogle Scholar
- Watermana CL, Kian-Kaia C, Griffin JL: Metabolomic strategies to study lipotoxicity in cardiovascular disease. Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids 2010, 1801(3):230–234. 10.1016/j.bbalip.2009.11.004View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.