A factorization method for the classification of infrared spectra

Henneges, Carsten; Laskov, Pavel; Darmawan, Endang; Backhaus, Jürgen; Kammerer, Bernd; Zell, Andreas

doi:10.1186/1471-2105-11-561

Methodology article
Open access
Published: 15 November 2010

A factorization method for the classification of infrared spectra

Carsten Henneges¹,
Pavel Laskov¹,
Endang Darmawan²,
Jürgen Backhaus²,
Bernd Kammerer³ &
…
Andreas Zell¹

BMC Bioinformatics volume 11, Article number: 561 (2010) Cite this article

6023 Accesses
4 Citations
Metrics details

Abstract

Background

Bioinformatics data analysis often deals with additive mixtures of signals for which only class labels are known. Then, the overall goal is to estimate class related signals for data mining purposes. A convenient application is metabolic monitoring of patients using infrared spectroscopy. Within an infrared spectrum each single compound contributes quantitatively to the measurement.

Results

In this work, we propose a novel factorization technique for additive signal factorization that allows learning from classified samples. We define a composed loss function for this task and analytically derive a closed form equation such that training a model reduces to searching for an optimal threshold vector. Our experiments, carried out on synthetic and clinical data, show a sensitivity of up to 0.958 and specificity of up to 0.841 for a 15-class problem of disease classification. Using class and regression information in parallel, our algorithm outperforms linear SVM for training cases having many classes and few data.

Conclusions

The presented factorization method provides a simple and generative model and, therefore, represents a first step towards predictive factorization methods.

Background

Bioinformatics data analysis often deals with additive mixtures of signals from unknown interfering sources. In the majority of cases, only class labels are known for each sample, which hampers the estimation of the original source signals. An example of such a situation is the search for metabolic features in blood within different patient groups. In blood, several signal sources add up as each single organ may submit hormones contributing its state into this complex mixture. For instance, adipocytes secrete the hormone leptin to indicate their state. This signal is then recognized in the hypothalamus to regulate the appetite. At the same time, insulin is secreted by pancreatic beta cells for the regulation of the blood sugar. Both peptide hormones are present within the blood while their regulation results in different outcomes. However, both signals are also hidden within a huge and noisy background of further signals also present in the blood stream. Consequently, a large number of samples must be taken to clearly identify an unknown signal. Infrared (IR) spectroscopy is a rapid method for detecting signals in biological samples. It relies on quantities of 1 μl size that can be easily obtained and it is fast: measuring a complete sample where each single molecule is detected requires a total time of 30 s on a Bruker Tensor 37.

The principles of IR spectroscopy, see for instance [1], are illustrated in Figure 1. IR spectroscopy can be used for the quantification of known compounds or for structural elucidation of unknown molecules. An IR source emits light towards a sample solution of chemical compounds. IR radiation is absorbed by chemical compounds as motion energy when the absorbed energy fulfills the resonance condition of a tone or related overtones. In this way, IR spectroscopy detects oscillations of bonds. As an additional condition, IR spectroscopy requires that oscillations lead to a periodical change of the molecular dipole moment. Consequently, compounds having no dipole are IR inactive. However, in the case of an IR active compound functional groups can be identified by their characteristic absorption bands, and thus give hints for structural elucidation. Alternatively, compounds can be identified through their characteristic fingerprint region within their IR spectra. This unique characteristic absorption fingerprint completely depends on the molecular constitution, because each path through a compound that is associated with a change in dipole moment absorbs at a characteristic wave length. All such paths of various lengths yield the characteristic absorption spectrum and uniquely identify the compound. Thus, the prediction of the IR spectrum of a compound is a hard task. Finally, we want to note that the IR detector records a mixture signal of all compounds present in the sample. Consequently, each single molecule present in the 1 μ l sample contributes to the signal, whether it is known or not. Then, the vibration spectrum represents a complex " fingerprint" of the biochemical condition of the sample wherein single compounds are not recognized any more.

However, all diseased changes are included in detail integrately such that the sample can be analyzed objectively and without knowing disease markers with the IR-spectroscopy. In this way, IR spectroscopy has a great potential as a method for early diagnosis and therapy control [2–4]. Analyzing IR spectra is however a complex signal processing problem.

Nonetheless, there exist algorithms that are able to separate additive signals into estimated subcomponents. Examples for these methods are Non-negative Matrix Factorization (NMF) [5] or Independent Component Analysis (ICA) [6, 7]. Both compute a generative additive signal model that is fitted to data samples to estimate the basic subsignals each data sample is composed of. However, IR spectra do not completely fulfill sparseness or smoothness constraints used by ICA or NMF completely, see [8]. Moreover, these methods are not designed for training on data with classification labels nor do they yield predictive models. In this work, we solve the class assignment problem and design a factorization method using a generative additive model that can be trained on data samples having class labels. For each class label, a factor signal is computed that, when exceeding a learned threshold, predicts the specific label. Therefore, our method can be trained on cheap IR spectra using class information and extract meaningful components from these signals, which leads to further insight into data and a predictive model.

Methods

This section develops the new predictive matrix factorization algorithm named BrierScoreMF for IR spectra. First, we motivate and define the problem. Then, we introduce factorization and classification loss functions and their matrix formulations. Finally, we derive the BrierScoreMF algorithm.

1.1 Problem formulation

In daily practice, bioinformatics often deals with signals from interfering sources. Each source could have considerable impact on the final interpretation of the signal. For instance, consider endocrine signaling. The endocrine system is composed of glands secreting a hormone into the blood stream. Within certain ranges, these signals represent the normal body state. However, increased signals may indicate a disease state, e.g. oncogenesis [9]. Thus, measuring all endocrine signals yields a superposition of healthy and disease signal combinations that have to be separated to diagnose the physical state. Moreover, disease signals may be combinations of coregulated signals not originating from a single signal source. In practice, measured signals are only grouped by disease classes raising the question for the characteristic shape of the disease signals.

Thus, we are dealing with two simultaneous problems: A signal decomposition problem and a classification problem that is based on the signal decomposition. A practical approach would try to learn the signals from given data samples.

Matrix factorization methods are convenient algorithms for the signal decomposition task [5]. These methods solve the problem of finding the decomposition X = AS for any matrix X. In general, this problem is ill-posed. However, using constraints restricts the number of feasible solutions, which can be found by local optimization algorithms. Commonly used restrictions comprise constraints for the statistical independence of signals [6] as well as non-negativity or sparsity of coefficients in A[5]. Up to now no factorization method is known using class labels, therefore our approach includes constraints for classification that are needed to learn from IR spectra obtained in clinical studies.

We begin with developing our predictive factorization algorithm. Given n pairs $({\vec{x}}_{i}, y_{i})$ of data samples comprising signals ${\vec{x}}_{i} \in ℝ^{d}$ and k classes y_i ∈ C = {c _i , . . .,c _k }, we define the following matrices

Design matrix X = {({\vec{x}}_{i})}^{T} \forall i

(1)

Class matrix Y = (y_{i j}) \forall i where y_{i j} = {\begin{cases} + 1, if x_{i} has class c_{j} \\ - 1, e l s e \end{cases} .

(2)

The dimensions are X ∈ ℝ^{n × d}and Y ∈ ℝ ^{n × k} . Thus, each row in X defines a measured signal and relates to a row in Y containing binary class information.

Searching for a factorization into signals ${\vec{s}}_{j} \in ℝ^{d}$ and coefficients a_ij ∈ ℝ, we want that

{\vec{x}}_{i} = \sum_{j = 0}^{k} a_{i j} {\vec{s}}_{j} \forall i \Leftrightarrow X = A S

(3)

.

where A ∈ ℝ ^{n × k} and S ∈ ℝ ^{k × d} . Equation (3) means that each signal is a linear combination of k different source signals and defines the general factorization problem in matrix formulation with respect to A and S. In practice, noise hampers the inference of the ${\vec{s}}_{j}$ and, consequently, this condition is not fulfilled exactly by any solution. However, we will see that in our special case the problem only reduces to finding a suitable S. For classification, we propose a linear approach using a threshold. Therefore, we want that

y_{i j} = signum (a_{i j} - b_{j}) \forall i, j

(4)

where $\vec{b} \in ℝ^{k}$ is a column threshold vector of the factorization. If the signal fraction exceeds a certain threshold, this will indicate the class membership within our prediction model.

1.2 Factorization loss functions

In general, factorization algorithms focus on the signal side of the problem. These methods optimize special distance functions between probability distributions, referred to as divergences, to estimate A and S. It can be shown that optimizing A and S in parallel is a non-convex optimization problem. Commonly used divergences include the Frobenius norm as well as the Kullback-Leibler divergence. Other exemplary divergences are the Itakura-Saito divergence and the families of α- and β-divergences [5].

However, in this work we will rely on the Frobenius norm between X and AS for divergence. Thus, we define the reconstruction error part of our loss function as

ℱ (X, A S) = {‖ X - A S ‖}_{F}

(5)

where

{‖ Z ‖}_{F} \equiv \sqrt{\sum_{i, j} z_{i, j}^{2}} \Leftrightarrow {‖ Z ‖}_{F}^{2} = tr Z^{T} Z

for some matrix Z. Here, tr denotes the trace of a matrix.

We have chosen the Frobenius norm as divergence for the reconstruction error, because it easily allows to compute the matrix differentials of an expression. This will simplify the search for possible solutions in section 1.4.

1.3 Classification loss functions

Classification algorithms focus on the inference of a predictive model for a target variable from training data. Therefore, they optimize classification loss functions that penalize false predictions to find the most probable parametrization of a model. Convenient loss functions comprise the Brier Score [10], the SVM loss [11], the logistic loss [12], as well as the Misclassification loss function.

We chose the Brier-Score [10] as it also can be expressed in terms of matrix computations. Let y ∈ {-1, +1} be the class label and let E[.] denote the expectation operator. Then the Brier-Score is defined as

E [{(y f (x) - 1)}^{2}]

(6)

where f(x) is a parametrized model function.

Now, consider equation (4) and define the matrix V to contain the signum arguments

V \equiv {(a_{i j} - b_{j})}_{n \times k} = A - 1_{n \times 1} {\vec{b}}^{T}

where $\vec{b} \in ℝ^{k \times 1}$ is the vector of column thresholds. Then, the Brier-Score can be written as a matrix function from $ℝ^{n \times k} \mapsto ℝ$ as

ℬ (Y, V) = κ tr {(Y \circ V - 1_{n \times k})}^{T} (Y \circ V - 1_{n \times k})

where $κ = \frac{1}{n k}$ ,Y is the class matrix, 1_n×k an n×k matrix of ones, and ◦ denotes the Hadamard product.

1.4 The predictive factorization algorithm

Current factorization methods are not predictive and can only be used for signal inference. In the case of NMF methods [5], this arises from the gradient descent methods used for optimization. Often, an alternating gradient descent is performed, where one matrix is kept fixed while the other is optimized. The drawback for a predictive approach based on A is that for a given NMF signal matrix S the corresponding A is not uniquely defined.

For any predictive approach, training a model requires that A is treated as a function of S and X. This, to our best knowledge, is not the case in current factorization approaches.

Here, we solve this problem by using the Moore-Penrose Pseudoinverse (MP) of S during training to compute A. The MP is uniquely defined for any matrix S. Let S⁺ denote the MP of S being defined by the following properties

S S^{+} S = S, S^{+} S S^{+} = S^{+}, {(S S^{+})}^{T} = S S^{+}, {(S^{+} S)}^{T} = S^{+} S

(7)

Using these rules, it is easy to show that

X = A S \Leftrightarrow X S^{+} = A

(8)

using (7) and assuming the existence of the quadratic matrix (SS⁺)^-1. Now, A is clearly defined as a function of X and S and we have solved the problem of the uniqueness.

Used in the following sections, we derive the differential for S⁺. Therefore, we adopt the notation from [13] to compute dS⁺ as

\begin{array}{l} d S & = & d (S S^{+} S) = (d S) (S^{+} S) + S (d S^{+}) S + (S S^{+}) (d S) \\ \Leftrightarrow S (d S^{+}) S & = & d S - (d S) (S^{+} S) - (S S^{+}) (d S) \\ \Leftrightarrow d S^{+} & = & {(S^{+} S)}^{- 1} S^{+} (d S) S^{+} {(S S^{+})}^{- 1} \\ - {(S^{+} S)}^{- 1} S^{+} (d S) (S^{+} S) S^{+} {(S S^{+})}^{- 1} \\ - {(S^{+} S)}^{- 1} (S^{+} S) S^{+} (d S) S^{+} {(S S^{+})}^{- 1} \\ \Leftrightarrow d S^{+} & = & - S^{+} (d S) S^{+} . \end{array}

Together with the two loss functions and the MP differential, all ingredients are available for the BrierScoreMF algorithm. First, we join both loss functions into a combined minimization problem

\begin{array}{l} \min_{S, b} ℒ (S, \vec{b}) & = & ℱ {(X, X S^{+} S)}^{2} + \leftarrow (Y, X S^{+} - 1 {\vec{b}}^{T}) \\ = & tr {(X - X S^{+} S)}^{T} (X - X S^{+} S) \\ + κ tr {(Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1)}^{T} (Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1) \end{array}

and substitute A = XS⁺. Thus, the complete loss function is easily expressed using matrix terms, where we have omitted the sizes of the 1-matrices for simplicity. Furthermore, we have used that is suffices to optimize a monotonic transformation of F [[13], p. 129 Theorem 9].

To find a minimizer of the L, we compute the differential

\begin{array}{l} d ℒ (S, b) & = & d tr(X - X S^{+} S)^{T} (X - X S^{+} S) \\ + κ d tr(Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1)^{T} (Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1) \end{array}

For the first summand, we compute

\begin{array}{l} d tr {(X - X S^{+} S)}^{T} (X - X S^{+} S) \\ = 2 tr {(X - X S^{+} S)}^{T} d (X - X S^{+} S) \\ = - 2 tr {(X - X S^{+} S)}^{T} X d (S^{+} S) \\ = - 2 tr {(X - X S^{+} S)}^{T} X ((d S^{+}) S + S^{+} (d S)) \\ = - 2 tr {(X - X S^{+} S)}^{T} X (- S^{+} (d S) S^{+} S + S^{+} (d S)) \\ = 2 tr {(X - X S^{+} S)}^{T} X S^{+} (d S) S^{+} S - 2 tr {(X - X S^{+} S)}^{T} X S^{+} (d S) \end{array}

Using trA^T (B ◦ C) = tr(A^T ◦ B^T ) C [[13]13, p. 45, Theorem 7 (a)], the second summand derives to

\begin{array}{l} d tr {(Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1)}^{T} (Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1) \\ = 2 tr {(Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1)}^{T} d (Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1) \\ = 2 tr {(Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1)}^{T} (Y \circ (d X S^{+})) \\ - 2 tr {(Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1)}^{T} (Y \circ 1 {(d \vec{b})}^{T}) \\ = - 2 tr {((Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1) \circ Y)}^{T} X S^{+} (d S) S^{+} \\ - 2 tr ((Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1) \circ Y)^{T} 1 {(d \vec{b})}^{T} \end{array}

Now, consider the term

\begin{array}{l} {((Y \circ (X S^{+} - 1 {\vec{b}}^{T}) - 1) \circ Y)}^{T} \\ = {(Y \circ (X S^{+} - 1 {\vec{b}}^{T}) \circ Y - 1 \circ Y)}^{T} \\ = {(X S^{+} - 1 {\vec{b}}^{T} - Y)}^{T}, \end{array}

because Y ◦ Y = 1 and 1 ◦ Y = Y. Setting the differential to zero and using the computation rules for the trace, especially trABC = trCAB, we derive

\begin{array}{l} d ℒ (S, b) & = & 2 tr {(X - X S^{+} S)}^{T} X S^{+} (d S) S^{+} S \\ - 2 tr {(X - X S^{+} S)}^{T} X S^{+} (d S) \\ - 2 κ tr {(X S^{+} - 1 {\vec{b}}^{T} - Y)}^{T} X S^{+} (d S) S^{+} \\ - 2 κ tr {(X S^{+} - 1 {\vec{b}}^{T} - Y)}^{T} 1 {(d \vec{b})}^{T} \\ = & tr [(2 S^{+} S {(X - X S^{+} S)}^{T} - 2 {(X - X S^{+} S)}^{T} \\ - 2 κ S^{+} {(X S^{+} - 1 {\vec{b}}^{T} - Y)}^{T}) X S^{+} (d S)] \\ - 2 κ tr {(X S^{+} - 1 {\vec{b}}^{T} - Y)}^{T} 1 {(d \vec{b})}^{T} \\ = & 0 . \end{array}

As both tr terms relate to distinct differentials, we first obtain that

0 = X S^{+} - 1 {\vec{b}}^{T} - Y \Leftrightarrow X S^{+} = Y + 1 {\vec{b}}^{T} \equiv W

for the coefficients of $d \vec{b}$ .

Assuming XS⁺ ≠ 0 and substitution of W back into $0 = d ℒ (S, \vec{b})$ yields

\begin{array}{l} 0 & = & 2 S^{+} S {(X - W S)}^{T} - 2 {(X - W S)}^{T} - 2 κ S^{+} {(W - 1 b^{T} - Y)}^{T} \\ = & 2 (S^{+} S - E) (X - W S) - 2 κ 0 \\ \Rightarrow X & = & W S \\ \Leftrightarrow S & = & W^{+} X = {(Y + 1 {\vec{b}}^{T})}^{+} X \end{array}

(9)

To this end, we have found a solution for the predictive matrix factorization problem using the Brier Score as classification loss and the Frobenius norm as factorization loss. Moreover, the solution is fully determined by a single k × 1 vector $\vec{b}$ that allows the computation of the factorized signal matrix S as well as the computation of the predictive coefficient matrix

A = X^{*} S^{+} = X^{*} {({(Y + 1 {\vec{b}}^{T})}^{+} X)}^{+}

for unknown data X*.

Now, the final problem of finding the vector $\vec{b}$ remains. In the present approach, we found that optimizing the following target function yields best performance

O (\vec{b}) \equiv \frac{1}{r} \prod_{i} (s_{i} t_{i})

where s_i and t_i denote the cross-validated sensitivities and specificities, and r denotes the cross-validated reconstruction error. Using numerically computed gradients for $\vec{b}$ in combination with a BFGS local search method [14] to optimize $O$ completes the BrierScoreMF.

We conclude this section with an interpretation of equation (9). First, we note that the BrierScoreMF has very few parameters, namely $\vec{b} \in ℝ^{k}$ , which minimizes the probability of over-fitting (Occam's Razor), but also hampers the algorithm in obtaining high prediction performance. Next, the computation of S involves both, the design matrix X used for training and the class matrix Y . Thus, using the known classes and a linear offset $\vec{b}$ the training data is projected by the MP of ${(Y + 1 {\vec{b}}^{T})}^{+}$ to a transformed matrix S.

Consequently, the training information Y and X are compressed together with the learned variables $\vec{b}$ in S. In this way, our new factorization method is similar to nearest neighbor classifiers, which also store the training data itself while learning a threshold value for classification.

All software used in this article is freely available from the author.

Results and Discussion

This section empirically compares the performance of the BrierScoreMF with linear Support Vector Machines (SVM) [15]. Therefore, we sample synthetic signal functions together with class and coefficient matrices for training both machine learning models. This setting was specifically designed with regard to the application case of IR spectroscopy. Finally, we train both algorithms on a real world IR data set comprising various diseases for classification.

We would like to note in advance that this comparison is not totally fair. SVM are pure classification algorithms that are statistically highly robust and achieve very high performance. In contrast, the BrierScoreMF is a factorization method designed for both, signal decomposition and prediction. Therefore, the problem solved by our algorithm is more constrained than the SVM.

In addition, our method has less degrees of freedom. To infer a BrierScoreMF model only k, being the number of classes, variables are optimized. Contrarily, even a linear SVM has m, being the number of input dimensions, variables to specify a predictive model. In our case, m = 3200 and k = 16, thus rendering BrierScoreMF the less flexible model. In addition, our method is a native multi-class algorithm where one model suffices to explain all classes. In contrast, the employed multi-class linear SVM are trained in one-versus-one mode resulting in 16 · 15 = 240 models used for prediction. In terms of Occam's razor, our model is the more simple method with an generative model suitable for prediction.

Thus, we compare both algorithms for baseline reasons and not to demonstrate the superiority of the BrierScoreMF. A comparison to actual factorization methods is planned as future work, because the question for fair performance measures for this task turns out to be far more delicate.

1.5 Experiments on synthetic data sets

IR spectra of chemical compounds and mixtures are smooth functions of the wavelength. In general, the measurement ranges from 400cm^-1 to 4000cm^-1 for Fourier-Transform Infrared Spectroscopy. However, we have chosen to sample base signals from Sobolev Spaces [16] defined on the range 0[1] as smoothness is more important than the signal domain.

Sobolev spaces are function spaces defining smooth functions. In a Sobolev space, smoothing a function means shrinking higher order coefficients towards zero. Therefore, sampling signals from this family of functions yields appropriate spectra that are smooth. We chose the Fourier basis ϕ_i (x)

φ_{1} (x) = 1, φ_{2 j} (x) = \frac{1}{\sqrt{2}} \cos (2 j π x), φ_{2 j + 1} (x) = \frac{1}{\sqrt{2}} \sin (2 j π x), j = 1, 2, \dots

from which signals

f (x) = \sum_{j = 1}^{\infty} θ_{j} φ_{j} (x) .

where sampled by their coefficients θ_j.

In this experiment, each synthetic data set is defined by four parameters: a seed for the random number generator to make the experiment reproducible, the number n of samples generated for the data set, the number k of classes contained in the data set, and the number m of feature dimensions. We used 5-fold cross-validation (CV) to estimate the prediction performance in terms of sensitivity s_i and specificity t_i as well as the reconstruction error r

s_{i} \equiv \frac{T P_{i}}{T P_{i} + F N_{i}}, t_{i} \equiv \frac{T N_{i}}{T N_{i} + F P_{i}}, r \equiv ∥ X - X S^{+} S ∥_{F}

where TP_i denotes the true positives, TN_i the true negatives, FP_i the false positives, and FN_i the false negatives of class c_i . Note that the BrierScoreMF employs an inner cross-validation loop for performance estimation, therefore the outer cross-validation measures the true generalization error of our model.

The generation of a data set was performed as follows: First, the seed of the random number generator was set. Then, the $\vec{b}$ vector was sampled from a uniform distribution. After that an n-array y of classes was obtained by sampling classes with replacement from c₁ , . . ., c_k . This was followed by sampling the order o of the Sobolev space by drawing an integer out of the range [1, 100]. Based on this, a matrix T containing o signal coefficients for each of the k signals was drawn from a uniform distribution. Finishing the sampling round, we finally drew the coefficient matrix A from a uniform distribution.

First, the matrix S containing the m measurements at equally spaced coordinates between 0[1] was computed from the coefficient matrix T (d = 3200). Then, the class matrix Y was constructed from the class array y by setting appropriate entries on +1 and every other entry to -1. Finally, we processed the coefficient matrix to relate to Y as follows: Each entry of A was scaled to the range [0, b_i ) for negative corresponding entries in Y and transformed into the range [b_i , 1] for positive ones. After that all entries relating to negative Y -entries were scaled such that $\sum_{j} a_{i j} = 1 \forall i$ . Given A and S we finally computed X = AS, completing the synthetic data set.

In this way, we obtained 4800 synthetic data sets using 100 seeds for each combination of n ∈ {50, 100, 150}, k ∈ {2, 3, 4, 5, 10, 15, 20, 25} and m ∈ {50,100}. On each data set, first the BrierScoreMF and subsequently the linear multi-class SVM from the R package e1071 was trained [17]. Thus, a direct performance comparison based on 5-fold CV was obtained. The prediction results are shown in Figures 2 and 3.

First, we found that there exist no significant differences in the performance behavior with respect to the input dimensions m for both algorithms. Inspection of the class parameter reveals that the linear SVM is superior to the BrierScoreMF for problems with less than five classes. Nonetheless, in these categories the BrierScoreMF achieves sensitivities and specificities around 0.8 with a standard deviation of less than 0.1. For problems with the number of classes between 10 and 25 the BrierScoreMF achieves superior sensitivities and specificities to the linear SVM. However, if the number of training samples is large (n = 150), the linear SVM obtains competitive specificities again. In summary, we find that the prediction performance of the BrierScoreMF decreases slower than the performance of the SVM with increasing class size. Finally, we note that in contrast to the SVM the standard deviations of the BrierScoreMF for sensitivity do not exceed 0.15 and for specificity 0.08. In conclusion, we have characterized and compared the prediction performance of the BrierScoreMF on synthetic data with a state of the art machine learning method. As explained, the BrierScoreMF solves a more complex system by generating an interpretable signal factorization, which balances the performance loss.

In the next section, we present results of the BrierScoreMF obtained by training on real IR spectra.

1.6 Experiments on a clinical data set

Next, we applied the BrierScoreMF to real world data. Therefore, we have reused the IR spectra of blood serum measured for the study in [18]. Therein, serum samples were collected at the University Hospital Heidelberg, the University Hospital Mannheim, and the St. Vencentius Krankenhaus in Karlsruhe, while the healthy control was obtained from the blood donating center in Mannheim. In total, 15 different diseases were collected and analyzed. However, reference [18] combines the classes MCI and Alzheimer, Colitis Ulcerosa and Morbus Crohn, heart insufficiency and heart infarction, as well as colorectal carcinoma and rectal carcinoma and, therefore, reports 12 diseases. In this work, we predict the more detailed classifications. For each IR spectrum, 1 μl of serum was diluted to 3 μl of distilled water, placed and dried on a 384 well Si-sample carrier plate. Then, the plate was measured on a Bruker Tensor 37 Fourier Transform IR spectrometer (Bruker Optics GmbH Ettlingen, Germany). In total, each sample was measured at least at three different days having randomized positions on the 384 well plate to avoid environmental effects. In this work, subsequent data processing consisted of the removal of all triplicates having a pairwise Pearson correlation of less than 0.95. All remaining triplicates were averaged before Savitzky-Golay smoothing (filter length was set to 15). Finally, we employed 5-fold CV to estimate the prediction performance of both SVM and BrierScoreMF. The results of this evaluation are shown in Figures 4, 5, and 6.

We found that the linear SVM was often superior to the BrierScoreMF. It was highly specific (Figure 6) while being less sensitive (Figure 5) than our method in some cases. As explained above, this outcome was expected as the linear SVM has more degrees of freedom (m = 3200) compared to BrierScoreMF (k = 16). In addition, training one-versus-one classifiers provides additional robustness with respect to noise as the classification problem is separated into smaller pieces. Whereas our algorithm is a native multi-class algorithm that is additionally constrained to yield an interpretable factorization.

However, our method achieved an estimated reconstruction error of 1.5325 × 10^-04 per matrix entry for this data set. The sensitivity ranges from 0.2809 to 0.9586, while specificity ranges from 0.5324 to 0.8417. In addition, it infers interpretable and predictive signals that may lead to further insight into characteristic disease signals, Figure 4.

To demonstrate the ability of the BrierScoreMF to discover interesting signal features in IR spectra, we now focus on four exemplary signal peaks in Figure 4, named (a), (b), (c) and (d). Figure 7 shows only the discussed signals. The peak at (a) belongs to the colorectal carcinoma signal. It is within 1100-1150cm^-1 and, therefore, may belong to the region were normally the DNA/RNA ribose CO stretching vibrations appear [1]. In colorectal carcinoma, this potentially indicates an increase of DNA/RNA damage by post-transcriptionally modified nucleic acids induced by cancer progression [19]. The peak at (b) is at 2700-2950cm^-1 and, thus, is in the region of the CH-group of phospholipids. Signals comprising peaks at (b) include bronchial carcinoma, colorectal carcinoma, pancreas carcinoma, pancreatitis, as well as the prostate carcinoma. Here, the phospholipid groups may relate to inflammatory signals in the blood responding to cancer [20]. The signal peaks marked with (c) relate to amid groups, while (d) indicates an ester of phospholipids. The disease specific signal showing these peaks (c) and (d) belong to heart attack (Figure 8). It is known that lipids form plugs that are a major cause for heart attacks, which could correlate the signals at (d) [21]. Additionally, we measured the Pearson correlation between the heart attack and the heart insufficiency signal resulting in ρ = 0.9981, which equals the maximum positive correlation within the inferred signals. Consequently, our algorithm was able to detect several interesting disease specific signals for further research.

The additional files provide supplementary results for training without the water peaks (Additional file 1)) as well as the detailed prediction performance of the BrierScoreMF method on the clinical dataset (Additional file 2).

Conclusions

In this work, we have presented the BrierScoreMF algorithm for factorization of additive signals. The ultimate goal was to employ IR spectra obtained from blood samples to classify patients based on disease specific signals. We have established a performance baseline for our method on both, synthetic and real world data. Yielding interpretable base signals, our factorization obtains comparable prediction performance on synthetic data sets comprising more than 10 classes. On real world data, we measure sensitivities as well as specificities of up to 0.8.

Our factorization method combines both tasks of prediction and signal inference. Therefore, we are confident that our work constitutes the basis for further development of similar factorization algorithms. Future research should focus on improving the prediction performance of BrierScoreMF, as well as on a correct comparison with actual factorization methods. Also, the integration of non-negativity constraints into our algorithm is of practical interest.

References

Stuart BH: Infrared spectroscopy: fundamentals and applications. Wiley 2004.
Google Scholar
Malins D, Anderson KM, Jaruga P, Ramsey CR, Gilman NK, Green VM, Rostad SW, Emerman TJ, Dizdaroglu M: Oxidative changes in the DNA of stroma and epithelium from the female breast: potential implications for breast cancer. Cell Cycle 2006, 5(15):1629–1632. 10.4161/cc.5.15.3098
Article CAS PubMed Google Scholar
Petrich W, Staiba A, Ottob M, Somorjaic RL: Correlation between the state of health of blood donors and the corresponding mid-infrared spectra of the serum. Vibrational Spectroscopy 2002, 28: 117–129. 10.1016/S0924-2031(01)00151-5
Article CAS Google Scholar
Staiba A, Dolenkob B, Finkc DJ, Frühd J, Nikulinb EA, Ottoe M, Pessin-Minsleyc MS, Quardera O, Somorjaib R, Thienelf U, Wernera G, Petricha W: Disease pattern recognition testing for rheumatoid arthritis using infrared spectra of human serum. Clinica Chimica Acta 2001, 308(1–2):79–89. 10.1016/S0009-8981(01)00475-2
Article Google Scholar
Cichocki A, Zdunek R, Amari S: Nonnegative matrix and tensor factorizations. Wiley 2009.
Google Scholar
Aapo Hyvärinen and Erkki Oja: Independent Component Analysis: Algorithms and Applications. Neural Networks 2000, 13(45):411–430.
Google Scholar
Chen J, Wang XZ: A New Approach to Near-Infrared Spectral Data Analysis Using Independent Component Analysis. J Chem Inf Comput Sci 2001, 41: 992–1001.
Article CAS PubMed Google Scholar
Kopriva I, Jeric I, Cichocki A: Blind decomposition of infrared spectra using flexible component analysis. Chemometrics and Intelligent Laboratory Systems 2009, 97(2):170–178. [http://www.sciencedirect.com/science/article/B6TFP-4W1BV2M-1/2/1064f5a50e8e9ca9fb727f716ebd699c] 10.1016/j.chemolab.2009.04.002
Article CAS Google Scholar
Bhowmick NA, Chytil A, Plieth D, Gorska AE, Dumont N, Shappell S, Washington KayM, Neilson EG, Moses LH: TGF-Signaling in Fibroblasts Modulates the Oncogenic Potential of Adjacent Epithelia. Science 2004, 303(5659):848–850. 10.1126/science.1090922
Article CAS PubMed Google Scholar
Brier GW: Verification of forecasts expressed in terms of probability. Monthly Weather Review 1950.
Google Scholar
Vapnik VN: The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New York, Inc; 1995.
Book Google Scholar
Friedman J, Hastie T, Tibshirani R: Additive Logistic Regression: a Statistical View of Boosting. Annals of Statistics 1998.
Google Scholar
Magnus JR, Neudecker H: Matrix Differential Calculus with Applications in Statistics and Econometrics. revised edition. John Wiley, Chichester; 1999.
Google Scholar
Nocedal J, Wright SJ: Numerical optimization. Springer Verlag 1999.
Google Scholar
Cortes C, Vapnik V: Support-vector networks. Machine Learning 1995, 20: 273–297.
Google Scholar
Wassermann L: All of nonparametric statistics. Springer 2005.
Google Scholar
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria 2009. [ISBN 3–900051–07–0] [http://www.R-project.org] [ISBN 3-900051-07-0]
Google Scholar
Backhaus J, Müller R, Formanski N, Szlama N, Meerpohl HG, Eidt M, Bugert P: Diagnosis of breast cancer with infrared spectroscopy from serum samples. Vibrational Spectroscopy 2010, 52: 173–177. 10.1016/j.vibspec.2010.01.013
Article CAS Google Scholar
Trowbridge JJ, Orkin SH: DNA methylation in adult stem cells: New insights into self-renewal. Epigenetics 2010, 5(3):189–193. 10.4161/epi.5.3.11374
Article CAS PubMed Google Scholar
Navab M, Gharavi N, Watson AD: Inflammation and metabolic disorders. Current Opinion in Clinical Nutrition and Metabolic Care 2008, 11(4):459–464. 10.1097/MCO.0b013e32830460c2
Article CAS PubMed Google Scholar
Watermana CL, Kian-Kaia C, Griffin JL: Metabolomic strategies to study lipotoxicity in cardiovascular disease. Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids 2010, 1801(3):230–234. 10.1016/j.bbalip.2009.11.004
Article Google Scholar

Download references

Acknowledgements

We would like to thank Dipl.-Math. Oliver Lendle (University of Mainz) for mathematical proofreading of this manuscript.

Author information

Authors and Affiliations

Zentrum für Bioinformatik Tübingen, Eberhard Karls Universität Tübingen, Sand 13, Tübingen, Germany
Carsten Henneges, Pavel Laskov & Andreas Zell
Institut für Instrumentelle Analytik und Bioanalytik, Hochschule Mannheim, Paul-Wittsack-Straße 10, Mannheim, Germany
Endang Darmawan & Jürgen Backhaus
Zentrum für Biosystemanalyse ZBSA, Albert-Ludwigs-Universität Freiburg, Habsburgerstraße 49, Freiburg, Germany
Bernd Kammerer

Authors

Carsten Henneges
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Laskov
View author publications
You can also search for this author in PubMed Google Scholar
Endang Darmawan
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Backhaus
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Kammerer
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Zell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carsten Henneges.

Additional information

Authors' contributions

CH developed the algorithm and designed the experiment. PL also contributed to the experimental design. ED and JB collected and measured the clinical IR spectra. BK and AZ supervised the project. All authors read and drafted the manuscript.

Electronic supplementary material

12859_2010_4144_MOESM1_ESM.DOC

Additional file 1: Comparison of the BrierScoreMF performance excluding Water Absorbtion Peaks. Here, we compare the BrierScoreMF performance on the clinical dataset when training with and without the water absorption peaks located at [2200-2270 1/cm] and [3200-3700 1/cm]. We find that omitting these regions does not significantly alter the prediction performance. This file can be opened with Microsoft Word 2002, Open Office Writer 3.1.1, or similar word processor programs. (DOC 34 KB)

12859_2010_4144_MOESM2_ESM.XLS

Additional file 2: Detailed BierScoreMF performance. Here, we provide the BierScoreMF performance on the clinical dataset in terms of True Positives, True Negatives, False Positives, False Negatives, as well as Sensitivity, Specificity, and Matthews Correlation Coefficient. This file can be opened with Microsoft Excel 2002, Open Office Calc 3.1.1, or similar spreadsheet applications. (XLS 9 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Henneges, C., Laskov, P., Darmawan, E. et al. A factorization method for the classification of infrared spectra. BMC Bioinformatics 11, 561 (2010). https://doi.org/10.1186/1471-2105-11-561

Download citation

Received: 15 June 2010
Accepted: 15 November 2010
Published: 15 November 2010
DOI: https://doi.org/10.1186/1471-2105-11-561

A factorization method for the classification of infrared spectra