- Proceedings
- Open Access

# Time-series alignment by non-negative multiple generalized canonical correlation analysis

- Bernd Fischer
^{1, 2}Email author, - Volker Roth
^{1}and - Joachim M Buhmann
^{1, 2}

**8 (Suppl 10)**:S4

https://doi.org/10.1186/1471-2105-8-S10-S4

© Fischer et al; licensee BioMed Central Ltd. 2007

**Published:**21 December 2007

## Abstract

### Background

Quantitative analysis of differential protein expressions requires to align temporal elution measurements from liquid chromatography coupled to mass spectrometry (LC/MS). We propose *multiple Canonical Correlation Analysis* (mCCA) as a method to align the non-linearly distorted time scales of repeated LC/MS experiments in a robust way.

### Results

Multiple canonical correlation analysis is able to map several time series to a consensus time scale. The alignment function is learned in a supervised fashion. We compare our approach with previously published methods for aligning mass spectrometry data on a large proteomics dataset. The proposed method significantly increases the number of proteins that are identified as being differentially expressed in different biological samples.

### Conclusion

Jointly aligning multiple liquid chromatography/mass spectrometry samples by mCCA substantially increases the detection rate of potential bio-markers which significantly improves the interpretability of LC/MS data.

## Keywords

- False Discovery Rate
- Canonical Correlation Analysis
- Ridge Regression
- Thin Plate Spline
- Differential Protein Expression

## Background

Liquid chromatography coupled to mass spectrometry (LC/MS) has emerged as the technology of choice for the quantitative analysis of proteins over the last decade. Technically, the liquid chromatography process separates solute molecules of a multi component chemical mixture which are measured by LC detectors as a time series of solute mass. A major problem when comparing two biological samples measured with LC/MS is a non-linear deformation of the time scale between two experiments. A LC/MS device generates mass peaks along the time axis. When two mass spectrometry experiments are aligned, it is our goal to generate matching hypotheses for as many peaks as possible between the two runs, while ensuring that most of these hypotheses are correct.

One of the standard methods for aligning mass spectrometry experiments is called *correlation optimized warping* (COW) [1], where piece-wise linear functions are fitted to align pairs of time series. A hidden Markov model [2] was proposed to align the mass spectrometry data as well as acoustic time series. In [3] the model was extended to use more than one *m/z* bin for aligning. Tibshirani [4] proposed hierarchical clustering for aligning. Kirchner et al. [5] resorted to robust point matching as developed in medical image analysis. All these previous methods did not utilize the information of identified peptides that are available in tandem mass spectrometry. In our previous work on LC/MS alignment [6] we addressed this problem by way of a semi-supervised nonlinear ridge regression model that maps one time scale onto the other. While this model has been demonstrated to outperform other approaches, it still suffers from two methodological shortcomings: (i) the regression approach is non-symmetric. By mapping the first experiment onto the second one can yield results different from mapping the second onto the first; (ii) the method is limited to aligning only *pairs* of time series, whereas in many experiments we have access to more than two replica. In this paper we will extend the ideas proposed in [6] by a *symmetric* approach based on *canonical correlation analysis*. mCCA is capable of aligning *multiple* time series and, thus, effectively benefits from an enlarged training set.

### Biological motivation

*relevant*for the discrimination between different biological conditions. In bottom-up proteomics, the proteins are first digested by an enzyme into smaller sized pieces, called peptides. Let $p{a}_{i}^{(1)}$ and $p{a}_{i}^{(2)}$ be the (measured) amount of ions of peptide

*i*in sample 1 and 2. According to [6] the differential protein expression estimate ${\widehat{\delta}}_{p}$ can be estimated as

*pa*

_{ i }is measured by the peak intensities.

For some peaks we have access to the underlying peptide sequence. The machine randomly selects a small number of peaks (typically 3) among the largest peaks of the MS spectrum. Peptide ions within a small mass/charge window are selected and stabilized in an ion trap. The selected peptide ions are further fragmented by a collision with a noble gas. A tandem mass spectrum (MS/MS) is acquired from these fragment ions. These peptide sequences are estimated based on MS/MS data, which contain a dissociation pattern of the peptide ions (see [8, 9] for details about peptide identification). In Figure 1 peaks with known peptide sequence are marked with a circle. In practice, this subset of identified peaks appears like a random selection (since the peptide masses for subsequent MS/MS spectra acquisition are selected randomly) and, consequently, the overlap of jointly identified peptides between replicated experiments is small. Since the measurement process is rather time consuming, the LC/MS machine selects only a small number of peaks for MS/MS scans and further identification.

When an experiment is repeated several times (technical replicates), one often observes that the mass axis is usually conserved very well, but the time axis shows substantial non-linear deformations. Since the mass axis is expected to contain only negligible errors and to keep the notation simple, we will not explicitly mention the mass measurement in the sequel.

- 1.
A large list of peaks at various time points

*without*knowledge about the underlying peptide sequence (typically 2000–3000 peaks). - 2.
A moderate list of peaks

*with*known peptide sequence (typically 100–300 peaks). The overlap of identified peptide sequences between experiments is small (typically 10–40 peaks).

The main idea behind our approach is to increase the number of identified peaks by *aligning* all replicated runs of the experiment. The individual time scales are warped to a *canonical* time scale which allows us to generate matching hypothesis even if the peptide sequences are missing. Focusing on the time measurements, we analyze the correctness of the predicted correspondences in terms of precision-recall statistics.

## Results and discussion

Our model for estimating the time warping function is based on multiple canonical correlation analysis (mCCA). Individual time scales are projected on a canonical scale such that the joint correlation in the projected space is maximized. The projection has to obey the constraint that the warping models the correspondence of monotonic temporal evolutions (i.e. negative time gradients are forbidden). This constraint is satisfied by projecting the time coordinates on a basis of hyperbolic tangent basis functions (generalized CCA) and by including a non-negativity constraint when optimizing the correlation (see the methods section).

As a test set for our aligning method we use 10 different sample pairs from an *Arabidopsis thaliana* cell culture. Each sample pair contains two slightly different biological samples. The different conditions are designed as follows: Given three different samples A, B, and C (in our case consecutive slices of a 1D-gel), the first sample contains a pool of A and B and the second sample contains a pool of B and C. Since the protein abundances on different gel slices are similar to each other, we measure the difference of protein abundance between consecutive gel slices. The two samples (pool A/B and pool B/C) are measured in two single experiments. For every sample 3 technical replicates are available for the analysis. Thus we have 3 LC/MS runs for the pool A/B and additional 3 LC/MS runs for the pool B/C.

The robust mCCA method is used for jointly aligning all 6 experiments available for a pair of samples. The results are then compared to the analysis based on the robust ridge-regression method which has been proposed in [6] as well as thin plate spline fitting [5]. The robust ridge regression technique possibly violates the monotonicity constraint of temporal warping and it has also not been developed for computing multiple alignments. All 6·5/2 pairs of LC/MS experiment are aligned by ridge regression. In addition we compare our method to a pairwise alignment method based on thin-plate splines [5] which is freely available. Instead of implicitly estimating the point correspondences, we fixed the given correspondences. All three methods produce a (possible empty) list of peaks where every peak is either identified by MS/MS or by prediction. Contradictions are resolved by majority vote.

### Validation of peak matching with known peptide sequence

The three methods are compared by 10-fold cross validation using the known labels of the peaks. All peptides that are identified in one of the 6 LC/MS experiments, are partitioned in 10 folds. Ridge regression, thin plate splines and multiple CCA have then been trained on 9 folds and the agreement on peptide sequence is then tested on the remaining fold. To measure the agreement, the number of peptides that are assigned to the same peak, are summed over all test peptides that are identified jointly in one pair of experiments and over all 6·5/2 pairs of experiments. We like to emphasize here, that even if a test peptide is compared in a pair of alignments, this test peptide did not appear in the training set of any other experiment.

- 1.
**no match.**The peak is not assigned to any peak in the second time series. - 2.
**correct.**The peak is assigned to the peak with the same label. - 3.
**wrong.**The peak is assigned to the peak with another label.

*precision*and

*recall*values as follows:

*σ*of hyperbolic tangent functions). There possibly exists a better parameter choice for the thin plate splines, but due to the enormous runtime, we could only select the parameters on a small sized example.

### Validation of differential protein expression values

In practical applications, the most important quality criterion for alignment methods of this kind is the number of proteins that are detected as significantly over-/underexpressed. In order to estimate this number, all six experiments are pair-wise aligned and log-peptide abundance ratios are estimated for all jointly identified peptides. Equation 1 suggests to calculate the mean log-peptide abundance ratio averaged over all peptides which originate from a particular protein. If this average log ratio deviates with t-test significance level *α* from zero then we declare this protein as strongly under- or overexpressed between the two conditions. The t-test with significance level *α* provides us with a list of differently expressed proteins. This test can be applied to two samples measured (i) under different biological conditions or (ii) as technical replicates. If the percentage of proteins detected as differentially abundant in different biological conditions *p*_{diff} substantially exceeds the percentage of proteins detected as differentially abundant in replicates *p*_{rep}, then we can conclude that the difference in these biological conditions significantly influences the proteome. The reader should note that one should compare against biological replicates, to detect significant biomarkers. Unfortunately for our analysis, no biological replicates are available. But the technical replicates are still sufficient to show that the underlying method is able to detect differences in biological samples from different experimental conditions.

A detection rate *p*_{rep} = *α* for a t-test with significance level *α* can be expected due to statistical fluctuations. We observed for our experiments that *p*_{rep} usually varies between 4% and 5% for *α* = 3% which is acceptable. The ratio *p*_{rep}/*p*_{diff} gives us now an estimate of the false discovery rate in the set of proteins detected as significantly different abundant. Changes in the significance level *α* controls the false discovery rate. The number of true-positive proteins are now estimated by the formula #positive·(1 - false discovery rate).

*potential*bio-markers that are further investigated by an additional subsequent analysis, and, therefore, they can accept more false discovery detections.

## Conclusion

In this paper we are concerned with one of the critical steps in the data analysis of quantitative differential proteomics experiments. If an experiment with a liquid chromatography unit is repeated, one typically observes a non-linear deformation of the time scales.

A novel technique for aligning such time scales is proposed where the alignment method is based on generalized canonical correlation analysis with a built-in non-negativity constraint. Two severe problems of previous approaches are solved with the novel technique: (i) non-symmetry of the time prediction function and (ii) a potential violation of the monotonicity constraint which is inherent in temporal alignments. On a large proteomics dataset we demonstrate that jointly aligning multiply replicated experiments increases both precision and recall: the total number of peptide correspondences is increased as well as the quality of these matches is improved by the novel technique. These improvements directly influence estimates of differential protein expression values, because the number of proteins are significantly increased that are detected as differentially abundant in our experiments.

## Methods

### Formal problem description

*K*different time scales. Each time scale

*k*is described by a list of peaks with time coordinates

*k*and

*l*

are provided by data base search. These time points are defined by those peptides that have been identified in both samples. For a peak *p* ∈ *P*_{
k
}we try to find a corresponding peak *q* ∈ *P*_{
l
}(if there exists one), which formally amounts to determine a mapping

*f*_{k,l}: *P*_{
k
}→ *P*_{
l
}∪ {∅} (6)

from the set of peaks *P*_{
k
}to the set of peaks *P*_{
l
}extended by the symbol ∅. This symbol ∅ represents the case that no corresponding peak can be found on the time scale *l*. Note that Eq. (6) defines a mapping between finite sets of time points.

To find a suitable mapping between the peaks, a continuous transformation between the time scales has to be learned. The function

*g*_{k,l}: ℝ → ℝ (7)

*k*into the (continuous) time scale

*l*. Given the time transformation

*g*

_{k,l}we create the mapping

*f*

_{k,l}as

The peak ${t}_{j}^{(k)}$ is mapped to the peak closest to the predicted time on time scale *l* within a window of size *w*. The window size *w* controls the number of accepted correspondences. A smaller window size leads to a lower total number of accepted matches, but to a higher *precision* of the accepted correspondences, i.e. to an increased fraction of correct matches. A continuous time transformation *g*_{k,l}(see the next section for details) is learned and the transformation is evaluated using the mapping function *f*_{k,l}.

### Robust ridge regression

*g*

_{k,l}. Let $({x}_{i},{y}_{i})=\left({t}_{i}^{(k)},{t}_{i}^{(l)}\right)\in {C}_{k,l}$ be a tuple of known time correspondences between time series

*k*and

*l*. The time is explicitly transformed to the polynomial basis $\phi ({x}_{i})={(1,{x}_{i},{x}_{i}^{2},\mathrm{...},{x}_{i}^{d})}^{t}$. Without loss of generality it is assumed that the sample vectors

*φ*(

*x*

_{ i }) have zero mean and unit covariance, otherwise the vectors are normalized accordingly. Robust ridge regression finds the parameter vector

*β*that minimizes

*L*(

*ξ*) is Huber's loss function [10]

- 1.
ridge regression is not symmetric in the sense that

*g*_{l,k}is not an inverse of*g*_{k,l}:*x*≠*g*_{l,k}(*g*_{k,l}(*x*)); - 2.
the time transformation function is not necessarily monotonically increasing. It might, thus, violate the monotonicity constraint that is inherent in temporal alignments.

### Canonical correlation analysis

*φ*(

*x*

_{ i }) have zero mean and unit covariance. The objective is to find parameter vectors

*β*

_{1}and

*β*

_{2}that maximize the correlation between the linear projections

*φ*(

*x*

_{ i })

^{ t }

*β*

_{1}and

*φ*(

*x*

_{ i })

^{ t }

*β*

_{2}

*β*

_{1}||·||

*β*

_{2}||, since the covariance matrices ${\Phi}_{X}{\Phi}_{X}^{t}$ and ${\Phi}_{Y}{\Phi}_{Y}^{t}$ are normalized to unit covariance before. The problem can be solved directly by a transformation to the eigenvalue problem (see [12])

*β*

_{2}is then obtained by ${\beta}_{2}={\Phi}_{Y}{\Phi}_{X}^{t}{\beta}_{1}$. In order to include the time-monotonicity constraint and robustness, however, it is advantageous to use an alternative formalization of the optimization problem, which can be equivalently restated as follows (see [13]):

The function *g*_{
k
}(*x*_{
i
}) = *φ*(*x*_{
i
})^{
t
}*β*_{
k
}denotes the transformation from the time scale *k* to the canonical time scale. Even for this reformulated problem, however, no guarantee is given that the transformation is monotonically increasing. In the case of a non-monotone function it is impossible to find an unambiguous transformation between time scales *k* to *l*, because in general *g*_{
k
}is not invertible.

- 1.
A set of hyperbolic tangent basis functions is used.

- 2.
A non-negativity constraint on the regression parameters is introduced.

The set of vectors *z*_{1},...,*z*_{
d
}can either be chosen as the set of vectors *x*_{1},...,*x*_{
n
}or as a set of *d* time points equally distributed over the range of the respective time scale. The scaling parameter *σ* controls the smoothness of the estimated alignment function.

*β*

_{1}and

*β*

_{2}are non-negative, the function

*g*

_{ k }(

*x*

_{ i }) =

*φ*(

*x*

_{ i })

^{ t }

*β*

_{ k }is monotonically increasing, because it is a linear combination with non-negative coefficients of monotonically increasing (hyperbolic tangent) functions. Instead of hyperbolic tangent functions, any other monotonically increasing and bounded basis functions can be used. The alignment problem can now be defined as non-negative canonical correlation analysis with hyperbolic tangent basis functions. Learning the time warping requires to find the parameter vectors

*β*

_{1}and

*β*

_{2}that minimize the objective function

We solve this problems iteratively by gradient descent. In a first step the objective function is minimized over *β*_{1} while keeping *β*_{2} fixed. Then the length of the vector *β*_{1} is normalized to fulfill the constraint ||*β*_{1}|| = 1. In a second step the objective is minimized over *β*_{2} while keeping *β*_{1} fixed, and *β*_{2} is normalized accordingly. In each step of the iteration we have to solve a non-negative least-squares problem for which we use the Lawson-Hanson algorithm [14].

*robust*version that is less sensitive to outliers. For that purpose the squared loss is replaced by Huber's robust loss function (Eq. 10)

Optimization of the robust CCA functional is similar to its quadratic counterpart (15), but with an additional inner loop that solves the robust non-negative least-squares problems by an iteratively re-weighted non-negative least-squares algorithm.

### Multiple canonical correlation analysis

*β*

_{1}|| = 1 and ||

*β*

_{2}|| = 1 in Equation (13) to one single constraint ||

*β*

_{1}|| + ||

*β*

_{2}|| = 2 [17]. A possible extension of canonical correlation analysis to

*K*time series is then defined by

*β*

_{ k }while keeping the other vectors

*β*

_{ l },

*l*≠

*k*fixed. The pseudo code is given in Figure 5.

## Declarations

### Acknowledgements

The authors like to thank J. Grossmann, S. Baginsky and W. Gruissem for valuable discussions and the competence center of systems physiology and metabolic diseases (ETH Zurich), for support.

This article has been published as part of *BMC Bioinformatics* Volume 8 Supplement 10, 2007: Neural Information Processing Systems (NIPS) workshop on New Problems and Methods in Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S10.

## Authors’ Affiliations

## References

- Nielsen NPV, Carstensen JM, Smedsgaard J: Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. J of Chromatography A. 1998, 805: 17-35. 10.1016/S0021-9673(98)00021-1.View ArticleGoogle Scholar
- Listgarten J, Neal RM, Roweis ST, Emili A: Multiple alignment of continuous time series. Neural Information Processing Systems. 2005, 17: 817-824.Google Scholar
- Listgarten J, Neal RM, Roweis ST, Wong P, Emili A: Difference detection in LC-MS data for protein biomarker discovery. Bioinformatics. 2007, 23: e198-1204. 10.1093/bioinformatics/btl326.View ArticlePubMedGoogle Scholar
- Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A, Le QT: Sample classification from protein mass spectrometry, by peak probability contrasts. Bioinformatics. 2004, 20: 3034-3044. 10.1093/bioinformatics/bth357.View ArticlePubMedGoogle Scholar
- Kirchner M, Saussen B, Steen H, Steen JA, Hamprecht FA: amsrpm: robust point matching for retention time alignment of LC/MS data with R. Journal of Statistical Software. 2007, 18:Google Scholar
- Fischer B, Grossmann J, Roth V, Baginsky S, Gruissem W, Buhmann JM: Semi-supervised LC/MS alignment for differential proteomics. Bioinformatics. 2006, 22 (14): e132-e140. 10.1093/bioinformatics/btl219.View ArticlePubMedGoogle Scholar
- Tang H, Arnold RJ, Alves P, Xun Z, Clemmer DE, Novotny MV, Reilly JP, Radivojac P: A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics. 2006, 22 (14): e481-e488. 10.1093/bioinformatics/btl237.View ArticlePubMedGoogle Scholar
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature. 2003, 422: 198-207. 10.1038/nature01511.View ArticlePubMedGoogle Scholar
- Sadygov RG, Cociorva D, JRY : Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nature Methods. 2004, 1 (3): 195-202. 10.1038/nmeth725.View ArticlePubMedGoogle Scholar
- Huber P: Robust statistics. 1981, New York: WileyView ArticleGoogle Scholar
- Hotelling H: Relations between two sets of variates. Biometrica. 1936, 28: 321-377.View ArticleGoogle Scholar
- Hardoon DR, Szedmak S, Shawe-Taylor J: Canonical correlation analysis: an overview with application to learning methods. Neural Computation. 2004, 16: 2639-2664. 10.1162/0899766042321814.View ArticlePubMedGoogle Scholar
- Kuss M, Graepel T: The geometry of kernel canonical correlation analysis. Tech Rep TR-108. 2003, Max Planck Institute for Biological Cybernetics, Tübingen, GermanyGoogle Scholar
- Lawson CL, Hanson RJ: Solving least square problems. 1974, Prentice Hall, chap 23:Google Scholar
- Yamanishi Y, Vert JP, Nakaya A, Kanehisa M: Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformatics. 2003, 9 (Suppl 1): i323-i330. 10.1093/bioinformatics/btg1045.View ArticleGoogle Scholar
- Bach FR, Jordan MI: Kernel independent component analysis. Journal of Machine Learning Research. 2002, 3: 1-48. 10.1162/153244303768966085.Google Scholar
- Via J, Santamaria I, Perez J: Canonical correlation analysis (CCA) algorithms for multiple data sets: application to blind simo equalization. European Signal Processing Conference. 2005Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.