MALDI imaging mass spectrometry: statistical data analysis and current computational challenges
© Alexandrov; licensee BioMed Central Ltd. 2012
Published: 5 November 2012
Skip to main content
© Alexandrov; licensee BioMed Central Ltd. 2012
Published: 5 November 2012
Matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) imaging mass spectrometry, also called MALDI-imaging, is a label-free bioanalytical technique used for spatially-resolved chemical analysis of a sample. Usually, MALDI-imaging is exploited for analysis of a specially prepared tissue section thaw mounted onto glass slide. A tremendous development of the MALDI-imaging technique has been observed during the last decade. Currently, it is one of the most promising innovative measurement techniques in biochemistry and a powerful and versatile tool for spatially-resolved chemical analysis of diverse sample types ranging from biological and plant tissues to bio and polymer thin films. In this paper, we outline computational methods for analyzing MALDI-imaging data with the emphasis on multivariate statistical methods, discuss their pros and cons, and give recommendations on their application. The methods of unsupervised data mining as well as supervised classification methods for biomarker discovery are elucidated. We also present a high-throughput computational pipeline for interpretation of MALDI-imaging data using spatial segmentation. Finally, we discuss current challenges associated with the statistical analysis of MALDI-imaging data.
In the last decade, matrix-assisted laser desorption/ionization-time of flight (MALDI-TOF) imaging mass spectrometry (IMS), also called MALDI-imaging , has seen incredible technological advances in its applications to biological systems [2–7]. While innovative ten years ago, applications to human or animal tissues are now fairly routine with established protocols already in place. New types of samples are continuously being analyzed (e.g. bacterial thin films , whole animal body sections , plant tissues , polymer films , and many more) with the main focus on proteomics. Although new IMS techniques are being introduced every year, our recent review  shows that MALDI-imaging plays the leading role in the new, rapidly developing field of IMS-based proteomics.
This paper consists of two parts. Firstly, we outline computational methods for MALDI-imaging data analysis with the emphasis on multivariate statistical methods, discuss their pros and cons, and give recommendations on their application. We hope to guide molecular biologists and biochemists through the maze of existing computational and statistical methods. While this paper does not elucidate the basics of existing methodologies, we try to give clear and concise recommendations on when certain methods should be applied. Secondly, we discuss current computational and statistical challenges in analyzing MALDI-imaging data. MALDI-imaging is a relatively new field with only a limited amount of laboratories performing data acquisition, although this number grows rapidly. Presently, this field has a high entry barrier for a computational scientist, since only a few datasets are publicly available. In addition, computational results are normally presented in proteomics or mass spectrometry journals, there fore the computational and statistical challenges are not known in the statistical or bioinformatic communities. We hope that the second part of this paper will attract scientists from these communities to contribute to the fascinating field of computational IMS.
As the field of MALDI-imaging is constantly evolving, novel MALDI-based techniques were recently introduced such as 3D MALDI-imaging , MALDI-FTICR-  or MALDI-Orbitrap-imaging ; however, this paper focuses primarily on conventional MALDI-imaging using a TOF mass analyzer. We do not consider computational methods developed for secondary ion mass spectrometry (SIMS) , another leading IMS technique, mainly because SIMS is not used in proteomic analysis with its mass range limited to below 1.0-1.5 kDa. Other emerging IMS techniques such as desorption electrospray ionization (DESI) , laser ablation inductively coupled plasma mass spectrometry (LA-ICP-MS) , or nanostructure-initiator mass spectrometry (NIMS) , are not considered either. In general, all computational methods discussed in this paper can be applied or are already applied (such as PCA in the context of SIMS, see later in the text) to all mentioned IMS techniques. Although we tried to consider only computational methods available in existing software packages, some methods require in-house implementation.
A state of the art MALDI-imaging dataset comprises a huge amount of spectra (usually 5,000-50,000 spectra) with each raw spectrum representing intensities measured at a large number (usually 10,000-100,000) of small m/z-bins and describing up to hundreds of different molecules. For any given m/z-value, the signal intensity at this m/z-value across all collected spectra can be visualized as a pseudo-colored image where each pixel is colored according to its spectrum intensity (sometimes called as a heat map), which we call an m/z-image. Definitely, understanding and interpreting such a multitude of spectra or m/z-images requires computational data mining methods. Although a dataset can be mined manually, this is a tedious work. Moreover, manual mining normally results in a few - sometimes arbitrarily selected - ions of interest, neglecting the major part of information represented in the IMS dataset.
An ultimate aim of processing, both manual and automated, of a MALDI-imaging dataset is to find m/z-values which correspond to ions of interest. These ions may be specific to a spatial region, e.g. be well co-localized with an anatomical region, or express difference between two spatial regions of one sample or between two different samples, e.g. be discriminative for a tumor region as compared with a control region. MALDI-imaging, as a non-targeted and label-free proteomic technique, delivers information about the wide range of molecules present in a sample and is well suited for discovery studies, e.g. for biomarker discovery. Computational methods are of special importance in discovery studies because manual data examination normally results in only a few - sometimes arbitrarily selected - ions. Such incomplete identification can undermine discovery. Once ions of interest are revealed with MALDI-imaging, they can be identified using MS-based proteomics identification methods; for a short review of identification strategies used in combination with MALDI-imaging, see .
For a broad review of technological principles and protocols used in IMS and, particularly, in MALDI-imaging, see the recent issue of Methods in Molecular Biology devoted to IMS . Moreover, see recent surveys [2, 22, 23] for a mass spectrometric perspective and  for a microbiology perspective.
We have structured this section by grouping computational methods according to the tasks they perform: firstly, pre-processing of spectra, then unsupervised data mining methods which can be used for preliminary data examination, then supervised classification applied e.g. in biomarker discovery. A typical MALDI-imaging study results in a set of ions of interest, which are visualized as m/z-images corresponding to their m/z-values. In the last subsection, we discuss visualization of such images.
A MALDI-imaging dataset represents a set of mass spectra with two spatial coordinates x and y assigned to each spectrum. In the current practice, the pre-processing of MALDI-imaging mass spectra does not differ much from spectra pre-processing in the conventional MALDI-MS of dried droplets and includes (1) normalization, (2) baseline correction, and, optionally, (3) spectra smoothing and (4) spectra recalibration. Standard and well-known MALDI-MS pre-processing methods can be applied to imaging data. For a discussion of mass spectra pre-processing from the MALDI-imaging perspective, see .
An important part of MALDI-imaging data pre-processing is the spectra normalization, i.e. scaling each spectrum up to some factor for a better intercomparison of intensities between different spectra. A standard method is the so-called total ion count (TIC) normalization, where for a spectrum its TIC (the sum of all intensities) is calculated and then all spectrum intensities are divided by the TIC value. Although there are still debates on this topic, recent extensive study , where TIC and five other normalization methods were considered, demonstrated the need for normalization. TIC is the most popular method and is recommended in general. For more careful analysis, Deininger et al.  recommends to consider either TIC or median normalization and to select the proper method by means of visual examination of exemplary m/z-images after normalization.
Another pre-processing method, which is sometimes considered separately from the traditional preprocessing methods listed above, is the peak picking, i.e. selection of m/z-values which correspond to high and relevant peaks. The aim of the peak picking is to reduce the number of m/z-values by neglecting those values corresponding to noise signals or to non-specific baseline signals; for more on noise and baseline see , for more on the physical TOF model influencing the peak shape see , for more on statistical modelling of noise and baseline see . Various peak picking methods for MALDI mass spectra are available and are implemented in mass spectrometry software packages. A recent comparison  shows that the methods which take into account the shape of a peak, and not just its intensity, perform the best. However, peak picking in MALDI-imaging poses new problems due to a large amount of spectra. Several approaches have been proposed. Firstly, the peak picking can be applied to the dataset mean spectrum. It is a very fast method and is implemented, e.g. in the ClinProTools software (Bruker Daltonik GmbH, Bremen, Germany). However, this method is not sensitive, since it does not favor high and relevant peaks presented only in a small part of a sample. For example, if a peak is present only in 1% of spectra (for an image of 100×100 pixels, this is an area of 10×10 pixels), then its contribution to the mean spectrum will be reduced by 100 times as compared to a low peak present in all spectra (e.g. a matrix peak). A consensus approach has been proposed , where among spectrum-wise picked peaks, those are selected, which are found in at least 1% of spectra. A similar approach, but requiring manual selection of regions of interest (ROIs) was proposed in . In  and , for spectrum-wise peak picking, we applied the Orthogonal Matching Pursuit method which has complexity O(n 2), where n is the length of a spectrum (usually 10,000-100,000). In general, one should consider efficient (at least O(n 2)) peak picking methods when applied to MALDI-imaging data. Designing and performing a spectrum-wise peak picking, one should keep in mind an inherent balance between efficiency and sensitivity. Firstly, processing all spectra makes the method potentially more sensitive than processing just a part of the spectra. Secondly, the more peaks are selected per spectrum, the more sensitive the method can be. However, increasing sensitivity in both cases leads to longer processing times.
When constructing a list of dataset-relevant peaks out of the spectrum-wise peak lists, m/z-values selected in different spectra for the same peak can slightly differ. This effect cannot be completely compensated by the instrument calibration using reference markers (e.g. using a mixture of peptides with known molecular masses) and is caused by instrumental and experimental variation. In order to counterbalance this effect, a peak alignment procedure should be applied. Although the peak alignment is a well-known task in mass spectrometry, there are no dedicated studies of peak alignment in MALDI-imaging. Norris et al. briefly discuss peak alignment in the context of MALDI-imaging . We have proposed an original but simple procedure for alignment of peaks with respect to the mean spectrum , another group reported the use of the Matlab (The Mathworks Inc., Natick, MA, USA) routine msalign .
Most statistical learning methods can be divided into two groups, so-called unsupervised and supervised methods. Unsupervised methods are used for data mining, can be applied without any prior knowledge, and aim at revealing general data structure. Supervised methods (mainly classification) require specifying at least two groups of spectra which need to be differentiated, e.g. by finding m/z-values differentiating spectra of tumor regions from spectra of control regions. In the context of MALDI-imaging, two unsupervised approaches have obtained recognition: component analysis and spatial segmentation.
Component analysis represents a MALDI-imaging dataset with few score plots (or score images) and coefficients of contribution of each score image to each original m/z-image . Mathematically speaking, a set of score images is a generating system of all m/z-images, that is, each m/z-image from the dataset can be represented as a sum of score images multiplied with respective coefficients. In the framework of MALDI-imaging, the most well-known component analysis method is the Principal Component Analysis (PCA) . Other methods have been also studied: probabilistic latent semantic analysis , independent component analysis and non-negative matrix factorization . For a recent comparison of component analysis methods, see .
Hierarchical clustering is advantageous providing clustering results in the form of a dendrogram which can be interactively analyzed. It is implemented in the flexImaging software (Bruker Daltonik) and was used in e.g. [39, 40]; for a histopathological discussion see a recent review . The main flaw of the hierarchical clustering is that it requires the distance matrix of size of n×n (n is the number of spectra) to be loaded into memory, that hinders processing of datasets with a large number of spectra. Moreover, it is subject to the pixel-to-pixel variability leading to noisy segmentation maps, see Figure 3. As for the parameters (distance, linkage) Deininger et al. [38, 40] recommend choosing the Euclidean distance and the Ward linkage.
Clustering suppressing pixel-to-pixel variability has been recently proposed [30, 32]. Both methods outperform hierarchical clustering by providing smooth, noiseless, and detailed segmentation maps. Although no publicly available implementations are provided yet, the second method  can be relatively easily implemented. For examples of segmentation maps produced with various methods, see Figure 3.
Based on our experience in developing and applying the MALDI-imaging data analysis pipeline, the following recommendations can be made. It is of crucial importance to represent the data in the most understandable and compact way for a biologist or practitioner, otherwise large amount of information extracted out of a MALDI-imaging dataset will not be appreciated. Providing a segmentation map is only a part of data analysis process. Interpretation of the segmentation map is as (or even more) important as the segmentation itself. When finding co-localized m/z-values based on a segmentation map, one should consider all m/z-values but not only those selected by a peak picking. Selecting too many peaks during the peak picking prior to segmentation is not always needed, often detailed segmentation does not need many peaks. Selecting many peaks slows down the segmentation and can introduce additional variation; usually 50-200 peaks is a good choice, although it depends on the analyzed mass range and samples. Memory requirements of a processing algorithm can be more important than the computational efficiency because the available memory is limited whereas the number of spectra increases quadratically with increasing the spatial resolution. One should consider memory-efficient methods which have O(n) memory requirements (n is the number of spectra) and ideally do not require storing the full dataset in the memory. Once a MALDI-imaging pipeline is developed and tested, it should be integrated with other computational tools for mass spectrometry analysis, that requires at least providing export of all valuable information into common format.
In this section we consider how supervised classification can be used for biomarker discovery. Classification requires specifying at least two groups of spectra and aims at differentiating these groups. Let us consider the task of cancer biomarker discovery which involves comparison of tumor and control regions of a biopsy tissue. One can also compare several tumor sections versus several control sections, collected from one or several patients. A classification algorithm, the so-called classifier, considers two groups of spectra and undergoes training to be able to discriminate the groups of spectra. If the training was successful that can be confirmed by a high classification accuracy (also called as the correct rate or the recognition rate) close to 100%, then one could apply the classifier to new spectra to determine their class (tumor or control), like in [44, 45]. However, in biomarker discovery studies one is interested not only in application of the classifier to new spectra, but in interpreting the differences between the tumor and control groups of spectra which were found by the classifier, namely, in the tumor-discriminative m/z-values. Later on, molecular identities of these tumor-discriminative m/z-values can be established using MS-based proteomics methods.
Currently, classification of MALDI-imaging spectra for the search of biomarkers is an active area of research. Lemaire et al.  used the StatView 5.0 software (SAS Institute, Cary, NC) with symbolic discriminant analysis and statistical tests for the search for a new ovary cancer biomarker. Groseclose et al.  used the ClinProTools software (Bruker Daltonik) with the support vector machine algorithm to differentiate adenocarcinoma from squamous cell carcinoma. Cazares et al.  used ClinProTools with the genetic algorithm and the SAS 9.1 statistical software (SAS Institute) to discriminate prostate cancer. Rauser et al.  used the R statistical package (http://www.r-project.org) with the support vector machine and artificial neural network algorithms for classification of HER2 receptor status in breast cancer tissues.
However, in all above cited studies, the classification methods developed for conventional MALDI mass spectrometry were used, which do not take into account specifics of MALDI-imaging data. Classification methods for MALDI-imaging data are still to be developed. Here, we give several recommendations on the most important points to consider when applying classification to MALDI-imaging data.
Firstly, the compared groups are often imbalanced, that is, they have significantly different sizes. Classification of imbalanced data requires special classification and evaluation methods, otherwise the classification can be biased towards a larger group. This issue is well-studied, and advanced methods for its solution were proposed [49–51]. In our experience, large number of spectra in MALDI-imaging normally allows one to compensate moderate imbalance (up to ten-fold) by simple decimation of the larger group. Namely, we consider only each k-th spectrum of the larger group, where k should be adjusted to achieve the balance between groups sizes. However, for compensating a strong imbalance, advanced methods (e.g. sampling and cost-sensitive learning) are recommended, see [49–51].
Secondly, although classification of conventional dried droplets MS data is evaluated by how close the classification accuracy is to 100%, one should not aim at achiving this theoretically highest possible accuracy in classification of MALDI-imaging spectra for the following reasons. MALDI-imaging spectra show significant heterogeneity because of technical reasons (noise, tissue mixture at the available spatial resolution, ions diffusion). Moreover, one cannot expect the annotation of a tumor region to be of perfect quality because of manual mistakes and a lack of the expert time. Additionally, the annotation does not go down to the cellular or subcellular level, where real differentiation between cells takes place. All this leads to classification accuracies lower than 100%. However, if a classifier produces a low accuracy (close to 50% for balanced groups), this indicates some problems and the provided discriminative m/z-values should be considered with caution. In our experience, the good accuracy values above 80%.
Thirdly, the discriminative m/z-values provided by the classification should always be visualized as m/z-images and manually examined whether their spatial patterns are relevant (e.g. co-localized with the tumor area). MALDI-imaging provides a unique way of evaluating the relevance of m/z-values by their spatial pattern, that should be done before starting tedious identification of molecular identities of putative biomarkers.
A computational analysis of a MALDI-imaging dataset, either using unsupervised methods or using supervised classification, delivers a list of m/z-values of interest. In order to associate these m/z-values with their molecular identities, one needs to perform their identification, usually with MS-based proteomics methods. Before starting identification, one usually examines provided m/z-values comparing them with the m/z-values known in the field. If the list contains m/z-values related to each other in a known manner, this increases the confidence in that they express biologically relevant information. For example, a few m/z-values separated by one unit can correspond to isotopes (in MALDI, ions usually have a charge of +1). Two m/z-values separated by 17 units can correspond to the same compound before and after the loss of ammonia. The difference of 18 units corresponds to the loss of water. The difference of 16 units corresponds to oxidation of methionine (or another amino acid side chain). Finally, m/z-values of interest undergo identification.
The second problem of visualization of m/z-images is the strong pixel-to-pixel variation which is inherent to MALDI-imaging technique. In , we analyzed this variation and showed that it has multiplicative nature with respect to the pixels intensity. That is, the higher the intensity in some spatial region, the stronger the noise in this region, which distorts the m/z-image and hampers visual evaluation of prominent features. In order to reduce this variability and suppress the noise, we proposed to apply image denoising to an m/z-image prior to visualization. Figure 7D illustrates application of advanced edge-preserving image denoising from .
In this section, we consider current challenges associated with the statistical analysis of MALDI-imaging data. We hope that this discussion will be of interest to bioinformaticians and statisticians fostering computational research in this area.
The commercially available software for MALDI-imaging delivered by mass spectrometry vendors is aimed at data acquisition and does not provide capabilities for statistical analysis yet. Bruker Daltonik (Bremen, Germany) delivers flexImaging (visualization) and, optionally, ClinProTools (multivariate analysis, PCA, classification) which however can be used for small datasets only. Thermo Scientific (Waltham, MA, USA) provides ImageQuest (visualization). Waters (Manchester, UK) provides HDI Software (visualization) which can be coupled with MassLynx (peak picking) and MarkerLynx (PCA, orthogonal projection least squares), although no publications involving MarkerLynx are known yet. Shimadzu (Nakagyo-ku, Kyoto, Japan) provides Intensity Mapping (visualization, export). In addition to vendor-provided software, Novartis (Basel, Switzerland) provides the BioMap software which can be used for visualization and calculating basic statistics of the full dataset or of regions of interest. AB Sciex (Foster City, CA, USA) provides TissueView which is based on the BioMap software. Currently, in-house developments are necessary and Matlab is probably the most popular development and computing environment in the MALDI-imaging field.
Two general considerations proved to be important in our practice when developing methods for processing MALDI-imaging data. Firstly, a MALDI-imaging dataset is large, that requires computational methods to be runtime and memory efficient. A typical dataset is comprised of 5,000-50,000 spectra, each having 10,000-100,000 intensity values. Datasets generated using upcoming high spatial resolution and high mass resolution MALDI-imaging techniques (e.g. MALDI-FT-ICR-imaging) or using 3D MALDI-imaging are several fold larger. At the same time, the first examination of acquired data is usually done on a workstation attached to the mass spectrometer. Processing single datasets on the same workstation is desirable, that imposes additional constraints regarding memory demands and computational costs. Ideally, the processing time should not exceed the acquisition time which is a few hours for a typical MALDI-imaging dataset. Secondly, MALDI-imaging data suffers from the strong pixel-to-pixel variation which can be significantly suppressed by using methods respecting spatial relations between pixels. As demonstrated by us, performing image denoising prior to clustering [30, 41] or considering each spectrum together with its spatial neighbors  leads to smoother and more detailed results. The advantage of respecting spatial relations between spectra was demonstrated for other problems as well .
Statistical modelling of pixel-to-pixel variability could help developing processing methods. However, this, as well as modelling of other statistical effects in MALDI-imaging data (noise, baseline generation, variability in the shape of a peak), is a scarcely studied field. Although a physical model of the time of flights distribution for MALDI-TOF mass spectrometry was proposed already in 2005 , a little progress is seen since then. The problem of statistical modelling for MALDI-imaging data is addressed only marginally . Successful modelling of this data would provide a way of evaluation of computational methods by using simulated data. Additionally, the statistical modelling can be used for development of computational methods taking into account the statistical models, e.g. model-based classification methods or statistical image processing, as it was illustrated for SIMS data processing .
Quality assurance for MALDI-imaging data is not developed yet. There exist no standard operation procedures for estimating the quality of a full dataset or single spectra. We have recently proposed a visualization method for a quick quality check , but there is a lot to be done in this area. Automatic quality evaluation of single spectra of a MALDI-imaging dataset is of special importance, since, due to biochemical complexity of a sample, and various weakly studied effects of matrix allocation and MALDI ionization, some spectra show artificial patterns leading to hotspots and distorting computational analysis. Such artificial spectra could be detected and removed by methods of outliers detection developed specifically for MALDI-imaging.
When preparing a training set of spectra in a MALDI-imaging biomarker discovery study, the annotation is normally done by a visual examination of a sample and by a manual annotation of regions representing different classes (e.g. tumor and control). However, due to the rough character of this annotation, and due to inherent chemical complexity on the scale resolved by MALDI-imaging, the annotation can be incorrect for a significant portion of spectra. For instance, some pixels in the region annotated as a control one, can contain tumor cells. In statistical learning, this effect is referred to as classification noise or noise in labels . When classifying spectra of a MALDI-imaging dataset, classification methods tolerating classification noise or, in general, methods with high generalizability should be considered.
Combination of MALDI-imaging and microscopy images of stained tissue used in immunohistochemistry can be used for improvement of MALDI-imaging data analysis. This approach is of special importance because the spatial resolution of MALDI-imaging is lower than of microscopy and the pixel-to-pixel variability is significantly stronger. Implementation of this approach requires special co-registration methods.
The author thanks Michael Becker (Bruker Daltonik GmbH, Bremen, Germany) for providing the rat brain MALDI-imaging dataset, Jeramie Watrous (University of California San Diego, La Jolla, USA) for his comments on the manuscript, and the anonymous reviewers and the editor for their valuable remarks and suggestions.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 16, 2012: Statistical mass spectrometry-based proteomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S16.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.