Open Access

MPAgenomics: an R package for multi-patient analysis of genomic markers

  • Quentin Grimonprez1Email author,
  • Alain Celisse1, 2,
  • Samuel Blanck1,
  • Meyling Cheok3,
  • Martin Figeac4 and
  • Guillemette Marot1, 5
BMC Bioinformatics201415:394

https://doi.org/10.1186/s12859-014-0394-y

Received: 23 July 2014

Accepted: 19 November 2014

Published: 14 December 2014

Abstract

Background

Last generations of Single Nucleotide Polymorphism (SNP) arrays allow to study copy-number variations in addition to genotyping measures.

Results

MPAgenomics, standing for multi-patient analysis (MPA) of genomic markers, is an R-package devoted to: (i) efficient segmentation and (i i) selection of genomic markers from multi-patient copy number and SNP data profiles. It provides wrappers from commonly used packages to streamline their repeated (sometimes difficult) manipulation, offering an easy-to-use pipeline for beginners in R.

The segmentation of successive multiple profiles (finding losses and gains) is performed with an automatic choice of parameters involved in the wrapped packages. Considering multiple profiles in the same time, MPAgenomics wraps efficient penalized regression methods to select relevant markers associated with a given outcome.

Conclusions

MPAgenomics provides an easy tool to analyze data from SNP arrays in R. The R-package MPAgenomics is available on CRAN.

Keywords

SNP arrays Segmentation of genomic data Marker selection Multi-patient analysis R package

Background

Genome-wide Single Nucleotide Polymorphism (SNP) arrays have been widely used over the past few years [1]. First generations were measuring only genetic variations of Single Nucleotide Polymorphisms, which are single base pair mutations at specific loci. Last generations (e.g. SNP5.0, SNP6.0) also include non-polymorphic probes in order to study copy-number variations along the genome in addition to genotyping measures. These arrays are especially used to study the impact of diseases, e.g. cancer, on the human genome.

Analyzing data from genome-wide SNP arrays within R requires several packages, e.g. aroma for normalization of Affymetrix®; SNP arrays [2],[3], changepoint or cghseg for segmentation of copy number profiles [4], cghcall for labelling segments [5], and glmnet for penalized regressions [6]. Each package performs a specific task along the whole analysis but none of them is related to the others. Output formats of given packages are often not compatible with input formats required by the other, making their use tricky for beginners in R. One main contribution of the MPAgenomics R package is to aggregate these commonly used packages, providing wrappers to inter-relate them automatically.

At each step of the analysis a large amount of packages are available to perform normalization, segmentation or marker selection. A careful choice of only a few methods is required to provide an easy-to-use and efficient tool.

In this software article, we describe two different pipelines implemented in the R package MPAgenomics. Both of them perform the whole analysis from raw data to normalization, and then either successive segmented profiles, or a list of genomic markers selected from all available profiles.

Implementation

MPAgenomics is implemented in R [7]. The package is divided in four main parts: data normalization, segmentation, calling and marker selection. Each part depends on different packages. MPAgenomics provides wrappers for some functions of these packages and facilitates the interaction between outputs and inputs of different functions. It remedies some problems with the wrapped packages such as confusing parameter names.

Data normalization

The normalization process in MPAgenomics contains technical biases correction and copy number and allele B fraction estimation. Following [8], allele B fraction refers to the proportion of the total signal coming from allele B. Normalization methods are available for Affymetrix®; arrays (10K, 100K, 500K, GenomeWideSNP 5 & 6, and CytoScanHD). The estimation of the total copy number and allele B fraction is made by CRMAv2 [9] originally implemented in the aroma packages. For studies with matched normal-tumor samples, a better estimation is suggested and implemented for the allele B fraction of the tumoral sample with the TumorBoost method [8].

The use of aroma packages is difficult for neophytes due to the strict folder architecture it requires and the documentation of the project which is mainly dedicated to experts able to criticize each method proposed and understand details of each procedure.

MPAgenomics provides documentation with a detailed example explaining how to quickly analyze data. The tutorial can be accessed in R by running the following commands:

library(MPAgenomics)

vignette(“MPAgenomics”)

More details on each step or wrapper are given to help advanced users to run each function separately.

Several features in the original aroma packages create new folders and files within the architecture. Matching files from different processes associated with a given sample can be tricky for neophytes. MPAgenomics implements a wrapper to build the folder architecture, check filenames automatically, process CRMAv2 and TumorBoost normalization steps. Miscellaneous functions are also provided to ease some actions like signal extraction. Furthermore, different graphs such as the copy number profile can be saved in the working directory for further visualization.

The following steps (segmentation, calling and/or selection of genomic markers) are available in two settings. One is aroma-based and exploits the folder architecture and the files generated along the process. The second does not depend on aroma and allows advanced users to use their own normalized data.

Segmentation

Although the use of manual annotations provides the best segmentation results [10], it appears essential for multi-patient analysis to avoid relying on them since they are time-consuming.

Therefore, following simulation results of [10], MPAgenomics wraps the CGHSEG [11],[12] and PELT (Pruned Exact Linear Time) [13] segmentation methods which appeared to be those with the best overall performance.

PELT and CGHSEG methods fit a Gaussian maximum likelihood model but they differ in the way they choose the number of segments. CGHSEG requires the maximal number of segments as input. In MPAgenomics, the optimal number of segments is chosen according to a penalty C × K × 2.5 + log P K with a profile of length P, K the number of segments and C>0 a parameter to choose [14]. This choice is performed using slope heuristics [15]. The PELT method returns a segmentation with a number of segments automatically chosen by the algorithm according to a penalty K ρ log(P) with ρ>0 a parameter to choose. The choice of the penalty parameter has been raised in [4]. MPAgenomics suggests an automatic sample-specific choice of ρ chromosome by chromosome (see package vignette for details on the method). In MPAgenomics, the two methods, CGHSEG with the slope heuristic and PELT with our calibration method, are proposed. By default, CGHSEG is used because it is quicker than PELT due to the multiple execution required by the ρ calibration method we propose.

The implemented segmentation methods are independently available for both copy number and allele B fraction profiles. In the case of allele B fraction segmentation, only heterozygous SNPs are kept. First, a naive genotype call [8] is performed on each normal sample in order to separate heterozygous SNPs from homozygous SNPs. Naive genotyping method assumes SNPs are bi-allelic and therefore is not recommended for tumor samples. Thus allele B fraction segmentation in MPAgenomics requires matched normal-tumor pairs. Then, following [16], the resulting signal is centered on 0.5 and symmetrized, which makes it similar to the usual copy number.

Calling

From each segmented profile, the CGHcall method [17] is run to label every copy-number segment in terms of loss, normal, and gain.

CGHcall depends on a parameter, named cellularity, corresponding to the contamination of a sample with healthy cells. In MPAgenomics, this parameter can be modified by users, by default its value is 1 meaning that tumor samples are pure.

In the aroma-dependent function, segmentation and calling are performed with the same wrapper. The calling is run for each profile separately. Results are saved in text format in the working folder architecture.

Selection of genomic markers

The goal is to select genomic markers (e.g. SNPs or CNV) associated with a given response from all patient profiles simultaneously. There is no need to perform segmentation and calling before the multi-patient analysis, marker selection is made over all copy-number profiles. However, segmentation can be performed before marker selection if wanted, in order to reduce the noise and the dimensionality of the problem.

Assuming I individuals and P potential markers, then for each individual i, y i denotes the response and x i,p the corresponding normalized value of copy number or allele B fraction signal at genomic position p.

Due to the huge number of markers (P>>I), MPAgenomics uses by default the lasso [18] regularization method to select very few ones. This method offers two advantages: (i) it selects only few variables, easing the interpretability of results, (i i) there exist some algorithms such as the lars [19] to solve quickly the lasso problem and support high-dimensional data.

The lasso regularization method consists in minimizing g λ : β P g λ ( β ) , where
g λ ( β ) = i = 1 I y i ( ) i 2 + λ p = 1 P | β p | ,

with ( ) i = p x i , p β p and λ>0 controlling the number of non-zero coordinates of β. After minimization, non-zero coefficients β p correspond to influential positions to predict the response.

MPAgenomics genomic marker selection drastically improves currently available packages in terms of computation time. With the linear regression model, it efficiently provides the exact solution by using the new R package HDPenReg, which is an optimized implementation of the lars algorithm [19] specially dedicated to a huge number of markers.

Since the theoretical grounding of Lasso when P>>I relies on a theoretical condition (see [20]) that cannot be easily checked in practice, the spike and slab algorithm [21],[22] – a three steps algorithm performing filtering, estimation and variable selection – is also provided in MPAgenomics as an alternative.

Logistic regression is also available for binary responses. In this case, MPAgenomics wraps the glmnet package [6] in the whole process. Unlike HDPenReg it does not provide the exact solution but is computationally very efficient. With glmnet and HDPenReg, the regularization parameter λ is chosen by k-fold cross-validation [23]. The selected variables are the most relevant ones regarding the response.

Discussion

MPAgenomics is mainly dedicated to beginners in SNP array analyses. It solves problems commonly encountered by neophytes such as interaction between different packages or specialized documentation dedicated to experts in the field. In addition, MPAgenomics suggests careful and automatic choices of crucial parameters at each part of the analysis.

To achieve simplicity of usage, MPAgenomics does not propose all options implemented in the wrapped packages, especially for normalization. However, outputs are generated in such a way that interaction between wrapped packages and MPAgenomics is facilitated. For example, the strict directory structure of aroma packages is built by MPAgenomics. Therefore, advanced users may directly use specific options of aroma to enhance their analysis without renormalizing data from scratch.

As specified in the data normalization section, segmentation, calling and marker selection steps can be performed without the use of aroma. This allows users to provide their own normalized data into matrices. This is useful for non-Affymetrix®; SNP arrays, CGH arrays or high-throughput sequencing data. For the latter, count data might need a variance-stabilizing transformation into Gaussian data before using current segmentation, calling and marker selection. For example, the Anscombe transform [24] can be used in addition to appropriate normalization specific to the used technology (target sequencing, whole-genome sequencing).

Currently, copy number and allele B fraction are segmented independently from each other. Research is ongoing to propose joint segmentation methods allowing to detect uniparental disomies, fragments which present a normal copy number but a loss of heterozygosity in the corresponding allele B fraction.

Conclusions

MPAgenomics provides user-friendly wrappers for normalization and multi-patient analysis of high-throughput genomic data. It offers a guideline for beginners in copy-number variation analysis focusing on proven methods for their effectiveness. MPAgenomics also provides automatic choices of crucial parameters for segmentation and selection of markers.

Even though normalization is provided for Affymetrix®; arrays, other steps (segmentation, calling, and marker selection) can be applied to normalized data from other DNA arrays and next-generation sequencing data.

Availability and requirements

Project name: MPAgenomicsProject home page: http://cran.at.r-project.org/package=MPAgenomicsOperating system(s): Platform independent Programming language: ROther requirements: none License: GNU GPL (>=2) Any restrictions to use by non-academics: None

Declarations

Acknowledgements

We thank Serge Iovleff for his help implementing HDPenReg. We also thank Claude Preudhomme and Olivier Nibourel for providing data presented in the vignette of the package, and for their helpful clinical competences to interpret the results. The development of this package was funded by the Inria Technological Development Action (ADT) named MPAGenomics.

Authors’ Affiliations

(1)
MODAL team, Inria Lille-Nord Europe
(2)
Laboratoire Paul Painlevé, Université Lille 1
(3)
Inserm, U837, Team 3, Cancer Research Institute of Lille
(4)
Plate-forme de génomique fonctionnelle et structurale, IFR-114, Université Lille 2
(5)
EA 2694, Université Lille 2

References

  1. LaFramboise T: Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances. Nucleic Acids Res. 2009, 37 (13): 4181-4193. 10.1093/nar/gkp552.View ArticlePubMed CentralPubMedGoogle Scholar
  2. Bengtsson H, Simpson K, Bullard J, Hansen K: aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory. Technical Report 745, Department of Statistics, University of California, Berkeley; 2008.Google Scholar
  3. Bengtsson H, Bullard J, Hansen K, Neuvial P, Purdomand E, Robinson M, Simpson K: Aroma project2010. [http://www.aroma-project.org/]
  4. Killick R, Eckley I: Changepoint: An R Package for Changepoint Analysis2013. R package version 1.1, [http://www.lancs.ac.uk/~killick/Pub/KillickEckley2011.pdf]
  5. van de Wiel M, Vosse S: CGHcall: Calling Aberrations for Array CGH Tumor Profiles2012. R package version 2.20.0 [http://www.bioconductor.org/packages/release/bioc/vignettes/CGHcall/inst/doc/CGHcall.pdf]
  6. Friedman JH, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010, 33 (1): 1-22.View ArticlePubMed CentralPubMedGoogle Scholar
  7. R: A Language and Environment for Statistical Computing. 2014, R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
  8. Bengtsson H, Neuvial P, Speed TP: Tumorboost: Normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays. BMC Bioinformatics2010, 11:245.Google Scholar
  9. Bengtsson H, Wirapati P, Speed TP: A single-array preprocessing method for estimating full-resolution raw copys from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6. Bioinformatics. 2009, 25 (17): 2149-2156. 10.1093/bioinformatics/btp371.View ArticlePubMed CentralPubMedGoogle Scholar
  10. Hocking T, Schleiermacher G, Janoueix-Lerosey I, Boeva V, Cappo J, Delattre O, Bach F, Vert J-P: Learning smoothing models of copy number profiles using breakpoint annotations. BMC Bioinformatics2013, 14(1):164.Google Scholar
  11. Picard F, Robin S, Lavielle M, Vaisse C, Daudin J-J: A statistical approach for array CGH data analysis. BMC Bioinformatics2005, 6(1):27.Google Scholar
  12. Rigaill G: Pruned dynamic programming for optimal multiple change-point detection. arXiv e-print, 2010, arXiv/1004.0887.Google Scholar
  13. Killick R, Fearnhead P, Eckley IA: Optimal detection of changepoints with a linear computational cost. J Am Stat Assoc. 2012, 107 (500): 1590-1598. 10.1080/01621459.2012.737745.View ArticleGoogle Scholar
  14. Lebarbier E: Detecting multiple change-points in the mean of gaussian process by model selection. Signal Process. 2005, 85 (4): 717-736. 10.1016/j.sigpro.2004.11.012.View ArticleGoogle Scholar
  15. Birgé L, Massart P: Minimal penalties for gaussian model selection. Probability Theory Related Fields. 2007, 138 (1-2): 33-73. 10.1007/s00440-006-0011-8.View ArticleGoogle Scholar
  16. Staaf J, Lindgren D, Vallon-Christersson J, Isaksson A, Göransson H, Juliusson G, Rosenquist R, Höglund M, Borg Å, Ringnér M: Segmentation-based detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol2008, 9(9):R136.Google Scholar
  17. van de Wiel MA, Kim KI, Vosse SJ, van Wieringen WN, Wilting SM, Ylstra B: CGHcall: calling aberrations for array CGH tumor profiles. Bioinformatics. 2007, 23 (7): 892-894. 10.1093/bioinformatics/btm030.View ArticlePubMedGoogle Scholar
  18. Tibshirani R: Regression shrinkage and selection via the lasso. J R Stat Soci Series B. 1994, 58: 267-288.Google Scholar
  19. Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat. 2004, 32: 407-499. 10.1214/009053604000000067.View ArticleGoogle Scholar
  20. Ravikumar P, Wainwright M. J, Raskutti G, Yu B: High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electron J Stat. 2011, 5: 935-980. 10.1214/11-EJS631.View ArticleGoogle Scholar
  21. Ishwaran H, Rao JS: Spike and slab variable selection: frequentist and bayesian strategies. Ann Stat. 2005, 33 (2): 730-773. 10.1214/009053604000001147.View ArticleGoogle Scholar
  22. Ishwaran H, Rao JS: Generalized ridge regression: geometry and computational solutions when p is larger than n. Technical Report, Department of Public Health Sciences Division of Biostatistics, University of Miami, Miller School of Medecine; 2010.Google Scholar
  23. Arlot S, Celisse A: A survey of cross-validation procedures for model selection. Stat Surv. 2010, 4: 40-79. 10.1214/09-SS054.View ArticleGoogle Scholar
  24. Anscombe FJ: The transformation of poisson, binomial, and negative-binomial data. Biometrika. 1948, 35 (3/4): 246-254. 10.2307/2332343.View ArticleGoogle Scholar

Copyright

© Grimonprez et al.; licensee BioMed Central Ltd. 2014

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.