- Open Access
PAPAyA: a platform for breast cancer biomarker signature discovery, evaluation and assessment
© Janevski et al; licensee BioMed Central Ltd. 2009
Published: 17 September 2009
The decision environment for cancer care is becoming increasingly complex due to the discovery and development of novel genomic tests that offer information regarding therapy response, prognosis and monitoring, in addition to traditional histopathology. There is, therefore, a need for translational clinical tools based on molecular bioinformatics, particularly in current cancer care, that can acquire, analyze the data, and interpret and present information from multiple diagnostic modalities to help the clinician make effective decisions.
We present a platform for molecular signature discovery and clinical decision support that relies on genomic and epigenomic measurement modalities as well as clinical parameters such as histopathological results and survival information. Our P hysician A ccessible P reclinical A naly tics A pplication (PAPAyA) integrates a powerful set of statistical and machine learning tools that leverage the connections among the different modalities. It is easily extendable and reconfigurable to support integration of existing research methods and tools into powerful data analysis and interpretation pipelines. A current configuration of PAPAyA with examples of its performance on breast cancer molecular profiles is used to present the platform in action.
PAPAyA enables analysis of data from (pre)clinical studies, formulation of new clinical hypotheses, and facilitates clinical decision support by abstracting molecular profiles for clinicians.
Advancement in molecular bioinformatics research is generating an overwhelming amount of information. Clinicians acknowledge that there is a need to accelerate the translation of knowledge discovery from genome scale studies to effective treatment and tailored cancer management. Commercially available tools such as GeneSpring or open source tools can process and visualize genomics data for preclinical applications. On the clinical side, a number of clinical decision support tools exist that incorporate clinical guidelines, assist clinicians in diagnostics, or intelligently interpret clinical data to give insight in the underlying trends. However, there is unique clinical value to be added by providing the clinician with an integrated view of the patient molecular profile and where the patient is compared to patients with similar clinical parameters and history. For the latest genomic tests that have entered the clinical guidelines, there is need for clinician driven analysis with patient-centric data and informatics-assisted discovery in an easily configurable environment that could be quickly tuned to new clinical questions.
There is a dearth of tools focused on the clinical use scenario that can meaningfully integrate information from multiple molecular modalities such as genomic (copy number variation), transcriptomic (gene expression) and epigenetic (differential methylation) data and that can provide a clinically-relevant comprehensive picture of the molecular state of a sample. Systems exist that acknowledge this issue and integrate various molecular data , however such work is primarily driven by knowledge discovery rather than for clinical use.
In this paper we introduce a P hysician A ccessible P reclinical A naly tics A pplication – PAPAyA, a platform for clinical decision making that relies on multiple information modalities: gene expression and differential DNA methylation as well as clinical parameters such as histopathological results and survival information. We have assembled a powerful set of statistical and machine learning tools that leverage the connections among the different modalities and present a clinically meaningful portrait of the individual sample.
Breast cancer and genomic profiling
Breast cancer is a complex disease driven by the accumulation of multiple molecular alterations. Recent molecular advances in high-throughput genomic, transcriptomic and epigenomic technologies have made it possible to focus on the molecular complexity of breast cancer and help guide cancer prognostication and therapy prediction.
Perou et al. demonstrated that breast cancer can be classified into distinct groups based on their gene expression profiles . The Estrogen Receptor positive (ER+) group is characterized by higher expression of a panel of genes that are typically expressed by breast luminal epithelial cells ('luminal' cancer). The Estrogen Receptor negative (ER-) branch covered three subgroups of tumors: 1) overexpressing ERBB2 (HER2); 2) expressing genes characteristic of breast basal cells (basal-like cancer); and 3) normal-like samples. The clinical relevance of this stratification is that ER+ tumors are typically associated with good prognosis and basal-like and HER2 tumors have poor prognosis. Further refinement in molecular classification however, can result in differing clinical significance.
Gene expression profiling has also led to the development of several gene-expression assays, of which Oncotype DX  and MammaPrint  are gaining acceptance in routine clinical use. Oncotype DX analyzes the expression of 21 genes and calculates a recurrence score to identify the likelihood of cancer recurrence in patients and an assessment of their likely benefit from chemotherapy. MammaPrint analyzes the expression of 70 genes and allows patients (<61 years) with early-stage breast cancer to be categorized as having risk of distant metastasis. High-risk patients may then be managed with more aggressive therapy, while low risk patients can be spared from toxic chemotherapy.
Recent advances in molecular profiling technologies have led to the application of more than one genomic modality to address similar clinical questions. For example, gene expression profiling can be enhanced with the detection of certain genomic copy number variations, amplifications and deletions using Representational Oligonucleotide Microarray Analysis (ROMA) and correlated with patient survival . We have also co developed Methylation Oligonucleotide Microarray Analysis (MOMA) in collaboration with Cold Spring Harbor Laboratory to perform genome-wide scans of CpG island methylation in normal and tumor samples .
Such growth in genomic profiling strategies leading to additional in-vitro diagnostic multivariate index assays will result in a plethora of tailored genomic tests that cater to specific clinically prevalent sub-populations. Appropriately integrating the information provided by multiple genomic profiling strategies can help reduce the resulting complexity in decision-making and offer unique patient-specific insights to the clinician. The main objective in designing the PAPAyA platform was to provide a highly flexible translational platform where we can explore integration of patient clinical data and multiple high-throughput molecular measurements. The aims were to design and implement a platform that can a) easily be used to prototype new ideas for clinical studies support, and b) introduce new clinical tools for data analysis contexts relevant to clinical practice based on molecular measurements.
PAPAyA is designed to provide common interface to multiple data sources, use of tools that utilize the data, and to facilitate a flow of data analysis and assessments of results. PAPAyA stores and provides access both to clinical as well as molecular data from clinical studies. The user interface enables access and analysis of data from multiple samples in a discovery flow, and clinical interpretation flow on a single sample – a patient. The core data access and presentation functionality is built in the platform, whereas the available data transformation steps are dynamically defined through plug-in tools depending on the data type and use flow.
The design supports easy registration and execution of applications written in R, Matlab, Python, and Perl as well as binary executables. It is in principle easy to extend this support to additional execution platforms. Applications are registered as tools annotated for their use contexts, which enable definition of numerous analysis pipelines capturing a sequence of processing steps. Tools can be developed for a specific modality (e.g. copy number variation, gene expression, and methylation) or for a particular clinical study. Here, context is a collection of tools that are allowed to be invoked at a certain step in the workflow.
The salient feature of PAPAyA is that not only data from the studies but also application behavior and tool definitions are stored and handled by the comprehensive Database Management System. Given the intentional loose coupling of each tool with the platform, it is typical to assume that each tool comes with its own data – External Data. However, many of the tools utilize the data stored in PAPAyA's database.
PAPAyA building blocks
The behavior of PAPAyA is defined by the user through a state diagram where each state can have multiple contexts. Transition from one context to another is defined a priori by some user or tool action. PAPAyA allows for fine-tuning of context descriptions with user-defined constraint variables, which are set and un-set depending on the user's actions and selected element type as in sample ID, measurement modality, or microarray probe. The visual representation of the contexts consists of two components: a display of the relevant parameters of the selected sample, measurements, measurement feature, signature, etc., and dynamic access to the available tools in the current context.
Each tool is an application or a routine that is defined by its execution platform (e.g. R, Matlab), output type (e.g. graphics or text), and parameters that the GUI handles by providing the user with a dialog to fill in. The input parameters can be pre-filled from the tool definition, but also from the current context of execution (e.g. the current patient ID, sample ID). It is very easy to integrate a tool into PAPAyA. Most software modules will likely require no modification apart from formatting the output to comply with the visual elements of the user interface. This also enables improved versions of the tools to be added by simply replacing the current tool with an updated version without reconfiguration of PAPAyA.
Tool execution; internal transition; ...
Tool name; initialization; internal actions
User-defined variables that can be set to define constraints. For example methylation modality active vs. expression modality active; analysis mode vs. decision support mode; etc.
Constraints to set with this transition
Constraints to unset with this transition
PAPAyA is implemented in C#. The GUI-intensive parts of PAPAyA enable navigation through clinical studies where the central part is the patient sample. For each sample, all available measurements are navigable and linked to further characterizations available in the context of the sample and the modality. In PAPAyA, all characterization is around molecular signatures as described in more detail in the Results section. Furthermore, PAPAyA provides decision support views of the current patient sample and measurements that transform the results for use in a clinical setting.
The focus of PAPAyA is on discovery of molecular signatures for clinical stratification of patient samples and their utilization in a clinical setting. The data browsing and signature discovery pipeline comprises statistical and machine learning algorithms (supervised and unsupervised) and operates on multiple modalities of high throughput measurements in a high-performance computing environment. The output of these algorithms consists of molecular signatures addressing specific clinical questions (benign vs. malignant, tumor subtype, relapse free survival, etc.) that are based on several individual or combination of modalities (DNA copy number, DNA methylation, and gene expression). The signatures are typically evaluated for performance characteristics (sensitivity and specificity) using statistical approaches. Clinical researchers can benefit from an integrated system that enables them to evaluate and explore these signatures in a user-friendly environment, and be able to characterize signatures and the likely scenarios in which they can be integrated into an oncologist's or pathologist's clinical practice.
PAPAyA facilitates these tasks by applying multivariate statistical analysis and data mining algorithms across modalities in an integrated fashion. The clinical trial database is accessible to bioinformatics tools such as feature filtering, hierarchical clustering, multimodal feature correlation, top-down hierarchical sorting, a methyl binding sites tool, and computationally intensive search approaches such as our CHC Genetic Algorithm (GA) coupled with Support Vector Machines . These tools are used for discovery of univariate and multivariate prognostic or predictive signatures and clinically relevant disease subtypes using each of the modalities independently and in combination.
PAPAyA allows for patient-centric analysis and informatics-assisted discovery to be performed systematically in a pipeline that is fine-tuned to assist in answering specific clinical questions. The analysis can be tailored to be patient-centric or signature centric thereby allowing for discovery or clinical decision support respectively. As a way of introduction to our platform we present several use scenarios of integrated analysis of multi-modality retrospective breast cancer data [2, 5, 8].
Discovery using PAPAyA
Thus PAPAyA user interface provides the user with the applicable tools based on the selected modality and stage in the discovery process. Another important aspect of this process is that the entire framework driving it is easily extensible with additional tools and data, maintaining the capability to work across multiple modalities. Therefore, additional data collected for the same samples, (Eg. microRNAs profiling or sequencing/mutation information) can be easily added to the analysis by adding tools and contexts to the flow to support visualization and analysis of such data.
Clinical decision support
The statistics and visualization we just described can be used by clinicians to gain additional insights and tailor the treatment to the physiological state of the patient. Tools that implement additional standard tools such as breast cancer clinical prognostic indices such as Nottingham Prognostic Index and St. Gallen Consensus can also be easily incorporated into PAPAyA. Additionally, integration of third party molecular signatures into the platform is supported by the existing database structure.
We designed and implemented PAPAyA as a platform that can easily be used to integrate existing tools and facilitate prototyping new ideas for clinical studies support. It also provides new clinical tools around multiple molecular modalities, standard clinical parameters and contexts defined by clinical experts. The platform has flexible architecture and can incorporate new modalities very easily. This flexibility still introduces different practical challenges. For example, quality control of new tools or tool updates is essential as well as ensuring appropriate combinations of tools to avoid derivation of wrong conclusions (e.g. use protocol definitions).
To leverage the insights into the molecular state of clinical samples, there has to be clinically-relevant linkage across modalities. We have started deep integration of DNA methylation and gene expression in PAPAyA, however we have to further include tools that facilitate integration of the inherent dependencies between the molecular and the standard histopathological modality. Finally, it would be extremely useful to integrate imaging data and utilize tumor morphology and texture with the molecular signatures for applications such as prognosis and prediction.
We thank Dr. James Hicks and Dr. Robert Lucito for valuable feedback on the application. We also thank Anca Bucur and Jasper van Leeuwen for fruitful discussions.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 9, 2009: Proceedings of the 2009 AMIA Summit on Translational Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S9.
- Dinu V, Zhao H, Miller PL: Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis. J Biomed Inform 2007, 40: 750–760.View ArticlePubMedGoogle Scholar
- Perou CM, Sørlie T, Eisen MB, Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lønning PE, Børresen-Dale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature 2000, 406(6797):747–52.View ArticlePubMedGoogle Scholar
- Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T, Hiller W, Fisher ER, Wickerham DL, Bryant J, Wolmark N: A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. New England Journal of Medicine 2004, 351(27):2817–2826.View ArticlePubMedGoogle Scholar
- van 't Veer LJ, Dai H, Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415: 530–536.View ArticlePubMedGoogle Scholar
- Hicks J, Krasnitz A, Lakshmi B, Navin NE, Riggs M, Leibu E, Esposito D, Alexander J, Troge J, Grubor V, Yoon S, Wigler M, Ye K, Børresen-Dale AL, Naume B, Schlicting E, Norton L, Hägerström T, Skoog L, Auer G, Månér S, Lundin P, Zetterberg A: Novel patterns of genome rearrangement and their association with survival in breast cancer. Genome Res 2006, 16(12):1465–79.PubMed CentralView ArticlePubMedGoogle Scholar
- Kamalakaran S, Kendall J, Zhao X, Tang C, Khan S, Kandasamy R, Auletta T, Riggs M, Wang Y, Helland A, Dimitrova N, Borresen-Dale A, Hicks J, Lucito R: Methodologies for the identification and analysis of DNA methylation in cancer. Nucleic Acids Research, in press.Google Scholar
- Schaffer JD, Janevski A, Simpson MR: A genetic algorithm approach for discovering diagnostic patterns in molecular measurement data. Proc. of the 2005 IEEE Symposium on: Computational Intelligence in Bioinformatics and Computational Biology 2005, 1–8.View ArticleGoogle Scholar
- Naume B, Zhao X, Synnestvedt M, Borgen E, Giercksky Russnes H, Lingjærde OC, Strømberg M, Wiedswang G, Kvalheim G, Kåresen R, Nesland JM, Børresen-Dale AL, Sørlie T: Presence of bone marrow micrometastasis is associated with different recurrence risk within molecular subtypes of breast cancer. Molecular Oncology 2007, 1(2):160–171.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.