The tissue microarray data exchange specification: implementation by the Cooperative Prostate Cancer Tissue Resource

Background Tissue Microarrays (TMAs) have emerged as a powerful tool for examining the distribution of marker molecules in hundreds of different tissues displayed on a single slide. TMAs have been used successfully to validate candidate molecules discovered in gene array experiments. Like gene expression studies, TMA experiments are data intensive, requiring substantial information to interpret, replicate or validate. Recently, an open access Tissue Microarray Data Exchange Specification has been released that allows TMA data to be organized in a self-describing XML document annotated with well-defined common data elements. While this specification provides sufficient information for the reproduction of the experiment by outside research groups, its initial description did not contain instructions or examples of actual implementations, and no implementation studies have been published. The purpose of this paper is to demonstrate how the TMA Data Exchange Specification is implemented in a prostate cancer TMA. Results The Cooperative Prostate Cancer Tissue Resource (CPCTR) is funded by the National Cancer Institute to provide researchers with samples of prostate cancer annotated with demographic and clinical data. The CPCTR now offers prostate cancer TMAs and has implemented a TMA database conforming to the new open access Tissue Microarray Data Exchange Specification. The bulk of the TMA database consists of clinical and demographic data elements for 299 patient samples. These data elements were extracted from an Excel database using a transformative Perl script. The Perl script and the TMA database are open access documents distributed with this manuscript. Conclusions TMA databases conforming to the Tissue Microarray Data Exchange Specification can be merged with other TMA files, expanded through the addition of data elements, or linked to data contained in external biological databases. This article describes an open access implementation of the TMA Data Exchange Specification and provides detailed guidance to researchers who wish to use the Specification.


Background
TMA technology was introduced in 1998 [1]. A TMA fundamentally differs from a conventional glass slide only in the number of tissue samples included [see Figure 1]. Tissue microarrays typically contain between 100 and 1,000 core tissue samples. A single TMA block can be sectioned and distributed to dozens of laboratories, saving years of preparation time, hundreds of thousands of dollars in tissue collection costs, and conserving experimental reagents by measuring a marker's distribution on hundreds of specimens arrayed on a single glass slide [1]. Several studies have demonstrated the value of TMAs to validate the biologic relevance of candidate genes expressed in prostate cancers [2][3][4][5][6].
Because TMAs are designed to answer questions applicable to pathologic lesions with specific sets of attributes (e.g. stage or grade or diagnostic subtype), preparation of a TMA requires access to large archives of paraffin embedded tissues. Each TMA core tissue must be annotated with clinical, demographic or histopathologic information so that measurements on the TMA core samples can result in clinically useful correlations. To ensure inter-laboratory reproducibility, information describing the preparation of TMA blocks and slides need to be provided along with the TMA data records.
The Cooperative Prostate Cancer Tissue Resource (CPCTR) is a multi-institutional virtual tissue bank funded by the U.S. National Cancer Institute (NCI) to provide researchers with samples of prostate cancer tissues [7]. The member institutions of the CPCTR are New York University, George Washington University, University of Pittsburgh and Medical College of Wisconsin. The CPCTR began service to the cancer research community on December 6, 2001. The CPCTR has over 5,000 prostate cancer specimens including radical prostatectomy cases (paraffin and fresh-frozen) and paraffinized needle biopsies. The CPCTR represents the largest repository of histologically-characterized and clinically annotated prostate cancer tissue in the USA. All accrued cases undergo pathology review and all clinical data is collected using methodology standardized across the participating institutions. CPCTR resources are available to all researchers, academic and commercial. Further information can be obtained from the CPCTR website [8].
The CPCTR has constructed a prostate cancer TMA implemented in conformance with the new TMA Data Exchange Specification (herinafter designated "the Specification"). The Specification was developed through a series of open workshops sponsored by the National Cancer Institute and the Association for Pathology Informatics [9]. Tissue data included in the CPCTR TMA database is de-identified, and assembled in an open access database to permit data sharing, in compliance with current NIH policy on data sharing [10] and in concert with ongoing NIH initiatives to develop new methods for sharing research data [11,12].

Results and Discussion
The TMA data exchange specification was designed to allow TMA database files to be totally self-describing. The properties of a self-describing database file would include: 1. An informative header that explained the purpose of the file and provided all the information to understand the file (i.e., its organization). CPCTR implementation of the Specification has all eight properties and employs the following enhancements: 3. Supports complex TMAs within a single TMA file. In this case, a single TMA file contained four blocks, with cores from a single tissue samples appearing in multiple locations in more than one block.

Protects patient privacy (by deidentifying all data)
3. Allows data sharing (by permitting free distribution of the XML data document)

Conclusions
Tissue microarrays allow for the high throughput analysis of tissue samples and their association with clinical or outcomes data. Yet these experiments require a large amount of information for the subsequent analysis and evaluation, in particular by interested second parties. The Specification provides an accurate and reproducible method for the transfer of this information as is required for inter-laboratory reproducibility. One of the most important problems with modern data specifications is the daunting technical expertise required for their implementation. The Specification was written to permit maximal flexibility and minimal implementation requirements [9]. This study demonstrates that the Specification can be implemented using a simple Perl script that converts an Excel database into XML-tagged data elements. The resulting large section of core-related XML text can be simply inserted into a conformant document containing header, block and slide information. The resulting TMA database can be validated with a Perl script provided with the Specification document.

Human subjects protections
All institutions participating in the CPCTR have Institutional Review Board (IRB) approval for human subjects research. Each CPCTR institution develops its own local protocols to protect the confidentiality and privacy of human subjects and obtains local IRB approval for all CPCTR activities. The IRB assurance numbers for each cooperating institution are: New York University -M1177; Medical College of Wisconsin -M1061; University of Pittsburgh Medical Center -M1256; and George Washington University Medical Center -M1125. Tissue data records from the cooperating institutions are submitted to a central data manager (Information Management Services, Inc., contracted by the NCI) as de-identified records. All institutions assign an arbitrary number to each record before submitting the de-identified record to the central database. This ensures that the central database has no links connecting records to patients. In addition, HIPAA's proscribed set of 18 data elements are omitted from core sample records (so-called safe harbor approach to HIPAA-compliance) [18].

Tissue and data collection
The CPCTR maintains a publicly available Manual of Operations that describes its tissue collection procedures and policies [19].
Simplest conforming TMA file Figure 2 Simplest conforming TMA file. Image displaying the simplest possible XML file conforming to the TMA Data Exchange Specification.
Pathological characterization of specimens involves review of all cases by a CPCTR pathologist using diagnostic criteria explained in the publicly available CPCTR histologic atlas and manual [20].
Protocols for the construction the TMA block and TMA slide are publicly available documents available at the CPCTR web site and linked from the TMA Database [16,17].

The TMA Data Exchange Specification
The Specification is an open access document that can be used without restriction [9].
The Specification requires four general sections for each TMA file: 1) Header, containing the specification Dublin Core identifiers, 2) Block, describing the paraffin-embedded array of tissues, 3) Slide, describing the glass slides produced from the Block, and 4) Core, containing all data related to the individual tissue samples contained in the array. The simplest possible structure for a conforming TMA file consists of nothing more than empty tags designating the four required sections [see Figure 2] [9].
Common Data Elements (CDEs) are metadata tags that describe the data elements included in an XML database. To be of value, CDEs must be well-defined, uniquely identified and available for human review or computer access. Eighty CDEs, conforming to the ISO-11179 [21] specification for data elements constitute the XML tags provided in the Specification [9]. CDE descriptors are publicly available [14]. However, the only CDEs that must appear in any conforming TMA file are the section CDEs (header, block, slide and core), the root CDE (histo) and the tma CDE itself (tma). A set of six simple semantic rules describe the syntax for the data exchange specification [9].
The Specification was designed for maximal flexibility. Flexibility in the first version of an XML specification permits the addition of greater structure in later versions built on tested implementations. A similar approach has been used for ANSI/HL7 Common Data Architecture (CDE) wherein the earliest version (Level One) is intentionally sparse [22]. At this time, there is no DTD (Data Type Definition) or Schema included in the Specification. For those wishing to use a DTD, a Specification-compliant DTD has been prepared by David G. Nohle, Ohio State University Department of Pathology and the Mid-Region AIDS & Cancer Specimen Resource (ACSR) [23].

Constructing the TMA Data file
Constructing a TMA Database consists of the following: 1. Filling the four sections (header, block, slide and core) 2. Assembling the four sections into a TMA file with a proper file declaration, root element and TMA CDE.

Validating that the TMA file conforms to the specification
The header, block and slide sections of the TMA will vary only slightly from project to project within a laboratory. The CPCTR header, block and slide sections were prepared "by hand" using the section-specific CDEs provided in the specification.
The header section contains descriptive information about the file and its contents. With the exception of one CDE (filename), the header CDEs are the same CDEs used in the Dublin Core set of XML identifiers used by librarians. Detailed information describing the Dublin Core elements is available [13]. A link to the Dublin Core elements is also included in CPCTR TMA database. The first few lines of the TMA database are shown [see Figure  3]. The block and slide headers of the TMA database are short and are also completed manually.

TMA XML opening section
The cores are distributed for each block in an array, with cores assigned to specific locations [see Figure 4], and all the cores in an array are assigned to a slide, which is a numbered section derived from a block [see Figure 5]. The core section contains annotated data for each core in the TMA. The central database for all CPCTR tissues is maintained as an Excel database by an NCI-contracted information management service (IMS, Rockville, MD). IMS extracts an Excel sub-file consisting of records pertaining to the tissues selected for the TMA block. CPCTR-specific data elements included in the IMS records are publicly available [15].
A Perl script was written that converts Excel files to XML, enclosing the data associated with the spreadsheet cells to XML CDEs corresponding to the column headings. This creates the "core" section of the TMA database. A sample of an XML-tagged extracted data record is shown [see The CPCTR prostate cancer TMA consists of 299 core samples distributed in four blocks, each block having 300 arrayed cores. Each block contains about 150 core samples in two different locations in each block. The core duplicates are staggered in the array, to maximize the chance that a given core will be represented if an area of the slide section is lost in processing. The distribution of one set of core samples in multiple array locations in four blocks yields a complex TMA that cannot be adequately represented by separate descriptions of each block. The Specification permits multi-block TMA files. Within the block CDE are the nested sets of four blocks that compose the complex TMA. Each core CDE is nested within a specific block CDE, and one core may have two associated array locations [see Figure 6].
The four sections are concatenated as a single XML database file. The CPCTR database file is provided with this manuscript [see Additional file 2].

Validating the TMA Data file
Once a TMA database is prepared, it needs to be validated to ensure conformance with the Specification. At this time, all TMA files should be validated using a software implementation written in Perl and distributed as an open access supplemental file with the Specification and with this publication [see Additional file 3]. The validating script requires a Perl installation but should operate equally well on any operating system. The validation software has a simple command-line interface. When the file successfully validates, the Perl script outputs the encountered CDEs from the Specification, a statement that the file is valid, and a one-way hash value specific for the validated file [see Figure 7].

Availability and requirements
The Perl scripts and files for the production of TMA databases that meet the Specification are available with this publication. The example prostate cancer TMA database is available as a supplementary file with this article [see Additional file 1]. The actual tissue microarray slides are available after an application process Although the CPCTR is a non-profit, government-sponsored resource, a surcharge is attached for glass slides, to help defray a portion of the costs of TMA production. The application process and charges are described at the CPCTR web site [8]. Questions regarding any aspect of the CPCTR can be directed to the CPCTR email query service [ask-cpctr-l@list.nih.gov].
TMA XML slide section information Figure 5 TMA XML slide section information. Image displaying the data elements describing the glass slide sectioning information.
TMA XML core section information Figure 6 TMA XML core section information. Image displaying the data elements comprising a record for a single tissue core. Figure 7 Validation script output. Image displaying the interaction between Perl validating script and user.