TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas

Background Data extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, mainly Next Generation Sequencing, for more than 30 cancer types. Results We propose TCGA2BED a software tool to search and retrieve TCGA data, and convert them in the structured BED format for their seamless use and integration. Additionally, it supports the conversion in CSV, GTF, JSON, and XML standard formats. Furthermore, TCGA2BED extends TCGA data with information extracted from other genomic databases (i.e., NCBI Entrez Gene, HGNC, UCSC, and miRBase). We also provide and maintain an automatically updated data repository with publicly available Copy Number Variation, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq (V1,V2) experimental data of TCGA converted into the BED format, and their associated clinical and biospecimen meta data in attribute-value text format. Conclusions The availability of the valuable TCGA data in BED format reduces the time spent in taking advantage of them: it is possible to efficiently and effectively deal with huge amounts of cancer genomic data integratively, and to search, retrieve and extend them with additional information. The BED format facilitates the investigators allowing several knowledge discovery analyses on all tumor types in TCGA with the final aim of understanding pathological mechanisms and aiding cancer treatments. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1419-5) contains supplementary material, which is available to authorized users.


Introduction
This user guide is intended for all the users that want to learn how to use the TCGA2BED tool for downloading and converting TCGA data into the BED, CSV, GTF, JSON, and XML formats. Please refer also to the TCGA2BED_readme.txt file (that is included in the software package) for additional details.

TCGA2BED procedure steps
The following steps are necessary to perform the download of public TCGA data and their conversion into the BED, CSV, GTF, JSON, and XML formats. These steps are thoroughly explained in the following sections of this tutorial: -Meta data download; -Experimental data download; -Conversion into the BED, CSV, GTF, JSON, and XML formats.

Installation JAVA
The TCGA2BED tool requires a working JAVA Virtual Machine (VM) installed. Thus, if not done yet, first download and install the free Oracle JAVA Runtime Environment from http://www.java.com/getjava/. Several versions for the most common operating systems are available (e.g., Windows x86 for Windows 32 bit, Windows x64 for Windows 64 bit, MacOsX, or Linux). Please choose the right version according to your operating system.

Executing TCGA2BED
Go to the directory where you extracted the TCGA2BED archive and execute TCGA2BED.jar by double clicking it (for supported operating systems) or by executing the following command from a prompt: java -jar TCGA2BED.jar

Start screen
The following main screen of the TCGA2BED software will appear.
It is composed of two main parts. On the left hand side you can find the Downloader, which permits the retrieval of TCGA data. On the right hand side you can find the Converter, which allows converting the downloaded data into the BED, CSV, GTF, JSON, and XML formats.

Meta data download
The first step to perform is downloading the meta data (clinical and biospecimen biotab files) for the cancer type you want to analyze. Please select the tumor tag from the drop down menu (Disease); a list with the available tumor tags and names is provided at the end of this tutorial. Then press the Download button and choose a folder where to save the metadata files.
The download will start and you can track the progress from the TCGA2BED console.

Experimental data download
The second step to perform is downloading the experimental data for the cancer type you want to analyze. Please select the tumor tag from the drop down menu (Disease); a list with the available tumor tags and names is provided at the end of this tutorial. Additionally, select the experiment type from the Data Type dropdown menu. The available experiment types are: Copy Number Variations (CNV), DNA Methylation, RNA-Seq, RNA-Seq V2, Somatic Mutations (DNA-Seq), miRNASeq. Then press the Download button and choose a folder where to save the experimental data files.
The download will start and you can track the progress from the TCGA2BED console.

Conversion into the supported formats
After the download of the meta data and experimental data, you can start the conversion into the BED, CSV, GTF, JSON, and XML formats (see the "TCGA2BED_format_definition.pdf" format definition file that is available as Supplemental material and at http://bioinf.iasi.cnr.it/tcga2bed for further details).
Please select from the drop down menus the Disease (through the tumor tag) and the Data Type Optionally, if you are converting CNV experiments select the Mage-Tab Source directory, which you can find in the root download folder.
Finally, start the conversion by selecting the desired format (BED, CSV, GTF, JSON, or XML), clicking the Convert button and choosing the output folder. You will find the converted files and tab-delimited attribute-value pair meta data files for each experiment in the selected folder.
You can start the whole process again with new tumor or experiment types.

Batch download and conversion into the desired format
Through the Load Configuration File button you can specify a XML file to download and convert in batch meta data and experimental data of several data types and diseases.
Each operation is defined as a XML block denoted by the operation tag followed by an incremental numeric identifier needed to preserve the execution sequence.
Two types of XML tags are required to define an operation: 1. cmd tag followed by the name of the current operation that can be one of the following commands: a. downloadmeta to download clinical data about a specific tumor; b. downloaddata to download a particular type of experiments about a tumor; c. convert to convert experiments from the TCGA format into the BED format.
2. The following attribute tags are required to configure an operation depending on the previously selected command. Please follow the example files to use them properly. a. disease denotes a specific tumor tag (all tags are listed at the end of this document); b. metadata contains the full path to the clinical data in biotab format; c. additional_metadata is a field and contains the full path to a file with user defined clinical data, if not present please set it to null; d. input_folder is the full path to the folder that contains experiments in TCGA format; e. output_folder is the output directory where the converted BED files will be generated; f. data_type denotes the type of the experiments contained in the folder specified in the input_folder field, and could be DNAMethylation27, DNAMethylation450, DNASeq, RNASeq, RNASeqV2, miRNASeq, and CNV; g. data_subtype is required for DNA Methylation, RNASeq, RNASeqV2, and miRNASeq only. The allowed values for this field are, gene, exon, and spljxn for RNASeq, gene, exon, spljxn, and isoform for RNASeqV2, and mirna, and isoform for miRNASeq; if you are not converting these experiments please set it to null; h. magetab_folder is required for the conversion of CNV experiments only and contains the full path to the folder with magetab data; if you are not converting CNV please set it to null; i. autoextract is a binary parameter (0 or 1), if set to 1 the data will be automatically decompressed after the download process; j. output_format is an optional parameter that can be set to BED, CSV, GTF, JSON, or XML to define the desired output format.

Data repository
The ftp site ftp://bioinf.iasi.cnr.it contains an up-to-date archive with the experimental and meta data from TCGA converted into the BED format.