Gene Fusion Markup Language: a prototype for exchanging gene fusion data

Background An avalanche of next generation sequencing (NGS) studies has generated an unprecedented amount of genomic structural variation data. These studies have also identified many novel gene fusion candidates with more detailed resolution than previously achieved. However, in the excitement and necessity of publishing the observations from this recently developed cutting-edge technology, no community standardization approach has arisen to organize and represent the data with the essential attributes in an interchangeable manner. As transcriptome studies have been widely used for gene fusion discoveries, the current non-standard mode of data representation could potentially impede data accessibility, critical analyses, and further discoveries in the near future. Results Here we propose a prototype, Gene Fusion Markup Language (GFML) as an initiative to provide a standard format for organizing and representing the significant features of gene fusion data. GFML will offer the advantage of representing the data in a machine-readable format to enable data exchange, automated analysis interpretation, and independent verification. As this database-independent exchange initiative evolves it will further facilitate the formation of related databases, repositories, and analysis tools. The GFML prototype is made available at http://code.google.com/p/gfml-prototype/. Conclusion The Gene Fusion Markup Language (GFML) presented here could facilitate the development of a standard format for organizing, integrating and representing the significant features of gene fusion data in an inter-operable and query-able fashion that will enable biologically intuitive access to gene fusion findings and expedite functional characterization. A similar model is envisaged for other NGS data analyses.


Background
Gene fusions are well-recognized molecular events and serve as genetic markers and drug targets for several hematological disorders [1,2]. The discovery of recurrent ETS-family translocations in prostate cancer [3,4], a RAF kinase gene fusion in ETS-negative prostate cancer [5,6], and an ALK kinase fusion in lung cancer [7] further advocates the significance of gene fusion events in the development of epithelial cancers [8,9]. The discovery of gene fusion candidates was infrequent and challenging because of various technological limitations inherent in traditional techniques, such as spectral karyotyping, comparative genomic hybridization (CGH), representational oligonucleotide microarray analysis (ROMA), fluorescent in situ hybridization (FISH), and Sangerbased sequencing [10]. However, the rapid evolution of non-Sanger-based massively parallel sequencing technologies has empowered an unprecedented sequencing speed, enabled an unbiased systematic characterization of large-scale genome-wide analysis, and surprised researchers with numerous gene fusion candidates at an unmatched resolution [11][12][13]. As this development in sequencing technology is still relatively new, no standard approach has been established for reporting or documenting a number of the key features associated with gene fusion discovery leading to a repetitive process of manual curation and interpretation of critical information that can entail scrutinizing the entire manuscript along with the supplementary information.
To enable researchers to remain current with the information flow and make the data more accessible, various non-standard community efforts have been established through other structural variation databases [14][15][16][17]. These were primarily developed as independent entities without much scope for data exchange and integrity. Moreover, the non-standard mode of reporting the rapidly increasing gene fusion discoveries has challenged their manual processes of curation and data entry when updating the database. Finally, a number of existing databases are not suited for documenting gene fusion features and lack many of the key attributes that are required for next-generation sequencing (NGS) analyses. In the near future, as we anticipate a downpour of gene fusion candidates from various high throughput analyses, representing the data in an unstructured, autonomous format poses a significant risk of misplacing such valuable information. Similar to other standardization efforts and procedures such as the Minimum Information About a Microarray Experiment (MIAME) that standardizes and shares microarray data [18], the Proteomic Standard Initiative-Molecular Interaction (PSI-MI) that was established to regulate protein-protein interaction representation [19], and the Systems Biology Markup Language (SBML) that represents biochemical reaction networks [20], there is an urgent need to create a standard for recording and reporting gene fusion discoveries. Here, we propose a prototype standardization initiative we call the Gene Fusion Markup Language (GFML) to propose a standard format for representing and documenting gene fusions. In turn, this improved documentation will enable rapid access and maximize the usage of gene fusion findings to other researchers in the community. Hence, these proposed guidelines would serve as a starting point for stricter standards that will evolve over time in the field as the standards undergo regular and thorough vetting by the community [21].

Current literature issues
Researchers have tried to practice and adopt the best possible methods to present their gene fusion data to the community. However, no standard list of features or data structures has been discussed or adopted to represent this special category of observations in an interoperable and query-able fashion. In most cases, authors tend to describe the gene fusion and its features based on a fusion detection algorithm used for a particular set of studies and to represent the information based on their own experience. The depth of information generated from an NGS analyses pipeline, mostly filtered out in the published literature, leads to significant information loss between the observed and reported findings ( Figure 1A). Here, we summarize some of the key concerns. 1. Representation Format: In published reports data is simplified and filtered to include only the diseasespecific or study-related gene fusions are described in an informal format (e.g., a colorful schematic representation). Other non-recurrent or biologically unrelated gene fusions for the given study are merely listed in the Results or the Discussion of the article as a running text or are at times represented pictorially as a Circos plot, histogram, or table (as an image). Besides explaining the significance of the observations, there is no additional effort to represent the complete analyses outcome in a consistent and sharable manner. 2. Splice coordinates: Apart from the fusion genes considered significant for the respective study, most of the other gene fusion candidates are simply listed in a table along with the number of supporting reads. Unfortunately, the features that help predict their significance are not described. In order to assess the true biological significance of candidate fusion transcripts, the splice coordinates (exon junction) are essential for the interpretation of the reading frame and domain signature of the fusion product. For example, the significance of fusion candidates involving kinase family members or transcription factors can be easily misinterpreted if there is no functional domain retained in the open reading frame of the fusion transcript. 3. Read evidence: To the best of our knowledge, there is no published study that reports actual reads mapped to any candidate gene fusion. Currently, the only way to access the gene fusion-specific reads is to download the large raw sequence files from public sequence repositories (only if available and accessible) and perform the non-trivial task of repeating the complete analyses that can entail everything from mapping the raw reads to a fusion discovery pipeline. As there are many fusion candidates predicted in each study with potential falsepositives, it is vital to retain sequence evidence from the analysis phase so that the observations can be independently verified. 4. Terminology: A controlled vocabulary in presenting the characteristic features of gene fusion events and their evidence is another potential problem that needs to be addressed in this discipline. Almost half a dozen published fusion detection algorithms have been independently developed. These algorithms are associated with a number of discrepancies in naming the attributes and values describing gene fusion events and the supporting sequence evidence [22][23][24][25][26][27][28].

Existing database issues
There are some common evolving standards towards representation of generic structural variations [29], but there is no specialized data exchange standard available to submit published gene fusion discoveries with required NGS features to any public repositories or databases. In order to avoid repetitive manual curation and enhance data accessibility, there have been independent efforts to curate and document these gene fusion discoveries in public databases. The Database of Genomic Variants [17,29], the Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer [16], the Atlas of Genetics and Cytogenetics in Oncology and Hematology [15], and the Catalogue of Somatic Mutation in Cancer [14] are a few of the most popular and closely related databases of this kind. Here, we summarize some of the key concerns with the existing resources. 1. NGS incompatible: NGS features that are specific to gene fusions are ignored and not captured, with the current gene fusion entries forcefully accommodated under generic structural variations. In other words, most of the existing databases are not yet adapted for NGS analyses. For example, information pertaining to the sequencing platform, mapping, the fusion detection algorithm, and the read evidence are omitted, leading to yet another level of information loss when the information from the published literature is captured in databases ( Figure 1A). 2. Non-interoperable: Most of the existing online databases function independently with different scopes of operations and report curated data in custom formats not compatible with one another. In turn, this interrupts the data exchange and integrity ( Figure 1A). 3. Unsynchronized update: Existing databases provides some function to the community. However, considering the volume of information generated out of recent NGS analyses and the current status of non-standard data representation in the literature, the manual mode of operation involved in curating and documenting such massive amount of data may  Requirement for specialized standard data exchange format for gene fusions Various standardization procedures have been adopted by the community periodically, each designed to handle various data types generated by different technologies that is appropriate for the level of information required for the exchange. For example, BioXSD is a common generic data exchange format that bridges the gap between specialized standardization formats [30]; specialized data formats such as MAGE-ML are useful in describing the minimum information specific to microarray experiment [18]; PSI-MI is more applicable for handling information specific to protein-protein interactions [19]; and Molecular Methods (MolMeth) database provides the research community with an up-to-date source of methods and protocols used in molecular biology and medicine [31]. Similarly, several sequencing-based standardization formats have been developed by openaccess and international working bodies such as Genomic Standards Consortium (MIGS\MIMS\MIENS\MIMARKS\ MIxS) and European Nucleotide Archive (ENA\SRA) that describe genomic, meta-genomic, and environmental sequences [32][33][34][35][36]. Previously developed sequencingbased standardization methods primarily focused on basic data such as sample information, experimental setup, machine configuration, sequence traces, reads, quality scores, assembly, mapping and annotation. However, standardization procedures describing secondary/ tertiary analyses and their outcomes are limited, especially in the context of rapidly evolving technological advances in next generation sequencing. The currently available Minimum Information about a high-throughput Nucleotide SeQuencing Experiment (MINSEQE), extends the MIAME specifications to capture the quantitative data (expression) from HTS technology, while dbVAR and GVF are specialized data formats to exchange genomic structural variations [37,38]. Similarly, we envision a customized standardization procedure to accommodate the features of gene fusion data arising from the transcriptomic analyses that incorporates valuable attributes such as description of fusion detection algorithms, 5' and 3' fusion transcript annotations, fusion read evidence, open reading frames, splice junction features, experimental validation status, functional domain architecture, etc.
currently not provided in a common data exchange format. Many of the common attributes of gene fusion schema could potentially be derived from other XML Schemas (SRA\ENA, PSI-MI, MAGE-ML), such as "Study", "Sample", and "Experiment" but the descriptions of such elements are also often varied across existing standardizations and are primarily based on the platform\ technology (sequencing, Co-IP, array-based, etc.,) that are specifically suited for the generation of information of primary interest. For example, the element "Experiment" is fine-tuned to fit sequencing platform details in SRA to exchange sequence data, in PSI-MI it is structured to handle protocols which generate protein-protein interactions, and in MAGE-ML it is designed to capture information from array based technologies. Therefore in the proposed GFML prototype, in order to ensure a wide scope for curation of gene fusion features across diverse platforms including NGS, FISH, aCGH, qPCR, etc., the common elements have not been derived from any of the available individual standards. However, to take advantage of the existing schema, tools to facilitate cross-talk and data sharing between various schemas will be enabled.

Methods
We initiated an investigation to understand the commonality and characteristics of the published features and format in NGS studies specifically involving gene fusion algorithms and related findings [5,6,[22][23][24][25][26][27][28]39,40]. We identified nine major elements and related attributes describing the characteristics of gene fusion events and the sequence evidence. We also investigated the existing structural variations databases [14][15][16][17]39] to understand the documented features, working model, and current status of each database. To design and develop a compatible and robust data model, we also adopted some of the relevant features of existing standardization protocols and markup languages [19]. Based on our interpretation of existing sources, we standardized the identified elements and features and further developed a prototype (Additional files 1 and 2). The graphical version of the GFML prototype was made using Altova XMLSpy version 2011rel3sp1 (http://www.altova.com/). The complete process flow is represented in Figure 1B.

Results
In addition to specifying all required data elements and features, the model recommends a database-independent structure and standardizes the data attributes and its values. The ultimate goal of this model is to provide a standard framework that enables different types of complex, biologically meaningful queries in order to maximize the data usage of the system (Table 1). "Record_Set" is the root element of the GFML that contains one or more records (Figure 2). Each "Record" is a self-contained unit that allows gene fusion entries from one to multiple sources. Nine major elements have been characterized and identified from various transcriptome studies describing gene fusion. A brief introduction for each element is summarized as follows (detailed technical comments are available in the XML documentation): (1) the "Source_List" element allows one or more source objects, describes the source of the study and normally refers to the publication or data provider details; (2) the "Sample_List" allows one or more sample objects and describes the properties of the biological specimens associated with the study; (3) the "Experiment_List" element allows one or more experiments and describes the experimental parameters associated with the study; (4) the "Sequence_Platform_List" element allows one or more sequence platform objects and describes the required sequencing parameters such as the platform name, description and application type; (5) the "Validation_Platform_ List" elements allow one or more validation platform objects and describe the validation methods; (6) the "Map-ping_Algorithm_List" allows one or more mapping algorithm objects and describes the mapping algorithm and its parameters; (7) the "Fusion_Detection_Algorithm_List" allows one or more fusion detection algorithm objects and describes the fusion algorithm and its parameters; (8) the "Sequence_ Repository_List" elements allow one or more sequence repository objects and describe the sequence repository details where the raw sequences have been deposited and made accessible; and (9) the "Gene_Fusion_List" element allows one or more gene fusion objects that describe the key attributes pertaining to gene fusion analyses including chromosomal location, gene annotation, read evidence, ORF status, the splice junction, potential mechanism, and validation status and is linked to other elements in the model, including the experiment, sample, mapping, and the fusion detection algorithm. For accommodating the available legacy data, the initial version of the prototype has been made highly flexible. Only 3 out of 9 elements (i.e., "source," "sample," "gene fusion"), represented as a thick line in Figure 2, are considered mandatory, whereas all others are represented as optional elements. As a proof of concept, we further illustrated the usefulness of this model with a prostate cancer-specific gene fusion candidate TMPRSS2-ERG, rediscovered by next generation sequence analyses [26] (Additional files 3 and 4). The instance further validated using the defined XML Schema Definition (Additional files 1 and 2).

Controlled vocabularies
A central requirement for efficient data exchange is a common data exchange format, however the presence of this format is not a guarantee of data compatibility. Ensuring the standardized use of the data attributes and its values through documentation and controlled vocabularies is also essential. Akin to other standardization protocols, controlled vocabularies related to gene fusion attributes and attribute values are subjected to initial standardization in order to enhance the dynamic  queries. To avoid redundancy the common elements such as sample details, experimental design, and protocol description that have already been discussed and standardized by other efforts were avoided [18,19,32,33,36,40] and emphasis is placed only on the new data elements specific to next generation sequencing and gene fusion analyses. We believe there are much room to improve upon these vocabularies as the system progresses and becomes widely adopted by the community.

Discussion and conclusions
Each successful surveying information system that developed and supported by a community has adopted, during its evolution, certain standard procedures that were warranted for data integrity and interoperability. Although a vast amount of microarray data and proteinprotein interaction data were generated and reported in the public domain in a non-standard manner, over time the impaired status of data access and integrated analyses were exposed. In response to these issues, communities designed and developed appropriate standardization procedures such as MIAME / MAGE-ML [18] and PSI [19]. A number of gene fusion candidates from NGS studies have been recently reported without conforming to any standardized procedures for describing and documenting the associated data. Because of this lack of standardization, there is an enormous risk of inaccessibility of such valuable information that can consequently impede data integrity and downstream analyses. Although the NGS studies are new and still evolving, the rapid generation of information underscores the immediate requirement for a standardization procedure to represent and document the data in a common format for publication in a journal and deposition in a database in an exchangeable fashion. We believe that our proposed prototype standardization tool, Gene Fusion Markup Language (GFML) helps in resolving existing inconsistencies in gene fusion data representation and will facilitate the development of interoperable data model that can be dynamically queried and interpreted across different systems.

Future plans
The model presented here offers the first necessary steps towards standardization of gene fusion features to share and exchange in a common standard format among the research community. The concept of incorporating the secondary and tertiary features derived from high throughput data can be extended to NGS-based de novo sequencing, gene expression, epigenetics, copy number variation, comparative genomics, metagenomics and pathogens. Collaboration with other standardization consortia will be pursued to further develop and extend the existing standards. As part of the community standard initiative, we intend to engage the research community in discussions and look forward to active participation by others to catalyze future development.