VDJML: a file format with tools for capturing the results of inferring immune receptor rearrangements

Toby, Inimary T.; Levin, Mikhail K.; Salinas, Edward A.; Christley, Scott; Bhattacharya, Sanchita; Breden, Felix; Buntzman, Adam; Corrie, Brian; Fonner, John; Gupta, Namita T.; Hershberg, Uri; Marthandan, Nishanth; Rosenfeld, Aaron; Rounds, William; Rubelt, Florian; Scarborough, Walter; Scott, Jamie K.; Uduman, Mohamed; Vander Heiden, Jason A.; Scheuermann, Richard H.; Monson, Nancy; Kleinstein, Steven H.; Cowell, Lindsay G.

doi:10.1186/s12859-016-1214-3

Volume 17 Supplement 13

Proceedings of the 13th Annual MCBIOS conference

Proceedings
Open access
Published: 06 October 2016

VDJML: a file format with tools for capturing the results of inferring immune receptor rearrangements

Inimary T. Toby¹,
Mikhail K. Levin²,
Edward A. Salinas³,
Scott Christley¹,
Sanchita Bhattacharya⁴,
Felix Breden⁵,
Adam Buntzman⁶,
Brian Corrie⁷,
John Fonner⁸,
Namita T. Gupta⁹,
Uri Hershberg¹⁰,
Nishanth Marthandan¹¹,
Aaron Rosenfeld¹²,
William Rounds¹³,
Florian Rubelt¹⁴,
Walter Scarborough⁸,
Jamie K. Scott¹⁵,
Mohamed Uduman¹⁶,
Jason A. Vander Heiden⁹,
Richard H. Scheuermann^17,18,19,
Nancy Monson¹³,
Steven H. Kleinstein^9,16,20 &
…
Lindsay G. Cowell¹

BMC Bioinformatics volume 17, Article number: 333 (2016) Cite this article

2842 Accesses
9 Citations
11 Altmetric
Metrics details

Abstract

Background

The genes that produce antibodies and the immune receptors expressed on lymphocytes are not germline encoded; rather, they are somatically generated in each developing lymphocyte by a process called V(D)J recombination, which assembles specific, independent gene segments into mature composite genes. The full set of composite genes in an individual at a single point in time is referred to as the immune repertoire. V(D)J recombination is the distinguishing feature of adaptive immunity and enables effective immune responses against an essentially infinite array of antigens. Characterization of immune repertoires is critical in both basic research and clinical contexts. Recent technological advances in repertoire profiling via high-throughput sequencing have resulted in an explosion of research activity in the field. This has been accompanied by a proliferation of software tools for analysis of repertoire sequencing data. Despite the widespread use of immune repertoire profiling and analysis software, there is currently no standardized format for output files from V(D)J analysis. Researchers utilize software such as IgBLAST and IMGT/High V-QUEST to perform V(D)J analysis and infer the structure of germline rearrangements. However, each of these software tools produces results in a different file format, and can annotate the same result using different labels. These differences make it challenging for users to perform additional downstream analyses.

Results

To help address this problem, we propose a standardized file format for representing V(D)J analysis results. The proposed format, VDJML, provides a common standardized format for different V(D)J analysis applications to facilitate downstream processing of the results in an application-agnostic manner. The VDJML file format specification is accompanied by a support library, written in C++ and Python, for reading and writing the VDJML file format.

Conclusions

The VDJML suite will allow users to streamline their V(D)J analysis and facilitate the sharing of scientific knowledge within the community. The VDJML suite and documentation are available from https://vdjserver.org/vdjml/. We welcome participation from the community in developing the file format standard, as well as code contributions.

Background

The genes that encode antibodies (Ab) and the immune receptors expressed on B and T lymphocytes are not germline encoded; rather, they are somatically generated in each developing lymphocyte by a process called V(D)J recombination, which assembles specific, independent germline gene segments into mature, composite genes [1]. Seven types of genes are assembled by V(D)J recombination, and, for each one, there are two or three sets of gene segments: Variable (V) and Joining (J) gene segments are present in all seven, and Diversity (D) gene segments are present in three. During V(D)J recombination, one gene segment of each type is selected, essentially at random, and the selected segments are joined to form a rearranged gene [2]. A diverse repertoire of genes is created as a result of the varied combinations of gene segments. In addition to this combinatorial diversity, there is junctional diversity as a result of imprecise joining: the sequence at the junction of two joined gene segments is unique due to enzymatic processes that act on the gene segment ends (e.g., hairpin opening and exonucleolytic removal) and add non-templated nucleotides into the junctions [2]. In B lymphocytes, the rearranged genes are further diversified through gene conversion (e.g., in chickens and rabbits) [3] or somatic hypermutation (e.g., in mice and humans) [4]. As a result of these processes, each individual has millions of unique Ab and immune receptor genes [5, 6].

V(D)J recombination is the distinguishing feature of adaptive immunity and, through the creation of a diverse immune receptor repertoire, enables the mounting of an effective immune response against an essentially infinite array of antigens, such as those derived from pathogens or tumors. It also has the potential to generate autoimmune responses. Thus, the characterization of adaptive immune receptor repertoires is critical in both basic research and clinical contexts, as well as in the development of pharmaceuticals. Recent application of high-throughput sequencing allows description of the immune response in exquisite detail and has resulted in an explosion of research activity in the field [7–10]. This has been accompanied by a proliferation of software tools for analysis of repertoire sequencing data.

Repertoire sequencing typically involves targeted polymerase chain reaction (PCR) or 5′ rapid amplification of cDNA ends (5′ RACE) to amplify rearranged gene sequences followed by sequencing of the PCR product. The initial steps in analysis of the resulting sequencing data are generally the same, regardless of the biological or clinical question being addressed [11]. The first step is preprocessing to prepare reads for analysis. Examples of preprocessing activities include demultiplexing, quality filtering, and error correction. The second common analysis step is inference of rearranged gene sequences via alignment of sequences to reference databases of germline gene segments. The third step is annotation of rearranged gene sequences according to things such as gene segments utilized, location of complementarity determining regions (CDRs), non-templated nucleotides, and, in the case of B lymphocyte-derived sequences, base substitutions, insertions, and deletions resulting from somatic hypermutation. The fourth common step is repertoire characterization, including clone enumeration, determination of repertoire-level distributions (e.g., gene segment usage, CDR3 length), and quantification of diversity and clonality. A large number of software packages have been developed for these analyses, particularly for the second and third steps. At the time of this writing, there are no fewer than 24 packages, each performing at least one of steps 2–4, and there will certainly be many more released in the future [12–35].

Despite the importance and widespread application of immune repertoire profiling via high-throughput sequencing, there are currently no community standards for data recording and exchange. The inputs and outputs for preprocessing software packages are reasonably standardized, as these mainly read and write FASTA/QUAL or FASTQ files. There are, however, differences in the way sequence-level metadata (such as primer matches and UID sequences) are stored, with one mechanism being encoding this information as entity-value pairs directly in the FASTQ format [24]. Software packages for conducting germline alignment all utilize FASTA files, but they write different output formats, and, for all subsequent analysis steps, there are no input/output standards. The most widely used germline alignment packages, IgBLAST and IMGT/High V-QUEST, write distinct mixed content text files that intersperse sequence alignments and tables with data embedded in free text. This creates significant problems for both software developers and users. Software developers must write parsers for multiple different input formats or limit their software to consuming the output of only a single alignment package. Users must reformat data as it moves through an analysis pipeline, which is a time-consuming and error-prone process. This problem is exacerbated by the frequent desire to utilize multiple packages for a single analysis task for comparative purposes. Furthermore, reformatting is not always possible as the formats differ not just in how the data are represented but also in the content.

To address this problem, we have developed VDJML, an XML-based file format for representing the alignments of rearranged gene sequences to germline gene segments and the accompanying set of annotations. VDJML can accommodate rearrangements from both B and T lymphocytes. VDJML is a common, open standard designed to provide both syntactic and semantic interoperability between software packages. The VDJML suite includes a schema for the VDJML format and libVDJML, a C++ library for generation and parsing of VDJML files, with an accompanying package of Python bindings for the library.

While our primary motivation was to provide a common open standard for software development, VDJML has additional benefits. It can serve as the basis for integrative analysis of results from different studies, and it can meet the widely recognized need for data sharing to improve reproducibility and scientific rigor [8, 36].

Implementation

Design priorities

VDJML development was initiated as part of the VDJServer project (https://www.vdjserver.org). It is being developed as an open, community standard. Its data structure was established during regular calls among the authors, some of whom are computational biologists developing repertoire analysis software, and some of whom are experimentalists and users of this software. The team includes developers of VDJServer and its software packages (e.g. VDJPipe) (https://bitbucket.org/account/user/vdjserver/projects/VDJS), developers of iReceptor, and developers of the Immcantation framework (http://immcantation.readthedocs.io) [23, 24]. Documentation for VDJML is available from https://vdjserver.org/vdjml/. Source code for the tools is available from the VDJML Bitbucket repository at https://bitbucket.org/vdjserver/vdjml.

We chose XML, the eXtensible Markup Language [37], because it readily accommodates key features of our data type, including:

The need to add or eliminate data fields over time and to version future extensions of the schema,
The need for standard VDJML data types along with custom, user-defined data types, and
Variable cardinality relationships between rearranged gene sequences and their annotations (e.g., a single sequence may have alternative alignments, each with multiple base substitutions).

Additional reasons for selecting XML include the portability of XML documents, existing extensive support for standard definitions, parsing, generation, and versioning, and its widespread acceptance as a standard [38].

We had the following additional design priorities:

Support for including results from multiple software packages, multiple germline databases, and multiple systems for annotating codon positions in rearranged gene sequences;
Normalization to minimize data duplication;
Read/write streamability in that, during reading or writing VDJML files, only the information from a single record needs to be present in memory.

Schema overview

Currently, version 1.0 of VDJML has been finalized with a relatively narrow scope that will be extended over time. The current scope does, however, include sufficient information to recreate an alignment. Elements defined by the schema (Fig. 1) belong to the namespace http://vdjserver.org/vdjml/xsd/1/ and are prefixed vdj:. The top element of the schema is vdj:vdjml, which contains the required version attribute.

A VDJML file consists of two parts enclosed in the vdj:meta and vdj:read_results elements (Fig. 2). The schema allows user-defined elements and attributes to appear under vdj:meta and vdj:read_results, but these should have namespaces other than vdj.

The vdj:meta element contains general information that may be shared across analysis results (Fig. 2). Its child elements include vdj:generator, vdj:aligner, and vdj:germline_db. The vdj:generator element describes the software that wrote the VDJML file using the required name, version, and time_gmt attributes. The value for the time_gmt attribute is the date and time the file was written in Greenwich Mean Time (GMT). The vdj:aligner element contains information about a software package used to align sequences to a database of germline gene segments, a program that generated all or some of the results in the VDJML document. This element has the required attributes aligner_id and name. The value for aligner_id is a unique identifier that is referenced within child elements of the vdj:read_results element described below. It enables inclusion of results from multiple different aligners for a single sequence in a single VDJML file. vdj:aligner has one child element, vdj:parameters, which can be used to capture information needed to reproduce the run of the alignment software. Figure 2 shows a VDJML file generated on VDJServer using a local installation of IgBLAST. On VDJServer, the parameter element captures the command passed to IgBLAST. The vdj:germline_db element stores information about a germline database used for analysis with the required attributes version, species, name, and gl_db_id. As with aligner_id, the value for gl_db_id is a unique identifier that is utilized with child elements of vdj:read_results to accommodate alignments for a single sequence against multiple germline databases.

Representation of alignments

Alignment results (alignments plus their annotations) are stored inside the vdj:read_results element as a series of vdj:read elements. Each vdj:read element corresponds to one sequence. The required read_id attribute holds a unique identifier for the sequence, which is the corresponding identifier from the FASTA or FASTQ source file used as input to the alignment software package. The primary child element for vdj:read is vdj:alignment, which captures all of the alignment output for that particular read sequence. It has two child elements: vdj:segment_match and vdj:combination.

The foundation of an alignment is the aligned region of a sequence, the germline gene segments to which the region aligns, and the alignment positions. This information is captured in VDJML using the element vdj:segment_match. This element takes the read sequence as its point of reference and specifies the subsequence that aligns well (or matches) to a particular germline gene segment (or set of gene segments if the alignments are identical). A single vdj:alignment element can have an unlimited number of vdj:segment_match child elements. Thus, this element has segment_match_id as a required attribute. Additional required attributes are read_pos0, read_len, and gl_len, which capture the position in the sequence where the alignment to a particular germline gene segment begins, the number of positions in the sequence that align to the germline gene segment, and the number of positions in the germline gene segment that align, respectively. Optional attributes capture the percentage of positions that match over the aligned region (identity) and the alignment score (score) as defined by the software package generating the alignment. Additional optional attributes (e.g., substitutions) capture more details about the alignment, as described below (Table 1).

Table 1 Names and descriptions of optional attributes for segment match and region elements

Full size table

Child elements of vdj:segment_match are vdj:btop, vdj:gl_seg_match, and vdj:aa_substitution. The vdj:btop element captures the BLAST “trace-back operations” string. The vdj:gl_seg_match element specifies the germline gene segment in the alignment. It has the unique identifier gl_seq_match_id as a required attribute, because a single vdj:segment_match element can specify alignment to multiple different germline gene segments when those alignments are identical. The required attributes type, name, and gl_pos0 specify the gene segment type (V, D, or J), the gene segment name according to the germline database used, and the first position in the germline gene segment sequence that aligns to the read sequence. Additional required attributes reference the gl_db_id (germline database identifier) and aligner_id (alignment software package identifier) attributes of the vdj:germline_db and vdj:aligner elements in the vdj:meta section of the file described above.

Figure 3 shows a sample alignment generated by IgBLAST for a sequence derived from the IGH locus. Because the example sequence was derived from the IGH locus, we expect alignment to V, D, and J gene segments. Figure 3 shows that the sequence aligns equally well to the V gene segments IGHV3-23*01 and IGHV3-23D*01. It aligns equally well to the D gene segments IGHD2-21*01 and IGHD2-21*02, with two possible alignments to IGHD2-21*01. Finally, the sequence aligns best to the J gene segment IGHJ4*02. There are additional lower scoring alignments to IGHV3-23D*02, IGHJ4*01, and IGHJ5*02.

This information is captured in VDJML using seven vdj:segment_match elements. Figure 2 shows the corresponding VDJML file with some of the vdj:segment_match elements left out for space. Figure 4 shows the full vdj:alignment element with all vdj:segment_match elements. The first one shows that 295 positions in the read sequence, beginning at position 123, align to 295 positions within the germline gene segments IGHV3-23*01 and IGHV3-23D*01. Alignment of the read sequence to IGHV3-23*01 or IGHV3-23D*01 produces identical results. Thus, they are captured within a single vdj:segment_match element. The resulting alignment has ~93 % sequence identity over the aligned region and 20 nucleotide substitutions. As can be seen in the two vdj:gl_seg_match elements, the two germline gene segments are present in the same germline database (gl_db_id = “1”), they are V gene segments, the alignments were produced using the same alignment software package (aligner_id = “1”), and the alignment begins at the first position of the germline gene segment sequence.

To specify the full alignment for a rearranged gene sequence, we must specify the combination of segment matches inferred to comprise the rearranged sequence. This is captured in the element vdj:combination using the required attribute segments. The value for segments is a list of segment match identifiers pointing to the appropriate vdj:segment_match elements, according to the alignment software package. These are the highest scoring segment matches of each type (V, D, J), arbitrarily using the first one listed when there are multiple segments with equivalent scores. Figure 2 shows the vdj:combination element for the example alignment from Fig. 3. The combination is formed from segment matches with identifiers 1, 3, and 5, corresponding to the germline gene segments IGHV3-23*01 or IGHV3-23D*01, IGHD2-21*01, and IGHJ4*02 (Fig. 4).

Annotation of alignments

After the full alignment has been specified, it can be annotated with regions of interest using the vdj:region child element of vdj:combination. Regions are specified using the required name, aligner_id, read_pos0, and read_len attributes. The value of name is the name of the region as provided by the software package generating the alignment or annotation. The corresponding software package is indicated using the aligner_id attribute to reference the vdj:aligner element in the vdj:meta section of the VDJML document. read_pos0 and read_len are used to specify precisely where in the read sequence the region is located by specifying the starting location and length, respectively. Commonly annotated regions include framework regions (FRs) 1 – 3, CDRs 1 – 2, the junctions between V and D segments and D and J segments for IGH, TCRB, and TCRD chains, and the junctions between V and J segments for IGK, IGL, TCRA, and TCRG chains. Figure 2 shows example annotations of the VD and DJ junctions and of the FRs and CDRs for an IGH rearrangement.

Alignments can be further annotated using a variety of optional attributes (Table 1). These can be included in either the vdj:segment_match or vdj:region elements.

VDJML support library

Since the VDJML format is based on XML, existing XML software libraries could be used for VDJML generation, parsing, and validation. However, to simplify use and adoption, we have developed libVDJML – a library for reading, writing, and validating VDJML data – thus eliminating the need to write code for the specific tags and structure of VDJML. By using libVDJML, programs are automatically insulated from changes and enhancements made to the VDJML specification over time. The library is implemented in the C++ language for speed and performance. We currently provide bindings for the Python language and plan support for R and Java.

The library provides the Vdjml_reader and Vdjml_writer classes for reading and writing the contents of a VDJML file. These classes also support compression, so large VDJML files can be compressed using gzip or bzip2, and they do not need to be uncompressed in order to read or write them. As described in the schema above, VDJML consists of a single metadata element (vdj:meta) and any number of alignment results (vdj:read and vdj:segment_match elements). Upon opening a file, Vdjml_reader reads the metadata element, but for efficiency, does not read the VDJML file completely into memory. Instead, it reads only a single vdj:read element at a time. Writing a VDJML file operates in a similar fashion. A metadata entry is provided upon creation of a Vdjml_writer instance, which is written initially to the output file, and alignment results are incrementally serialized.

Each VDJML element and its associated sub-elements for an alignment result are represented by an underlying set of C++ classes, which are instantiated upon parsing of the alignment result. Attributes for VDJML elements are accessed through instance variable getter methods of their respective class in the normal C++ fashion. For elements with variable cardinality, a map data structure is utilized allowing for quick access to an individual element or iteration across all elements. Likewise, construction of elements involves creating the appropriate C++ class with the desired values and inserting any variable cardinality elements; Vdjml_writer will then transform them into a correctly structured VDJML element. All of the capability of the C++ API is available through the Python bindings.

To facilitate conversion of alignment files into VDJML format, libVDJML currently provides the igblast_parse.py script that translates an IgBLAST result file into a VDJML file (Fig. 5). A parser for IMGT/High V-QUEST alignment files is currently under development.

Results and discussion

We describe VDJML, an XML-based file format designed to represent the results of aligning immune repertoire sequences to germline gene segments. In addition to including sufficient information to reconstruct a full alignment of a sequence to its component germline gene segments, the standard can capture annotations of biological importance, such as CDRs and base substitutions. To our knowledge, this is the first such standard.

We intend for VDJML to be used in at least two ways. First, it is designed to serve as a common format for software developers. If software that generates alignments and their annotations, such as IgBLAST and IMGT/High V-QUEST, generated VDJML files as output, then downstream analysis packages, such as Change-O [23], could be designed to take a single, common file format as input. This would greatly simplify software development by eliminating the need to code against multiple different input formats. Additionally, adoption of VDJML will facilitate the use of multiple different software packages during analysis, eliminating the need for data reformatting. Such standardization is urgently needed. In recent years, there have been significant improvements in high-throughput sequencing of rearranged immune receptor genes, resulting in widespread application of this technology. This has in turn resulted in a tremendous amount of activity developing new software packages to analyze this data type. At this time, we count at least 24 packages, half of which were published in the last two years, and we expect this activity to increase in the coming years.

Our second intended use is as a medium for data sharing. The scientific community at large has recognized the need for data sharing in support of scientific rigor, transparency, and reproducibility. While raw sequence reads, or processed reads used as input to alignment software, can be shared using FASTA/QUAL or FASTQ files, there is benefit to sharing the alignments and annotations used for analysis. It can be complicated and time-consuming to reproduce the full preprocessing, alignment, and annotation steps of an analysis, particularly if one is integrating data from a variety of sources. For some analyses, reproducing all steps may be necessary, but certainly for many it is not. For example, it is frequently of interest to ask whether a particular CDR3 sequence was observed among any donors in a study; this question could be readily addressed with data shared in VDJML format, but would require significant work using FASTA formatted data.

To enhance the transparency of alignments and annotations shared in VDJML format, we have developed elements and attributes for capturing key features of the processes generating a specific VDJML file. In general, the information captured within the vdj:meta section of a VDJML file should enable the file to be recreated if the starting read sequences (and scores if applicable) are available.

Common barriers to the adoption of standards include: (1) a large number of existing, similar standards, (2) not meeting the needs of the target community, and (3) difficulty using the standard. As described above, VDJML is the first file format proposed as a standard for this domain. As such, we don’t anticipate barriers to adoption if the format adequately meets the community’s needs and is accompanied by tools to facilitate use. To ensure that the community’s needs are met, VDJML is being developed by a group that includes software developers and non-developer users. Participation is open to all interested. We have an online forum available through http://forums.vdjserver.org. To facilitate use, we have created libVDJML, which includes classes for reading and writing VDJML elements. libVDJML currently supports conversion of IgBLAST output to VDJML, and in the near future, will support conversion from IMGT/High V-QUEST as well.

VDJML is currently supported by software packages developed by authors of this paper. In addition, it will be supported by the ImmPort database [39] as an immune repertoire sequence data reporting format. Currently, for studies that are included in ImmPort, raw immune repertoire sequence data is shared via the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) and VDJML files are shared via a reference to the files hosted at VDJServer. Our test case for developing this system was Rubelt et al. 2016 [40]. The ImmPort study page can be accessed via the study identifier SDY675. The VDJServer study page and files can be accessed here: http://wiki.vdjserver.org/vdjserver/index.php/Rubelt_et_al._2016.

Future development plans for the VDJML format include providing an enhanced representation of workflows within the vdj:meta element. For example, we will include elements to support additional types of processing steps. We will preserve the alignment element and add elements for describing annotations such as genotype corrections and clonal analysis from software such as TIgGER [41] and Change-O [23] for assigning sequences to clones. Additionally, we plan to add elements for acknowledging the probabilistic nature of inferring rearranged sequences and assigning annotations. Future plans for libVDJML include providing bindings to additional languages and expanding the suite of parsers available so that output from existing alignment software can be readily converted to VDJML.

We invite participation from the larger community, both in the form of suggested revisions and enhancements to the schema, as well as in the form of code contributions.

References

Tonegawa S. Somatic generation of antibody diversity. Nature. 1983;302(5909):575–81.
Article CAS PubMed Google Scholar
Schatz DG, Oettinger MA, Schlissel MS. V(D)J recombination: molecular biology and regulation. Annu Rev Immunol. 1992;10:359–83.
Article CAS PubMed Google Scholar
Reynaud CA, et al. Formation of the chicken B-cell repertoire: ontogenesis, regulation of Ig gene rearrangement, and diversification by gene conversion. Adv Immunol. 1994;57:353–78.
Article CAS PubMed Google Scholar
Berek C, Milstein C. The dynamic nature of the antibody repertoire. Immunol Rev. 1988;105:5–26.
Article CAS PubMed Google Scholar
Lythe G, et al. How many TCR clonotypes does a body maintain? J Theor Biol. 2016;389:214–24.
Article PubMed PubMed Central Google Scholar
Warren RL, et al. Exhaustive T-cell repertoire sequencing of human peripheral blood samples reveals signatures of antigen selection and a directly measured repertoire size of at least 1 million clonotypes. Genome Res. 2011;21(5):790–7.
Article CAS PubMed PubMed Central Google Scholar
Hou XL, et al. Current status and recent advances of next generation sequencing techniques in immunological repertoire. Genes Immun. 2016;17(3):153–64.
Article CAS PubMed Google Scholar
Georgiou G, et al. The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat Biotechnol. 2014;32(2):158–68.
Article CAS PubMed PubMed Central Google Scholar
Kirsch I, Vignali M, Robins H. T-cell receptor profiling in cancer. Mol Oncol. 2015;9(10):2063–70.
Article CAS PubMed Google Scholar
Calis JJ, Rosenberg BR. Characterizing immune repertoires by high throughput sequencing: strategies and applications. Trends Immunol. 2014;35(12):581–90.
Article CAS PubMed PubMed Central Google Scholar
Yaari G, Kleinstein SH. Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome Med. 2015;7:121.
Article PubMed PubMed Central Google Scholar
Ye J, et al. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013;41(Web Server issue):W34–40.
Article PubMed PubMed Central Google Scholar
Alamyar E, et al. IMGT((R)) tools for the nucleotide analysis of immunoglobulin (IG) and T cell receptor (TR) V-(D)-J repertoires, polymorphisms, and IG mutations: IMGT/V-QUEST and IMGT/HighV-QUEST for NGS. Methods Mol Biol. 2012;882:569–604.
Article CAS PubMed Google Scholar
Volpe JM, Cowell LG, Kepler TB. SoDA: implementation of a 3D alignment algorithm for inference of antigen receptor recombinations. Bioinformatics. 2006;22(4):438–44.
Article CAS PubMed Google Scholar
Munshaw S, Kepler TB. SoDA2: a Hidden Markov Model approach for identification of immunoglobulin rearrangements. Bioinformatics. 2010;26(7):867–72.
Article CAS PubMed PubMed Central Google Scholar
Gaeta BA, et al. iHMMune-align: hidden Markov model-based alignment and identification of germline genes in rearranged immunoglobulin gene sequences. Bioinformatics. 2007;23(13):1580–7.
Article CAS PubMed Google Scholar
Wang X, et al. Ab-origin: an enhanced tool to identify the sourcing gene segments in germline for rearranged antibodies. BMC Bioinformatics. 2008;9 Suppl 12:S20.
Article PubMed Google Scholar
Ohm-Laursen L, et al. No evidence for the use of DIR, D-D fusions, chromosome 15 open reading frames or VH replacement in the peripheral repertoire was found on application of an improved algorithm, JointML, to 6329 human immunoglobulin H rearrangements. Immunology. 2006;119(2):265–77.
Article CAS PubMed PubMed Central Google Scholar
Souto-Carneiro MM, et al. Characterization of the human Ig heavy chain antigen binding complementarity determining region 3 using a newly developed software algorithm, JOINSOLVER. J Immunol. 2004;172(11):6790–802.
Article CAS PubMed Google Scholar
Zhao S, Lu J. A germline knowledge based computational approach for determining antibody complementarity determining regions. Mol Immunol. 2010;47(4):694–700.
Article CAS PubMed Google Scholar
Chen Z, et al. Clustering-based identification of clonally-related immunoglobulin gene sequence sets. Immunome Res. 2010;6 Suppl 1:S4.
Article PubMed Google Scholar
Barak M, et al. IgTree: creating Immunoglobulin variable region gene lineage trees. J Immunol Methods. 2008;338(1–2):67–74.
Article CAS PubMed Google Scholar
Gupta NT, et al. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics. 2015;31(20):3356–8.
Article PubMed Google Scholar
Vander Heiden JA, et al. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics. 2014;30(13):1930–2.
Article CAS PubMed PubMed Central Google Scholar
Bolotin DA, et al. MiTCR: software for T-cell receptor sequencing data analysis. Nat Methods. 2013;10(9):813–4.
Article CAS PubMed Google Scholar
Cortina-Ceballos B, et al. Reconstructing and mining the B cell repertoire with ImmunediveRsity. MAbs. 2015;7(3):516–24.
Article CAS PubMed PubMed Central Google Scholar
D’Angelo S, et al. The antibody mining toolbox: an open source tool for the rapid analysis of antibody repertoires. MAbs. 2014;6(1):160–72.
Article PubMed Google Scholar
Imkeller K, et al. sciReptor: analysis of single-cell level immunoglobulin repertoires. BMC Bioinformatics. 2016;17:67.
Article PubMed PubMed Central Google Scholar
Kuchenbecker L, et al. IMSEQ--a fast and error aware approach to immunogenetic sequence analysis. Bioinformatics. 2015;31(18):2963–71.
Article CAS PubMed Google Scholar
Nazarov VI, et al. tcR: an R package for T cell receptor repertoire advanced data analysis. BMC Bioinformatics. 2015;16:175.
Article PubMed PubMed Central Google Scholar
Ralph DK, Matsen FAt. Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation. PLoS Comput Biol. 2016;12(1):e1004409.
Article PubMed PubMed Central Google Scholar
Safonova Y, et al. IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis. Bioinformatics. 2015;31(12):i53–61.
Article CAS PubMed PubMed Central Google Scholar
Schaller S, et al. ImmunExplorer (IMEX): a software framework for diversity and clonality analyses of immunoglobulins and T cell receptors on the basis of IMGT/HighV-QUEST preprocessed NGS data. BMC Bioinformatics. 2015;16:252.
Article PubMed PubMed Central Google Scholar
Thomas N, et al. Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine. Bioinformatics. 2013;29(5):542–50.
Article CAS PubMed Google Scholar
Zhang W, et al. IMonitor: a robust pipeline for TCR and BCR repertoire analysis. Genetics. 2015;201(2):459–72.
Article CAS PubMed PubMed Central Google Scholar
Brusic V, et al. Computational resources for high-dimensional immune analysis from the Human Immunology Project Consortium. Nat Biotechnol. 2014;32(2):146–8.
Article CAS PubMed PubMed Central Google Scholar
Bray T. Extensible Markup Language (XML) 1.0; W3C Recommendation 10-February-1998. W3C 1998 4/27/2016]; Available from: https://www.w3.org/TR/1998/REC-xml-19980210.
Achard F, Vaysseix G, Barillot E. XML, bioinformatics and data integration. Bioinformatics. 2001;17(2):115–25.
Article CAS PubMed Google Scholar
Bhattacharya S, et al. ImmPort: disseminating data to the public for the future of immunology. Immunol Res. 2014;58(2–3):234–9.
Article CAS PubMed Google Scholar
Rubelt F, et al. Individual heritable differences result in unique cell lymphocyte receptor repertoires of naive and antigen-experienced cells. Nat Commun. 2016;7:11112.
Article CAS PubMed Google Scholar
Gadala-Maria D, et al. Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles. Proc Natl Acad Sci U S A. 2015;112(8):E862–70.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Mack Dressler provided significant editing support including making Figs. 1 and 5 and generating the Endnote Reference Library.

Declarations

This article has been published as part of BMC Bioinformatics Volume 17 Supplement 13, 2016: Proceedings of the 13th Annual MCBIOS conference. The full contents of the supplement are available online at http://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-13.

Funding

Development of VDJML was supported by a Burroughs Welcome Fund Career Award and an NIAID-funded R01 (AI097403) to LGC. Florian Rubelt and Sanchita Bhattacharya were supported, in part, by the Bioinformatics Support Contract (BISC) HHSN272201200028C and the National Institute of Allergy and Infectious Diseases grant U19 AI090019. SHK was supported, in part, by National Institute of Allergy and Infectious Diseases grant R01 AI104739. NTG was supported, in part, by a PhRMA foundation pre-doctoral informatics fellowship. JAVH received research support from the National Library of Medicine of the National Institutes of Health (T15 LM07056). Publication charges for this article were paid from AI097403 to LGC.

Availability of data and material

Documentation for VDJML is available from https://vdjserver.org/vdjml. Source code for the tools is available from the VDJML Bitbucket repository at https://bitbucket.org/vdjserver/vdjml. An archive can be downloaded anonymously from https://bitbucket.org/vdjserver/vdjml/get/v0.1.4.tar.gz.

Authors’ contributions

The VDJML schema was developed in group calls by all authors. MKL developed initial versions of the libVDJML code. EAS developed the converter from IgBLAST to VDJML. SC and IT are the current developers. All authors read and approved the final manuscript.

Authors’ information

Contributions from Mikhail K. Levin and Edward A. Salinas were made while they were at UT Southwestern.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Author information

Authors and Affiliations

Department of Clinical Sciences, UT Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX, 75390-9066, USA
Inimary T. Toby, Scott Christley & Lindsay G. Cowell
Bank of America Corporate Center, 100 North Tryon Street, Charlotte, NC, 28202, USA
Mikhail K. Levin
Broad Institute, 75 Ames Street, Cambridge, MA, 02142, USA
Edward A. Salinas
Institute for Computational Health Sciences, University of California San Francisco, Mission Hall, 550 16th Street, 4th Floor, Box 0110, San Francisco, CA, 94158, USA
Sanchita Bhattacharya
Department of Biological Sciences and The IRMACS Centre, Simon Fraser University, 8888 University Drive, Burnaby, V5A 1S6, British Columbia, Canada
Felix Breden
Department of Immunobiology, University of Arizona School of Medicine, 1656 E. Mabel Street, P.O. Box 245221, Tucson, AZ, 85724-5221, USA
Adam Buntzman
New Zealand eScience Infrastructure, University of Auckland, Level 10, 49 Symonds Street, Auckland, New Zealand
Brian Corrie
Texas Advanced Computing Center, Research Office Complex 1.101, J.J. Pickle Research Campus, Building 196, 10100 Burnet Road (R8700), Austin, TX, 78758-4497, USA
John Fonner & Walter Scarborough
Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, 300 George Street, Suite 505, New Haven, CT, 06511, USA
Namita T. Gupta, Jason A. Vander Heiden & Steven H. Kleinstein
School of Biomedical Engineering, Science and Health Systems and Department of Microbiology and Immunology, College of Medicine, Drexel University, 3141 Chestnut Street, Philadelphia, PA, 19104, USA
Uri Hershberg
The IRMACS Centre (ASB 10905), Simon Fraser University, 8888 University Drive, Burnaby, BC, V5A 1S6, Canada
Nishanth Marthandan
School of Biomedical Engineering, Science and Health Systems, Drexel University, 3141 Chestnut Street, Philadelphia, PA, 19104, USA
Aaron Rosenfeld
Department of Neurology and Neurotherapeutics, UT Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX, 75390-9036, USA
William Rounds & Nancy Monson
Stanford University School of Medicine, 279 Campus Drive, Stanford, CA, 94305-5101, USA
Florian Rubelt
Department of Molecular Biology and Biochemistry and Faculty of Health Sciences, Simon Fraser University, Blusson Hall, Room 11300, 8888 University Drive, Burnaby, BC, V5A 1S6, Canada
Jamie K. Scott
Department of Pathology, Yale School of Medicine, 300 George Street, Suite 505, New Haven, CT, 06511, USA
Mohamed Uduman & Steven H. Kleinstein
J. Craig Venter Institute, 4120 Capricorn Lane, La Jolla, CA, 92037, USA
Richard H. Scheuermann
Department of Pathology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
Richard H. Scheuermann
Division of Vaccine Discovery, La Jolla Institute for Allergy and Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA
Richard H. Scheuermann
Department of Immunobiology, Yale School of Medicine, New Haven, CT, USA
Steven H. Kleinstein

Authors

Inimary T. Toby
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail K. Levin
View author publications
You can also search for this author in PubMed Google Scholar
Edward A. Salinas
View author publications
You can also search for this author in PubMed Google Scholar
Scott Christley
View author publications
You can also search for this author in PubMed Google Scholar
Sanchita Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
Felix Breden
View author publications
You can also search for this author in PubMed Google Scholar
Adam Buntzman
View author publications
You can also search for this author in PubMed Google Scholar
Brian Corrie
View author publications
You can also search for this author in PubMed Google Scholar
John Fonner
View author publications
You can also search for this author in PubMed Google Scholar
Namita T. Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Uri Hershberg
View author publications
You can also search for this author in PubMed Google Scholar
Nishanth Marthandan
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Rosenfeld
View author publications
You can also search for this author in PubMed Google Scholar
William Rounds
View author publications
You can also search for this author in PubMed Google Scholar
Florian Rubelt
View author publications
You can also search for this author in PubMed Google Scholar
Walter Scarborough
View author publications
You can also search for this author in PubMed Google Scholar
Jamie K. Scott
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Uduman
View author publications
You can also search for this author in PubMed Google Scholar
Jason A. Vander Heiden
View author publications
You can also search for this author in PubMed Google Scholar
Richard H. Scheuermann
View author publications
You can also search for this author in PubMed Google Scholar
Nancy Monson
View author publications
You can also search for this author in PubMed Google Scholar
Steven H. Kleinstein
View author publications
You can also search for this author in PubMed Google Scholar
Lindsay G. Cowell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lindsay G. Cowell.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Toby, I.T., Levin, M.K., Salinas, E.A. et al. VDJML: a file format with tools for capturing the results of inferring immune receptor rearrangements. BMC Bioinformatics 17 (Suppl 13), 333 (2016). https://doi.org/10.1186/s12859-016-1214-3

Download citation

Published: 06 October 2016
DOI: https://doi.org/10.1186/s12859-016-1214-3

Proceedings of the 13th Annual MCBIOS conference

VDJML: a file format with tools for capturing the results of inferring immune receptor rearrangements