IMEX includes algorithms and statistical analyses for determining descriptive statistics about sequence functionality and V-(D)-J rearranged region frequency, calculating clonality of cells, estimating diversity of the cell spectrum, and visual representation of various gene/allele combinations. IMEX has been designed for analyzing and summarizing NGS-based IG and TR data derived from IMGT®;. IMGT/HighV-QUEST is a NGS high-throughput analysis portal for IG and TR, and so far the only one available online [7, 13]. IMGT/HighV-QUEST uses the same algorithms as IMGT/V-QUEST [14] with integrated IMGT/JunctionAnalysis [15], provides 11 compressed output files that contain information about variable (V), diverse (D), and joining (J) gene arrangements (V-(D)-J), identification and characterization of new alleles, detailed analysis of the junction (IMGT/JunctionAnalysis results), and additional information of mutations. IMEX uses these processed files as input for statistical analyses. Sample comparisons, clonotype tracking, and variety analysis are also included in IMEX. IMEX is written in C# and is freely available at http://bioinformatics.fh-hagenberg.at/immunexplorer/. In the following paragraphs we give detailed descriptions of the analysis methods implemented in IMEX.
Preprocessing methods for the IMGT/HighV-QUEST submission
The IMGT/HighV-QUEST online portal enables uploading and processing of up to 500,000 sequences, therefore preprocessing methods have been developed in IMEX: FASTA files can be split into several files (using a user-defined threshold for the size of these files) to prepare the upload to the IMGT®; information system; after uploading to IMGT/HighV-QUEST [16] at IMGT®;, the international ImMunoGeneTics information system®; (http://www.imgt.org) [17] and analyzing, the compressed output files can be merged to one compressed data file. This file includes all information that is needed for determining overall statistics of the IG and TR clonotypes, frequencies, diversity and V-(D)-J rearranged region frequencies using IMEX.
Descriptive statistic analyses
IMEX enables a wide range of statistical analyses of IG and TR data. Lists of V, D, and J gene occurrences containing the total amounts and relative frequencies of these genes are calculated as well as the total amounts of the productive, unproductive, and unknown sequences (see Fig. 1). Sequences, for which no alignment result was found, are reported, but not considered later when it comes to further calculations in IMEX. Additionally, pie charts can be generated to gain more insights about the productive and unproductive B and T cell arrangements of the human adaptive immune system. All statistical calculations can be downloaded as text files and used for further calculations.
Clonality analysis
The clonality of the IG and TR based on theV-(D)-J rearranged regions, the CDR3 sequences, and/or the nucleotide sequence of the whole amplicon provides additional information. Clonal expansion is related to the level of somatic proliferation of single B or T cell clonotypes triggered by various immunological reactions. In IMEX, the calculation of clonality can be defined by the user by choosing the amino acid or the nucleotide sequence or the V-(D)-J rearranged regions. IMEX enables the calculation of the clonality based on the three complementarity determining regions (CDR), namely CDR1, CDR2, and CDR3. CDR3, the most variable CDR, can be found in the junction of the rearranged V-(D)-J regions. The number of clonotypes can also be determined using the nucleotide sequence of the whole read of the V-(D)-J rearranged region. Total numbers and relative frequencies of the clonotypes are given in tabular view; these lists can be exported and used for further analyses.
Diversity analysis
The diversity of an antigen receptor repertoire is calculated by analyzing the unique clonotypes of IG and TR in all sequences.
In the literature, several different ways to define the term diversity can be found [18]; IgAT, for example, calculates the clonotypic diversity as clonotypes per productive sequences and the sequence diversity as unique sequences per productive sequences [11]. IMEX calculates sequence diversity using a more elaborated data mining approach [19] based on the most variable region, the CDR3 [7]:
To empirically calculate the diversity in IG or TR data, we randomly choose n out of N CDR3 sequences (r
a
n
d(n,N)) in the sample and determine the number of unique clonotypes (c
unique
) in these n sequences. This c
unique
(n) is calculated for increasing numbers of n, for example for n={0,1000,2000,3000,…}, and so we get the calculated diversity d
i
v
calc
(n) in n sequences:
$$\begin{array}{@{}rcl@{}} div_{calc}(n)= c_{unique}(rand(n,N)) \end{array} $$
((1))
This calculation is repeated five times for each n and the number of unique clonotypes c
unique
is averaged. Examples are shown in Fig. 2.
We assume that there is a certain amount of unique clonotypes in the sample, and the more amino acid sequences we draw from the sample, the more the number of unique sequences will converge to the true number of unique clonotypes. Additionally, we have to keep in mind that the more sequences we draw, the more unique sequences we will see due to read errors. This is why we assume that the number of unique sequences (seen in n randomly drawn sequences) can be modeled as
$$\begin{array}{@{}rcl@{}} div_{mod}(n)= a * (1-e^{-b*n})+k*n \end{array} $$
((2))
where a is the true number of unique clonotypes and k is the fraction of unique sequences caused by read errors.
The parameters a, b, and k of the here proposed model are optimized so that they fit the empirically calculated diversity d
i
v
calc
using evolution strategies [20]. The so optimized a in the model corresponds to the total number of unique clonotypes in the multiplex PCR as shown in Fig. 3.
V-(D)-J visualizer
IMEX provides an algorithm for visualizing various V-(D)-J rearranged region combinations. All V-J, V-D, J-D and V-(D)-J gene and/or allele combinations are determined in the data sample. The framework contains several different graphical representation possibilities to visualize the total gene and allele frequencies; frequency histograms, heat maps, and bubble charts can be created and enable detailed visualizations of the state of the investigated receptor repertoire. Gene and allele frequencies can be sorted by gene names so that results for different samples can be compared easily. A frequency threshold can be used to filter specific genes and alleles.
IMEX also offers the download of all B and T cell genes and alleles listed in the IMGT information system®; for the species Homo Sapiens. For the visualization of the V-(D)-J rearranged region distributions we have first calculated a list of all possible V-(D)-J combinations; all V-(D)-J combinations of a sample are determined and mapped on the full spectrum of all known V-(D)-J rearranged regions. This enables an accurate approach to compare various samples on gene or allele level.
PCR primer matching
IMEX includes a feature for analyzing primer efficiency. Primer sets used for multiplex rearranged V-(D)-J regions PCR amplification can be imported (see Additional file 1: Primer lists for TRB and IGH). This primer matching algorithm searches for the exact sequences in the IMGT aligned sequences and returns the relative frequency of each primer in the imported primer sets. This enables the optimization of the efficiency in multiplex PCR.
Comparison analysis
The comparison of various two or more samples with respect to the clonality of the IG and TR repertoire is an essential analysis feature in IMEX:
-
Pairwise CDR3 Clone Comparer: IMEX is capable of generating a list of unique CDR3 clonotypes of each data sample and searching the top c
unique
clonotypes from one sample in the other sample. Each clonotype is assigned a randomly chosen color and matched clonotypes are shown in the same color.
-
Multiple CDR3 Clone Comparer: The multiple comparison algorithm generates the top c
unique
clonotypes in each given data sample and searches for all so collected clonotypes in data samples. IMEX also contains a visualization and tabular view to compare overlapping multiple data samples according to CDR3.
-
Multiple V-(D)-J Clone Comparer: As clonality can not only be defined over the CDRs but also over the V-(D)-J rearranged regions, IMEX also offers a multiple V-(D)-J Clone Comparer. The functionality is implemented in analogy to the Multiple CDR3 Clone Comparer.
Approval of ethics committee and consent
Informed written consent was obtained from all participating individuals according to the Declaration of Helsinki. Ethical approval for the sample collection used here was obtained from the Ethical Committee of Upper Austria (no. E-9-12, Jan 21st, 2013).