TRIg: a robust alignment pipeline for non-regular T-cell receptor and immunoglobulin sequences

Background T cells and B cells are essential in the adaptive immunity via expressing T cell receptors and immunoglogulins respectively for recognizing antigens. To recognize a wide variety of antigens, a highly diverse repertoire of receptors is generated via complex recombination of the receptor genes. Reasonably, frequencies of the recombination events have been shown to predict immune diseases and provide insights into the development of immunity. The field is further boosted by high-throughput sequencing and several computational tools have been released to analyze the recombined sequences. However, all current tools assume regular recombination of the receptor genes, which is not always valid in data prepared using a RACE approach. Compared to the traditional multiplex PCR approach, RACE is free of primer bias, therefore can provide accurate estimation of recombination frequencies. To handle the non-regular recombination events, a new computational program is needed. Results We propose TRIg to handle non-regular T cell receptor and immunoglobulin sequences. Unlike all current programs, TRIg does alignments to the whole receptor gene instead of only to the coding regions. This brings new computational challenges, e.g., ambiguous alignments due to multiple hits to repetitive regions. To reduce ambiguity, TRIg applies a heuristic strategy and incorporates gene annotation to identify authentic alignments. On our own and public RACE datasets, TRIg correctly identified non-regularly recombined sequences, which could not be achieved by current programs. TRIg also works well for regularly recombined sequences. Conclusions TRIg takes into account non-regular recombination of T cell receptor and immunoglobulin genes, therefore is suitable for analyzing RACE data. Such analysis will provide accurate estimation of recombination events, which will benefit various immune studies directly. In addition, TRIg is suitable for studying aberrant recombination in immune diseases. TRIg is freely available at https://github.com/TLlab/trig. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1304-2) contains supplementary material, which is available to authorized users.


Public RACE data of human TRβ gene
To address the concern whether the presence of non--regular TRβ sequences is unique to our approach, we analyzed a public RACE data of human TRβ gene generated by a 454 sequencer (NCBI SRA accession SRR941034). For this dataset, IgBLAST was the most sensitive as it did not annotate only 0.5% of the reads (Table S2). In contrast, TRIg did not annotate 8.5% of the reads. For those reads, the V alignments by IgBLAST were either short (<20 bp in 71.0% of cases) or of low identities (<0.8 in 97.1% of alignments ≥30 bp). This again suggests that IgBLAST is over--sensitive for non--regular VDJ sequences. Compared to our data, the overall consistency of annotations (Table S3) showed a similar pattern except that IgBLAST was more consistent with TRIg (71.2% of the annotations were identical) and there were relatively more non--identical annotations in the non--VJ categories. The better consistency between IgBLAST and TRIg was reasonable because the percentage of reads without a V segment in this dataset (28.0%) was smaller than that in our data (64.5%) according to TRIg. Many statements for our data still held for this dataset. For example, TRIg gave a better alignment than IMGT did for a majority of the non--identical annotations ( Figure   S2). The similar pattern of results suggests generality of these tools on RACE data of human TRβ gene.
However, there were still distinctions in the results. For example, IgBLAST gave a longer but of lower identity alignment for 38.5% of the reads with non--identical annotations in this dataset ( Figure S2), much higher than the <1% in our data. For most (97.7%) of those reads, TRIg identified only a segment in the constant C region while IgBLAST reported V and/or J alignments. The V and J alignments by IgBLAST were either short (<20 bp in 79.4% of the cases) or of a low identity (<0.8 in 97.1% of alignments ≥30 bp); therefore were less convincing. Similarly, IgBLAST reported an extra J alignment in 88.3% of the extra annotations and 96.8% of the J alignments were short (<20 bp). These again suggest the over--sensitivity of IgBLAST. In contrast, most of the constant C segments identified by TRIg were from the same C locus, and they were likely primer sequences used in the RACE approach. Thus, TRIg's annotations for those reads were more convincing.
Another distinction was that between IgBLAST and TRIg, relatively more non--identical annotations appeared in the non--VJ category. For 15704 reads in the non--VJ category, TRIg found only a short C segment, which again was likely primer sequence. This suggests that the remaining segments were from non--TRB genes. To confirm the statement, the 15704 reads were aligned to human genome (h38) using BLAT (v35) and the best alignments were selected. The best alignments of 15528 reads fell outside TRβ gene locus. Along this line, we aligned all the reads to the human genome and found that 20.2% contained a segment that could be aligned to a non--TRβ locus. In contrast, only 0.2% of reads in our data could be aligned to a non--TRβ locus. This explained why TRIg failed to annotation 8.5% of reads and some reads were not fully aligned.
In the public RACE data of human TRβ gene, 42.7% of reads were non--regular, among which the most abundant class (52.8%) were sequences containing only a V segment. For those reads, it is possible that the sequencing started from the V segments but was not long enough to reach the J segments.
The second most abundant class (26.9%) of non--regular reads were sequences containing only a short C segment. This echoed our discovery that a good portion of this public data contained sequences from a non--TRβ locus and a C segment (putative RT--PCR artifact) was concatenated to the non--TRβ segment. The next two abundant classes were non--regular reads with a C segment connecting only to a J segment (9.0%) or an intergenic segment (6.2%), respectively.
Interestingly, most of the intergenic segments were from two TRβ loci, one in the upstream of TRBD1 and the other in the upstream of TRBC1, suggesting non--regular splicing events.    Figure S1. Flow of 5' RACE experiment. Please see Table S1 for the primer and MID sequences.   Figure   2b of the main text when TRIg is compared to IgBLAST and IMGT.