TRIP - T cell receptor/immunoglobulin profiler

Background Antigen receptors are characterized by an extreme diversity of specificities, which poses major computational and analytical challenges, particularly in the era of high-throughput immunoprofiling by next generation sequencing (NGS). The T cell Receptor/Immunoglobulin Profiler (TRIP) tool offers the opportunity for an in-depth analysis based on the processing of the output files of the IMGT/HighV-Quest tool, a standard in NGS immunoprofiling, through a number of interoperable modules. These provide detailed information about antigen receptor gene rearrangements, including variable (V), diversity (D) and joining (J) gene usage, CDR3 amino acid and nucleotide composition and clonality of both T cell receptors (TR) and B cell receptor immunoglobulins (BcR IG), and characteristics of the somatic hypermutation within the BcR IG genes. TRIP is a web application implemented in R shiny. Results Two sets of experiments have been performed in order to evaluate the efficiency and performance of the TRIP tool. The first used a number of synthetic datasets, ranging from 250k to 1M sequences, and established the linear response time of the tool (about 6 h for 1M sequences processed through the entire BcR IG data pipeline). The reproducibility of the tool was tested comparing the results produced by the main TRIP workflow with the results from a previous pipeline used on the Galaxy platform. As expected, no significant differences were noted between the two tools; although the preselection process seems to be stricter within the TRIP pipeline, about 0.1% more rearrangements were filtered out, with no impact on the final results. Conclusions TRIP is a software framework that provides analytical services on antigen receptor gene sequence data. It is accurate and contains functions for data wrangling, cleaning, analysis and visualization, enabling the user to build a pipeline tailored to their needs. TRIP is publicly available at https://bio.tools/TRIP_-_T-cell_Receptor_Immunoglobulin_Profiler.


Background
Antigen receptors, namely the B cell receptor immunoglobulin (BcR IG) and the T cell receptor (TR) expressed by the B and T cells, respectively, are characterized by extreme diversity of specificities. This property enables the human immune system to recognize a broad spectrum of exo-and auto-antigens, thus orchestrating a wide range of immune responses fundamental to health (e.g. protection against microbial pathogens or cancerous cells) and disease (e.g. autoimmunity, allergy, lymphoid cancer). Within the last decade, the use of next generation sequencing (NGS) enabled a much deeper and thorough study of both BcR IG and TR gene repertoires through the generation of an unprecedented amount of sequence data offering a profound impact on our understanding of various clinical and research settings.
On one hand, immune profiling using NGS methodologies is gaining popularity in the context of cancer and autoimmunity through monitoring minimal residual disease (MRD) and characterizing the complexity of the immune repertoires, respectively. On the other hand, IG/TR NGS can also provide valuable information regarding normal processes and mechanisms, such as B and T cell development, inflammation and the aging of the immune system. In contrast to the vast majority of (human) genes, the analysis of BcR IG/TR rearrangement sequences cannot be based on simple comparison with a reference genome. This is due to the fact that: (1) BcR IG/TR receptor variable domains are created through the combination of 2 (V and J) or 3 (V, D, and J) types of genes, and, (2) random nucleotides are deleted and/or inserted at the junctions between these genes, thus resulting in a high level of sequence diversity.
NGS methodology has been applied in the field of immunogenetics in three main areas: (i) clonality assessment, (ii) detection of minimal residual disease (MRD) and (iii) repertoire analysis of BcR IG and TR gene sequences [1][2][3].
The implementation of NGS technologies, hence the capacity to generate massive amounts of data, underlies the requirement for powerful computational and analytical tools while also raising challenges in data sharing, archiving, and aggregation. Therefore, a series of different pipelines have been developed to facilitate both data annotation and meta-analysis [4][5][6][7].
In this context, the international ImMunoGeneTics (IMGT) [8] information system is the most widely used repository and curation site for BcR IG and TR gene sequence information. IMGT is also able to computate clonotypes defined as unique combinations of variable (V), diversity (D), joining (J) genes and complementarity-determining region 3 (CDR3) amino acid (AA) in-frame junction [9]. Each "IMGT clonotype (AA)" is characterized by a selected unique representative sequence. The "IMGT/StatClonotype" tool can evaluate and visualize statistical significance of pairwise comparisons of IMGT clonotype (AA) diversity or expression per gene of a given BcR IG or TR dataset, which should not exceed 1,000,000 sequences. This tool is incorporated in a downloadable R package with user-friendly interface [10,11].
Another software for comprehensive adaptive immune profiling is MiXCR [12]. MiXCR handles both paired-and single-end reads, depending on the utilized sequencing chemistry, assesses their quality and applies a heuristic multilayer clustering for error correction. It also aligns the reads with a built-in library of reference germline (D)J and constant (C) gene sequences for humans and mice based on the corresponding loci from GenBank. MiXCR computates clonotypes by assembling identical and homologous reads and is also able to rescue low-quality reads by mapping them to previously assembled high-quality clonotypes. The software is able to analyze full-and partial-length data.
The Vidjil platform [13] is an open-source application for the analysis of highthroughput sequencing reads of BcR IG and TR gene rearrangements. The algorithm performs the processing and annotation of one or several samples and displays the results in an interactive user-friendly interface. The data can be stored and analyzed by several complementary software, including an annotation step through IMGT. As input, the user needs to provide a raw sequence file. The algorithm identifies clonotypes by default based on a so-called "window" of 50 bp nucleotides within the V(D)J junction and proceeds with the alignment of the sequences. The window size can be changed by the user. The algorithm then assesses the clonality of the samples, displays the most frequent clonotypes of each sample in lists and visualizes the results. Vidjil has also been designed to quantify MRD detection during patient follow-up.
Another software toolbox for high throughput immune receptor profiling is available on the Galaxy platform [14], namely the Immune Receptor Profiler (IRProfiler). The selection of Galaxy as the hosting platform of IRProfiler ensures the usability and modularity of IRProfiler and provides powerful means for its distribution. IRProfiler uses the "Summary" file from IMGT/HighV-Quest tool as input and can use 5 alternative clonotype definitions, performs data filtering, clonotype computation and gene usage for the V and J genes of both BcR IG and TR. Moreover, IRProfiler is able to detect shared and exclusive clonotypes among different repertoires and datasets and compare V and J gene repertoires among different samples. The pipeline of IRProfiler has been employed by a number of publications for TR repertoire analysis in different pathological conditions [14].
In this paper, we present the T-cell Receptor / Immunoglobulin Profiler (TRIP) tool, a software framework that provides analytical services on BcR IG and TR gene sequence data. TRIP offers the opportunity for an in-depth analysis based on the processing of the output files of the IMGT/HighV-Quest tool, and was developed to address a variety of scientific issues in NGS IG/TR data analysis ranging from data curation and filtering to the characterization of complex features and processes. Datasets from many different patients can be processed together and the results can be displayed either together as a merged output or separately. In detail, TRIP performs clonotype computation, the definition of which is specified according to user needs. Highly similar clonotypes, i.e. BcR IG or TR clonotypes using the same V gene and differing in a few amino acids within the CDR3 of a particular length, can also be computed, while shared clonotypes between different datasets (either samples or individuals) can be identified. The V, D and J gene repertoires can be extracted and the relations between different characteristics of the gene rearrangements can be evaluated, e.g. V gene usage and CDR3 length. Molecular features of the receptors can be assessed in a comprehensive way, such as CDR3 predicted isoelectric point value and length distribution and (only for BcR IG genes) somatic hypermutation (SHM) status and characteristics. Specifically regarding the latter, TRIP has the option to align rearranged IGHV genes against the germline gene and allele (from IMGT) and assess the frequency and characteristics of the observed mutations based on their physicochemical properties [15]. The results can be displayed in a web interface and all outputs can be downloaded in a text (.txt) format.
A key advantage of TRIP is that it offers the opportunity for comprehensive data analysis for both antigen receptors through a unified, reproducible and user-friendly interface. In addition, it supports a wider range of user experience; from a fully interactive graphical interface, to a fully customizable and versatile commandline tool that can be incorporated in further workflows.
TRIP has already been used in studies of hematologic malignancies assessing the immune repertoire in different contexts like chronic lymphocytic leukemia (CLL) and multiple myeloma (MM) [16][17][18], supporting theories for antigen involvement into disease pathophysiology. Moreover, in recent studies of the team TRIP was implemented for the characterization of the intraclonal temporal dynamics leading to clonal drift in CLL, as well as the subclonal 'architecture' essentially arising from intraclonal diversification of the BcR IG genes in the context of ongoing SHM, alluding to interactions with disease-specific antigens [19,20].
Additionally, the implementation of TRIP for NGS TR data analysis in the context of HIV supported an antigen-driven, HIV-specific immune process in the development of non-neoplastic reactive lymphadenopathy, through the detection of clonotypes with established reactivity against certain HIV protein epitopes [20]. Similar studies focusing on cell lines enriched for T cells reactive against viral infections (e.g. by EBV and/or CMV and/or BKV), a major cause of morbidity and mortality after allogeneic hematopoietic stem cell transplantation, provided insights into the TR repertoire of ex vivo-or endogenously-generated virus-specific T cells [21].

Implementation
In order to ensure consistency in the description of each functionality offered by the tool, the following list of definitions will be used: AA Junction: The AA junction refers to the unique V-(D)-J gene rearrangement leading to the AA sequence between the conserved anchors (e.g. C104 and W118 for the heavy chain of the BcR IG).
Clonotype: From a biological perspective, is a unique nucleotide sequence that arises during the gene rearrangement process for BcR IG or TR. There are several definitions that can be used according to the needs of each study, i.e. a unique combination of V-(D)-J gene and allele and/or the CDR3 at amino acid or nucleotide level.
Highly similar clonotypes: It refers to clonotypes with the same CDR3 length that differ in the AA composition in few particular positions.
Shared clonotypes: The identification of common clonotypes between different datasets and their relative frequencies in every dataset.
Convergent evolution: This term is used to describe the phenomenon of different nucleotide sequences degenerately encoding the same CDR3 amino acid sequence [22][23][24][25].
Moreover, it should be noted that the tool functions have been explicitly designed to work at both the nucleotide and the amino acid level, depending solely on the requested task.
A typical NGS data analysis workflow is depicted on Fig. 1. BcR IG and TR NGS pairedend protocols produce anti-parallel reads, Read 1 (R1) and Read 2 (R2) providing with both forward and reverse reads for each amplified BcR IG/TR gene sequence. The first step of the analysis is to filter R1 and R2 raw sequences according to a set of quality rules. Subsequently, the paired-end reads are combined based on an overlapping region and the stitching process leads to the generation of the full-length sequences, which undergo Fig. 1 Overall workflow of the NGS data analysis; the TRIP tool is used for the downstream analysis of the NGS data, after the stitching process and the sequence annotation offered by IMGT/High-VQuest quality filtering again. The output of the algorithm consists of 10 files including the FASTA files with stitched, full-length sequences, statistics and reports about the process. The FASTA file of the stitching algorithm serves as input for the IMGT/HighV-QUEST Tool, which provides the most up-to-date reference dataset for human BcR IG and TR genes and offers characterization of clonotypes for clonal diversity. As output, the tool produces an archive file, which comprises 12 different files. This output is used as input to the TRIP tool for meta-data analysis.
TRIP can be used in four ways: a) as a standalone graphical tool running locally on your computer, b) as a web application hosted on a server, e.g. the Amazon cloud server, c) as a docker container, and d) as a script-based tool. The graphical environment of TRIP is implemented in R Shiny, an R package that can be used to build interactive web applications straight from R, allowing users to directly interact with the data, the analysis and the results. So, in the first case, R, R Studio, the R Shiny package and all the other R packages that TRIP uses are installed. In the second case, a Shiny App can be defined as a web page (User Interface, UI) connected to a computer running a live R session (Server). The users are able to access the web page through any device (no dependencies needed) and select personalized parameters via the UI. The selected parameters are passed to Server, where calculations are performed, and the UI's display is updated according to them. In the third case, the docker container of TRIP can run in cloud infrastructures as well as in any machine running the Docker daemon, without further installations. The docker image of TRIP is publicly available through DockerHub, whereas more information about Docker can be found in [26]. Finally, in the last case, TRIP can run as an R script-based tool, where all parameters can be selected through the command line. More information about this tool are available in the project's github repository.
To create an R Shiny application, two predefined R scripts need to be located into the same directory. The first one implements the User Interface (ui.R) by controlling the layout of the page using html commands and other nested R functions, and handling the input parameters inserted by the users. The second script implements the Server (server.R) and contains essential commands and instructions on how to build the application and process the data. Apart from these scripts, TRIP also includes a helper script (helpers.R), which contains all the created functions needed for the whole analysis.
The main R package that an R Shiny application needs is the shiny package. To customize the UI, additional CSS and Javascript libraries can be used within the application, such as shinyjs and shinyBS. In order to handle big data, data structures that use indexing and keys are used to save data and make the look-up process quicker, as well as libraries that contain functions with an option of parallelize (i.e dplyr, data.table, tidyr). The results of the analysis are presented using plotting packages, such as plotly and plot3D.
The user interface of TRIP is organized in 12 major tabs (Fig. 2), including the Home tab, the Preselection and Selection tabs, the Pipeline tab, one tab for each one of the steps of the pipeline with the corresponding results, and the Visualization tab. The default pipeline of TRIP depends on the antigen receptor type (BcR IG or TR) and includes the following procedures: clonotype computation, highly similar clonotype computation, shared clonotype computation, repertoire extraction, repertoire comparison, multiple variable comparison, sequence alignment, SHMs and amino acid position-based frequency estimation. Some examples of the produced visualizations are available in Fig. 3. The user is able to form a personalized pipeline by selecting the processes that fit their needs and taking into account the dependencies between the processes. The whole workflow and the pipeline processes that are available, including the aforementioned dependencies between them, are presented on Figs. 4 and 5, accordingly. At the end of the process, the whole session, i.e. the user-selected parameters and the function outputs, can be bookmarked, in order for the user to be able to restore it any time.

Load data and initialization
The first step of the process is to upload the input data, which comprises the output files of the IMGT/HighV-Quest tool [27]. By design, TRIP currently supports as input the output files produced by the IMGT/HighV-Quest tool. This is due to the fact that IMGT outputs are used as a global standard format for the field of immunogenetics. However, IMGT provides an option for downloading the output files in AIRR format (Adaptive Immune Receptor Repertoire Standards -https://docs.airr-community.org/) which provide a more Fig. 2 The Home page of TRIP tool: Users can import data by selecting the directory in which the data is stored. Previous sessions can also be loaded ("Restore Previous Sessions" button). There are 2 options regarding the receptor analyzed (BcR IG and TR) as well as 2 options based on the type of available data (high-or low-throughput). In the next step, filters are applied to the chosen data for curation (Preselection) and filtering (Selection) concrete schema for describing, analyzing, storing and sharing immunogenetic data. Following this, and taking into account that similar tools in the field support both IMGT and AIRR formats (e.g.Vidjil -http://www.vidjil.org/), future extension of TRIP includes supporting outputs coming from both IMGT or AIRR tool.
Data from a given sample are organized into a folder containing 10 individual files in text (.txt) format. Users are able to choose to upload only some of these files depending on the type of the downstream analysis. IMGT/HighV-Quest has a submission threshold of 500,000 sequences. If a single sample has more sequences, the data is split into batches of 500,000 sequences through the stitching algorithm. Hence, multiple folders for these given samples will be generated. These folders should be named with the same identifier but a different extension in the form of "-0", "-1" etc. in order to be handled as one sample by TRIP.
Immediately after the upload, input data is automatically checked for the presence of data columns with a different/unknown title. In case such columns are identified, users The workflow of the TRIP tool. Overall schema of the TRIP tool usage. Initially, the user selects the desired patients' datasets (i.e. IMGT/HighV-Quest files), performs data cleaning (preselection) and filtering (selection), and executes the preferred pipeline are asked to replace the names of these columns with the appropriate ones, ensuring that TRIP operation will not be affected by IMGT changes in updated versions. Data columns that will not be used in the downstream analysis as well as duplicated columns are removed at the very beginning of the process to reduce the overall complexity. After this, one table for each sample and a merged table (All Data table) that combines all samples' data are created.
There are 2 global parameters at the Home tab regarding the antigen receptor type (BcR IG and TR) as well as the type of the data to be analyzed, i.e. high-throughput (NGS) or low-throughput (Sanger sequencing) data. Concerning the latter, the main difference Overview of TRIP tool's pipeline: The two main functionalities (Clonotype and Highly Similar clonotype computation -represented with a blue) as well as the dependencies between the rest of the pipeline's functions are illustrated above. Clonotype computation is the only step that is required for the pipeline to run, while every other procedure is optional. Green colored functions can be used for both T or B cells, while red colored function can be run only for B cells comes to the preprocessing steps, i.e. the Preselection and Selection steps. In the case of high-throughput data, all filters are applied consequentially (i.e. if a sequence fails in more than one selection criteria, only the first unsatisfied criterion will be reported), whereas for low-throughput data all criteria are applied at the same time and only those rows that fail in all criteria are filtered out from the analysis.

Data preparation
After uploading data and initializing the global parameters, data Preselection (curation) and Selection (filtering) are applied, according to the user's preferences.

Preselection
The Preselection process comprises the following criteria: 1 Select type of sequences that will be taken into account: The user has the choice of including only productive sequences (without pseudogenes and/or stop codons and/or frameshifts), only unproductive sequences, or all sequences in the next steps of the analysis. 2 Only take into account CDR3 with valid start/end landmarks: Start/end CDR3 landmarks (anchors) can be customized by the user based on the type of data i.e. BcR IG (heavy/light chain) or TR. More than one valid landmark can be requested. Sequences with landmarks other than the chosen ones are excluded from the analysis. 3 Only take into account CDR3 with no Special Characters (X,*) : Only sequences without ambiguities (i.e. characters other than those of the 20 amino acids) are included in the analysis.
The results of the Preselection process are presented in the Preselection tab. The output consists of 4 different tables: (i) a summary table with the numbers of both the included and excluded sequences for each different criterion, (ii) the raw data, (iii) the data that meet the Preselection criteria and are provided as input to the selection process and (iv) the excluded raw data, including information about the unsatisfied criterion.

Selection
The sequences that passed through the Preselection process are used as input for the data Selection (filtering) process, which comprises 6 different filters: 1 V-REGION identity % : Sequences with identity percent to germline that do not fall in the range set by the user are excluded from the analysis. 2 Select Specific V, J, D Gene: The user can select for rearrangement sequences of one or more particular V, D, J genes or gene alleles, respectively. 3 Select CDR3 length range: Only sequences with CDR3 length within the range set are included in the analysis. 4 Only select CDR3 containing specific amino-acid sequence: Sequences with the specific CDR3 amino acid motif provided by the user are included in the analysis.
The results of the Selection (filtering) process are presented in the Selection tab, using the same four tables and layout as those provided at the Preselection tab.

Pipeline
After the preselection and selection processes, the main pipeline that the user can select consists of the clonotype computation, highly similar clonotype computation, shared clonotype computation, repertoire extraction, repertoire comparison, multiple variable comparison, sequence alignment (only for BcR IG), SHM (only for BcR IG) and amino acid position-based frequency estimation.

Clonotype computation
The first and necessary step of the pipeline for both BcR IG and TR is the clonotype computation. The clonotype definition is specified by the user. There are 10 different options for the definition: clonotypes can be defined using a gene (V gene/gene allele, J gene/gene allele) in combination with the CDR3 at nucleotide or amino acid level (AA CDR3 or Nt CDR3) or using only the AA CDR3 or Nt CDR3.
To form clonotypes, the raw data are grouped by the selected pair gene-CDR3 or the selected CDR3 attribute. The relative frequency with which each one of the clonotypes appears into the raw data is computed based on the total raw reads of each sample and the results are then sorted in descending order regarding the frequencies, so that the most frequent-important clonotypes appear at the top of the table. After computing this grouped table (Fig. 6a) and assuming that M clonotypes were uncovered, M sub-tables (C i ) are created containing the raw data that correspond to each specific clonotype (Fig. 6c). In the case that AA CDR3 has been selected in the clonotype definition, the number of different CDR3 sequences at nucleotide level that form each clonotype, named as convergent evolution, is computed utilizing the information of the subtables.

Highly similar (HS) clonotypes computation
After clonotype computation, the user is able to merge clonotypes that are highly similar (Fig. 6b). On TRIP the threshold used to determine what is considered similar is set by the user and concerns the amino acids themselves and not their physicochemical properties. The user can set the number of mismatches allowed for each CDR3 AA length found in the dataset and a clonotype frequency threshold (CFTh), that defines the minimum frequency value that a clonotype can have in order to be considered as major. The process applied for each identical CDR3 length L contained in the dataset is described in the following steps.
1 Create subset-L which contains clonotypes that are characterized by CDR3 sequences of AA length L. 2 Find the clonotype of subset-L with the highest frequency. If the frequency is above CFTh, this clonotype is considered as the major clonotype of subset-L. In this case, go to step 3, otherwise, no clonotypes of length-L can be merged, so they remain as they are and the process terminates for length L. 3 Find those clonotypes of subset-L that have less mismatches with the major one than those allowed by the user for the specific length and assign them to the major group. 4 Create a new subset-L with the remaining clonotypes i.e. those that have not been assigned to another group and repeat steps 2 and 3.

Fig. 6
Examples of output tables provided by TRIP. a Clonotypes of all datasets. Each clonotype is presented according to the definition chosen by the user followed by its absolute count of reads, its relative frequency and the convergent evolution, i.e. the different nucleotide sequences encoding the amino acid sequence of each clonotype. b Highly Similar Clonotypes. Clonotypes of the same CDR3 length and differences in the amino acid composition in few positions are merged. Each clonotype is presented followed by the newly-calculated absolute count of reads, its relative frequency and a list of the cluster ids merged. c Elements of the clonotype. Each clonotype presented in the first column of Fig. 1a is also a link that provides a table with all relevant immunogenetic data for that particular clonotype. d Shared Clonotypes. When multiple datasets are analyzed simultaneously, some clonotypes may co-occur in more than one datasets. The clonotypes that were present in >= 2 samples are presented in this table followed by the number of reads assigned to each clonotype in every dataset. The last column of the table is about the number of datasets/samples that shared each clonotype. e Grouped Alignment. This output table is provided after the alignment and the grouping of the identical sequences at amino acid level based on the reference sequences and each row represents a unique amino acid sequence. It includes the number of reads that are identical, the IGHV gene and allele used as reference, the cluster id (clonotype) the sequence corresponds to (the reference sequence is characterized with "-") and the positions of the BcR IG molecule. The amino acids that correspond to the germline are replaced with "-", while the differences remain. The whole process can be performed with or without taking into account the rearranged V-gene. Except for the highly similar clonotypes table which has a similar structure as the clonotype table, one table for each CDR3 length containing information regarding the highly similar clonotype grouping is also provided.

Repertoire extraction
The repertoire extraction process follows the clonotype computation process. Six different repertoires can be extracted, i.e. V, J or D gene and/or gene allele. To compute repertoires based on the same gene/gene allele that is used for the clonotype definition, the process is quite simple. The only thing that needs to be done is to group the clonotype table by the gene attribute, compute the relative frequency with which each gene appears in the clonotype table based on the total number of clonotypes identified and sort the results in descending frequency order. In any other case, assumming that the user has selected gene G, the repertoires are extracted using the following steps: 1 For each sub-table C i find the most frequent gene/gene allele of G, G i , as follows: i GroupC i by column G. ii Find the frequency of occurrence for each value of G. iii Sort by frequency in descending order. iv Select the first element of column G of the grouped table.
2 GatherallG i values and create a data frame GDF with one column. 3 Group table GDF by G, compute the corresponding frequencies and sort the table to get the repertoire table.
This step of the pipeline can be implemented either on the initial clonotypes or the highly similar clonotypes, in the case that the latter has been computed.

Multiple value comparison
Another grouping process which is based on the clonotype table is the Multiple Value comparison process. Here, the user is able to select two attributes (A 1 , A 2 ) and find the frequency that each unique combination of the attributes occurs in the clonotype table. Many different combinations can be selected by the user, including Gene, CDR3-IMGT AA length, the isoelectrical point (pI) etc. To compute the Multiple Value comparison table, the process that is followed is similar to the one followed to extract repertoires using a different gene/gene allele than the one used for clonotype definition, but this time the number of attributes is two: 1 For each sub-table C i find the most frequent value of A 1 , A 1i : i) group C i by A 1 , ii) find the frequency of occurrence for each value of A 1 , iii) sort by frequency in descending order, and iv) select the first element of column A 1 of the grouped

Shared clonotypes computation
After clonotype computation, a table of the shared clonotypes among the different samples that are loaded can be computed. This table is computed by extracting those clonotypes that tend to co-occur in more than one datasets/samples of the analysis (Fig. 6d). This step of the pipeline can be implemented on both the initial clonotypes and the highly similar clonotypes, in the case that the latter have been computed.

Repertoire comparison
After repertoire extraction, a comparative repertoire analysis can be performed, when more than one samples are analyzed, using the same approach as the one used for shared clonotypes computation. This step of the pipeline can be selected for both the initial repertoires and the highly similar repertoires, in the case that the latter have been computed.

Detect CDR3 sequences with 1 length difference
By selecting this process, groups of similar CDR3 sequences are created. In this case, two CDR3 sequences are called similar when their lengths differ by 1 AA and if one copies the AA of the specific position P to the position P+1 of the sorter sequence, the same sequence occurs. Assuming that the CDR3 length of the longest sequence is L and the position where the difference can appear is P, the process that is applied to each unique gene of the dataset is summarized in the following steps 1 Filter in row data which contain CDR3 sequences with length=L or L -1.
2 For the CDR3 with length=L -1: i Make a right sift of the amino acids that belong to the positions (P +1):(L -1). ii Copy the amino acid of the position P to the position P +1.
3 Group by CDR3, count and summarize. 4 Return the groups with more than one elements

Alignment
An alignment table can be created for the selected V region at both Nt and AA level. This can be done by applying the following steps to each unique gene of the dataset: 1 Find the germline of the gene from the corresponding file. 2 Compare the selected region with the germline and replace with "-" the positions of the sequences that correspond to the germline.
A grouped alignment table is created as well by further grouping together the exact same sequences of each particular cluster id (Fig. 6e).

Somatic hypermutations
After the alignment process, SHMs can be extracted based on the grouped alignment table. The output table includes: (i) the mutation type, (ii) the position of the change, (ii) the region where the change occurs (based on IMGT unique numbering), (iii) the number of sequences carrying each change and (iv) the frequency of the change for every gene or allele based on the grouped alignment table regardless the clonotype (Fig. 6f). There is the possibility to analyze only a number of clonotypes (top N clonotypes) or even some clonotypes separately.

Logo creation
At this step, a logo can be created for the V region or the CDR3. To create a logo, the corresponding length of the sequences must be provided by the user. Moreover, a frequency table needs be computed first, by counting the appearance of the 20 different amino acids at each position of the sequence. There is also the option to include only those sequences at the plot that correspond to the top N clonotypes.

Visualization
In the Visualization tab different types of charts (scatter plots, bar plots, 3D plots, surfaces, histograms etc.) are available for the visualization of the analysis results. Clonotypes are presented as bars and the user can select the frequency above which the clonotypes will be presented. Convergent evolution can also be visualized with more than one chart type options. The computed repertoires are presented as pie-charts and the user can again select the minimum frequency of the gene/allele that will be presented. All computed tables can be downloaded in text format, whereas the plots and the graphics can be downloaded in png format.

Results and discussion
Two sets of experiments have been performed in order to evaluate the efficiency and the performance of the TRIP tool. Therefore, this subsection is divided into two parts: (a) the performance experiments on different sizes of input data regarding memory usage and computational time discussed in Performance testing subsection, and (b) the application of the TRIP pipeline in a group of patients presented in Experimental results on a group of patients subsection.

Performance testing
TRIP was developed as an R Shiny application. Even though R is considered as a slower language, as it makes certain sacrifices for convenience at the expense of optimal speed, a lot of methods can be applied in the coding style to produce efficient R code. What is more, R gives the opportunity to identify bottlenecks and code it directly in C++ via the Rcpp package. A lot of packages, such as tidyverse, take advantage of this, which is why in some cases they may be faster than some base implementations. Based on this, optimization of the system's time was achieved by vectorizing all data structures in the source code and replacing, when possible, for loops with apply functions. In addition, data processing was performed with the use of the tidyverse package's functionalities. In this way TRIP allows the user to repetitively perform an analysis on multiple chunks of data coming from multiple patients. However, it should be stated that part of the future work on the tool is to support parallel processing of the input data.
The first set of experiments focused on evaluating the performance of the TRIP tool, in terms of memory usage and computational time. During this set of experiments, we used four different artificial datasets that were created after merging a number of BcR IG gene sequence datasets with similar characteristics, in order to create realistic synthetic datasets. Each one of the four synthetic datasets had 250,000 rows. The datasets are available to the readers (see Availability of data and materials section). The whole pipeline for antigen receptor type BcR IG was applied to all four datasets at the same time, using the initial datasets and subsets of them of different sizes, summing up to 250,000, 500,000, 750,000 and 1,000,000 rows. The experiments were executed on a server that had the following specifications: Ubuntu 18.04.3 LTS (kernel 4.15.0.58-generic), 2 x Intel Xeon X5650 @ 2.67 GHz and 118 GB RAM. Figure 7 shows the memory usage and the time that has passed from the beginning of the process until the time when each module of the pipeline was completed. The total typical pipeline for BcR IG datasets of 1 million rows of data took approximately 6 h to run. The figure makes clear that the computational time and memory usage are increased linearly according to the dataset size.
Analyzing the time passed from one module to the next one, one can observe that the V gene repertoires and the V gene Highly similar (HS) repertoires processed are quicker than the corresponding J and D repertoires. The same thing for the aforementioned genes happens when they are involved in the Multiple value comparison process. The reason is Fig. 7 The computational time and the memory used (y-axis) of the graphical TRIP tool's modules during a single pipeline run, for four different subsets of data. Each dataset's size represents the total number of records for a particular sample (equivalent to the number of rows of the input tables), each dot on the time graph shows the elapsed time needed from the beginning of the pipeline until the corresponding procedure is terminated, and each dot on the memory graph represents the memory usage by the running routine. The functions which are used for the pipeline procedure are presented on the x-axis of both graphs that these processes are calculated based on the extracted clonotypes. In this pipeline, we computed the clonotypes using the V gene and the CDR3, so further computations are needed to find the major J and D genes of each group of clonotypes. The module of the pipeline with the highest memory and time demand is the Alignment process.

Experimental results on a group of patients
The second set of experiments was focused on evaluating the efficiency of the TRIP tool. We assessed the reproducibility of the results produced by the main TRIP workflow with the results retrieved through the Galaxy IRProfiler [14]. As a control we used a previously published dataset [2] which consists of the TR gene rearrangements from three human samples. We checked the preselection and selection process, as well as the clonotype computation and the repertoire extraction modules. No significant differences were noted, whereas the clonotype and V gene repertoire were faithfully and reproducibly represented by TRIP. In more detail, although the preselection process seems to be stricter following the TRIP pipeline, almost 0.1% more rearrangements were filtered out, without having a significant impact on the final results. The clonotypes ranking and their relative frequencies remain comparable. (Tables 1, 2 and 3).

Comparison to the galaxy IRProfiler tool
We used IRProfiler -a software toolbox for high throughput immune receptor profiling [14], which is available through the Galaxy platform, in order to compare its functionalities with the TRIP tool. The main differences are listed in Table 4.

Table 1
Comparison of the results of the preselection and selection processes using the Galaxy and the TRIP tools  Conclusions NGS holds the potential to offer new knowledge of both biological and clinical relevance for improved understanding of: (i) many normal processes and mechanisms, such as B and T cell development, inflammation and the aging of the immune system, and (ii) pathological conditions, such as cancer and autoimmunity. However, the interpretation of the results and the extraction of meaningful conclusion requires extensive expertise in bioinformatics, which is often limited in clinical as Table 3 Comparison of the results of the repertoire extraction process using the Galaxy and the TRIP  tools   Vgene  T3396  GALAXY   T3396  TRIP   T3397  GALAXY   T3397  TRIP   T3404  GALAXY   T3404  TRIP   TRBV10 well as science laboratories. To this end, capitalizing on our long-standing experience into the field of immunogenetics and bioinformatics analysis in the field, we designed a user-friendly, straightforward bioinformatics pipeline in order to assist not only the inexperienced users but also the experienced ones by facilitating the analytical part. TRIP is a novel software framework that provides analytics services on antigen receptor gene sequence data, offering the opportunity for an in-depth analysis based on the processing of the output files of the IMGT/HighV-Quest tool. It provides detailed information about V, D and J gene usage, CDR3 AA and Nt composition, and clonality, also offering analysis of the SHMs present in the V-region of the BcR IGs. It is accurate, open source, easy-to-use, user friendly, and enables the user to build a personalized pipeline. Finally, it processes many different datasets at the same time.