RACS: rapid analysis of ChIP-Seq data for contig based genomes

Background Chromatin immunoprecipitation coupled to next generation sequencing (ChIP-Seq) is a widely-used molecular method to investigate the function of chromatin-related proteins by identifying their associated DNA sequences on a genomic scale. ChIP-Seq generates large quantities of data that is difficult to process and analyze, particularly for organisms with a contig-based sequenced genomes that typically have minimal annotation on their associated set of genes other than their associated coordinates primarily predicted by gene finding programs. Poorly annotated genome sequence makes comprehensive analysis of ChIP-Seq data difficult and as such standardized analysis pipelines are lacking. Results We present a one-stop computational pipeline, “Rapid Analysis of ChIP-Seq data” (RACS), that utilizes traditional High-Performance Computing (HPC) techniques in association with open source tools for processing and analyzing raw ChIP-Seq data. RACS is an open source computational pipeline available from any of the following repositories https://bitbucket.org/mjponce/RACS or https://gitrepos.scinet.utoronto.ca/public/?a=summary&p=RACS. RACS is particularly useful for ChIP-Seq in organisms with contig-based genomes that have poor gene annotation to aid protein function discovery.To test the performance and efficiency of RACS, we analyzed ChIP-Seq data previously published in a model organism Tetrahymena thermophila which has a contig-based genome. We assessed the generality of RACS by analyzing a previously published data set generated using the model organism Oxytricha trifallax, whose genome sequence is also contig-based with poor annotation. Conclusions The RACS computational pipeline presented in this report is an efficient and reliable tool to analyze genome-wide raw ChIP-Seq data generated in model organisms with poorly annotated contig-based genome sequence. Because RACS segregates the found read accumulations between genic and intergenic regions, it is particularly efficient for rapid downstream analyses of proteins involved in gene expression.


Background
In the last few years the traditional High Performance Computing (HPC) centres, such as SciNet at the University of Toronto [1], have been witnessing the emergence of increasing amounts of work-flows from non-typical disciplines in the field of computational science [2].Among those, disciplines related to bioinformatics appear to be the most prominent in terms of demanding resources and tackling complex problems related to Next Generation Sequencing (NGS).The human genome project [3], the human microbiome project [4], RNA-Seq to analyze gene expression [5,6] and ChIP-Seq to assess global DNA-binding sites [5,6] are all examples of projects where HPC can be applied.
The advantage of NGS for researchers is high-throughput sequencing analysis allowing millions of DNA molecules to be read at the same time [7,8,9].The output of NGS is substantial and it can be overwhelming for analyses [10,11].The difficulties during the analyses can be increased if the genome to be studied is more specific.We address this problem by combining and utilizing open-source tools such as BWA [12], SAMtools [13], Linux shell and R scripts [14] and techniques commonly employed in the HPC fields.
Chromatin immunoprecipitation coupled to NGS (ChIP-Seq, Figure 1) can be utilized to infer function(s) of a protein of interest based on its chromatin-occupancy profile.For instance, if the ChIP signal accumulates closer to a 5' region of a gene, we could infer that protein's function may be related to transcription initiation.On the other hand, accumulation of ChIP peaks at the 3' ends would suggest a role in transcription termination.However, this has to be tested further since gene expression can also be coordinated by elements that are not in close proximity to the specific gene [15].Utilization of mock/control samples (untagged) is important to account for nonspecific DNA interactions with the chromatography resin.
In multi-cellular eukaryotes, the mechanisms of chromatin remodeling involving the precise delineation of how transcription starts, progresses and ends are poorly understood.Chromatin remodeling research aids in our understanding of how these mechanisms affect cellular function in the progression of disease.ChIP-Seq is generally used as an initial step to examine the possible role(s) of a protein in chromatin re-modeling [9,16].
Many algorithms and visualization tools are already set for other model organisms such as human or yeast; however, none of these applications cater to contig based genomes with only minimal annotations of genes, as is the case in T.thermophila.These applications also do not directly address whether the accumulation of the protein of interest is in a genic or intergenic region.Using available tools, such as MACS2 [17], the user receives the peaks coordinates and has to group afterwards to assess the local enrichment within genic and/or intergenic regions.Our pipeline, Figure 2, tackles this problem by offering an one step solution that utilizes the mentioned open sourced tools, the available contigs and gene annotations files of a genome, providing the user with two tables with all the found read accumulations.The first table contains the reads accumulation in the genic region, regions corresponding to the annotated genes.The second table is for the accumulation of reads in the intergenic regions.The intergenic regions are newly generated each time to account for modifications or improvements in the gene annotations.The obtained results are normalized to the number of clusters that passed Illumina's "Chastity filter" also called PF clusters.These numbers represent the reads obtained per sample.The normalized values are further filtered by using the data obtained from the mock samples.
We present in this paper our computational pipeline to Rapidly Analyse ChIP-Seq data for contig based genomes (RACS).RACS, which can be run either in a typical workstation or taking full advantage of HPC resources, such as, multicore architectures and use of RAMdisk, to improve the analysis times making it more efficient, (see details on Sec.2.4).This pipeline was developed to answer whether the protein of interest localized to a gene or an intergenic region.RACS was designed in a user friendly manner to accommodate researchers with basic knowledge in Linux shell and R [14].RACS provides accessible downstream analyses of ChIP-Seq data obtained from Illumina instruments.RACS follows a unique approach to tackle this problem, and is widely applicable and useful enough to analyse ChIP-Seq related data from a variety of different organisms generated by NGS.

RACS Pipeline Implementation
The RACS pipeline is an open source set of shell and R scripts, which are organized in three main categories: • the core pipeline tools, which allow the user to compute reads differentiating between genic and intergenic regions automatically • auxiliary post-processing scripts [1] for normalization using the "Pass Filter" values • and utilities to validate results by visualizing the reads accumulation and run comparisons with other software tools, such as IGV and MACS respectively.In particular, the visual representation tools are still under development, and will be available in the repository as soon as they are ready.The RACS pipeline will run in any standard workstation with a Linux-type operating system.The pipeline requires the fastq files; the assembly or genomic data, such as, T thermophila June2014.assembly.fasta;the coordinates for the genic regions, such as, T thermophila June2014.gff3(both files can be found at http://ciliate.org/index.php/home/downloadsciliate.org[18]); and the following open source tools: • Burrows-Wheeler Alignment (BWA) version 0.7.13 [12] • Sequence Alignment/Map (SAMtools) version 1.3.1 [13] • the R statistical language [14] Our pipeline is open source, and the scripts are available to download and accessibles from any of the following repositories [2] : https://gitrepos.scinet.utoronto.ca/public/?a=summary&p=RACS or https://bitbucket.org/mjponce/racsThe repository includes the core or main scripts placed in the "core" directory.The comparison and auxilary tools are placed in a "tools" directory.And we have also included examples of submission scripts in the "hpc" directory, with PBS [19,20] and SLURM [21,22] examples of submission scripts, so that users can take advantage of HPC resources.Additionally, we have also included a "datasets" directory containing examples of data.Details about the pipeline implementation and how to use it are included in the 'README' file available within the RACS repository. [1]Alternatively, we have also included an auxiliary spreadsheet that could be used instead of the script to perform the post-processing and normalization manually, as well as, to check against negative controls. [2]Both repositories are synchronized, so that the latest version of RACS is available and can be obtained from both of them.
A generic top-down overview of the pipeline implementation for the data analysis, is shown in Figure 3.
Our core scripts do not require any additional R packages; however, the comparison tools, depending on what format the data to compare with is given, might use some spreadsheet reader package.For instance, we have included one named xlsx which allows to read propietary formats.The results of the genic and intergenic regions are generated in two CSV (Comma Separated Value) files.These are standard text ASCII files, which can be read with any typical spreadsheet software or R.
We implemented RACS using two data sets of ChIP-Seq data obtained from the organims T. thermophila.The first data set used was using the protein Med31-FZZ which is part of the putative Mediator complex.The Med31-FZZ data was submitted to a scientific journal and it is under review.The second data set used is from one of our recent studies [23].Here, the protein used is Ibd1-FZZ which it was found to be part of multiple chromatin remodeling complexes.

Determination of the Genic Regions
To count the amount of reads in each genic region or open reading frame (ORF) the core pipeline script was implemented using Linux shell commands combined with the usage of BWA and SAMtools.The input files are the genome of reference (T thermophila June2014.assembly.fasta),the gene annotation file (T thermophila June2014.gff3)and the INPUT and IP files obtained from NGS.After the INPUT and IP sequences are aligned with the genome and sorted, the script uses a loop to count the reads in each ORF and deposits the obtained data in a file named "FINAL.table.INPUTfile-IPfile"; where INPUTfile and IPfile are the INPUT and IP files respectively.Figure 3 depicts a flowchart representing the required steps to obtain the final table containing the the number of reads found in each of the ORF.Details of the processing stages are shown in Figure 2, in relation to Tetrahymena scaffold database and the breakdown of each these steps.
The pipeline is implemented to specifically target the data from the T.Thermophila organism within the .gff3file, if the user would like to consider a different organism, some changes should be done in the table.shscript located in the core subdirectory.In particular, the variables filter1 and filter2 filter1=gene filter2="Name=TTHERM_" should be adjusted correspondingly to the organism of interest.

Determination of the Intergenic Regions
The intergenic regions were not available neither determined by the standard packages.For this reason, we developed an R script to determine these regions.In this pipeline these sequences are calculated during each run to account for further genome actualizations.The input for this script are the files generated by the genic regions pipeline discussed in the previous section (ie."FINAL.table.INPUTfile-IPfile", the INPUT and IP .bamfiles -which are generated as intermediate files of the ORF pipeline-) plus the gene annotation file (eg.T thermophila June2014.gff3).First, the script determines the intergenic regions by calculating the beginning and end of each annotated gene within each available scaffold and subtracts these values.The algorithm only reports intergenic regions that are equal or grater to zero.In the earlier version of the pipeline this constraint was not included, and in some cases it could result in the pipeline reporting regions with negative sizes.We noticed that 92 of these cases were presented in our previous study [23]; however, we should emphasize that there were not reads present in these regions thus did not affect these results.Second, the script uses the newly generated intergenic regions to count the number of reads in each of them.Finally, the data is deposited in the intergenic table for each of the intergenic regions Figure 3 .

Normalization of Reads accumulation and Enrichment Calculation
To account for differences in the amount of PF clusters (reads) presented among samples, the obtained INPUT and IP values were normalized by dividing them by the corresponding value of the Flowcell summary obtained from the Illumina run or from the Total Sequences of the fastQC file.These calculations can be done by the script "normalizedORF.sh"located in the core directory of RACS.Alterantively it can also be calculated employing the following two spreadsheets: for the genic regions (PostProcessing Genic.xlsx ), and for the intergenic regions (PostProcessing Intergenic.xlsx ); that can be found in the "datasets" subdirectory within the RACS repository.
These spreadsheets contain the reads found in the untagged (or mock purification/negative control) samples in the Untagged tab.The user can also add the Flowcell summary details in the Add FCS for (SAMPLE ID) tab.The user can manually introduce the read values for the samples being analysed in the Add (SAMPLE ID) ChiP Seq tab.In this tab the user can divide the number found by RACS by the corresponding PF cluster number found in the previous tab.This data can be deposited in the "Normalized INPUT or IP (FCS)" columns.After the required reads normalization the accumulation can be obtained as the number of IP reads divided by the number of INPUT reads (IP/INPUT).This can be deposited in the "Enrichment (N IP(FCS)/N INPUT(FCS)" column of the same tabs.The obtained values are filtered (Filter 1 ) by the user by subtracting the corresponding number found in the Untagged tab and deposit the values in the Enrichment (Minus AVERAGE untagged column.If there are more than two samples the values can be averaged and values that are less than 1.5 can be filtered (Filter 2 ) and deposited in the "Enrichment Average Sample" tab.For the ORF table, in this tab there is a column containing the Expression profile obtained from the RNA Seq tab.We recommend to copy the filtered cells to the Results tab.The distribution of the protein of interest can be calculated in this tab.For the Intergenic table there is a ORF vs IGR (Intergenic) tab where the number of regions and reads can be calculated.The number of regions is represented by the number of ORF and Intergenic regions that passed the 2 filters.The number of reads found in the ORF and Intergenic regions can be calculated by adding all the available values from the "Normalized IP (FCS)" columns and deposit them in the ORF vs IGR tab of the Intergenic table.

Utilities: Validation and Quality Checks
To account for biological variability in the wet lab, we perform ChIP-Seq using 2 samples for each tagged strain and average their Enrichment.To validate the findings, it is important to determine the genic and intergenic regions of the untagged (negative control) INPUT and IP samples.After this determination, we subtracted the obtained average enrichment from untagged to the obtained tagged average of the samples.Then we filtered for values that had an enrichment greater than or equal to 1.5 in the final enrichment column.These are the enriched regions and represent where the tagged protein interacts.

Visualization of Reads accumulation
The browser IGV [24] can be used to visually inspect and validate the obtained reads based on their ranked enrichment.The files needed are illustrated in Figure 3 and the 'README' file included in the RACS' repository.MACS2, a main-stream application to call peaks, can also be used as specified in [17].MACS2 uses the same intermediate files (.bam) obtained from the RACS pipeline, hence it can be a good reference to be considered for comparison purposes.

RACS Performance
RACS can be run in any normal Linux workstation, however it can also take advantages of cluster-type environments.In particular, several stages of RACS can be run using mutlicore architectures with several threads in parallel.In addition to that, RAMdisk can be used to speed up file I/O operations, we have added an optional argument where the user can specify to use RAMdisk if wished.When we originally developed our pipeline we tested it in our previous HPC cluster, GPC [1] consisting of 2.53 GHz Intel Xeon E5540, with 16GB RAM per node (2GB per core).By comparing the performance of RACS with a typical workstation we noticed a speed-up factor among 8 to 12×.We have also run our pipeline in our newest cluster, Niagara [25], of 1500 Lenovo SD350 servers each with 40 Intel "Skylake" cores at 2.4 GHz.Each node of the cluster has 188 GiB / 202 GB RAM per node, for which we have obtained a speed up of 5 to 10×.In other words, the whole processing of ORF and IGR for a typical INPUT/IP sample, took between 1 and 2 hours.In addition to that, in our new system is possible to bundle 40 (80 using multithreading) processes together.
Moreover, this first release of RACS utilizes the basic SAMtools and BAM codes, however it has been reported that improvements in processing SAM files could be achieved using SAMBAMBA [26].One of the many advantages of dealing with an open source, modular pipeline like this, is that it allows interested users to explore this possibility as well, just by modifying the tool to process SAM files and selecting the one desired.

Case study
In this section, we will describe how RACS was used to re-analyse and generate the data presented in [23].The model organism used in [23] is the protist Alveolate T. thermophila which is the most experimentally amenable member of this taxonomical group.T. thermophila can be used in some cases to understand the basic biology of the parasitic and disease-causing members of the Alveolates.Members of Plasmodium species that causes malaria [27,28,29] and other related species that affect ecosystems [30] and aquiculture [31] can be examined by analogy through our selected model organism.In addition, T. thermophila has genes that present homology to human genes [32,33] and characteristics that makes it an excellent candidate to study chromatin mainly because of the segregation of transcriptionally active and silent chromatin into two distinct nuclei, macronucleus (MAC) and micronucleus (MIC) respectively [34].In our recent study we found a protein, Ibd1, that interacts with many chromatin remodeling complexes [23].The T.thermophila's genome [18] is contig based and contains almost 27 thousand annotated genes or genic regions [35].To further the understanding of Ibd1, and to contribute to current understanding of how chromatin remodeling works, we analysed its localization within the genome by ChIP-Seq (Figure 1) [36,37,38,39].This allowed us to understand what genes are being regulated by Ibd1.

Pre-Processing of the fastq files and quality assessment
The ChIP samples were processed as described in [23] to make the library preparation using the TruSeq ChIP-Seq kit (Illumina).For the untagged (this study) and Ibd1 [23] strains, libraries were sequenced using the v4 chemistry in a HiSeq2500 instrument (Illumina) set for High Output mode.The obtained read lengths were of 66 base pairs, 6 base pairs corresponded to the adapters for demultiplexing.These files were demultiplexed in fastq format and the adapters were trimmed using the bcl2fastq2 Conversion Software v2.20.0.The obtained fastq files for the INPUTS and IP samples were assessed by fastQC version 0.11.5 [40].
Each dataset obtained from the ChIP-Seq experiments has a sequence of whole cell DNA (INPUT) and DNA sequenced from an immunoprecipitated (IP) sample.The Ibd1 NGS data generated in [23] can be found at the following Gene Expression Omnibus (GEO) link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103318GSE103318 [23].In addition, untagged T.thermophila fastq files were generated and they can be found at the following GEO link: https:// www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE125576GSE125576 (please use the following token: shqnagaefhilrsz) We recommend to assess the quality of the data obtained from the NGS.This step is important to have general information regarding the run.This fastQC report also helps to understand the alerts present in each sample, these alerts do not necessarily mean that the NGS run failed [40].In other words, this step is to verify if the fastq data has any alerts that have to be addressed before proceeding to the processing.For example in our case our data for the Per base sequence quality fell into the very good quality reads (green) area of the y axis allowing us to avoid quality trimming.On the other hand, we obtained a flag for Overrepresented sequences, in particular the one that called our attention was the sequence containing only the nucleotide N. Since our reads are 35-58 base paires long, the allowed maximun mismatch to the genome will be up to 3 base paires according to the BWA algorithm [12] hence it will not consider these sequences for the alignment.

Visualization and list of reads
After, the determination of enriched regions we can further analyse them using a visualization tool, such as IGV.The region of interest can be copied from either the ORF or intergenic table.This localization corresponds to where the protein of interest is localizing with respect to annotated genes (see Figure 4) or an intergenic region (see Figure 5).
It is important to note that for Ibd1's ChIP-Seq [23] data we also used MACS2 a main-stream application to call peaks [17].The visualization option for MACS2 and RACS are similar in that both provide an specific file that can be used for this purpose.In the case of RACS, our pipeline uses .bamand .faifiles which are generated within the ORF part of the pipeline (see Figure 3).Such .bamfiles can be opened in IGV, although the .fai(index) file will not, however both files should be present in the same directory.The required files for IGV visualization are depicted in Figure 3.In addition, the .bamfiles generated by RACS can be used as input for MACS2.When compared the MACS2 visualization file to the RACS .bamfiles using IGV (Figure 4), we observed that the RACS files provide a visual of the portion enriched.Here we observed that the IP samples are clearly enriched regions showing peaks when compared to the INPUT samples.This can be determined by noting the numbers shown in each of the IGV tracks, which represent its corresponding reads accumulation.
The output lists provided by RACS are segregated into two .csvfiles.The first file contain the ORF and the second the intergenic regions.Both lists contain the all reads obtained from the INPUT and IP samples.This obtained data should be filtered with the data obtained from Mock samples.We found that the output of MACS2 provides a list of peaks.MACS2 does not classify the peaks based on the localization to a gene or an intergenic region as RACS does.However, this can be addressed using BEDTools [41] after MACS2 analysis.The datasets generated by MACS2 and RACS can be found at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103318GSE103318.

Mock samples facilitate the analysis
In [23] we found that the majority of genes regulated by Ibd1 were highly expressed housekeeping genes.For that analysis untagged samples were not employed, instead a cut-off based on accumulation of reads was implemented.However, since the cutoff was arbitrary there was some degree of uncertainty in regards of its astringency.To overcome this limitation and further facilitate the analysis, for this study we performed ChIP-Seq using untagged control T.thermophila cells.The addition of two mock ChIP-Seq replicas for this study (untagged T.thermophila) enhanced RACS' ability to discriminate non specific DNA binding.In addition, the use of mock ChIP-Seq samples eliminated the uncertainty associated with using the arbitrary cut-off.This newly generated analysis will be used as a model for this case study.Between both the analysis presented in [23], and this new analysis (RACS), there are not major statistical differences regarding Ibd1 localization to genes that are highly, moderate, or low to no-expressed.The statistical analysis presented in Figure 6 shows that the hypothesis generated in [23] regarding an Ibd1 function related to transcription of highly expressed genes stands.The calculation of this result can be found in the Result tab of the ORF table (RACS/datasets/IBD1 ORF.xlsx).

RACS aids in the determination of the protein of interest function
By segregating the protein's localization between genic and intergenic, the determination of its function is rapid.After analizing the Ibd1's data with RACS, tables with the total number of reads found in each of the 26,996 ORF and 27,780 intergenic regions were generated [23].
From the ORF and intergenic tables we observed that Ibd1 localizes to more individual intergenic regions than genic regions (Figure 7-A).However, the majority of reads accumulation are in the genic regions (Figure 7-B), suggesting that Ibd1 localization is within the ORF and its function maybe related to transcription regulation.
The function of Ibd1 was further inferred based on its gene-specific localization.From the ORF table (RACS/datasets/IBD1 ORF.xlsx) we observed that Ibd1 mostly localizes to genes that are highly expressed and related to housekeeping function; such as, cellular function, translation, gene expression, biogenesis, cytoplasmic translation among others (see Figure 8).The functions are based on GO annotations for biological process [42].The calculation of this result can be found in the Gene Ontology tab of the ORF table (RACS/datasets/IBD1 ORF.xlsx).

Outliers
We found a great accumulation of the following three ORFs: TTHERM 02141639, TTHERM 02641280, TTHERM 02653301; in the tagged and untagged ChIP-Seq samples.Since we found this DNA accumulated in the untagged and tagged samples we concluded that this event may be due to non-specific binding to the agarose matrix.For Ibd1, the first 2 were filter out after the subtraction step described in the "Utilities: Validation and Quality Checks" section (Sec.2.3), and the third region seems to be controled by Ibd1.

Performance
By implementing this pipeline as described here, we obtained roughly a factor of 4× faster in comparison to a serial and non-I/O optimized (ie.not using RAMdisk), in a equivalent hardware to the node used in the cluster.This is something we have also observed by using similar techniques (eg.RAMdisk) in other type of bio-informatics pipelines where the hierarchy of the computational scales is dominated by the I/O parts of the code.Moreover, we processed a second set of data, that was roughly 3 times larger than the original data -which would not fit in memory (> 64GB)-, utilizing a more modern node (i7 core) with a solid-state device (SSD), we were able to furher reduce the processing time approximately by another factor of ∼ 4.This type of trend is typical in cases where performance is dominated by computations and I/O operations (eg.reading and writting files), for which the combination of faster processing plus faster access to the data is essential for improving the overall performance.Nevertheless, we should emphasize that even when RAMdisk or a SSD can be a solution that could in principle be thrown to similar type of problems, ie.intensively I/O demanding ones, the best approach would always be to try to mitigate and reduce as much as possible the I/O operations, as these usually represent the slowest part in any computational implementation.

Discussion
In this paper, we have presented a pipeline implementation utilizing open source tools.This pipeline aids in the rapid analysis of ChIP-Seq data grouping the reads accumulation from gene or intergenic regions throughout a contig based genome.The objective is to infer protein function based on its chromatin occupancy.This pipeline has been applied to Tetrahymena thermophila's ChIP-Seq data, but its application can be extended to ChIP-Seq datasets generated in any other organisms.
We called peaks using MACS2, the output of this application was efficient; however, did not elucidate whether the protein of interest localizes to a genic or an intergenic region.Different to MACS2 that calls all peaks regardless of their position in a genic or intergenic region, RACS was designed with that very same goal in mind, hence the diferentiation in the modules that are offered in our pipeline.We should also notice that other tools such as BEDTools, can be used to perform this task in combination with MACS2.
Without the need of any additional "external" software, RACS calls reads and segregates them on a genic or intergenic regions table.Thus our pipeline is more suitable to answer biological questions regarding function.We believe that MACS2 is complementary to RACS, in that both are peaks and reads accumulation predictors.As shown in Figure 4, by obtaining a top hit from RACS this regions can be compared or assessed by MACS2 at the same time to obtained a more robust result.
It is important noting that some regions presented in the processed table have very few reads after substracting the values obtained from untagged samples .For instance, a sample that has 2 reads in the INPUT and 10 reads in the IP will return an enrichment of 5 and it may pass the filter of 1.5X enrichment.Even when there are many computational tools available for processing ChIP-Seq data, RACS is particularly suitable for contig genomes with limited annotations.Recently several ChIP-Seq studies [43,44] have emerged for T.Thermophila.However, there is a lack of standardized computational methods for this model organism, hence it becomes difficult to reliably reach at the same conclusions when replicating the findings.Our tool is the first effort in T.Thermophila to provide a community resource for genome-wide ChIP-Seq studies, therefore it will contribute in standardizing the analyses.Other tools, such as, MACS2 and metagene using BEDTools analysis [41] complement RACS.
Last but not least, we must also note that our tool is of a broad utility and can be easily used for any other model system for ChIP-Seq analyses.

Conclusions
RACS is an excellent tool for genomes that are contig-based and/or have poor annotations, it permits the segregation of reads accumulation after ChIP-Seq processing.RACS is complementary to other tools, such as MACS2, as it can help to discriminate complex regions improving the overall analysis.
RACS offers an alternative tool with a different approach focused on a simple, modular and open approach.RACS offers a versatile, agile and modular pipeline that cover many of the steps needed in the process of analysing ChIP-Seq data.
The pipeline uses HPC tools, such as RAMdisk or batch processing via scheduling in cluster type environments, so that the data analysis can be done for large datasets.
The scripts are reusable and generic enough that can be simply modified and utilized in other pipelines as well.
The modular approach we followed when developing RACS, also allows for future developments as this pipeline could be easily ported as a backend of a web interface, or a gateway portal, serving a larger group of researchers from different disciplines.Figure 6 Comparison of Ibd1 localization presented in [23] (without untagged controls) and analyzed by RACS using untagged controls (current study).There is a correlation of 0.896 and non-statistical differences between the two data sets.The data presented in [23] uses an arbitrary cut-off.The data presented in this paper does not use the arbitrary cut-off and instead uses as cut-off the values obtained by the analyses of the untagged samples.

Figure 1 Figure 2
Figure1Diagram summarizing the ChIP-Seq technique used to prepare the samples and generate the data from the "wet-lab": 1) Native state of chromatin.2) Specific antibodies recognize the tagged proteins.3) Isolation of tagged protein plus its interacting chromatin.4) After DNA purification and library preparation NGS is performed.5) The output data from NGS is aligned to Tetrahymena thermophila's genome assembly.

Figure 3
Figure3Core RACS pipeline overview.This flowchart represents the logic steps implemented in the core pipeline.Boxes represent files and file types as indicated in the text.Files with thick boxes represent the Input Files for Intergenic calculations.Files in green are to be uploaded to IGV.File in blue is needed for IGV but it does not have to be uploaded to IGV.This file has to be kept in the same folder directory than the sorted bam.

Figure 4
Figure 4 Visualization using IGV to compare MACS2 to our pipeline.Determination of an intergenic region obtained by MACS2 showing the broad and gapped peaks (first track) and by the pipeline presented in this paper (second track).Note that our pipeline shows graphical peaks behaviour.The third track shows T.thermophila's genes.

Figure 5
Figure5Results comparing the determination of the Genic Regions between the pipeline presented in this paper vs MACS2.Region obtained by MACS2 (first track) and RACS pipeline (second track).MACS2 found two weak peaks that can be interpreted as background by our pipeline.

Figure 7
Figure 7 Ibd1 localization using RACS. A. Ibd1 localizes to more Intergenic regions than ORF.B. The majority of reads are found in ORF.These results take into consideration the updated information provided by the mock samples.

Figure 8
Figure 8 Gene Ontology (GO) analysis of genes controlled by Ibd1.