NEAT: a framework for building fully automated NGS pipelines and analyses

Background The analysis of next generation sequencing (NGS) has become a standard task for many laboratories in the life sciences. Though there exists several tools to support users in the manipulation of such datasets on various levels, few are built on the basis of vertical integration. Here, we present the NExt generation Analysis Toolbox (NEAT) that allows non-expert users including wet-lab scientists to comprehensively build, run and analyze NGS data through double-clickable executables without the need of any programming experience. Results In comparison to many publicly available tools including Galaxy, NEAT provides three main advantages: (1) Through the development of double-clickable executables, NEAT is efficient (completes within <24 hours), easy to implement and intuitive; (2) Storage space, maximum number of job submissions, wall time and cluster-specific parameters can be customized as NEAT is run on the institution’s cluster; (3) NEAT allows users to visualize and summarize NGS data rapidly and efficiently using various built-in exploratory data analysis tools including metagenomic and differentially expressed gene analysis. To simplify the control of the workflow, NEAT projects are built around a unique and centralized file containing sample names, replicates, conditions, antibodies, alignment-, filtering- and peak calling parameters as well as cluster-specific paths and settings. Moreover, the small-sized files produced by NEAT allow users to easily manipulate, consolidate and share datasets from different users and institutions. Conclusions NEAT provides biologists and bioinformaticians with a robust, efficient and comprehensive tool for the analysis of massive NGS datasets. Frameworks such as NEAT not only allow novice users to overcome the increasing number of technical hurdles due to the complexity of manipulating large datasets, but provide more advance users with tools that ensure high reproducibility standards in the NGS era. NEAT is publically available at https://github.com/pschorderet/NEAT. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0902-3) contains supplementary material, which is available to authorized users.


Introduction
The NExt generation Analysis Toolbox (NEAT) is a perl/R package that supports users during the analysis of next generation sequencing (NGS) data.
NEAT provides an easy, rapid and reproducible exploratory data analysis (EDA) tool that allows users to assess their data in less than 24 hours (based on a 200mio read Highseq run). NEAT was developed in four main sections. The two first sections (creating and running a pipeline) helps users with jobs that are computationally demanding (mapping, filtering, etc). The two last sections consist of analyses that can be run locally (on a desktop computer). All four sections are standalone applications.
This quick guide is intended for users who want to run a quick analysis and are comfortable using the default parameters and algorithms. Please refer to the complete guide for more details on in depth analysis. Fig.1 NEAT architecture. NGS data can be analyzed using NEAT in less than a day. Users follow a logical 4--step process, including the creation of a new project, running the pipeline on a remote server or in the cloud, transferring the data to a local computer and proceeding to the analysis.

Introduction to NEAT
A central feature of NEAT is its ability to perform repetitive tasks on complex sample setups while managing batch submissions and cluster queuing. NEAT can easily be implemented in most institutions with limited to no programming experience. The workflow has been designed to efficiently run on a computer cluster using a distributed resource manager such as TORQUE (qsub commands) or LSF (bsub commands). NEAT has been developed by and for wet--lab scientists as well as bioinformaticiens to ensure user--friendliness, management of complicated experimental setups and reproducibility in the big data era.
To start using NEAT, please follow the tutorial. This will walk you through the analysis of a small test dataset using your own computer cluster. This will also ensure NEAT and its dependencies are correctly installed before submitting large, memory--savvy analysis.
All fastq files from the test data have been subsetted to ca. 15'000 reads. This data comes from an unpublished 50bp single end (SE) sequencing experiment although NEAT can deal with paired--end (PE) sequencing as well. For more information on the test data provided in this tutorial, please read below.
Although this quick guide is intended for scientists with no programming experience, the pipeline ran by NEAT will be launched on a remote server. Users therefore require an SSH access with a username and a password. Please refer to your system administrator to obtain such credentials. This tutorial assumes the NEAT/ directory was saved to the user's desktop on a local computer (~/Desktop) and that it will be saved to the home directory (/home/) on the remote server.

Install NEAT
Make sure the folder is named NEAT and not NEAT--master. Run the install package and follow the instructions.
Enter your password and allow for the folders to transfer. This step installs all required R packages on your personal computer. Please ensure there are no errors before proceeding to the analysis.

NEAT part 1 and 2 3.1 Before running NEAT for ChIPseq
Please make sure all R packages required for NEAT are installed on your computer (refer to the R manual) prior to running NEAT. Refer to the Version information and required packages section below for more information. In addition, all four NEAT standalone applications run through Automator, a MacOS application standardly installed on most modern Apple computers, please make sure it is installed on your computer.
Finally, ensure your Terminal software is closed before launching NEAT.

Running NEAT part 1 3.2.1 Creating a new ChIPpip project
The first step to run NEAT on ChIPseq data is to create a new ChIPpip project.
Double--click the 1_NewProject.app found in the NEAT directory /Desktop/NEAT/ChIPmE/ and follow the prompts.
We advise to save projects in other directories than the NEAT directory itself so that you can update NEAT by dragging and dropping the entire NEAT folder without compromising older projects.

Filling in the Targets.txt file
Creating a new project (named by default MY_NEW_CHIP_PROJECT) should create the central part of NEAT: the Targets.txt file. The Targets.txt file can be found in /MY_NEW_CHIP_PROJECT/DataStructure/ and is the backbone of NEAT. It contains all the information specific to your experiment and your computer cluster, including the names of files, the paths to the reference genomes, the steps to execute, the name of your samples, their relationships, etc.
IMPORTANT: Sample names including inputs and fastq files cannot start with the letter 'n' (small or capital) as this is the universal perl symbol for carriage return.
They can also not contain '_R2' other than to name the corresponding paired--end samples (see files).
The Targets.txt file is the most important piece of NEAT and users are expected to invest the time and effort to ensure all paths and parameters exist and are correctly set. However, once set, most of these parameters will not change on a specific computer cluster (users from the same institute will use the same paths). We therefore suggest that more advanced users modify the original Targets.txt template file (see below).
All parameters of the Targets file should be self--explanatory. Below is a brief summary: My_personal_email : If users would like to be notified by emailed when the cluster has finished. This will only work if your computer cluster has activated the emailing feature (please check with system administrator). To ensure servers are not overwhelmed by email services, ChIPpip is configured in such a way as to notify users only if the pipeline has terminated properly (with no error). : Full path to your project folder (without the project name) [automatically generated]. Remote_path_to_NEAT : Full path to your NEAT folder. Note that in our example, we have created our project within the ChIPpip folder itself, but users can freely decide to create a dedicated folder for all of their ChIPpip projects. Remote_path_to_orifastq_gz Full path to where your compressed .fastq.gz files are. Usually, your sequencing core facility will let you know where they store these files. Note that all .fastq.gz files can be kept in a single location, they do not need to be copied to your folder. Remote_path_to_chrLens_dat Path to a .dat file containing chromosome information for your reference genome. Refer to your computer core facility. Remote_path_to_RefGen_fasta Path to folder containing your reference genome files. When using BWA, this folder will contain the indexes (*.fa.* files) as well as the fasta file (.fa, .fai). When using BOWTIE, this folder will contain the indexes (*.ebwt files) as well as the fasta file (.fa, .fai). Refer to your computer core facility. Remote_path_to_chrLens_dat_ChIP_rx: See above. Remote_path_to_RefGen_fasta_ChIP_rx: See above. Aligner_algo_short : "BWA" or "BOWTIE" for standard alignment. If other algorithms are used, modify the AdvancedSettings.txt file accordingly. Paired_end_seq_run : "0" for single end sequencing. "1" for paired end sequencing. PeakCaller_R_script : NEAT comes with two preinstalled algorithms: SPP (PeakCaller_SPP.R) and MACS (PeakCaller_MACS.R). For advanced users, any peakcalling algorithm can be used. Wrap your algorithm in a .R files and enter the name of the file in the PeakCaller_R_script parameter. Steps_to_execute_pipe : Users can choose from the following tasks • GRanges If you do not want to run all of these, simply delete them for the Targets.txt or rename them. Once ran, ChIPpip will change the value of these from 'unzip' to 'unzip_DONE'. Obviously, a certain hierarchy has to be followed, e.g. attempting to filter reads without having previously mapped them (in the same run or in a previous run) will not work. Note that 'qc' requires Thomas Girke's systemPipeR package; map requires bwa; the default peacalling requires the R package SPP; filter requires samtools; Granges requires various well--established R packages. Refer below for exact requirements. Sample naming : For single--end (SE) runs, fill in the 'SAMPLES INFO' section. Characters including spaces, dollar signs and other well--known special characters should not be used. Underscores should be the reference character to delimit words. For paired--end sequencing runs, in addition to filling in the first section, fill in the 'PE CORRESPONDING SECTION', which correspond to the .fastq files containing the reads from the reverse strand. The names of these files need to be identical to the exception of adding a '_R2' at the end of the name. For example, if the first file name is 'PSa29--5_noDox, the corresponding reverse strand file will be named 'PSa29--5_noDox_R2'. The order in which the files appear also needs to be identical in both sections. Please refer to the example Targets.txt file for more information.
Please modify the Targets.txt file according to your needs. To modify the Targets.txt file, we suggest users get accustomed to using plain text editors such as TextWrangler as it will avoid including spaces and special characters. In addition, it is worth making sure that the parameters in the AdvancedSettings.txt files are correct, especially the unzip command and extension as well as the alignment command (BWA vs Bowtie).
The paths to the reference genomes should be obtained from your computer core facility (system administrator), as they are the ones maintaining these up to date.
Note here that the reference genome files should have an '.fa' extension (e.g. mm9.fa). Please make sure that your core has named these files accordingly as any other extension will lead to a prematurely arrest of the pipeline.
In addition, #Remote_path_to_RefGen.fasta refers to the path to the file <reference_genome>.fa and not to the folder in which '.fa' files are located. This is different than in RNApip.
To avoid repeating these steps at each new NEAT project creation, we suggest you

Running NEAT part 2
Once the Targets.txt file is correctly set up, users can run the 2_Run_NEAT.app.
This script will execute the tasks specified in the Targets.txt file.
pipeline and will prompt a summary of the user's parameters. NEAT automatically manages all creations and batch submissions of jobs, dependencies, ordering of files, queuing, etc. If the cluster is using TORQUE, the processes can be followed on the terminal using the qstat (or bjobs) command (type qstat (or bjobs) in your terminal) after ssh--ing into your remote cluster. To ssh into a remote cluster, type in the following commands into a terminal window replacing your username and server address: Once the pipeline has finished, it will notify users of its status by email (if applicable).
The mock data provided as a test example should take no more than one hour to run, usually a lot less.

NEAT part 3 and 4
4.1 Running NEAT part 3

Step 1: Download a ChIPpip project
To transfer a NEAT project from a remote server to a local computer, double click the ChIPmE 3_Transfer.app.
Users will be prompted to locate the NEAT directory and the location they want to save their ChIPpip project. In this example, the NEAT directory is on the desktop.
The 3_Transfer.app will use all the information found in the Targets file to download the ChIPpip project from the remote server to your local computer. Please be attentive as users will need to enter their ssh password several times.
Downloading an entire project should not take more than a few minutes.

Running NEAT part 4 4.2.1 Step 2: Run a ChIPmE analysis
Once the project has been downloaded, users can run the proper ChIPmE analysis.
To this end, double click the NEAT 4_Analyze.app.
Users will be prompted to locate the NEAT folder (saved on the desktop in this example). Users will also need to choose a mart object (see below). Finally, users will be asked to locate the ChIPpip folder (the one just downloaded). In this example, the ChIPpip project was downloaded to the desktop. Running the analysis using the test data should not take more than a couple minutes.

Mart objects
Mart objects are .bed files of regions of interest to align data to. Examples of mart objects are transcriptional start sites (TSS), all transcripts, enhancers, etc. Some mart objects are provided as part of the NEAT package in the MartObject folder. For this example, the data will be aligned over all TSS. The provided mart object (mm9_TSS_10kb.bed) is comprises of 10kb around all TSS of the mouse genome.
Please note here that care should be brought to match the mart objects with the reference genome initially used. In this example, the data was mapped to the mouse mm9 genome, hence the mm9_TSS_10kb.bed. Several parameters can be set before running the analysis including binNumber, strand, runmeank, Venn and normInp.
Values are set as a reference, but we suggest users experiment to find the best values for their own need.

Logs
Each time ChIPmE is run, a log file is created and named using the date and time.
This file is save in the ~/MY_NEW_CHIP_PROJECT/logs/ directory. We strongly encourage users to initially look at these files, as any error that might have occurred will be saved there. Usually, if no error is prompted from the terminal, ChIPmE has terminated correctly. Also, please note that if there are unrecognized chromosome names such as random chromosomes, warning messages will appear. In the test data, there are 50 or more warnings. These can usually be disregarded.

Count tables
ChIPmE will generate count tables that should be self--explanatory. In brief, rows correspond to a mart object line and columns correspond to bins. Please note here that bins will be of same length for centered mart objects (for example TSSs) and will have names corresponding to base pairs, but will be of varied length for non-centered objects (for example transcripts), hence the columns will arbitrarily be named V1, V2, …, VN. Users should not worry about this to interpret graphs as it is accounted for by normalizing the values per bin by the bin length.

Custom mart objects
Custom mart objects can easily be created with your favorites genes/regions. The files are simple .bed files that can either be manually or automatically created using various online tools or can be directly downloaded from genome browsers such as USCS or Ensembl. ChIPmE has been developed to ensure consistency between and within labs. In addition the relative small size of these bed files makes it easy to email/share them. We therefore suggest keeping an up--to--date, centralized folder containing all recurrent mart object files.

Bam files and GRanges
Bam files are generated during the initial phase (step 2) but are not automatically downloaded to the local computer. Rather, the GRanges files, which are often two orders of magnitude smaller in size (respectively several Gb vs tens of Mb) are dowlonaded. If required, bam files are always stored in the 'bam' folder on the remote server.

Consolidating projects
Consolidating projects is very easy. Users who intend to do so simply need to change the Targets.txt file and copy--paste the Granges objects (or bam files) from one folder to the other. Other files and folder can be left as is.