NEAT: a framework for building fully automated NGS pipelines and analyses
© Schorderet. 2016
Received: 5 August 2015
Accepted: 21 January 2016
Published: 1 February 2016
The analysis of next generation sequencing (NGS) has become a standard task for many laboratories in the life sciences. Though there exists several tools to support users in the manipulation of such datasets on various levels, few are built on the basis of vertical integration. Here, we present the NExt generation Analysis Toolbox (NEAT) that allows non-expert users including wet-lab scientists to comprehensively build, run and analyze NGS data through double-clickable executables without the need of any programming experience.
In comparison to many publicly available tools including Galaxy, NEAT provides three main advantages: (1) Through the development of double-clickable executables, NEAT is efficient (completes within <24 hours), easy to implement and intuitive; (2) Storage space, maximum number of job submissions, wall time and cluster-specific parameters can be customized as NEAT is run on the institution’s cluster; (3) NEAT allows users to visualize and summarize NGS data rapidly and efficiently using various built-in exploratory data analysis tools including metagenomic and differentially expressed gene analysis.
To simplify the control of the workflow, NEAT projects are built around a unique and centralized file containing sample names, replicates, conditions, antibodies, alignment-, filtering- and peak calling parameters as well as cluster-specific paths and settings. Moreover, the small-sized files produced by NEAT allow users to easily manipulate, consolidate and share datasets from different users and institutions.
NEAT provides biologists and bioinformaticians with a robust, efficient and comprehensive tool for the analysis of massive NGS datasets. Frameworks such as NEAT not only allow novice users to overcome the increasing number of technical hurdles due to the complexity of manipulating large datasets, but provide more advance users with tools that ensure high reproducibility standards in the NGS era. NEAT is publically available at https://github.com/pschorderet/NEAT.
Massively parallel / next generation sequencing (NGS) has become a central tool for many projects related to the life sciences, including fields such as molecular biology, evolutionary biology, metagenomics and oncology. These novel technologies have brought tremendous depth to our understanding of epigenetics and are becoming widely used in many experimental setups. Recent improvements in sequencing technologies have made it commonplace to obtain 20 to 40 gigabits of data from a single experiment  while the cost per mega base has dropped by half nearly every six months since 2008 [2, 3]. The explosion of NGS data in the life sciences has lead to surpass the petabase barrier . In addition to the massive amount of data generated in the genomics era, the empiric observation that NGS analysis constitutes one of the major bottlenecks in modern genomics projects has brought new challenges including the urgent need to create efficient and reproducible analysis pipelines accessible to both biologists and bioinformaticians.
Biologists have embraced NGS technologies with great enthusiasm, mainly because of the opportunities and promises they provide. However, although NGS allows rapid assessment of genome wide changes, paradoxically, the computational power and complexity required for its analysis has significantly hindered the overall turnaround time for wet-lab scientists, many of whom rely on overwhelmed bioinformatics core facilities. A common effort has thus been established to support the post-genomic era including the development of important interfaces such as genome browsers (UCSC [5–7], Ensembl ), annotation databases (ENCODE , modENCODE ) and tools to manipulate big data files (BEDTools , SAMtools ). Moreover, many scientists have contributed to the development of Galaxy, an open source, web-based platform that provides various tools for NGS data analysis [13, 14]. Finally, the R community is providing increased support to the field of bioinformatics by developing and providing a plethora of open source packages as part of the Bioconductor consortium .
The development of publically available tools has undoubtedly facilitated the analysis of NGS data. However, several loopholes still remain. For example, irrespective of how user-friendly these tools might be, they are often daunting for scientists that have little to no programming experience. This particular segment is often brought to the dilemma of choosing between investing the effort to learn the computational skills necessary to analyze their own data or waiting for it to be analyzed by computational cores. Empirically, the majority of the decisions converge to the later. In the meantime, scientists still heavily rely on the ability to visualize their data to steer their projects. We thus feel that the community would strongly benefit from easy-to-use tools that do not require programming skills. The reason such applications have never been implemented likely stems from the disparity of each individual project and the need to apply specific parameters to each of them on a case-by-case scenario. Nevertheless, there is a strong demand for tools to rapidly assess whether the technical aspect of an experiment succeeded (antibody specificity, conditions, sequencing depth, etc), even though the tradeoff of using default parameters might well introduces some bias and imperfections in the analysis.
Another loophole in NGS analysis is seen with more advanced users. Indeed, many computational biologists, who strongly depend on automation for the majority of their work, continue to manually manipulate files (renaming, filing, copying, etc). This apparent dichotomy can be explained by the lack of tools to support vertical integration of NGS analysis while managing their interdependencies. For example, the vast majority of tools that support singular repetitive tasks that can be run in parallel (mapping, filtering, etc.), rarely provide an easy solution for the integration of these tasks into a complicated multi-dimensional workflow. As such, few softwares allow users to efficiently run custom made pipelines on the same server on which the data is stored long term. For example, Galaxy, the most widely used open-source platform for data analysis has a powerful and intuitive web-based front-end interface. Nevertheless, users are required to upload files and are often limited by various regulations including maximum job submissions, wall time and storage space. Other tools such as HTStation  require scientists to continuously follow the job statuses and manually manipulate files and keys between different steps. These iterative and error-prone processes, which, de facto, cannot be referred to as pipelines, are cumbersome and time-consuming.
To address some of the above-discussed issues, we present NEAT, a framework developed to help manage ChIPseq and RNAseq pipelines in a robust, reproducible and user-friendly manner. NEAT offers several automated modules (unzip, rename, QC, chiprx, map, filter, peakcalling, creation of wig files, etc) that can be run through double-clickable icons from any desktop or laptop, an interface that not only facilitates the analysis of NGS data, but that makes it accessible to non-expert users. Furthermore, NEAT includes downstream applications that allow users to effortlessly explore NGS data using a graphical user interface (GUI) display. In summary, we believe that NEAT will help biologists as well as established bioinformaticians create, manage and analyze complex NGS pipelines, as well as assess NGS data within 24 h of the sequencing run completion through a simple GUI.
We have created an NGS framework under the UNIX operating system called NEAT that can easily be run either through the command line or through a graphical user interface (GUI). NEAT is a modular, reliable and user-friendly framework that allows users to build both ChIPseq and RNAseq pipeline using plain words (‘map’ will map and so on). NEAT is completely automated and supports users in the analysis of NGS data by managing all jobs and their dependencies from a single, centralized file. NEAT is designed to be run by scientists with no programming experience and as such, pipelines can be build and managed using double-clickable executables on a simple laptop. On the other hand, its modular architecture allows advanced users to easily customize NEAT to their own needs. In additional, NEAT can be implemented in the vast majority of institutions (compatible with LSF and PBS) regardless of rules and regulations as all cluster parameters including queuing priority, node allocation, number of CPUs and wall time can be parameterized from a single file.
The four steps of the NEAT framework are described below. In addition, step-by-step tutorials can be found in the supplemental material (Additional files 1 and 2). The tutorials allow users to follow through an entire NGS analysis using a provided test data set. The test datasets, which are either H3K4me3 ChIPseq data or RNAseq data from mouse embryonic stem cells, have been truncated such that the entire analysis should take less than two hours. Running the test data will also ensure NEAT and its dependencies (packages, scripts, etc) are properly installed before submitting large, memory-savvy analyses.
Step 1: Creating a NEAT project
The first step of the NEAT framework is to build a new project. This can be done through the ‘New Project’ application (Fig. 1 and Additional files 1 and 2), which will prompt users to enter some details including the directory the project will be created in and the name of the project. Once executed, the user will be asked to fill in the foremost important step of NEAT: the Targets.txt file.
The Targets.txt file is the most important piece of NEAT and users are expected to invest the time and effort to ensure all paths and parameters exist and are correctly set. It is worth noting that once set, most of these parameters will not change on a specific computer cluster (users from the same institute will use the same parameters). We therefore suggest that more advanced users modify the original Targets.txt template file (Additional files 1 and 2), which is used as template each time a new project is created. This will significantly ease the process of building new projects and will minimize errors due to inexistent files or wrong paths. For down stream analysis of NEAT projects (see step 4), several widely used database names can be found in the Species_specificities.txt file for reference (Additional files 1 and 2).
Finally, in addition to containing the data processed by the pipeline, most importantly, the Targets.txt file contains the building blocks of the pipeline. These blocks are specified under the ‘Steps_to_execute_pipe’ and can be written in plain English words, e.g. ‘unzip’, ‘map’, ‘filter’ etc. The different default building blocks are described below. As NEAT uses exact word matching, users that do not want to run a given block are free to delete it or rename it (for example as 'chiprs_NO').
The ‘unzip’ module will unzip, rename and store fatsq files in a newly created folder within the project folder. Although this strategy can seem cumbersome for space issues, it allows systematic storage of backups without manipulating the original compressed file, which helps organize and keep track of the sequencing runs.
Sequencing cores use different compression formats. For this reason, users can specify the file extension and the unzip command in the AdvancedSettings.txt file (Additional files 1 and 2). This module will unzip the compressed files found in the directory specified in the ‘Remote_path_to_orifastq_gz’ parameter and which names are found in the ‘OriFileName’ column of the Targets.txt file, and will rename them according to the users setup in the ‘FileName’. All files will be stored in the newly created ‘fatsq’ folder (Additional file 3).
The ‘QC’ module uses the R systemPipeR package (Girke T. (2014) systemPipeR: NGS workflow and report generation environment. URL https://github.com/tgirke/systemPipeR) to provide a variety of high quality control outputs including per cycle quality box plots, base proportion, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The ‘QC’ building block, together with the ‘GRanges’ modules (see below) are the rare exception that require the installation of external R packages. Additional information on package installation can be found in the tutorials (Additional files 4, 1 and 2).
ChIPrx  is a cutting edge normalization method for ChIPseq that performs genome-wide quantitative comparisons using a defined quantity of an exogenous epigenome, e.g. a spike-in control. The detailed algorithm of ChIP-RX has been implemented as previously published . For the sake of consistency, the same mapping and filtering parameters will be used for both the alignment of the standard and the spike-in epigenome. If no spike-in controls are used, all ChIP-RX parameters can be dashed (‘-’).
The ‘map’ module maps reads using either bwa  or bowtie . For RNAseq projects, the splice-aware, bowtie-based Tophat  algorithm is preferred. The standard parameters for either algorithm can be modified in the AdvancedSettings.txt file, including maximum number of gaps, gap extension, maximum edit distance, number of threads, mismatch and gap penalty, etc. Additional mapping algorithms can easily be implemented by advanced users (Additional files 1 and 2).
The ‘filter’ module allows the user to specify filtering parameters (AdvancedSettings.txt) including how to manage duplicate reads, minimum and maximum size of fragments, etc. This module uses the samtools [12, 21] view, sort, rmdup and index functions.
The ‘peakcalling’ module specifies the algorithm used to call peaks. NEAT has two well-established peak calling methods built-in by default, including MACS (PeakCaller_MACS.R)  and SPP (PeakCaller_SPP.R) . It is worth noting that given that NEAT is open source and very versatile, it is easy for advanced users to implement their own peak calling algorithm as an R code (Additional files 1 and 2).
Given different mapping algorithms have distinct outputs, the ‘cleanfiles’ module helps reorganize and store the different .bam and .bai files before proceeding to downstream analysis. This allows advanced users to implement their own mapping algorithms while still taking advantage of NEAT’s EDA modules.
The ‘GRanges’ module creates significantly smaller GRanges objects (compared to bam files), which are necessary for downstream analysis including identification of differentially regulated genes (RNAseq) and metagenomic analyses (ChIPseq). This eases and increases the efficiency of file transfer, file sharing and consolidation of projects. In addition, the ‘GRanges’ module creates small size wiggle files (.wig files). Wiggle files can be loaded and visualized in various genome browsers including IGV [24, 25]. The compression of the file is driven in part by the binning of the data across the genome. The bin size, which is in base pair units, can be customized in the AdvancedSettings.txt file.
Step 2: Running NEAT
After building a pipeline using the easy one-word method in the ‘Steps_to_execute_pipe’ line of the Targets.txt file, non-expert users can run the workflow using the applescript double-clickable executable (Fig. 1 and Additional files 1 and 2). More advanced users can run it through the command line (Fig. 1 and Additional files 1 and 2). The executable will prompt users to identify which project they want to run before opening a terminal and asking them (twice) to enter their ssh password. This will allow NEAT to access and run the pipeline on the computationally efficient remote cluster. Once entered, NEAT automatically manages job submission, queuing and dependencies. A detailed explanation on how to follow the pipeline and a step-by-step debugging support can be found in the tutorials (Additional files 1, 2 and 5). Moreover, users can decide to setup automatic emailing when the pipeline has completed. As a point of reference, running an exhaustive pipeline (unzip + QC + chiprx + map + filter + peakcalling + cleanfiles + granges) on data comprising 200–400 million reads should not take more than 10 to 15 h. The project architecture of a completed NEAT project on the remote server including the timing and location of files and folders can be found in Additional file 3.
Step 3: Download a NEAT project from a remote server to a local computer
The core component of NEAT (step 2), which is the pipeline per se, is computationally demanding and is thus preferentially ran on a remote cluster. However, upon completion of the pipeline, users may prefer to view and analyze their data locally, e.g. on a desktop or a laptop. As mentioned above, NEAT can be used to create GRanges and wiggle files, which main advantage are their relatively small size compared to bam files (wig ~ 4–6 Mb; GRanges ~ 40–60 Mb; bam ~ 4–6 Gb). In addition, these files can easily and rapidly be shared by email or in batch using standard flash drives.
To download a NEAT project from a remote server to a local computer, users can run the ‘Transfer.app’ applescript double clickable executable (Fig. 1 and Additional files 1 and 2), which will automatically open a terminal window and start the process. Users will be prompted to locate the NEAT directory and the NEAT project. The ‘Transfer.app’ will use all the information found in the corresponding Targets.txt file to download the NEAT project from the remote server to the local computer. Users should be attentive as they will be asked to enter the corresponding ssh password several times. Downloading an entire project should not take more than a few minutes.
Step 4: Exploratory data analysis using NEAT
Empirically, data visualization is an important milestone for wet-lab scientists. This step is often critical for deciding the direction to take for further experiments and computational biologists often underestimate its importance. As an effort to improve the turn around time of NGS datasets, NEAT supports users in the creation of wig files (see step 2) that can be visualized using various genome browsers including IGV [24, 25].
It is worth noting that the metagenomic analysis in the ChIPseq module can be easily customizable. This tool allows users to visualize chromatin immunoprecipitation enrichments of various samples over specific features (contained in the MartObject folder; Additional files 1 and 2). For example, using the test dataset, users can explore enrichment of an epigenetic mark (K4me3) around all transcriptional start sites (TSSs) of the mouse genome. However, such analyses are not constrained to any particular region, nor to regions of similar length. By creating a simple bed file, users can assess enrichments over their preferred regions of interest. For example, users can visualize enrichments over all transcripts and/or enhancers. In such case, the length will be normalized throughout all regions. Any bed-formatted file can be used for the metagenomic module.
Customizing NEAT modules
NEAT was developed as a user-friendly, intuitive and versatile tool. As such, care has been taken to allow users the ability to customize the pipeline for their own needs. This includes easy customizable mapping algorithms, mapping and filtering parameters, peak calling algorithms and metagenenomic features (TSS, transcripts, personal regions of interest, etc). In addition, more advanced users can efficiently develop novel modules as the code architecture has been written in a robust, logical, highly redundant and well-annotated manner. To add a new module, advanced users can simply duplicate an existing module and integrate their custom task into the script, usually consisting of a single line of code. The NEAT framework fully automates recurrent tasks such as batch job submissions, job dependencies, job queuing, error management, filing, etc., which greatly facilitates the creation of custom modules. Full support and step-by-step explanations to add customized modules can be found in the tutorials (Additional files 1, 2, 5, 6 and 7).
As this work presents a 'pipeline', tangible results are in the form of outputs (Fig. 3). Supporting arguments are in-line.
Technological revolutions often drive and precede biological revolutions. The omics field has not been immune to this general rule. Such paradigm shifts are often followed by a period of great adaptation. For massively parallel sequencing, developing curriculums to educate scientists with the proper skill sets will require some time. Meanwhile, the life science community is in desperate need for tools to support scientists that have been trained prior to the sequencing of the human genome. Although NEAT is not intended to replace thorough bioinformatics analysis per se, we believe that it provides helpful tools to accompany scientists in the analysis of NGS data and allow them to rapidly apply standard exploratory data analysis methods to assess the quality of their experiments within 24 h of the sequencing run completion. Specifically, we strongly believe that providing wet-lab scientists with simple tools to facilitate rapid data visualization, which is a significant bottleneck for many users, will greatly benefit the community and will allow one to better plan and foresee biological experiments without the need to wait for thorough bioinformatics analysis.
NEAT was developed for a wide audience including scientists with no a priori programming knowledge. To this end, although NEAT should be self explanatory (double-clickable application based), it comes with step-by-step tutorials as well as two test datasets that will enable novice users to follow through and reproduce entire ChIPseq and RNAseq workflows. In addition, given the wide diversity of interests in the life sciences, NEAT has been developed to be versatile, easily customizable and applicable to a wide variety of different genomes. Finally, the modular structure of NEAT allows advanced users and computational core facilities to easily add and modify tasks, customize settings and comply with internal rules and regulations with minimal footprint to their existing server architecture. Taken together, we believe NEAT will be of general interest and has the potential to be widely adopted for its versatility and ease of use.
NEAT is an open-source software under an MIT license. NEAT, including tutorials and test data, is publicly available on GitHub (https://github.com/pschorderet/NEAT).
Availability and requirements
Project home page: https://github.com/pschorderet/NEAT
Operating system: Mac OSx
Programming language: Perl, R, Applescript
License: NEAT is an open-source software under an MIT license
PS is supported by an Advanced Swiss National Foundation Fellowship (P300P3_158516) and is supported in part through the National Institute of General Medical Sciences (US) grant R37 GM48405-21 awarded to Robert E Kingston. I would like to acknowledge the Kingston Lab members for stimulating discussions, feedback and testing; particularly Alan Rodrigues, Sharon Marr, Aaron Plys, Ozlem Yildirim, Ruslan Sadreyev and Bob Kingston for critical reading during the preparation of the manuscript.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Hodkinson BP, Grice EA. Next-generation sequencing: a review of technologies and tools for wound microbiome research. Adv Wound Care (New Rochelle). 2015;4:50–8.View ArticleGoogle Scholar
- Sboner A, Mu XJ, Greenbaum D, Auerbach RK, Gerstein MB. The real cost of sequencing: higher than you think! Genome Biol. 2011;12:125.PubMed CentralView ArticlePubMedGoogle Scholar
- Stein LD. The case for cloud computing in genome informatics. Genome Biol. 2010;11:207.PubMed CentralView ArticlePubMedGoogle Scholar
- Kodama Y, Shumway M, Leinonen R, International Nucleotide Sequence Database C. The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res. 2012;40:D54–6.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006.PubMed CentralView ArticlePubMedGoogle Scholar
- Kuhn RM, Haussler D, Kent WJ. The UCSC genome browser and associated tools. Brief Bioinform. 2013;14:144–61.PubMed CentralView ArticlePubMedGoogle Scholar
- Raney BJ, Cline MS, Rosenbloom KR, Dreszer TR, Learned K, Barber GP, et al. ENCODE whole-genome data in the UCSC genome browser (2011 update). Nucleic Acids Res. 2011;39:D871–5.PubMed CentralView ArticlePubMedGoogle Scholar
- Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. Nucleic Acids Res. 2013;41:D48–55.PubMed CentralView ArticlePubMedGoogle Scholar
- de Souza N. The ENCODE project. Nat Methods. 2012;9:1046.View ArticlePubMedGoogle Scholar
- Celniker SE, Dillon LA, Gerstein MB, Gunsalus KC, Henikoff S, Karpen GH, et al. Unlocking the secrets of the genome. Nature. 2009;459:927–30.PubMed CentralView ArticlePubMedGoogle Scholar
- Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Goecks J, Nekrutenko A, Taylor J, Galaxy T. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11:R86.PubMed CentralView ArticlePubMedGoogle Scholar
- Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005;15:1451–5.PubMed CentralView ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80.PubMed CentralView ArticlePubMedGoogle Scholar
- David FP, Delafontaine J, Carat S, Ross FJ, Lefebvre G, Jarosz Y, et al. HTSstation: a web application and open-access libraries for high-throughput sequencing data analysis. PLoS ONE. 2014;9:e85879.PubMed CentralView ArticlePubMedGoogle Scholar
- Orlando DA, Chen MW, Brown VE, Solanki S, Choi YJ, Olson ER, et al. Quantitative ChIP-Seq normalization reveals global modulation of the epigenome. Cell Rep. 2014;9:1163–70.View ArticlePubMedGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.PubMed CentralView ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–11.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–93.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9:R137.PubMed CentralView ArticlePubMedGoogle Scholar
- Kharchenko PV, Tolstorukov MY, Park PJ. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008;26:1351–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14:178–92.PubMed CentralView ArticlePubMedGoogle Scholar
- Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nat Biotechnol. 2011;29:24–6.PubMed CentralView ArticlePubMedGoogle Scholar
- Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics. 2009;10:48.PubMed CentralView ArticlePubMedGoogle Scholar
- Eden E, Lipson D, Yogev S, Yakhini Z. Discovering motifs in ranked lists of DNA sequences. PLoS Comput Biol. 2007;3:e39.PubMed CentralView ArticlePubMedGoogle Scholar