TSSAR: TSS annotation regime for dRNA-seq data
© Amman et al.; licensee BioMed Central Ltd. 2014
Received: 19 June 2013
Accepted: 24 March 2014
Published: 27 March 2014
Differential RNA sequencing (dRNA-seq) is a high-throughput screening technique designed to examine the architecture of bacterial operons in general and the precise position of transcription start sites (TSS) in particular. Hitherto, dRNA-seq data were analyzed by visualizing the sequencing reads mapped to the reference genome and manually annotating reliable positions. This is very labor intensive and, due to the subjectivity, biased.
Here, we present TSSAR, a tool for automated de novo TSS annotation from dRNA-seq data that respects the statistics of dRNA-seq libraries. TSSAR uses the premise that the number of sequencing reads starting at a certain genomic position within a transcriptional active region follows a Poisson distribution with a parameter that depends on the local strength of expression. The differences of two dRNA-seq library counts thus follow a Skellam distribution. This provides a statistical basis to identify significantly enriched primary transcripts.
We assessed the performance by analyzing a publicly available dRNA-seq data set using TSSAR and two simple approaches that utilize user-defined score cutoffs. We evaluated the power of reproducing the manual TSS annotation. Furthermore, the same data set was used to reproduce 74 experimentally validated TSS in H. pylori from reliable techniques such as RACE or primer extension. Both analyses showed that TSSAR outperforms the static cutoff-dependent approaches.
Having an automated and efficient tool for analyzing dRNA-seq data facilitates the use of the dRNA-seq technique and promotes its application to more sophisticated analysis. For instance, monitoring the plasticity and dynamics of the transcriptomal architecture triggered by different stimuli and growth conditions becomes possible.
The main asset of a novel tool for dRNA-seq analysis that reaches out to a broad user community is usability. As such, we provide TSSAR both as intuitive RESTful Web service (http://rna.tbi.univie.ac.at/TSSAR) together with a set of post-processing and analysis tools, as well as a stand-alone version for use in high-throughput dRNA-seq data analysis pipelines.
KeywordsDifferential RNA sequencing dRNA-seq TSS Transcription start site annotation Transcriptome RESTful Web service Next generation sequencing
Deep sequencing approaches were successfully applied to examine the architecture of primary bacterial transcriptomes and uncovered an unexpectedly complex architecture [1–5]. Although plain transcriptome sequencing can in principle be sufficient to determine transcription start sites (TSS) as local accumulations of read starts, this approach requires extensive sequencing depth [6, 7]. Alternative TSS located within well-expressed genes or operons remain undetectable since moderate changes in coverage do not offer a sufficiently distinctive signal. On the other hand, TSS are not the only loci at which read starts accumulate in RNA-seq data. Alternative sources of such signals are specific processing sites, secondary structures that influence RNA degradation patterns, or chemical modifications [8–10].
The differential RNA sequencing method dRNA-seq  is designed to overcome these difficulties. It makes use of the 5’-monophosphate dependent terminator RNA exonuclease (TEX) that specifically degrades processed RNA, which exhibits a monophosphate at its 5’ end. Transcription initiation, in contrast, produces a 5’-triphosphate that protects the unprocessed 5’ end from degradation by TEX. Treating RNA isolates with TEX prior to reverse transcription to cDNA, leads to a sequencing library ([+]-library or treated library) that is enriched in primary transcription starts, compared to an untreated total RNA library ([–]-library or untreated library). Similar to other library preparation steps that enrich or deplete certain transcript types, e.g. TAP treatment  and rRNA depletion , the TEX dependent degradation of processed RNA fragments is not perfect. The [+]-library, therefore, still contains a mixture of primary and processed transcripts, albeit with a distribution of read starts that is shifted significantly towards TSS positions . In the data used in this contribution a median enrichment at TSS positions of 3.5 is observable. The discrimination of TSS from other accumulations of read starts is thus non-trivial and cannot be performed unambiguously from a TEX treated library alone. On the other hand comparison of [+]- and [–]-libraries offers a potentially highly informative source of information: while read starts will be relatively enriched, we can expect the alternative types of read start accumulations to be depleted in the [+]-library.
Since the signal at hand is quantitative rather than an all-or-none qualitative difference, it is imperative to employ a statistical model to assess when an observed enrichment is indeed significant. This depends strongly on the expression level. To distinguish between real TSS signals and accidental read start accumulation resulting from imperfect TEX degradation or high local expression, the aid of a background model, e.g. the [–]-library, is needed.
Hitherto, the analysis of the dRNA-seq data consists of mapping sequencing reads for each library onto the reference genome, visualizing the read coverage in a genome browser, often with displayed gene and transcription unit annotation, promoter predictions and other available prior knowledge. With this background the genome is manually inspected for positions with a more pronounced peak in the [+]- compared to the [–]-library. The interpretation of dRNA-seq signals in such a way is not only very time consuming, tedious, and error-prone, but also highly subjective and weakly reproducible. Additional annotation information from third-party sources can be very helpful but bear the risk to introduce biases, resulting in re-annotation of already "known" features, and neglecting signals that are less obviously associated with current annotation data. It is, therefore, preferable to separate dRNA-seq data analysis from subsequent data integration with additional available information.
To overcome these shortcomings we developed TSSAR (TSS Annotation Regime), a tool for automated de novo TSS annotation from dRNA-seq data. Incorporation of information like gene annotation or promoter predictions is deferred to post-processing steps.
Here λ[+] and λ[–] are the parameters describing the average read start rate in the [+]- and the [–]-library, respectively. I|k| is the modified Bessel function of the first kind and integer order |k| .
A major practical issue is the estimation of the parameters λ[±] for the two libraries. We assume that read start counts per position within transcriptional active regions follow a Poisson distribution, with the expected value λ depending on the transcription rate, or to be more precise, on the RNA abundance, which depends on the transcription rate and the RNA stability. Within untranscribed regions the read start count per position is ideally zero. In reality, this idealization is often not met due to "leaky" promoters, read-through from adjacent genes, spurious transcription starts at random positions, sample contamination, and sequencing errors leading to mis-mapping of individual reads. Nevertheless, as a first approximation it is reasonable to neglect these sources of error since the effect can be regarded to be small compared to the coverage dispersion within transcribed regions.
where e-λ is the probability for a position within the Poisson distributed part to have zero reads starting there (sampling zero). These positions are part of transcriptional active regions. We use a zero-inflated Poisson regression to estimate ϕ and thus determine how many positions without read starts are structural and sampling zeros, respectively. Only the latter and positions that have at least one read start are used to estimate λ of the [+]- and [–]-library, respectively. The estimation of λ thus effectively considers the transcriptionally active regions only. In practise, the removal of structural zeros leads to larger estimates for λ and thereby avoids the incorrect prediction of TSS from small counts in regions with low numbers of observed read starts.
TSSAR has been implemented in Perl and R and is available in two variants: A stand-alone version incorporates the core statistic routines and is best suited to be used in custom high-throughput dRNA-seq analysis. The Web service (available at http://rna.tbi.univie.ac.at/TSSAR/) comprises additional components for pre- and post-processing, thus providing a Web-based, cross-platform compatible pipeline for dRNA-seq analysis. An overview of the pipeline workflow can be found in Additional file 1: Figure S1.
While the TSSAR Web service provides convenient usability for routine dRNA-seq analysis tasks, there is also a demand for integrating third-party bioinformatics tools into custom analysis pipelines. To address this issue, we provide a TSSAR stand-alone version. In this version, the implementation is restricted to processing of SAM files, analysis based on the statistical calculations, and output of annotated TSS in BED format. The stand-alone version is available for download from the TSSAR Web site.
We chose a sliding window approach with a dynamic assessment of each position in the context of its local surrounding in order to account for different transcription rates across the genome. As a matter of fact, the choice of the window size parameter has an effect on the results (see Additional file 1: Figure S2). There, two conflicting interests have to be balanced. On the one hand, the region should be large enough to provide enough information for a reliable distribution parameter estimation. On the other hand, the region should be small enough to provide an as homogeneous surrounding as possible. If the sliding window covers more than one actively transcribed gene, with different RNA abundances, the variance will be estimated over all transcribed entities. This might blur small signals, e.g., for low abundant sRNA genes. As a compromise, the default window size is 1,000 nt, approximately matching the average length of prokaryotic genes. It can be easily adjusted by the user.
For each window the parameters describing the Poisson distribution are estimated in the following manner: First, the sample values are winsorized , i.e., the highest read start count is substituted with the second highest count. The same procedure is done for the lowest value. This increases the robustness of the method against outliers, which may be caused by mis-mapping and/or abundant RNA fragments e.g. arising from rRNA loci.
Third, a regression procedure is applied to each window in the [+]- and in the [–]-library separately. For each library the probability ϕ is transformed into an expected number of excess structural zeros. Since the same genomic region is under consideration in both libraries, a similar proportion of untranscribed and transcribed regions can be expected. To increase robustness, the average between the number of structural zeros in both libraries is calculated and the estimated number of zeros are removed from each library. To determine λ for each library, describing the Poisson distribution of the sample, the arithmetic mean of the remaining counts is calculated.
Thereby, p i and are the raw and normalized values of the [+]-library at position i, respectively. The same applies to the [–]-library, i.e., m i and . and are the native sums of all read start counts in the total [+]- and [–]-library, respectively. The effect of this step is to scale the read counts of the larger library relative to the smaller one, hence avoiding artificial distending of the sample variance. The estimated parameters λ[+] and λ[–] are therefore normalized accordingly.
For each sequence position i in the current window, the difference of the normalized counts between [+]- and [–]-library is calculated. Unexpectedly large positive values of for position i indicate TSS, while exceptional negative values may indicate processing sites. The probability of observing is evaluated w.r.t. the Skellam distribution with the estimated normalized Poisson parameters.
The window slides along the genome with a step size equal to 1/10 th of the window size, hence each position is evaluated in 10 slightly different contexts. The geometric mean of all ten p-values is calculated in order to obtain the final position-wise p-value. Finally, each position that falls below a user-specified average p-value cutoff and whose total read start count in the [+]-library exceeds a user specified noise cutoff is reported as a significant TSS. The noise cutoff serves as an additional safeguard to restrict the results to plausible annotations. This is needed because the Skellam distribution works only with the differences of the expectation values of the underlying Poisson distributions. A very high expectation value in the [–]-library in combination with a small expectation value in the [+]-library leads to a negative expectation value of the resulting Skellam distribution. This, in turn could lead to annotated positions which are not supported by reads in the [+]-library, as significantly enriched. To prevent this unwanted behavior a user defined number of read starts must be observed in the [+]-library.
The goal of the TSSAR method is to provide user-friendly tools for rapid annotation of significant TSS based on dRNA-seq data. We therefore implemented a stand-alone version and a Web service. The first is intended to be used in high-throughput analysis pipelines whereas the latter represents an easy to use and platform independent user interface. For a Web service it is important to avoid the transfer and storage of gigabyte-sized mapping files. We therefore provide a Java client that extracts the necessary information and asks the user for only two parameters, namely genome size and window size. The data is pre-processed locally on the user’s computer. The essential information, i.e., the number of sequencing reads starting at each position, is automatically uploaded and analyzed on the TSSAR Web server (see Additional file 1: Figure S1). All relevant cutoffs like p-value and noise threshold are subsequently selectable for precomputed values.
To evaluate the performance of TSSAR in analyzing dRNA-seq data, we resort to the published data set for Helicobacter pylori. We used the publicly available raw sequencing data from the Sequence Read Archive  (study accession number SRP001481), restricting ourselves to the dRNA-seq data from mid-logarithmic growth phase and acid stress growth condition. The reads were pooled and mapped to the reference genome (NCBI accession ID NC_000915) using segemehl version 0.1.4  with default parameters.
Based on this data, which were normalized in the same way as indicated in equations 3 and 4, we predicted putative TSS with three different approaches. The first two represent a naïve benchmark. First, we calculated the difference () for each position i of the [+]- and [–]-library read start counts. We applied different cutoff thresholds between 1 and 300, thereby denoting every position with a difference higher than the cutoff to be a putative TSS. The resulting list of potential TSS was compared to the manual annotation from  using BEDTools Intersect, allowing ±2 nt inaccuracy to call a manual and an automated annotated TSS the same. The second approach is quotient based. Analogous to the difference based approach, the quotient is calculated for each position i (+1 is used as pseudo-count to avoid division by zero). Again we use different cutoff values between 1.1 and 20. These two approaches have their static nature in common. The same threshold is applied for the whole genome. A similar approach was already applied by . Albeit, there it was used to identify differentially induced TSS between different strains and growth conditions and additional information about promoter sequences was used to gain specificity.
Finally, we applied the dynamic TSSAR model, which analyzes the transcriptome locally and thus is able to model the different dynamics within the transcriptome. Here, we used a window size of 1,000 nt (approximately the mean gene length in H. pylori) and a noise cutoff of 3 reads per position. We filtered with different p-value threshold from 1·10-15 to 9·10-1.
In its default settings TSSAR merges consecutive TSS. Since the tested naïve approaches do not share this behavior, we tested the influence of TSS clustering on the prediction performance separately (see Additional file 1: Figure S5). Although, clustering contributes to the precision of the prediction, the effect is too small to cause the improved performance of TSSAR.
During manuscript preparation a new method for dRNA-seq data analysis, called TSSpredator, became available. It combines a peak calling approach for the treated library and a ratio based approach between the treated and untreated library . Since the parameters for this method were trained on the very same data set we used for the evaluation, the performance of methods cannot be impartially compared on this data. We can report, however, that TSSAR’s statistical method performs equally well even on this set (Additional file 1: Figure S6). To demonstrate that TSSAR has more generally applicable default parameters, we compared the two methods on a new, publicly available dRNA-seq data set from Stenotrophomonas maltophilia. Based on the mapped reads (treated and untreated dRNA-seq data from the WT strain, SRS352126 and SRS352125, mapped with segemehl read aligner  against the reference genome NC_010943, considering only uniquely mapped reads), TSSAR and TSSpredator were applied each in the default settings.
The PWMs of these motifs were then mapped back to the putative promoter regions with FIMO. We counted a motif found if the p-value of the hit was below 0.001, Figure 5B. For TSS annotated by TSSAR 27.1% are associated with one of the three motifs, of which the majority corresponds to the Pribnow box like motif, while only 18.7% of the TSSpredator results are associated with a motif. The manualy curated list of TSS shows a motif for 29.5% of the entries.
The reproduction of the manual annotation on the S. maltophilia data shows an F-measure of 0.46 and 0.39 for TSSAR and TSSpredator, respectively. Which is mainly due to the fact that the latter predicts about 1,100 additional TSS whereas TSSAR only predicts about 500 not manually annotated TSS, Figure 5C.
These results indicate a higher accuracy of the TSSAR analysis. TSSAR reproduces manual annotation more precise and TSS annotated by TSSAR exhibit a higher enrichment in putative promoter motifs, indicating its greater specificity compared to TSSpredator. However, the latter results have to be interpreted with caution since several recent studies, e.g. [37, 38], showed that the textbook knowledge of the homogeneous architecture of promoter region motifs does not capture the full complexity of biological reality.
A major advantage of an automated TSS annotation, based on a sound statistical analysis, neglecting a priori knowledge of the whereabouts of promoters and other already established annotation, lies in the avoidance of any bias towards certain genomic positions. This ensures an unbiased analysis as well as a comparable and reproducible TSS annotation procedure.
Although our approach checks whether the basic assumption that the read starts of a sequencing library are Poisson distributed holds, a manual inspection of the produced data is still recommended. The automated TSS prediction is only as good as the underlying dRNA-seq libraries. We therefore emphasize that a thoughtful investigation of the input sequencing reads, especially for PCR duplicates, is advised. Manual inspection is necessary for those genomic regions that are not annotated by TSSAR due to non-convergence in the estimation of the expression parameters. For TSSAR’s output, we recommend at least a basic sanity check, since very complex regions, such as tRNA and rRNA loci, might be misconstrued. In spite of these precautions, the work load to check hundreds or a few thousands of predicted TSS positions is significantly reduced compared to screening millions of genomic positions in the first place.
Reliable and automated TSS annotation is a prerequisite for many applications. So far, most genome-wide TSS annotations focused on a static picture of the transcriptomal architecture [2, 39] (there are also notable exceptions, e.g. [30, 40]). One reason is that data analysis was more laborious than data generation. Relieving the experimenter from this time-consuming burden might liberate the resources to investigate more of the dynamics and alteration of the transcriptome, due to external stimuli or evolutionary differences. During manuscript preparation the latter was demonstrated by conducting a comparative transcriptomics approach . There, TSS annotation was also conducted in an automated manner. First, putative TSS are selected by considering the "flank height", and the differences of mapped read starts of position i-1 to i are calculated. These sites are then evaluated similarly to our Quotient approach based on the ratio between the TEX treated and untreated library. The problem of selecting an educated cutoff, which is immanent to all methods but especially troublesome for classifiers which directly depend on variable conditions such as sequencing depth, was neatly circumvented by using a comparative approach. Transcriptomes of different Campylobacter jejuni isolates were used to dynamically adjust thresholds if signals in different strains could be observed. In the more typical application scenarios, where such comparative information is not available, a robust p-value estimate that takes the dynamic range of transcription activity along the whole genome into account for the classification seems to be preferable.
Currently, TSSAR is based on the assumption that a [+]- and [–]-library is analyzed and only positions with a significant enrichment in the [+]-library are reported as potential TSS. At least two other application scenarios of the statistical framework are possible. One is to detect RNA processing sites and the other to analyze differentially induced transcription starts. In principle the latter could be achieved by comparing two TEX treated libraries resulting from dRNA-seq runs of different growth conditions. In that case, a large positive and negative is of importance as it indicates (growth phase dependent) induction of a TSS in the one or the other library. RNA processing sites are in principle detectable using the "standard" dRNA-seq approach. Positions where a significant enrichment in the [–]-minus over the [+]-library is observable are of interest. Extremely small values of point to these positions. Tackling both issues, processing sites and induced transcription initiation, is however currently hampered by the lack of experimentally verified training sets. Furthermore, although tailored for analyzing dRNA-seq data, in principle, the TSSAR method should be applicable to other RNA-seq protocols, e.g., , which aim to enrich read starts at certain positions in the sequencing library. Currently, the run-time of TSSAR, see Additional file 1: Figure S4, prevents its application for one of the above mentioned purposes to complete eukaryotic genomes. An improvement of this aspect will be a task for the future development and refinement of the program.
The modularity of the TSSAR framework makes it possible to extend the current approach e.g., by improving the statistical model. Alternative approaches based on a different (non-Poisson) distribution or the Pitman sampling method  can be implemented in the TSSAR core module, without the necessity to change the Java client or the Web service front end. The RESTful architecture of the TSSAR Web service provides additional extensibility, rendering implementation of new functionality such as promoter or operon characterization straightforward.
Here, we presented an automated analysis of dRNA-seq data which aims to detect significantly enriched TSS positions. The background distributions of sequencing read starts are modeled locally by a zero inflated Poisson distribution. Positions with a larger difference between the TEX treated and the untreated library than expected, considering the background, are annotated as significant transcription start sites. We could show that our method reproduces manually analyzed dRNA-seq data better than two simple approaches that use a global cutoff to discriminate between true and false signals. Furthermore, the choice of a p-value cutoff is more intuitive and less arbitrary.
TSSAR is available both as a stand alone tool and as a Web service at http://rna.tbi.univie.ac.at/TSSAR/. The latter provides additional post-processing functionality like TSS classification or merging of consecutive TSS. The TSSAR Web service offers user-friendly and intuitive online access to the TSSAR framework whereas the stand-alone version is intended for integration into third-party annotation pipelines.
Availability and requirements
Project name: TSSAR
Project home page:http://rna.tbi.univie.ac.at/TSSAR
Operating system: Platform independent
Programming language: Java, Perl and R
Other requirements: Client needs Java 1.6 or higher and the standalone version is based on Perl 5, R 2.15
License: Java client under Apache License, Statistics module under GPL2.
Any restrictions to use by non-academics: For non-profit use only.
We thank David Hot for initiating our interest in dRNA-seq and sharing data, Thomas Yee for valuable advice and Peter Kerpedjiev for proof reading.
This work was partly funded by the Austrian Science Fund (FWF) projects "In silico annotation of noncoding RNAs and their targets" (AF 0430511), "RNA regulation of the transcriptome" (F43), the German ministry of science (0316165C as part of the e:Bio initiative), and the Deutsche Forschungsgemeinschaft (STA 850/10–2 within the SPP 1258 "Sensory and regulatory RNAs in Prokaryotes").
- Croucher NJ, Thomson NR: Studying bacterial transcriptomes using RNA-seq. Curr Opin Microbiol. 2010, 13 (5): 619-624. 10.1016/j.mib.2010.09.009.View ArticlePubMed CentralPubMedGoogle Scholar
- Cho BK, Zengler K, Qiu Y, Park YS, Knight EM, Barrett CL, Gao Y, Palsson BØ: The transcription unit architecture of the Escherichia coli genome. Nat Biotechnol. 2009, 27 (11): 1043-1049. 10.1038/nbt.1582.View ArticlePubMedGoogle Scholar
- Wurtzel O, Sapra R, Chen F, Zhu Y, Simmons B, Sorek R: A single-base resolution map of an archaeal transcriptome. Genome Res. 2010, 20: 133-141. 10.1101/gr.100396.109.View ArticlePubMed CentralPubMedGoogle Scholar
- Sharma C, Hoffmann S, Darfeuille F, Reignier J, Findeiß S, Sittka A, Chabas S, Reiche K, Hackermüller J, Reinhardt R, Stadler PF, Vogel J: The primary transcriptome of the major human pathogen Helicobacter pylori. Nature. 2010, 464 (7286): 250-255. 10.1038/nature08756.View ArticlePubMedGoogle Scholar
- Schmidtke C, Findeiß S, Sharma C, Kuhfuß J, Hoffmann S, Vogel J, Stadler P, Bonas U: Genome-wide transcriptome analysis of the plant pathogen Xanthomonas identifies sRNAs with putative virulence functions. Nucleic Acids Res. 2012, 40 (5): 2020-2031. 10.1093/nar/gkr904.View ArticlePubMed CentralPubMedGoogle Scholar
- Tauber S, von Haeseler A: Exploring the sampling universe of RNA-seq. Stat Appl Genet Mol Biol. 2013, 12 (2): 175-188.PubMedGoogle Scholar
- Passalacqua KD, Varadarajan A, Ondov BD, Okou DT, Zwick ME, Bergman NH: Structure and complexity of a bacterial transcriptome. J Bacteriol. 2009, 191 (10): 3203-3211. 10.1128/JB.00122-09.View ArticlePubMed CentralPubMedGoogle Scholar
- Ishitani R, Yokoyama S, Nureki O: Structure, dynamics, and function of RNA modification enzymes. Curr Opin Struct Biol. 2008, 18 (3): 330-339. 10.1016/j.sbi.2008.05.003.View ArticlePubMedGoogle Scholar
- Knoop V: When you can’t trust the DNA: RNA editing changes transcript sequences. Cell Mol Life Sci. 2011, 68 (4): 567-586. 10.1007/s00018-010-0538-9.View ArticlePubMedGoogle Scholar
- Findeiß S, Langenberger D, Stadler PF, Hoffmann S: Traces of post-transcriptional RNA modifications in deep sequencing data. Biol Chem. 2011, 392 (4): 305-313.View ArticlePubMedGoogle Scholar
- Wurtzel O, Sesto N, Mellin JR, Karunker I, Edelheit S, Bécavin C, Archambaud C, Cossart P, Sorek R: Comparative transcriptomics of pathogenic and non-pathogenic Listeria species. Mol Syst Biol. 2012, 8 (583): 1-14.Google Scholar
- Giannoukos G, Ciulla DM, Huang K, Haas BJ, Izard J, Levin JZ, Livny J, Earl AM, Gevers D, Ward DV, Nusbaum C, Birren BW, Gnirke A: Efficient and robust RNA-seq process for cultured bacteria and complex community transcriptomes. Genome Bilogogy. 2012, 13 (r23): 1-13.Google Scholar
- Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, Sammeth M: Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 2012, 40 (20): 10073-10083. 10.1093/nar/gks666.View ArticlePubMed CentralPubMedGoogle Scholar
- Skellam J: The frequency distribution of the difference between two Poisson variates belonging to different populations. J R Stat Soci Series A (Gen). 1946, 109 (Pt 3): 296-View ArticleGoogle Scholar
- Abramowitz M, Stegun IA: Handbook of mathematical functions with formulas, graphs, and mathematical tables. Natl Bur Stand: Applied Mathematics Series 55. 1972, 55: 355-435. [Modified Besselfunctions of the first kind with integer coeficient appear frequently in many fields of physics and as the result of integrals in mathematical statistics. They rapidly grow as a function of their argument.]Google Scholar
- Lambert D: Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992, 34: 1-14. 10.2307/1269547.View ArticleGoogle Scholar
- Yee TW: The VGAM package for categorical data analysis. J Stat Softw. 2010, 32 (10): 1-34.View ArticleGoogle Scholar
- The Perl Dancer Project. [http://www.perldancer.org],
- Fielding RT: REST: Architectural styles and the design of network-based software architectures. Doctoral dissertation, University of California, Irvine; 2000Google Scholar
- Picard tools version 1.85. [http://picard.sourceforge.net],
- XZ for Java version 1.2. [http://tukaani.org/xz/java.html],
- Appache HttpComponents version 4.2.3. [http://hc.apache.org/index.html],
- Ramachandran V, Shearer N, Jacob J, Sharma C, Thompson A: The architecture and ppGpp-dependent expression of the primary transcriptome of Salmonella Typhimurium during invasion gene expression. BMC Genomics. 2012, 13: 25-10.1186/1471-2164-13-25.View ArticlePubMed CentralPubMedGoogle Scholar
- Quinlan AR, Hall IM: BEDTools-User-Manual. [http://bedtools.googlecode.com/files/BEDTools-User-Manual.pdf],
- Searls DT: An estimator for a population mean which reduces the effect of large true observations. J Am Stat Assoc. 1966, 61 (316): 1200-1204. 10.1080/01621459.1966.10482204.View ArticleGoogle Scholar
- Yee TW, Yee MT, Suggests M: Package ‘VGAM’. 2012, CRANGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, Feolo M, Fingerman IM, Geer LY, Helmberg W, Kapustin Y, Krasnov S, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, Marchler-Bauer A, Miller V, Karsch-Mizrachi I, Ostell J, Panchenko A, Phan L, Pruitt KD, Schuler GD, et al: Database resources of the national center for biotechnology information. Nucleic Acids Res. 2012, 40 (D1): D13-D25. 10.1093/nar/gkr1184.View ArticlePubMed CentralPubMedGoogle Scholar
- Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermüller J: Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol. 2009, 5 (9): e1000502-10.1371/journal.pcbi.1000502.View ArticlePubMed CentralPubMedGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842. 10.1093/bioinformatics/btq033.View ArticlePubMed CentralPubMedGoogle Scholar
- Mitschke J, Vioque A, Haas F, Hess WR, Muro-Pastor AM: Dynamics of transcriptional start site selection during nitrogen stress-induced cell differentiation in Anabaena sp. PCC7120. Proc Nat Acad Sci. 2011, 108 (50): 20130-20135. 10.1073/pnas.1112724108.View ArticlePubMed CentralPubMedGoogle Scholar
- Sokolova M, Japkowicz N, Szpakowicz S: Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. AI 2006: Advances in Artificial Intelligence. 2006, Heidelberg: Springer Berlin, 1015-1021.View ArticleGoogle Scholar
- Dugar G, Herbig A, Förstner KU, Heidrich N, Reinhardt R, Nieselt K, Sharma CM: High-resolution transcriptome maps reveal strain-specific regulatory features of multiple campylobacter jejuni isolates. PLoS Genet. 2013, 9 (5): e1003495-10.1371/journal.pgen.1003495.View ArticlePubMed CentralPubMedGoogle Scholar
- Roscetto E, Angrisano T, Costa V, Casalino M, Förstner KU, Sharma CM, Di Nocera PP, De Gregorio E: Functional characterization of the RNA chaperone Hfq in the opportunistic human pathogen Stenotrophomonas maltophilia. J Bacteriol. 2012, 194 (21): 5864-5874. 10.1128/JB.00746-12.View ArticlePubMed CentralPubMedGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in bipolymers. Proc Second Int Conf Intell Syst Mol Biol. 1994, 2: 28-36.Google Scholar
- Pribnow D: Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. Proc Nat Acad Sci. 1975, 72 (3): 784-788. 10.1073/pnas.72.3.784.View ArticlePubMed CentralPubMedGoogle Scholar
- Grant CE, Bailey TL, Noble WS: FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011, 27 (7): 1017-1018. 10.1093/bioinformatics/btr064.View ArticlePubMed CentralPubMedGoogle Scholar
- Mitschke J, Vioque A, Haas F, Hess WR, Muro-Pastor AM: Dynamics of transcriptional start site selection during nitrogen stress-induced cell differentiation in Anabaena sp. PCC7120. Proc Nat Acad Sci. 2011, 108 (50): 20130-20135. 10.1073/pnas.1112724108.View ArticlePubMed CentralPubMedGoogle Scholar
- Schlüter JP, Reinkensmeier J, Barnett MJ, Lang C, Krol E, Giegerich R, Long SR, Becker A: Global mapping of transcription start sites and promoter motifs in the symbiotic α-proteobacterium Sinorhizobium meliloti 1021. BMC Genomics. 2013, 14: 156-10.1186/1471-2164-14-156.View ArticlePubMed CentralPubMedGoogle Scholar
- Mitschke J, Georg J, Scholz I, Sharma CM, Dienst D, Bantscheff J, Voß B, Steglich C, Wilde A, Vogel J, Hess WR: An experimentally anchored map of transcriptional start sites in the model cyanobacterium Synechocystis sp. PCC6803. Proc Nat Acad Sci. 2011, 108 (5): 2124-2129. 10.1073/pnas.1015154108.View ArticlePubMed CentralPubMedGoogle Scholar
- Nicolas P, Mäder U, Dervyn E, Rochat T, Leduc A, Pigeonneau N, Bidnenko E, Marchadier E, Hoebeke M, Aymerich S, Becher D, Bisicchia P, Botella E, Delumeau O, Doherty G, Denham EL, Fogg MJ, Fromion V, Goelzer A, Hansen A, Härtig E, Harwood CR, Homuth G, Jarmer H, Jules M, Klipp E, Le Chat L, Lecointe F, Lewis P, Liebermeister W, et al: Condition-dependent transcriptome reveals high-level regulatory architecture in Bacillus subtilis. Science. 2012, 335 (6072): 1103-1106. 10.1126/science.1206848.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.