Wavelet-based detection of transcriptional activity on a novel Staphylococcus aureus tiling microarray
- Víctor Segura^{1}Email author,
- Alejandro Toledo-Arana^{2},
- Maite Uzqueda^{1},
- Iñigo Lasa^{2} and
- Arrate Muñoz-Barrutia^{3}
https://doi.org/10.1186/1471-2105-13-222
© Segura et al.; licensee BioMed Central Ltd. 2012
Received: 27 June 2012
Accepted: 16 August 2012
Published: 5 September 2012
Abstract
Background
High-density oligonucleotide microarray is an appropriate technology for genomic analysis, and is particulary useful in the generation of transcriptional maps, ChIP-on-chip studies and re-sequencing of the genome.Transcriptome analysis of tiling microarray data facilitates the discovery of novel transcripts and the assessment of differential expression in diverse experimental conditions. Although new technologies such as next-generation sequencing have appeared, microarrays might still be useful for the study of small genomes or for the analysis of genomic regions with custom microarrays due to their lower price and good accuracy in expression quantification.
Results
Here, we propose a novel wavelet-based method, named ZCL (zero-crossing lines), for the combined denoising and segmentation of tiling signals. The denoising is performed with the classical SUREshrink method and the detection of transcriptionally active regions is based on the computation of the Continuous Wavelet Transform (CWT). In particular, the detection of the transitions is implemented as the thresholding of the zero-crossing lines. The algorithm described has been applied to the public Saccharomyces cerevisiae dataset and it has been compared with two well-known algorithms: pseudo-median sliding window (PMSW) and the structural change model (SCM). As a proof-of-principle, we applied the ZCL algorithm to the analysis of the custom tiling microarray hybridization results of a S. aureus mutant deficient in the sigma B transcription factor. The challenge was to identify those transcripts whose expression decreases in the absence of sigma B.
Conclusions
The proposed method archives the best performance in terms of positive predictive value (PPV) while its sensitivity is similar to the other algorithms used for the comparison. The computation time needed to process the transcriptional signals is low as compared with model-based methods and in the same range to those based on the use of filters. Automatic parameter selection has been incorporated and moreover, it can be easily adapted to a parallel implementation. We can conclude that the proposed method is well suited for the analysis of tiling signals, in which transcriptional activity is often hidden in the noise. Finally, the quantification and differential expression analysis of S. aureus dataset have demonstrated the valuable utility of this novel device to the biological analysis of the S. aureus transcriptome.
Keywords
Background
The complete deciphering of the information contained in the genome would be helpful to improve our understanding of the biological processes occurring in living organisms. High-density oligonucleotide-based whole-genome microarray is an extensively used technology to detect the expression of all RNA species including protein coding RNAs and non-coding RNAs. It is particularly suitable for the analysis of whole small-sized genomes such as those corresponding to bacteria. For these organisms high resolution can be achieved with the microarrays currently provided by the manufactures.
Applications of tiling array technology include the generation of transcriptional maps and annotations of genomes, the identification of transcription factor binding sites, the analysis of alternative splicing events, the analysis of methylation states, the discovery of genotyping and polymorphism, and the re-sequentation of genomes [1].
The emerging high-throughput next generation DNA sequencing (NGS) technologies [2] have revolutionized transcriptomics by allowing RNA analysis through cDNA sequencing on a massive scale (RNA-seq). Several limitations inherent to microarray technologies are overcome by NGS technologies, in particular, it is not necessary to design appropriate probes and the experimental reproducibility is guaranteed. However, the microarray design presented in [3] allowed a comprehensive examination of gene expression and genome-wide identification of alternative splicing as well as detection of coding and noncoding transcripts. This microarray (Affymetrix GG-H array) was compared with RNA-seq in [3]. The reproducibility in the estimation of gene and exon abundance was high and even more sensitive than RNA-seq at the exon level. This microarray design contains as targets near 50000 highly transcribed fragments of unknown functions from Affymetrix tiling microarray data [4]. The NGS experiments highlighted that 49% of these fragments had uniquely mapped reads, revealing a high degree of concordance between both technologies.
The analysis of a tiling microarray experiment starts with a two-step process that generates a discrete signal. First, the DNA or RNA samples are hybridized in the custom designed tiling array. Second, for each probe, the raw intensities are converted to a score [5]. The result is a discrete intensity signal with a value per probe.
Transcriptome analysis refers to the detection of segments where the noisy tiling signal is constant. The start and end points of these segments correspond to transcript start and end sites. Several approaches have been deployed in the segmentation of tiling signals: pseudo-median or Hodges-Lehmann estimator [6, 7], local non parametric smoothing [8, 9], hidden Markov models [10–13], circular binary segmentation [14] and structural change model [15, 16].
Wavelet analysis using the Discrete Wavelet Transform (DWT) [17] has demonstrated excellent performance in the analysis of ChIP-chip experiments using tiling array technology [18, 19]. In this paper, we propose a wavelet transform based method for the identification of TARs in tiling signals (ZCL). We have chosen the SUREShrink algorithm for denoising and a method based on the computation of the Continuous Wavelet Transform (CWT) for detection of transcription start and end sites. In particular, the sharp transitions of the tiling signal are identified as the zero-crossing lines of a multiresolution decomposition using as the mother wavelet the second derivative of a Gaussian [20]. We applied the proposed analysis to the public Saccharomyces Cerevisiae dataset to validate our analytical approach. The novel identification algorithm was compared with two well-known methods: pseudo-median sliding window (PMSW) and structural change model (SCM). The absence of a biologically validated ground truth to evaluate the resulting segmentations prevent the use of specificity and sensitivity as performance metrics. Consequently, the evaluation has been made in terms of positive predictive value (PPV), sensitivity and computation time. We also evaluated the segmentation quality resulting from the combination of the TARs detected by several of the methods under study.
We also used this algorithm for the identification of the subset of transcripts whose expression decreases in a S. aureus strain deficient in the sigma B transcription factor. SigB has been shown to be involved in the stress response to different stimuli, the regulation of sarA, sarH1, and agr that control a wide array of virulence factors, biofilm formation, the ability of S. aureus to bind to various host-cell matrix proteins such as fibrinogen and fibronectin, and in the development of resistance to the antibiotics methicillin and teicoplanin [21–26].
We applied the segmentation methods to this high quality dataset and we have demonstrated its usefulness for the analysis of the tiling array derived transcriptome map. The results demonstrate that ZCL not only allows a rapid identification of the transcripst based on the segmentation procedure but also a more accurate estimation of the expression level of each transcript.
Results and discussion
All the steps needed to obtain a trancriptional map from the raw data (read the CEL files, normalize, denoise and segment the tiling signal) have been implemented using the statistical language R/Bioconductor[27]. The CRAN packages Rwave, wavethresh and wmtsa have been used for wavelet analysis. All the R functions described are available as Supplementary Material (Additional file 1, Additional file 2 and Additional file 3). The R code to perform the example analyses and the generation of the figures included in the paper can also be found as Supplementary Material. The results show that wavelets compare well with the rest of methods in terms of segmentation accuracy and time consumed in the analysis.
Experimental datasets
Saccharomyces cerevisiae dataset
The dataset is described in [16]. An oligonucleotide array for S. cerevisiae was developed. It contains 6.5 million probes and interrogates both strands of the full genomic sequence with 25-mer probes tiled at an average of 8 nucleotide intervals on each strand and 4 nucleotide tile offset between strands. The first-strand cDNA was synthesized using random primers from poly(A) and total RNA. A set of genomic DNA was also hybridized for normalization purposes [15]. Their analysis of the transcription map identified the transcript boundaries, its structure and the intensity level of coding and non-coding transcripts [16]. All data (CEL files, bmap files for both strands and annotation file) was deposited in ArrayExpress database with accession number E-TABM-14.
Staphylococcus aureus dataset
The Staphylococcus aureus custom tiling microarray (NA-Staph-b520729F) was designed in collaboration with Affymetrix (Santa Clara, CA, USA). Specifically, the microarray (format 49-7875 with 11 μ m features) contains a total of 522,406 probes, divided into two parts. The first part corresponds to the tiling array containing a total of 384,932 probes (25-mer), which are further divided into eight sets. The set used in our analysis covers both strands of the S. aureus NCTC 8325 genome (2,821,347 bp covered by 363,127 probes). Each 25-mer probe was tiled each 14-nt across the whole genome, resulting in 11-nt overlaps and a 7-nt tile offset between strands. The microarray design has been deposited in the ArrayExpress Archive at EMBL-EBI (http://www.ebi.ac.uk/microarray-as/ae/), ArrayExpress accesion no. A-AFFY-165.
Before cDNA synthesis, RNA integrity from each sample was confirmed on Agilent RNA Nano LabChips (Agilent Technologies). 10 μ g of RNAs extracted from bacterial strains grown until exponential phase (OD600nm = 0.8) were reverse transcribed using SuperScript II reverse transcriptase (Invitrogen Life Technologies). They were processed following the protocol of the Affymetrix GeneChip Expression Analysis Technical Manual (P/N 702232 Rev. 2) in the presence of 6 ng/ml Actinomycin D to avoid spurious second-strand cDNA synthesis during the reverse transcription reaction [28]. Sense RNA corresponding to B. subtilis poly-A lys, phe, thr, trp, dap genes were spiked into sample RNA as a control for the labeling and hybridization steps. cDNA was digested by DNase I (PIERCE) in 10X DNAse I buffer (USB-Affymetrix) and the size of digestion products was analyzed in the Agilent Bioanalyser 2100 using RNA Nano LabChips to ensure that the fragmentation resulted in a majority of products in the range of 50 to 200 base-pairs. The fragmented cDNA were then biotinylated using terminal deoxynucleotidyl transferase (Promega) and the GeneChip DNA labeling reagent (Affymetrix) following the manufacturer’s recommendations. Biotinylated cDNA (5 μ g per array) were hybridized for 16 hours according to the Affymetrix protocol in a total volume of 200 μ l per hybridization chamber. Following incubation, the arrays were washed and stained in the Fluidics station 450 (Affymetrix) using the protocol FS450_0005. The arrays were then scanned using the GeneChip scanner 3000 (Affymetrix). The intensity signals of each probe cell were computed by the GeneChip operating software (GCOS) and stored in cell intensity files (.CEL extension) before preprocessing and analysis. All microarray data described in this study have been deposited in the ArrayExpress Archive at EMBL-EBI (http://www.ebi.ac.uk/microarray-as/ae/), ArrayExpress accesion no. E-MEXP-2778.
Probe annotation and normalization
The annotation of the PM probe sequences was obtained with the alignment to the genome sequence of S. cerevisiae strain S288c (SGD of August 7, 2005) as provided in the package davidTiling of Bioconductor. Available data correspond to 3 replicates of poly(A), 2 replicates of total RNA and 3 replicates of genomic DNA. The CEL files were read and the normalized signals (poly(A) and total RNA) were obtained using Equation 4. The analysis steps (denoising, segmentation and detection of TARs) were performed on the poly(A) signal as it showed an improved hybridization quality [16]. Once the signal is constructed from CEL and annotation files we used tilingArray package functions to obtain equally-spaced samples. Other resampling methods can be applied without loss of generality.
The annotation files for S. aureus microarray are provided in the ArrayExpress database (A-AFFY-165). The microarrays of the experiment correspond to three replicates of genomic DNA, three replicates of RNA of the 15981 wild-type strain, and three replicates of the sigB deletion. All the preprocessing steps were performed as previosly described for S. cerevisiae dataset.
Denoising
Segmentation
The TAR start and end positions were defined as the transition locations for which the difference between the mean intensity of neighboring segments is greater than 10% of the dynamic range of the tiling signal. Moreover, the inspection of the intensity histogram of chromosome 1 forward strand was used to set the minimum normalized transcription level value to −2. The same parameters were adopted to process the other strands of the organism. The R function segmentZCL (provided as Supplementary Material) implements the whole segmentation procedure.
Segmentations comparison using S. cerevisiae dataset
The results from the ZCL segmentation were compared to those obtained with PSMW [6] and SCM [15]. The robust PMSW method is based on the calculation of a pseudo-median within a sliding window. The local expression level is computed with the Hodges-Lehmann estimator [31] on the RNA normalized signal. To be able to do this, the Tilescope pipeline [7] was implemented. Once the candidate transcript regions were determined, the TARs were assembled by the combination of a normalization intensity threshold and a max-gap and min-run criteria. The former is defined as the maximum distance below which two adjacent transcribed probes are included in the same TAR. The later as the minimum length of a feature to be classified as a transcribed region.
Huber’s method is based on the structural change model (SCM). The SCM model [15, 16] is used in econometrics for the modeling of sharp transitions in financial time series. It has been applied to the segmentation of comparative genomics hybridization (CGH) data [32]. The signal is modeled as a piecewise constant function of chromosomal coordinates described using the segment boundaries, the maximum number of segments and the mean signal value for each segment. The method is applied independently to each chromosome and, if the signal is strand-specific, to each of its two strands. A dynamic programming algorithm part of the tilingArray package of Bioconductor computes a globally optimal set of parameters for segmentations of increasing number of segments.
Due to the lack of a biologically validated ground truth to evaluate the outputs, we compared the methods in terms of two metrics, sensitivity and positive predictive value (PPV) at probe-level. We define sensitivity as the number of probes in the detected TARs that overlap with annotated regions (true positives, TP) divided by the total number of probes in the annotated regions (sum of true positives and false negatives, TP + FN): Sensitivity=TP/(TP + FN). The PPV is defined as the number of probes in the detected TARs that overlap with annotated regions (TP) divided by the total number of probes in the detected TARs (sum of true and false positives, TP + FP): PPV=TP/(TP + FP). A sensitivity of 100% is not expected since in any given tissue or cell line at any given experimental condition, not all known genes will be expressed. Also, a PPV of 100% is not expected since an accurate and complete gene annotation is not available [5, 33].
The PMSW and SCM methods were applied to the S. cerevisiae using the parameters previously reported in the literature [6, 15]. In particular, the PMSW method used a bandwidth size (BW) of 3, a normalized intensity threshold equal to −2, a separation between the probes in a TAR (maxgap) of 10 and a minimum acceptable TAR size (minrun) of 90. The maximum number of segments for the SCM method was fixed to 1500. In the case of ZCL we selected the following parameters: 100 wavelet scales, minimum TAR size of 10 and minimum value of normalized intensity transcription equal to −2. The zero-crossing line length threshold was computed based on the histogram of line lengths.
Evaluation metrics for S. cerevisiae dataset
Evaluation metrics (S. cerevisiae) | ||||
---|---|---|---|---|
Method | TARs | PPV | Sensitivity | Time (min) |
PMSW | 22114 | 0.7416 | 0.4700 | 2.88 |
SCM | 11246 | 0.7847 | 0.3904 | 79.09 |
ZCL | 18209 | 0.8486 | 0.3760 | 13.02 |
ZCLSure | 22513 | 0.8547 | 0.3686 | 10.70 |
In-depth analysis of chromosome 1 gives interesting insights into concerning the relationship between methods. In the forward strand, the number of probes annotated as genes is 12796 representing 19.35% of the total number of probes. 65.16% of probes are correctly classified by the three algorithms (12.21% of gene probes and 52.95% of non-gene probes). From the annotated probes, 63.12% are detected by all methods, while only 11.85% of the probes are not detected by either of them. This means that 88.15% of the annotated probes are detected by at least one of the methods. The reverse strand contains 11866 annotated probes (17.90% of probes located in this strand), from which 19.74% are considered part of a TAR by all the methods and 28.35% are true negative probes. In this strand, 51.41% of the annotated probes are included in a TAR by any method while only 14.35% are never detected. In other words, 85.65% of the probes in the strand are detected by at least one algorithm. In light of this outcome, we considered it worthwhile to evaluate if the combination of results computed with the different methods would improve the performance of the segmentation.
Combination of TAR probes candidates
Evaluation of segmentation combinations for both strand of chromosome 1
Integrative transcriptional analysis | ||||
---|---|---|---|---|
Method | PPV Forward | Sensitivity Forward | PPV Reverse | Sensitivity Reverse |
PMSW | 0.6511 | 0.5873 | 0.5811 | 0.2073 |
SCM | 0.7188 | 0.4390 | 0.6968 | 0.2146 |
ZCL | 0.8675 | 0.3821 | 0.8220 | 0.2208 |
PMSW ⋂ SCM ⋂ ZCL | 0.6312 | 0.5984 | 0.5441 | 0.2030 |
PMSW ⋂ ZCL | 0.6448 | 0.5921 | 0.5744 | 0.2076 |
PMSW ⋂ SCM | 0.6370 | 0.5957 | 0.5504 | 0.2043 |
SCM ⋂ ZCL | 0.7053 | 0.4409 | 0.6626 | 0.2116 |
Majority voting | 0.7247 | 0.4396 | 0.6993 | 0.2164 |
Computational performance
S. cerevisiae analyses were executed in an Intel(R) Xeon(R) processor server (64 bits, 4 cores, 2 GHz) with 32 Gb installed memory running Red Hat Enterprise Linux AS release 4 and R 2.13.0. Computing times needed to process each chromosome strand with the described methods are shown in Figure 5. For the same signal length, longer computation time is required for SCM, while comparable times are needed for PMSW and ZCL. The mean time to segment the whole transcriptome is 2.88 mins for PMSW, 13.02 mins for ZCL and 79.09 mins for SCM.
Differential expression analysis of S. aureus sigmaB mutant
Evaluation metrics for S. aureus dataset
Evaluation metrics (S. aureus) | |||||
---|---|---|---|---|---|
Tiling Signal | Metric | PMSW | SCM | ZCL | ZCLSure |
WT Forward | PPV | 0.6298 | 0.6498 | 0.6248 | 0.6407 |
WT Forward | Sens | 0.8657 | 0.8766 | 0.8715 | 0.8719 |
WT Reverse | PPV | 0.6867 | 0.6993 | 0.7050 | 0.6989 |
WT Reverse | Sens | 0.8506 | 0.8560 | 0.8388 | 0.8535 |
Δ sigB Forward | PPV | 0.6238 | 0.6388 | 0.6227 | 0.6308 |
Δ sigB Forward | Sens | 0.9054 | 0.9035 | 0.9027 | 0.9036 |
Δ sigB Reverse | PPV | 0.6664 | 0.6815 | 0.6765 | 0.6748 |
Δ sigB Reverse | Sens | 0.8697 | 0.8684 | 0.8667 | 0.8515 |
The most frequent transcriptional analysis is the detection of genes that have changed their expression in the conditions under study (differential expression analysis). As sigma B affects the expression of more than one hundred genes, we decided to test whether it is possible to use the intensity of all the probes included in each detected TAR with the ZCL segmentation procedure to calculate the expression level of the transcript in a particular environmental condition. In order to carry out this analysis using tiling microarrays we need to compress the intensity of all the probes included in each detected TAR into one value. Standard methods for microarray normalization can be applied, for example RMA (Robust Multichip Average) algorithm in the case of Affymetrix microarrays [34]. This processing can be performed using the packages affxparser, affy and limma of Bioconductor for CDF (chip definition file) generation, normalization and differential expression analysis.
Conclusions
Transcriptomics is a powerful technology for the study of gene structures and RNA-based regulation in any organism. Genome-wide transcriptome analysis of prokaryotes can be carried out with any of these two techniques: RNA-seq and genomic tiling arrays [35]. High-resolution tiling arrays have been used, among others, to study the transcriptomes of Caulobacter crescentus[36], Escherichia coli and [37], Listeria monocytogenes[38].
In this paper, we propose a combined WT-based method for the denoising and segmentation of tiling signals. For illustrative and evaluative purposes, we applied the proposed analysis to the public S. cerevisiae. Our denoising results show an increase in the SNR of the filtered signal with respect to Huber’s method [15]. We believe it is advisable to properly denoise the tiling signal before segmentation as the number of false positives induced by signal variability is thus reduced. Even when constructing a manual segmentation, it seems an advisable choice to mark the transitions on the denoised signal, as its improved quality could help the expert to better discriminate between low expression transcripts and noise.
Our segmentation algorithm (ZCL) calculates all the possible break points based on the zero-crossing lines of the second derivative of the Gaussian wavelet. The results show that our method achieves the best compromise between accuracy (evaluated in terms of PPV and sensitivity) and computation time. The R code provided can be used to apply our algorithm as well as to combine the resulting segmentation with other methods as PMSW and SCM.
We also designed a new tiling microarray for the analysis of S. aureus genome, publicly available in the ArrayExpress database (accession number A-AFFY-165). This platform has been used for the comparison of the gene expression pattern of the S. aureus 15981 wild type and its isogenic sigB mutant. We selected this mutant because it is one of the most study staphylococcal regulatory factors and consequenctly it was a useful gold standard to compare the accuracy of our algorithms. The relevance of the segmentation results comes from the fact that a correct analysis of the tiling signals could improve the matching between the probes and the corresponding transcriptional units. In particular, it could help to more precisely localize the start and end transcription sites or even, include units that are not annotated in the current genome definition.
Once the TARs are properly detected, differentially expressed transcripts can be identified by well-known methods (such as Linear Models for Microarray Data (LIMMA) [39]) with a previous probe summarization algorithm to generate the transcript annotation (using, for example, Robust Microarray Analysis (RMA) [34]). In practice, this means that differential gene expression analysis could benefit from an enhanced analysis of tiling signals such as the one proposed here. To confirm the accuracy of the proposed method, we introduced a simple measure based on the median of TAR probe intensity. Using this approach, known up-regulated (nuc, aur, hla) and down-regulated (asp23, lysP) genes in sigB mutant were verified.
In conclusion, we present a novel method for denoising and segmentation of tiling microarray signals based on wavelet multiresolution analysis that outperforms previous methods in terms of SNR, positive predictive value and computation time. The R code that implements the method is given as supplementary material and can be easily adapted to a parallel computing schema. Also, we have introduced the possibility of combining the results of ZCL with those obtained with other two well-known approaches (PMSW and SCM) for the segmentation of tiling signals.
Methods
WT-based analysis
where $a\in {\mathbb{R}}^{+}-\left\{0\right\}$ is the scale, $b\in \mathbb{R}$ is the translation, ψ(x) is the mother wavelet, ψ^{∗}((b−x)/a) is the complex conjugated, scaled and translated wavelet and CWT is the 2D matrix of wavelet coefficients. The continuous input signal s(x) interpolates the discrete input samples s k,k=1,…,n where n is the length of the signal.
The CWT can be interpreted as the correlation of the input signal with a position reversed version of ψ rescaled by a factor a. For an 1D input signal, the result is a 2D description of the signal with respect to the position b and scale a and shifted by b. The scale a is inversely proportional to the central frequency of the dilated wavelet ψ_{ a }=ψ(x/a), which is typically a bandpass function; b represents the position location at which we analyze the signal. The larger the scale a, the wider the analyzing function ψ_{ a }, and hence the smaller the corresponding analyzed frequency. The output value is maximized when the frequency of the signal matches that of the corresponding dilated wavelet. The CWT computation for arbitrary scales can be easily adapted to a parallel implementation with a linear computational complexity [41].
Mallat’s fast wavelet algorithm [42] uses the multiresolution properties of the wavelet to compute the CWT at dyadic scales a=2^{ i } and time shifts b=2^{ i }k, $k\in ,\mathbb{Z}$, resulting in what is known as DWT. For additional information about the wavelet transform and its properties the reader is referred to [17].
Normalization of tiling microarray data
where n is the number of RNA samples and m is the number of DNA samples.
WT-based denoising
where n is the length of s, σ is the noise level and MAD is the estimated median absolute deviation in the first scale. An important issue is the selection of a suitable wavelet function. As the signal can be roughly approximated to a zero-order polynomial, a boxcar-like function such as the Haar wavelet gives a reasonable level of correlation (i.e., a good pattern matching) with the target signal.
For a large dimension n the law of large numbers will ensure that T_{SURE} will be almost the optimal threshold [30].
WT-based segmentation
An important issue in signal processing is to define an appropriate representation able to compress most of the signal information into few representative features. Sharp variations in amplitude (i.e., transitions and peaks) are among the most meaningful features of a signal. For that reason, many segmentation algorithms rely on their detection. Previous studies have detected the peaks in mass spectrometry data using either the ridge lines [43] or the zero-crossing lines [44] in a multi-scale decomposition of the signal. Zero-crossing lines seems a more consistent description as they belong to connected curves, are more robust to noise and easier to detect that ridge lines [44].
Hence, the wavelet transform of s(x) is proportional to the second derivative of s(x)smoothed by θ_{ a }(x). The zero-crossings of CWT(a b) correspond to the inflection points of s∗θ_{ a }. The identification of transcript start and end sites is achieved by computation of the redundant CWT over a wide scale range followed by zero-crossing line detection and length thresholding. The chosen mother wavelet is the second derivative of a Gaussian. The redundancy of the CWT yields enhanced information on the position-scale localization of the features of interest (in this case, the transitions) [46].
Identification of transcriptional active regions
The candidates start and end sites detected as described in the previous section, are filtered to remove incorrect assignments. The purpose of this procedure is to filter those transitions that do not correspond to variations in signal intensity. For the generation of TARs we considered the signal transitions in which variation in intensity is at least 10% of the dynamic range of the analyzed signal. We also eliminate from the list of detected TAR all the start and end points that are not correctly paired off. We use the sign of the zero-crossing lines to separate start and end points and we match each start site with its corresponding end site. Finally, we define the minimum normalized intensity threshold required for the segments to be considered as transcriptional active regions. This value is calculated as the median of the signal intensity distribution, but this threshold can also be user-defined. In order to improve the definition of TARs, we cluster together consecutive segments for which the mean normalized intensity value is over the threshold.
Declarations
Acknowledgements
We thank Prof. Fernando J. Corrales and Lourdes Ortiz (Genomics Core Facility) for technical support and all the useful comments about the manuscript. This work was supported by the spanish Torres-Quevedo fellowship [PTQ-08-03-07769] to VS. ATA and AMB were supported by Spanish Ministry of Science and Innovation ‘Ramón y Cajal’ contracts. This work was supported by the Spanish Ministry of Science and Innovation Grants BIO2008-05284-C02-01, BFU2011-23222, ERA-NET Pathogenomics PIM2010EPA-00606 and the agreement between ‘Fundación para la Investigación médica aplicada’ (FIMA) and the ’UTE project CIMA’.
Authors’ Affiliations
References
- Mockler TC, Chan S, Sundaresan A, Chen H, Jacobsen SE, Ecker JR: Applications of DNA tiling arrays for whole-genome analysis. Genomics 2005, 85: 1–15. 10.1016/j.ygeno.2004.10.005View ArticlePubMedGoogle Scholar
- Ozsolak F, Milos PM: RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 2011, 12(2):87–98. 10.1038/nrg2934PubMed CentralView ArticlePubMedGoogle Scholar
- Xu W, Seok J, Mindrinos MN, Schweitzer AC, Jiang H, Wilhelmy J, Clark TA, Kapur K, Xing Y, Faham M, Storey JD, Moldawer LL, Maier RV, Tompkins RG, Wong WH, Davis RW, Xiao W, the Inflammation and Host Response to Injury Large-Scale Collaborative Research Program: Human transcriptome array for high-throughput clinical studies. Proc Natl Acad Sci U S A 2011, 108(9):3707–3712. 10.1073/pnas.1019753108PubMed CentralView ArticlePubMedGoogle Scholar
- Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bel lI, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR: RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 2007, 316(5830):1484–1488. 10.1126/science.1138341View ArticlePubMedGoogle Scholar
- Emanuelsson O, Nagalakshmi U, Zheng D, Rozowsky JS, Urban AE, Du J, Lian Z, Stolc V, Weissman S, Snyder M, Gerstein MB: Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome. Genome Res 2007, 17(6):886–897. 10.1101/gr.5014606PubMed CentralView ArticlePubMedGoogle Scholar
- Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S, Cawley S, Drenkow J, Piccolboni A, Bekiranov S, Helt G, Tammana H, Gingeras TR: Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res 2004, 14(3):331–342. 10.1101/gr.2094104PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang ZD, Rozowsky J, Lam HY, Du J, Snyder M, Gerstein M: Tilescope: online analysis pipeline for high-density tiling microarray data. Genome Biol 2007., 8(5):Google Scholar
- Royce TE, Carriero NJ, Gerstein MB: An efficient pseudomedian filter for tiling microrrays. BMC Bioinformatics 2007, 8: 186–186. 10.1186/1471-2105-8-186PubMed CentralView ArticlePubMedGoogle Scholar
- Wang LY, Abyzov A, Korbel JO, Snyder M, Gerstein M: MSB: a mean-shift-based approach for the analysis of structural variation in the genome. Genome Res 2009, 19: 106–117.PubMed CentralView ArticlePubMedGoogle Scholar
- Ji H, Wong WH: TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics 2005, 21(18):3629–3636. 10.1093/bioinformatics/bti593View ArticlePubMedGoogle Scholar
- Du J, Rozowsky JS, Korbel JO, Zhang ZD, Royce TE, Schultz MH, Snyder M, Gerstein M: A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: systematically incorporating validated biological knowledge. Bioinformatics 2006, 22(24):3016–3024. 10.1093/bioinformatics/btl515View ArticlePubMedGoogle Scholar
- Toyoda T, Shinozaki K: Tiling array-driven elucidation of transcriptional structures based on maximum-likelihood and Markov models. Plant J 2005, 43(4):611–621. 10.1111/j.1365-313X.2005.02470.xView ArticlePubMedGoogle Scholar
- Nicolas P, Leduc A, Robin S, Rasmussen S, Jarmer H, Bessières P: Transcriptional landscape estimation from tiling array data using a model of signal shift and drift. Bioinformatics 2009, 25(18):2341–2347. 10.1093/bioinformatics/btp395PubMed CentralView ArticlePubMedGoogle Scholar
- Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 2004, 5(4):557–572. 10.1093/biostatistics/kxh008View ArticlePubMedGoogle Scholar
- Huber W, Toedling J, Steinmetz LM: Transcript mapping with high-density oligonucleotide tiling arrays. Bioinformatics 2006, 22(16):1963–1970. 10.1093/bioinformatics/btl289View ArticlePubMedGoogle Scholar
- David L, Huber W, Granovskaia M, Toedling J, Palm CJ, Bofkin L, Jones T, Davis RW, Steinmetz LM: A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci U S A 2006, 103(14):5320–5325. 10.1073/pnas.0601091103PubMed CentralView ArticlePubMedGoogle Scholar
- Mallat S: A Wavelet Tour of Signal Processing. San Diego: Academic Press; 1998.Google Scholar
- Karpikov A, Rozowsky J, Gerstein M: Tiling array data analysis: a multiscale approach using wavelets. BMC Bioinformatics 2011, 12: 57–57. 10.1186/1471-2105-12-57PubMed CentralView ArticlePubMedGoogle Scholar
- Buck MJ, Lieb JD: ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004, 83(3):349–360. 10.1016/j.ygeno.2003.11.004View ArticlePubMedGoogle Scholar
- Mallat S: Zero-crossings of a wavelet transform. IEEE Trans Info Th 1991, 37(4):1019–1033. 10.1109/18.86995View ArticleGoogle Scholar
- Bischoff M, Dunman P, Kormanec J, Macapagal D, Murphy E, Mounts W, Berger-Bächi B, Projan S: Microarray-based analysis of the Staphylococcus aureus sigmaB regulon. J Bacteriol 2004, 186(13):4085–4099. 10.1128/JB.186.13.4085-4099.2004PubMed CentralView ArticlePubMedGoogle Scholar
- Gertz S, Engelmann S, Schmid R, Ohlsen K, Hacker J, Hecker M: Regulation of sigmaB-dependent transcription of sigB and asp23 in two different Staphylococcus aureus strains. Mol Gen Genet 1999, 261(3):558–566. 10.1007/s004380051001View ArticlePubMedGoogle Scholar
- Kullik I, Giachino P, Fuchs T: Deletion of the alternative sigma factor sigmaB in Staphylococcus aureus reveals its function as a global regulator of virulence genes. J Bacteriol 1998, 180(18):4814–4820.PubMed CentralPubMedGoogle Scholar
- Ziebandt AK, Weber H, Rudolph J, Schmid R, Höper D, Engelmann S, Hecker M: Extracellular proteins of Staphylococcus aureus and the role of SarA and sigma B. Proteomics 2001, 1(4):480–493. 10.1002/1615-9861(200104)1:4<480::AID-PROT480>3.0.CO;2-OView ArticlePubMedGoogle Scholar
- Karlsson A, Arvidson S: Variation in extracellular protease production among clinical isolates of Staphylococcus aureus due to different levels of expression of the protease repressor sarA. Infect Immun 2002, 70(8):4239–4246. 10.1128/IAI.70.8.4239-4246.2002PubMed CentralView ArticlePubMedGoogle Scholar
- Giachino P, Engelmann S, Bischoff M: Sigma(B) activity depends on RsbU in Staphylococcus aureus. J Bacteriol 2001, 183(6):1843–1852. 10.1128/JB.183.6.1843-1852.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S: Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health). Secaucus, NJ, USA: Springer-Verlag New York, Inc.; 2005.View ArticleGoogle Scholar
- Perocchi F, Xu Z, Clauder-Münster S, Steinmetz LM: Antisense artifacts in transcriptome microarray experiments are resolved by actinomycin D. Nucleic Acids Res 2007, 35(19):e128. 10.1093/nar/gkm683PubMed CentralView ArticlePubMedGoogle Scholar
- Donoho D, Johnstone I, Johnstone IM: Ideal Spatial Adaptation by Wavelet Shrinkage. Biometrika 1993, 81: 425–455.View ArticleGoogle Scholar
- Donoho D, Johnstone IM: Adapting to Unknown Smoothness via Wavelet Shrinkage. J Am Stat Assoc 1995, 90(432):1200–1224. 10.1080/01621459.1995.10476626View ArticleGoogle Scholar
- Hollander M, Wolfe DA: Nonparametric Statistical Methods, 2nd Edition. Wiley-Interscience; 1999.Google Scholar
- Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ: A statistical approach for array CGH data analysis. BMC Bioinformatics 2005, 6: 27–27. 10.1186/1471-2105-6-27PubMed CentralView ArticlePubMedGoogle Scholar
- Kim Y, Bekiranov S, Lee JK, Park T: Double error shrinkage method for identifying protein binding sites observed by tiling arrays with limited replication. Bioinformatics 2009, 25(19):2486–2491. 10.1093/bioinformatics/btp471PubMed CentralView ArticlePubMedGoogle Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31(4):e15. 10.1093/nar/gng015PubMed CentralView ArticlePubMedGoogle Scholar
- Sorek R, Cossart P: Prokaryotic transcriptomics: a new view on regulation, physiology and pathogenicity. Nat Rev Genet 2010, 11: 9–16.View ArticlePubMedGoogle Scholar
- Landt SG, Abeliuk E, McGrath PT, Lesley JA, McAdams HH, Shapiro L: Small non-coding RNAs in Caulobacter crescentus. Mol Microbiol 2008, 68(3):600–614. 10.1111/j.1365-2958.2008.06172.xView ArticlePubMedGoogle Scholar
- Cho BK, Zengler K, Qiu Y, Park YS, Knight EM, Barrett CL, Gao Y, Palsson BØ: The transcription unit architecture of the Escherichia coli genome. Nat Biotechnol 2009, 27(11):1043–1049. 10.1038/nbt.1582View ArticlePubMedGoogle Scholar
- Toledo-Arana A, Dussurget O, Nikitas G, Sesto N, Guet-Revillet H, Balestrino D, Loh E, Gripenland J, Tiensuu T, Vaitkevicius K, Barthelemy M, Vergassola M, Nahori MA, Soubigou G, Régnault B, Coppée JY, Lecuit M, Johansson J, Cossart P: The Listeria transcriptional landscape from saprophytism to virulence. Nature 2009, 459(7249):950–956. 10.1038/nature08080View ArticlePubMedGoogle Scholar
- Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3: Article3.PubMedGoogle Scholar
- Daubechies I: Ten Lectures on Wavelets (CBMS-NSF Regional Conference Series in Applied Mathematics). 1 edition. SIAM: Society for Industrial and Applied Mathematics; 1992.Google Scholar
- Munoz Barrutia A, Ertle R, Unser M:Continuous Wavelet Transform with Arbitrary Scales and $\mathcal{O}\mathbf{\left(}\mathbf{N}\mathbf{\right)}$ Complexity. Signal Processing 2002, 82(5):749–757. 10.1016/S0165-1684(02)00140-8View ArticleGoogle Scholar
- Mallat SG: A theory for multiresolution signal decomposition: the wavelet representation. Pattern Anal and Machine Intelligence, IEEE Trans on 1989, 11(7):674–693. 10.1109/34.192463View ArticleGoogle Scholar
- Du P, Kibbe WA, Lin SM: Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics 2006, 22(17):2059–2065. 10.1093/bioinformatics/btl355View ArticlePubMedGoogle Scholar
- Nguyen N, Huang H, Oraintara S, Vo A: Mass spectrometry data processing using zero-crossing lines in multi-scale of Gaussian derivative wavelet. Bioinformatics 2010, 26(18):659–665. 10.1093/bioinformatics/btq397View ArticleGoogle Scholar
- Marr D, Hildreth E: Theory of Edge Detection. Proc R Soc London Ser B 1980, 207: 187–217. 10.1098/rspb.1980.0020View ArticleGoogle Scholar
- Rioul O, Vetterli M: Wavelets and signal processing. IEEE Signal Process Mag 1991, 8(4):14–38.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.