NanoStriDE: normalization and differential expression analysis of NanoString nCounter data
© Brumbaugh et al; licensee BioMed Central Ltd. 2011
Received: 15 July 2011
Accepted: 16 December 2011
Published: 16 December 2011
The nCounter analysis system (NanoString Technologies, Seattle, WA) is a technology that enables the digital quantification of multiplexed target RNA molecules using color-coded molecular barcodes and single-molecule imaging. This system gives discrete counts of RNA transcripts and is capable of providing a high level of precision and sensitivity at less than one transcript copy per cell.
We have designed a web application compatible with any modern web browser that accepts the raw count data produced by the NanoString nCounter analysis system, normalizes it according to guidelines provided by NanoString Technologies, performs differential expression analysis on the normalized data, and provides a heatmap of the results from the differential expression analysis.
NanoStriDE allows biologists to take raw data produced by a NanoString nCounter analysis system and easily interpret differential expression analysis of this data represented through a heatmap. NanoStriDE is freely accessible to use on the NanoStriDE website and is available to use under the GPL v2 license.
In recent years, RNA expression studies have relied on two major technologies: microarrays and high-throughput sequencing. Although both of these methods have proven their utility in biological assays [1, 2], each has its limitations. Microarray analyses offer low cost, transcriptome-wide assays, but are hindered by a low dynamic range of detection [3, 4]. RNA-Seq experiments offer greater sensitivity and digital measurements of transcript abundance, but require complex sequence analysis and have a relatively high cost [5, 6]. The NanoString nCounter is a novel RNA-based technology whose costs and abilities place it firmly between microarrays and RNA-Seq .
The nCounter allows digital quantification of multiplexed target molecules through the use of color-coded molecular barcodes and a single-molecule imaging system . By providing discrete counts of RNA transcripts, the nCounter overcomes the saturation limitations of microarrays while avoiding the complex sequence analysis necessitated by RNA-Seq. The platform can quantify up to 800 different RNA targets simultaneously in up to 12 samples per run, making the system ideal for studying differential expression and microRNA expression assays. Hands-on time for the assay is under one hour and the full process from input RNA to data output can be completed in one to two days.
Differential expression analysis is performed differently for microarrays and RNA-Seq because of distinctions in the underlying technologies. Microarrays measure expression levels using optical detection of fluorescent intensities. Two-sided t-tests are normally used with microarray data as log normalized fluorescent intensity levels are well modelled by the Gaussian distribution. RNA-Seq uses sequence coverage as a measurement of expression, producing discrete rather than continuous data. The use of a discrete distribution like the Poisson, or more appropriately the negative binomial [9, 10], is appropriate in this case. The NanoString nCounter is most similar to RNA-Seq in that it processes discrete counts of measurement similar to RNA-Seq; as such, it is more appropriate to utilize differential expression analysis tools developed for RNA-Seq for data generated by the NanoString nCounter. For historical reasons and to allow comparisons, our server allows for both t-tests as well as a negative binomial-based test for the discovery of differentially expressed genes, a feature that allows us to offer an array of options in the web-based data analysis package described below.
Although the technology overcomes major limitations of microarrays and high throughput sequencing, the data produced by the NanoString nCounter is not in a directly usable format and must be compiled, normalized, and analyzed using a variety of statistical methods. Guidelines to achieve this are provided by the company; however, the process is complex and generally requires the use of Microsoft Excel running on a Windows system. Moreover, performing statistical analysis of the data to explore relevant trends and produce differential expression maps requires familiarity with R and associated packages like Bioconductor and DESeq. The use of this software and an understanding of which analyses must be performed and in what order is not part of most biologists' skill sets and so acts as a needless impediment to their effective use of nCounter data.
Our NanoString web application, NanoString Differential Expression (NanoStriDE), provides a configurable, extremely simple-to-use interface as a front end to a fully automated analysis pipeline of R libraries and Perl scripts. nCounter data is uploaded directly to the website. Processed data with a differential expression heatmap is made available for download within minutes.
Controls and normalization
where c is count data for a gene in a given sample, m is the mean of the sum of the positive controls across all samples, and s is the sum of all of the positive controls for that given sample. The positive correction is applied to the data only if the t-test or one-way ANOVA was chosen. DESeq and one-way ANOVA (negative binomial) using DESeq ANODEV uses DESeq's built in normalization methods.
where c is all count data for a given gene, m is the mean of the count data for that gene across all samples, and p is the p-value cutoff used. If the p-value is at or above the provided cutoff (default 0.05), the counts for the gene are not significant and are set to 0 for all samples. If the p-value is below the provided cutoff, the mean for the count data for the gene are subtracted from the counts for the gene and count values that fall below zero are set to zero. As with the positive correction, the negative correction is applied to the data only if the t-test or one-way ANOVA was chosen. DESeq and one-way ANOVA (negative binomial) using DESeq ANODEV uses DESeq's built in normalization methods.
The normalization employed after the positive and negative correction steps depend on the type of statistical analysis selected. If DESeq or one-way ANOVA (negative binomial) was chosen for the differential expression, then the normalization process used within DESeq is applied to the data and the sample content normalization is not applied to the data as application of the sample content normalization and the DESeq normalization would be redundant. Refer to DESeq for the details of the normalization process applied . If the t-test or either one-way ANOVA is chosen for the differential expression, then one of three sample content normalization choices is applied:
Option 1: Normalize to the mRNA housekeeping genes. This normalization is applied by using Formula 1, where c is the count data, m is the mean of the sum of the housekeeping genes across all samples, and s is the sum of the housekeeping genes for a given sample.
Option 2: Normalize to the entire miRNA sample. The normalization is employed by using Formula 1, where c is the count data, m is the mean of the sums of the counts for each sample, and s is the sum of all of the counts for each sample.
Option 3: Normalize to the highest miRNA in an assayed sample. Unlike Option 2, which uses all of the genes, this option normalizes the data to only the highest miRNA counts (default top 75) for a given sample (default sample 1) for the normalization factor.
After all of the relevant correction and normalization steps have been applied, statistical tests are applied to determine differentially expressed genes. If the data is assumed to be distributed normally, a t-test (for two conditions) or one-way ANOVA (for three or more conditions) can be performed on the sample content-normalized data to determine statistical significance. If a negative binomial distribution is assumed, DESeq (for two conditions, using the DESeq R library) or a one-way ANODEV (for three or more conditions, using the DESeq R library) can be used on the DESeq normalized to determine statistical significance. When estimating the dispersion factors in DESeq used for normalization, different parameters are implemented. If only single replicates are used for each condition, the blind method is used to estimate dispersions with a fit-only sharing mode. If any condition has only two biological replicates, the pooled method is used with a fit-only sharing mode. If neither of the previous two conditions is met, a pooled method with a maximum sharing mode is used to estimate the dispersions.
Heatmap generation and downloadable output
NanoStriDE is configured to be easily accessible to biologists who employ nCounter to perform sophisticated experiments, so that they may complete their analyses without specialized training in biostatistics. A user's guide outlining how to use the website as well as details on all options and warnings is available on the NanoStriDE website. The analyzed data from a completed job can be downloaded in a number of stages and formats from NanoStriDE. Corrected raw, normalized data, results of differential expression analysis for all probes, user settings, and customized readme file are available for download in text format.
The NanoStriDE web application was developed to assist biologists with performing and interpreting differential expression analaysis from NanoString nCounter system data. The high resolution heatmaps produced by NanoStriDE are suitable for use as figures in publications. If the user wishes to independently perform differential analysis on the data, the normalized data is additionally provided in the output of the completed job. NanoStriDE was designed for ease of use allowing biologists with limited computational experience to perform sophisticated differential expression analysis.
Availability and requirements
Project Name: NanoStriDE
Project Homepage: http://nanostride.soe.ucsc.edu
Operating System: Platform independent
Programming Language: Perl, PHP, R
Other Requirements: Modern HTML 5 compliant web browser (e.g. IE 9, Firefox 8, Chrome 15, Opera 11)
License: NanoStriDE is freely available under the GPL v2 license and can be found on the NanoStriDE web page on the license page.
We would like to acknowledge Stephen Jackson and from NanoString technologies for providing assistance with NanoString corrections and normalizations. This work was supported in part by grants from the National Aeronautics and Space Administration Cooperative Agreements (NNX08B47A, NNX10AQ16A and NNX11AD38G), and National Institutes of Health (P01-35HG000205).
- Heller MJ: DNA microarray technology: devices, systems, and applications. Annu Rev Biomed Eng 2002, 4: 129–153. 10.1146/annurev.bioeng.4.020702.153438View ArticlePubMedGoogle Scholar
- Wilhelm BT, Landry J-R: RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods 2009, 48: 249–257. 10.1016/j.ymeth.2009.03.016View ArticlePubMedGoogle Scholar
- Fu X, Fu N, Guo S, Yan Z, Xu Y, Hu H, Menzel C, Chen W, Li Y, Zeng R, Khaitovich P: Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics 2009, 10: 161. 10.1186/1471-2164-10-161PubMed CentralView ArticlePubMedGoogle Scholar
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008, 18: 1509–1517. 10.1101/gr.079558.108PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10: 57–63. 10.1038/nrg2484PubMed CentralView ArticlePubMedGoogle Scholar
- Liu S, Lin L, Jiang P, Wang D, Xing Y: A comparison of RNA-Seq and high-density exon array for detecting differential gene expression between closely related species. Nucleic Acids Res 2011, 39: 578–588. 10.1093/nar/gkq817PubMed CentralView ArticlePubMedGoogle Scholar
- Geiss GK, Bumgarner RE, Birditt B, Dahl T, Dowidar N, Dunaway DL, Fell HP, Ferree S, George RD, Grogan T, James JJ, Maysuria M, Mitton JD, Oliveri P, Osborn JL, Peng T, Ratcliffe AL, Webster PJ, Davidson EH, Hood L, Dimitrov K: Direct multiplexed measurement of gene expression with color-coded probe pairs. Nat Biotechnol 2008, 26: 317–325. 10.1038/nbt1385View ArticlePubMedGoogle Scholar
- Zak DE, Aderem A: A systems view of host defense. Nat Biotechnol 2009, 27: 999–1001. 10.1038/nbt1109-999PubMed CentralView ArticlePubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26: 139–140. 10.1093/bioinformatics/btp616PubMed CentralView ArticlePubMedGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol 2010, 11: R106. 10.1186/gb-2010-11-10-r106PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.