Many efforts in microarray data analysis are focused on providing tools and methods for the qualitative analysis of microarray data. HDBStat! (High-Dimensional Biology-Statistics) is a software package designed for analysis of high dimensional biology data such as microarray data. It was initially developed for the analysis of microarray gene expression data, but it can also be used for some applications in proteomics and other aspects of genomics. HDBStat! provides statisticians and biologists a flexible and easy-to-use interface to analyze complex microarray data using a variety of methods for data preprocessing, quality control analysis and hypothesis testing.

Results

Results generated from data preprocessing methods, quality control analysis and hypothesis testing methods are output in the form of Excel CSV tables, graphs and an Html report summarizing data analysis.

One of the most critical tasks in the field of biology is identifying how and which genes interact with each other under different conditions. Until a few years ago, researchers were only able to accomplish this task for a limited number of genes because the traditional methods in molecular biology allowed them to assess only one gene at a time. The advent of microarray technology has provided investigators the opportunity to simultaneously assess the expression levels of thousands of genes. Microarrays also generate a large amount of data in short period of time. Extracting statistically valid and biologically relevant information from such massive data sets is a major challenge. HDBStat! is a user-friendly and platform-independent software designed for the statistical analysis of microarray data using well-validated methods for quality control of experiments and the identification of differentially expressed genes.

Implementation

Data analysis in HDBStat! is divided into four steps – data import, data processing, quality control and hypotheses testing (Figure 1).

Data import

Data is imported into HDBStat! using two files, a gene expression data file (Figure 2) and chip level information file (Figure 3), both of which must be Microsoft Excel '97 or more recent format (.xls), or Comma Separated Values (.csv) files [see Additional file 1 and Additional file 2]. The gene expression data file contains the output from the chip image processing software, such as MAS 5.0, Bioconductor, or GenePix. The chip level file contains experimental variables such as treatment, time, experiment, and if appropriate, pairing variables for the chips. Upon import some descriptive statistics are automatically generated about the raw data such as Pearson's correlations between chips, mean, standard deviation, minimum and maximum values of gene expression levels for each chip and displayed in graphical and tabular formats.

Data preprocessing

Optionally, a normalization and/or transformation method(s) can be applied prior to the primary statistical analyses. Normalization is a procedure intended to remove variability among chips that is unrelated to treatment conditions of interest. HDBStat! offers Chip Mean normalization, which divides each observation by the chip mean, and Quantile-Quantile normalization, which ranks each observation on the chip based on expression value and then converts to the value of a deviation that would be expected from the standard normal distribution based on the observation rank. Quantile-quantile normalization results in data from each chip with a mean of zero and standard deviation of 1.0. Transformation is a process of applying a mathematical function to every observation in a data set in order to better satisfy assumptions of certain statistical models used for analysis. HDBStat! offers three different scales of logarithmic transformation, base-2, base-e, or base-10. Combinations of normalizations and transformations may be selected.

Quality control

HDBStat! provides a unique quality control procedure based upon Deleted Residuals (DR). Deleted residuals have traditionally been used in the statistical analysis of data when the number of observations in a group are small or may be influenced by outliers, as in the case in microarrays. In HDBStat!, the deleted residuals for each gene on each chip is calculated by taking the observed value of a gene on a chip subtracting the mean for the gene across all other chips in that group divided by the standard deviation of the mean for the gene across all the other chips in that group. The Probability Density Function (PDF) for the deleted residuals for a gene will follow a Student's t-distribution with n-2 degrees of freedom where n is the number of chips in the treatment group. If we assume that the genes across a chip are independent identically distributed (IID) the distribution of the deleted residuals should approximate a Student's t-distribution with n-2 degrees of freedom. The difference of the observed data from the expected t-distribution is graphically illustrated (Figure 4) and the significance of the difference is tested using a Kolmogorov-Smirnov test. If a chip is significantly different from the t-distribution it may be an indication that the particular chip is an outlier compared with the other chips in the group. Further, the user has the opportunity to remove chip(s) from the analysis and re-analyze the data.

Hypotheses testing

Currently, HDBStat! performs a series of pair wise comparison tests. Based on the information provided by user in chip level information file, a combination of all possible hypotheses is displayed in the user interface. User must select at least one hypothesis in order to perform two group comparisons.

HDBStat! includes parametric and non-parametric methods for estimating the significance of changes in gene expression between groups. Student's t-test, for which the user can choose an equal-variance t-test, which uses a pooled variance across treatments, or Welch's t-test, which assumes unequal variances between the two treatment groups [11]. Another method based on Chebyshev's inequality, Chebby Checker is extremely robust against departures from normality and equality of variance between treatment groups, but it also has very low power [2]. The Chebby Checker is useful for identifying genes that are almost certainly differentially expressed without considering any statistical assumptions. In addition a bootstrap resampling method [6, 8] is implemented. One can either conduct an exact bootstrap (all possible permutations) or a random (used specified number of permutations) bootstrap. The bootstrap procedures implement both pivots and smoothes in order to calculate the significance more accurately. As exact bootstrap is more accurate than random bootstrap, it is preferred for computationally feasible cases, but once the n per groups exceeds 6 it is difficult to implement.

Because of the large number of simultaneous hypotheses tested in high dimensional biology experiments, an adjustment for multiple testing is appropriate in order to avoid falsely calling too many genes significant. In HDBStat! several multiplicity control methods are available to be applied to any hypothesis testing method. The available multiplicity control adjustments are Bonferroni [4], Sidak [10], two False Discovery Rate (FDR) estimation methods [3, 5], and a method based on a mixture modeling of observed p-values [1], referred to as the "Mix-o-matic" (Figure 5) in the HDBStat! software. The Bonferroni and Sidak methods provide experiment-wise (or Family-wise) type I control. The FDR methods are designed to control the proportion of false positives among all genes declared differentially expressed. The mixture modeling method allows for the Bayesian estimation of the probability that each gene is a false positive or negative and this approach is also conveniently for projecting power estimates for future studies [9].

For the planning of future studies HDBStat! implements the method of Gadbury et al to extrapolate power from pilot data [9]. HDBStat! allows for the calculation of the expected discovery rate (EDR), posterior true positive, and posterior true negative rates for large and smaller samples sizes than were entered as pilot data. (Figure 6)

If an investigator is interested in empirically comparing the size of the observed differences in gene expression, an Empirical Bayes method is provided to provide shrinkage estimators of the true differences in gene expression [7]. In addition group means and fold changes in expression are calculated and output to results directory specified by the user.

Programming details

HDBStat! is implemented using the Java programming language using various licensed and open source libraries such as Visual Numerics JMSL, Jakarta POI, Velocity, and JFreeChart. Extensive software testing is performed using JUnit library.

Results

At the completion of calculating the deleted residuals and analyzing a hypothesis, results are output to a date/time-stamped directory into the user specified directory. Chip level statistics, preprocessed data, deleted residuals, standard outliers, various pair wise comparison tests (Table 1), mix-o-matic and power analysis results are output in the form of Excel CSV files. Graphs generated from chip level statistics, deleted residuals, mix-o-matic and power analysis results are output in .png format image files. HDBStat! also generates a HTML file that provides a summary of the analysis including the hypotheses tested, chips in each group. This mechanism of outputting results provides the user an opportunity to view quality control results and modify hypotheses, preprocessing methods, and/or chip selections before proceeding to the next step.

Table 1

Hypothesis testing results

Probe

Mean (group1, Raw data)

Mean (group2, Raw data)

Fold change

t (Equal variance t-test)

p-value (Equal variance t-test)

PTP, unrestricted (Equal variance t-test)

AFFX-MurIL2_at

105.16

200.34

1.905097

-0.81193

0.440318

0.226179

AFFX-MurIL10_at

176.52

253.58

1.436551

-0.49777

0.632041

0.194216

AFFX-MurIL4_at

146.56

197.86

1.350027

-0.05477

0.957668

0.147303

AFFX-MurFAS_at

670.06

487.42

0.727427

2.264698

0.05333

0.32061

AFFX-BioB-5_at

4074.98

5165.36

1.267579

-1.10225

0.302405

0.251946

AFFX-BioB-M_at

10258.4

12376.04

1.20643

-0.85246

0.418745

0.230018

AFFX-BioB-3_at

5600.76

6784.9

1.211425

-0.82656

0.432446

0.227573

AFFX-BioC-5_at

15688.4

17406.88

1.109538

-0.23558

0.819676

0.16602

AFFX-BioC-3_at

11812.54

13288.74

1.124969

-0.2571

0.803593

0.168321

AFFX-BioDn-3_at

65025.2

68451.72

1.052695

0.467328

0.65273

0.190965

AFFX-CreX-5_at

124944.9

152682.8

1.222001

-0.39488

0.70325

0.183174

AFFX-CreX-3_at

166352

192790.6

1.158932

-0.26323

0.799024

0.168979

AFFX-BioB-5_st

611.6

677.8

1.108241

0.280987

0.785851

0.170886

AFFX-BioB-M_st

556.26

779.84

1.401934

-1.1822

0.27107

0.25832

AFFX-BioB-3_st

195.62

359.32

1.836827

-1.51174

0.169049

0.281499

AFFX-BioC-5_st

114.36

183.8

1.607205

-2.17372

0.061461

0.316596

AFFX-BioC-3_st

57.36

94.86

1.653766

-0.62822

0.547374

0.2079

AFFX-BioDn-5_st

1981.88

2327.5

1.17439

-0.99146

0.350498

0.242595

AFFX-BioDn-3_st

1487.74

2265.06

1.522484

-2.01951

0.078118

0.309434

AFFX-CreX-5_st

950.56

1221.04

1.284548

-1.42444

0.192137

0.275814

AFFX-CreX-3_st

2610.3

3429.5

1.313834

-1.21894

0.257585

0.261147

AFFX-DapX-5_at

204.42

267.12

1.306721

0.187522

0.85592

0.16092

AFFX-DapX-M_at

299.3

347.62

1.161443

0.119799

0.907597

0.153866

AFFX-DapX-3_at

192.64

95.82

0.497404

1.5611

0.157123

0.284581

AFFX-LysX-5_at

132.46

115.9

0.874981

0.445594

0.667702

0.188634

AFFX-LysX-M_at

174.16

204.14

1.172141

-0.09641

0.92557

0.151477

AFFX-LysX-3_at

47.8

167.94

3.513389

-1.2216

0.256632

0.261348

AFFX-PheX-5_at

33.04

55.5

1.679782

-2.04285

0.07534

0.310549

AFFX-PheX-M_at

40.64

29.66

0.729823

1.223746

0.255865

0.261511

AFFX-PheX-3_at

497.66

217.12

0.436282

2.826877

0.022257

0.342932

AFFX-ThrX-5_at

92.06

184.74

2.006735

-0.26462

0.797994

0.169127

AFFX-ThrX-M_at

225.32

272.46

1.209214

-0.33299

0.747704

0.17649

AFFX-ThrX-3_at

114.54

95.08

0.830103

-0.17193

0.86776

0.15928

AFFX-TrpnX-5_at

73

96.68

1.324384

-0.82529

0.433126

0.227452

AFFX-TrpnX-M_at

30.12

37.74

1.252988

-0.31889

0.757976

0.174969

Discussion

The goal of HDBStat! is to help researchers analyze microarray data to extract valid inferences, estimates and interpretations via a flexible and user-friendly graphical interface. It allows the user to skip preprocessing and quality control methods by simply not selecting those methods. After previewing the preliminary results of raw data, preprocessed data or deleted residuals, user has flexibility to drop a chip by simply un-checking checkbox in the user interface. This feature allows the user to design any number of possible comparisons while the analysis is in progress.

To assist novice users with using HDBStat!, video clips demonstrating how to analyze paired and unpaired data, examples of how to set up input files for paired and unpaired data analyses, screen shots, and FAQ are available on our website. A detailed description of methods as well as additional explanations of the output files in this software is also available in a PDF format on our website.

Additional statistical methods and features are added on an ongoing basis. Support for data import from a text file and results output to a text file will be available for large data sets. In the current version, only single channel or common reference design microarray data can be analyzed using two group comparisons. In the near future, we will add the capability to analyze two channel data and support for ANOVA, and GLM.

There are many software programs available to analyze microarray data, each offering various features and functions. In Table 2, we have compared the features and functions of HDBStat! to SAM, BRB Array Tools and TM4.

Table 2

Comparison of HDBStat! with other software packages. All these software packages are still in active development and new functions will undoubtedly be added over time.

HDBStat!

SAM

BRB-Array Tools 3.2.2

TM4

Two color data handling

Common reference and balanced block designs

No

Yes

Yes

Database

No

No

No

Yes

Ratio Statistics

Yes

No

Yes

Yes

Normalization

Yes

Yes

Yes

Yes

Max number of arrays

No limit

255

249

No limit

Discriminate Analysis

No

No

Yes

Yes

ANOVA

Yes

No

Yes

Yes

Bootstrapping

Yes

Yes

Yes

Yes

Non-normal and heteroskedastic data handling

Yes

Yes

Via normalization

Via normalization

Non-parametric statistics

Yes

No

Yes

No

Cluster analysis

No

No

Yes

Yes

FDR (number)

8

1

2

1

Family Wise Error rate corrections

2

0

1

1

Quality Control

Yes

No

No

Yes

Power Analysis

Yes

No

No

No

Automatic report generation

Yes

No

Yes

No

Gene Class testing

No

No

Yes

No

Automatic Annotation

No

Link out

Yes

No

Platform

Single program implemented in Java & available via Java Web Start technology

Microsoft Excel Add-in

Microsoft Excel Add-in

4 separate programs, 3 of which implemented in java & 1 in C++

Availability and requirements

System requirements for an end-user are the Java Runtime Environment (JRE 1.4.2 or higher), at least 256 MB RAM and 25 MB hard disk space. Using Java Web Start technology, HDBStat! can be easily downloaded from our website at http://www.soph.uab.edu/ssg_content.asp?id=1164.

Declarations

Acknowledgements

This work is supported by grants from the UAB HSF GEF, NSF 0217651 and NIH U54CA100949.

Authors’ Affiliations

(1)

Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham

(2)

Department of Mathematics and Statistics, University of Missouri-Rolla

(3)

Department of Agronomy, Iowa State University

(4)

Pennington Biomedical Research Center

References

Allison DB, Gadbury GL, Moonseong H, Fernandez JR, Lee C, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data.Comp Statist & Data Anal 2002,39(1):1–20.View Article

Beasley TM, Page GP, Brand JPL, Gadbury GL, Mountz JD, Allison DB: Chebyshev's inequality for non-parametric testing with smallNand a in microarray research.J R Statist Soc C 2004, 53:95–108.View Article

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.J R Statist Soc B 1995, 57:289–300.

Bland JM, Altman DG: Multiple significance tests: the Bonferroni method.BMJ 1995,310(6973):170.PubMed

Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency.Ann Statist 2001,29(4):1165–1188.View Article

Davison AC, Hinkley DV: Bootstrap methods and their application. Cambridge University Press, United Kingdom 1997.

Edwards JW, Page GP, Gadbury G, Heo M, Kayo T, Weindruch R, Allison DB: Empirical Bayes estimation of gene-specific effects in micro-array research.Funct Integr Genomics 2005,5(1):32–9.View ArticlePubMed

Effron B, Tibshirani RJ: An Introduction to the Bootstrap. Chapmann and Hall New York 1993.

Gadbury GL, Page GP, Edwards JW, Kayo T, Prolla TA, Weindruch R, Permana PA, Mountz J, Allison DB: Power and Sample Size Estimation in High Dimensional Biology.Stat Meth Med Res 2004, 13:325–338.View Article

Sidak Z: Rectangular confidence regions for the means of the multivariate normal distributions.J Am Stat Assoc 1967, 62:626–633.View Article

Welch BL: The significance of the difference between two means when the population variances are unequal.Biometrika 1938, 29:350–362.

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.