# HDBStat!: A platform-independent software suite for statistical analysis of high dimensional biology data

- Prinal Trivedi
^{1}, - Jode W Edwards
^{1, 3}, - Jelai Wang
^{1}, - Gary L Gadbury
^{1, 2}, - Vinodh Srinivasasainagendra
^{1}, - Stanislav O Zakharkin
^{1}, - Kyoungmi Kim
^{1}, - Tapan Mehta
^{1}, - Jacob PL Brand
^{1, 4}, - Amit Patki
^{1}, - Grier P Page
^{1}and - David B Allison
^{1}Email author

**6**:86

**DOI: **10.1186/1471-2105-6-86

© Trivedi et al; licensee BioMed Central Ltd. 2005

**Received: **29 November 2004

**Accepted: **06 April 2005

**Published: **06 April 2005

## Abstract

### Background

Many efforts in microarray data analysis are focused on providing tools and methods for the qualitative analysis of microarray data. HDBStat! (High-Dimensional Biology-Statistics) is a software package designed for analysis of high dimensional biology data such as microarray data. It was initially developed for the analysis of microarray gene expression data, but it can also be used for some applications in proteomics and other aspects of genomics. HDBStat! provides statisticians and biologists a flexible and easy-to-use interface to analyze complex microarray data using a variety of methods for data preprocessing, quality control analysis and hypothesis testing.

### Results

Results generated from data preprocessing methods, quality control analysis and hypothesis testing methods are output in the form of Excel CSV tables, graphs and an Html report summarizing data analysis.

### Conclusion

HDBStat! is a platform-independent software that is freely available to academic institutions and non-profit organizations. It can be downloaded from our website http://www.soph.uab.edu/ssg_content.asp?id=1164.

## Background

One of the most critical tasks in the field of biology is identifying how and which genes interact with each other under different conditions. Until a few years ago, researchers were only able to accomplish this task for a limited number of genes because the traditional methods in molecular biology allowed them to assess only one gene at a time. The advent of microarray technology has provided investigators the opportunity to simultaneously assess the expression levels of thousands of genes. Microarrays also generate a large amount of data in short period of time. Extracting statistically valid and biologically relevant information from such massive data sets is a major challenge. HDBStat! is a user-friendly and platform-independent software designed for the statistical analysis of microarray data using well-validated methods for quality control of experiments and the identification of differentially expressed genes.

## Implementation

### Data import

### Data preprocessing

Optionally, a normalization and/or transformation method(s) can be applied prior to the primary statistical analyses. Normalization is a procedure intended to remove variability among chips that is unrelated to treatment conditions of interest. HDBStat! offers Chip Mean normalization, which divides each observation by the chip mean, and Quantile-Quantile normalization, which ranks each observation on the chip based on expression value and then converts to the value of a deviation that would be expected from the standard normal distribution based on the observation rank. Quantile-quantile normalization results in data from each chip with a mean of zero and standard deviation of 1.0. Transformation is a process of applying a mathematical function to every observation in a data set in order to better satisfy assumptions of certain statistical models used for analysis. HDBStat! offers three different scales of logarithmic transformation, base-2, base-e, or base-10. Combinations of normalizations and transformations may be selected.

### Quality control

### Hypotheses testing

Currently, HDBStat! performs a series of pair wise comparison tests. Based on the information provided by user in chip level information file, a combination of all possible hypotheses is displayed in the user interface. User must select at least one hypothesis in order to perform two group comparisons.

HDBStat! includes parametric and non-parametric methods for estimating the significance of changes in gene expression between groups. Student's t-test, for which the user can choose an equal-variance t-test, which uses a pooled variance across treatments, or Welch's t-test, which assumes unequal variances between the two treatment groups [11]. Another method based on Chebyshev's inequality, Chebby Checker is extremely robust against departures from normality and equality of variance between treatment groups, but it also has very low power [2]. The Chebby Checker is useful for identifying genes that are almost certainly differentially expressed without considering any statistical assumptions. In addition a bootstrap resampling method [6, 8] is implemented. One can either conduct an exact bootstrap (all possible permutations) or a random (used specified number of permutations) bootstrap. The bootstrap procedures implement both pivots and smoothes in order to calculate the significance more accurately. As exact bootstrap is more accurate than random bootstrap, it is preferred for computationally feasible cases, but once the n per groups exceeds 6 it is difficult to implement.

If an investigator is interested in empirically comparing the size of the observed differences in gene expression, an Empirical Bayes method is provided to provide shrinkage estimators of the true differences in gene expression [7]. In addition group means and fold changes in expression are calculated and output to results directory specified by the user.

### Programming details

HDBStat! is implemented using the Java programming language using various licensed and open source libraries such as Visual Numerics JMSL, Jakarta POI, Velocity, and JFreeChart. Extensive software testing is performed using JUnit library.

## Results

Hypothesis testing results

Probe | Mean (group1, Raw data) | Mean (group2, Raw data) | Fold change | t (Equal variance t-test) | p-value (Equal variance t-test) | PTP, unrestricted (Equal variance t-test) |
---|---|---|---|---|---|---|

AFFX-MurIL2_at | 105.16 | 200.34 | 1.905097 | -0.81193 | 0.440318 | 0.226179 |

AFFX-MurIL10_at | 176.52 | 253.58 | 1.436551 | -0.49777 | 0.632041 | 0.194216 |

AFFX-MurIL4_at | 146.56 | 197.86 | 1.350027 | -0.05477 | 0.957668 | 0.147303 |

AFFX-MurFAS_at | 670.06 | 487.42 | 0.727427 | 2.264698 | 0.05333 | 0.32061 |

AFFX-BioB-5_at | 4074.98 | 5165.36 | 1.267579 | -1.10225 | 0.302405 | 0.251946 |

AFFX-BioB-M_at | 10258.4 | 12376.04 | 1.20643 | -0.85246 | 0.418745 | 0.230018 |

AFFX-BioB-3_at | 5600.76 | 6784.9 | 1.211425 | -0.82656 | 0.432446 | 0.227573 |

AFFX-BioC-5_at | 15688.4 | 17406.88 | 1.109538 | -0.23558 | 0.819676 | 0.16602 |

AFFX-BioC-3_at | 11812.54 | 13288.74 | 1.124969 | -0.2571 | 0.803593 | 0.168321 |

AFFX-BioDn-3_at | 65025.2 | 68451.72 | 1.052695 | 0.467328 | 0.65273 | 0.190965 |

AFFX-CreX-5_at | 124944.9 | 152682.8 | 1.222001 | -0.39488 | 0.70325 | 0.183174 |

AFFX-CreX-3_at | 166352 | 192790.6 | 1.158932 | -0.26323 | 0.799024 | 0.168979 |

AFFX-BioB-5_st | 611.6 | 677.8 | 1.108241 | 0.280987 | 0.785851 | 0.170886 |

AFFX-BioB-M_st | 556.26 | 779.84 | 1.401934 | -1.1822 | 0.27107 | 0.25832 |

AFFX-BioB-3_st | 195.62 | 359.32 | 1.836827 | -1.51174 | 0.169049 | 0.281499 |

AFFX-BioC-5_st | 114.36 | 183.8 | 1.607205 | -2.17372 | 0.061461 | 0.316596 |

AFFX-BioC-3_st | 57.36 | 94.86 | 1.653766 | -0.62822 | 0.547374 | 0.2079 |

AFFX-BioDn-5_st | 1981.88 | 2327.5 | 1.17439 | -0.99146 | 0.350498 | 0.242595 |

AFFX-BioDn-3_st | 1487.74 | 2265.06 | 1.522484 | -2.01951 | 0.078118 | 0.309434 |

AFFX-CreX-5_st | 950.56 | 1221.04 | 1.284548 | -1.42444 | 0.192137 | 0.275814 |

AFFX-CreX-3_st | 2610.3 | 3429.5 | 1.313834 | -1.21894 | 0.257585 | 0.261147 |

AFFX-DapX-5_at | 204.42 | 267.12 | 1.306721 | 0.187522 | 0.85592 | 0.16092 |

AFFX-DapX-M_at | 299.3 | 347.62 | 1.161443 | 0.119799 | 0.907597 | 0.153866 |

AFFX-DapX-3_at | 192.64 | 95.82 | 0.497404 | 1.5611 | 0.157123 | 0.284581 |

AFFX-LysX-5_at | 132.46 | 115.9 | 0.874981 | 0.445594 | 0.667702 | 0.188634 |

AFFX-LysX-M_at | 174.16 | 204.14 | 1.172141 | -0.09641 | 0.92557 | 0.151477 |

AFFX-LysX-3_at | 47.8 | 167.94 | 3.513389 | -1.2216 | 0.256632 | 0.261348 |

AFFX-PheX-5_at | 33.04 | 55.5 | 1.679782 | -2.04285 | 0.07534 | 0.310549 |

AFFX-PheX-M_at | 40.64 | 29.66 | 0.729823 | 1.223746 | 0.255865 | 0.261511 |

AFFX-PheX-3_at | 497.66 | 217.12 | 0.436282 | 2.826877 | 0.022257 | 0.342932 |

AFFX-ThrX-5_at | 92.06 | 184.74 | 2.006735 | -0.26462 | 0.797994 | 0.169127 |

AFFX-ThrX-M_at | 225.32 | 272.46 | 1.209214 | -0.33299 | 0.747704 | 0.17649 |

AFFX-ThrX-3_at | 114.54 | 95.08 | 0.830103 | -0.17193 | 0.86776 | 0.15928 |

AFFX-TrpnX-5_at | 73 | 96.68 | 1.324384 | -0.82529 | 0.433126 | 0.227452 |

AFFX-TrpnX-M_at | 30.12 | 37.74 | 1.252988 | -0.31889 | 0.757976 | 0.174969 |

## Discussion

The goal of HDBStat! is to help researchers analyze microarray data to extract valid inferences, estimates and interpretations via a flexible and user-friendly graphical interface. It allows the user to skip preprocessing and quality control methods by simply not selecting those methods. After previewing the preliminary results of raw data, preprocessed data or deleted residuals, user has flexibility to drop a chip by simply un-checking checkbox in the user interface. This feature allows the user to design any number of possible comparisons while the analysis is in progress.

To assist novice users with using HDBStat!, video clips demonstrating how to analyze paired and unpaired data, examples of how to set up input files for paired and unpaired data analyses, screen shots, and FAQ are available on our website. A detailed description of methods as well as additional explanations of the output files in this software is also available in a PDF format on our website.

Additional statistical methods and features are added on an ongoing basis. Support for data import from a text file and results output to a text file will be available for large data sets. In the current version, only single channel or common reference design microarray data can be analyzed using two group comparisons. In the near future, we will add the capability to analyze two channel data and support for ANOVA, and GLM.

Comparison of HDBStat! with other software packages. All these software packages are still in active development and new functions will undoubtedly be added over time.

HDBStat! | SAM | BRB-Array Tools 3.2.2 | TM4 | |
---|---|---|---|---|

Two color data handling | Common reference and balanced block designs | No | Yes | Yes |

Database | No | No | No | Yes |

Ratio Statistics | Yes | No | Yes | Yes |

Normalization | Yes | Yes | Yes | Yes |

Max number of arrays | No limit | 255 | 249 | No limit |

Discriminate Analysis | No | No | Yes | Yes |

ANOVA | Yes | No | Yes | Yes |

Bootstrapping | Yes | Yes | Yes | Yes |

Non-normal and heteroskedastic data handling | Yes | Yes | Via normalization | Via normalization |

Non-parametric statistics | Yes | No | Yes | No |

Cluster analysis | No | No | Yes | Yes |

FDR (number) | 8 | 1 | 2 | 1 |

Family Wise Error rate corrections | 2 | 0 | 1 | 1 |

Quality Control | Yes | No | No | Yes |

Power Analysis | Yes | No | No | No |

Automatic report generation | Yes | No | Yes | No |

Gene Class testing | No | No | Yes | No |

Automatic Annotation | No | Link out | Yes | No |

Platform | Single program implemented in Java & available via Java Web Start technology | Microsoft Excel Add-in | Microsoft Excel Add-in | 4 separate programs, 3 of which implemented in java & 1 in C++ |

## Availability and requirements

System requirements for an end-user are the Java Runtime Environment (JRE 1.4.2 or higher), at least 256 MB RAM and 25 MB hard disk space. Using Java Web Start technology, HDBStat! can be easily downloaded from our website at http://www.soph.uab.edu/ssg_content.asp?id=1164.

## Declarations

### Acknowledgements

This work is supported by grants from the UAB HSF GEF, NSF 0217651 and NIH U54CA100949.

## Authors’ Affiliations

## References

- Allison DB, Gadbury GL, Moonseong H, Fernandez JR, Lee C, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data.
*Comp Statist & Data Anal*2002, 39(1):1–20. 10.1016/S0167-9473(01)00046-9View Article - Beasley TM, Page GP, Brand JPL, Gadbury GL, Mountz JD, Allison DB: Chebyshev's inequality for non-parametric testing with small
*N*and a in microarray research.*J R Statist Soc C*2004, 53: 95–108. 10.1111/j.1467-9876.2004.00428.xView Article - Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.
*J R Statist Soc B*1995, 57: 289–300. - Bland JM, Altman DG: Multiple significance tests: the Bonferroni method.
*BMJ*1995, 310(6973):170.PubMed CentralView ArticlePubMed - Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency.
*Ann Statist*2001, 29(4):1165–1188. 10.1214/aos/1013699998View Article - Davison AC, Hinkley DV:
*Bootstrap methods and their application.*Cambridge University Press, United Kingdom; 1997.View Article - Edwards JW, Page GP, Gadbury G, Heo M, Kayo T, Weindruch R, Allison DB: Empirical Bayes estimation of gene-specific effects in micro-array research.
*Funct Integr Genomics*2005, 5(1):32–9. 10.1007/s10142-004-0123-0View ArticlePubMed - Effron B, Tibshirani RJ:
*An Introduction to the Bootstrap.*Chapmann and Hall New York; 1993.View Article - Gadbury GL, Page GP, Edwards JW, Kayo T, Prolla TA, Weindruch R, Permana PA, Mountz J, Allison DB: Power and Sample Size Estimation in High Dimensional Biology.
*Stat Meth Med Res*2004, 13: 325–338.View Article - Sidak Z: Rectangular confidence regions for the means of the multivariate normal distributions.
*J Am Stat Assoc*1967, 62: 626–633. - Welch BL: The significance of the difference between two means when the population variances are unequal.
*Biometrika*1938, 29: 350–362.View Article

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.