- Open Access
designGG: an R-package and web tool for the optimal design of genetical genomics experiments
BMC Bioinformatics volume 10, Article number: 188 (2009)
High-dimensional biomolecular profiling of genetically different individuals in one or more environmental conditions is an increasingly popular strategy for exploring the functioning of complex biological systems. The optimal design of such genetical genomics experiments in a cost-efficient and effective way is not trivial.
This paper presents designGG, an R package for designing optimal genetical genomics experiments. A web implementation for designGG is available at http://gbic.biol.rug.nl/designGG. All software, including source code and documentation, is freely available.
DesignGG allows users to intelligently select and allocate individuals to experimental units and conditions such as drug treatment. The user can maximize the power and resolution of detecting genetic, environmental and interaction effects in a genome-wide or local mode by giving more weight to genome regions of special interest, such as previously detected phenotypic quantitative trait loci. This will help to achieve high power and more accurate estimates of the effects of interesting factors, and thus yield a more reliable biological interpretation of data. DesignGG is applicable to linkage analysis of experimental crosses, e.g. recombinant inbred lines, as well as to association analysis of natural populations.
Genetical genomics  has become a popular strategy for studying complex biological systems using a combination of classical genetics, biomolecular profiling and bioinformatics [2–5]. By measuring molecular variation, using transcriptomics, proteomics, metabolomics and related emerging technologies, in genetically different individuals, genetical genomics has the potential to identify the functional consequences of natural and induced genetic variation. Recently, genetical genomics has been generalized to achieve a comprehensive understanding of the dynamics of molecular networks by combining environmental and genetic perturbation [6, 7]. This type of large scale "omics" study leads to a better understanding of why individuals of the same species respond differently to drugs, pathogens, and other environmental factors.
However, most molecular profiling experiments are very costly, and as a consequence most genetical genomics studies are performed at the verge of statistical feasibility. Therefore, experimental design needs careful consideration to achieve maximum power from limited resources, such as microarrays and experimental animals [8, 9]. But, even in standard scenarios this requires sophisticated application of statistical concepts to intelligently select genetically different individuals from a population and allocate them to different conditions and experimental units. This topic has motivated classical statistical research since a long time . More recently, the concepts developed there have been adapted to the high dimensional data sets of post-genomics research [8, 11–13], and useful simplified design strategies have been suggested [11, 14]. However, to transfer these statistical ideas to the even more complex context of genetical genomics [9, 15, 16] still requires considerable expertise in statistics.
Here we present an online web tool to make these selections and allocations easy for biologists with little/no statistical training. The program will find the best experimental design to produce the most accurate estimates of the most relevant biological parameters, given the number of experimental factors to be varied, the genotype information on the population, the profiling technology used, and the constraints on the number of individuals that can be profiled. Advanced users can download the underlying methods as an R package to adapt the program for a more tailored design. Without loss of generality, we will illustrate the method using microarrays, while they apply equally well to other profiling technologies, such as mass spectrometry. Also, we will only discuss molecular technologies that profile samples individually (e.g., single color microarrays) or in pairs (e.g., dual color microarrays), but an extension of the R scripts to more advanced multiplex technologies would be straightforward .
The objective of designGG is to find an optimal allocation of genetically different samples to different conditions and experimental units (arrays) favoring a precise estimate of interesting parameters, such as main genetic effects and interaction effects between genotype and drug treatment. A simple case with one environmental factor can be expressed as y = μ + G× E + ε, where y is the measurement vector, ε is the error term, and G×E denotes main effect and interaction effects of genotype and environment. In matrix notation, a model with one or more genotype factors (quantitative trait loci; QTL) and one or more environmental factors can be written as: Y = Xβ + E, where X is the design matrix of samples by parameters and β is the effect of genotype and environmental factors. The least squares estimate of β is b = (XTX)-1XTY with var(b) = σ2(XTX)-1. The optimal experiment design is defined as the one that minimizes the double sum of the variances of b firstly summed over all parameters and then summed over all genotypic markers. We use an optimization algorithm (simulated annealing ) to search the experimental design space of all possible allocations to produce an optimal design matrix X. During the optimization, the algorithm utilizes the available marker information from the individuals to optimize the allocation of individuals to microarrays and conditions.
In the optimization, the experimenter can, of course, give more weight to parameters of higher interest, which will then be estimated with higher accuracy. Particularly, prior knowledge about expected effect sizes of interesting factors can be incorporated as weight parameters for the algorithm and the weight is inversely proportional to the expected effect size of the corresponding factors. In addition, it is also possible to specify the genome regions that are of major interest in a particular experiment, by specifying a region parameter. For example, if the relevant phenotype is known to map to certain genome regions, parameters for the markers in these regions can be given full weight in the optimization algorithm, whereas parameters for other markers can be given lesser or even zero weight. Thus, mapping resolution can improve and the power for finding QTLs in focal regions can be increased.
DesignGG is a package entirely written in the R language . Every function of the designGG library is available as a stand-alone R tool and detailed help is available according to the standard format of R documentation.
Choose the platform. Select the single- or dual-channel option for one-color or two-color gene expression microarrays (the dual-channel option is also used for any other technology profiling pairs of samples).
Upload a tab separated value (TXT) file containing the genotype data matrix (individuals × markers). Each cell contains a genotype label (e.g. A or B for the parental alleles, H for heterozygous loci; NA for missing data).
Set parameters. Specify the number of environmental factors, their number of levels, and the possible values of these levels. Specify either the total number of slides (assays) or the number of samples allocated within each condition.
Use advanced options if only one or a few genome regions or particular factors are of major interest. It is possible to optimize the experimental design by focusing on certain regions (e.g. the first 20 markers on chromosome I). Prior knowledge about expected effect sizes of interesting factors can also be incorporated as weight parameters for the algorithm.
Start the optimization algorithm by clicking on the button Optimize Experimental Design (Figure 1).
Get results. After the optimization is finished, the optimal experimental design will be displayed online (in table format), and will be available as text files for download.
Here we illustrate how to apply the designGG R package using an example: suppose we are studying the effect of genetic factors (Q), temperature (F1), drug treatment (F2) and their interaction on gene expression using two-colour microarrays. There are 100 microarray slides available for this experiment, and we plan to study two different levels for each environment, which are 16°C and 24°C for F1 (temperature), and 5 μM and 10 μM for F2 (drug treatment). Then the R package can also be used in command line form as follows:
Prepare the input file specifying the genotype of each individual at each marker position. The file should be formatted as tab separated values (TXT), as illustrated in Table 1.
Load the designGG package by starting the R application and typing the command:
Specify the input arguments (Steps 3–5 correspond to steps 2–4 of using the web tool. The order of the following commands in steps 3–5 does not matter).
Choose the platform of the experiment. In this example, we use two-color microarray, thus:
> bTwoColorArray <- T #if paired; F otherwise
Load the marker data and specify the following required arguments (number of environmental factors, number of levels per factor, the values of each level, and the number of available slides):
> data(genotype) #an example data attached with the designGG package
# The command below can be used to read TXT data
# genotype <- read.table("genotype.txt")
> nEnvFactors <- 2
> nLevels <- c(2, 2)
> Level <- list(c(16, 24), c(5, 10))
> nSlides <- 100; nTuple <- NULL
An alternative to specifying nSlides is to specify nTuple, the number of strains to be allocated onto each condition. For example,
> nTuple <- 25 ; nSlides <- NULL;
In addition to the required arguments specified in step 4, there are some optional ones for a tailored experimental design: e.g., we might be especially interested in the genome region between 1st marker and 20th marker, where a known phenotypic QTL from previous study locates. They can then specify that the optimization algorithm should only take genotypes at markers 1 to 20 into account:
> region <- seq(1, 20, by = 1)
Additionally, if we want that the estimates of all interaction effects are twice as accurate as the estimates of the main effects (genotype, temperature and drug treatment), then we specify weights for the estimates:
> weight <- c(0.5,0.5,0.5,1,1,1,1)
Here the order of elements in the weight vector is such that first the main effects are listed, starting with the genotype, followed by the two environmental factors in the order used for nLevels and Level, then the one-way interactions, in the same order, and finally the two-way interaction between all three factors.
The following commands specify the directory where the resulting optimal design tables are to be stored and the name of the output files (design tables):
> directory <- "C:\myproject\design"
> fileName <- "myDesign"
A detailed explanation of the above arguments can also be found in Table 2.
Run designGG to obtain your optimal design:
> myOutput <- designGG(genotype, nSlides, nTuple, nEnvFactors, nLevels, Level, region = region, weight = weight, nIterations = 10)
It should be noted that the number of iteration of the simulated annealing method (fnIterations)is set to 10 here for testing purposes. The default value (nIterations = 3000) is recommended, but it will result in a longer computing time.
Output can be found in the directory or retrieved with:
> optimalArrayDesign <- myOutput$arrayDesign
> optimalCondDesign <- myOutput$conditionDesign
In addition, users can check the curve of optimization score recorded as the algorithm iterates using:
> plotAllScores (myOutput$plot.obj)
Details of default settings such as method (SA: simulated annealing) or nSearch (equals 2) can be found in the designGG manual or the online help. Example genotype data and output tables are also provided along with the package. The R package can be found in Additional file 1 and most up-to-date version of the software can be downloaded at http://gbic.biol.rug.nl/designGG.
Two tables summarize the optimal design: The table pair design is only used for two-channel experiments and describes how samples are paired together in one assay e.g., a two-color microarray chip (Table 3). The table environment design lists how samples are assigned to environments/experimental factors (Table 4).
DesignGG, a freely-available R package and web tool presented in this work, represents a novel tool for the researcher interested in system genetics. Based on the careful experimental design provided by designGG, limited resources, such as arrays and samples, are maximally exploited, and more accurate estimates of parameters of interest can be achieved.
Jansen RC, Nap JP: Genetical genomics: the added value from segregation. Trends Genet 2001, 17(7):388–391. 10.1016/S0168-9525(01)02310-1
Bystrykh L, Weersing E, Dontje B, Sutton S, Pletcher MT, Wiltshire T, Su AI, Vellenga E, Wang J, Manly KF, et al.: Uncovering regulatory pathways that affect hematopoietic stem cell function using 'genetical genomics'. Nat Genet 2005, 37(3):225–232. 10.1038/ng1497
Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, Guhathakurta D, Sieberts SK, Monks S, Reitman M, Zhang C, et al.: An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 2005, 37(7):710–717. 10.1038/ng1589
Chen Y, Zhu J, Lum PY, Yang X, Pinto S, MacNeil DJ, Zhang C, Lamb J, Edwards S, Sieberts SK, et al.: Variations in DNA elucidate molecular networks that cause disease. Nature 2008, 452(7186):429–435. 10.1038/nature06757
Brem RB, Kruglyak L: The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc Natl Acad Sci USA 2005, 102(5):1572–1577. 10.1073/pnas.0408709102
Li Y, Breitling R, Jansen RC: Generalizing genetical genomics: getting added value from environmental perturbation. Trends Genet 2008, 24(10):518–524. 10.1016/j.tig.2008.08.001
Li Y, Alvarez OA, Gutteling EW, Tijsterman M, Fu J, Riksen JA, Hazendonk E, Prins P, Plasterk RH, Jansen RC, et al.: Mapping determinants of gene expression plasticity by genetical genomics in C. elegans. PLoS Genet 2006, 2(12):e222. 10.1371/journal.pgen.0020222
Churchill GA: Fundamentals of experimental design for cDNA microarrays. Nat Genet 2002, 32(Suppl):490–495. 10.1038/ng1031
Fu J, Jansen RC: Optimal design and analysis of genetic studies on gene expression. Genetics 2006, 172(3):1993–1999. 10.1534/genetics.105.047001
Fisher RA: The design of experiments. 4th edition. Edinburgh: Oliver and Boyd; 1947.
Kerr MK, Churchill GA: Experimental design for gene expression microarrays. Biostatistics 2001, 2(2):183–201. 10.1093/biostatistics/2.2.183
Yang YH, Speed T: Design issues for cDNA microarray experiments. Nat Rev Genet 2002, 3(8):579–588.
Fournier MV, Carvalho PC, Magee DD, Carvalho MGC, Appasani K: Experimental Design for Gene Expression Analysis. In Bioarrays From Basics to Diagnostics. Humana Press; 2007:29.
Wit E, Nobile A, khanin R: Near-optimal designs for dual-channel microarray studies. Applied Statistics 2005, 54(5):817–830.
Lam AC, Fu J, Jansen RC, Haley CS, de Koning DJ: Optimal design of genetic studies of gene expression with two-color microarrays in outbred crosses. Genetics 2008, 180(3):1691–1698. 10.1534/genetics.108.090308
Rosa GJ, de Leon N, Rosa AJ: Review of microarray experimental design strategies for genetical genomics studies. Physiol Genomics 2006, 28(1):15–23. 10.1152/physiolgenomics.00106.2006
Woo Y, Krueger W, Kaur A, Churchill G: Experimental design for three-color and four-color gene expression microarrays. Bioinformatics 2005, 21(Suppl 1):i459–467. 10.1093/bioinformatics/bti1031
Wit E, Nobile A, Khanin R: Simulated annealing for near-optimal dual-channel microarray designs. Appl Statistics 2005, (54):817–830.
The R Project for Statistical Computing[http://www.r-project.org/]
Swertz MA, De Brock EO, Van Hijum SA, De Jong A, Buist G, Baerends RJ, Kok J, Kuipers OP, Jansen RC: Molecular Genetics Information System (MOLGENIS): alternatives in developing local experimental genomics databases. Bioinformatics 2004, 20(13):2075–2083. 10.1093/bioinformatics/bth206
Swertz MA, Jansen RC: Beyond standardization: dynamic software infrastructures for systems biology. Nat Rev Genet 2007, 8(3):235–243. 10.1038/nrg2048
This work was supported by the Netherlands Organization for Scientific Research, NWO-86504001. We thank Danny Arends for help in implementing the web tool.
YL developed designGG. RCJ and RB directed the project. MAS, GV and JF helped to implement the web tool. All authors wrote the manuscript, and read and approved the final version.
Electronic supplementary material
Additional file 1: . DesignGG aims at finding an optimal design of genetical genomics experiments which maximize the power and resolution of detecting genetic, environmental and interaction effects. This will help to achieve high power and more accurate estimates of the effects of interesting factors, and thus yield a more reliable biological interpretation of data. (ZIP 128 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Li, Y., Swertz, M.A., Vera, G. et al. designGG: an R-package and web tool for the optimal design of genetical genomics experiments. BMC Bioinformatics 10, 188 (2009). https://doi.org/10.1186/1471-2105-10-188
- Quantitative Trait Locus
- Complex Biological System
- Genetical Genomic
- Main Genetic Effect
- Expect Effect Size