MicroarrayDesigner: an online search tool and repository for nearoptimal microarray experimental designs
 Ahmet Sacan^{1, 2}Email author,
 Nilgun Ferhatosmanoglu^{3} and
 Hakan Ferhatosmanoglu^{2}
DOI: 10.1186/1471210510304
© Sacan et al; licensee BioMed Central Ltd. 2009
Received: 12 February 2008
Accepted: 22 September 2009
Published: 22 September 2009
Abstract
Background
Dualchannel microarray experiments are commonly employed for inference of differential gene expressions across varying organisms and experimental conditions. The design of dualchannel microarray experiments that can help minimize the errors in the resulting inferences has recently received increasing attention. However, a general and scalable search tool and a corresponding database of optimal designs were still missing.
Description
An efficient and scalable search method for finding nearoptimal dualchannel microarray designs, based on a greedy hillclimbing optimization strategy, has been developed. It is empirically shown that this method can successfully and efficiently find nearoptimal designs. Additionally, an improved interwoven loop design construction algorithm has been developed to provide an easily computable general class of nearoptimal designs. Finally, in order to make the best results readily available to biologists, a continuously evolving catalog of nearoptimal designs is provided.
Conclusion
A new search algorithm and database for nearoptimal microarray designs have been developed. The search tool and the database are accessible via the World Wide Web at http://db.cse.ohiostate.edu/MicroarrayDesigner. Source code and binary distributions are available for academic use upon request.
Background
Microarray experiments are commonly used to detect differential expression of genes across a number of conditions of interest. In a typical twocolor microarray experiment, cDNA varieties (also denoted as treatments or samples) from two experimental conditions are labeled with two different fluorophores (e.g., Cy3 green and Cy5 red fluorescent dyes), and hybridized onto the same slide of complementary probes. Relative intensities of each fluorophore is then used to quantify differential expression levels of the genes from the two treatments.
The data generated by microarray experiments are highly multidimensional and contain a considerable amount of noise due to variability associated with slide preparation and measurement. Therefore, careful planning is required in order to obtain statistically significant and biologically valid conclusions [1]. Theoretical experimental design studies aim to identify, in advance, the expected accuracy of the results that can be obtained from the microarray experiments (see [2] for a recent survey). Several evaluation criteria have been proposed in order to quantify the optimality of a given experimental design, with A, L, and Doptimality being the most widely used criteria [3, 4].
For small and simple experiments, it is possible to identify the optimal design through exhaustive enumeration of all possible designs. However, for more complex experiments, a naive search becomes infeasible, because the number of all possible designs grow exponentially with increasing number of varieties or slides. For example, for 10, 11, and 12 vertices, there are about 11 million, 1 billion, and 150 trillion nonisomorphic connected graphs, respectively. Therefore, the search for nearoptimal designs ultimately relies on either following some general guidelines for constructing such designs, or heuristically sampling the search space of all graphs.
In this study, we have developed an effective hillclimbing strategy to search for the nearoptimal designs, and harvest the results of the search efforts into a database of nearoptimal designs. We have also developed an improved construction algorithm for the interwoven loop designs, which were previously found to be nearoptimal [4], but for which no efficient method was present.
Construction and content
Given a design matrix X, an optimality criterion tries to summarize the precision of the parameter estimates in a single score. Differences in defining this precision have given rise to multiple forms of optimality criteria, with A, L, and D optimality being the most common. In MicroarrayDesigner, we define these optimality criteria such that an optimal design would be one that minimizes the corresponding criterion:

Aoptimality is defined as the average variance of the parameter estimates:

Loptimality is defined as the variance of the parameter estimates with respect to all parameter contrasts C:

Doptimality is defined in terms of the determinant of the design matrix.
Note that the definitions above are only trivially different from their conventional definitions in the literature [4, 5]. Particularly, we have scaled the original definition of Doptimality using its logarithm, for numerical convenience. This modification preserves the ordering of optimality scores of different designs.
Hill Climbing optimization
For a given set of experimental constraints, we would like to find the experimental design that is optimal, i.e., that minimizes the given optimality criteria. The experimental constraints are the number of varieties being analyzed (n) and the number of slides (m) available for the experiment. Our heuristic search method is based on a hillclimbing approach that seeks to improve a given initial experimental design at each step. This is achieved by repeatedly adding and removing edges (slides) until no further improvement in the optimality criteria can be achieved. The basic algorithm is outlined in Appendix 1.
The algorithm takes an initial design graph G = ⟨V, E⟩, where V and E are the list of vertices and edges, respectively. At each iteration, a random number r of edges are added one by one. The AddBestEdge function tests all edges (for large graphs, random sampling of edges is performed) and identifies the candidates that improve the optimality criteria of the design the most. These candidates are further filtered with the objective of minimizing the variance in the degree of the vertices, and the distances between pairs of vertices. An edge is randomly selected from the final set of candidates and added to the graph. Similarly, the RemoveWorstEdge function first identifies candidate edges that can be removed with the least degradation to optimality, and a randomly selected candidate edge is removed from the graph. The search algorithm stops when a predefined maxiter number of iterations is reached, or when no improvement is obtained for maxidle iterations.
Interwoven Loop designs
Because of the difficulty of analytically or numerically finding the optimal designs, there have been efforts to identify certain recipes for construction of experimental designs. The "reference" and "loop" designs are two such basic types of designs and are the most widely used experimental layouts. In the reference design, each variety is compared to a common reference variety [6]; whereas in the loop design, the varieties are compared to one another in a circular or multiplepairwise fashion [3]. The loop design is shown to be generally more efficient than the reference design [7, 8].
Wit et.al. [4] introduced Interwoven Loop design layouts as ordinary loop designs where each variety was also compared to the varieties that are j_{2}, j 3,...,j_{n1}'jumps' further along the circle. Interwoven Loop designs were not only shown to be more optimal than the alternative reference and loop designs, but they were also shown to be nearoptimal; i.e., achieving an optimality that is close to the theoretically best possible design. However, to the best of our knowledge, no efficient algorithm was hitherto present for the construction of such designs. The size of the class of such designs explored by smida software package [4, 9] is exponential in the number of slides and becomes infeasible to compute for large experiments.
As part of the MicroarrayDesigner, we have developed an efficient construction algorithm based on the observation that the jumps in the optimal interwoven loop designs are organized in such a way that the pairwise distances between the nodes are minimized. This heuristic construction (Heuristic Loop) has made it feasible to generate interwoven loop designs for very large experiments. For example, for 10 varieties and 100 slides, it takes the smida program over 10 minutes to find the optimal interwoven loop design, whereas it takes Heuristic Loop under 0.01 seconds to find the same design. For 10 varieties and 200 slides, smida is unable to execute due to excessive memory allocation (with estimated runtime of more than 110 years), whereas Heuristic Loop executes only 0.12 seconds.
The database and web interface
The nondeterministic search methods generate different results each time they are executed, and it would be a waste of computational time and resources if one did not store the best designs found. The efficient and scalable methods developed in this study have allowed us to compile a database of nearoptimal designs for variety and slide numbers of up to 100, which we believe to be a good limit for practical experiments. In order to continually improve the database, a background process keeps searching for better designs, and updates the database designs accordingly. Daily snapshots of the database are made available through the web interface. This database can serve as a practical reference for the biologists, and as a benchmark dataset for research in microarray design.
Utility
Loptimality of designs found by different methods.
Varieties  Slides  smidaSA  Heuristic Loop  Hill Climbing 

5  10  4.22  4.10  4.10 
10  30  14.39  N/A  14.22 
20  50  86.27  N/A  83.53 
20  80  49.47  47.51  47.75 
30  100  146.54  N/A  141.36 
50  100  861.42  893.63  825.81 
For each test case, the Hill Climbing search method found either the best or close to the best design. We attribute the performance of the Hill Climbing to the fact that unlike the random changes employed in the Simulated Annealing method, the changes at each iteration of our algorithm guide the search to a more optimal design. Notably, a reimplementation of the smidaSA with the Hill Climbing search method incorporated as one of the possible steps gave slightly better results than the original Simulated Annealing method. The results of the Hill Climbingenhanced Simulated Annealing, and of the other test cases are available on the website as part of the database.
Discussion and conclusion
We have developed an efficient heuristic method for finding nearoptimal microarray experimental designs. The proposed method employs a directed hillclimbing algorithm that guides the search toward optimal designs. We have also developed a constructive algorithm for the class of interwovenloop designs, making construction of these designs feasible for large experiments.
The improved search algorithms have allowed us to generate and maintain continually evolving database of nearoptimal microarray experimental designs. This compilation can serve as a reference and benchmark for experiment designers and design optimality researchers. An interactive web interface is provided to query the set of designs for various optimality measures or to upload usercontributed designs. Daily snapshots of the database are also provided for download.
While the early microarray design studies focused on fixed effects models, there have been recent efforts to address the hierarchical or factorial nature of the experimental designs using mixed effects models [2, 10]. We remark that the design optimization procedure introduced in this study can not directly be applied to general factorial designs. Nevertheless, in the current version of MicroarrayDesigner, we have implemented a limited support for hierarchical designs with only two levels of factors. Following Ankenman et.al. [11], we have modeled biological replication using nested random factors. Support for a more comprehensive mixed effects model and analysis of the data generated from various experimental designs are out of scope of the current study and are left for future work.
Availability and requirements
The search tool and the database are accessible via the World Wide Web at http://db.cse.ohiostate.edu/MicroarrayDesigner. The source code and binary distributions for the search algorithm and the web service are available from the authors for academic use. Computation of the optimality criteria and the search algorithms are implemented in MATLAB. The database of experimental designs is stored as plain text files to simplify distributed processing and to allow direct packaging of the database for download.
Appendix 1  The Hill Climbing optimization algorithm
The algorithm optimizes an input design graph G by repeatedly adding and removing a random number of edges each of which improve the optimality criteria. AddBestEdge iterates over each pair of nodes in the graph and adds the edge that results in the highest increase in optimality. Likewise, RemoveW orstEdge iterates over each of the existing edges and removes the one that results in the highest increase in optimality. The procedure is repeated a specified maxiter iterations or until no improvement is achieved over maxidle iterations.
Input: initial design graph G = ⟨V, E⟩
Output: optimized design G
for i ← 1 to maxiter do
G' ← G;
r ← rand * V;
for j ← 1 to r do
AddBestEdge(G');
for j ← 1 to r do
RemoveWorstEdge(G');
if optimality did not improve past maxidle iterations then
break;
G ← G';
Declarations
Acknowledgements
This research is partially supported by US National Science Foundation (NSF) Grants IIS0546713 and DBI0750891; and Turkish Scientific and Research Council (TÜBİTAK) Grant 107E173.
Authors’ Affiliations
References
 Quackenbush J: Computational analysis of microarray data. Nature Reviews Genetics 2001, 2(6):418–27. 10.1038/35076576View ArticlePubMedGoogle Scholar
 Rosa GJM, de Leon N, Rosa AJM: Review of microarray experimental design strategies for genetical genomics studies. Physiological Genomics 2006, 28: 15–23. 10.1152/physiolgenomics.00106.2006View ArticlePubMedGoogle Scholar
 Kerr MK, Churchill GA: Experimental design for gene expression microarrays. Biostatistics 2001, 2(2):183–201. 10.1093/biostatistics/2.2.183View ArticlePubMedGoogle Scholar
 Wit E, Nobile A, Khanin R: Nearoptimal designs for dual channel microarray studies. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2005, 54(5):817–830. 10.1111/j.14679876.2005.00519.xView ArticleGoogle Scholar
 Bailey RA: Designs for twocolour microarray experiments. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2007, 56(4):365–394. 10.1111/j.14679876.2007.00582.xView ArticleGoogle Scholar
 Eisen M, Brown P: DNA arrays for analysis of gene expression. Methods in Enzymology 1999, 303: 179–205. full_textView ArticlePubMedGoogle Scholar
 Kerr M: Design considerations for efficient and effective microarray studies. Biometrics 2003, 59(4):822–828. 10.1111/j.0006341X.2003.00096.xView ArticlePubMedGoogle Scholar
 Vinciotti V, Khanin R, D'alimonte D, Liu X, Cattini N, Hotchkiss G, Bucca G, de Jesus O, Rasaiyaah J, Smith C, Kellam P, Wit E: An experimental evaluation of a loop versus a reference design for twochannel microarrays. Bioinformatics 2005, 21(4):492–501. 10.1093/bioinformatics/bti022View ArticlePubMedGoogle Scholar
 Wit E, McClure J: Rlibrary: SMIDA version 0.1.2006. [http://www.math.rug.nl/~ernst/book/smida.html]Google Scholar
 Tempelman RJ: Assessing statistical precision, power, and robustness of alternative experimental designs for two color microarray platforms based on mixed effects models. Veterinary Immunology and Immunopathology 2005, 105(3–4):175–186. 10.1016/j.vetimm.2005.02.002View ArticlePubMedGoogle Scholar
 Ankenman BE, Aviles AI, Pinheiro JC: Optimal designs for mixedeffects models with two random nested factors. Statistica Sinica 2003, 13: 385–401.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.