Two new ArrayTrack libraries for personalized biomedical research
BMC Bioinformatics volume 11, Article number: S6 (2010)
Recent advances in high-throughput genotyping technology are paving the way for research in personalized medicine and nutrition. However, most of the genetic markers identified from association studies account for a small contribution to the total risk/benefit of the studied phenotypic trait. Testing whether the candidate genes identified by association studies are causal is critically important to the development of personalized medicine and nutrition. An efficient data mining strategy and a set of sophisticated tools are necessary to help better understand and utilize the findings from genetic association studies.
SNP (single nucleotide polymorphism) and QTL (quantitative trait locus) libraries were constructed and incorporated into ArrayTrack, with user-friendly interfaces and powerful search features. Data from several public repositories were collected in the SNP and QTL libraries and connected to other domain libraries (genes, proteins, metabolites, and pathways) in ArrayTrack. Linking the data sets within ArrayTrack allows searching of SNP and QTL data as well as their relationships to other biological molecules. The SNP library includes approximately 15 million human SNPs and their annotations, while the QTL library contains publically available QTLs identified in mouse, rat, and human. The QTL library was developed for finding the overlap between the map position of a candidate or metabolic gene and QTLs from these species. Two use cases were included to demonstrate the utility of these tools. The SNP and QTL libraries are freely available to the public through ArrayTrack at http://www.fda.gov/ArrayTrack.
These libraries developed in ArrayTrack contain comprehensive information on SNPs and QTLs and are further cross-linked to other libraries. Connecting domain specific knowledge is a cornerstone of systems biology strategies and allows for a better understanding of the genetic and biological context of the findings from genetic association studies.
Genetic variations are a major factor for inter-individual differences in disease susceptibility and response to environmental exposures such as nutrients and drugs. Recent advances in microarray-based genotyping techniques have enabled researchers to rapidly scan for known single nucleotide polymorphisms (SNPs), one of the most common genetic variations, across complete genomes. Genome wide association studies (GWAS) have identified putative variations that contribute to common, complex diseases such as asthma, cancer, diabetes, heart disease and mental illnesses. SNPs that have been associated with complex diseases may eventually be used to develop better strategies to detect, treat and prevent these diseases. A web-based catalog of GWAS publications has been created and periodically updated at the National Human Genome Research Institute . Such technology is contributing to the development of personalized medicine, in which the current one-size-fits-all approach to medical care will give way to more customized treatment strategies.
However, it is uncommon for GWAS to incorporate diet or environmental exposures which are known to influence disease susceptibility ([2, 3] and http://www.nugo.org/nutrialerts/39848). In addition, many GWAS have been done in European populations and their applicability to other populations and individuals has not been adequately studied ([4–6] and http://www.nugo.org/nutrialerts/40314 and http://www.nugo.org/nutrialerts/38373). GWAS results must therefore be further tested to determine whether the statistical associations found offer real-world potential to predict complex phenotypes or are useful in developing testable hypotheses about the development, progression, or treatment of a disease.
A novel strategy has been proposed to analyze gene-nutrient interactions, aiming to discover genes that contribute to individual risk factors [7–9]. This data mining strategy is based on analyzing candidate genes involved in nutrient metabolism or regulation and mapping those genes to quantitative trait loci (QTL) contributing to a particular trait or condition. A QTL is a region of DNA that is associated with a particular phenotypic trait. A common use of QTL data is to identify candidate genes underlying a trait within one or more QTL. This approach utilizes the available genomic, physiological, and environmental data to select candidate genes for further analyses.
A limitation of this type of strategy is that many databases are knowledge or domain specific – that is, they limit data to one discipline such as proteomics, genomics, or metabolomics. To address this limitation, we propose a solution through ArrayTrack. ArrayTrack is a publicly accessible microarray data management and analysis system developed by the FDA’s National Center for Toxicological Research [10, 11]. It has been extended to manage and analyze preprocessed proteomics and metabolomics experiment data. To facilitate data interpretation, ArrayTrack has integrated a rich collection of biological information for genes, proteins and pathways, which are drawn from public repositories and organized as individual yet cross-linked libraries. Thus it provides a one-stop solution for omics data analysis and interpretation in the context of gene-function relationship.
One of the focuses in GWAS is to relate SNPs to genes and pathways to understand the underlying mechanisms of the studied disease. The SNP-gene-pathway relationship should be dynamically interrogated in an interactive/integrated environment. ArrayTrack has provided a gene-pathway exploratory platform. By integrating the SNP library that contains annotation summary information of SNPs and their mapped relationship to genes, ArrayTrack now enables dynamic analysis of the SNP-gene-pathway relationship and thus offers support to SNP studies. The identification of the SNP-gene-QTL relationship is the basis to test whether the gene/SNP is associated with the etiology of a disease in animal models or human studies. The integration of SNP and QTL libraries into ArrayTrack enables dynamic mining of such complex biological interactions and thus expands the utility of ArrayTrack.
Construction and content
A major goal of the SNP and QTL libraries is to collect dispersed data in one place, allowing researchers to easily access and compare data across multiple knowledge bases. Data have been downloaded from public repositories and reorganized as library components of ArrayTrack. The data in the SNP and QTL libraries can directly link back to their sources, as well as ArrayTrack’s own existing collection of libraries.
Data for the SNP library with annotation summary information were downloaded to ArrayTrack (an Oracle Enterprise Edition 10g database) from the UCSC Genome Bioinformatics Site and the NCBI dbSNP. This guarantees a seamless external connection to its Genome Browser for each SNP with a link constructed based on the SNP’s chromosomal position. The UCSC Genome Bioinformatics Site reports different positions than the NCBI dbSNP database for a small subset of SNPs. The annotation summary information is organized as one database table (Table 1). The SNP library includes approximately 15 million human SNPs and their annotations.
For additional annotations, external links are provided for each SNP to the websites of dbSNP, UCSC Genome Browser, Ensembl, and the International HapMap Project [15–17]. These websites provide information about SNP allele frequency distributions among different populations, linkage with nearby genetic variants, functional annotations, and pathways involving the related genes [18, 19]. Major online SNP databases and resources are listed at http://www.nugo.org/nutrialerts/40615. The SNP library also maps SNPs to genes in ArrayTrack’s Gene library based on the relationships downloaded from dbSNP.
For the QTL library, data for mouse, rat, and human QTLs were collected from species-specific databases. QTL data for mouse were taken from the Mouse Genome Database (MGI) at Jackson Laboratory. Only those QTLs with a valid mapping position and an official validation status were imported. QTL data for rat and human were extracted from Rat Genome Database (RGD) developed by the Medical College of Wisconsin. The processing of QTL data taken from RGD was much more complex due to a disagreement with the QTL position assignment method adopted by RGD. A QTL in RGD is positioned on a genome assembly by using the flanking and peak markers as provided by the publication detailing the QTL. When only one flanking or peak marker is available, the QTL position is assigned using the QTL size estimates made from the global distribution of QTL sizes, which are 26 Mbp (million base pairs) for human and 45 Mbp for rat. Many researchers would prefer to estimate these differences rather than rely on the default parameters. We excluded those QTLs that are identified by only one flanking marker since the confidence in QTL positions is quite low. For those QTLs identified by the peak marker, the position of its peak marker was assigned to the associated QTL, without any estimate of the QTL’s size. Marker positions were pulled from RGD except for those markers that are actually genes, in which case gene positions from NCBI were used. Finally QTL data from all three species were stored together in one database table (Table 2). Web links to the original data sources have been provided for detailed information about each QTL. Additionally, chromosome positions for all genes in human, rat, and mouse were downloaded from the NCBI ftp site and organized into a separate table to enable the cross-table query of genes and QTLs based on their map positions.
Utility and discussion
Many databases are cumbersome and difficult to browse or search. For example, only one SNP at a time may be queried and viewed in the well-designed and cross-linked SNP database at the dbSNP. Besides collecting dispersed data in one place to facilitate data mining across multiple knowledge domains, ArrayTrack also aims to facilitate accessibility of data. The SNP library (Figure 1) and QTL library (Figure 2) use a clean interface that offers a spreadsheet-like view of search results. Searches are very quick and offer comprehensive functionality that includes: extended mapping ranges, exact or partial matches, and combinations of query filters on all data fields. The addition of the SNP and QTL libraries to ArrayTrack opens up several new research opportunities. Following are two case studies of data mining strategies.
Gene – nutrient interaction
This strategy was proposed to analyze gene-nutrient interactions, aiming to discover genes that contribute to risk factors that include environmental exposures. Manual processes to find the QTLs that have map positions nearby those of each gene in a given list are labor intensive and time consuming. The QTL library completely automates this data mining process by searching and collecting data and providing a convenient list-based search interface. The strategy is comprised of two steps:
Search the metabolic and regulatory pathways of a chosen nutrient to generate a list of genes regulated by or involved in the metabolism of such a nutrient. Examples used to develop this approach included thiamine, folic acid, riboflavin, glucose, fructose, vitamin A, vitamin D, and vitamin E. The pathway for each gene or metabolite is searched individually. This step may be accomplished through GeneGo or other similar pathway search tools.
Using the QTL library, map each gene to QTLs contributing to a phenotype. In this case, the metabolic genes were “mapped” to QTLs for obesity, T2DM, body weight, or other related phenotypes or to QTLs that contribute to those diseases (for example, insulin or glucose level QTLs). The chromosomal position of each gene is found with the specified species mapping information and then used to construct a chromosomal search region for QTLs with a user specified range of extension.
An example of this strategy is shown in Table 3. The dietary carbohydrate, fructose, is implicated in the pathogenesis of obesity, insulin resistance and cardiovascular diseases . In order to identify the genes of the fructose metabolic pathway that are potentially underlie these disease processes, a pathway analyses program, GeneGo, was used. A total of 34 genes were acquired for fructose metabolism (rodent version), from the GeneGo “organism specific pathway map,” under carbohydrate metabolism. The genes obtained were uploaded on to ArrayTrack’s QTL library and searched for associations to QTLs choosing specificity for the mouse species. The search range was set at 5 Mbp (million base pair). The 34 genes involved in fructose metabolism associated with a total of 108 QTLs. These results were filtered to retain only the QTLs that related to obesity, type 2 diabetes and cardiovascular diseases. Using this approach, 11 genes of the fructose metabolic pathway were identified to be associated to 19 mouse QTLs as depicted in Table 3.
Connecting GWAS results with QTLs
Both GWAS and QTL analyses associate a certain trait with genetic map positions. GWAS typically use unrelated populations of cases and controls. QTL mapping studies are usually performed on inbred strains of animals or nuclear families (e.g., trios design) of humans. Combining GWAS results from human association studies with QTLs which are usually from laboratory animals increases the reliability of identifying candidate genes for further fine mapping studies, e.g., through next generation sequencing. Primates and rodents have shared synteny, the co-localization of genes within a chromosomal region. These shared chromosomal regions can be re-ordered within and among chromosomes between species, but their map positions have been well characterized (see NCBI MapView - http://preview.ncbi.nlm.nih.gov/mapview/). This strategy is comprised of three steps:
Obtain a list of trait-associated SNPs from published GWAS results for a chosen condition such as obesity, T2DM, or hypertension. This can be quickly accomplished through querying the GWAS Catalog.
Using ArrayTrack’s SNP library, map each SNP to genes based on chromosomal positions. The result of this step is a list of genes.
For each gene in the list, query ArrayTrack’s QTL library to find whether there are any nearby QTLs that may contribute to the studied condition.
As an example, a list of SNPs was obtained from the GWAS Catalog that are associated with hypertension-related phenotypes such as elevated systolic or diastolic or both blood pressures, hypertension, and stroke. These SNPs were then mapped to human genes through the SNP library. Finally we searched the QTL library for those in human that, by mapping position, are close to any gene in the list. An extended search range of 2 Mbp (million base pair) was chosen and the results were filtered to keep those QTLs relevant to hypertension related traits or phenotypes. The final results are shown is Table 4. The genes identified by this strategy are candidates for further fine mapping in linkage or association studies and may be used to design animal studies to test their role in the mechanisms of hypertension.
Besides meeting the need of SNP interpretation and exploration, the integration of the SNP library with ArrayTrack’s library collection enables users to quickly explore and compare the associated biological pathways for SNPs of interest. Along with ArrayTrack’s library collection, the SNP and QTL libraries will be maintained and periodically updated as new data become available. As the development of these libraries progresses, query based on gene names will be added to the SNP library and query based on QTL symbols will be implemented for the QTL library.
The massive amount of data generated in biomedical research studies is often considered and organized as separate knowledge domains. We are developing strategies and tools such as the SNP and QTL libraries for data mining that will allow for more targeted research studies for developing the path to personalized nutrition, medicine, and healthcare.
Availability and requirements
The SNP and QTL libraries are freely available to the public through ArrayTrack at http://www.fda.gov/ArrayTrack.
A Catalog of Published Genome-Wide Association Studies[http://www.genome.gov/GWAStudies]
Kaput J: Nutrigenomics research for personalized nutrition and medicine. Curr Opin Biotechnol 2008, 19(2):110–120. 10.1016/j.copbio.2008.02.005
Kaput J, Rodriguez RL: Nutritional genomics: the next frontier in the postgenomic era. Physiol Genomics 2004, 16(2):166–177.
Myles S, Davison D, Barrett J, Stoneking M, Timpson N: Worldwide population differentiation at disease-associated SNPs. BMC Med Genomics 2008, 1(1):22. 10.1186/1755-8794-1-22
Myles S, Tang K, Somel M, Green RE, Kelso J, Stoneking M: Identification and analysis of genomic regions with large between-population differentiation in humans. Ann Hum Genet 2008, 72(Pt 1):99–110.
Adeyemo A, Rotimi C: Genetic Variants Associated with Complex Human Diseases Show Wide Variation across Multiple Populations. Public Health Genomics 2010, 13(2):72–79. 10.1159/000218711
Kaput J, Swartz D, Paisley E, Mangian H, Daniel WL, Visek WJ: Diet-Disease Interactions at the Molecular Level: An Experimental Paradigm. J Nutr 1994, 124(8_Suppl):1296S-1305.
Park EI, Paisley EA, Mangian HJ, Swartz DA, Wu M, O'Morchoe PJ, Behr SR, Visek WJ, Kaput J: Lipid Level and Type Alter Stearoyl CoA Desaturase mRNA Abundance Differently in Mice with Distinct Susceptibilities to Diet-Influenced Diseases. J Nutr 1997, 127(4):566–573.
Wise C, Kaput J: A Strategy for Analyzing Gene - Nutrient Interactions in Type 2 Diabetes. J Diabetes Sci Technol 2009, 3(4):710–721.
Tong W, Cao X, Harris S, Sun H, Fang H, Fuscoe J, Harris A, Hong H, Xie Q, Perkins R, et al.: ArrayTrack--supporting toxicogenomic research at the U.S. Food and Drug Administration National Center for Toxicological Research. Environ Health Perspect 2003, 111(15):1819–1826. 10.1289/ehp.6497
Fang H, Harris SC, Su Z, Chen M, Qian F, Shi L, Perkins R, Tong W: ArrayTrack: An FDA and Public Genomic Tool. Methods Mol Biol 2009, 563: 379–398. full_text
Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M, Smith KE, Rosenbloom KR, Raney BJ, et al.: The UCSC genome browser database: update 2010. Nucleic Acids Res 2010, 38(Databse issue):D613-D619. 10.1093/nar/gkp939
dbSNP: the NCBI Database of Genetic Variation[http://www.ncbi.nlm.nih.gov/SNP]
The Ensembl Project[http://www.ensembl.org/Homo_sapiens/index.html]
Consortium IHGS: The International HapMap Project. Nature 2003, 426(6968):789–796. 10.1038/nature02168
The International HapMap C: A haplotype map of the human genome. Nature 2005, 437(7063):1299–1320. 10.1038/nature04226
Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, et al.: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007, 449(7164):851–861. 10.1038/nature06258
Illig T, Gieger C, Zhai G, Romisch-Margl W, Wang-Sattler R, Prehn C, Altmaier E, Kastenmuller G, Kato BS, Mewes HW, et al.: A genome-wide perspective of genetic variation in human metabolism. Nat Genet 2010, 42(2):137–141. 10.1038/ng.507
Peng G, Luo L, Siu H, Zhu Y, Hu P, Hong S, Zhao J, Zhou X, Reveille JD, Jin L, et al.: Gene and pathway-based second-wave analysis of genome-wide association studies. Eur J Hum Genet 2010, 18(1):111–117. 10.1038/ejhg.2009.115
Mouse Genome Database (MGD) at the Mouse Genome Informatics website, The Jackson Laboratory, Bar Harbor, Maine[http://www.informatics.jax.org]
Twigger SN, Shimoyama M, Bromberg S, Kwitek AE, Jacob HJ, RGD Team: The Rat Genome Database, update 2007--Easing the path from disease to data and back again. Nucleic Acids Res 2007, 35(Database issue):D658-D662. 10.1093/nar/gkl988
Tappy L, Le K-A: Metabolic Effects of Fructose and the Worldwide Increase in Obesity. Physiol Rev 2010, 90(1):23–46. 10.1152/physrev.00019.2009
The views presented in this article do not necessarily reflect those of the Food and Drug Administration. We would like to thank the ArrayTrack development team for providing invaluable supports and a system platform as the building foundation of the libraries described in this manuscript.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 6, 2010: Proceedings of the Seventh Annual MCBIOS Conference. Bioinformatics: Systems, Biology, Informatics and Computation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S6.
The authors declare that they have no competing interests.
CW and JK conceived the integration of the QTL library and related data mining strategies. HF and WT conceived the integration of the SNP library and its applications. VV, CW, JX, and JK developed the case studies. BN, HH, and HF suggested functions to be implemented with the libraries and helped with testing. JX developed the databases and software. JX created the first draft manuscript. All authors helped draft the manuscript and approved the final version.
About this article
Cite this article
Xu, J., Wise, C., Varma, V. et al. Two new ArrayTrack libraries for personalized biomedical research. BMC Bioinformatics 11 (Suppl 6), S6 (2010). https://doi.org/10.1186/1471-2105-11-S6-S6