SNPLims: a data management system for genome wide association studies
© Orro et al.; licensee BioMed Central Ltd. 2008
Published: 26 March 2008
Recent progresses in genotyping technologies allow the generation high-density genetic maps using hundreds of thousands of genetic markers for each DNA sample. The availability of this large amount of genotypic data facilitates the whole genome search for genetic basis of diseases.
We need a suitable information management system to efficiently manage the data flow produced by whole genome genotyping and to make it available for further analyses.
We have developed an information system mainly devoted to the storage and management of SNP genotype data produced by the Illumina platform from the raw outputs of genotyping into a relational database.
The relational database can be accessed in order to import any existing data and export user-defined formats compatible with many different genetic analysis programs.
After calculating family-based or case-control association study data, the results can be imported in SNPLims. One of the main features is to allow the user to rapidly identify and annotate statistically relevant polymorphisms from the large volume of data analyzed. Results can be easily visualized either graphically or creating ASCII comma separated format output files, which can be used as input to further analyses.
The proposed infrastructure allows to manage a relatively large amount of genotypes for each sample and an arbitrary number of samples and phenotypes. Moreover, it enables the users to control the quality of the data and to perform the most common screening analyses and identify genes that become “candidate” for the disease under consideration.
Genome wide search for genes underlying common diseases is enormously facilitated by the use of high throughput genotyping. Nowadays, huge amount of molecular markers are available for the human genome and laboratories equipped with recent genotyping technologies can use them to quickly generate hundreds of thousands of genotypes for each DNA under study.
In particular, Single Nucleotide Polymorphisms (SNPs) are one of the most common forms of human genetic variation that can be used to discover the sequence variants affecting common diseases by examining them for statistically significant association with measurable phenotypes.
In a typical molecular biology laboratory genotype data are usually managed with the help of specialized software (LIMS - Laboratory Information Management Systems) that implements several useful functions, for example: sample tracking for all steps of the experiments, clustering of fluorescent values, visualization and manual correction of genotypes with ambiguous assignment, generation of genotype reports.
Some genotype management systems have been implemented in last years with different features and supporting different genotyping technologies (GenoDB , PacLIMS , SNPP , TIMS , , ). Even though they are useful tools, unfortunately, none of these available systems seem to be easy to customize or integrate in pre-existent infrastructures. Since the software provided together with our microarray platform (Illumina ) is suitable for managing raw genotype data, we started to develop a system mainly devoted to the management of post-genotyping activities with particular emphasis to the support of the most common analysis performed in association studies.
In particular the integration in a unique database of genotype, phenotype and demographic data coming from different laboratories facilitates the generation of reports for both visualization and data input for further analysis.
The main features of the system are: automatic import of genotype data from the Illumina microarray platform; definition and assignment of phenotypes to the subjects, including both qualitative and quantitative traits; control of the quality of the data in order to select markers with high genotyping score; statistical descriptive analysis that provides information about basic features and quality of data; analysis of the genetic population structure to identify stratification; statistical descriptive analysis that provides information about basic features and quality of data; single point analysis of association between genotype and quantitative or qualitative traits; multi locus analysis to combine genotypes of adjacent markers and find associations between haplotypes and phenotypes.
The system has been implemented as a client/server application and deployed in a Debian Linux server  in which the main storage element is a PostgreSQL database  accessed through a web application written with the Zope Web Application Framework . Users can access to the data in two ways: through a command line client within the Linux server and through a web interface. The first method is useful when other command line applications or scripts need to be integrated in pipelines for automatic computation; the second approach is more user oriented and it is used especially for visualization and data management.
Access policy is managed with a mixed approach based on system user accounts and Zope object permissions. Objects stored in the database are grouped in logical sessions that represent data acquisitions or computation results so that multiple studies can be managed in logical projects and shared between users. For example a genotyping session can represent the acquisition into the database of a group of DNA genotypes related to the same study project.
System architecture and data model
Similarly it is possible to define simple phenotype attributes related to individuals and to store them in the database. Phenotypes can be related both to the disease status of subjects (case/control studies) and to a numeric quantitative trait. A phenotype is defined through a unique name, a data type and the data structure (table structure) in which it will be stored. The most common data types (numerical, categorical and strings) supported by the database management system are also supported by the infrastructure. Each phenotype value is stored together with the phenotype ID, the individual ID and the session ID which represent a logical group of values (usually referred to the same population). In this way it is possible to define multiple phenotypes associate them to individuals.
Demographic attributes are related to the parental relationship between the subjects and to the race of the subjects. They are managed like the phenotype attributes but it is not possible to define acquisition session in this case because they are strictly related to the subject and not estimated.
List of supported tools
Whole Genome Association Analysis Toolset
Software for detecting and correcting for population stratification in genome-wide association studies
Software package for using multi-locus genotype data to investigate population structure
Software for implementing family-based association tests
Software tool for genomic annotation of whole genome association studies
Tool for analysis and visualization of LD and haplotype maps
Software for haplotype reconstruction, and recombination rate estimation from population data
Pedigree Management for stratified analysis
Similarly to the genotype and phenotype acquisition, all analysis results can be grouped in sessions that represent a logical unit of analysis (for example the analysis of group of DNA samples or of a particular cytogenetic region of interest).
Input reports are used to produce file input for analysis tools. They are specific for the particular program and the most common is the ped format that integrates in a unique file pedigree data, genotypes and phenotypes.
CSV reports are useful to import data in a calc-sheet software (like Excel or StataSE) or as general purpose input format for R or Matlab.
Graphical reports are mainly graphical plots of values along a chromosome region (for example the p value of Hardy-Weinberg test or the association test).
List of supported export format
Quality control and summary statistics
List of ‘GenCall scores’ of selected samples
Marker information file (Haploview)
Marker information file (PLINK)
Linkage Pedigree format (Haploview/PLINK)
Linkage Pedigree format (StataSE)
Family based association
Input for implementing family-based association tests (fbat)
Input files of genotypes and phenotypes (EIGENSTRAT)
Input for reconstructing haplotype (phase)
Input for genomic annotation (WGAViewer)
Report for Pooling Statistics (R, StataSE)
Web Interface and Client
The web interface has been implemented with the Zope Framework and in particular using the Plone content management product . In this way some functionality like the management of users, permissions and document workflows are inherited directly from the underlying framework.
The software is installed on Intel(R) Xeon(TM) CPU 2.40GHz (1G RAM) on the Debian (kernel 2.6) operating system. In the current installation the creation of a report integrating results of analysis with SNP annotation takes a time negligible respect to the creation of a PED input which takes about 10 min for a file 100 samples and 300k SNPs. The association case/control analysis performed on the same dataset with plink takes about 2 min.
In this session we describe the context in which the proposed system has been developed and tested. Genotype data, produced with the HumanHap300 (317k SNPs), for 95 case subjects and 91 controls has been used for a genome wide association study search in order to find regions or genes related to the schizophrenia disease.
The system has been used for both managing data and supporting statistical analysis. In particular descriptive statistics has been used to summarize and describe the main statistical properties of data whereas inferential statistics, concerning the inference of new insights about the genetic association, has been used for the screening. The analysis pipeline includes the quality control and the summary statistics of raw data as descriptive statistics and analysis of population stratification and association test between genotype and phenotype as inferential statistics. Reports of computed statistical parameters are integrated with the SNP annotation of the HumapHap300 in order to compare regions with high significance with the biological properties of the regions.
Descriptive statistics are used to describe the basic features of the data and to perform the quality control of raw data produced by the genotyping platform.
The system supports the evaluation of the call rate parameter that counts the number of called SNPs per sample and the GenCall score calculated by the BeadStudio software that indicates the quality of the SNP clustering. They are useful measures to evaluate the global quality of the genotyping.
In order to identify a set of markers with high degree of statistical significance for the disease, the following association tests has been performed: the basic association test for a disease trait based on comparing allele frequencies between cases and controls, the Cochran-Armitage trend test, different genetic models (dominant, recessive and general), tests for stratified samples and a test for a quantitative phenotype.
Association and annotation
Discussion and conclusions
In this paper a system for data management of genotypes and phenotype data has been proposed. Main focus of the infrastructure is the support of genetic studies of genome-wide association studies by wrapping the most common tools used in this field.
Availability and requirements
Project name: SNPLims
Project homepage: http://www.itb.cnr.it/snplims
Operating system(s): tested for Debian.
Programming language: Python 2.4, Zope 2.9 and Plone 2.5
Database management system: PostgreSQL 8.1
This work has been supported by the Italian FIRB-MIUR project “LITBIO - Italian Laboratory for Bioinformatics Technologies”, by the European Specific Support Action “BioinfoGRID - Bioinformatics Grid Application for life science” and “EGEE - Enabling Grids for E-sciencE” project, and by the CNR-Bioinformatics and ITALBIONET projects.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 2, 2008: Italian Society of Bioinformatics (BITS): Annual Meeting 2007. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S2
- Li JL, Deng H, Lai DB, Xu F, Chen J, Gao G, Recker RR, Deng HW: Toward high-throughput genotyping: dynamic and automatic software for manipulating large-scale genotype data using fluorescently labeled dinucleotide markers. Genome Res 2001, 11: 1304–1314. 10.1101/gr.159701PubMed CentralView ArticlePubMedGoogle Scholar
- Donofrio N, Rajagopalan R, Brown D, Diener S, Windham D, Nolin S, Floyd A, Mitchell T, Galadima N, Tucker S, Orbach MJ, Patel G, Farman M, Pampanwar V, Soderlund C, Lee YH, Dean RA: PACLIMS: A component LIM system for high throughput functional genomic analysis. BMC Bioinformatics 2005, 6: 94. 10.1186/1471-2105-6-94PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao LJ, Li MX, Guo YF, Xu FH, Li JL, Deng HW: SNPP: automating large-scale SNP genotype data management. Bioinformatics 2005, 21: 266–268. 10.1093/bioinformatics/bth486View ArticlePubMedGoogle Scholar
- Monnier S, Cox DG, Albion T, Canzian F: T.I.M.S: Taqman Information Management System, tools to organize data flow in a genotyping laboratory. BMC Bioinformatics 2005, 6: 246. 10.1186/1471-2105-6-246PubMed CentralView ArticlePubMedGoogle Scholar
- Hampe J, Wollstein A, Lu T, Frevel HJ, Will M, Manaster C, Schreiber S: An integrated system for high throughput TaqMan™ based SNP genotyping. Bioinformatics 2001, 17: 654–655. 10.1093/bioinformatics/17.7.654View ArticlePubMedGoogle Scholar
- Wang L, Liu S, Niu T, Xu X: SNPHunter: a bioinformatic software for single nucleotide polymorphism data acquisition and management. BMC Bioinformatics 2005, 6: 60. 10.1186/1471-2105-6-60PubMed CentralView ArticlePubMedGoogle Scholar
- plink – “Whole genome association analysis toolset”[http://pngu.mgh.harvard.edu/~purcell/plink]
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 2006, 38: 904–909. 10.1038/ng1847View ArticlePubMedGoogle Scholar
- Pritchard JK, Stephens M, Rosenberg NA, Donnelly P: Association mapping in structured populations. American Journal of Human Genetics 2000, 67: 170–181. 10.1086/302959PubMed CentralView ArticlePubMedGoogle Scholar
- Horvath S, Xu X, Laird N: The family based association test method: strategies for studying general genotype-phenotype associations. Euro J Hum Gen 2001, 9: 301–306. 10.1038/sj.ejhg.5200625View ArticleGoogle Scholar
- Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 2005, 21: 263–265. 10.1093/bioinformatics/bth457View ArticlePubMedGoogle Scholar
- Stephens M, Donnelly P: A comparison of Bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics 2003, 73: 1162–1169. 10.1086/379378PubMed CentralView ArticlePubMedGoogle Scholar
- Lanktree MB, VanderBeek L, Macciardi FM, Kennedy JL: PedSplit: pedigree management for stratified analysis. Bioinformatics 2004, 20: 2315–2316. 10.1093/bioinformatics/bth224View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.