iHAP – integrated haplotype analysis pipeline for characterizing the haplotype structure of genes

Background The advent of genotype data from large-scale efforts that catalog the genetic variants of different populations have given rise to new avenues for multifactorial disease association studies. Recent work shows that genotype data from the International HapMap Project have a high degree of transferability to the wider population. This implies that the design of genotyping studies on local populations may be facilitated through inferences drawn from information contained in HapMap populations. Results To facilitate analysis of HapMap data for characterizing the haplotype structure of genes or any chromosomal regions, we have developed an integrated web-based resource, iHAP. In addition to incorporating genotype and haplotype data from the International HapMap Project and gene information from the UCSC Genome Browser Database, iHAP also provides capabilities for inferring haplotype blocks and selecting tag SNPs that are representative of haplotype patterns. These include block partitioning algorithms, block definitions, tag SNP definitions, as well as SNPs to be "force included" as tags. Based on the parameters defined at the input stage, iHAP performs on-the-fly analysis and displays the result graphically as a webpage. To facilitate analysis, intermediate and final result files can be downloaded. Conclusion The iHAP resource, available at , provides a convenient yet flexible approach for the user community to analyze HapMap data and identify candidate targets for genotyping studies.


Background
The identification of Single Nucleotide Polymorphisms (SNPs) that contribute to complex diseases has made them the preferred choice for diagnostics and therapeutics studies. For instance, the methylenetetrahydrofolate reductase (MTHFR) C677T polymorphism (dbSNP: rs1801133) has been reported to be associated with gastric cancer in the Chinese population [1]. To uncover novel markers that may be associated to a disease, genotyping studies are conducted to determine the genetic variations between diseased and healthy subjects, allowing for further functional characterization that could lead to therapeutic applications. While it may not be sufficient coverage just to genotype only these specific diseaserelated SNPs, it is costly to genotype all available SNPs from a large sample of individuals. As such, by genotyping only a subset (also known as the tag SNPs [2]), which may include the disease associated SNPs, the cost and effort involved in association studies can be effectively reduced with minimal compromise to the power of such studies.
In the absence of comprehensive genotype data from local populations, genotyping studies can be designed using data from the International HapMap Project [3]. Recent studies show that, despite differences in the fine details of linkage disequilibrium (LD) patterns between populations [4], tag SNPs selected from one HapMap population can be used to characterize other populations reasonably well [5][6][7]. These findings indicate that HapMap data is currently the most ideal freely-available dataset for tag SNP selection and association studies. . This provides users with fully customizable options for on-the-fly analysis as opposed to pre-processed results provided by TAMAL. iHAP also highlights SNPs found in coding regions so potentially significant SNPs may be "force included" as tag SNPs at users' discretion, a feature unavailable in GVS and TAMAL. Being integrated with our local repositories of genotype and gene data from HapMap and the UCSC Genome Browser Database [22] respectively, iHAP relieves users of the hassle of having to locate and download genotype data as is the case with Haploview. Furthermore, iHAP generates result pages that graphically depict the haplotype structures, including blocks, haplotype patterns and tag SNPs, alongside the exons and introns of genes found within the chromosomal region. Alternative sets of inferred tag SNPs are also presented with the respective scores. The key differences between iHAP and other similar tools are highlighted in Table 1.

Implementation
The iHAP resource was written in the PHP 5.1.4 scripting language with the GD library of image functions. Using a backend MySQL 4.1.14 relational database, this resource is currently deployed on a Solaris environment with Apache HTTP Server 2.0.58 running on a Sun Fire V240 Server. An overall schematic architecture of iHAP is shown in Figure 1.

Choice of backend haplotype analysis tool
Apart from HapBlock, other tools including HaploBlock and HaploBlockFinder were also considered and evaluated for suitability as iHAP's backend haplotype analysis tool. Eventually, HapBlock was preferred over these alternatives because it offers a wider selection of haplotype block definitions and tag SNP selection algorithms. Hap-Block is also capable of accommodating the option for "forcing" specific SNPs to be selected as tags, which is helpful if one wants to include prior information into the analysis.

Local data repositories
Essential to the execution of the iHAP resource are two data repositories. Overall schematic of iHAP Figure 1 Overall schematic of iHAP. The iHAP resource may be conceptualized as having three components. The first involves batch-based data preparation while the second and the third are for real time analyses. Users submit jobs to iHAP via the webbased interface and each job is then processed in the background. Upon completion, results are returned to the users via the web-based interface.
this assembly is based on NCBI build 35. This resource is used for determining the chromosomal locations of genes, including the positions of their respective introns and exons.

Job execution and management
Based on the settings supplied by users, iHAP generates the necessary input files in the format required by Hap-Block and triggers its execution as a background job.
Depending on the nature of individual haplotype analysis jobs, the execution time could vary from seconds to hours. In addition, the storage requirements for each job also vary according to the availability of genotype data for the selected chromosomal region. Therefore, it is necessary to optimize job scheduling to present users with a logical and coherent interface without compromising server performance.
To address this issue, a job manager module was devised. This module not only initiates each HapBlock execution as a background job, but also monitors the execution process through periodic polls. With this module, the progress of each job can be tracked so users may be updated with the current status of their jobs via the webbased interface. An email alert mechanism is also in place to inform users upon completion of their analysis jobs. To keep storage requirements in check, a script that automatically cleans up redundant files belonging to old jobs is also executed periodically.

Result display
As individual jobs are completed, the job manager module extracts information pertaining to haplotype blocks and tag SNPs to an intermediate format. Alternative sets of results are collated along with their respective scores while exon and intron information of genes found within the chromosomal region of interest is obtained from the local mirror of the UCSC Genome Browser Database. Such information is then combined along with additional details such as SNP names and locations in the dynamically generated image that illustrates the haplotype structure graphically. Intermediate files relating to individual jobs are finally archived in ZIP files which can be downloaded conveniently.

Results and discussion
Based on the submitted gene name, the iHAP resource determines the chromosomal region of interest using the UCSC Genome Browser Database. The setup of the analysis job is then defined according to parameters such as HapMap population, allele frequency threshold, block definitions, tag SNP definitions, permutation test settings, as well as SNPs to be "force included" as tags. Snippets of help for each parameter are ergonomically positioned to facilitate the configuration of each job.
The necessary files required by HapBlock are then generated by iHAP. These include the parameter, genotype or haplotype data, SNP names, SNP position lookup, and "forced tag SNP" files. iHAP then invokes the execution of HapBlock as a background job and monitors its progress through periodic polls so as to keep users updated on their job progression. Upon completion of the analysis, results are converted to dynamically generated images for display as a webpage.
The results page first provides a summary of the settings used for the analysis. A graphical representation of the genomic region is displayed with the locations of genes and their respective intronic and exonic regions illustrated as grey boxes, and inferred blocks as yellow rectangles. SNP locations are marked as blue vertical lines with those selected as tags augmented with red triangles. The next section depicts the structure of each block, including the dbSNP identifiers and the haplotype patterns along with their respective frequencies. The scores of the displayed and alternative tag SNP sets for each block are tabulated according to various criteria in the following section.

Conclusion
The iHAP application provides a one-stop resource for inferring haplotype blocks and selecting tag SNPs from HapMap data. Apart from providing a wider selection of algorithms and integrating genotype data with gene information, iHAP also offers greater flexibilities by allowing users to "force include" specific SNPs as tags. Additionally, iHAP displays the results obtained graphically for intuitive interpretation and includes alternative sets of tag SNPs attained. In essence, iHAP is a practical tool that can be used to analyze HapMap data for the selection of candidate targets in genotyping studies.

Availability and requirements
Project name: iHAP (integrated haplotype analysis pipeline) Project home page: http://ihap.bii.a-star.edu.sg Operating system: Solaris (or any other OS that supports Apache, MySQL and PHP) Programming language: PHP (with GD library) Other requirements: MySQL, Apache HTTP Server

License: none
Typical workflow of iHAP Figure 2 Typical workflow of iHAP. The iHAP resource was used to analyze the MTHFR gene with gastric cancer related SNP (rs1801133) "force included" as tag.