BISMA - Fast and accurate bisulfite sequencing data analysis of individual clones from unique and repetitive sequences
© Rohde et al. 2010
Received: 28 October 2009
Accepted: 6 May 2010
Published: 6 May 2010
Skip to main content
© Rohde et al. 2010
Received: 28 October 2009
Accepted: 6 May 2010
Published: 6 May 2010
Bisulfite sequencing is a popular method to analyze DNA methylation patterns at high resolution. A region of interest is targeted by PCR and about 20-50 subcloned DNA molecules are usually analyzed, to determine the methylation status at single CpG sites and molecule resolution.
The BISMA (Bisulfite Sequencing DNA Methylation Analysis) software for analysis of primary bisulfite sequencing data implements sequencing data extraction and enhanced data processing, quality controls, analysis and presentation of the methylation state. It uses an improved strategy for detection of clonal molecules and accurate CpG site detection and it supports for the first time analysis of repetitive sequences.
BISMA works highly automated but still provides the user full control over all steps of the analysis. The BISMA software is freely available as an online tool for academic purposes for the analysis of bisulfite sequencing data from both unique and repetitive sequences http://biochem.jacobs-university.de/BDPC/BISMA/.
Epigenetic modification of histones and DNA adds heritable information to the genome [1, 2]. In mammals, DNA is methylated at the C5 position of cytosine residues mainly in CpG dinucleotides in a tissue specific pattern [3, 4]. DNA methylation is an essential process and abnormal methylation is associated with human diseases such as cancer [5, 6]. DNA methylation is intensively studied as illustrated by the finding that a PubMed search for 'DNA methylation' retrieved more that 26,000 entries. Bisulfite genomic sequencing is the standard technique for the analysis of DNA methylation at high resolution. In this approach, the genomic DNA is treated with sodium bisulfite, which converts all unmethylated cytosines to uracil, whereas the methylated cytosines remain unconverted. The region of interest is amplified by PCR with primers specific for converted DNA and the PCR product is sequenced [7, 8]. Detecting a cytosine in the sequence indicates that the respective position was methylated in the original DNA whereas a thymine indicates that the respective cytosine was unmethylated. When combined with subcloning and sequencing of individual clones, the DNA methylation pattern can be determined at single molecule and nucleotide resolution for continuous tracks of up to 500 base pairs (bps) [9, 10].
The analysis of the primary bisulfite sequencing data, which should comprise about 20-50 subcloned DNA molecules for statistical analysis (Additional file 1: Suppl. Text S1), requires the following tasks: 1) the experimental sequences need to be aligned to the in silico converted genomic reference. 2) The sequence identity and the conversion rate of each experimental sequence need to be measured, and sequences which do not comply with the quality criteria must be removed. 3) Clonal sequences, which were amplified from the same template molecule in the PCR, need to be detected and removed. 4) The CpG sites need to be identified in the reference sequence and the aligned experimental sequences. 5) The methylation state of the CpG sites in the experimental sequences needs to be determined and the data summarized and presented.
Convenience of use: BiQ Analyzer needs installation and has slow performance.
Sequence alignment: The BiQ Analyzer sometimes fails to construct the sequence alignment between the experimental and reference sequences without manual user intervention.
Filtering for clonal sequences: QUMA has not implemented a filtering of clonal sequences. The BiQ Analyzer for certain datasets erroneously suggests removing too many sequences as clonal. In addition, the filtering routine of the BiQ Analyzer software erroneously assumes two molecules as non-clonal, which differ by the existence of an unresolved nucleotide annotation (N-site) only.
Identification of CpG sites and annotation of methylation states: The BiQ Analyzer often fails detecting CpG sites which are located downstream of a T-stretch. QUMA does not check for the presence of the corresponding CpG guanine in the experimental sequences and annotates methylation states at sites with alignment errors, sequencing errors and mutations, like TA, TT, TN, or CN.
The genetic diversity among repetitive sequences requires a different strategy for analysis of the bisulfite sequencing data which is not provided by any software so far.
Therefore, we developed a new software called Bisulfite Sequencing DNA Methylation Analysis (BISMA). BISMA provides a highly automated, simple and fast workflow for the analysis of bisulfite sequencing experiments. It can be used for the analysis of subcloned PCR products from unique and repetitive genomic regions. All uploaded data will be automatically processed, filtered for sequence identity, conversion rate and clonal sequences and analyzed. BISMA has implemented an improved strategy for detection of clonal sequences, which preserves identical methylation patterns and ignores N-sites during pattern analysis. Furthermore, BISMA has implemented a new algorithm for detection of CpG sites in the bisulfite sequencing data.
We embedded the BISMA software in the existing DNA methylation data analysis platform BDPC, which allows for further downstream data processing, compilation, web presentation and statistical analysis of results  as well as clustering and graphical data presentation . The BISMA software is a freely available online tool for academic purposes http://biochem.jacobs-university.de/BDPC/BISMA/.
Biological samples were obtained and treated as described before . Published primary sequencing data of the amplicons 51_new5, 237, 264, 327_III and 335 were used for comparison of the programs . Treatment of cells with 5-azacytidine was performed as described before . Bisulfite treatment, PCR-product purification, subcloning and sequencing was done as described .
Briefly, about 200-300 ng genomic DNA were digested with an appropriate restriction enzyme at 37°C overnight to facilitate the subsequent denaturation process, then the DNA was converted with sodium bisulfite in a thermocycler for 15 min at 99°C, 30 min at 50°C, 5 min at 99°C, 1.5 h at 50°C, 5 min at 99°C, 1.5 h at 50°C. The converted DNA was used as a template for PCR. The PCR products were purified by ChargeSwitch PCR Clean-Up Kit (Invitrogen) and subcloned using the StrataClone kit (Stratagene). A high number of clones for each amplicon were sequenced.
The Xist promoter was amplified using the following primers: (FP: GGT AGG GGA ATT AAA AAT GTT TTTT; RP: TAA CCA CTC CTC TTC TAA TCT CTCC) from tailtip DNA of a female mouse. The Alu regions were targeted using the published primer set  without the 5'-overhangs (FP: TTT TTA TTA AAA ATA TAA AAA TTA GT; RP: CCA AAC TAA AAT ACA ATAA). The analysis was performed using the matching region of the Alu-Sx subfamily consensus sequence (GeneBank: U14574). We downloaded, installed and used the BiQ Analyzer v2.00 software on a Core Duo L2400 computer with a 1.66 gigahertz processor and 2 gigabyte installed random-access memory on a Microsoft Windows XP professional operation system. We installed JAVA Runtime environment version 6 update 13 (Build 1.6.0_13-b03). We used the remote-ClustalW option such that all multiple sequence alignments were performed on a server at the Max Planck Institute for Informatics. We used 95% and 90% as lower threshold for the conversion rate and the sequence identity in all three programs for comparison.
BISMA is a PHP coding language based web application which manages the sequencing data information using a MySQL database for temporary storage. Currently, BISMA is running on an openSUSE 10.2 Linux web server. If sequences in ABI file format are uploaded to BISMA the DNA sequence is automatically extracted from the ABI files as they were edited by the user. For this task BISMA uses the Perl module Bio::Trace::ABIF . All pair wise and multiple sequence alignment steps in the BISMA software are performed using the ClustalW software [19, 20]. After the alignment, BISMA compares each uploaded sequence with the reference or consensus sequence to calculate the sequence identity, the conversion rate and the occurrence of gaps. Sequences which do not pass the user defined threshold will be excluded from further analysis. BISMA uses the following default quality filtering thresholds for the analysis of unique genomic sequences: sequence identity 90%; conversion rate 95%; gaps 20%. For the analysis of repetitive sequences BISMA uses the following default thresholds: sequence identity 70%; conversion rate: sequences with 3 and more unconverted cytosines in 100 bps are excluded; gaps: 20%.
The BISMA software is designed for upload of primary bisulfite sequencing datasets derived after subcloning of PCR products from unique and repetitive sequences. BISMA has implemented automatically sequence processing, alignment, filtering for sequence quality, filtering of clonal sequences, data analysis and presentation.
However, we observed that a ClustalW multiple sequence alignment sometimes will provide a better alignment than the combination of pair-wise alignments. The massive occurrence of thymine after bisulfite conversion and PCR leads to the presence of long poly-T stretches, which often cause PCR and sequencing artifacts in form of thymine insertions or deletions. Under these circumstances the combined pair wise alignments often fail at the last CpG position. Therefore, with default settings BISMA automatically will apply a ClustalW multiple sequence alignment if a CpG site is close to the end of the PCR product and associated with a T-stretch. Beside this, the user can manually choose between pair wise and ClustalW multiple sequence alignment.
BISMA automatically determines for each sequence the degree of identity to the reference sequence, the conversion rate, the appearance of insertions and deletions and the number of unresolved nucleotides (N-sites) at reference sequence cytosine positions, which are in general an indication of low sequence quality (Figure 1B). These parameters are then used for quality control filtering and sequences which do not pass user defined thresholds will be excluded. Recently, non-CpG methylation has been reported for mammalian cell lines as well which seems to be correlated with the degree of differentiation [21, 22]. Although BISMA has a CpG centered approach of data analysis, it can be used for initial visualization of sequences with non-CpG methylation. If this is intended, the threshold of the conversion rate needs to be lowered manually, because otherwise clones with high non-CpG methylation will be excluded because of low conversion. As BISMA does not analyze non-CpG methylation, it might be considered to use the plant methylation analysis softwares Kismeth  or CyMATE  for this task. However, discrimination of non-CpG methylation from incomplete conversion in mammalian cell types requires careful controls and ideally additional experimental evidence.
This strategy allows correct and reliable evaluation of the methylation state and resolves cases of deletions or insertions of thymine by PCR artifacts. As introduced above, such artifacts frequently happen during amplification of T-stretches in the bisulfite converted DNA. BISMA displays the multiple sequence alignment including the annotation of the methylation pattern, unconverted cytosines and possible clonal molecules for visual inspection (Figure 1C).
The BISMA output (Figure 1E) consists of the following elements: 1) an annotated alignment, 2) a graphical display of the methylation pattern in context of the CpG site distribution in the reference sequence, 3) a graphical display of the condensed methylation pattern, 4) a graphical display of the average methylation at each CpG site, 5) statistics about the overall methylation percentage and the percentage of CpG sites analyzed and 6) the overall methylation percentage of each clone. BISMA will automatically sort the sequences in the alignment and all output files and graphical presentations by their methylation level, if selected by the user. BISMA has implemented a threshold for the average methylation report of individual CpG sites, such that only statistics are calculated and reported if at least 5 experimental sequences could be analyzed at the respective site. This avoids over-interpretation and propagation of weak data points. All statistics and the alignment are combined in one HTML file for data storage, presentation and further manual or automatic downstream analysis. For example, the output files of many analyses can be combined and submitted to the BDPC compilation software on the same server.
The methylation state of repetitive genomic sequences is frequently investigated using bisulfite genomic sequencing as a measurement of global methylation difference or to determine the repeat specific methylation (Additional file 1: Suppl. Text S2). Different approaches have to be used for the investigation of the methylation state of unique and repetitive genomic sequences, because the genomic reference for a repeat sequence is not defined as these elements are present in many copies with similar but not identical sequences. For example, a single primer pair designed to determine the methylation of Alu sequences will amplify PCR-products from about 15,000 template regions in bisulfite converted human DNA . As a consequence, the genomic origin of each amplified molecule is unknown, such that it cannot be directly compared with a defined reference sequence. Therefore, the standard approach for the analysis of unique sequences, where each CpG position is compared against its genomic reference sequence and the methylation state of each CpG site can be determined, is not an option in the analysis of sequencing data from repetitive genomic sequences. Instead, with repeats a consensus sequence has to be used for sequence alignment and analysis. However, this does not allow distinguishing an unmethylated CpG site from one that is mutated to TG. Furthermore, the experimental sequences may contain additional CpG sites which are not present in the consensus.
To illustrate the functionality of the new BISMA software for analysis of repetitive sequences, we amplified a short region of the Alu repetitive element from bisulfite converted DNA from various samples and used the matching part from the Alu-Sx subfamily consensus sequence  for data analysis. As shown in Figure 4, we observed high methylation in the Alu sequences from peripheral blood leukocytes, cultured fibroblast cells, the immortalized cell line HEK293 and the human hepatocellular liver carcinoma cell line HepG2, which is in agreement with literature data on the methylation state of the Alu sequences [17, 25, 26]. In spite of this, several molecules were found without methylation or low methylation especially in HepG2 and even more pronounced after treatment of HEK293 cells with 5-azacytidine. Such observation is consistent with reported loss of global methylation during cancer progression , which for example was documented especially for the Line1 repetitive element in Hep-G2 .
As an alternative analysis strategy, we implemented the estimation of the overall methylation as introduced by Yang et al. 2004 . This method focuses on the consensus CpG positions. It estimates the rate of CG to TG mutations in the sequenced strand by determining the number of CG to CA mutations, which corresponds to CG to TG exchanges in the opposite strand. Assuming that the mutation rate is similar in both DNA strands, the number of unmethylated CpG sites can be calculated by correcting the number of TG sites by the fraction of mutated sites. Then, the overall DNA methylation percentage is calculated from the number of unmethylated and methylated CpG positions. Using our test dataset from leukocytes, we obtained an overall methylation of Alu repeats of 93.6%, which is higher than reported previously . Such difference might be in the range of inter-individual variation or result from the absence of the 5'-overhangs in the primers used here.
We used datasets from published results  for the human genes S100B, NCAM2, COL6A2, ZNF295 and H2BFS to compare the performance of the new BISMA software for unique sequences with the BiQ Analyzer and QUMA showing examples of improved CpG site detection and methylation state annotation (Additional file 1: Suppl. Text S3-S5) and filtering for clonal molecules (Additional file 1: Suppl. Text S6-S8).
All three programs support uploading of sequencing files in text format. However, BISMA is the only software which supports direct uploading of ABI sequencing data. All three tools identify and remove improper sequencing results based on sequence identity and conversion rate. In addition, QUMA and BISMA remove sequences with insertions or deletions which do not pass the user defined threshold. Only BISMA directly filters clones for N-sites at cytosine positions as well. All three programs provide an annotated alignment of all sequence of the final dataset.
The accurate determination of the methylation state of each aligned CpG position is one of the most important tasks during analysis of bisulfite sequencing data. The new BISMA software has implemented an improved algorithm for CpG site detection, which detected all positions accurately. The BiQ Analyzer fails if the methylation site under investigation is not directly aligned with the respective cytosine position, which frequently happens in aligned bisulfite sequencing results at T-stretches (Additional file 1: Suppl. Text S3). The QUMA software erroneously reports a methylation result of mutated [TA] and unknown [TN or CN] positions (Additional file 1: Suppl. Text S3-5).
The BiQ Analyzer and BISMA programs offer filtering of clonal molecules, while the QUMA software does not check for clonal sequences and it includes obvious examples of clonal PCR for methylation analysis (Additional file 1: Suppl. Text S6). However, BISMA has implemented an improved algorithm that keeps clones with identical cytosine patterns, which do not have conversion artifacts. More stringent filtering as implemented in the BiQ Analyzer inappropriately removes valid methylation patterns (Suppl. Text S7). BISMA ignores N-sites during filtering of clonal sequences, which is not implemented in the BiQ Analyzer (Additional file 1: Suppl. Text S8). Therefore, the filtering routine of the BiQ Analyzer leads to preferential analysis of sequences with conversion artifacts and bad sequencing data quality. This is critical when analyzing highly methylated or unmethylated regions.
To illustrate the functionality of the improved filtering for clonal sequences, we isolated DNA from mouse tail tip, converted it with sodium bisulfite and amplified a part of the Xist promoter from a female animal (Figure 3C). In females, the Xist promoter is methylated on the active X-chromosome, while unmethylated on the other chromosome which is subject to X-chromosome-inactivation. Therefore, the expected methylation pattern consists of half methylated and half unmethylated clones [28–30]. As shown in Figure 3C, the strict filtering of identical cytosine patterns led to the removal of many clones and artificially created a heterogeneous methylation pattern, which might mislead the interpretation. In contrast, the data analyzed with the modified settings nicely reflect the true methylation pattern.
The BISMA software supports the analysis of subcloned PCR products from bisulfite sequencing experiments. It has implemented an improved strategy for detection of clonal sequences and a new algorithm for detection of CpG sites. BISMA is the first software which supports analysis of bisulfite sequencing from repetitive genomic regions. The highly automated and simple workflow of the BISMA software minimizes the required user interactions to successfully analyze the dataset within a short time. BISMA provides full user control and allows adjusting the thresholds of many parameters and selection of the final dataset on the basis of an annotated sequence alignment. The BISMA software can be freely used as an online tool for academic purposes at http://biochem.jacobs-university.de/BDPC/BISMA/.
Project name: BISMA
Project home page: http://biochem.jacobs-university.de/BDPC/BISMA/
Operating system(s): Platform independent online software
License: Source code available upon request for academic use
Any restrictions to use by non-academics: Licence needed
Bisulfite sequencing Data Presentation and Compilation
Bisulfite Sequencing DNA Methylation Analysis software
University of California Santa Cruz genome browser
unresolved nucleotide sites in sequencing reads
Technical assistance of Ms. Sandra Becker and the web server capacities of the Jacobs University including its maintenance by Dr. Achim Gelessus are gratefully acknowledged. We acknowledge Dr. Tomasz P. Jurkowski for discussions about implementation of sequencing data extraction and alignment tools in web applications. This work was supported by the NGFN2 program of the German Minister of Research and Education (BMBF).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.