TelomereHunter: telomere content estimation and characterization from whole genome sequencing data

Summary: Telomere shortening plays an important role in cellular aging and tumor suppression. The availability of large next-generation sequencing cohorts of matched tumor and control samples enables a computational high-throughput analysis of changes in telomere content and composition in cancer. Here we describe a novel software tool specifically tailored for the processing of large data collections. Availability and Implementation: TelomereHunter is implemented as a python package. It is freely available online at: http://www.dkfz.de/en/applied-bioinformatics/telomerehunter/telomerehunter.html.


Introduction
Telomeres are nucleoprotein complexes at the ends of eukaryotic chromosomes. In humans, telomeric DNA consists mainly of non-coding t-type (TTAGGG) repeats. However, c-(TCAGGG), g-(TGAGGG) and j-type (TTGGGG) variant repeats as well as other variations of the hexameric sequence exist (Coleman, et al., 1999;Lee, et al., 2014;Varley, et al., 2002). Telomeres shorten with each cell division and once a critical telomere length is reached, a DNA damage response is triggered, resulting in cellular senescence or apoptosis.
To circumvent the limited number of possible cell divisions, tumors employ activation of telomerase (Kim, et al., 1994) or "alternative lengthening of telomeres" (ALT) (Bryan, et al., 1997) as telomere maintenance mechanisms. Telomerase is an enzyme that adds t-type repeats to the chromosome ends.
In contrast, ALT is based on recombination of telomeric regions and results in several characteristics, including telomeres of heterogeneous length and sequence composition.
Telomere maintenance mechanisms are crucial for tumorigenesis, making them valuable drug targets for cancer therapy (Shay, 2016). However, to precisely identify and interfere with these mechanisms in various tumor types, more insight into the different telomere structures is needed. In the last decades, several experimental methods have been established to assess telomere length and ALT status, e.g.
With the advance of massively parallel sequencing, an alternative method for measuring telomere content has emerged. Several studies have already shown that the number of short reads containing telomeric repeats can be used to estimate telomere content in whole genome sequencing (WGS) data, and that the results are comparable to those of established experimental methods Nersisyan and Arakelyan, 2015;Parker, et al., 2012).Here, we present TelomereHunter, a new computational tool for determining telomere content that is specifically designed for matched tumor and control pairs. In contrast to existing tools, TelomereHunter takes alignment information into account and reports the abundance of variant repeats in telomeric sequences.

Implementation
TelomereHunter is written as a python package and takes WGS BAM files of single samples or matched tumor and control pairs as input. Several parameters can be set by the user with the default settings and workflow being described in the following.
In the first step of TelomereHunter, telomere reads containing at least n non-consecutive repeats (t-, c-, g-or j-type) are extracted (Fig. 1A). n is calculated for each read depending on the read length with the following formula: n = floor(read length · 0.06). The criterion of searching for six non-consecutive repeats in 100 bp long reads has been proposed previously (Lee, et al., 2014) and was also found suitable for the data presented in the present study.
In the second step, the reads extracted are categorized depending on the alignment coordinates. If reads are properly paired, the mapping position of the mate is considered for the sorting. In short, reads mapping to intrachromosomal regions, i.e. all chromosome bands except the first or last band, are defined as intrachromosomal reads. The subtelomeric fraction comprises telomeric reads mapped to the first or last band of a chromosome. Telomeric reads from paired-end data are classified as junctionspanning if one mate maps to a first or last chromosome band and the other mate is unmapped. All remaining unmapped reads are categorized as intratelomeric.
The telomere content is calculated as the fraction of intratelomeric reads per million reads. To account for GC biases in sequencing data, TelomereHunter determines a GC-corrected telomere content: Instead of normalizing by the total number of reads in the sample, the intratelomeric reads are divided by the number of reads with a GC content between 48-52%, which is similar to that of the canonical ttype repeat and has been suggested for the normalization of telomeric reads .
The output of TelomereHunter includes several diagrams visualizing the results (see Supplementary   Fig. 1 for examples).

Validation
For validation, TelomereHunter was compared to established experimental methods for telomere content measurement (see Supplementary Methods). The telomere content of nine pediatric brain tumor samples (six medulloblastoma and three glioblastoma samples) was determined computationally and was measured by telomere qPCR and TRF analysis. To demonstrate that TelomereHunter correctly determines the telomere content of both ALT-positive and ALT-negative samples, we included samples with different ALT status in the validation samples (as determined by TRF and C-circle assay, Supplementary Fig. 2).
The experimentally determined telomere content estimation was highly correlated with the TelomereHunter results for the individual tumor and control samples (r = 0.90 for qPCR and r = 0.65 for TRF). The correlation was further improved by GC correction of the computationally determined telomere content (r = 0.94 and 0.72; Supplementary Fig. 3). All methods consistently predicted telomere content gain or loss in the tumor sample compared to the control (Fig. 1B). The only exception was MB175, which can be explained by different amounts of DNA in the experimental setup (see Supplementary Methods).

Benchmark
Performance comparison of TelomereHunter with Motif Counter and TelSeq (Supplementary Fig. 4 and Table 1) revealed that all computational results correlated better with qPCR than with the TRF analysis.
TelomereHunter outperformed the other tools in the correlation to qPCR. The correlation of tumor/control log2 ratios to the TRF analysis was similar for all tools and the direct correlation to the TRF analysis was best using TelSeq.

Discussion and outlook
TelomereHunter reliably determines telomere content from WGS data. In contrast to existing tools, it takes mapping information into account and is able to search for a combination of the most common telomere repeat types. Moreover, TelomereHunter visualizes the results and, by default, gives a summary of telomere composition. We anticipate that the combination of telomere content determination and telomere repeat variant analysis from WGS data provided by TelomereHunter will prove to be valuable for identifying and characterizing telomere maintenance mechanisms in primary tumor samples.

Acknowledgements
We thank the DKFZ Genomics and Proteomics Core Facility for provision of sequencing services. The authors would also like to thank Elke Pfaff (DKFZ) for her help in DNA sample selection.

Funding
The Conflict of Interest: none declared.

Whole genome sequencing
The WGS datasets analyzed in this study were obtained from the PedBrain ICGC project. Matching tumor and control samples were collected according to ICGC guidelines. The DNA libraries were prepared using Illumina paired-end sample preparation protocols and sequencing was performed on Genome Analyzer IIx and Illumina HiSeq 2000 instruments as previously described (Jones, et al., 2012;Sturm, et al., 2012). Reads were aligned to the GRCh37 reference from 1000 Genomes project using bwa mem version 0.7.8 with the option -T 0.

Computational telomere content estimation using Motif Counter and TelSeq
In addition to TelomereHunter analysis, telomere content was determined using Motif Counter (http://sourceforge.net/projects/motifcounter/)  with the parameters -s -u -q 0 and TelSeq  with default settings.
Cycling conditions (for both telomere and 36B4 products) were 10 min at 95°C, followed by 40 cycles of 95°C for 15 s and 60°C for 1 min. A standard curve was used to determine relative quantities of telomere repeats (T) to those of the single copy gene (S, 36B4 gene, also known as RPLP0). The T/S ratio was calculated for each sample (tumor and control) separately. The log2 ratio of telomere content was determined by dividing the T/S ratio of the tumor sample by the T/S ratio of the control sample.
The calculated log2 ratio represents the increase or decrease in telomere content in tumor versus control samples.

C-circle assay
The C-circle assay was performed according the protocol of Henson et al. (Henson, et al., 2009).
Briefly, 30 ng DNA was combined with 10 μl 2X Φ29 Buffer, 7.5 U Φ29 DNA polymerase (both NEB), 0.2 mg/ml BSA, 0.1% (v/v) Tween 20, 1 mM each dATP, dGTP and dTTP and incubated at 30°C for 8 h followed by 20 min at 65°C. Reactions without addition of polymerase (-pol) were included as controls. After addition of 40 μl 2X SSC, the amplified DNA was dot-blotted onto a 2X-SSC-soaked Roti-Nylon plus membrane (Carl Roth). The membrane was baked for 20 min at 120°C and hybridized and developed using the TeloTAGGG Telomere Length Assay Kit (Roche).
Chemiluminescent signals were detected using a ChemiDoc MP imaging system (Bio-Rad).

Terminal restriction fragment analysis
For TRF analysis, 4.5 μg genomic DNA of tumor and blood (control)  where only the total amount of DNA loaded is measured. Thus, the TRF analysis is more prone to errors that arise from differences in the amount of DNA between samples. Telomere content tumor/control log2 ratios were estimated for nine sample pairs from pediatric brain tumor patients using TRF, qPCR and the computational tools TelomereHunter, Motif Counter and TelSeq.