Filtering de novo indels in parent-offspring trios

Background Identification of de novo indels from whole genome or exome sequencing data of parent-offspring trios is a challenging task in human disease studies and clinical practices. Existing computational approaches usually yield high false positive rate. Results In this study, we developed a gradient boosting approach for filtering de novo indels obtained by any computational approaches. Through application on the real genome sequencing data, our approach showed it could significantly reduce the false positive rate of de novo indels without a significant compromise on sensitivity. Conclusions The software DNMFilter_Indel was written in a combination of Java and R and freely available from the website at https://github.com/yongzhuang/DNMFilter_Indel.

indels are usually mixed with a large number of false ones. Therefore, effective de novo indel filtering methods are urgently needed.
Here, we present DNMFilter_Indel, a de novo indel filtering method that extends from our previous work DNMFilter [12]. Firstly, we integrate local de novo assembly to refine the alignment. Secondly, we add the classification model with two new sequence features strongly related to de novo indels. Additionally, to expand the positive set, we simulate synthetic de novo indels which can overcome the problem of the limited number of cross validated de novo indels. Finally, we evaluate DNMFilter_Indel's performance using the real sequencing data of a whole genome trio from 1000 Genomes Project.

Implementation
The DNMFilter_Indel pipeline comprises two main modules: (a) Training; (b) Prediction, which is shown in Fig. 1.
In the Training module, firstly, DNMFilter_Indel takes the trios' alignment files as input and employs commomly used de novo indel detection methods, such as DeNovo-Gear [8], PhaseByTransmission [9] and TrioDeNovo [10], to detect de novo indels; secondly, DNMFilter_Indel detects inherited indels using state-of-the-art indel detection methods (e.g. GATK HaplotypeCaller [5]); thirdly, DNMFilter_Indel uses the synthetic and cross validated de novo indels as positive examples and random sampling false de novo indels and inherited indels as negative examples; finally, DNMFilter_Indel performs local de novo assembly to refine the alingment for any positive or negative example, and then extracts sequence features from the refined alignment data to construct a training set.
In the Prediction module, DMFilter_Indel uses the same gradient boosting classification model as DNMFilter [12] to train the model and makes predictions for all putative de novo indels obtained via any computational methods. DNMFilter_Indel finally produces a score of 0 to 1 for each de novo indel, which represents the possibility of classification as real de novo indel.

Sequence feature selection
Indel detection is more prone to alignment errors, so some commonly used indel detection methods do local de novo assembly to refine the alignment around candidate indels, and then detect indels from the realignment pileups. In order to correct alignment errors, DNMFilter_Indel uses the same strategy to perform local de novo assembly using ABRA2 [13] and extracts all sequence features for any de novo indel when training and predicting.
A large number of indels are from homopolymer and short tandem repeat (STR) regions of the human genome, but meanwhile indel detection is more prone to errors in homopolymer and STR regions. Hence, in addition to the sequence features used in DNMFilter, DNMFilter_Indel includes two additional sequence features to the classification model. One sequence feature is homopolymer, which refers to the repetitive sequence element with a unit of 1bp (the minimum repeat tract is set to 4); the other is short tandem repeat, which refers to the repetitive sequence elements with a unit of 2bp to 6bp (the minimum repeat tract is set to 3).

Training set construction
Considering that de novo indel mutation rate is extremely low, it is hard to gather sufficient true de novo indels with cross validation as the positive examples. Here, we simulate synthetic de novo indels to complement the number of true de novo indels. The simulating process is as below. If one parent's genotype is reference and the other parent's genotype is a heterozygous indel, and at the same time the offspring's genotype is reference, then the alingment information of the parent carrying the heterozygous indel and the offspring are exchanged. The exchanged indel sites can be regarded as synthetic de novo indel sites. The false de novo indels are produced according to the following process: (a) several commonly used de novo indel detection methods are run to get putative de novo indels; (b) the cross validated de novo indels are excluded; (c) the false de novo indels are randomly sampled from the set got by the previous step. Besides, inherited indels are also included as the negative examples.

Results
The widely used CEU trio from 1000 Genomes Project is adopted to demostrate the performance of DNMFilter_Indel. The whole genome alignment files were got from ftp:// ftp.1000g enome s.ebi.ac.uk/vol1/ftp/techn ical/worki ng/20120 117_ceu_trio_b37_decoy /. All reads were mapped to human reference genome (GRCh37). There are 56 de novo indels in the CEU trio that were previously cross validated [8].
The training set was constructed with chromosome 1 to chromosome 6 of the trio, . Three state-of-the-art de novo indel detection methods, including DeNovoGear, PhaseByTransmission and TrioDeNovo, were adopted to detect de novo indels in the remaining chromosome 7 to chromosome 22, and DNMFilter_Indel was then employed to filter out false de novo indels obtained by these detection methods separately.
DeNovoGear, PhaseByTransmission and TrioDeNovo were all run with default settings, and DNMFilter_Indel's score cutoff was set to 0.4. DNMFilter_Indel was applied both on the raw alignment data and the refined alignment data based on local de novo assembly.
Foe the training set, the principal component analysis (PCA) was performed to project all sequence features of de novo indels to first three components (Fig. 2), and the result suggested that the sequence features used in this study were able to distinguish between true and false de novo indels. The feature importance ranking meansures were performed using the method provided in the R package "gbm" to determine the contribution of sequence features (Fig. 3). The result suggested that homopolymer and STR that we introduced ranked 21st and 27th respectively, indicating that two new sequence features introduced were useful for the classification.
The overall performance of DNMFilter_Indel coupled with de novo detection methods was illustrated in Table 1. The results showed that DNMFilter_Indel substantially filtered out false de novo indels with almost no loss in sensitivity. For any de novo indel detction method coupled with DNMFilter_Indel, only one true de novo indel was filtered out by mistake on the raw alignment data; no de novo indel was filtered out by mistake on the refined alignment data based on local de novo assembly. Too many remaining de novo indels in the final results may be due to that a lot of true de novo indels were not cross validated in the previous study. In conclusion, local de novo assembly-based refined alingment was effective for improving filtering performace; the positive set consisting of both the validated and synthetic de novo indels was effective for filtering de novo indels.

Conclusions
We proposed a novel method DNMFilter_Indel extended from our previous work DNM-Filter, which can effectively filter de novo indels from the trio-based sequencing data. By applying on the real sequencing data, DNMFilter_Indel is shown it could substantially  out false de novo indels with hardly sacrificing sensitivity.Together with the tool, the training set constructed with the CEU trio used in this study is released. The researchers can directly use this training set or construct a new training set with the module provided in DNMFilter_Indel, and then use DNMFilter_Indel to get true de novo indels mixed with a massive number of false ones.

Availability and requirements
Project name: DNMFilter_Indel. Project home page: https ://githu b.com/yongz huang /DNMFi lter_Indel Operating system: Linux dependent. Programming language: Java and R. License: MIT. Any restrictions to use by non-academics: licence needed.