An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome

Background Small insertions and deletions (indels) have a significant influence in human disease and, in terms of frequency, they are second only to single nucleotide variants as pathogenic mutations. As the majority of mutations associated with complex traits are located outside the exome, it is crucial to investigate the potential pathogenic impact of indels in non-coding regions of the human genome. Results We present FATHMM-indel, an integrative approach to predict the functional effect, pathogenic or neutral, of indels in non-coding regions of the human genome. Our method exploits various genomic annotations in addition to sequence data. When validated on benchmark data, FATHMM-indel significantly outperforms CADD and GAVIN, state of the art models in assessing the pathogenic impact of non-coding variants. FATHMM-indel is available via a web server at indels.biocompute.org.uk. Conclusions FATHMM-indel can accurately predict the functional impact and prioritise small indels throughout the whole non-coding genome. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1862-y) contains supplementary material, which is available to authorized users.


FATHMM-indel's Features
Each indel in our data sets was annotated using DNA conservation data obtained from multiple sequence alignments (MSAs) of vertebrate genomes to the human genome [1]. The details on how covariates were computed are reported below. The features used are: • P r . This covariate represents FATHMM's emission probability for the reference nucleotide [2].
• P m − P r and |P m − P r |. These scores measure the potential impact of a mutation: the greater the difference, the greater the impact.
• MSA depth (the number of species used in MSAs). This feature is positively correlated with the confidence of the resulting conservation scores.
• FATHMM score. The log-odds ratio between P r and P m .
• PhyloP score [3]. PhyloP scores measure evolutionary conservation at individual alignment sites, ignoring neighbouring nucleotides in the calculation. It can be used to measure acceleration (evolution faster than expected under neutral drift) as well as conservation (evolution slower than expected).
• PhastCons score [4]. PhastCons scores account for regions surrounding each alignment position in estimating the probability that each nucleotide belongs to a conserved element.
• Percent identity. This value quantifies the similarity between wildtype and mutant sequences.
With the exception of percent identity, all covariates can be computed using data coming from MSAs between human and 45 vertebrate genomes (8 "46 way" features), or using MSA data between human and 99 vertebrate genomes (8 "100 way" features). In this work we considered both "46 way" and "100 way" covariates, culminating in a total of 17 (8 + 8 + 1) features (the study was performed using the human genome assembly GRCh37).

Computation of Conservation Scores
To illustrate how single-nucleotide features are integrated into an indel score, let us consider a threenucleotide reference sequence AGC. A two-nucleotide insertion TT starting at the second position will yield the mutant sequence ATTGC. The mutation's overall conservation score includes the score for the first nucleotide, A, along with the scores from the inserted nucleotides TT. We compute the effect of a transition from the three-nucleotide wildtype sequence AGC to the three-nucleotide mutant sequence ATT by averaging the conversion scores for three transitions: A → A, G → T, and C → T. The approach for deletions is similar: we simply average the scores for mutations along the length of a deletion. These mean scores provide a simple way to assess conservation across an indel. In coding regions, even a single-nucleotide insertion/deletion can impact many positions downstream. In non-coding regions, indels may disrupt binding by altering recognition motifs or by changing the relative positions of motifs. More sophisticated scoring procedures may thus be warranted, but we found that this simple procedure appears to yield highly informative features for our model.
When methods provide p values (e.g. PhastCons), the indel score is simply the average of the p values. However, it does not always make sense to take a simple average for scores from other methods.
For example, FATHMM [2] scores f SNV are (logarithmic) ratios based on probabilities for reference (P r ) and mutant (P m ) nucleotides: To compute a score over a range, we compute the average reference probability (P r ) and the average mutant probability (P m ) across the range and then compute a ratio from these average probabilities: PhyloP also requires special handling. Briefly, these scores may be split into two categories based on positive or negative values. Positive values indicate whether a position is unusually highly conserved, even if it falls within a conserved region. Negative values indicate whether a position is unusually variable, even within a variable region. Either a positive or negative PhyloP score S phy may be converted into a probability P e that the value is extreme using: P e = 10 −|S phy | By using |S phy | we guarantee that P e ∈ [0, 1]. To compute a score that reflects conservation characteristics across a region, we convert scores to probabilities for positive and negative scores, then subtract the probabilities associated with negative scores from those associated with positive scores to obtain a total. We then divide this by the number of positions to get an average for the region. These averages fall into the range [−1, +1] and thus correspond to sequence changes that range from least (−1) to most (+1) disruptive.

Percent Identity Scores
Although we eliminated repeat sequences from our data, some indels are likely to be more disruptive than others -by changing local GC content, for example. As a simple way to encapsulate the disruption an indel may impart to a genomic region, we included a percent identity feature to compare the wildtype and mutant sequences. To compute this feature we consider the region around an indel's starting position s and compare the reference sequence R with the mutated sequence M that results from the indel. For an indel of length k, R is the reference genomic sequence from s to s + k. For an insertion, the mutant sequence M is simply the annotated insert sequence. For a deletion of length k, we construct M from the reference sequence starting at position s + k and ending at s + 2k. In each case we compute the percent identity f PID between R and M as follows: where I is the indicator function.