# OD-seq: outlier detection in multiple sequence alignments

- Peter Jehl
^{1}, - Fabian Sievers
^{1}Email author and - Desmond G. Higgins
^{1}

**Received: **15 May 2015

**Accepted: **13 August 2015

**Published: **25 August 2015

## Abstract

### Background

Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous.

### Results

The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least *O*(*N*
^{2}) for *N* sequences. This is prohibitive for large *N* but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to *O*(*N* log(*N*)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity.

### Conclusion

OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds. Software available as http://www.bioinf.ucd.ie/download/od-seq.tar.gz.

## Keywords

## Background

Multiple sequence alignments are essential for many sequence analysis tasks [1–4]. They are widely used for phylogenetic analysis, discovery of conserved regions, protein function studies and structure prediction. One problem that can arise in these analyses is the presence of outlier sequences. Outliers can disrupt an alignment at the construction stage and lead to highly sub-optimal alignments with a knock-on effect on downstream analyses. With small datasets, these can often be seen in the final alignment, if a viewer such as Jalview [5] is used. With very large alignments of say 10 s of thousands of sequences, it can be hard to view the complete alignment and it can be difficult to see outliers.

Outliers can come from a variety of sources. The simplest example is a sequence that is not homologous with the rest of the dataset and which has been included by accident. These are the easiest to detect as they will not have any of the conserved regions which the rest of the sequences might share. The pattern of gaps and conserved blocks will be completely different between the outlier and the rest. A second source of outliers will be sequences which have been partly mistranslated due to a sequencing error or incorrect automated translation from a genome sequence. These will have some conserved blocks followed by sections that are non-homologous with the rest. The longer the mistranslated region, the easier it will be to detect but very short outlier sections may be hard to find. The third kind of outlier will be where a sequence is homologous to the rest but where the similarity is extremely low and it is very hard to align. This case is a matter of choice and of degree. Some data sets need to include all possible homologues, for completeness. Others need to restrict dataset membership to sequences which are alignable over their full length. This case is different to the first two because very distant sequences may be hard to distinguish from real outliers which are simply not homologous.

Outlier detection is a well studied field in computer science and has been applied to many different fields such as credit card, insurance and tax fraud detection, intrusion detection for cyber security, fault detection in safety critical systems, military surveillance for enemy activities and many other areas [6]. Previous attempts at detecting sequence outliers have mainly concentrated on finding sections of sequence or alignment which are highly divergent. In Clustal X [7] there are menu items to activate a variety of outlier detection schemes which are based on detecting runs of amino acids with suspiciously low similarity scores to the rest. These can then be highlighted in various ways for visual detection. The GBLOCKS package [8] finds sections of alignment that are rich in gaps and which have low overall conservation. These blocks can be automatically removed for automated creation of phylogenies in pipelines. Penn et al. [9] used robustness to guide tree changes as a measure of alignment confidence. The DivA package [10] uses training sets seeded with manually inserted problem segments to recognise outlier positions or segments.

In this paper, we introduce the OD-Seq package which attempts to identify outlier sequences in a multiple alignment. It uses a simple gap based metric which counts the number of positions between two aligned sequences which have a gap in one and not the other. These distances are used to make a distance matrix which is then analysed to find sequences with unusually high average distances from the rest of the sequences. We can also use OD-Seq to find outliers in sets of unaligned sequences using pairwise BLAST [11] scores but this is less sensitive. We demonstrate the use of OD-Seq by taking sets of sequences from Pfam [12] and seeding these with outlier sequences from other Pfam families. In principle our method should work for nucleic acid sequences as well, but hasn’t been tested.

## Methods

### Algorithm

#### Overview

For very large numbers of sequences the distance matrix becomes very time and memory consuming to create. mBed [14] allows us to calculate approximate mean distances for the sequences, without calculating a full distance matrix. For analysing the vector of distances, we predict outliers in two ways: bootstrapping to estimate confidence intervals or by analysing interquartile ranges. Both of these methods are used to find sequences with unexpectedly large average distances.

#### Alignment gap metric

#### Aligned sequences

*X*

_{ N,L }, with

*N*number of sequences and

*L*is the length of the alignment. Three different metrics are available for selection: linear, affine and cumulative. Before the distance is computed the sequences are changed to a binary representation with 0 for an amino acid character and 1 for a gap. The linear metric is given by the formula

*S*, length of alignment

*L*and the alignment as a matrix

*X*in binary form. It does not distinguish between fewer, longer gaps or more, shorter gaps. The affine version uses the formula

*C*

_{ j }being a vector holding the distance value for each position of the comparison. The pairwise scores of each sequence M are then added up to an overall distance vector

*D*

_{ i }.

#### Unaligned sequences

The similarity computation for unaligned sequences is performed by BLASTP with default parameters. Here the bit score for every pairwise computation is added up to an overall score for each sequence.

#### mBed

Computing a full distance matrix for *N* sequences has a time and memory complexity of *O*(*N*
^{2}). This becomes prohibitive for alignments of many sequences. The mBed algorithm, also used in Clustal Omega, reduces computing times. Here the *N*×*N* distance matrix is reduced to *N* log(*N*) by randomly selecting *M*= log(*N*) seed sequences and calculating a reduced *M*×*N* distance matrix. When tested on multiple sequence alignment benchmarks, it was used to make alignments without a significant change in accuracy.

#### Bootstrapping

*N*sequences randomly (with replacement) and computing the mean and standard deviation of each pseudo replicate. The mean of these values over all the replicates generate the estimates for the calculation of the outlier score. A sequence

*i*is considered an outlier if

with *σ* being the estimated standard deviation, \(\overline {s}\) the estimated mean score and *D*
_{
i
} the score of sequence *i*.

#### Interquartile range analysis

*Q*1) and 3rd quartiles (

*Q*3) determine the interquartile range

*r*=

*Q*3−

*Q*1. For sequences with a distance measure greater than

*Q*1 and smaller than

*Q*3 the outlier score is 0. A sequence

*i*is considered an outlier if

For datasets with a lot of identical sequences the interquartile range might be 0 which can cause a division by 0. Small values for the interquartile range can lead to inflated scores for the outliers. For these cases bootstrap analysis should be chosen as the estimators are more robust.

### Datasets

*%*of the average alignment length.

Pfam verification datasets overview

Different clan | Same clan | No outlier | |
---|---|---|---|

Number of families | 1329 | 889 | 1329 |

Number of sequences | 5051 | 5051 | 5000 |

Average alignment length | 906.7 | 817.4 | 664.3 |

% identity | 27.5 | 25.1 | 29.0 |

Computing times were determined using two variations of the ABC transporter family in Pfam (PF00005). For measuring the time with various alignment lengths, 5000 sequences were chosen and truncated to various lengths. The dependency on the number of sequences in the alignment was determined by keeping the length constant at 200 positions and varying the number of sequences.

## Results

### Example analysis

### Run times

*O*(

*L*) in sequence length. Times for different subsets of the ABC transporters of different lengths are plotted in (Fig. 4 a) and all variations of the algorithm do show approximately linear behaviour. Time complexity wrt number of sequences is expected to be

*O*(

*N*

^{2}) for full distance matrix mode or

*O*(

*N*log(

*N*)) for mBed mode. Run times for different subsets of the ABC transporters are plotted in (Fig. 4 b). Data shown is the execution time for the whole analysis including sequence read, distance computation, normalisation and output. The outlier detection for the whole ABC transporter family from Pfam with 363,409 sequences with an alignment length of 2177 takes 39 minutes on one CPU core.

### Large scale outlier tests

A threshold range for outliers was identified by plotting the true positive (TPR) and false positive (FPR) rate against the threshold (Fig. 5). In these plots one can observe that if a cut-off of 2 is selected a FPR below 10 *%* can be expected while the TPR is still far higher and depending on the dataset true outliers can be identified confidently. The plots show that the FPR stays constant even if no outliers are present. According to these figures a threshold between 2 and 10 is recommended, depending on the stringency the user wants to choose. For the same clan datasets a steep decline in TPR is observed, this is due to the fact that many of the introduced outliers are so similar to the selected family that they might not even be considered outliers due to their similarity. For the unaligned method the area of good performance is narrower as high outlier scores are very rare.

### Pfam outliers and example

Detected outliers in Pfam families

Pfam families | Percentage of outliers found by bootstrap cutoff | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

Sequences | Median | Type | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |

≤150 | 68 | Linear | 12.52 | 5.51 | 2.22 | 0.92 | 0.40 | 0.19 | 0.09 | 0.05 |

150−500 | 278 | 12.70 | 5.34 | 2.11 | 0.88 | 0.39 | 0.19 | 0.09 | 0.05 | |

≥500 | 1652 | 11.74 | 5.21 | 2.25 | 1.00 | 0.46 | 0.22 | 0.11 | 0.06 | |

≤150 | 68 | Affine | 13.10 | 5.39 | 2.02 | 0.76 | 0.33 | 0.15 | 0.07 | 0.03 |

150−500 | 278 | 13.25 | 5.33 | 2.00 | 0.77 | 0.32 | 0.14 | 0.07 | 0.03 | |

≥500 | 1652 | 11.98 | 5.18 | 2.21 | 0.96 | 0.43 | 0.19 | 0.09 | 0.05 | |

≤150 | 68 | Cumulative | 10.37 | 5.10 | 2.54 | 1.30 | 0.69 | 0.40 | 0.22 | 0.12 |

150−500 | 278 | 9.01 | 4.25 | 2.15 | 1.20 | 0.72 | 0.45 | 0.29 | 0.20 | |

≥500 | 1652 | 7.50 | 3.53 | 1.87 | 1.09 | 0.68 | 0.45 | 0.31 | 0.22 | |

≤150 | 68 | Unaligned | 14.3 | 8.19 | 1.01 | 0.055 | 0.046 | 0.045 | 0.045 | 0.045 |

150−500 | 278 | 16.35 | 12.65 | 0.093 | 0.012 | 0.0027 | 0.0009 | 0.0005 | 0.0003 | |

≥500 | 1652 | 15.8 | 1.65 | 0.16 | 0.018 | 0.0032 | 0.0008 | 0.0002 | 0.0001 |

## Discussion

Viewing very large alignments of many thousands of sequences is challenging, even if you use the best viewers available such as Jalview or Seaview [16]. One serious issue is to detect and deal with highly aberrant sequences. OD-seq provides a simple yet robust method to rank the most aberrant sequences or to select all sequences above some cut-off that is tuned to optimise a desired false positive or false negative rate. When faced with clear outliers, such as the case with Pfam families that have been seeded with sequences from different Pfam clans, OD-seq finds the outliers with extremely high sensitivity and specificity. These outliers were selected to be similar in length to the family they were added to so as to reduce length effects. When sequences were added from the same Pfam clan, the performance reduces but here, the difference between an outlier and a homologous yet very dissimilar sequence, is not so clear-cut. OD-seq can also deal with unaligned sequences, but the performance is, again, not as good as the full multiple alignment case. The main aim when designing OD-seq was to catch outliers in the full alignment.

OD-seq is fast enough to process alignments of moderate to large size in seconds. The run times to do any calculations on a full distance matrix will normally be proportional to the square of the number of sequences. For alignments of many tens of thousands of sequences this could take days to process. The *O*(*N* log(*N*)) mBed method for distance matrix calculation, reduces the time and memory to perfectly practical levels with alignments of over 300,000 sequences of length over 2000 positions being processed in under 40 minutes on a single core. The combination of speed and accuracy, then makes OD-seq a potentially useful tool for pipelines that require many large alignments to be made or for checking very large alignments for outliers. OD-seq will not be able to deal with outliers in very small data sets or with sequences that are homologous to the rest of a dataset over most of their length but which have a short region of mismatch. Such sequences or mismatched regions can be found satisfactorily by existing tools.

## Conclusion

OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds.

## Declarations

### Acknowledgements

Thanks to Norma Coffey-Bargary for suggesting the interquartile range analysis. Funding was provided by Science Foundation Ireland to DGH through PI grant 11/PI/1034.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011; 7:539.View ArticlePubMedPubMed CentralGoogle Scholar
- Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for multiple sequence alignments. JMB. 2000; 302:205–217.View ArticleGoogle Scholar
- Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Bio Evol. 2003; 30:772–80.View ArticleGoogle Scholar
- Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32(5):1792–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ. Jalview Version 2 - a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009; 25(9):1189–91. doi:10.1093/bioinformatics/btp033.View ArticlePubMedPubMed CentralGoogle Scholar
- Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey. ACM Comput Surv. 2009; 41(3):58. Article 15, doi:10.1145/1541880.1541882.View ArticleGoogle Scholar
- Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 1997; 25:4876–82.View ArticlePubMedPubMed CentralGoogle Scholar
- Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000; 17:540–52.View ArticlePubMedGoogle Scholar
- Penn O, Privman E, Landan G, Graur D, Pupko T. An alignment confidence score capturing robustness to guide tree uncertainty. Mol Biol Evol. 2010; 27(8):1759–67.View ArticlePubMedPubMed CentralGoogle Scholar
- Zepeda Mendoza ML, Nygaard S, da Fonseca RR. DivA: detection of non-homologous and very divergent regions in protein sequence alignments. BMC Res Notes. 2014; 7:806.View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215:403–10.View ArticlePubMedGoogle Scholar
- Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. The Pfam protein families database. Nucleic Acids Res. 2014; 42:D222–30.View ArticlePubMedGoogle Scholar
- Löytynoja A, Goldman N. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008; 320(5883):1632–5.View ArticlePubMedGoogle Scholar
- Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG. Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol. 2010; 5:21.View ArticlePubMedPubMed CentralGoogle Scholar
- Felsenstein J. Phylip - phylogeny inference package (version 3.2). Cladistics. 1989; 5:164–6.Google Scholar
- Gouy M, Guindon S, Gascuel O. SeaView Version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol. 2010; 27(2):221–4.View ArticlePubMedGoogle Scholar