Reduction, alignment and visualisation of large diverse sequence families

Background Current volumes of sequence data can lead to large numbers of hits identified on a search, typically in the range of 10s to 100s of thousands. It is often quite difficult to tell from these raw results whether the search has been a success or has picked-up sequences with little or no relationship to the query. The best approach to this problem is to cluster and align the resulting families, however, existing methods concentrate on fast clustering and either do not align the sequences or only perform a limited alignment. Results A method (MULSEL) is presented that combines fast peptide-based pre-sorting with a following cascade of mini-alignments, each of which are generated with a robust profile/profile method. From these mini-alignments, a representative sequence is selected, based on a variety of intrinsic and user-specified criteria that are combined to produce the sequence collection for the next cycle of alignment. For moderate sized sequence collections (10s of thousands) the method executes on a laptop computer within seconds or minutes. Conclusions MULSEL bridges a gap between fast clustering methods and slower multiple sequence alignment methods and provides a seamless transition from one to the other. Furthermore, it presents the resulting reduced family in a graphical manner that makes it clear if family members have been misaligned or if there are sequences present that appear inconsistent. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1059-9) contains supplementary material, which is available to authorized users.

: Amino acid substitution process for a fragment of the T-rich family. The composition is at the 1/3 T level at which point additional 'T' characters are introduced to maintain this level (eg: line 13 to 14). DTTTTTTYNDETICNDKTYFDKTYITQKTKTATLTTTKCTETTTFDTFEITLFDEDITYTHKTGVTTT TGLTTATNETTICNTKWYSK YITQNYTTATLDTTKTTTNCTTVTHQIITTDVESITTHIAKTTTS  FTTETTTTVQFMTTTTGYAQTHS  MTRTQQKTTRYGMTTTRTKE  TRTTFHSRTTT  HTTTTNTTVAQMITTMMTTDTMS  TTQRLTTTTTTMQRKTKTDTYYKCDSIITTDTFYSVTTK  FTTTLQTTFITTERTRTRLCTTTNTDLTVPFWTTTNTPTTSYQTTTFRPHTITITFMSTDTMTS  FTTTLTTQTTTMHRTCRSTREMTNTTDLWFFWTTTTTYCPSYTTTPRTPHT TITFKPTHTTTT  TTNHATTTNTTT TKTRNGNN NTTTMVGRRNRTAT  QWIIT  QLTHTTQ TSTATV  TWTHAITPNTTGCTKTRNGNNTNTTKFVDTKTRTYT  TQQTTTTRQLTITTQNTSTTTM  AFITTIIFITTCRKKTNGTTNDTAVQVTTTTTIYYTKQQTLTFTRLLTINTTMTTD  TATI TTTTTQ  FTETT  TTTSTQAYKTTKTGATYTDQTRTTTTRYTTVHTRTTHDGEKTAHIAKTMTTTT  THPRTELDTTDMTHMDTTTTDTNSEWTATVGATTGTTADTGHTDITT DAFTTTPNTLTPVEKDCIALTT  THTTAMLTTTMTTLTMTTTNEVFAL  GATITTMATTTETDTYTCDQFTTTPTTLTTVFEKHTLTTN  TGTTHTTTQITTYMDTGT  IEKIIQ  TGITNTTFTKETTETTTTT  THTTRQTTFQFTELTTLT  TNDKHTRKATTDDIQRTKFLTVTGMTTTTCEKPTTFTDTTT  FTGTQHTTTTTTKSTDTIMDKTPPTFWFTNFAWTKQT  TTTTTTATTTLLFQTTTT  TTLQPTFTTYKNITTTTDYRWWTTTWHNWYWTYTMWQFTGYENYNTTTTTTETDDTHWTTTNT  TLTEETTWNTNTKKNETTYTRGNTTKTPNACMYRVTSGTHCMCHIIETTELTHTMTYTTATTFTKTT  ATTTTETTE TNTKTTETTTTRCGNNKTPT MARRVARTHIMTFHIIETKLAH MTTKTATTFTTKT  TPTWATTTTHTHTETDMDTTTFTTNLTATGAGT  THETCTPIYGTFTTHTNTYPRVLCQITTT  TLPTWATTTTHVTTAMELGTTTTTFTTTATGAGT TTHQCTPIYTK TATNHTKTWLCTITTA MULSEL reads the current file of sequences and implements a full peptide-based resort of the sequence order followed by a number of stages of hierarchic multiple sequences alignment using profile/profile matching. In this example file, the first line specifies that up to 92,680 sequences can be read from the file final.seq in the length range 100-150 residues using a peptide size of 4. (The remaining numbers are not used). The next two lines specify two cycles of peptide sorting using blocks of sequences specified by the first two numbers ( -1 -1 tells the program to use default divisions). The following three numbers are: the minimum score used in clustering, the number of top matches held per sequence and the bonus given to adjacent sequences pairs. The following five lines control the mini-alignment stages. The first three numbers specify a choice of matrix (0 0 0 = identity), the next four are the gap-penalty, the sequence range outside which pairs are not considered (span), the maximum allowed indel size (window) and the score cutoff (as ten times the sequence identity). The following three numbers are not relevant to the current implementation. This file will create mini-alignments by considering pairs scoring over 780 in five stages down to 700. A previous cycle may have run from 880 to 800 and a subsequent one from 680 to 600 and so on, down to the lowest final cutoff required.

9
The program MULSEL (which is short for multiple sequence selection) was written 10 in the C language and run on a Dell laptop with a Intel Core i7-3740QM CPU

11
(2.70GHz processor) and 8GB of memory. The program requires an input file of 12 sequences and an optional file of keword/penalty pairs. In addition, a series of 13 parameter files need to be provided for each cycle of mini-alignment generation.
14 These could all be the same but it is better to have a higher similarity cutoff in 15 the initial stages. An example file is shown in Figure 8 and a complete set is 16 included in the program tarball.

31
In an early application of the MULTAL program to the clustering of all known 32 sequences (without selection) [1], computer memory limitation was an important 33 concern. Luckily, at that time, the number of known sequences was just under 34 32,767 and so could be indexed by a short int (in the C computer language).

35
Sequence databanks have grown since then but so too have computer resources advantage that the presorts can be performed in parallel.

44
Although it was stated above that a binary tree encoding does not restrict 45 the length of the peptide, there is however a limit imposed by the size of variable 46 used to store the encoded peptide. Currently this is an integer which can encode a peptide of 7 residues (20 7 ≤ 2147483647) or an oligonucleotide of 15 bases. In 48 the unlikely need to go beyond these limits, a "long long int" 1 could be used 49 (with minimal changes to the code) or a different string-based structure, such as 50 a radix tree. 51 1 In the C language, a long int has the same size as the default int variable and using their unsigned form makes little difference

A.2.3 Keyword-penalty input
to its complement: "<".  will have a score of just 1, or two triple repeat proteins will score just 3 (whatever 105 the peptide length). However, for typical protein (and nucleic acid) sequences, 106 the overhead required for storing the frequencies was considered unjustified given 107 that such low complexity sequences rarely occur. Knowing the peptide composition for each protein makes it easy to add a low-119 complexity filter. However, this was not implemented at the peptide pre-sorting 120 stage but the number of unique peptides identified per protein at that stage was 121 passed to the selection penalty assignment where it was used to give a slight bias 122 towards more complex sequences being selected.