Example of similarity blocks found by MS4 in the non-coding LTR sequences. Part of the alignment from 29 out of the 43 non-coding LTR sequences centered on the NFκ B binding site. The complete alignment of the 43 sequences is shown in Additional Files 8 and 9. The alignment is focused on the transcription factor NFκ B binding site (GGGACTTTCC[A|G]) and its flanking regions. The names of sequences are indicated with their accession number in Los Alamos HIV sequence databank. The sequence are regrouped according to their phylogeny. The position of the first letter of the displayed region is given on the left. The letters are rewritten by applying the MS4 method to the whole non coding LTR sequences. As seen in Additional File 8, the complete MS4 identifier is constructed as follows: e.g. C24_8 (class C24 for a N value of 8). Identical recoded letters that are in the same columns are displayed in the same colour. The MS4 identifier has been simplified as follows: we have just indicated the letter and the value of N. Therefore it can be that two different MS4 classes that lie on the same column, with the same letter and the same N value are only distinguished by their colour (e.g. A18 and also T18 HIV-1-M/G, that are red or green). The two or more repeated segments of the same sequence are put one under the other. Therefore the sequences are often written on several lines to highlight similarities between sequences and inside sequences. Most often the similarity blocks are aligned and the great majority of identical indexed letters are on only one column. Some colored letters are unique because only 29 sequences (out of 43) are displayed on this figure.