Skip to main content

Table 2 Segregation success and run-times for large sequence collections are tabulated as in Table 1

From: Reduction, alignment and visualisation of large diverse sequence families

a 30,000 sequences
Alignment Time Sequences Remaining cd-hit
stages (3 to X%) sec. selected subfamilies clusters
90 (8) 126.2 10892 1085 1152
80 (4) 32.2 3352 250 588
70 (2) 10.6 929 67 377
60 (1) 4.5 302 22 253
50 (1) 1.8 203 22 173
b 90,000 sequences  
Alignment Time Sequences Remaining cd-hit
stages (3 to X%) sec. selected subfamilies clusters
90 (8) 1695.8 33734 3234 3562
80 (4) 323.4 10740 794 1861
70 (2) 70.3 2899 182 1154
60 (1) 19.8 947 49 771
50 (1) 6.1 634 22 525
  1. The final pass by MULTAL to align the remaining sequences correctly generated the 20 distinct families for each sequence collection. The reason why MULSEL left two singleton unclustered sequence “subfamilies” is explained in the text. The results of the fast clustering program cd-hit are included for comparison but it should be noted that these values are not directly comparable as cd-hit is a single pass method whereas MULSEL is progressive. The values quoted are the smallest number of clusters reported for all combinations of parameters that produced a result with the cutoff (-c) from 0.4 to 0.9 and word lengths (-n) from 5 down to 2