a
|
30,000 sequences
|
---|
Alignment
|
Time
|
Sequences
|
Remaining
|
cd-hit
|
---|
stages (3 to X%)
|
sec.
|
selected
|
subfamilies
|
clusters
|
---|
90 (8)
|
126.2
|
10892
|
1085
|
1152
|
80 (4)
|
32.2
|
3352
|
250
|
588
|
70 (2)
|
10.6
|
929
|
67
|
377
|
60 (1)
|
4.5
|
302
|
22
|
253
|
50 (1)
|
1.8
|
203
|
22
|
173
|
b
|
90,000 sequences
| |
Alignment
|
Time
|
Sequences
|
Remaining
|
cd-hit
|
stages (3 to X%)
|
sec.
|
selected
|
subfamilies
|
clusters
|
90 (8)
|
1695.8
|
33734
|
3234
|
3562
|
80 (4)
|
323.4
|
10740
|
794
|
1861
|
70 (2)
|
70.3
|
2899
|
182
|
1154
|
60 (1)
|
19.8
|
947
|
49
|
771
|
50 (1)
|
6.1
|
634
|
22
|
525
|
- The final pass by MULTAL to align the remaining sequences correctly generated the 20 distinct families for each sequence collection. The reason why MULSEL left two singleton unclustered sequence “subfamilies” is explained in the text. The results of the fast clustering program cd-hit are included for comparison but it should be noted that these values are not directly comparable as cd-hit is a single pass method whereas MULSEL is progressive. The values quoted are the smallest number of clusters reported for all combinations of parameters that produced a result with the cutoff (-c) from 0.4 to 0.9 and word lengths (-n) from 5 down to 2