Skip to main content

Table 2 Segregation success and run-times for large sequence collections are tabulated as in Table 1

From: Reduction, alignment and visualisation of large diverse sequence families

a

30,000 sequences

Alignment

Time

Sequences

Remaining

cd-hit

stages (3 to X%)

sec.

selected

subfamilies

clusters

90 (8)

126.2

10892

1085

1152

80 (4)

32.2

3352

250

588

70 (2)

10.6

929

67

377

60 (1)

4.5

302

22

253

50 (1)

1.8

203

22

173

b

90,000 sequences

 

Alignment

Time

Sequences

Remaining

cd-hit

stages (3 to X%)

sec.

selected

subfamilies

clusters

90 (8)

1695.8

33734

3234

3562

80 (4)

323.4

10740

794

1861

70 (2)

70.3

2899

182

1154

60 (1)

19.8

947

49

771

50 (1)

6.1

634

22

525

  1. The final pass by MULTAL to align the remaining sequences correctly generated the 20 distinct families for each sequence collection. The reason why MULSEL left two singleton unclustered sequence “subfamilies” is explained in the text. The results of the fast clustering program cd-hit are included for comparison but it should be noted that these values are not directly comparable as cd-hit is a single pass method whereas MULSEL is progressive. The values quoted are the smallest number of clusters reported for all combinations of parameters that produced a result with the cutoff (-c) from 0.4 to 0.9 and word lengths (-n) from 5 down to 2