Skip to main content

Table 3 Correlations between dataset properties and nCC scores

From: Assessment of composite motif discovery methods

 

TRANSFAC PWMs

Custom PWMs

 

Average nCC

Highest nCC

Average nCC

Highest nCC

Number of sequences

-0.23

-0.16

-0.23

-0.05

Length of shortest sequence

0.30

0.18

0.30

0.13

Average sequence length

0.40

0.33

0.42

0.43

Total sequence set length

-0.19

-0.12

-0.18

-0.02

Number of module instances

-0.38

-0.32

-0.40

-0.19

Size of smallest module

0.61

0.69

0.67

0.73

Size of largest module

0.26

0.34

0.19

0.35

Average module size

0.60

0.68

0.59

0.70

Module size standard deviation

0.23

0.29

0.13

0.29

IC-content (lowest)

0.46

0.45

0.73

0.47

IC-content (total)

0.75

0.73

0.78

0.54

Module/background-ratio

0.53

0.61

0.51

0.63

  1. We conducted a simple correlation analysis to examine which properties of the TRANSCompel sequence sets and PWMs correlated best with the highest and average nCC scores obtained by the methods on these sets. "IC-content (lowest)" is the information content (IC) of the PWM with the lowest IC of the two involved in each sequence set. The information content of a PWM is inversely related to the amount of variability in the binding patterns from which the PWM is constructed [38]. PWMs with higher information content are more specific and match only sites with a high degree of similarity to the consensus motif. "IC-content (total)" is the sum of IC-contents for the two motifs (for TRANSFAC PWMs we used the PWM with the highest IC in each equivalence set to represent the motif). The three highest values are highlighted in each column. The properties that seem to correlate best with methods' performances are the minimum and average size of modules (in basepairs) and the total IC-content, which would imply that module discovery is harder for datasets containing short and degenerate modules.