Skip to main content

Table 1 Performed experiments and experimental input.

From: Fast index based algorithms and software for matching position specific scoring matrices

 

Exp. 1

Exp. 2

Exp. 3

Exp. 4

Exp. 5

Exp. 6

# searched sequences

59,021

30,964

19,111

1 (H.s. Chr. 6)

19,111

19,111

total length

20.2 MB

37.2 MB

4.3 MB

162.9 MB

4.3 MB

4.3 MB

sequence source

see [13]

DBTSS 5.1

RCSB PDB

Sanger V1. 4

RCSB PDB

RCSB PDB

sequence type/PSSM type

protein

DNA

protein

DNA

protein

protein

# PSSMs

4,034

220

11,411

576

28,337

10,931

PSSM source

see [13]

MatInspector

PRINTS 38

TRANSFAC Prof. 6.2

BLOCKS 14.1

PRINTS 38

avg. length of PSSMs

29.74

14.21

17.32

13.33

26.3

17.37

index construction (sec)

41

146

10.2

586

10.2

10.2

mdc (sec)

1960

-

1486

-

11871

1486

MatInspector

 

x

    

FingerPrintScan

  

x

   

Blimps

    

x

 

DN00

x

     

LAsearch

x

x

x

x

x

 

ESAsearch

x

x

x

x

x

x

ESAsearch (reduced A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ )

     

x

  1. Overview of the sequences and PSSMs used in the performed experiments. For the experiments that use p-value or E-value cutoffs, we precomputed the cumulative score distributions and stored them on file. mdc is the time needed for this task. In Experiment 1 we measured the running time of the Java-program from [13], referred to by DN00. We ran DN00 with a maximum of 2 GB memory assigned to the Java virtual machine. DN00 constructs the suffix tree in main memory and then performs the searches. For a fair comparison, we therefore measured the total running time, and the time for matching the PSSMs (without suffix tree construction). For Experiment 2, we implemented the matrix similarity scoring scheme (MSS) of MatInspector and matched the PSSMs against both strands of the DNA sequences with different MSS cutoff values. The MSS of PSSM M of length m and a sequence w ∈ A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ m is defined as MSS = s c ( w , M ) − s c min ( M ) s c max ( M ) − s c min ( M ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGnbqtcqqGtbWucqqGtbWucqGH9aqpdaWcaaqaaiabdohaZjabdogaJnaabmGabaGaem4DaCNaeiilaWIaemyta0eacaGLOaGaayzkaaGaeyOeI0Iaem4CamNaem4yam2aaSbaaSqaaiGbc2gaTjabcMgaPjabc6gaUbqabaGcdaqadiqaaiabd2eanbGaayjkaiaawMcaaaqaaiabdohaZjabdogaJnaaBaaaleaacyGGTbqBcqGGHbqycqGG4baEaeqaaOWaaeWaceaacqWGnbqtaiaawIcacaGLPaaacqGHsislcqWGZbWCcqWGJbWydaWgaaWcbaGagiyBa0MaeiyAaKMaeiOBa4gabeaakmaabmGabaGaemyta0eacaGLOaGaayzkaaaaaaaa@582A@ and hence given an MSS cutoff value, the threshold th is determined as th = MSS·(scmax(M) - scmin(M)) + scmin(M). Instead of using the reverse strand we use the reverse complement M ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabd2eanbaaaaa@2DE0@ of the PSSM M, defined by M ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabd2eanbaaaaa@2DE0@ (i, a) = M(m - 1 - i, a ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdggaHbaaaaa@2E08@ ) for all i ∈ [0, m - 1] and a ∈ A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ , where a ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdggaHbaaaaa@2E08@ is the Watson Crick complement of nucleotide a. This allows to use the same enhanced suffix array for both strands. In Experiment 5 we used a PERL-based wrapper for the Blimps program shipped with the BLIMPS distribution to do bulk sequence searches. The overhead for the PERL interpreter call was found to be negligible. For Experiment 6 we used the reduced alphabets given in Figure 8. The last seven rows show which programs were used in which experiment.