Skip to main content

Advertisement

Springer Nature is making Coronavirus research free. View research | View latest news | Sign up for updates

Table 1 Performed experiments and experimental input.

From: Fast index based algorithms and software for matching position specific scoring matrices

  Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 Exp. 6
# searched sequences 59,021 30,964 19,111 1 (H.s. Chr. 6) 19,111 19,111
total length 20.2 MB 37.2 MB 4.3 MB 162.9 MB 4.3 MB 4.3 MB
sequence source see [13] DBTSS 5.1 RCSB PDB Sanger V1. 4 RCSB PDB RCSB PDB
sequence type/PSSM type protein DNA protein DNA protein protein
# PSSMs 4,034 220 11,411 576 28,337 10,931
PSSM source see [13] MatInspector PRINTS 38 TRANSFAC Prof. 6.2 BLOCKS 14.1 PRINTS 38
avg. length of PSSMs 29.74 14.21 17.32 13.33 26.3 17.37
index construction (sec) 41 146 10.2 586 10.2 10.2
mdc (sec) 1960 - 1486 - 11871 1486
MatInspector   x     
FingerPrintScan    x    
Blimps      x  
DN00 x      
LAsearch x x x x x  
ESAsearch x x x x x x
ESAsearch (reduced A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ )       x
  1. Overview of the sequences and PSSMs used in the performed experiments. For the experiments that use p-value or E-value cutoffs, we precomputed the cumulative score distributions and stored them on file. mdc is the time needed for this task. In Experiment 1 we measured the running time of the Java-program from [13], referred to by DN00. We ran DN00 with a maximum of 2 GB memory assigned to the Java virtual machine. DN00 constructs the suffix tree in main memory and then performs the searches. For a fair comparison, we therefore measured the total running time, and the time for matching the PSSMs (without suffix tree construction). For Experiment 2, we implemented the matrix similarity scoring scheme (MSS) of MatInspector and matched the PSSMs against both strands of the DNA sequences with different MSS cutoff values. The MSS of PSSM M of length m and a sequence w A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ m is defined as MSS = s c ( w , M ) s c min ( M ) s c max ( M ) s c min ( M ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqGnbqtcqqGtbWucqqGtbWucqGH9aqpdaWcaaqaaiabdohaZjabdogaJnaabmGabaGaem4DaCNaeiilaWIaemyta0eacaGLOaGaayzkaaGaeyOeI0Iaem4CamNaem4yam2aaSbaaSqaaiGbc2gaTjabcMgaPjabc6gaUbqabaGcdaqadiqaaiabd2eanbGaayjkaiaawMcaaaqaaiabdohaZjabdogaJnaaBaaaleaacyGGTbqBcqGGHbqycqGG4baEaeqaaOWaaeWaceaacqWGnbqtaiaawIcacaGLPaaacqGHsislcqWGZbWCcqWGJbWydaWgaaWcbaGagiyBa0MaeiyAaKMaeiOBa4gabeaakmaabmGabaGaemyta0eacaGLOaGaayzkaaaaaaaa@582A@ and hence given an MSS cutoff value, the threshold th is determined as th = MSS·(scmax(M) - scmin(M)) + scmin(M). Instead of using the reverse strand we use the reverse complement M ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabd2eanbaaaaa@2DE0@ of the PSSM M, defined by M ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabd2eanbaaaaa@2DE0@ (i, a) = M(m - 1 - i, a ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdggaHbaaaaa@2E08@ ) for all i [0, m - 1] and a A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ , where a ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqdaaqaaiabdggaHbaaaaa@2E08@ is the Watson Crick complement of nucleotide a. This allows to use the same enhanced suffix array for both strands. In Experiment 5 we used a PERL-based wrapper for the Blimps program shipped with the BLIMPS distribution to do bulk sequence searches. The overhead for the PERL interpreter call was found to be negligible. For Experiment 6 we used the reduced alphabets given in Figure 8. The last seven rows show which programs were used in which experiment.