Fast index based algorithms and software for matching position specific scoring matrices

Beckstette, Michael; Homann, Robert; Giegerich, Robert; Kurtz, Stefan

doi:10.1186/1471-2105-7-389

BMC Bioinformatics

Table 1 Performed experiments and experimental input.

From: Fast index based algorithms and software for matching position specific scoring matrices

	Exp. 1	Exp. 2	Exp. 3	Exp. 4	Exp. 5	Exp. 6
# searched sequences	59,021	30,964	19,111	1 (H.s. Chr. 6)	19,111	19,111
total length	20.2 MB	37.2 MB	4.3 MB	162.9 MB	4.3 MB	4.3 MB
sequence source	see [13]	DBTSS 5.1	RCSB PDB	Sanger V1. 4	RCSB PDB	RCSB PDB
sequence type/PSSM type	protein	DNA	protein	DNA	protein	protein
# PSSMs	4,034	220	11,411	576	28,337	10,931
PSSM source	see [13]	MatInspector	PRINTS 38	TRANSFAC Prof. 6.2	BLOCKS 14.1	PRINTS 38
avg. length of PSSMs	29.74	14.21	17.32	13.33	26.3	17.37
index construction (sec)	41	146	10.2	586	10.2	10.2
mdc (sec)	1960	-	1486	-	11871	1486
MatInspector		x
FingerPrintScan			x
Blimps					x
DN00	x
LAsearch	x	x	x	x	x
ESAsearch	x	x	x	x	x	x
ESAsearch (reduced $A$ )						x

Overview of the sequences and PSSMs used in the performed experiments. For the experiments that use p-value or E-value cutoffs, we precomputed the cumulative score distributions and stored them on file. mdc is the time needed for this task. In Experiment 1 we measured the running time of the Java-program from [13], referred to by DN00. We ran DN00 with a maximum of 2 GB memory assigned to the Java virtual machine. DN00 constructs the suffix tree in main memory and then performs the searches. For a fair comparison, we therefore measured the total running time, and the time for matching the PSSMs (without suffix tree construction). For Experiment 2, we implemented the matrix similarity scoring scheme (MSS) of MatInspector and matched the PSSMs against both strands of the DNA sequences with different MSS cutoff values. The MSS of PSSM M of length m and a sequence w ∈ $A$ ^m is defined as $MSS = \frac{s c (w, M) - s c_{\min} (M)}{s c_{\max} (M) - s c_{\min} (M)}$ and hence given an MSS cutoff value, the threshold th is determined as th = MSS·(sc_max(M) - sc_min(M)) + sc_min(M). Instead of using the reverse strand we use the reverse complement $\bar{M}$ of the PSSM M, defined by $\bar{M}$ (i, a) = M(m - 1 - i, $\bar{a}$ ) for all i ∈ [0, m - 1] and a ∈ $A$ , where $\bar{a}$ is the Watson Crick complement of nucleotide a. This allows to use the same enhanced suffix array for both strands. In Experiment 5 we used a PERL-based wrapper for the Blimps program shipped with the BLIMPS distribution to do bulk sequence searches. The overhead for the PERL interpreter call was found to be negligible. For Experiment 6 we used the reduced alphabets given in Figure 8. The last seven rows show which programs were used in which experiment.

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com