In the following subsections, we study the parameterized tractability of CSWO and show the problem is sensitive to different parameterizations.
CSWO: Tractability Results
We first consider when Σ is a parameter. In computational biology applications the biological sequences of interest are typically DNA or protein sequences, hence the number of different symbols is a small constant (i.e. 4 or 20 in the case of DNA or protein sequences, respectively). Restricting Σ only does not make CSWO tractable since it is NP-hard even when the alphabet is binary. However, if Σ and ℓ are both parameters then it is fixed-parameter tractable; we can enumerate and check all the |Σ|ℓ possible center strings. As a result the problem is fixed parameter tractable with the combined parameters Σ, ℓ, d and n*. We will prove in a later section that it is imperative that Σ be a parameter in order to obtain this tractability.
Next we show that CSWO is fixed parameter tractable if d and k are parameters. The fixed parameter algorithm that we present is similar to the algorithm presented by Gramm et al. [13], where it is proved that CLOSEST STRING is fixed parameter tractable with respect to the parameter d. In the algorithm by Gramm et al. [13] at each recursive step a string s is selected that has Hamming distance at least d + 1 away from the current candidate center string x if one exists; otherwise x is returned since it is a center string. Then for any d + 1 positions where x and s disagree, there is at least one position at which s is equal to the final solution. The algorithm tries each of the d + 1 positions, changes x to s at one of the d + 1 the position, reduces Δd by one, and calls itself recursively. Hence, Δd is the current degeneracy parameter at a particular recursive iteration and x is the current candidate center string. Since the recursion stops after at most d steps the size of the search tree is bounded by O((d + 1)d).
CSWO Algorithm
Input: A CSWO instance with a set of S n strings of length ℓ, parameters Δd, d and k, and a candidate string x.
Output: A string s* if there exists a set S of at least n* strings where each string in S has distance at most d from s*, and “Not found” otherwise.
-
1.
If Δd < 0 or k < 0 then return “Not found”.
-
2.
Choose i ∈ {1,…, n} such that d(x, s
i
) > d. If no such i exists return x.
-
3.
s
ret
= CSWO Algorithm (S\ {s
i
}, Δd, k – 1, x).
-
4.
If s
ret
= “not found ” then:
-
(a)
P = {p | x[p] ≠ s
i
[p]};
-
(b)
Choose any P′ from P with |P′| = d + 1.
-
(c)
For each position p ∈ P′:
-
Let x be equal to s
i
at position p.
-
s
ret
= CSWO Algorithm (S, Δd – 1, k, x).
-
If s
ret
≠ “not found”, then return s
ret
.
-
5.
Return “not found”.
Our algorithm begins with s1 as the candidate center string. If s1 is a center string with respect to S then we are done; otherwise there exists a string s
i
that has distance at least d + 1 from s1. We “guess” whether s
i
belongs in the set of outliers. If it is an outlier then we remove it from S and recurse on the smaller set with k – 1. If it is not an outlier then we use s
i
to move the candidate string x closer to toward s
i
, which can be done by applying the methodology of Gramm et al. [13]. We use the term “guess” as an euphemism in this brief description of the our algorithm but rather we try both possibilities as can be seen in the CSWO Algorithm. This will increase the size of the search tree.
Proposition 1 The CSWO Algorithm solves the CSWO problem in time O(n ℓ + nd · dd · 2k+d).
Proof. Running time. Each recursion of the algorithm reduces either k or d by 1. Thus, there are at most k + d guesses of whether a particular string belongs in the set of outliers. Thus, the search tree size is increased by a multiplicative factor of at most 2k+d and the search tree size is bounded above by O(2k+d · (d + 1)d). The analysis of Gramm et al. [13] demonstrated that each recursive step takes time O(nd) and the preprocessing time takes O(n ℓ) and therefore, we obtain an overall running time of O(n ℓ + nd · dd · 2k+d).
Correctness We show the correctness of the algorithm by showing the correctness of the first recursive step and then the correctness of the algorithm follows by inductively applying the following argument. Clearly, if S does not contain a subset S* of n* strings, such that there exists a center string s* for S* then “not found” will be returned and therefore, we assume otherwise.
If s1 is a center string for S then the algorithm immediately halts so we assume there exists a string s
i
in S that does not have s1 as a center string. CSWO Algorithm creates two subcases: one where s
i
is in the set of outliers, and another where s
i
is not. Suppose s
i
is in the set of outliers then the first case will successfully remove s
i
from the set and recurse on S\{s
i
}. Otherwise, if s
i
is not in the set of outliers then eventually the second case will reached. We refer to the set of positions as correct if {p | s1[p] ≠ s*[p] = s[p]}. It follows from Gramm et al. [13] that one of the d + 1 chosen positions p will be a correct one. Thus, we have shown that either one of the subcases will lead to a smaller subcase containing the solution for S.
The previous result demonstrates the fixed parameter tractability with respect to d and k. We note that a similar modification of the O(n|Σ|O(d)) algorithm of Ma and Sun [16] also gives a fixed parameter algorithm with respect to the parameters Σ, d and k. In the modified algorithm, for any string s with distance greater than d to the current candidate center string x, we again try the subcases where s is an outlier, and is not an outlier. In the former case, we remove s from the set of input strings S and recurse on S and k – 1, and in the latter case, we use the same technique as in the algorithm of Ma and Sun [16] to reduce the distance between x and the final solution. This modification that accounts for the outliers results an extra multiplicative factor of O(2k+logd) to the running time of the original algorithm. Although this algorithm improves upon the running time of the previous result, it requires that Σ is also a parameter. Further, we note that some of the recent improvements [20–22] to the algorithm of Ma and Sun can be modified in a similar manner to obtain fixed parameter algorithms for CSWO with respect to parameters Σ, d and k.
Proposition 2 CSWO is fixed parameter tractable for parameters Σ and n.
Proof. Gramm et al. [13] gave a linear fixed parameter tractable algorithm for CLOSEST STRING with respect to the number of strings and Σ, which we refer to this algorithm as ILP-procedure(S), where S is the set of input strings. Our algorithm enumerates all size-n* subsets of S, and call ILP-procedure on each subset.
CSWO: Intractability Results
We derive the W[1]-hardness result by a series of intermediate steps, aiming at a reduction from Clique to CSWO, showing that CSWO is W[1]-hard for the combination of ℓ, d, and n*, and when the alphabet is unbounded.
Reduction from CLIQUE
As previously described, we let the CLIQUE instance be given by an undirected graph G = (V, E) with a set V = {v1,v2,…,v
n
} of n vertices, a set E of m edges, and a positive integer t denoting the size of the desired clique. We describe how to generate a set S of
strings such that G has a clique of size t if and only if there is a subset of S of size
, denoted as S*, where there exists a string x such that d(s
i
,x) ≤ d for all s
i
∈ S*. We let ℓ = t and d = t – 2. We assume that t > 2 since t ≤ 1 produces trivial cases.
We begin by describing the alphabet. We assume |Σ| can be infinite and we let Σ be equal to the union of the following sets of symbols:
-
1.
{v
i
| for all i = 1,…, |V|}. Hence, there exists one symbol representing each vertex in G.
-
2.
{c
i
,
j
,
m
|i = 1,…,t; j = 1,…,t; m = 1,…, |E|}. There exists an unique symbol for each
strings produced for our reduction.
Hence, we have a total of
number of symbols.
Next, we generate a set of
strings S = {s1,1,1,…, s1,1,|
E
|, s1,2,1,…, s1,2,|
E
|,… ,s
t–
1,
t
,|
E
|}. Every string has length t and will encode one edge of the input graph. There will be
corresponding for each edge, however, encode the edges in different positions. For string s
i
,
j
,
m
we encode edge e
m
= (v
r
, v
s
), where 1 ≤ r < s ≤ |V|, but letting position i equal to v
r
and position j equal to v
s
and the remaining positions equal to c
i
,
j
,
m
. Hence, a string is given by
si,j,m := [ci,j,m] i–1 vr[ci,j,m] j–i–1 vs[ci,j,m]m–j.
To clarify our reduction, we give an example. Let G = (V, E) be an undirected graph with V = v1, v2, v3, v4 and edges E = {(v1, v2), (v1, v3), (v1, v4), (v2, v3)} and let our CLIQUE instance have G and t = 3. Figure 1 illustrates the reduction. Using G, we exhibit the above construction of
strings, which we denote as S. We claim that there exists a clique of size 3 if and only if there exists a string s* of length ℓ = t = 3 and subset S* of S of size 3 where d(s,s
i
) ≤ d for all s
i
∈ S*. In this example the center string s is equal to v1v2v3 and each string in the set {v1v2c121, v1c132v3, c234v2v3} is such that each string in S* has Hamming distance at most 1 from s.
Correctness of the Reduction
The following two lemmas establish the correctness of the reduction.
Lemma 1 For a graph with a t-clique, the construction in Subsection produces a CSWO instance with a set S∗ and a string s of length such that for every si ∈ S* d(si, s) ≤ d.
Proof Let the input graph have a clique of size t. Let v
α
1, v
α
2, …, v
αt
be the vertices in the clique C of size t and without loss of generality, assume α1< α2<…< α
t
. Then we claim that the there exists a subset of
vertices that have distance at most t – 2 from the string s = v
α
1v
α
2 …v
αt
. Consider the first edge of the clique (v
α
1, v
α
2) of the clique then it follows that the string s11
r
= v
α
1v
α
2 [c11
r
]t–2, where edge r has endpoints v
α
1v
α
2, is contained in the set of strings {s111, s112,…, s11|
E
|}. Clearly, H(s11
r
,s) = t – 2. For each edge in C we have we have a string in S that has distance at most t – 2 from s and our lemma follows from this construction.
For the reverse direction, we need to prove that the existence a subset S* of
and a string s where d(s, s
i
) ≤ t – 2 for all s
i
∈ S* implies the existence of a clique in G with t vertices.
Lemma 2
The t symbols of the center string correspond to the t vertices of clique in the input graph
Proof. Let S* be the subset of S of size
such that s has distance t – 2 from each string in S*. Since ℓ = t, n* = t, d = t – 2 and for each symbol c
i
,
j
,
m
there exists only a single string i = 1, …,t, j = 1, …,t and m = 1,…, |E| it follows from the Pigeonhole principle that the center string s only contains symbols from {v
i
| for all i = 1,…,|V|}. Without loss of generality assume s is equal to v
α
1v
α
2 …v
αt
for α
v
1, α
v
2,…,α
vt
∈ {1,…, |V|}. Consider any pair α
i
, α
j
for 1 ≤ i < j ≤ t and consider the set of strings S
i
,
j
= {s
i
,
j
,1, s
i
,
j
,2,…, s
i
,
j
,|
E
|}. Recall that S
i
,
j
contains a string corresponding to each edge e = (r, s) in E which has v
r
at the i th position and v
s
at the jth position and c
i
,
j
,
m
at all remaining positions. Therefore, we can only find a string in S
i
,
j
that has distance at most t – 2 from s if v
α
í
is at the i th position and v
αj
is at the j th position; and such a string exists if and only if there is an edge in G connecting v
αí
to v
αj
. Hence, the center string s implies there exists an edge between any pair of vertices in G in the set {v
α
1v
α
2 … v
αt
} and by definition the vertices form a clique.
Our main theorem follows directly from Lemma 1 and Lemma 2. We note that the hardness for the combination of all three parameters also implies the hardness for each subset of the three.
Theorem 1 CSWO with unbounded alphabet is W[1]-hard with respect to the parameters ℓ, d, and n*.
Since there exists a trivial reduction from the CLOSEST STRING problem to CSWO (i.e. simply set k = 0 in CSWO), there cannot exist a fixed parameter tractable algorithm for CSWO with k as a parameter, unless P = NP; such an algorithm would contradict the NP-hardness of CLOSEST STRING.
Fact 1 CSWO is W[1]-hard with respect to the parameter k and when |Σ| ≤ 2, unless P = NP.