Closest string with outliers

Boucher, Christina; Ma, Bin

doi:10.1186/1471-2105-12-S1-S55

Volume 12 Supplement 1

Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011)

Research
Open access
Published: 15 February 2011

Closest string with outliers

Christina Boucher¹ &
Bin Ma¹

BMC Bioinformatics volume 12, Article number: S55 (2011) Cite this article

2640 Accesses
14 Citations
Metrics details

Abstract

Background

Given n strings s₁, …, s_n each of length ℓ and a nonnegative integer d, the CLOSEST STRING problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s. Finding a common pattern in many – but not necessarily all – input strings is an important task that plays a role in many applications in bioinformatics.

Results

Although the closest string model is robust to the oversampling of strings in the input, it is severely affected by the existence of outliers. We propose a refined model, the CLOSEST STRING WITH OUTLIERS (CSWO) problem, to overcome this limitation. This new model asks for a center string s that is within Hamming distance d to at least n – k of the n input strings, where k is a parameter describing the maximum number of outliers. A CSWO solution not only provides the center string as a representative for the set of strings but also reveals the outliers of the set.

We provide fixed parameter algorithms for CSWO when d and k are parameters, for both bounded and unbounded alphabets. We also show that when the alphabet is unbounded the problem is W[1]-hard with respect to n – k, ℓ, and d.

Conclusions

Our refined model abstractly models finding common patterns in several but not all input strings. We initialize the study of the computability of this model and show that it is sensitive to different parameterizations. Lastly, we conclude by suggesting several open problems which warrant further investigation.

Background

Finding similar regions in multiple DNA, RNA, or protein sequences plays an important role in many applications, including universal PCR primer design [1–4], genetic probe design [2], antisense drug design [2, 5], finding transcription factor binding sites in genomic data [6], determining an unbiased consensus of a protein family [7], and motif-recognition [2, 8, 9]. The CLOSEST STRING problem formalizes these tasks and can be defined as follows: given a set of n strings S of length ℓ over the alphabet Σ and parameter d, the aim is to determine if there exists a string s that has Hamming distance at most d from each string in S. The optimization version of this problem tries to minimize the parameter d. We refer to s as the center string and let d(x, y) be the Hamming distance between strings x and y.

The CLOSEST STRING was first introduced and studied in the context bioinformatics by Lanctot et al. [2]. Frances and Litman [10] showed the problem to be NP-complete even in the special case when the alphabet is binary, implying there is unlikely to be a polynomial-time algorithm for this problem unless P = NP. Since its introduction the investigation of efficient polynomial time approximation algorithms and exact exponential time algorithms for the CLOSEST STRING problem has been thoroughly considered [2, 11–16].

The CLOSEST STRING problem requires that the Hamming distance constraint be satisfied for each of the input strings and therefore, is robust to the oversampling of the input strings. For this reason it is frequently used to model many of the aforementioned applications. However, this property also causes a severe problem: if the input includes a string that is significantly different from the other input strings, which we refer to as an “outlier”, then it will have the effect of causing there not to exist a center string for the complete set of input strings; d will have to be increased dramatically to account for this string and obtain a center string. This is a significant limitation for applications such as the design of universal primers where a small d is crucial for the effectiveness of the primers. In this and many other applications, it would be preferable to determine a “good” center string (i.e. one that is reasonably close to each of the strings) for a large portion of the input strings rather than trying to find a center string for the complete set and in doing so finding one that is far distance from many or all of the strings. Hence, we aim to model the task of finding a center string that is within distance d to most – but not necessarily all – of the input strings, where d is reasonably small. Another compelling consequence of the modification of the model is that in situations where a more satisfying solution can be found by regarding a few strings as outliers, the initial decision of including them requires reexamination.

We formally model this problem as follows:

CLOSEST STRING WITH OUTLIERS (CSWO)

INPUT: A set of n length-ℓ strings S = {s₁,…, s_n} over a finite alphabet Σ and nonnegative integers k and d.

QUESTION: Find a center string s and a subset of S* ⊂ S, such that |S*| = n – k and d(s, t) ≤ d for t ∈ S*.

For the rest of the paper we denote n – k as n*, and s_i [p] to be the symbol at position p of string s_i.

There exists a simple reduction from the CLOSEST STRING problem to CSWO that demonstrates it is NP-complete even in the special case where the alphabet is binary and k = 0, implying it is unlikely to be solved exactly by a polynomial-time algorithm, unless P=NP. One approach to investigating the computational intractability of CSWO is to consider its parameterized complexity, which aims to classify computationally hard problems according to their inherent difficulty with respect to multiple parameters of the input. If it is solvable by an algorithm that is polynomial in the input size and exponential in parameters that are typically small then it can still be considered tractable in some practical sense.

For unbounded alphabet size, we show that CSWO is W[1]-hard for every combination of the parameters ℓ, d, and n* and thus, is fixed parameter intractable when parameterized by any subset of these parameters, unless FPT = W[1]. We also show that when the alphabet is unbounded, there exists a fixed parameter tractable algorithm for CSWO with respect to the parameters d and k. In the case of constant size alphabet, CSWO is fixed parameter tractable for the parameter n but intractable for the parameter k. The complexity of the problem remains open when parameterized by d and the alphabet is of constant size, and when parameterized by n* and k.

Previous Results

It is worth noting that analogous parameterized complexity studies have been performed for the CLOSEST STRING problem and the CLOSEST SUBSTRING problem. Gramm et al. [13] demonstrated that the CLOSEST STRING problem is FPT when the number of strings remains fixed. This FPT result is based on an integer linear programming formulation with a constant number of variables (assuming n is fixed), and the application of the result of Lenstra [19] that proves integer linear programming is polynomial-time solvable when the number of variables remains fixed. They further demonstrated that the problem is FPT when d is a parameter by giving a O(n ℓ + nd(d + 1)^d) time algorithm [13]. Ma and Sun gave an O(n|Σ|^O⁽^d⁾) algorithm, which is a polynomial-time algorithm when d = O(log n) and Σ has constant size [16]. Chen et al. [20], Wang and Zhu [21], and Zhao and Zhang [22] improved upon the fixed parameter tractable result of Ma and Sun [16].

The CLOSEST SUBSTRING problem seems to be inherently more intractable then the CLOSEST STRING problem. Given n strings s₁, s₂,…, s_n over alphabet Σ and integers d and ℓ, the CLOSEST SUBSTRING problem aims to determine whether there is a string s of length ℓ such that, for all i = 1,…, n, d(s, ) ≤ d where is a length ℓ substring of s_i. Fellows et al. [11] showed that CLOSEST SUBSTRING is W[1]-hard with respect to the number of input strings n even for a binary alphabet. When Σ is unbounded the problem is W[1]-hard with respect to the parameters ℓ, d and n[11]. Most recently, Marx [23] proved the problem is W[1]-hard with combined parameters n and d even if the alphabet is binary, which resolved an open problem stated in [11, 12, 24].

Methods

We give insight into the computational tractability of CSWO through studying the parameterized complexity of the problem. Parameterized complexity aims to classify problems according to their inherent difficulty with respect to multiple parameters of the input.

Parameterized Complexity

A problem φ is said to be fixed parameter tractable with respect to parameter k if there exists an algorithm that solves φ in f(k) · n^O⁽¹⁾ time, where f is a function of k that is independent of n[17]. Given a graph G = (V, E) with vertex set V, edge set E, and positive integer k, the Vertex Cover problem aims to discern where there is a subset of vertices V_c ⊆ V with k or fewer vertices such that each edge in E has at least one its endpoints in V_c . The vertex cover problem is NP-complete [18] but is fixed parameter tractable since there exists algorithmic solutions that have running time O(kn + 1.3^k) [17]. The corresponding complexity class is FPT.

Not all NP-complete problems are in FPT. For example, consider the NP-complete CLIQUE problem: given an undirected graph G = (V, E) and a positive integer t, the aim is determine whether there is a subset of vertices C ⊆ V of size at least t where each pair of vertices in C are connected by an edge. The best known algorithms for solving clique runs in time O(n^o⁽^t⁾) and hence, there is no known algorithm for solving t for which t is not in the exponent of n in the running time [17].

In order to characterize those problems that do not seem to admit a fixed parameter efficient algorithm, Downey and Fellows [17] defined a fixed parameter reduction. We will restrict interest to the W[1] class and hence, the following definition will only apply to W[1]-hardness. W[1]-hardness gives convincing evidence that a parameterized problem with parameter k is unlikely to have an algorithm that has running time of the form f(k) · n^O⁽¹⁾. Let L, L′ ⊆ Σ* ×ℕ be two parameterized languages, then L reduces to L′ if there are functions k → k′ and k → k″ from ℕ to ℕ and a function (x, k) → x′ from Σ* × ℕ to Σ* such that:

1.
(x, k) → x′ is computable in time k″|x|^c, for some constant c and
2.
(x, k) ∈ L if and only if (x′, k′) ∈ L′.

Results and Discussion

In the following subsections, we study the parameterized tractability of CSWO and show the problem is sensitive to different parameterizations.

CSWO: Tractability Results

We first consider when Σ is a parameter. In computational biology applications the biological sequences of interest are typically DNA or protein sequences, hence the number of different symbols is a small constant (i.e. 4 or 20 in the case of DNA or protein sequences, respectively). Restricting Σ only does not make CSWO tractable since it is NP-hard even when the alphabet is binary. However, if Σ and ℓ are both parameters then it is fixed-parameter tractable; we can enumerate and check all the |Σ|^ℓ possible center strings. As a result the problem is fixed parameter tractable with the combined parameters Σ, ℓ, d and n*. We will prove in a later section that it is imperative that Σ be a parameter in order to obtain this tractability.

Next we show that CSWO is fixed parameter tractable if d and k are parameters. The fixed parameter algorithm that we present is similar to the algorithm presented by Gramm et al. [13], where it is proved that CLOSEST STRING is fixed parameter tractable with respect to the parameter d. In the algorithm by Gramm et al. [13] at each recursive step a string s is selected that has Hamming distance at least d + 1 away from the current candidate center string x if one exists; otherwise x is returned since it is a center string. Then for any d + 1 positions where x and s disagree, there is at least one position at which s is equal to the final solution. The algorithm tries each of the d + 1 positions, changes x to s at one of the d + 1 the position, reduces Δd by one, and calls itself recursively. Hence, Δd is the current degeneracy parameter at a particular recursive iteration and x is the current candidate center string. Since the recursion stops after at most d steps the size of the search tree is bounded by O((d + 1)^d).

CSWO Algorithm

Input: A CSWO instance with a set of S n strings of length ℓ, parameters Δd, d and k, and a candidate string x.

Output: A string s* if there exists a set S of at least n* strings where each string in S has distance at most d from s*, and “Not found” otherwise.

1.
If Δd < 0 or k < 0 then return “Not found”.
2.
Choose i ∈ {1,…, n} such that d(x, s_i) > d. If no such i exists return x.
3.
s_ret = CSWO Algorithm (S\ {s_i}, Δd, k – 1, x).
4.
If s_ret = “not found ” then:
(a)
P = {p | x[p] ≠ s_i[p]};
(b)
Choose any P′ from P with |P′| = d + 1.
(c)
For each position p ∈ P′:

Let x be equal to s_i at position p.
s_ret = CSWO Algorithm (S, Δd – 1, k, x).
If s_ret ≠ “not found”, then return s_ret.

5.
Return “not found”.

Our algorithm begins with s₁ as the candidate center string. If s₁ is a center string with respect to S then we are done; otherwise there exists a string s_i that has distance at least d + 1 from s₁. We “guess” whether s_i belongs in the set of outliers. If it is an outlier then we remove it from S and recurse on the smaller set with k – 1. If it is not an outlier then we use s_i to move the candidate string x closer to toward s_i, which can be done by applying the methodology of Gramm et al. [13]. We use the term “guess” as an euphemism in this brief description of the our algorithm but rather we try both possibilities as can be seen in the CSWO Algorithm. This will increase the size of the search tree.

Proposition 1 The CSWO Algorithm solves the CSWO problem in time O(n ℓ + nd · d^d · 2^k+d).

Proof. Running time. Each recursion of the algorithm reduces either k or d by 1. Thus, there are at most k + d guesses of whether a particular string belongs in the set of outliers. Thus, the search tree size is increased by a multiplicative factor of at most 2^k⁺^d and the search tree size is bounded above by O(2^k⁺^d · (d + 1)^d). The analysis of Gramm et al. [13] demonstrated that each recursive step takes time O(nd) and the preprocessing time takes O(n ℓ) and therefore, we obtain an overall running time of O(n ℓ + nd · d^d · 2^k⁺^d).

Correctness We show the correctness of the algorithm by showing the correctness of the first recursive step and then the correctness of the algorithm follows by inductively applying the following argument. Clearly, if S does not contain a subset S* of n* strings, such that there exists a center string s* for S* then “not found” will be returned and therefore, we assume otherwise.

If s₁ is a center string for S then the algorithm immediately halts so we assume there exists a string s_i in S that does not have s₁ as a center string. CSWO Algorithm creates two subcases: one where s_i is in the set of outliers, and another where s_i is not. Suppose s_i is in the set of outliers then the first case will successfully remove s_i from the set and recurse on S\{s_i}. Otherwise, if s_i is not in the set of outliers then eventually the second case will reached. We refer to the set of positions as correct if {p | s₁[p] ≠ s*[p] = s[p]}. It follows from Gramm et al. [13] that one of the d + 1 chosen positions p will be a correct one. Thus, we have shown that either one of the subcases will lead to a smaller subcase containing the solution for S.

The previous result demonstrates the fixed parameter tractability with respect to d and k. We note that a similar modification of the O(n|Σ|^O⁽^d⁾) algorithm of Ma and Sun [16] also gives a fixed parameter algorithm with respect to the parameters Σ, d and k. In the modified algorithm, for any string s with distance greater than d to the current candidate center string x, we again try the subcases where s is an outlier, and is not an outlier. In the former case, we remove s from the set of input strings S and recurse on S and k – 1, and in the latter case, we use the same technique as in the algorithm of Ma and Sun [16] to reduce the distance between x and the final solution. This modification that accounts for the outliers results an extra multiplicative factor of O(2^k^+log^d) to the running time of the original algorithm. Although this algorithm improves upon the running time of the previous result, it requires that Σ is also a parameter. Further, we note that some of the recent improvements [20–22] to the algorithm of Ma and Sun can be modified in a similar manner to obtain fixed parameter algorithms for CSWO with respect to parameters Σ, d and k.

Proposition 2 CSWO is fixed parameter tractable for parameters Σ and n.

Proof. Gramm et al. [13] gave a linear fixed parameter tractable algorithm for CLOSEST STRING with respect to the number of strings and Σ, which we refer to this algorithm as ILP-procedure(S), where S is the set of input strings. Our algorithm enumerates all size-n* subsets of S, and call ILP-procedure on each subset.

CSWO: Intractability Results

We derive the W[1]-hardness result by a series of intermediate steps, aiming at a reduction from Clique to CSWO, showing that CSWO is W[1]-hard for the combination of ℓ, d, and n*, and when the alphabet is unbounded.

Reduction from CLIQUE

As previously described, we let the CLIQUE instance be given by an undirected graph G = (V, E) with a set V = {v₁,v₂,…,v_n} of n vertices, a set E of m edges, and a positive integer t denoting the size of the desired clique. We describe how to generate a set S of strings such that G has a clique of size t if and only if there is a subset of S of size , denoted as S*, where there exists a string x such that d(s_i,x) ≤ d for all s_i ∈ S*. We let ℓ = t and d = t – 2. We assume that t > 2 since t ≤ 1 produces trivial cases.

We begin by describing the alphabet. We assume |Σ| can be infinite and we let Σ be equal to the union of the following sets of symbols:

1.
{v_i| for all i = 1,…, |V|}. Hence, there exists one symbol representing each vertex in G.
2.
{c_i_,_j_,_m|i = 1,…,t; j = 1,…,t; m = 1,…, |E|}. There exists an unique symbol for each strings produced for our reduction.

Hence, we have a total of number of symbols.

Next, we generate a set of strings S = {s_1,1,1,…, s_1,1,|_E_|, s_1,2,1,…, s_1,2,|_E_|,… ,s_t–_1,_t_,|_E_|}. Every string has length t and will encode one edge of the input graph. There will be corresponding for each edge, however, encode the edges in different positions. For string s_i_,_j_,_m we encode edge e_m = (v_r, v_s), where 1 ≤ r < s ≤ |V|, but letting position i equal to v_r and position j equal to v_s and the remaining positions equal to c_i_,_j_,_m. Hence, a string is given by

s_i,j,m := [c_i,j,m] ^i–1 v_r[c_i,j,m] ^j–i–1 v_s[c_i,j,m]^m–j.

To clarify our reduction, we give an example. Let G = (V, E) be an undirected graph with V = v₁, v₂, v₃, v₄ and edges E = {(v₁, v₂), (v₁, v₃), (v₁, v₄), (v₂, v₃)} and let our CLIQUE instance have G and t = 3. Figure 1 illustrates the reduction. Using G, we exhibit the above construction of strings, which we denote as S. We claim that there exists a clique of size 3 if and only if there exists a string s* of length ℓ = t = 3 and subset S* of S of size 3 where d(s,s_i) ≤ d for all s_i ∈ S*. In this example the center string s is equal to v₁v₂v₃ and each string in the set {v₁v₂c₁₂₁, v₁c₁₃₂v₃, c₂₃₄v₂v₃} is such that each string in S* has Hamming distance at most 1 from s.

Correctness of the Reduction

The following two lemmas establish the correctness of the reduction.

Lemma 1 For a graph with a t-clique, the construction in Subsection produces a CSWO instance with a set S∗ and a string s of length such that for every s_i ∈ S* d(s_i, s) ≤ d.

Proof Let the input graph have a clique of size t. Let v_α₁, v_α₂, …, v_αt be the vertices in the clique C of size t and without loss of generality, assume α₁< α₂<…< α_t . Then we claim that the there exists a subset of vertices that have distance at most t – 2 from the string s = v_α₁v_α₂ …v_αt. Consider the first edge of the clique (v_α₁, v_α₂) of the clique then it follows that the string s₁₁_r = v_α₁v_α₂ [c₁₁_r ]^t–², where edge r has endpoints v_α₁v_α₂, is contained in the set of strings {s₁₁₁, s₁₁₂,…, s_11|_E_|}. Clearly, H(s₁₁_r,s) = t – 2. For each edge in C we have we have a string in S that has distance at most t – 2 from s and our lemma follows from this construction.

For the reverse direction, we need to prove that the existence a subset S* of and a string s where d(s, s_i) ≤ t – 2 for all s_i ∈ S* implies the existence of a clique in G with t vertices.

Lemma 2 The t symbols of the center string correspond to the t vertices of clique in the input graph

Proof. Let S* be the subset of S of size such that s has distance t – 2 from each string in S*. Since ℓ = t, n* = t, d = t – 2 and for each symbol c_i_,_j_,_m there exists only a single string i = 1, …,t, j = 1, …,t and m = 1,…, |E| it follows from the Pigeonhole principle that the center string s only contains symbols from {v_i| for all i = 1,…,|V|}. Without loss of generality assume s is equal to v_α₁v_α₂ …v_αt for α_v₁, α_v₂,…,α_vt ∈ {1,…, |V|}. Consider any pair α_i, α_j for 1 ≤ i < j ≤ t and consider the set of strings S_i_,_j = {s_i_,_j_,1, s_i_,_j_,2,…, s_i_,_j_,|_E_|}. Recall that S_i_,_j contains a string corresponding to each edge e = (r, s) in E which has v_r at the i th position and v_s at the jth position and c_i_,_j_,_m at all remaining positions. Therefore, we can only find a string in S_i_,_j that has distance at most t – 2 from s if v_α_í is at the i th position and v_αj is at the j th position; and such a string exists if and only if there is an edge in G connecting v_αí to v_αj. Hence, the center string s implies there exists an edge between any pair of vertices in G in the set {v_α₁v_α₂ … v_αt} and by definition the vertices form a clique.

Our main theorem follows directly from Lemma 1 and Lemma 2. We note that the hardness for the combination of all three parameters also implies the hardness for each subset of the three.

Theorem 1 CSWO with unbounded alphabet is W[1]-hard with respect to the parameters ℓ, d, and n*.

Since there exists a trivial reduction from the CLOSEST STRING problem to CSWO (i.e. simply set k = 0 in CSWO), there cannot exist a fixed parameter tractable algorithm for CSWO with k as a parameter, unless P = NP; such an algorithm would contradict the NP-hardness of CLOSEST STRING.

Fact 1 CSWO is W[1]-hard with respect to the parameter k and when |Σ| ≤ 2, unless P = NP.

Conclusions

We introduced the CSWO problem, and proved with unbounded alphabet size and parameterized by ℓ, d and n* it is W[1]-hard. We also gave fixed parameter algorithms for the problem when parameterized by d and k, and with unbounded alphabet size. In the case of a fixed alphabet size, we showed CSWO is fixed parameter tractable when parameterized by n = n* + k. Table 1 summarizes these tractability and intractability results.

Currently, the fixed parameter tractability of the CSWO problem when parameterized by d, n* and Σ, and by n* and k, remains open (see Table 1). In addition, the existence of efficient, non-trivial approximation algorithms for this problem warrants further investigation.

Table 1 Parameterized tractability of CSWO

Full size table

References

Dopazo J, Rodríguez A, Sáiz J, Sobrino F: Design of primers for PCR amplification of highly variable genomes. Computer Applications in the Biosciences 1993, 9: 123–125.
CAS PubMed Google Scholar
Lanctot J, Li M, Ma B, Wang S, Zhang L: Distinguishing string selection problems. Information and Computation 2003, 41–55. 10.1016/S0890-5401(03)00057-9
Google Scholar
Lucas K, Busch M, Össinger S: Thompson J: An improved microcomputer program for finding gene-and gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes. Computer Applications in the Biosciences 1991, 7: 525–529.
CAS PubMed Google Scholar
Proutski V, Holme E: Primer master: A new program for the design and analyiss of PCR primers. Computer Applications in the Biosciences 1996, 12: 253–255.
CAS PubMed Google Scholar
Deng X, Li G, Li Z, Ma B, Wang L: Genetic design of drugs without side-effects. SIAM Journal on Computing 2003, 32(4):1073–1090. 10.1137/S0097539701397825
Article Google Scholar
Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 2005, 23: 137–144. 10.1038/nbt1053
Article CAS PubMed Google Scholar
Ben-Dor A, Lancia G, Perone J, Ravi R: Banishing bias from consensus strings. Proc. of 8th CPM 1997, 247–261.
Google Scholar
Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 2001, 17: S207-S214.
Article PubMed Google Scholar
Pevzner P, Sze S: Combinatorial approaches to finding subtle signals in DNA strings. Proc. of 8th ISMB 2000, 269–278.
Google Scholar
Frances M, Litman A: On covering problems of codes. Theoretical Computer Science 1997, 30(2):113–119.
Google Scholar
Fellows M, Gramm J, Neidermeier R: On the Parameterized Intractability of Closest Substring and Related Problems. Proc. of 19th STACS 2002, 262–273.
Google Scholar
Fellows M, Gramm J, Niedermeier R: On The Parameterized Intractability Of Motif Search Problems. Combinatorica 2006, 26: 141–167. 10.1007/s00493-006-0011-4
Article Google Scholar
Gramm J, Niedermeier R, Rossmanith P: Fixed-parameter algorithms for closest string and related problems. Algorithmica 2003, 37: 25–42. 10.1007/s00453-003-1028-3
Article Google Scholar
Li M, Ma B, Wang L: Finding similar regions in many strings. Journal of Computer and System Sciences 2002, 65: 73–96. 10.1006/jcss.2002.1823
Article Google Scholar
Ma B: A polynomial time approximation scheme for the closest substring problem. Proc. of 11th CPM 2000, 99–107.
Google Scholar
Ma B, Sun X: More efficient algorithms for closest string and substring problems. Proc. of 12th ACM RECOMB 2008, 396–409.
Google Scholar
Downey R, Fellows M: Parameterized Complexity. Springer; 1999.
Chapter Google Scholar
Garey M, Johnson D: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman; 1979.
Google Scholar
Lenstra W: Integer programming with a fixed number of variables. Mathematics of Operations Research 1983, 8: 538–548. 10.1287/moor.8.4.538
Article Google Scholar
Chen ZZ, Ma B, Wang L: A Three-String Approach to the Closest String problem. Proc. of 16th COCOON (to appear) 2010.
Google Scholar
Wang L, Zhu B: Efficient algorithms for the closest string and distinguishing string selection problems. Proc. of 3rd FAW 2009, 261--270.
Google Scholar
Zhao R, Zhang N: A more efficient closest string algorithm. Prof. of 2nd BICoB (to appear) 2010.
Google Scholar
Marx D: Closest Substring Problems with Small Distances. SIAM Journal on Computing 2008, 38: 1382–1410. 10.1137/060673898
Article Google Scholar
Gramm J, Guo J, Niedermeier R: On Exact and Approximation Algorithms for Distinguishing Substring Selection. Proc. of 14th FCT 2003, 261–272.
Google Scholar

Download references

Acknowledgement

CB is supported by NSERC Grant OGP0046506, NSERC Grant OGP0048487, Canada Research Chair program, MITACS, and Premier’s Discovery Award. BM is supported by NSERC (RGPIN 238748-2006), China 863 National High-tech R&D Program (2008AA02Z313), and a startup grant at University of Waterloo. We are also grateful to the referees for their many helpful comments.

This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.

Author information

Authors and Affiliations

David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
Christina Boucher & Bin Ma

Authors

Christina Boucher
View author publications
You can also search for this author in PubMed Google Scholar
Bin Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Christina Boucher or Bin Ma.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Concept, FPT analysis: BM. W[1]-hardness analysis: CB. Manuscript preparation: BM and CB.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Boucher, C., Ma, B. Closest string with outliers . BMC Bioinformatics 12 (Suppl 1), S55 (2011). https://doi.org/10.1186/1471-2105-12-S1-S55

Download citation

Published: 15 February 2011
DOI: https://doi.org/10.1186/1471-2105-12-S1-S55

Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011)

Closest string with outliers