#### 1 Algorithm

The input of our algorithm is a list of cDNA or DNA sequences. Each sequence is denoted as a string of length *m*, *s*
_{
i
}= *s*
_{
i
}[1]*s*
_{
i
}[2]...*s*
_{
i
}[*m*], which is over a fixed finite alphabet, i.e. *s*
_{
i
}[*j*] ∈ Σ = {*A, G, C, T*}, 1≤*i*≤*n*, 1≤*j*≤*m*. All sequences are expressed as a set of string *S* = {*s*
_{
i
}|1≤*i*≤*n*}. The output is a degenerated string of length *k*, which represents degenerated primers.

In the first step, we align all the input strings and get the conserved regions in them. It is similar to the closest substring problem [13]. Let *s*
_{
i
}[*j, k*] be a substring of *s*
_{
i
}= *s*
_{
i
}[1]*s*
_{
i
}[2]...*s*
_{
i
}[*m*] in position *j* and of length *k*, which consists of the sequence of symbols *s*
_{
i
}[*j*]*s*
_{
i
}[*j* + 1]...*s*
_{
i
}[*j* + *k* - 1]. We need to find a set of substring *S*[*j, k*] = {*s*
_{
i
}[*j, k*]|1≤*i*≤*n*]}, which is the most conserved, by minimizing the following objective function.

Where *h*(*a, b*) = |{t|*a*[*t*]≠*b*[*t*]}|, 1≤*t*≤*k* means the hamming distance between string *a* and *b*. *s*
^{
k
}denotes center string of *S*[*j, k*]. Each letter in the center string is the letter that appears most in same position of *S* [*j, k*]. Let *p*
_{
i
}= *j* for each *s*
_{
i
}[*j, k*] denotes the position of the first letter in the substring. The above statement can be formulated as the following optimization problem.

Where *P* = {*p*
_{
i
}|1≤*i*≤*n*}.

The problem is NP complete, so we need to find an approximation algorithm within polynomial time. The pseudo code is as follows.

Taking 3 strings, *s*
_{1}, *s*
_{2}, *s*
_{3}, randomly from *S* = {*s*
_{
i
}|1≤*i*≤*n*};

Sampling 3 substrings *s*
_{1}[*j, k*], *s*
_{2}[*j, k*]*s*
_{3}[*j,k*] from *s*
_{1}, *s*
_{2}, *s*
_{3} respectively;

Finding a substring, which is closest to the center string of the sampled 3 substrings, for every string in *S* = {*s*
_{
i
}|1≤*i*≤*n*};

Step 1 will be repeated for
times. Step 2 will be repeated for (*m*-*k*+1)^{3} times. So we get
groups of substrings. Using formulas (1) and (2), the group of substring with the minimum *D* is the most conserved substrings. Step 3 will be repeated for *n* × *k* times. So the whole algorithm will be repeated
times.

Now we get a position set *P* = {*p*
_{
i
}|1≤*i*≤*n*}. Each element is the beginning position of the conserved region in the corresponding string. In the next step, a degenerated primer is designed in these conserved regions. A PCR primer sequence is called degenerate if some of its positions have several possible bases. The degeneracy of the primer is the number of unique sequence combinations it contains [14]. We overlay all substrings, *s*
_{
i
}[*p*
_{
i
}, *k*], as a *n* by *k* matrix. Let *Q* be the set of positions where *s*
_{
i
}[*j, k*] agree, and *R* = {1,2,...,*k*}-*Q* be the set of positions where *s*
_{
i
}[*j, k*] disagree. We only need work at the positions, θ, in *R*. A distribution matrix is constructed firstly, which denotes the number of appearances, or count, of each character at each position.

*M*(σ,θ) = |{θ|*s*
_{
i
}[θ] = σ}|, σ∈Σ, 1≤θ≤|*R*| (3)

The leading value of column θ, denoted *L*(θ), is defined as the largest value in that column: L(θ) = max{M(σ,θ)|σ∈Σ}. The leading character of column θ is a character *y*(θ), whose count is the leading value: *M*(*y*(θ),θ) = *L*(θ). A column-wise majority string *w* is the string of |*R*| leading characters, one for each column, which is used as initial non-generated string. Then we degenerate the string *w* in order to match a maximum number of strings in the set of *S*
_{
R
}= {*s*
_{
i
}[θ ]|1≤*i*≤*n*, 1≤θ≤|*R*|} using minimum degeneracy. The elements except the leading characters in matrix *M*(σ,θ) are sorted from largest to smallest. We select the λ largest elements and degenerate them into their corresponding leading characters. Then a degenerated string *w** is obtained. Let *M*(σ_{1}, θ_{1})≥ *M*(σ_{2}, θ_{2})...≥ *M*(σ_{λ},θ_{λ}) denotes the largest λ selected elements and θ* = {θ_{1}, θ_{2},..., θ_{λ}}, 1≤θ_{1}, θ_{2},... θ_{λ}≤|*R*| are columns that have selected elements. Let ρ_{1} be the columns that have only one selected element, ρ_{2} be the columns that have two selected elements, ρ_{3} be the columns that have three selected elements, and θ* = ρ_{1}⋃ρ_{2}⋃ρ_{3}. The degeneracy of the string *w** is g =
. In practice, we don't need to cover all input strings. It is a trade off between degeneracy and coverage (the number of matched input sequences). We can use the parameter λ to adjust this trade off. By combining the characters in positions of *Q* and the characters in positions of *R*, the final primer of length *k* is obtained. There are two parameters in this algorithm, *k* and λ. *k* is the length of the primer, which usually is about 20. The value of λ is determined by degeneracy and depends on the database. The algorithm is implemented on a Pentium IV 2.4 GHz PC with 1 GB DDRAM using Microsoft Visual C++ programming language in WINDOWS_XP environment. A typical execution of this algorithm on 8000 sequences of length 1000 takes approximately 1 minute.