### Independent codon model

We first consider a method to identify overrepresented motifs in a coding sequence conditional on the amino acid sequence, under the assumption that each codon in the sequence is independent. Specifically, we calculate the overrepresentation or underrepresentation of a motif in a set of protein-coding sequences of total length *N.*

We are given an amino acid sequence *A* = *A*
_{1}, *A*
_{2},..., *A*
_{
n
}and a motif *μ*. There are *O*(*e*
^{
N
}) different coding nucleotide sequences that translate into the same amino acid sequence *A*. Denote *C*
_{
a
}as the set of all such nucleotide sequences. Let *X* denote the random variable that gives the number of occurrences of the motif *μ* for any sequence {*α*} *= α*
_{1}, *α*
_{2}, ..., α_{N} ∈ *C*
_{
A
}.

The ~ *e*
^{
N
}sequences in *C*
_{
a
}will not contribute equally to the expectation. This is because even in the absence of selection on motifs, amino acids have preferences for codon usage. The null model for codon usage can be set as the codon usage in a reference set, which we typically choose to be the set of all coding sequences genome-wide. This provides a background probabilistic model to weight the ~ *e*
^{
N
}coding sequences.

A direct enumeration of all ~

*e*
^{
N
}sequences is prohibitive. Therefore we have devised a dynamic programming approach to exactly calculate the distribution of

*X*. The distribution for

*X*, which we refer to as

*D*(

*X*), is stored as an array of values, and can be calculated by

Here

*δ*(

*X*({

*α*}) is a delta function centered at the value

*X*({

*α*}). The probability of the sequence {

*α*} is given by

where the individual *p*(*α*
_{
i
}) values are determined from the reference codon usage table for the corresponding amino acid. Since the weightings are conditional on the amino acid sequence, the *p*(*α*
_{
i
}) values for the codons in a synonymous group sum to one.

The distribution can be calculated by an inductive approach. One calculates the *D*(*X*
_{k+1}) distribution for the motif occurrences in the subsequence defined by the first *k +* 1 codons using the *D*(*X*
_{
k
}) distribution defined by the motif occurrences in the first *k* codons. By iterating through this procedure, one can efficiently calculate *D*(*X*
_{
N
}), which is the desired distribution *D*(*X*) for the full *N* codon sequence.

To perform the dynamic programming calculation, at a given iteration *k* one will need to keep track of each distribution function of the type *D*(*X*
_{
k
}) conditioned on the possible codon strings in the last Δ - 1 codons {*α*
_{k-Δ+2 }... *α*
_{
k
}}. Δ is the maximum number of codons that a given instance of the motif can overlap, i.e. for motif length *l*, Δ = [(*l* - 1)/3] + 1, where [*x*] indicates the greatest integer less than or equal to *x*. We denote these distributions as *D*(*X*
_{
k
}
*, α*
_{k-Δ+2 }...*α*
_{
k
}).

We will need these distributions for all possible values of the codons {*α*
_{k-Δ+2 }... *α*
_{
k
}}. Note that since the maximum number of copies of a motif scales with *N*, each corresponding distribution requires *O*(*N*) memory, and the total memory requirement is *O*(*e*
^{Δ-1}
*N*). These distribution functions are used to calculate the number of copies of the motif that would be added by appending the *k +* 1^{
st
}codon to the first *k* codons.

The induction step requires a convolution calculation using all of the

*D*(

*X*
_{
k
}, {

*α*
_{k-Δ+2 }...

*α*
_{
k
}}) functions. In this step, one counts the number of copies of the motif in each possible set of Δ codons consistent with the amino acids in positions

*k* - Δ + 2 through

*k +* 1, which allows one to calculate the next set of distribution functions. This counting step accounts for motifs in all reading frames, in contrast to recent shuffling-based algorithms [

17]. Formally, we have the induction relation

where

*θ* is a function on the Δ codons from

*k* - Δ

*+* 2 to

*k* + 1 defined as

*θ*(

*α*
_{k-Δ+2 }
*... α*
_{k+1}) ≡ the number of copies of the motif that end in the last codon of {

*α*
_{k-Δ+2 }...

*α*
_{k+1}}. The sum is over all values of

*α*
_{k-Δ+2 }consistent with the amino acid

*A*
_{k-Δ+2}. When the end of the sequence is reached, the final value of

*D*(

*X*
_{
N
}) is calculated from the weighted sum of the

*D*(

*X*
_{
N
}
*, α*) values that cover the last Δ - 1 codons of the amino acid sequence, i.e.

The probabilities in equation 5 can be calculated directly as

Note that all of these calculations can be done in either the 5' to 3' or 3' to 5' direction. In practice, we use the 3' to 5' direction, as this is necessitated by the way in which the Dinucleotide-corrected Codon Model (described below) is implemented.

#### Optimization for sparse motifs

For most amino acid sequences, the possible locations of the motif consistent with the genetic code are sparsely distributed. That is, depending on the motif, there can be large portions of the amino acid sequence where no motif is possible for any consistent choice of codons. Inductively calculating the motif occurrence distribution *D* in these regions is clearly wasteful. To take advantage of sparseness, we split our induction up into independent regions that can contain motifs. As described above we can calculate *D*(*X*
_{
k
}), the motif occurrence distribution for the subsequence *A*
_{1}
*, A*
_{2}, ..., *A*
_{
k
}, where *k* < *N*. However, at each step we also check whether there exists any codon choice {*α*} that would allow a motif instance to start in *α*
_{1}
*α*
_{2} ... *α*
_{
k
}and end in *α*
_{k+1}
*α*
_{k+2 }... α_{
N
}. Such a motif must occur in the subsequence *α*
_{k-Δ}
*α*
_{k-Δ+1 }... α_{
k
}
*..*. α_{k + Δ-1}
*α*
_{k+Δ}. The calculation of whether such a motif exists is then constant time for a fixed motif length.

If a motif instance is possible, we continue the induction. However, if not, then the distribution on the traversed sequence is independent of the distribution for the rest of the sequence. We therefore store the current motif occurrence distribution, denoted as *D*
_{
c
}
*= D*(*X*
_{
k
}) and scan forward in the sequence until we find the next codon *j* > *k* such that a motif can occur. Considering this as the beginning of a new truncated sequence,
, we calculate a new distribution *D*(*X*
_{
k'
}). This process is repeated until the end of the amino acid sequence. We can calculate the complete distribution by convolving all such distributions. The advantage of this approach is that any regions for which a motif instance is impossible (positions *k +* 1 through *j* - 1 in the description above) do not require convolution calculations. For longer motifs, the set of possible motif locations is increasingly sparse, which reduces the calculation time. This time reduction tends to offset the larger number of terms to be calculated in equation 4 for longer motifs.

#### Convolution calculation

A convolution of two distributions can be calculated by considering the values in each distribution as coefficients of two generating functions and then multiplying the two generating functions. Term-by-term multiplication of the two generating functions will take time *O*(*mn*) when the polynomials are of degree *m* and *n.* An alternative approach is a Fast Fourier Transform (FFT), which takes advantage of the fact that convolution in x-space is equivalent to multiplication in Fourier space. In the FFT approach, each polynomial is converted to Fourier space, the transformed polynomials are multiplied, and the product is converted back to x-space [34]. The algorithm takes *O*(*m* log *m* + *n* log *n*) time.

We tested both the direct and FFT approaches. For the motif lengths we investigated (4 - 7 bp), the FFT approach is not noticeably faster than the direct polynomial multiplication. This is because the convolution calculations involve a large number of multiplications in which *m* ≫ *n* and for which log *m* may be comparable to *n.* The FFT approach also occasionally yielded slightly negative values for polynomial coefficients due to limits on computer precision in the transform step. Therefore in the final program we used the direct multiplication approach.

### Dinucleotide-corrected codon model

Because we were concerned that the Independent Codon Model (ICM) did not sufficiently account for neutral dinucleotide biases, we implemented a dinucleotide-corrected codon model (DCM) which includes dinucleotide biases in the null model. The DCM uses a Markov model to generate the sequence, starting from the 3'-most codon and working backward to the 5' end. This choice of direction simplifies the calculation, since for most amino acids specification of the amino acid fixes the 5'-most nucleotide of a codon. Note that although the program is run in this direction, the results sections describe motifs in the standard 5' to 3' direction.

To specify the Markov model, we use the conditional codon usage table as observed in the reference sequence. The probability of choosing a codon is conditioned upon the amino acid of the current codon as well as the first nucleotide of the adjacent 3' codon. Formally, let

*F*(

*α, b*) be the number of instances in the reference sequence for which one observes the codon

*α* followed by base

*b*, such that the amino acid coincident with

*α* is

*A.* Let

*F*(

*A, b*) be the number of instances for which one observes amino acid

*A* followed by base

*b* in the reference sequence. Then define

In the DCM, sequences are then generated from the 3' to the 5' end with probabilistic weighting

Here we have written *b*(*α*
_{
i
}) to refer to the 1st base of codon *α*
_{
i
}, treating *b* as a function on a codon.

By iterating through equation 8 we can calculate the probability of the complete sequence given the amino acids. One source of ambiguity is how to treat the 3'-most codon. Our rule is to use the first nucleotide 3' to the sequence as the starting point of the probability assignment. This is a minor assumption since for most amino acids the first base of the codon is forced. If the sequence is a whole gene then we require *b* for the 3'-most codon to be "T", which is the first letter of all three stop codons. To avoid arbitrariness in the choice of stop codon, we by convention do not look for motifs overlapping the stop codon. Also, when applying the optimization for sparse motifs, we do not need to use the *p*(*α*) values of codons in the regions where a motif is not possible. Therefore, we use a shortcut that frees us from having to apply equation 8 in such regions. This shortcut is to traceback the *p*(*α*) from the region where a motif is possible to an amino acid where there is only one possible first base of the codon. This typically requires consideration of only a few codons in the 3' direction since for the large majority of amino acids (17/20), specification of an amino acid also specifies the first codon base.

With this approach in mind, calculating the motif occurrence distribution is analogous to the ICM case. One can apply equation 4 but with the substitution of the conditional probability

for the probability factor. When the 5' end of the sequence is reached, the final value of

*D*
_{
μ
}(

*N, X*) can be calculated using 5, though the calculation of

*p*({

*α*
_{N-Δ+2 }...

*α*
_{
N
}}) for equation 6 should again use conditional probabilities. For the DCM, this means replacing equation 6 with

#### Preservation of codon and dinucleotide usage by DCM

The purpose of using the DCM null instead of the ICM was to preserve the dinucleotide frequency and codon usage found in the reference sequence. Here we provide an argument for why the DCM model can closely preserve these quantities. Due to the structure of the genetic code, specification of an amino acid usually fixes the first nucleotide of the underlying codon. Only for the amino acids Ser, Arg and Leu is there degeneracy in the first nucleotide. Here we show that under the simplifying assumption that specifying the amino acid fixes the first nucleotide of the codon for every amino acid, then the codon and dinucleotide usage generated by the Markov process equals the codon and dinucleotide usage in the reference sequence. Because this assumption is approximately true for the real genetic code, codon and dinucleotide usage will be well-preserved by the DCM model.

Suppose in our amino acid sequence that amino acids

*A* and

*B* occur at locations

*i +* 1 and

*i* respectively. Since by assumption specification of

*B* also specifies the first base

*b* of the underlying codon, the probability of selecting codon

*α* at position

*i +* 1 is given by

where the counts *F*(*α, b*) and *F*(*A, b*) are from the reference sequence as defined above.

Denote the number of occurrences of

*α* in the original sequence as

*F*(

*α*), and denote the expected number of occurrences of a codon α in a sequence generated by our Markov process as

*E*(

*α*). From our definitions it is clear that

*F*(

*α*) = Σ

_{
b
}
*F*(

*αb*). Meanwhile for the Markov process we have

In the second step we have made use of the fact that each amino acid *B* specifies a first base *b* in the codon. Thus we see that the expected number of instances of any codon *α* is equal to the number of copies in the original sequence.

To see that the Markov process preserves dinucleotide counts, we again assume the idealized case in which specification of an amino acid also specifies the first base of the underlying codon. Denote

*ψ*
_{
xy
}(

*αb*) as the number of times a dinucleotide

*xy* appears within the 4 bases consisting of

*α* and the nucleotide

*b* corresponding to the first base of the codon underlying amino acid

*B.* Thus

*ψ* will have a value of 0,1, 2, or 3. Then the total number of copies of the dinucleotide

*xy* in the original sequence is

The expected number of copies of

*xy* generated by the Markov process

*E*[

*xy*] can be broken down into contributions

*E*[

*xy*|

*Ab*] corresponding to cases where an amino acid

*A* occurs 5' to a base b, i.e.

Plugging in

*P*(α|

*Ab*) as before, we get