The datasets
The experimental dataset (also called the 8K dataset) used here has been described previously [6], and contains 8160 sequences from the genome of Arabidopsis thaliana. Briefly, all available expressed sequence tags (ESTs) were downloaded from GenBank, and those containing terminal poly(A) sequences [8 to 15 nucleotide (nt) with at least 80% adenine content] were recognized and trimmed. The terminal nt of each trimmed polyadenylated transcript was classified as the last nt before a poly(A) site. A total of 8160 such poly(A) sites were identified and confirmed through the comparison of genomic and EST sequences (The oligo(A) should not be found in the genomic sequence because these were added post-transcriptionally during the polyadenylation process). Using the poly(A) sites as a reference, the corresponding 400 nt genomic sequences were extracted in such a way that each sequence contained 301 nt upstream and 99 nt downstream of the poly(A) site. Thus, the poly(A) site in each sequence was between the 301st nt and the 302nd nt (from left to right; the poly(A) site was also the cleavage site. The cleavage reaction occurs between two nucleotides linked by a phosphodiester bond). The cleavage site is defined as the "0" position (note there is no nucleotide assigned to this position). Hence, the nucleotide sequences on the left (upstream) have a negative designation, and on the right have a positive (often omitted) designation. In general, for the purpose of easier description, the nucleotide on the left of the cleavage site (position 301 in the dataset) is normally referred to as the a poly(A) site.
This dataset was used to extract the features of the poly(A) signals and poly(A) sites. Other test sequences shown in the results were either downloaded from GenBank or from published papers as cited. For the Sp calculation, control datasets of the Arabidopsis coding (which do not include 5' and 3' UTRs and introns), 5'-UTR, and intron sequences were downloaded from The Arabidopsis Information Resources website (TAIR; [18]; Arabidopsis Genome Initiative Release 5, 2004). These sequences were trimmed into 400 nt in length each for better comparison with the 8K dataset. The coding sequence datasets were extracted from downloaded sequences in the range of 300–700 to avoid the inclusion of UTRs. The random sequences for the Sp calculation were generated based on the second order trinucleotide distribution [5] in the 8K dataset. For the genomic sequence scan, the Arabidopsis chromosome 4 genomic sequence was used (ATH1_chr4.1con.01222004; from TAIR). It is worth mentioning that the sequences used in this work are in DNA form, so nucleotides in these sequences are ATCG instead of AUCG as in RNA. This does not impact the analysis.
Modeling routine
The topological structure is one of the most important factors in designing a GHMM model. The regular expression of topological structures in GHMM models is based on all connection structures, in which every state can go to any other state. This kind of topological structure does not take advantage of the positions of the signal elements in the 3' UTR (Figure 6A). Therefore, we employed a GHMM model that recognized the signals from left to right, and only allowed the recognition of signals from the current state to the next state in one direction, as indicated in Figure 6B. Based on the analysis of the current model of plant poly(A) signals [6], we classified the sequences into five regions (Figure 6A). The poly(A) signals are distributed in these regions with some spacing between the two signal elements. Based on this, we added a background state between the two signal states. To simplify the model, we assumed that the length of every signal was fixed but the length of background was variable. It was also possible that two signals were next to each other and thus the length of the background may be 0. The final model was designed in such a way that all calculations began on the first state and ended at the last state (Figure 6B). The order of the algorithm is shown in Figure 6C.
Parameter setting
In this model, some basic parameters were set as follows: the number of states was 11 (Figure 6B, from Bg1 through Bg6); the array of signals in every state is set to be {A, T, C, G}. States in odd numbers were the background state with variable length, and states in even numbers were signal states with fixed length. Because the model begins with the first state and ends at the last state, the initial state distribution is set in π = {1,0,...,0}. Because every state (i) can be only transferred to the i+1 state, only the value of ai,i+1 is 1 in the state transition probability matrix, and other elements of the distribution matrix are 0. Other parameters such as distribution of nucleotides and length of state will be described below.
Length of the signal elements
For modeling purposes, we needed to assign each state a few parameters including the signal nucleotide composition, signal pattern length, etc. To simplify the model, we assumed that the size or nucleotide length of each signal (FUE, NUE, CE, respectively) was fixed. As a first step, we had to designate reasonable signal lengths for each of them. To this end, we designed the following method to extract this data from the SignalSleuth experiments as described in Loke et al [6]. The observed total count of 3 nt patterns was used as a basis to calculate the "expected count" of pattern sizes of 4 nt, and the observed total count of 4 nt patterns was used to calculate the "expected count" of 5 nt, and so on. For example, the expected count of 4 nt patterns should decrease by 25% of the actual total count of 3 nt patterns because of an increase in length by one of the four nucleotides. The difference between the predicted count of patterns (random chance) and the actual observed count is useful in measuring pattern length uniqueness. The measure of the deviation from the randomness of the patterns offers a clue as to the potential length of the signals, because the real signals should have greater deviation from randomness than that of non-signals. As indicated on the histogram (see Additional file 3), the greatest difference observed for FUE is at 8 nt, NUE at 6 nt, CE-L at 6 nt, and CE-R at 7 nt, respectively. Importantly, using this bioinformatics approach, the signal lengths of the NUE and FUE match the lengths of NUE and FUE defined by classical genetic analysis, in which it was found that the NUE signal length is 6 nt, and the FUE is about 8 nt [2, 27]. Note that the signal length is different from the range of signal region where the signal can be found. The latter is for modeling purposes, and is larger because it describes a collective area where the signals are found in different genes.
Output probability of signal state
After determining the length of the signals, we needed to study the output probability B of the nucleotides (A, T, C, and G) in every signal state. To this end, we analyzed the distribution of nucleotides in each region (FUE, NUE and CEs) with the formula below, using the frequency data that was generated by SignalSleuth as described by Loke et al [6] using the 8K dataset.
where is the statistical weight of sequence i, and the more repeats this sequence has, the higher the weight is; C
i
is the frequency at which the ith sequence occurs in this signal area; D
ε
represents the probability of the nucleotide ε in the signal element, i.e. the distribution of nucleotides, which is ε ∈ {A, T, C, G}; ε
i
is the frequency of ε in the ith sequence, where 1 ≤ i ≤ N, and N is the number of signal patterns considered.
Taking signals in the FUE as an example, the representative nucleotide output probability B was calculated based on the top 50 patterns [see Additional file 4]. First, we calculated the weight of every pattern by count, and then calculated the repeat times ε
i
of every nucleotide in these patterns. The nucleotide output probability B for FUE hence are 0.0485, 0.7740, 0.0479 and 0.1290 for A, T, C, and G, respectively.
Using the same method, the nucleotide output probability B for CE-L and CE-R are: CE-L: 0.09987, 0.74970, 0.06186, 0.08860; CE-R: 0.08520, 0.78700, 0.07050, 0.05680 for A, T, C, and G, respectively.
The NUE signals are slightly better conserved than other signals, and the transition from one nt to the next may be constrained. To present these interactions of hexamer signals, we used a subset of first order inhomogeneous Markov model to describe the feature information. A frequency transport matrix was used to analyze the 50 most predominant NUE signals [6]. The equation is shown below:
where PN(T/A) is the probability of a transition from state "A" to "T"; S(AT) is the sum of times of this transport; Σ
A
= S(AA) + S(AT) + S(AC) + S(AG). The same rule was used for the others. Thus, we obtained parameters of the NUE sub-model as: probability distribution of the first nucleotide, PN0 = [0.6276, 0.3563, 0.0001, 0.0161]. The distributions of the second to sixth nucleotides are listed below, respectively,
Therefore, for a certain hexamer sequence S = s1s2s3s4s5s6 we can calculate the probability of the NUE signal at the S position:
P[S | NUE] = PN0(s1)*PN1(s2|s1)*PN2(s3|s2)*PN3(s4|s3)*PN4(s5|s4)*PN5(s6|s5).
The same kind of first order inhomogeneous Markov model was established for the poly(A) site signal (YA in the model, Fig. 6A). For this, we randomly selected 1000 sequences from the 8K dataset and obtained the poly(A) site parameters listed blow. Initiated state CSPN0 = [0.0730,0.4630, 0.3250, 0.1390],
2nd state
Based on the same method, we calculated a dimer sequence S = s1 s2 and obtained the probability at S position: P[S|CS] = CSPN0(s1)* CSPN1(s2|s1).
The background parameters
Apart from the signal regions, there is not much information on the background states. Therefore, the parameters for background states were relatively random. Based on this condition, we first analyzed several basic factors in the background and modified them accordingly. The nucleotide output probabilities of background states were calculated by counting the nucleotide distribution of the whole sequence. We tested the distribution of nucleotides in the region of -160 to +100 nt.
The most important factor in the background state is the length between two signal states. Taking the background states near the poly(A) site as an example, Bg4 and Bg5 are located upstream and downstream of the poly(A) site, and both the CE-L and CE-R signal states could be about 10 nt distance from the poly(A) site. Therefore, the length of the background state can be set to a range with 0 nt to 10 nt. The maximum length D was set to 10. However, because the length of the background could change slightly, we set all the lengths to a uniform distribution, which can be calculated by P
i
(d) = 1/(1+D
i
), where P
i
(d) is the probability of the length of the background ith to be d, and D
i
is the possible maximum length of the ith background. All background lengths were set by this method. The initial possible maximum length of Bg1, Bg2, Bg3, Bg4, Bg5 and Bg6 were set to 100, 100, 20, 10, 10 and 15, respectively.
To identify the background region, we needed to consider the near signal region of both sides. For example, Bg3 lies between NUE and CE-L signal states, the range of NUE state is 10~30 nt and CE-L signal region is from 1 to 10 nt. Therefore, the range of Bg3 could be set in the center of these two regions which is 6~20 nt. The distributions of the background region and nucleotide output probability B are listed in [Additional file 5].
Formula for the output of scores
We applied a sliding 180 nt-wide window to calculate the output of scores for the sequences. For every nucleotide, our program computed a score in all windows that contained this nucleotide. The window slid along the entire sequence, combining values of forward-backward variables using the following equation for the output of the score at nucleotide t:
where w is all of the windows that include nucleotide t; PSw,tis the forward-backward algorithm's probability that nucleotide t is a poly(A) site in window w; the two constants, 120 and 2, are used to adjust the scores to be in a manageable range.
Calculation of sensitivity and specificity
The formulas for Sp and Sn calculations are given in the Results. The methods for the compiling false positive and false negative numbers are shown here. We employed a user defined value called threshold in these calculations. At a given threshold value (t), the score at an nt must be at least t in order for that nt to be a predicted poly(A) site. The False Positive sites (FP) were calculated as following: for a sequence of interest, let n represent the total number of nucleotides; let p represent the number of true poly(A) sites with a score equal or larger than a given t; let m represent the number of all sites with a score equal or larger than t. Then, FP = m-p. As one can see, using the 8K dataset sequences to calculate FP requires that all poly(A) sites have to be identified. Due to the fact that the identification of true poly(A) sites in a given 3'UTR is incomplete in plants (many alternative poly(A) sites may not be represented in the EST collection, or the dataset is not sufficiently large enough), sequences that are known to not possess poly(A) sites were used to tally FP. These sequences include protein-coding sequences, 5'-UTRs, and introns as indicated in "The Datasets" under Methods. Random sequences generated by preserving the trinucleotide distribution were also used. In these control datasets, FP = m. For True Positive sites (TP), at a given t, in the sequences with known poly(A) sites, TP = p. In the control sequences, TP = n-m. False negative (FN) is the number of actual sites that cannot be identified or predicted correctly. To calculate FN, let f represent the number of true poly(A) sites with a score smaller than a given t. Hence, at a given t, in the sequences with known poly(A) sites, FN = f.