Motif information content
Information content of motif X is
where i is the position, L is the length of the motif, and n is the each nucleotide A, C, G, and T. For the information content calculation based on N data set sequences, we added pseudocounts, using the background probability of each base frequency. We used the upstream 30 bp and downstream 20 bp from TIS sites for the calculation.
Experimentally validated data set for translation initiation sites
We used the EcoGene database [37] and Link data set [21] as reliable data sets of translation initiation sites in E. coli. The EcoGene database contains 862 proteins that were confirmed by N-terminal protein sequence identification. We removed from the data set a selenoprotein, release factor 2 (which is known to be synthesized by a + 1 frameshift), as well as two genes starting with ATT instead of canonical start codons (ATG, GTG, and TTG),.
The Link data set contains 195 genes; four of these are not consistent with the EcoGene data set. To construct a fully reliable data set, we removed these four genes (hdeB, leuB, lolA, and ydcG). For B. subtilis, we used a data set of 1248 'non-y' (i.e., experimentally characterized) genes [38] and checked them using the new GenBank annotation (NC_000964.2). Two genes had been removed in the new GenBank annotation, and three codons previously identified as start codons were changed to ATC, ATT, and CTG. We removed those data, leaving 1243 genes in the data set. We also included the more reliable 58 sequences confirmed by comparison with homologous sequences of Bacillus halodurans [38].
Constructing data set with sequence homology
When we apply Hon-yaku to a newly sequenced bacterial genome such as H. arsenicoxydans, we need to construct a reliable data set with strong sequence homology to experimentally validated genes. Using the currently available two data sets, the EcoGene data set and the B. subtilis non-y data set, we defined presumably correct start sites for genomes where experimental data on actual start sites is missing by using the set of related persistent genes ([16], this works for Proteobacteria and Firmicutes) aligning them individually with counterparts in model organisms (E. coli and B. subtilis), and choosing manually the start site.
We substantiated the procedure by comparison with diverse E. coli and B. subtilis data sets as follows:
-
1.
Pick up orthologous genes from the EcoGene data set or B. subtilis non-y data set.
We defined orthologous genes when two proteins display reciprocal best hit with at least 40% similarity in amino acid sequence and 20% or less difference in protein length [39]. We obtained 165 orthologous genes that belong to both the EcoGene data set and the B. subtilis non-y data set.
-
2.
Remove genes that are not aligned in TIS vicinity or that have two or more candidate TISs within 5 bp. With the 165 orthologous genes, we confirmed that 89% of the TIS position differences are less than 5 bp. We removed genes whose TISs is not located within 5 bp upstream or downstream from the experimentally validated TIS, and that have no other candidate TIS within these 5 bp vicinity. From these rules, we obtained a data set of 126 genes with 100% accuracy out of the 165 orthologous genes.
We applied this procedure to P. aeruginosa, B. pseudomallei, and H. arsenicoxydans to construct the training data sets.
Modeling to predict translation initiation sites
To construct a suitable score function, we applied Bayesian statistics to combine the following five elements:
-
1.
The motif sequence around the ribosomal binding site (RBS), identifying the RBS region using a weight matrix constructed from the reference data set
-
2.
The empirically determined distance between the RBS sequence and the start codon
-
3.
The base composition of the start codon
-
4.
The base composition of the beginning of the protein coding sequence with a position specific scoring matrix
-
5.
The empirically determined length of the protein
Additionally we took into account overlapping ORFs using the empirically determined intergenic distance distributions. This methodology requires only the positions of stop codons and evaluates all TIS candidates that are located between the stop codon to the nearest upstream stop codon. We used the annotation by running GeneMark [2] on the genome of H. arsenicoxydans and by using GenBank entries for the other organisms.
Motif search around the RBS
One of the most important elements for TIS prediction is the RBS, containing the RBS sequence AGGAG in E. coli [8] and AAGGAGGU in B. subtilis [40].
Different tools adopted different methods to model the RBS. Hannenhalli et al. used the RBS binding energy to find the RBS motif [41]. The program RBSfinder considers the number of hydrogen bonds to detect motifs complementary to the 3' end of the 16S rRNA [24]. GS-Finder uses the "Z-curve" method [42], which considers differences of the cumulative occurrence numbers for three kinds of base combinations [23]. GS-Finder considers the A, C, G, T contexts in a window. Recently, because of the remarkable progress in motif extraction tools and to avoid having to calculate the binding energy between an organism-dependent 16S rRNA and the mRNA, position specific weight matrices (0th order Markov Model) have been applied for describing the RBS sequence motif (ex. MED-Start [22]). In this paper, we also used a zeroth-order Markov model, while, in addition, we explored higher-order Markov models. To describe the motif sequences by a 1st-order Markov model, we denote the transition probability of the double bases "mn" as a
mn
= P (x
i
= n|xx-1= m). The probability that the motif sequence S
M
is generated by this model is then:
where i is the position and L is the length of the motif.
The log-likelihood ratio that the sequence S
M
is created by the model is
where is the weight matrix of 1st-order Markov chain for a nucleotide n at position i to be followed by the nucleotide m. We prepared one log-odds scoring matrix M
SD
to describe the conserved region around the ribosomal binding site, and another matrix M
DS
to describe the downstream adenine-rich region following the start codon. Those motifs are defined by multiple alignments. In this section, we described the 1st order Markov model. When comparing the 0th, 1st, and 2nd order Markov model in E. coli, B. subtilis, and Herminiimonas arsenicoxydans, we found that a 1st-order Markov model yields more accurate results in both E. coli and B. subtilis, whereas a zeroth-order model was most accurate for Herminiimonas arsenicoxydans (Table 3).
The empirically determined distance from a RBS sequence to a start codon
To describe the gap length between a RBS sequence and a start codon, we estimated the probability density distributions fdist (D
i
) of the distance D
i
from the RBS sequence to the translation initiation site, measured in base pairs, using a kernel density estimation based on Gaussian kernels (Figure 2) [43]. The two Gram-negative bacteria, E. coli and Herminiimonas arsenicoxydans, have similar distributions of the length between the RBS sequence and the TIS, while the Gram-positive bacterium B. subtilis has a longer average distance between the RBS sequence and the start codon. This agrees with the results of previous reports [38, 22].
Base composition of start codons
Table 1 shows the frequency of each start codon for the three bacteria. We also calculated the frequency of ATG, GTG, and TTG codons upstream and downstream of the true TIS to create a negative TIS data set (Eq. 9).
Distribution of protein length ratio
62.6% of the EcoGene data set genes start with the first possible translation initiation codon as the real CDS. We also used the distribution of the ratio of the protein length to the length of the longest ORF. The smallest ratio is 0.697 in the EcoGene data set, most genes show a ratio of over 0.95 (Figure 3).
Combining features around TIS
The Bayesian posterior probability that a gene starts from the translation initiation site TIS can be calculated as
where the prior probability Pprior (TIS) is calculated as the frequency of start codon. P (S, D
protein
|TIS) is the conditional probability that the sequence S is generated around a true translation initiation site, resulting in a protein coding region of length D
protein
. The sequence S around the TIS consists of the ribosomal binding site S
SD
, the start codon S
STC
, the sequence S
DS
content downstream of the TIS, and the remaining sequence S\S
SD
S
TIS
S
DS
. We can then decompose P (S, D
protein
|TIS) into six parts:
P (S, D
protein
|TIS)
= P (S
SD
|TIS)·fdist (DSD 2STC)·P (S
STC
|TIS)
·P (S
DS
|TIS)·fdist (D
protein
)·P (S\S
SD
S
TIS
S
DS
|background), (5)
fdist (DSD 2STC) is the probability that S
RSB
is generated at a distance DSD 2STCfrom the transcription start site, and fdist (D
protein
) is the distribution of the protein length.
Dividing by the background probability yields
where M
SD
and M
DS
are the value of the PSSM score for the RBS sequence and downstream region around the translation initiation site and P (STC|TSS) is the base composition of start codon, as determined from the E. coli known data set.
We define the score functions
score(TIS) ≡ ln Pprior (TIS) + M
SD
+ ln fdist (DSD 2STC)
+ ln P (STC|TSS) + M
DS
+ ln fdist (D
protein
). (7)
For the calculation of P (TIS|S, D
protein
), we can consider either an assimilation method(Eq: 8) or a discrimination method(Eq: 9). The assimilation method makes the assumption that the base frequency around an ATG, GTG, TTG codon that is not a start codon is the same as the whole genome background model.
where nonTIS represents an ATG, GTG, or TTG codon that does not function as a start codon.
In the discrimination method, we need to make negative data sets which explicitly model nonTIS features. In this case, we made two models, which represent the upstream (intergenic) region nonTIS
up
, and the downstream (in coding region) nonTIS
down
to distinguish between protein coding features and non-coding features.
In Hon-yaku, we calculate score (TIS) and the Bayesian posterior probability that a gene starts from the TIS for all translation initiation sites in the ORF.
Other contributing elements
To increase the prediction accuracy, we additionally considered the operon structure, and alternative candidate start codons that are either adjacent or separated by one codon.
If the two genes are arranged in a head-to-head configuration and the intergenic distance is under 100 bp, we added an empirically determined intergenic distance distribution ln (f
dist
(D
headtohead
)) to the score function (Eq. 7). If the two genes have the same direction and the intergenic distance is under 50 bp, we added an empirically determined intergenic distance distribution ln (f
dist
(Dtailtohead_under 50bp)) to the score function. Thus, we aimed to reduce mispredictions leading to genes with long overlapping sequence regions. This function also improves the prediction of genes with the start codon close to the previous stop codon, as often occurs in operons.
Another reason for incorrect predictions is that some genes have two start codon candidates close to each other. Especially when two candidates are contiguous, the distance function between the start codon and the RBS sequence f
dist
(DSD 2STC) gives ambiguous results. In this case, our algorithm chooses the TIS based on the distribution of the start codon location for MM and MXM amino acid sequences. We constructed the species-specific distribution in E. coli and B. subtilis and applied the E. coli distribution to other bacteria that have a small number of data set genes.
Except for this two neighboring start codon case, which had to be fixed as described above, we established the value of all other parameters using the training data set.
Cross validation
In this paper, we calculated accuracies of Hon-yaku with a leave-one-out cross validation analysis. To avoid showing only the overoptimistic performance rates of the leave-one-out measure, we also calculated the performance of our method with other cross validations. We trained our model with 90% or 80% of the true data set, while the randomly chosen remaining 10% or 20% are retained for subsequent use in evaluating our model. The procedure was repeated one thousand times.