This section details how ReRanker prioritizes candidate gene models on a target species by comparison with orthologs from a reference species. Subsections address the generation of candidate gene models, ortholog identification between the two species, the construction of similarity features between gene models, the format of scoring function of candidate gene models and learning of the reranker's scoring parameters.
Generating candidate gene models
Gene loci on the target species were first defined and candidate gene models for each locus are generated by Evigan. The term gene locus refers to a genomic region containing only a single gene. Gene loci on the target species were first identified by an initial prediction gene set produced by Evigan integrating multiple lines of evidence (Augustus, Genscan, Genie, Geneid, CONTRAST were used in the experiment). The genomic region defined by each gene in the initial prediction set is extended in both directions on the genomic sequence until the neighboring predicted genes are reached. Each such extended region is a gene locus. This procedure often produces thousands or tens of thousands of gene loci on the target species, depending on the size of the genome and the Evigan initial prediction set.
For each proposed gene locus, Evigan was used to generate the K best candidate gene models for the gene with the posterior probability for each, by integrating the evidence overlapping with the region. K is a parameter passed to the K best decoder in Evigan as the maximum number of alternative paths to be generated. If the aggregated evidence at this locus supports less than K candidate gene models, all possible models will be generated. The K-best decoder [39] in Evigan uses a variation of the Viterbi decoding algorithm [40, 41] to search for high probability paths, with O(K N log N) computational complexity where N is the size of the standard Viterbi trellis, which is quite efficient. In the original Viterbi decoding implementation of Evigan, an optimal path may contain multiple genes, whereas in the implementation of the K-best decoder only single-gene paths are returned. Note that the best candidate in the K-best list for a locus may or may not be exactly the same as the initially predicted gene used to identify the locus. In practice, however, discrepancy is rarely observed.
Ortholog identification
Ortholog pairs between the target species and a reference species are identified by BLASTP [42] reciprocal best hits between the best candidate models (translated into protein products) on the target species and the proteins on the reference species. Specifically, if a gene's best candidate model on the target species and a protein from the reference species are reciprocal best hits by running BLASTP(default parameters, e-value cutoff set as 1e-5), they are considered as an ortholog pair. This is a rather simplified approach for identifying orthologs but in practice it produces reasonably good results. More comprehensive approaches would be searching all candidate models of a gene against the reference proteins or examining multiple species and phylogentic relationships between the species [36, 43].
Reranking features
A variety of features were extracted from candidate gene models, including the posterier probabilities defined by Evigan and various similarity features determined by comparison with orthologous proteins/gene models. Note that these features could readily be expanded to include additional informative similarity features. In the current implementation, six features on a candidate gene model were extracted, as described below. Let t and r denote a candidate gene model (or its translated protein) on the target species and a protein/gene on the reference species, respectively.
Posterior probability
Let p(t) denote t's Evigan posterior probability given the evidence. The probability feature f1(t) is the logarithm of p(t):
f1(t) = log p(t)
Length similarity
Let l(t) and l(r) denote the coding sequence length of t and r. The length similarity feature f2(t) is given by
The absolute difference in the coding length of the two genes |l(t) - l(r)| is normalized by the coding length l(r) of the reference gene. (Normalizing by the coding length of the target gene model is not a good idea, because it may bias towards target candidate gene models that are very long or short.) The +1 term in the numerator and denominator smoothes the counts.
Splice count similarity
As with coding length, we also compare the number of splice sites in source and target. Let s(t) and s(t) denote the number of splice sites of t and r. The splice site feature f3(t) is given by
Again, the +1 term in the numerator and denominator smoothes the counts, and also prevents division by zero.
Sequence similarity
The sequence similarity feature between t and r is computed from the alignment score given by DiAlign [44], a multiple sequence alignment program. When two sequences are aligned, DiAlign first searches for multiple gapless local alignments, referred to as segments, and then constructs a global alignment between the two sequences by searching for the best set of consistent segments. In addition to producing gapless local alignments, DiAlign also provides for each segment an alignment score, which is basically the negative logarithm of the probability that two random sequences can be aligned as well as these two sequences. Suppose the coding sequences of t and r are aligned by DiAlign (translated alignment) and let A(t, r) denote the sum of the alignment scores for the segments constituting the global alignment and A(t, r) is roughly linear to the length of t and r. The sequence similarity feature f4(t) is given by normalizing the alignment score by the length of r, or
Shared splice sites
The segments produced by DiAlign can be used to extract another useful similarity feature: shared splice sites. Figure 3 shows the alignment betweeen the coding sequences of t and r output by DiAlign, where blue boxes represent gapless local alignments and wavy lines represent unaligned regions. Splice sites of t and r are mapped to the segments, as shown by the arrows in the figure. If a splice site of t and a splice site of r are mapped to the same relative position within a segment, as exemplified by the first and third pairs of splice sites in the figure, they are identified as a shared splice site. Let C(t, r) denote the number of shared splice sites identified by the above approach. The shared splice feature f5(t) is given by
Signal peptides
A signal peptide feature, f6(t), represents the co-occurence of predicted signal peptide on t and r. The presence or absence of signal peptides on t and r is predicted by signalP-3.0 [45]. Let S(t) and S(r) denote the presence or absence of signal peptides on t and r. Then the feature f6(t) is given by
If the reference gene contains a signal peptide, target candidate gene models with signal peptides are preferred; If the reference gene does not contain signal peptide, no preference is imposed on target candidate gene models. The one-sided nature of the feature is motivated by the relatively low abundance of signal peptides and the observation that signal peptide detection algorithms tend to focus on sensitivity rather than specificity. If the reference gene does not have a signal peptide while a target candidate model does, the candidate will not be penalized.
Scoring function
The features just described are used to compute a score S(t) for each candidate gene model t. The features of t are arranged into a feature vector f(t), and the score is defined by the inner product S(t) = f(t)·w, where w is a weight vector that will be learned from training data. Given K candidate gene models t1, ..., t
K
, the index of the highest scoring model is given by the decision rule
Weight estimation
The parameter weight vector w in the scoring function is estimated from a training set D to optimize reranking accuracy using the MIRA online large-margin learning algorithm [46].
The training set D = {e1, ..., e
N
} is a set of training examples, where each example e ∈ D contains the set of candidate models for a training gene. More specifically, each e ∈ D has the form e = {(t
k
, q
k
)|k = 1, ..., K} where t
k
is a candidate model and q
k
is the quality of t
k
relative to the reference annotation. In our experiments, q
k
is the exon-level F-score (harmonic mean of sensitivity and specificity) for t
k
relative to the reference annotation genes at t
k
's locus.
The MIRA learning algorithm [46] learns w by looping over the training examples and updating w at each example so that lowest-error candidate model is selected for the example by the decision rule given above. The weight vector w is initially the zero vector. The pseudocode "Outline of MIRA update" shows a single cycle of updating the weight vector. At each round, the algorithm fetches an example e from the training set, reranks its candidate models and selects the best predicted candidate tk*using the current weight vector. The true best candidate is denote by , given by the maximum quality assessment. The algorithm updates the weight vector by solving an optimization problem. The goals of the optimization problem are two-fold: keep the new weight vector as close to the current weight vector as possible; and score the true best candidate higher than the predicted candidate by their quality difference . C is a weight factor balancing the two goals, which is set to 5 in the experiments. The algorithm will loop over the examples in the training set until the weight vector does not change significantly.
Outline of MIRA update
Given an example e = {(t
k
, q
k
)|k = 1, ..., K} and a current weight vector w
n
, the updated weight vector wn+1← MIRA-update(e, w
n
) is computed as follows:
-
Use the current weight vector w
n
to rank the candidate models and select the index for best predicted candidate by k* = arg maxk = 1...Kf(t
k
)·w
n
-
Let be the index of the true best candidate = arg maxk = 1...Kq
k
-
Find the solution w, ξ for the following optimization problem:
-
Set wn+1= w.
It is common practice to consider the average of the updated weight vector at each round as the final output weight vector, because the average weight vector often gives better performance than individual weight vectors [46]. The pseudo-code titled "MIRA algorithm wrapper" shows an algorithm wrapper that calls the MIRA update as a subroutine at each round and outputs a final weight vector.
MIRA algorithm wrapper
Given a training set D, the algorithm wrapper computes a weight vector w ← MIRA-wrapper(D) as follows:
-
Initialize the weight vector w0 ← 0
-
Perform the following N times:
-
Get an example e from the training set D
-
Update the weight vector wn+1← MIRA-update(e, w
n
)
-
Output the average weight vector
Evaluation
For each locus on the target species, Evigan's prediction is always the top gene model from the original candidate list generated by Evigan; ReRanker's prediction is the candidate model with the highest reranking score as described above. Performance of prediction sets is assessed by sensitivity and specificity on exon, transcript and gene level using the Eval program [37] (only coding parts were evaluated). Sensitivity is defined as the fraction of annotated exons (or genes) predicted correctly. Specificity is the fraction of the predicted exons (or genes) that correspond precisely to any exon (or gene) in the curated annotation set. F-score is the harmonic mean of sensitivity and specificity. An exon is considered correct if its boundaries and reading frame are both correct. A gene is counted correct if all of its exons are precisely predicted. For genes with multiple transcripts, sensitivity and specificity were determined at the exon, transcript and gene levels. A transcript is considered correct if all its exons are accurately predicted. A gene is counted correct if one of its transcripts is predicted correctly.