This section details how ReRanker prioritizes candidate gene models on a target species by comparison with orthologs from a reference species. Subsections address the generation of candidate gene models, ortholog identification between the two species, the construction of similarity features between gene models, the format of scoring function of candidate gene models and learning of the reranker's scoring parameters.
Generating candidate gene models
Gene loci on the target species were first defined and candidate gene models for each locus are generated by Evigan. The term gene locus refers to a genomic region containing only a single gene. Gene loci on the target species were first identified by an initial prediction gene set produced by Evigan integrating multiple lines of evidence (Augustus, Genscan, Genie, Geneid, CONTRAST were used in the experiment). The genomic region defined by each gene in the initial prediction set is extended in both directions on the genomic sequence until the neighboring predicted genes are reached. Each such extended region is a gene locus. This procedure often produces thousands or tens of thousands of gene loci on the target species, depending on the size of the genome and the Evigan initial prediction set.
For each proposed gene locus, Evigan was used to generate the K best candidate gene models for the gene with the posterior probability for each, by integrating the evidence overlapping with the region. K is a parameter passed to the K best decoder in Evigan as the maximum number of alternative paths to be generated. If the aggregated evidence at this locus supports less than K candidate gene models, all possible models will be generated. The Kbest decoder [39] in Evigan uses a variation of the Viterbi decoding algorithm [40, 41] to search for high probability paths, with O(K N log N) computational complexity where N is the size of the standard Viterbi trellis, which is quite efficient. In the original Viterbi decoding implementation of Evigan, an optimal path may contain multiple genes, whereas in the implementation of the Kbest decoder only singlegene paths are returned. Note that the best candidate in the Kbest list for a locus may or may not be exactly the same as the initially predicted gene used to identify the locus. In practice, however, discrepancy is rarely observed.
Ortholog identification
Ortholog pairs between the target species and a reference species are identified by BLASTP [42] reciprocal best hits between the best candidate models (translated into protein products) on the target species and the proteins on the reference species. Specifically, if a gene's best candidate model on the target species and a protein from the reference species are reciprocal best hits by running BLASTP(default parameters, evalue cutoff set as 1e5), they are considered as an ortholog pair. This is a rather simplified approach for identifying orthologs but in practice it produces reasonably good results. More comprehensive approaches would be searching all candidate models of a gene against the reference proteins or examining multiple species and phylogentic relationships between the species [36, 43].
Reranking features
A variety of features were extracted from candidate gene models, including the posterier probabilities defined by Evigan and various similarity features determined by comparison with orthologous proteins/gene models. Note that these features could readily be expanded to include additional informative similarity features. In the current implementation, six features on a candidate gene model were extracted, as described below. Let t and r denote a candidate gene model (or its translated protein) on the target species and a protein/gene on the reference species, respectively.
Posterior probability
Let p(t) denote t's Evigan posterior probability given the evidence. The probability feature f_{1}(t) is the logarithm of p(t):
f_{1}(t) = log p(t)
Length similarity
Let l(t) and l(r) denote the coding sequence length of t and r. The length similarity feature f_{2}(t) is given by
{f}_{2}(t)=\mathrm{log}\left(\frac{l(t)l(r)+1}{l(r)+1}\right)
The absolute difference in the coding length of the two genes l(t)  l(r) is normalized by the coding length l(r) of the reference gene. (Normalizing by the coding length of the target gene model is not a good idea, because it may bias towards target candidate gene models that are very long or short.) The +1 term in the numerator and denominator smoothes the counts.
Splice count similarity
As with coding length, we also compare the number of splice sites in source and target. Let s(t) and s(t) denote the number of splice sites of t and r. The splice site feature f_{3}(t) is given by
{f}_{3}(t)=\mathrm{log}\left(\frac{s(t)s(r)+1}{s(r)+1}\right)
Again, the +1 term in the numerator and denominator smoothes the counts, and also prevents division by zero.
Sequence similarity
The sequence similarity feature between t and r is computed from the alignment score given by DiAlign [44], a multiple sequence alignment program. When two sequences are aligned, DiAlign first searches for multiple gapless local alignments, referred to as segments, and then constructs a global alignment between the two sequences by searching for the best set of consistent segments. In addition to producing gapless local alignments, DiAlign also provides for each segment an alignment score, which is basically the negative logarithm of the probability that two random sequences can be aligned as well as these two sequences. Suppose the coding sequences of t and r are aligned by DiAlign (translated alignment) and let A(t, r) denote the sum of the alignment scores for the segments constituting the global alignment and A(t, r) is roughly linear to the length of t and r. The sequence similarity feature f_{4}(t) is given by normalizing the alignment score by the length of r, or
{f}_{4}(t)=\frac{A(t,r)}{l(r)}
Shared splice sites
The segments produced by DiAlign can be used to extract another useful similarity feature: shared splice sites. Figure 3 shows the alignment betweeen the coding sequences of t and r output by DiAlign, where blue boxes represent gapless local alignments and wavy lines represent unaligned regions. Splice sites of t and r are mapped to the segments, as shown by the arrows in the figure. If a splice site of t and a splice site of r are mapped to the same relative position within a segment, as exemplified by the first and third pairs of splice sites in the figure, they are identified as a shared splice site. Let C(t, r) denote the number of shared splice sites identified by the above approach. The shared splice feature f_{5}(t) is given by
{f}_{5}(t)=\mathrm{log}\left(\frac{C(t,r)+1}{s(r)+1}\right)
Signal peptides
A signal peptide feature, f_{6}(t), represents the cooccurence of predicted signal peptide on t and r. The presence or absence of signal peptides on t and r is predicted by signalP3.0 [45]. Let S(t) and S(r) denote the presence or absence of signal peptides on t and r. Then the feature f_{6}(t) is given by
{f}_{6}(t)=\{\begin{array}{ll}1\hfill & \text{if}S(t)=S(r)=1\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}
If the reference gene contains a signal peptide, target candidate gene models with signal peptides are preferred; If the reference gene does not contain signal peptide, no preference is imposed on target candidate gene models. The onesided nature of the feature is motivated by the relatively low abundance of signal peptides and the observation that signal peptide detection algorithms tend to focus on sensitivity rather than specificity. If the reference gene does not have a signal peptide while a target candidate model does, the candidate will not be penalized.
Scoring function
The features just described are used to compute a score S(t) for each candidate gene model t. The features of t are arranged into a feature vector f(t), and the score is defined by the inner product S(t) = f(t)·w, where w is a weight vector that will be learned from training data. Given K candidate gene models t_{1}, ..., t_{
K
}, the index of the highest scoring model is given by the decision rule
{k}^{\ast}=\mathrm{arg}\underset{k=\mathrm{1...}K}{\mathrm{max}}S({t}_{k})
Weight estimation
The parameter weight vector w in the scoring function is estimated from a training set D to optimize reranking accuracy using the MIRA online largemargin learning algorithm [46].
The training set D = {e_{1}, ..., e_{
N
}} is a set of training examples, where each example e ∈ D contains the set of candidate models for a training gene. More specifically, each e ∈ D has the form e = {(t_{
k
}, q_{
k
})k = 1, ..., K} where t_{
k
}is a candidate model and q_{
k
}is the quality of t_{
k
}relative to the reference annotation. In our experiments, q_{
k
}is the exonlevel Fscore (harmonic mean of sensitivity and specificity) for t_{
k
}relative to the reference annotation genes at t_{
k
}'s locus.
The MIRA learning algorithm [46] learns w by looping over the training examples and updating w at each example so that lowesterror candidate model is selected for the example by the decision rule given above. The weight vector w is initially the zero vector. The pseudocode "Outline of MIRA update" shows a single cycle of updating the weight vector. At each round, the algorithm fetches an example e from the training set, reranks its candidate models and selects the best predicted candidate t_{k*}using the current weight vector. The true best candidate is denote by {t}_{\widehat{k}}, given by the maximum quality assessment. The algorithm updates the weight vector by solving an optimization problem. The goals of the optimization problem are twofold: keep the new weight vector as close to the current weight vector as possible; and score the true best candidate higher than the predicted candidate by their quality difference {q}_{\widehat{k}}{q}_{{k}^{\ast}}. C is a weight factor balancing the two goals, which is set to 5 in the experiments. The algorithm will loop over the examples in the training set until the weight vector does not change significantly.
Outline of MIRA update
Given an example e = {(t_{
k
}, q_{
k
})k = 1, ..., K} and a current weight vector w_{
n
}, the updated weight vector w_{n+1}← MIRAupdate(e, w_{
n
}) is computed as follows:

Use the current weight vector w_{
n
}to rank the candidate models and select the index for best predicted candidate by k* = arg max_{k = 1...K}f(t_{
k
})·w_{
n
}

Let \widehat{k} be the index of the true best candidate \widehat{k} = arg max_{k = 1...K}q_{
k
}

Find the solution w, ξ for the following optimization problem:
\begin{array}{l}\underset{w,\xi}{\mathrm{min}}\left\rightw{w}_{n}{}^{2}+C\xi \hfill \\ \text{subjectto}w\cdot f({t}_{\widehat{k}})\ge w\cdot f({t}_{{k}^{\ast}})+({q}_{\widehat{k}}{q}_{{k}^{\ast}})\xi ,\xi \ge 0\hfill \end{array}

Set w_{n+1}= w.
It is common practice to consider the average of the updated weight vector at each round as the final output weight vector, because the average weight vector often gives better performance than individual weight vectors [46]. The pseudocode titled "MIRA algorithm wrapper" shows an algorithm wrapper that calls the MIRA update as a subroutine at each round and outputs a final weight vector.
MIRA algorithm wrapper
Given a training set D, the algorithm wrapper computes a weight vector w ← MIRAwrapper(D) as follows:

Initialize the weight vector w_{0} ← 0

Perform the following N times:

Get an example e from the training set D

Update the weight vector w_{n+1}← MIRAupdate(e, w_{
n
})

Output the average weight vector w\leftarrow \frac{{\displaystyle {\sum}_{n=1}^{N}{w}_{n}}}{N}
Evaluation
For each locus on the target species, Evigan's prediction is always the top gene model from the original candidate list generated by Evigan; ReRanker's prediction is the candidate model with the highest reranking score as described above. Performance of prediction sets is assessed by sensitivity and specificity on exon, transcript and gene level using the Eval program [37] (only coding parts were evaluated). Sensitivity is defined as the fraction of annotated exons (or genes) predicted correctly. Specificity is the fraction of the predicted exons (or genes) that correspond precisely to any exon (or gene) in the curated annotation set. Fscore is the harmonic mean of sensitivity and specificity. An exon is considered correct if its boundaries and reading frame are both correct. A gene is counted correct if all of its exons are precisely predicted. For genes with multiple transcripts, sensitivity and specificity were determined at the exon, transcript and gene levels. A transcript is considered correct if all its exons are accurately predicted. A gene is counted correct if one of its transcripts is predicted correctly.