Detecting small plant peptides using SPADA (Small Peptide Alignment Discovery Application)
© Zhou et al.; licensee BioMed Central Ltd. 2013
Received: 2 August 2013
Accepted: 15 November 2013
Published: 20 November 2013
Skip to main content
© Zhou et al.; licensee BioMed Central Ltd. 2013
Received: 2 August 2013
Accepted: 15 November 2013
Published: 20 November 2013
Small peptides encoded as one- or two-exon genes in plants have recently been shown to affect multiple aspects of plant development, reproduction and defense responses. However, popular similarity search tools and gene prediction techniques generally fail to identify most members belonging to this class of genes. This is largely due to the high sequence divergence among family members and the limited availability of experimentally verified small peptides to use as training sets for homology search and ab initio prediction. Consequently, there is an urgent need for both experimental and computational studies in order to further advance the accurate prediction of small peptides.
We present here a homology-based gene prediction program to accurately predict small peptides at the genome level. Given a high-quality profile alignment, SPADA identifies and annotates nearly all family members in tested genomes with better performance than all general-purpose gene prediction programs surveyed. We find numerous mis-annotations in the current Arabidopsis thaliana and Medicago truncatula genome databases using SPADA, most of which have RNA-Seq expression support. We also show that SPADA works well on other classes of small secreted peptides in plants (e.g., self-incompatibility protein homologues) as well as non-secreted peptides outside the plant kingdom (e.g., the alpha-amanitin toxin gene family in the mushroom, Amanita bisporigera).
SPADA is a free software tool that accurately identifies and predicts the gene structure for short peptides with one or two exons. SPADA is able to incorporate information from profile alignments into the model prediction process and makes use of it to score different candidate models. SPADA achieves high sensitivity and specificity in predicting small plant peptides such as the cysteine-rich peptide families. A systematic application of SPADA to other classes of small peptides by research communities will greatly improve the genome annotation of different protein families in public genome databases.
A major challenge in translating new genome sequences into useful community resources is the accurate annotation of genes and other functionally-relevant features . While there have been clear improvements in gene prediction algorithms , accurate prediction of small one and two-exon genes remains stubbornly problematic . False-positive signals arising from the poor specificity of promoter motifs and other commonly-used signals employed by general purpose gene-finding algorithms are widespread [4–7]. To address the flood of false-positive signals for small genes, many annotators filter out small-gene predictions lacking direct experimental expression evidence, resulting in a major problem of false negatives [3, 4].
We propose here an alternative and complementary strategy for genome-wide annotation – a strategy that has as its strength predicting the small one- and two-exon genes that all-purpose gene-finding algorithms often fail to predict accurately. Our approach focuses on finding all related paralogous genes within a target gene family and then using signals from the corresponding multiple sequence alignment to aid in refining the model predictions. We have implemented this approach in an open-source and freely available application called SPADA (Small Peptide Alignment Discovery Application). SPADA can be used directly with a user’s own protein family alignments or with a comprehensive set of protein family alignments from public sources such as Pfam , InterPro  or PROSITE , enabling the exhaustive discovery of essentially all members of the input families within a given genome sequence. Because these public resources continue to expand and include new and novel protein families, SPADA’s ability to comprehensively identify arbitrarily large families of small peptides in genomes will steadily grow.
Here we describe the conceptual basis of SPADA and go onto test its performance with selected families of notoriously difficult genes to annotate properly - specifically, plant Cysteine-Rich Peptides (CRPs) in two model plant species (Arabidopsis thaliana and Medicago truncatula), the S-Protein homologue (SPH) family in A. thaliana, and the alpha-amanitin toxin gene family in the mushroom Amanita bisporigera. In the case of CRPs, we examine the accuracy and recall compared to published composite test/training sets for these species based on previous semi-manual curation and subsequent experimental expression validation [11–13]. We also compare SPADA’s performance against a range of commonly-used generic gene-prediction algorithms [14–17], providing evidence of SPADA’s advantage in identifying these challenging classes of small peptides.
SPADA is a computational pipeline that, when provided with a multiple sequence alignment for a gene/protein family of interest, identifies all members of this family in a target genome. Technically, SPADA’s pipeline is a general homology-based gene finding program with specifically enhanced power to detect and annotate small peptides with one or two exons. Unlike general-purpose gene prediction programs such as Fgenesh , SPADA works on an entire gene family at one time - with the goal of finding all family members in the genome. Unlike other homology-based gene predictors such as Genewise  and Exonerate  that map a single protein sequence to the target genome, SPADA performs a similarity search using a profile alignment and identifies all homologs of the family. In addition, SPADA provides automated access to both similarity search tools (e.g., BLAST  and HMMER ) and ab initio gene predictors (e.g., Augustus), significantly improving the annotation efficiency of multi-member gene families. As shown in Figure 1, SPADA consists of four consecutive components:
Model Evaluation & Selection
In SPADA, HMMER  is first used to identify hits in the target genome sequence (translated in six reading frames) as well as in the target proteome (if available) using a reasonably generous E-value (10). These hits are then tiled with regard to their genomic coordinates and merged into overlapping clusters. Finally, one best hit in each cluster is picked to generate a list of candidate hits.
The pipeline then allows the user to run one or more processes to predict gene structures for these potential genes. By default, SPADA runs Augustus  using hit locations as clues for "CDS regions" (coding sequence). In parallel, SPADA runs a custom pipeline optimized for predicting the exon boundaries of genes containing one or two exons using GeneWise , SplicePredictor  and custom Perl scripts.
In the next step, all gene structure predictions are combined to make a raw calling set, with each hit having one or more gene structure predictions. SPADA uses multiple statistics to assess the confidence of each candidate gene model, including an alignment score (mean pairwise score with known members in the original family-specific multiple sequence alignment), an HMM alignment score (sum of posterior probability scores in the Hmmsearch output file), the presence/absence of proper start/stop codons, as well as the SignalP D-score  in the case of secreted peptides.
Finally, the best candidate gene model is picked for each hit and the resulting set is filtered using empirical cutoffs (hmmsearch E-value of 0.001) to remove false positives.
The original motivation for developing this pipeline was accurately identifying and predicting Cysteine-Rich Peptides (CRPs) in plant genomes. For this purpose, SPADA comes with a complete set of manually-curated protein sequence alignments for plant Cysteine-Rich Peptide (CRP) families . In 2007, Silverstein et al. built multiple sequence alignments for most plant CRP families through iteratively scanning EST sequences from different plant species in TIGR’s Gene Indices . These alignments were re-aligned here using ClustalO  and trimmed using trimAl  to remove spurious sequences and poorly aligned positions. Finally, a profile Hidden Markov Model (HMM) was built for each CRP family using "hmmbuild" in the HMMER package .
As a general homology-based gene finding program, SPADA has been designed to work with any set of protein families. Users can start with a list of amino acid sequence alignments of their own interest, run the script "build_profile.pl" to generate custom HMM profiles, and initiate the pipeline using the new HMM(s). With this in mind, we have tested SPADA’s performance on an additional protein family as a proof of concept (see Results section) and assessed its applicability to secreted protein families other than CRPs.
In SPADA, genome FASTA sequences are translated in all six reading frames to amino acid sequences and then Open Reading Frames (ORFs) are extracted by breaking up these long amino acid sequences using stop codons. Here an ORF is defined as a segment of amino acid sequence with at least 15 residues and uninterrupted by stop codons. Extracting ORFs from the original translated genomic sequence reduces the target database size for the subsequent motif mining step and improves sensitivity. Using ORFs also ensures that no protein-coding exon spans stop codons in the middle of a sequence and that each exon will have a reasonable length. In theory, all protein-coding exons should locate within these ORFs, which will be discovered in the next motif-mining step. However, the exact exon boundaries are still unclear at this point of the search procedure.
If a gene annotation file in General Feature Format version 3.0 (GFF3) is available, SPADA can also read and process it, extracting the amino acid sequences of existing annotations and passing them onto the next motif mining step. In doing so, exon boundaries can be better refined, further improving the accuracy in the model prediction step.
Hard-masking of genome sequences is recommended (replacing repetitive sequences with 'N’s) before running the pipeline. Some plant species have very large genomes with highly repetitive content (e.g., Maize ). By hard-masking the genome sequence, the target database size in the motif-mining step is effectively reduced, significantly improving the search sensitivity of the entire pipeline. However, if many family members locate in repeat-rich genomic regions (such as the fungi effector families ), the unmasked genome version should be used.
In SPADA, profile HMMs are used to search against translated genomic sequences (and known protein sequences, if available) using "hmmsearch", a component of the HMMER package (v3.0) . This program finds significant hits against a protein sequence database using one or more profiles as inputs. The output of the scan is a list of genomic intervals with significant sequence similarity to query profiles and amino acid sequences translated from these intervals. For single-exon genes, a contiguous stretch of amino acid sequence in the target databases will be discovered, roughly corresponding to the exon in the original genomic sequence. For genes containing two or more exons, partial amino acid sequence hits corresponding to different exons will be separated by introns (if they share a reading frame) or distributed in different target sequences (if in different reading frames). SPADA collects all these full and partial hits in translated protein sequences, recovers their original genomic coordinates, filters out low-significance hits (E-value lower than 0.1), selects the most significant hit for each genomic interval since multiple input profiles may hit the same region, and merges nearby partial hits. During this merging step, SPADA requires that each neighboring partial hit should hit a different segment (either upstream or downstream) in the input profile HMM. The merged genomic intervals (called "extended hits") roughly correspond to the multiple exons in the underlying gene model - although the exact intron-exon boundaries and start/stop codon locations are yet to be refined at this stage of the procedure.
In parallel, SPADA searches against existing protein sequences (generated using the GFF3 annotation file), yielding a separate list of hits to the input profiles. These hits are also treated as partial hits, i.e., mapped to their original genomic coordinates and then used to build "extended hits". This "hmmsearch against proteome" step is considered complementary to the abovementioned "hmmsearch against translated genome" step, since it improves prediction sensitivity by capturing otherwise non-significant partial hits in the translated genome search.
At this point, SPADA has generated a list of "extended hits" approximately corresponding to actual exon boundaries. For each extended hit the surrounding genomic sequence is extracted. By default, 2500 bp upstream from the hits are extracted, since the first exon (containing the signal peptide) is usually separated from the second exon (with the mature peptide) by an intron up to 1500 bp, as determined by manual curation and understanding of plant genomes. At the other end, 1500 bp downstream from the hit boundaries are extracted, since the correct stop codon can typically be found within 1000 bp downstream of the HMM hit . SPADA next runs one or more components (selected by the user) in parallel to determine gene structure in this region. A total of five prediction components are currently supported by the pipeline: Augustus , GeneWise , GlimmerHMM , GeneMark  and GeneID . By default SPADA only runs two of these components (Augustus and GeneWise) since performance evaluation on a group of common plant peptides suggests that running all five of them does not offer a significant extra gain compare to running just two of them (see Results & Discussion).
The first component, which we denote "Augustus_evidence", runs Augustus  in its "evidence mode". The genomic sequence is used as input along with a "hint file" providing the program instructions for which part(s) of the input sequence are known to be part(s) of the coding sequence. In other words, location information of extended hits is incorporated in the prediction process. Augustus will then try to complete the gene model by looking for start/stop codons and canonical donor-acceptor splice sites around the hits while preserving the open reading frame. The improvement in prediction accuracy and specificity by running Augustus in the "evidence mode" (as compared to the "Augustus de novo mode") is significant and will be discussed in Results & Discussion.
In parallel, SPADA runs a custom pipeline specifically designed to identify and predict genes with one or two exon(s) and with a leading signal peptide. This component, which we denote "Genewise+SplicePredictor", first runs GeneWise  to align the extended hit sequence (translated to amino acid sequence) to genomic sequence and identifies compatible splice sites that preserve the hit ORFs. If GeneWise fails due to non-canonical splice sites, SPADA then runs SplicePredictor  to find all possible donor/acceptor splice sites and extracts compatible ones, extending the ORFs to the nearest start codon and stop codon. In practice, this "Genewise+SplicePredictor" approach works well as a complement to the "Augustus_evidence" approach (see Results and Discussion).
At this stage in the pipeline, SPADA reports all compatible gene models predicted by the two components. These candidate models are then passed on to the next step for evaluation in order to generate a best calling set.
For each extended hit, SPADA then evaluates the underlying candidate models using a number of measures and picks the most "confident" model for output. These evaluation statistics include the presence of start/stop codons at the beginning/end of the model, the presence of inframe stop codons, the SignalP score  in the case of a secreted gene family, and in particular, the Multiple Sequence Alignment (MSA) score and the "Hmmsearch Probability" (HmmProb) score, as described below.
In theory, the correct gene model should encode an amino acid sequence that aligns to the original family-specific protein alignments better than any other candidate models. To calculate the MSA score, SPADA aligns the amino acid sequence of the candidate model to the profile alignment using ClustalO "profile-to-profile" mode . SPADA then scores all pairwise alignments using BLOSUM80 scoring matrix  and calculates a mean alignment score. The BLOSUM80 matrix is used instead of BLOSUM62 because the sequences that are being aligned tend to be fairly similar to each other, and a matrix with more conserved target frequencies such as BLOSUM80 should be more reasonable. Ideally, the candidate model with the highest MSA score should be the most probable model.
Nevertheless, the MSA score is not sufficient to pick the best model, since candidate models are sometimes too close to each other in sequence and the MSA scores may not vary appreciably among model alternatives. Therefore SPADA also calculates an "Hmmsearch Probability" Score for each candidate model. In theory, if hmmsearch is run using the original family HMM against all candidate models, the most significant hit in the output should then be the best model. In practice, the probability score in the hmmsearch output serves as a better predictor than the E-value itself, especially when a model contains more than one hit domain. The MSA score and the HmmProb score are used to evaluate each candidate model. SPADA then picks the best candidate model that meets the following criteria: (1) it has a SignalP D-score of no less than 0.4 (determined according to the software manual, this option could be turned off to allow prediction of non-secreted gene families); (2) it has proper start/stop codons and no premature stop codon; and (3) it has the highest (MSA score + HmmProb score).
SPADA uses a relatively relaxed E-value cutoff in the motif mining step (e.g., 10 for running hmmsearch) in order to increase specificity. This also results in numerous false positive hits. These hits will generally not have valid candidate gene models built for them in the model prediction step, and thus would not make it into the ultimate output. However, SPADA does employ a final filtering step based on hmmsearch E-value to refine gene models that are retained. We evaluated the performance of the pipeline under different final E-value cutoffs (see Results and Discussion) and set the default cutoff to 0.001, which may be adjusted by the user to achieve customized search purposes. For all gene models passing the filter, SPADA outputs the sequences in FASTA format and gene coordinate information in GFF format. SPADA also generates for each gene family a multiple sequence alignment including all predicted models and the family-specific consensus sequence. If a gene annotation file has been passed to the pipeline, SPADA will also report the comparison results of predicted models with existing annotation (e.g., the number of models with exactly the same exon boundaries, models with partial overlap, models in different reading frames, etc.).
We have compiled a test set of plant cysteine-rich peptides (CRPs) in two genomes: the model dicotyledon, Arabidopsis thaliana, and the model legume, Medicago truncatula. In previous work, Silverstein et al. have exhaustively searched and curated all 516 CRP families (CRP0000 - CRP6250) in Arabidopsis . A large number of CRPs have also been identified and curated in an early release of the Medicago genome sequence . Recently, as a collaborative effort with J. Craig Venter Institute (JCVI), we expanded this list of CRPs in M. truncatula. We manually inspected and curated 136 CRP families (CRP0000 - CRP1530, focusing specifically on the Defensin-Like proteins or DEFLs) in M. truncatula. Finally, we now have a complete list of CRP members for Arabidopsis and Medicago (742 for Arabidopsis and 725 for Medicago, Additional file 1: File S1).
We collected evidence from multiple sources to validate the expression of the models in the compiled test set. On the one hand, extensive RNA-Seq data were downloaded from NCBI Sequence Read Archive  for both Arabidopsis and Medicago; on the other hand, we downloaded the AtMtDEFL microarray dateset [12, 13] to find additional support for expression of these gene families. The AtMtDEFL array include probe sets for 317 Arabidopsis DEFLs, 15 Arabidopsis DEFL-related Genes (MEGs), and 684 Medicago DEFLs, plus additional marker genes. In total, 583 (78.6%) out of the 742 CRPs in the Arabidopsis test set and 657 (90.6%) out of the 725 Medicago CRPs receive support from either RNA-Seq (FPKM >1) or microarray data (Additional file 2: Table S3 and Additional file 3: Table S4). These carefully curated, high-quality CRP calls were then taken as our test set in evaluating the performance of different model prediction components in SPADA under different hmmsearch E-value cutoffs. All experimental procedures complied with the guidelines of the Institutional Biosafety Committee (IBC) at University of Minnesota (IBC code: 1301-30313H).
We tested a number of popular gene prediction programs as SPADA model prediction components. In addition to the previously mentioned components, we also tested GeneID (v1.4.4) , GlimmerHMM (v3.0.1) , GeneMark (v3.9d) , and Augustus (v2.6.1, de novo mode). The "Augustus_evidence" differs from the "Augustus_de_novo" component simply by the inclusion of a "hint file" (with hit location information) fed to the program. We evaluated the pipeline performance running these components (individually or in combination) based on our curated test dataset (see Results and Discussion), and decided to use the "Augustus_evidence" and "Genewise+SplicePredictor" approaches as default components in the SPADA model prediction step.
Prediction performance was measured at two different levels: coding nucleotide sequence and exonic structure. At each level, we measured the sensitivity and specificity for each component. We first define the true positives (TP, number of coding nucleotides that are correctly predicted as coding), true negatives (TN, number of noncoding nucleotides that are correctly predicted as noncoding), false negatives (FN, number of coding nucleotides predicted as noncoding) and false negatives (FN, number of noncoding sequences predicted coding). At the nucleotide level, Sensitivity (Sn) is then defined as the proportion of coding nucleotides that have been correctly predicted as coding , while Specificity (Sp) is the proportion of predicted coding nucleotides that are actually coding . At the exon level, Sn is the proportion of actual exons in the input sequence that are correctly predicted, while Sp is the proportion of all predicted exons that are correctly predicted . Other measures such as Correlation Coefficient (CC) and Average Conditional Probability (ACP) were not evaluated since they require the calculation of TN nucleotides/exons, which are noncoding regions that are predicted as noncoding. Unlike a general gene-finding program that tries to predict all coding genes in a given sequence, SPADA focuses only on coding genes that are significantly similar to a given profile, while ignoring all other genes. Consequently "TN" statistics is not straightforward to evaluate in this context.
Performance evaluation was done in both A. thaliana and M. truncatula. The extracted genomic sequences were used as input sequences. We evaluated the pipeline performance using each of the "GeneID", "Augustus_de_novo", "GlimmerHMM", "GeneMark", "GeneWise+SplicePredictor", "Augustus_evidence" component (individually), as well as "SPADA" (combination of "GeneWise+SplicePredictor" and "Augustus_evidence") and "All" (combination of all 6 individual components). All programs were installed and run locally on a GNU/Linux workstation. The appropriate parameter files, model files and training directories, if available, were used to run these programs in each species, otherwise the default parameter files (which are for Arabidopsis) were used. The output of these runs were parsed to derive a unique prediction for each test sequence.
We mapped the RNA-Seq short reads (downloaded form NCBI SRA) to the reference using TopHat and summarized the results using Cufflinks . Cufflinks is able to estimate the expression value at the level of transcripts. We used a cutoff of FPKM (Fragments Per Kilobase of exon model per Million mapped fragments) > 1 to determine if a model (either in the test set or in the SPADA prediction set) is expressed.
For the AtMtDEFL array, PMA (Present, Marginal and Absent) calls and normalized expression values of each probe set were obtained from the supplemental tables of two recent papers [12, 13]. We mapped the probe sequences to the transcript models in the test set as well as SPADA prediction. In many cases the annotated gene boundaries are not complete and lack portions of the 3’-UTR that is prioritized in Affymetrix designs, and the probes designed in these regions would not be mapped. As a result, we require at least six probes in a probe set matching the target gene (with 23 or more identical nucleotides for each 25-mer oligo probe). Finally, the PMA calls of a probe set should be 'Present’ in at least one tissue/treatment condition to indicate expression support for the transcript model it is mapped to.
In order to visualize some of the novel SPADA predictions as compared to the original genome annotation, as well as the underlying RNA-Seq read mapping support, we loaded the data (genome sequence file, annotation GFF file, SPADA prediction GFF file, RNA-Seq mapping BAM file) into IGV (Integrative Genomics Viewer ), adjusted the width of each track, and made screenshots.
Using our manually-curated high-quality CRP test set from Arabidopsis and Medicago, we first evaluated the performance of SPADA under different search E-value thresholds. Generally speaking, with a loose E-value threshold (e.g., 0.1), SPADA is able to predict almost all true models (i.e., achieving high sensitivity) while making many false predictions (i.e., specificity is low) (Additional file 4: Figure S1). By setting the search threshold to a more stringent value, SPADA avoids making most of the false predictions, but also loses a small number of true models. In an effort to optimize search sensitivity (the ability to detect all true gene models) and specificity (prevent detection of spurious false models; refer to the "Method / Performance evaluation / Evaluation procedure" section for a formal definition of Sensitivity and Specificity), we set the default search E-value threshold to 0.001. Users can also change the default E-value threshold to build custom searches (e.g., a very sensitive search using E-value cutoff of 1 to identify all potential hits).
These results are expected as SPADA does not work as a general gene finding program but instead focuses on particular classes of genes with known profiles. Small genes are typically difficult to predict and often missed by genome annotation pipeline due to the intrinsic properties of many automatic gene finding algorithms . In our test with GeneID, Augustus and GlimmerHMM against the Medicago genome, the Arabidopsis training matrix was used since a Medicago specific one is not yet available. This explains to a large extent the extremely low sensitivity performance for these three programs in Medicago. Search specificities were generally quite high and did not vary much among different programs or genomes tested, indicating the relatively stringent search E-value (0.001) in effect allows few false positives.
Cysteine-Rich Peptides (CRPs) predicted in A. thaliana and M. truncatula
Nodule Cysteine-Rich peptide
Ripening related protein
Rapid Alkalinization Factor
Root cap/late embryogenesis
Antimicrobial peptide MBP-1
Bowman Birk inhibitor
Pollen Ole e I
ECA1 gametogenesis related
Lipid transfer protein
Proteinase inhibitor II
Kunitz type inhibitor
The default E-value threshold of 0.001 is a compromise between sensitivity and specificity that generally works well for both organisms. For the purpose of identifying all potential small coding genes, a search with high sensitivity should be performed since it allows the user to see all potential hits and then determine for him/herself the boundary between false predictions and true predictions based on search scores. The users can then set the cutoff threshold empirically and select genes for experimental verification on their own. Thus, we also report here two CRP prediction sets by running SPADA using E-value threshold of 1 (Additional file 13: Table S7 and Additional file 14: Table S8). In practice, users are encouraged to change the default E-value threshold to build custom searches.
Novel CRP models identified by SPADA determined by manual inspection
Number of unannotated predictions a
Number of novel models b
We then performed manual inspection on these unannotated CRP models and tried to determine whether the calls are truly bad predictions (e.g., pseudogenes with pre-mature stop codons) or valid members of the family missed by current genome annotation (criteria being that the predicted model fits well in the family-specific alignment and has either RNA-Seq or Affymetrix expression support). The number of "novel" CRPs discovered in this fashion, is given in Table 2 (Additional file 15: File S3). SPADA was able to identify 77 novel CRPs in Medicago that were missed by current genome annotation pipeline. The actual number of new CRPs in Medicago will be even higher since we only evaluated a subset of all CRP groups (CRP0000-CRP1530). This result is not unexpected given that the Medicago genome was released only recently and resources and efforts put into the genome annotation pipeline have been limited. On the other hand, only 3 novel CRPs were found in Arabidopsis, suggesting a relatively higher quality of gene calls in this extensively studied model organism.
Through examination of the novel CRPs, we also noted that while some of the novel hits have a very significant hmmsearch E-value (e.g., 10-12), most have moderate E-values (e.g., 10-4-10-7), suggesting that their sequence similarity to the input HMM profile is limited. While the input profile alignments were manually built and may not be exhaustive in capturing all groups of CRPs, we speculate that some of these novel CRPs might form new clades that define novel profile alignments, separate from the original alignment. Consequently, a new round of genome scans using these novel profiles has the potential for capturing even more members that have been missed in the previous search.
In order to assess whether SPADA could be useful in searches for families of small non-secreted peptides outside the plant kingdom, we also examined the fungal cyclic peptides of Amanita mushrooms . This family includes the amatoxins and phallotoxins, such as α-amanitin and phalloidin, respectively, which are synthesized as proproteins of 34-35 amino acids. We began by creating a multiple sequence alignment via ClustalO of reported proproteins  and executed SPADA as usual with the signal peptide filter turned off, using Arabidopsis as the training model for Augustus in searching the low-coverage genome contigs of Amanita bisporigera. As a negative control, we scanned the genome of Amanita thiersii, which is non-toxic and not known to produce this class of toxins.
Unlike general-purpose gene predicting programs, SPADA works as a family-based gene finder. The major difference between SPADA and general gene predicting programs is that it incorporates prior information from the family profile in the prediction process. SPADA takes advantage of generic gene prediction programs, but goes a step further by suggesting where to look for family members. Through scanning the target genome using pre-built family-specific alignments, SPADA identifies and builds "extented hits" that serve as the backbone of the underlying exonic structure. This location information greatly improves prediction accuracy, as shown by the different performances of "Augustus_de_novo" and "Augustus_evidence" components in Figure 2. Among the six individual predicting components, the four that do not require additional information and make de novo predictions all yield low sensitivities. The other two approaches, "Augustus_evidence" and "GeneWise+SplicePredictor", make use of the location information and are able to predict most of the true positives.
Family-based gene prediction was first introduced in the AUGUSTUS package as the AUGUSTUS-PPX (Protein Profile eXtension) module . Although AUGUSTUS-PPX was shown to be more sensitive and accurate in predicting long, multi-exon gene family members than the standard AUGUSTUS algorithm, its approach is not suitable for small, divergent peptide families such as CRPs, SPHs or Amanita toxin-like peptides examined here. Rather than using the entire protein family alignment profile as input, AUGUSTUS-PPX makes use of conserved, ungapped blocks from the alignment to make a profile. This enables the algorithm to identify core match regions in the genome sequence which together act as a scaffold in the gene prediction. A modification of the standard AUGUSTUS gene-centric HMM is then used to fill in the pieces between scaffold elements with splice elements and other signals, ultimately emitting the most probable full gene structure. While this approach works well for families of typical genes with large numbers of conserved elements, it completely breaks down when applied to small, divergent peptide families like the CRPs, as these families tend to contain no conserved, ungapped regions of appreciable size to seed the initial scaffold. Indeed, when we applied AUGUSTUS-PPX to the CRPs we observed no improvement over "Augustus_de_novo".
The default SPADA pipeline (running the "Augustus_evidence" and "GeneWise+SplicePredictor" components) achieves even higher sensitivity than the two individual components. This owes to the model evaluation & selection step. For each HMM hit, SPADA collects all candidate gene models built by its model predicting components, and in the model evaluation step, picks a best candidate model based on multiple evaluation statistics. True family members will probably get a high-scoring gene model, while most false positive hits will have no qualifying or only low-scoring gene models built. High-scoring gene models that passed the filter are more likely to be true models since they are the ones that best fit the family-specific alignment.
SPADA identifies paralogous gene family members throughout the genome. Many of these predictions currently lack expression evidence and some of the gene predictions have premature stop codons suggesting they may be psuedogenes. Nonetheless, it is important to identify all gene family members, regardless of their expression and pseudogene status, especially in evolutionarily dynamic gene families,. The semi-automated approach that inspired SPADA’s development identified hundreds of defensin-like genes in Arabidopsis which, at the time, had no expression evidence . Later, these genes turned out to be highly specifically expressed in reproductive tissues not previously examined with earlier genome-wide expression approaches . Moreover, one must also be careful not to discard pseudogene predictions that are highly similar to other family members. A gene that appears as a pseudogene in the reference sequenced accession of a species may indeed be fully intact in other accessions, as observed among the defensin-like pollen-tube attractant, AtLUREs . In their study, Takeuchi and Higashiyama observed half a dozen AtLUREs with disabling mutations in non-reference accessions, as well as putative functional and intact forms of AtLUREs 1.5 and 1.6, which are pseudogenes in the reference Col-0 genotype .
Previous work has sought to exhaustively identify small secreted peptides (SSP) in Arabidopsis , rice  and Populus deltoides. These earlier studies only scanned short ORFs (25-250 amino acids) in translated genome sequence, though Pan et al. did include multiple-exon gene predictions from ab initio gene predition programs such as Fgenesh and Augustus. However, with a primary focus on detecting all small secreted peptides, these studies did not utilize protein family information in the model building process since secreted peptides are so diverse. The Arabidopsis Unannotated Secreted Peptide Database (AUSPD) only contains one-exon ORF predictions, and thus mis-annonates most (if not all) two-exon secreted peptides (Additional file 18: Figure S5 for example). The OrysPSSP database (comparative Platform for Small Secreted Proteins from rice from rice and other plants) does contain multi-exon models predicted by Fgenesh (0.72%) and Augustus (1.16%) in addition to single-exon ORFs . However, since no prior information is incorporated into predictions by these ab initio gene predicting programs, multi-exon models in OrysPSSP are frequently in conflict with the true rice CRPs (Additional file 19: Figure S6 for example). As a result, while most single-exon peptides in Arabidopsis and rice are captured in AUSPD and OrysPSSP respectively, a large portion of the two-exon and multi-exon genes (such as CRP0000-CRP1530) are clearly under-represented in these two databases. SPADA, on the other hand, used additional gene structure information obtained in the motif mining step and was able to correctly predict most of the CRP models (Additional file 20: Table S10).
SPADA is not designed to identify all genes in a genome. However, its applicability to new annotation projects steadily will increase due to the marked growth of protein sequence family signatures and alignments. InterPro release 43.0 contains 16,652 protein family signatures. In the last 3 years, the number of families characterized by InterPro has increased by 24%, compared with a 38% increase in the 3 years prior to that . (Release 29.0 from October 2010 had 13,382 family entries; Release 16.1 from October 2007 had 9,729 entries.) It should be noted that SPADA is unlikely to perform well with genes that have large numbers of exons due to the combinatoric explosion of potential splice donor and acceptor pair combinations to evaluate. For longer multi-exon gene families, AUGUSTUS-PPX should be used. Still, SPADA has been shown here to be extremely effective in predicting families of one- and two- exon genes often missed or excluded by standard gene prediction algorithms [3, 4]. Hence, it is anticipated that gene annotation pipelines would be improved by routinely running SPADA to pick up small genes in addition to the standard generic gene prediction algorithms (e.g., Augustus) for larger genes.
As sequencing costs have come down, there has been a commensurate expansion in the sequencing of multiple plant genomes within each species. Moreover, Genome Wide Association Studies (GWAS) are now routinely carried out in these populations. Gene annotation cannot be simply transferred across members of a species due to the myriad of SNPs and indels that alter gene structures. Gan et al. estimated that gene structural changes occurred in more than 30% of genes among the 18 Arabidopsis accessions they resequenced and assembled . Further, GWAS studies have repeatedly implicated unannotated intergenic regions as having the most significant association with important agronomic traits [44, 45]. While it is likely that many of these GWAS peaks identify non-coding RNAs or regions in strong linkage disequlibrium with causative variants, we suspect that many of these sites may actually mark members of as yet unannotated families of small genes. Indeed, in our own GWA studies , many peaks turned out to coincide with NCR or other CRP family members that prior to our intensive family-based annotation studies had been un-annotated in Medicago.
In striking contrast to Arabidopsis, the Medicago genome harbors a huge number (583 versus 3) of Nodule Cysteine-Rich peptides (NCRs, CRP1130-CRP1530) – Defensin-Like proteins with nodule-specific expression (Table 1). These NCRs are unique to Medicago (specifically, legumes in the Inverted Repeat-Lacking Clade)  and have recently been shown to play vital roles in the communication between Medicago and symbiotic rhizobia [48, 49].
Surprisingly, three CRPs were found in the Arabidopsis genome falling into the nodule-specific sub-families (CRP1130-CRP1530, or NCRs). Previously, NCRs were thought to be unique to Medicago and other IRLC legumes, playing a vital role in the legume-rhizobia symbiotic interaction . Looking closely at the sequence alignments (Additional file 21: Figure S7), these "Arabidopsis NCRs" have all the conserved cysteine residues in the expected configuration, while also exhibiting substantial divergence from Medicago NCRs - and forming a separate Arabidopsis-specific clade. Furthermore, only one Arabidopsis NCR is predicted in each sub-class. It is possible, therefore, that these "Arabidopsis NCRs" are descendants from the most recent common ancestral genes that later evolved into Medicago NCRs. After the Arabidopsis-Medicago divergence, these ancient NCRs could have become increasingly divergent in the legume (Medicago) clade, eventually gaining new functions in nodule development and symbiosis, possibly through neo-functionalization, conferring a selective advantage and thus increasing rapidly in copy number through gene duplication.
Because the model prediction step in the pipeline is not optimized for multi-exon gene models nor the extremely large introns present in animal genomes, we do not yet recommend SPADA to identify small peptides in animals (especially mammals). Also, SPADA is not expected to work well with bacterial genomes due to the absence of introns in their gene models. However, we speculate this pipeline will work well with organisms such as yeast, oomycete and fungi, since they have similar gene structures to plants [51, 52]. In fact, it was recently found that oomycetes and fungi genomes encode large number of secreted effectors as a result of the evolutionary "arms-race" between pathogen and host [53, 54]. Potentially, SPADA will be useful in effector discovery in these pathogen genomes given that a growing number of informative family alignments are becoming available.
Although SPADA was initially designed to target secreted peptide families in plants, it can be used on non-secreted peptide families, especially in fungal systems. In the Results section, we tested SPADA using draft genome contigs of the mushroom A. bisporigera in search of additional members of a class of potent liver toxin pro-peptides characterized in earlier work . Roughly 20 pro-peptides belonging to this family had been cloned and sequenced, with only about a dozen present among the draft genome sequence contigs. SPADA identified 5 new family members with convincing alignments and significant E-values (2.4 × 10-17 < E < 5.8 × 10-8). Three of these were in contigs long enough that extensive homology in the 3’-UTR region characteristic of the family could be observed. The contigs of the remaining two hits ended shortly after the coding sequence preventing 3’-UTR homology characteristics from being confirmed. When the same input HMM constructed from the known 20 pro-peptides  was used to scan against a related mushroom Amanita thiersii that is known not to produce toxic peptides, SPADA did not identify any candidate genes with E < 0.01.
SPADA is a homology-based gene prediction program to accurately identify and predict the gene structure for short peptides with one to a few exons. SPADA works well on small plant peptides such as the cysteine-rich peptide families. SPADA gives much more accurate and precise gene calls than traditional ab initio gene finding programs in tested genomes. Running SPADA on less well-annotated plant genomes (e.g., Medicago) reveals numerous mis-annotated and unannotated CRPs in the current genome annotation. Predictions made by SPADA constitute the most complete set of plant cysteine-rich peptides, and in this regard, will provide an invaluable resource for the research of small, secreted peptides in plants. The systematic application of SPADA to other classes of small peptides by communities will greatly improve the genome annotation of different protein families in public genome databases.
Project name: Small Peptide Alignment Discovery Application
Project home page: https://github.com/orionzhou/SPADA
Operating system(s): Linux
Programming language: Perl
Other requirements: SPADA is a perl-based pipeline that internally runs HMMer, Augustus, Genewise, SplicePredictor and a number of custom scripts. Instructions for installation and running are available at https://github.com/orionzhou/SPADA/wiki.
License: Apache License, Version 2.0
Any restrictions to use by non-academics: None
Open reading frame
Nodule cysteine-rich peptide
Present, marginal and absent
Genome wide association study
Hidden Markov model
General feature format
Multiple sequence alignment
Inverted repeat-lacking clade.
We thank Yong Bao and Diana Trujillo for testing the program in soybean and other legume genomes. We thank Roman Briskine for the useful suggestions on software release and maintenance. Finally, our special thanks go to the Minnesota Supercomputing Institute for the hardware and software resources. Work in NDY’a lab was supported by NSF project PGRP 0820005 and 1237993. Work in JDW’s lab was supported by Grant 1R01-GM088274 from the National Institutes of Health General Medical Sciences.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.