Skip to main content

TranSeqAnnotator: large-scale analysis of transcriptomic data

Abstract

Background

The transcriptome of an organism can be studied with the analysis of expressed sequence tag (EST) data sets that offers a rapid and cost effective approach with several new and updated bioinformatics approaches and tools for assembly and annotation. The comprehensive analyses comprehend an organism along with the genome and proteome analysis. With the advent of large-scale sequencing projects and generation of sequence data at protein and cDNA levels, automated analysis pipeline is necessary to store, organize and annotate ESTs.

Results

TranSeqAnnotator is a workflow for large-scale analysis of transcriptomic data with the most appropriate bioinformatics tools for data management and analysis. The pipeline automatically cleans, clusters, assembles and generates consensus sequences, conceptually translates these into possible protein products and assigns putative function based on various DNA and protein similarity searches. Excretory/secretory (ES) proteins inferred from ESTs/short reads are also identified. The TranSeqAnnotator accepts FASTA format raw and quality ESTs along with protein and short read sequences and are analysed with user selected programs. After pre-processing and assembly, the dataset is annotated at the nucleotide, protein and ES protein levels.

Conclusion

TranSeqAnnotator has been developed in a Linux cluster, to perform an exhaustive and reliable analysis and provide detailed annotation. TranSeqAnnotator outputs gene ontologies, protein functional identifications in terms of mapping to protein domains and metabolic pathways. The pipeline is applied to annotate large EST datasets to identify several novel and known genes with therapeutic experimental validations and could serve as potential targets for parasite intervention. TransSeqAnnotator is freely available for the scientific community at http://estexplorer.biolinfo.org/TranSeqAnnotator/.

Background

Expressed sequence tags or ESTs, derived from complementary DNA (cDNA) libraries provide a low-cost transcriptomic alternative to whole genome sequencing as these are short, unedited, randomly selected single-pass sequence reads of approximately 200-800 base pairs (bp) which represent a small region or a part of nucleotide sequence from a transcribed protein coding or non-coding messenger mRNA. They play vital role in gene identification and verification of gene prediction as they represent the expressed region of a genome. The analysis of EST data can facilitate gene discovery, help in gene structure identification, complement genome annotation, establish the viability of alternative transcripts, direct single nucleotide polymorphism (SNP) characterization and facilitate proteomic exploration [1–3]. They were used as the primary source for human gene discovery in early 1990s [4]. Besides ESTs, millions of sequencing reads of 35-250 bp are generated with the advent of "next-generation" sequencing (NGS) which further help in the study of transcriptome data mainly for neglected organisms and also, understanding different isoforms of an organism at different stages of development. Studies using experimental proteomic approach have shown the identification of proteins in ESP with transcriptome assembly [5]. Many challenges are faced in the areas of bioinformatics analysis in data storage and management solution and developing informatics tools for analysis with the focus on sequence quality scoring, alignment, assembly, and data processing with the advent of short read strategy of NGS [6, 7]. A comprehensive analysis pipeline is required to store, organize and annotate ESTs with several computational tools for pre-processing, clustering, assembly into contiguous segments known as contigs and annotation to yield biological information. The web resources available were reviewed for large-scale EST dataset at each step including clustering, assembly, consensus generation and tools for DNA, protein and ES annotation [8]. A number of analysis steps and tools confounded computational strategies to organize and analyse transcriptomic dataset [9] which is compounded by the ability of some tools to handle high-throughput EST data. An evaluation revealed that all available platforms terminated prior to downstream functional annotation, including gene ontologies (GOs), motif/pattern analysis and pathway mapping. Hence, the establishment of a comprehensive large-scale transcriptomic analysis pipeline [9] was required to be developed to keep up with the rapidity with which enormous amounts of sequence data are currently being generated. An urgent need for advanced, high-throughput computational analyses of EST and genomic sequence datasets using automated platforms is highlighted. EST data are been applied to study of functional biomolecules [9, 10] but, predicting ES proteins, from ESTs have been uncommon. Excretory/Secretory (ES) products are the molecules excreted or secreted by a cell or an organism that can circulate throughout the body of an organism (e.g., in the extracellular space) or are localized to or released from the cell surface, making them readily accessible to drugs and/or the immune system. ES products cover 8 ± 20% of the proteome of an organism [11] and include molecules of varied functionality, including chemokines, digestive enzymes cytokines, hormones, toxins, antibodies, morphogens, extracellular proteinases and antimicrobial peptides. They are known to be involved in vital biological processes, including cell adhesion, cell migration, cell-cell communication, differentiation, proliferation, morphogenesis and immune responses [12]. Biochemical and immunological studies of parasitic helminths were focussed on ES proteins. Worms secrete biologically active mediators which can transform or customize their niche within the host [13–15] to regulate or to elude immune attack or stimulate a particular host response.

Some platforms terminate at the assembly level, providing contigs and singletons [16] (referred to as rESTs) while other platforms exclusively run nucleotide-based programs with limited annotation at the protein level [17–20]. Based on the benchmarking results, a robust transcriptome analysis pipeline (TranSeqAnnotator) is constructed with contig generation from ESTs and short reads, updated pathway analysis, non-classically secreted protein identification and extensive annotation with an option to select specific analysis phases by users (detailed below). Proteins secreted by classical and non-classical pathways are identified by a combination of computational approaches to predict ESPs. The pipeline accepts ESTs, quality values, protein sequences and short reads as input and provides as output, assembled rESTs and their annotations including gene ontologies, secretory proteins, mapping to protein domains, motifs, metabolic pathways and interaction databases. TranSeqAnnotator (TSA) is available as web service and can be downloaded for local installation.

Implementation

TranSeqAnnotator workflow has three phases with Phase I (a) for EST or (b) short read fasta sequence pre-processing, assembly, conceptual translation and blast against NR, Phase II for the identification of putative ES proteins, from classically and non-classically secreted proteins and the elimination of transmembrane proteins and Phase III for the combined annotation of the protein sequence and ES proteins involving a carefully selected suite of bioinformatic tools, based on a large-scale transcriptome analysis [21] (Figure 1). TranSeqAnnotator currently implements the genetic codes for 15 organisms, covering the most studied organisms, including human, rat, pig, dog, chicken, rice, wheat, thale cress (Arabidopsis thaliana), zebrafish, yeast and a free-living roundworm (Caenorhabditis elegans).

Figure 1
figure 1

Schematic diagram of TranSeqAnnotator workflow.

Phase I accept ESTs and short reads as well as quality values in the case of ESTs as input for pre-processing and assembly (Figure 1).

The sequence cleaning step uses seqclean [22] and seqtrim [23] with ESTs alone and with ESTs and quality sequences respectively followed by masking the repeats using RepeatMasker [24] which is optional. The Phase I (b) accepts short reads and pre-processing is carried out using seqclean. The masked sequences are then passed on for clustering and assembly with iAssembler http://bioinfo.bti.cornell.edu/tool/iAssembler/ which incorporates MIRA [25] and CAP3 assemblers for ESTs and short reads. For conceptual translation into proteins, the program ESTScan [26] applies the genetic code from the nearest organism to the contig and singleton sequences generated by CAP3 or iAssembler.

In Phase II, the protein sequences generated in Phase I, using TMHMM [27] and putative ES proteins identified using SecretomeP [28] are annotated (Figure 1). Firstly, the signal sequence is checked with SignalP while, SecretomeP looks for non-classically secreted proteins and the hidden Markov model probability scores (SignalPNN and SignalP-HMM), using default parameters that can be modified by experienced users. Subsequently, all proteins with signal sequences are passed on to TMHMM, a hidden Markov model-based transmembrane helix prediction program, to ''filter out'' of transmembrane proteins. ES proteins, the subset lacking transmembrane helices are further annotated. Phase III, the annotation level for protein sequences or ES proteins comprises a suite of computational tools InterProScan [29] for domain analysis and Gene Ontology, pathway mapping using KOBAS (KEGG Orthology-Based Annotation System) [30, 31]. Also, protein BLAST is employed to search databases derived from Wormpep [32] for locating nematode homologues and a list of homologous proteins in C. elegans, archived in WormBase as well as interaction databases like IntAct [33], BioGrid [34] and DIP [35] which give information on molecular interaction data and experimentally verified protein-protein interactions.

TSA accepts a dataset submitted by the user and optional programs can be selected as required (Figure 2). The progress of the analysis is monitored on the status page which is updated after each selected process is completed and the output of each program is available along with a summarized output. Some of these tools are provided in the ESTExplorer [36] and EST2Secretome [37] pipeline but, the analysis of large-scale EST dataset and short read sequences with updated bioinformatics tools is incorporated with TranSeqAnnotator as part of the benchmarking with the large-scale analysis of Teladorsagia circumcincta dataset (unpublished work). Also, the program SecretomeP showed the identification of important proteins which the previous pipelines failed to identify with SignalP. The identification of both classically and non-classically secreted proteins with secretomeP is the highlight of the robust analysis pipeline as our earlier analysis on Fasciola hepatica [38].

Figure 2
figure 2

TranSeqAnnotator data submission page.

Software/hardware environment

TranSeqAnnotator is developed using PERL v5.10.0 which links the different bioinformatics programs and MySQL as backend for data management and analysis. The front end is developed using PHP and the processes are run based on CPU availability. Each input sequence submitted by the user is tagged with a request ID to trace the process. The pipeline runs on a 16-node Linux cluster (2.4 GHz, Intel(R) Xeon (R) CPU, 16 Processors, 32 GB RAM) running on ubuntu server operating system. The output files for viewing and downloading are provided as final results which are available for a week.

Results and discussion

Application of TranSeqAnnotator

Ascaris lumbricoides, the soil-transmitted helminths or geohelminths is the largest common intestinal nematode parasites of human that causes the disease ascariasis [39]. It infects an estimated 1.2 billion people worldwide, but is usually asymptomatic [40]. 1822 A. lumbricoides EST sequences from dbEST [41], were analysed using the TranSeqAnnotator. The dataset is from the adult male whole body Ascaris lumbricoides cDNA clone. The phase I of pre-processing (SeqClean and RepeatMasker) aligned/clustered using CAP3 followed by assembly, was carried out which yielded 236 contigs and 658 singletons. These rESTs were mapped to the non-redundant (NR) dataset using BLAST, for nucleotide level annotation. Using a translational matrix, ESTScan conceptually translates these high quality rESTs, which are then transferred to Phase II of TSA, for the prediction of ES proteins, by sequentially running SecretomeP (with a threshold value for the NN-score of 0.9) and TMHMM programs. The cluster dataset, translated peptide sequences and ES proteins were annotated with biochemical pathways, employing KOBAS, domain/family motif and GeneOntology using InterProScan. The query sequences were compared using BLASTP against Wormpep [32] and against the IntAct database (version 1.7.0) to extract all interaction partners. The 894 rESTS were conceptually translated to yield 510 peptide sequences. The GO terms were identified for these putative protein sequences using InterProScan, with 108 peptide sequences assigned biological process (BP), 156 associated with molecular function (MF) and 83 as part of a cellular component (CC) (Additional File 1). The analysis revealed that translation (GO:0006412) and oxidation-reduction process (GO:0055114) were the highly represented GO categories signifying biological processes. The major number of GO terms in molecular function was structural constituent of ribosome (GO:0003735), oxidoreductase activity (GO:0016491) and ATP binding (GO:0005524) whereas in cellular component, the highly represented GO terms were ribosome (GO:0005840) and extracellular space (GO:0005615).

A total of 239 peptide sequences were mapped to 113 KEGG pathways using KOBAS. The main KEGG pathways mapped included ribosomal protein assembly pathway (n = 34) and cytoskeleton proteins (n = 19). Other well represented pathways include tight junction (n = 14), regulation of actin cytoskeleton (n = 12), focal adhesion (n = 12), valine, leucine and isoleucine degradation (n = 8) and propanoate metabolism (n = 7). Peptides were mapped to several pathways, including glycolysis/gluconeogenesis, MAPK signaling pathway and ubiquitin mediated proteolysis (Additional File 2).

Domain mapping by Interproscan provides details as to the family, fold and functional domains present in the putative peptides. The most represented was the collagen triple helix repeat of proteins, comprising 14 protein entries, followed by C-type lectin fold and transthyretin-like family, with nine protein entries each. Other highly represented domains are the actin-like and C-type lectin (Additional File 3).

A total of 32 were predicted by SecretomeP. Of these, 6 are classically secreted peptides; with N-terminal signal sequences while 26 are non-classical, supporting the use of SecretomeP vs. SignalP alone, which can only predict classically secreted proteins. Of these 32, six proteins with transmembrane helices, predicted by TMHMM were eliminated, resulting in 26 excreted/secreted proteins inferred from the present dataset of 894 rESTs. We could identify cecropin (including the cecropin-P1, cecropin-P2, cecropin-P3), cathepsin L from Ascaris suum and cathepsin L-like protease from Strongylus vulgaris, chymotrypsin/elastase isoinhibitor 1 from Ascaris suum, C-type lectin protein 160 from Ascaris suum and C-type lectin domain-containing protein 160 from Ascaris suum. Gelsolin from Ascaris suum and GelSoliN-Like family member (gsnl-1) from Caenorhabditis elegans were also identified (Additional File 4). Cecropins, represent a large family of antibacterial and toxic peptides are known to execute host defence functions mainly against micro-organisms [42, 43] and are found in insects [44]. Ascaris cecropins (P1-P4) were identified as antimicrobial peptides that were positively inducible by bacterial injection. Ascaris cecropins synthesized chemically were bactericidal against a wide range of microbes, i.e. Gram-positive (Staphylococcus aureus, Bacillus subtilis and Micrococcus luteus) and Gram-negative (Pseudomonas aeruginosa, Salmonella typhimurium, Serratia marcescens and Esherichia coli) bacteria, and were weakly but detectably active against yeasts (Saccharomyces cerevisiae and Candida albicans) [45]. A large family of proteins that binds carbohydrate moieties in a Ca2+-dependent manner are represented by C-type lectins (CTLs) which act as a pathogen recognition molecule or an antibacterial protein in immune responses to protect the worm itself against microbial infection [46–49]. They also play vital role in immune homeostasis by endogenous 'self' ligand recognition [50], and they themselves have a bactericidal activation [51]. Studies have shown that A. suum C-type lectin-1(As-CTL-) shows high similarity to Toxocara canis C-type lectin (Tc-CTLs) and are exposed to attack by host immune responses. Hence, to avoid protective immune responses in infected animals during tissue migration A. suum larvae might interfere with host inflammation processes by As-CTL-1 [52]. The Gelsolin family belongs to a group of actin binding proteins are known to be involved in cell structure, motility, apoptosis, amyloidosis and cancer. Gelsolin-like protein-1 (GSNL-1) from C. elegans is a new member of the gelsolin family of actin regulatory proteins which provide new insight into functional diversity and evolution of gelsolin-related proteins [53, 54]. We were able to functionally assign GO terms to 26 putative ES proteins with proteolysis (GO:0006508) the most common GO category representing biological processes, cysteine-type peptidase activity (GO:0008234) in molecular function and extracellular region (GO:0005576) in cellular component. Protein processing in endoplasmic reticulum, phagosome, lysosome, antigen processing and presentation, rheumatoid arthritis represented the sequences mapped to KEGG pathways using KOBAS. The TranSeqAnnotaor methodology was benchmarked using the large-scale dataset of Teladorsagia circumcincta (unpublished work) and applied for the annotation of A. lumbricoides.

Future directions

TranSeqAnnotator currently supports nucleotide, short reads, protein and ES level annotation. Our aim is to extend the pipeline with updating the masking the repeats with repeatless libraries to annotate newly sequenced organisms and also to carry out annotations for different datasets like RNA-seq, microarray datasets.

References

  1. Rudd S: Expressed sequence tags: alternative or complement to whole genome sequences?. Trends Plant Sci. 2003, 8 (7): 321-329. 10.1016/S1360-1385(03)00131-6.

    Article  CAS  PubMed  Google Scholar 

  2. Dong Q, Kroiss L, Oakley FD, Wang BB, Brendel V: Comparative EST analyses in plant systems. Methods Enzymol. 2005, 395: 400-418.

    Article  CAS  PubMed  Google Scholar 

  3. Jongeneel CV: Searching the expressed sequence tag (EST) databases: panning for genes. Brief Bioinform. 2000, 1 (1): 76-92. 10.1093/bib/1.1.76.

    Article  CAS  PubMed  Google Scholar 

  4. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF: Complementary DNA sequencing: expressed sequence tags and human genome project. Science. 1991, 252 (5013): 1651-1656. 10.1126/science.2047873.

    Article  CAS  PubMed  Google Scholar 

  5. Moreno Y, Gros PP, Tam M, Segura M, Valanparambil R, Geary TG, Stevenson MM: Proteomic analysis of excretory-secretory products of Heligmosomoides polygyrus assessed with next-generation sequencing transcriptomic information. PLoS neglected tropical diseases. 2011, 5 (10): e1370-10.1371/journal.pntd.0001370.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Wold B, Myers RM: Sequence census methods for functional genomics. Nat Methods. 2008, 5 (1): 19-21. 10.1038/nmeth1157.

    Article  CAS  PubMed  Google Scholar 

  7. Yang MQ, Athey BD, Arabnia HR, Sung AH, Liu Q, Yang JY, Mao J, Deng Y: High-throughput next-generation sequencing technologies foster new cutting-edge computing techniques in bioinformatics. BMC genomics. 2009, 10 (Suppl 1): I1-10.1186/1471-2164-10-S1-I1.

    Article  PubMed Central  PubMed  Google Scholar 

  8. Ranganathan S, Menon R, Gasser RB: Advanced in silico analysis of expressed sequence tag (EST) data for parasitic nematodes of major socio-economic importance--fundamental insights toward biotechnological outcomes. Biotechnol Adv. 2009, 27 (4): 439-448. 10.1016/j.biotechadv.2009.03.005.

    Article  CAS  PubMed  Google Scholar 

  9. Nagaraj SH, Gasser RB, Ranganathan S: A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform. 2007, 8 (1): 6-21.

    Article  CAS  PubMed  Google Scholar 

  10. Adams MD, Kerlavage AR, Fields C, Venter JC: 3,400 new expressed sequence tags identify diversity of transcripts in human brain. Nat Genet. 1993, 4 (3): 256-267. 10.1038/ng0793-256.

    Article  CAS  PubMed  Google Scholar 

  11. Greenbaum D, Luscombe NM, Jansen R, Qian J, Gerstein M: Interrelating different types of genomic data, from proteome to secretome: 'oming in on function. Genome Res. 2001, 11 (9): 1463-1468. 10.1101/gr.207401.

    Article  CAS  PubMed  Google Scholar 

  12. Maizels RM, Yazdanbakhsh M: Immune regulation by helminth parasites: cellular and molecular mechanisms. Nat Rev Immunol. 2003, 3 (9): 733-744. 10.1038/nri1183.

    Article  CAS  PubMed  Google Scholar 

  13. Lightowlers MW, Rickard MD: Excretory-secretory products of helminth parasites: effects on host immune responses. Parasitology. 1988, 96 (Suppl): S123-166.

    Article  PubMed  Google Scholar 

  14. Hawdon JM, Jones BF, Hoffman DR, Hotez PJ: Cloning and characterization of Ancylostoma-secreted protein. A novel protein associated with the transition to parasitism by infective hookworm larvae. J Biol Chem. 1996, 271 (12): 6672-6678. 10.1074/jbc.271.12.6672.

    Article  CAS  PubMed  Google Scholar 

  15. Maizels RM, Gomez-Escobar N, Gregory WF, Murray J, Zang X: Immune evasion genes from filarial nematodes. Int J Parasitol. 2001, 31 (9): 889-898. 10.1016/S0020-7519(01)00213-2.

    Article  CAS  PubMed  Google Scholar 

  16. Masoudi-Nejad A, Tonomura K, Kawashima S, Moriya Y, Suzuki M, Itoh M, Kanehisa M, Endo T, Goto S: EGassembler: online bioinformatics service for large-scale processing, clustering and assembling ESTs and genomic DNA fragments. Nucleic Acids Res. 2006, W459-462. 34 Web Server

  17. D'Agostino N, Aversano M, Chiusano ML: ParPEST: a pipeline for EST data analysis based on parallel computing. BMC Bioinformatics. 2005, 6 (Suppl 4): S9-10.1186/1471-2105-6-S4-S9.

    Article  PubMed Central  PubMed  Google Scholar 

  18. Latorre M, Silva H, Saba J, Guziolowski C, Vizoso P, Martinez V, Maldonado J, Morales A, Caroca R, Cambiazo V: JUICE: a data management system that facilitates the analysis of large volumes of information in an EST project workflow. BMC Bioinformatics. 2006, 7: 513-10.1186/1471-2105-7-513.

    Article  PubMed Central  PubMed  Google Scholar 

  19. Paquola AC, Nishyiama MY, Reis EM, da Silva AM, Verjovski-Almeida S: ESTWeb: bioinformatics services for EST sequencing projects. Bioinformatics. 2003, 19 (12): 1587-1588. 10.1093/bioinformatics/btg196.

    Article  CAS  PubMed  Google Scholar 

  20. Hotz-Wagenblatt A, Hankeln T, Ernst P, Glatting KH, Schmidt ER, Suhai S: ESTAnnotator: A tool for high throughput EST annotation. Nucleic Acids Res. 2003, 31 (13): 3716-3719. 10.1093/nar/gkg566.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Menon R, Gasser RB, Miterva M, Ranganathan S: An analysis of the transcriptome of Teladorsagia circumcincta: its biological and biotechnological implications. BMC Genomics. 2012,

    Google Scholar 

  22. Chen YA, Lin CC, Wang CD, Wu HB, Hwang PI: An optimized procedure greatly improves EST vector contamination removal. BMC Genomics. 2007, 8: 416-10.1186/1471-2164-8-416.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Falgueras J, Lara AJ, Fernandez-Pozo N, Canton FR, Perez-Trabado G, Claros MG: SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics. 2010, 11: 38-10.1186/1471-2105-11-38.

    Article  PubMed Central  PubMed  Google Scholar 

  24. RepeatMasker. [http://www.repeatmasker.org]

  25. Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Muller WE, Wetter T, Suhai S: Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 2004, 14 (6): 1147-1159. 10.1101/gr.1917404.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  26. Iseli C, Jongeneel CV, Bucher P: ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc Int Conf Intell Syst Mol Biol. 1999, 138-148.

    Google Scholar 

  27. Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305 (3): 567-580. 10.1006/jmbi.2000.4315.

    Article  CAS  PubMed  Google Scholar 

  28. Bendtsen JD, Jensen LJ, Blom N, Von Heijne G, Brunak S: Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng Des Sel. 2004, 17 (4): 349-356. 10.1093/protein/gzh037.

    Article  CAS  PubMed  Google Scholar 

  29. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L: InterPro: the integrative protein signature database. Nucleic Acids Res. 2009, D211-215. 37 Database

  30. Xie C, Mao X, Huang J, Ding Y, Wu J, Dong S, Kong L, Gao G, Li CY, Wei L: KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic acids research. 2011, 39 (Web Server): W316-322. 10.1093/nar/gkr483.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  31. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic acids research. 2006, D354-357. 34 Database

  32. Bieri T, Blasiar D, Ozersky P, Antoshechkin I, Bastiani C, Canaran P, Chan J, Chen N, Chen WJ, Davis P: WormBase: new content and better access. Nucleic Acids Res. 2007, D506-510. 35 Database

  33. Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J: The IntAct molecular interaction database in 2010. Nucleic Acids Res. 2010, D525-531. 38 Database

  34. Breitkreutz BJ, Stark C, Reguly T, Boucher L, Breitkreutz A, Livstone M, Oughtred R, Lackner DH, Bahler J, Wood V: The BioGRID Interaction Database: 2008 update. Nucleic Acids Res. 2008, D637-640. 36 Database

  35. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic acids research. 2004, D449-451. 32 Database

  36. Nagaraj SH, Deshpande N, Gasser RB, Ranganathan S: ESTExplorer: an expressed sequence tag (EST) assembly and annotation platform. Nucleic Acids Res. 2007, 35 (Web Server): W143-147. 10.1093/nar/gkm378.

    Article  PubMed Central  PubMed  Google Scholar 

  37. Nagaraj SH, Gasser RB, Ranganathan S: Needles in the EST haystack: large-scale identification and analysis of excretory-secretory (ES) proteins in parasitic nematodes using expressed sequence tags (ESTs). PLoS Negl Trop Dis. 2008, 2 (9): e301-10.1371/journal.pntd.0000301.

    Article  PubMed Central  PubMed  Google Scholar 

  38. Robinson MW, Menon R, Donnelly SM, Dalton JP, Ranganathan S: An integrated transcriptomics and proteomics analysis of the secretome of the helminth pathogen Fasciola hepatica: proteins associated with invasion and infection of the mammalian host. Mol Cell Proteomics. 2009, 8 (8): 1891-1907. 10.1074/mcp.M900045-MCP200.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  39. Dold C, Holland CV: Ascaris and ascariasis. Microbes Infect. 2011, 13 (7): 632-637. 10.1016/j.micinf.2010.09.012.

    Article  PubMed  Google Scholar 

  40. Holland CV: Predisposition to ascariasis: patterns, mechanisms and implications. Parasitology. 2009, 136 (12): 1537-1547. 10.1017/S0031182009005952.

    Article  CAS  PubMed  Google Scholar 

  41. Boguski MS, Lowe TM, Tolstoshev CM: dbEST--database for "expressed sequence tags". Nat Genet. 1993, 4 (4): 332-333. 10.1038/ng0893-332.

    Article  CAS  PubMed  Google Scholar 

  42. Tamang DG, Saier MH: The cecropin superfamily of toxic peptides. J Mol Microbiol Biotechnol. 2006, 11 (1-2): 94-103. 10.1159/000092821.

    Article  CAS  PubMed  Google Scholar 

  43. Bulet P, Stocklin R: Insect antimicrobial peptides: structures, properties and gene regulation. Protein Pept Lett. 2005, 12 (1): 3-11. 10.2174/0929866053406011.

    Article  CAS  PubMed  Google Scholar 

  44. Steiner H, Hultmark D, Engstrom A, Bennich H, Boman HG: Sequence and specificity of two antibacterial proteins involved in insect immunity. Nature 292: 246-248. 1981. J Immunol. 2009, 182 (11): 6635-6637.

    CAS  PubMed  Google Scholar 

  45. Pillai A, Ueno S, Zhang H, Lee JM, Kato Y: Cecropin P1 and novel nematode cecropins: a bacteria-inducible antimicrobial peptide family in the nematode Ascaris suum. Biochem J. 2005, 390 (Pt 1): 207-214.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  46. O'Rourke D, Baban D, Demidova M, Mott R, Hodgkin J: Genomic clusters, putative pathogen recognition molecules, and antimicrobial genes are induced by infection of C. elegans with M. nematophilum. Genome Res. 2006, 16 (8): 1005-1016. 10.1101/gr.50823006.

    Article  PubMed Central  PubMed  Google Scholar 

  47. Schulenburg H, Hoeppner MP, Weiner J, Bornberg-Bauer E: Specificity of the innate immune system and diversity of C-type lectin domain (CTLD) proteins in the nematode Caenorhabditis elegans. Immunobiology. 2008, 213 (3-4): 237-250. 10.1016/j.imbio.2007.12.004.

    Article  CAS  PubMed  Google Scholar 

  48. Drickamer K: Two distinct classes of carbohydrate-recognition domains in animal lectins. J Biol Chem. 1988, 263 (20): 9557-9560.

    CAS  PubMed  Google Scholar 

  49. Drickamer K: Ca(2+)-dependent sugar recognition by animal lectins. Biochem Soc Trans. 1996, 24 (1): 146-150.

    Article  CAS  PubMed  Google Scholar 

  50. Garcia-Vallejo JJ, van Kooyk Y: Endogenous ligands for C-type lectin receptors: the true regulators of immune homeostasis. Immunol Rev. 2009, 230 (1): 22-37. 10.1111/j.1600-065X.2009.00786.x.

    Article  CAS  PubMed  Google Scholar 

  51. Cash HL, Whitham CV, Behrendt CL, Hooper LV: Symbiotic bacteria direct expression of an intestinal bactericidal lectin. Science. 2006, 313 (5790): 1126-1130. 10.1126/science.1127119.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  52. Yoshida A, Nagayasu E, Horii Y, Maruyama H: A novel C-type lectin identified by EST analysis in tissue migratory larvae of Ascaris suum. Parasitol Res. 2012

    Google Scholar 

  53. Liu Z, Klaavuniemi T, Ono S: Distinct roles of four gelsolin-like domains of Caenorhabditis elegans gelsolin-like protein-1 in actin filament severing, barbed end capping, and phosphoinositide binding. Biochemistry. 2010, 49 (20): 4349-4360. 10.1021/bi100215b.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  54. Klaavuniemi T, Yamashiro S, Ono S: Caenorhabditis elegans gelsolin-like protein 1 is a novel actin filament-severing protein with four gelsolin-like repeats. J Biol Chem. 2008, 283 (38): 26071-26080. 10.1074/jbc.M803618200.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We are grateful to Macquarie University for the award of postgraduate research scholarships. Funding to pay the Open Access publication charges for this article was provided by Macquarie University.

This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shoba Ranganathan.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RM carried out the analysis, computational studies and drafted the manuscript. RM, GG, SR and RBG participated in the design of the study and interpretation of data. SR and RBG conceived the project and finalized the manuscript. All authors have read and approved the final manuscript.

Electronic supplementary material

12859_2012_5488_MOESM1_ESM.xlsx

Additional file 1: GO annotation for putative peptides. Gene Ontology annotations from Interproscan reported. (XLSX 29 KB)

12859_2012_5488_MOESM2_ESM.xlsx

Additional file 2: KEGG Pathway analysis of proteins (E-value threshold of 1E-05). Database matches reported. (XLSX 11 KB)

Additional file 3: Domain description for the protein sequences. Interproscan domains reported. (XLSX 27 KB)

Additional file 4: Top BLAST hits for secreted proteins. Non-redundant database matches reported. (XLSX 13 KB)

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Menon, R., Garg, G., Gasser, R.B. et al. TranSeqAnnotator: large-scale analysis of transcriptomic data. BMC Bioinformatics 13 (Suppl 17), S24 (2012). https://doi.org/10.1186/1471-2105-13-S17-S24

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-13-S17-S24

Keywords