EST2uni: an open, parallel tool for automated EST analysis and database creation, with a data mining web interface and microarray expression data integration
© Forment et al; licensee BioMed Central Ltd. 2008
Received: 09 July 2007
Accepted: 07 January 2008
Published: 07 January 2008
Expressed sequence tag (EST) collections are composed of a high number of single-pass, redundant, partial sequences, which need to be processed, clustered, and annotated to remove low-quality and vector regions, eliminate redundancy and sequencing errors, and provide biologically relevant information. In order to provide a suitable way of performing the different steps in the analysis of the ESTs, flexible computation pipelines adapted to the local needs of specific EST projects have to be developed. Furthermore, EST collections must be stored in highly structured relational databases available to researchers through user-friendly interfaces which allow efficient and complex data mining, thus offering maximum capabilities for their full exploitation.
We have created EST2uni, an integrated, highly-configurable EST analysis pipeline and data mining software package that automates the pre-processing, clustering, annotation, database creation, and data mining of EST collections. The pipeline uses standard EST analysis tools and the software has a modular design to facilitate the addition of new analytical methods and their configuration. Currently implemented analyses include functional and structural annotation, SNP and microsatellite discovery, integration of previously known genetic marker data and gene expression results, and assistance in cDNA microarray design. It can be run in parallel in a PC cluster in order to reduce the time necessary for the analysis. It also creates a web site linked to the database, showing collection statistics, with complex query capabilities and tools for data mining and retrieval.
The software package presented here provides an efficient and complete bioinformatics tool for the management of EST collections which is very easy to adapt to the local needs of different EST projects. The code is freely available under the GPL license and can be obtained at http://bioinf.comav.upv.es/est2uni. This site also provides detailed instructions for installation and configuration of the software package. The code is under active development to incorporate new analyses, methods, and algorithms as they are released by the bioinformatics community.
Recent advances in high-throughput sequencing technology have provided a mechanism to gain genomics insight on species without a complete genome sequence by generating expressed sequence tags collections (ESTs, ). ESTs are single-pass, partial sequences obtained from randomly selected complementary DNA (cDNA) clones and need to be processed and annotated to provide a biologically relevant data set. They include low-quality and vector regions that must be identified and removed to obtain high-quality, clean sequences suitable for further analysis. In addition, due to the random selection of the sequenced cDNA clones, a clustering step is needed to obtain a non-redundant set of unique consensus sequences, or unigenes. Finally, functional and structural annotation of the unigenes is required in order to add relevant biological information to the sequences. All these data must be conveniently stored and organized in a structured database, and interfaces must be set up for end-users to efficiently retrieve and mine all these data.
Due to the high number of sequences in most ESTs datasets, different computer-based methods are required to process, annotate, record, display, and retrieve the data. These methods are applied sequentially from the input raw sequence data to the final searchable, fully annotated EST database, and knowledge of computing science is needed to arrange them in an efficient and reliable analysis pipeline. The usual approach to this problem is to build an in-house prepared set of script programs that semi-automates the analysis. This solution requires highly skilled bioinformatics staff capable of programming and using the scripts, and is inefficient because a different pipeline should be prepared for each project, and because the resulting semi-automated process is difficult to maintain and lacks reproducibility. It would therefore be more convenient to use a well tested, freely available automatic tool. In our opinion, this application should ideally have the following features: 1) to be fully automated in a pipeline covering all the steps from the input chromatogram files to a clean, annotated web-searchable EST database, 2) to be highly modular and adaptable, 3) to be able to run in parallel in a personal computer (PC) cluster, thus benefiting from the multiprocessing capabilities of these systems, 4) to use third-party freely-available programs, in order to ease the incorporation of the improvements made by others programmers, 5) to include a highly-configurable and extensible user-friendly interface to perform data mining by combining any search criteria, fitting the final user needs, and 6) to be based on an open source license to allow a continuous development by a community of users and programmers, as well as its customization for the particular needs of different projects.
Some applications, including PipeOnLine , ESTAP , PartiGene , ESTIMA , EST-PAGE , ParPEST , GARSA , or openSputnik , have been proposed and they fulfill a certain number of the desired characteristics. However, as far as we know, none of these packages are endowed with all the requirements indicated above, especially the code availability, enabling costumization, and the automatic creation of a user-friendly web site to perform complex queries ready to be deployed in a production environment.
In an attempt to fulfill the need for an analysis software which accomplishes all the mentioned requirements, we have developed a software package, namely EST2uni (EST analysis software TO create an annotated UNIgene database). This pipeline has been tested through three genomics projects which we are involved in: Citrus Functional Genomics Project , ChillPeach Project , and Spanish Melon Genomics Project . EST2uni uses a set of chromatogram files as input to produce a structured and annotated EST database, as well as a web site to perform complex queries and data mining. Configuration of the pipeline is done by just editing a single well-documented text file. After initial set up configuration, the pipeline is completely automated, and can be run in parallel in a PC cluster using the load distribution tool CONDOR . Its modular structure provides an easy way to adapt the analysis to the special requirements of individual projects. Furthermore, the code is designed to easily integrate well-tested, widely-used, freely available third-party tools, either as locally installed programs (e.g., Primer3 ) or as web services (e.g., GEPAS  and Babelomics suites ). The software package is freely distributed  under a GPL license, and can be easily installed in a standard Linux system running Apache HTTP Server, Perl scripting language, MySQL database management system, and PHP language.
The EST2uni package consists of: 1) a set of Perl modules that perform both the EST analysis and the database creation, 2) a set of PHP scripts to generate the browseable database-interacting web pages, and 3) a set of PHP modules with the functions called by the PHP scripts. The EST analyses are performed by the main Perl script, which manages the execution of several third-party, freely-available, standard tools commonly used for EST analysis, as well as a number of home-made analyses. The running parameters used by the pipeline and the external tools are stored in a well-documented, plain text configuration file which can be modified with a text editor. This configuration file contains all the required information to run the pipeline, like file paths, analyses to be performed, and paths and parameters for the programs installed to run the analyses. The software package is distributed with a single minimal-analysis example configuration file where default parameters used by EST2uni and external tools can be inspected and modified if desired. Running parameters considered as default ones by developers of the external tools used have been selected as a starting point in this example configuration file. Each annotation analysis is controlled by a different Perl module, and the results from different analyses are stored at independent tables in the database to facilitate the future addition of new methods and analyses. As a consequence, all analyses can be run independently at different times, and the annotation modules could be re-run at any time with updated external databases. Parallelization is also managed by an independent Perl module which splits the analysis in small tasks, manages their execution in different nodes, and join the results, using the load distribution tool Condor, as explained above. It should however be easy to hack the code in order to use a different load distribution tool.
The package is distributed with a complete working web site built by EST2uni using PHP. The visual design is controlled by using CSS, and special attention has been paid to the modularity of the PHP code, so that it is very easy to add new functionalities to the web pages and to customize their appearance. This adaptation should be easily performed by any project administrator because the PHP code appearing on the browseable web pages is minimal (e.g., the queries page has just 14 PHP lines). MySQL authentication can be used to maintain the privacy of the data of the different projects.
Pre-processing is the first step performed, and includes base calling, vector and low-quality regions removal, masking of repeats and low-complexity regions, and contaminant sequence detection and removal. The result is the generation of a clean, high-quality EST sequence set. Both chromatogram and FASTA sequence files with or without quality scores are accepted as entry point to the analysis. Base calling and quality score assignment from chromatograms are performed with phred . Low quality and cloning vector regions are removed from the sequences with Lucy , and repetitive elements and low complexity regions are masked with RepeatMasker  and seqclean , respectively. For repeat masking, a taxon-specific repeats database can be used. Unexpected vector sequences, probably coming from contamination by manipulation at the laboratory, are also removed with seqclean, using NCBI's UniVec database .
Clean, high-quality EST sequences obtained in the pre-processing step are assembled in contigs and singletons in order to eliminate sequence redundancy and a non-redundant unigene set is obtained. Either CAP3  or TGICL  can be used for the assembly step, and consensus sequences for the contigs are determined using the quality scores of the individual ESTs to resolve sequence discrepancies. A set of unigene clusters (or "superunigenes"), grouping different unigenes with extensive sequence overlapping, can also be obtained. These superunigenes could represent gene families, alternative splicing or polymorphism. Poly(A/T) tails and open reading frames (ORFs), predicted using ESTScan , are used to reverse the sequences when necessary.
To assist in the design of a cDNA microarray derived from the EST collection, one cDNA can be selected as the best representative for each unigene or superunigene. The criteria to choose this cDNA are based on a number of user-defined restrictions (extensive overlapping with the unigene consensus sequence, EST sequence length above a given threshold, GC content, etc.) and/or preferences (e.g., the most 5' or 3' clone, or the longest or shortest one, etc.). When no clone in an unigene satisfies all the above mentioned restrictions, these can be progressively relaxed until a representative is selected. Since clones from different unigenes in a superunigene are very similar, they are expected to have the same mRNA targets under standard hybridization conditions when used as probes in cDNA microarrays , and superunigene representatives can also be selected to be printed in the microarray, which reduces spot cross-hybridizations.
Unigenes can be annotated using different modules, following user specifications in the configuration file. SSR microsatellites can be detected with sputnik . Putative single nucleotide polymorphisms (SNPs) can be found by using a locally developed algorithm. Since ESTs have frequent sequencing errors, only positions with a quality score above a user-specified threshold value are considered, and sequence discrepancies between ESTs in the same contig are marked as putative true SNPs only if the polymorphism is confirmed by more than one EST in the contig. When a file including a set of primer pairs is provided, in-silico PCR experiments can be performed with ipcress  to integrate information about existing PCR-based molecular markers and associate these markers to specific unigenes. RFLP information can also be integrated with the unigenes via the cDNA clones. An RFLP is a clone proven to be useful to identify differential hybridization band patterns when used as a probe against DNA samples coming from different sources and digested with an specific restriction enzyme. A file with RFLP names and corresponding cDNA clones and restriction enzymes can be used to annotate unigenes coming from these RFLP clones. Finally, when cDNA libraries are constructed using oligo-dT as a primer for the reverse transcriptase reaction, unigenes can be aligned with protein databases from species with a complete genome sequence to predict if there are full-length clones for each unigene.
For functional annotation, different similarity search tools can be used. Unigene comparisons against a set of user-defined nucleotide and/or protein databases are carried out by BLAST  using the search parameters indicated in the configuration file. The descriptions of the BLAST hits obtained with the different BLAST runs are then parsed to yield a descriptive annotation for each unigene. Descriptions containing some word from a user-defined list including words as "unknown" or "hypothetical" are considered uninformative and unsuitable for annotation of the unigenes. The annotation is in the form "Similar to" or "Highly similar to", depending on the E value of the alignment with the corresponding BLAST hit. Gene Ontology classification (GO, ) can also be obtained from BLAST against a set of user-defined GO-annotated databases. A HMMER  search against a Pfam database  can be used to locally search protein domains in each unigene. A bi-directional BLAST comparison can also be performed with a number of user-defined species-specific sequence databases in order to obtain a set of putative orthologs. In this analysis, two sequences are considered orthologs when each one is the first hit in a BLAST search with the other. Finally, when EST collections can have contaminant sequences coming from species other than the intended one, e.g. fungus sequences in plant EST libraries made with infected plants, a BLAST analysis can be done using taxon-specific databases to predict the source organism and flag the putative contaminant sequences.
Microarray expression integration
EST2uni is ready to incorporate microarray expression data for each unigene. Normalized expression data coming from different arrays and experiments can be easily added to the database and referred to the corresponding unigenes. Database can also store details about biological samples, tissues and conditions used for each experiment, while referencial integrity is required to reproduce the standard work flux in microarray experiments. There are several ways of retrieving this expression data from the web site depending on the number of unigenes involved. A graphical representation of expression data in all the microarray experiments done is created for each unigene, which is accesible from the individual unigene web pages. Bulk data retrieval is also posible, so that expression data for all unigenes or for any set of unigenes obtained in a query can be easily obtained. Furthermore, this data can be directly sent to the GEPAS analysis suite  or downloaded as a text file ready to be used with the MeV software .
Execution times in standard EST analysis pipelines are usually very long because of the high number of sequences to be processed and the big computational costs of some analyses performed. EST2uni can achieve a reduction in these execution times by parallelization, because it can be run either in sequential mode or in a parallel environment using a load distribution tool. When run in parallel mode, EST2uni divides the analysis in several tasks that are asigned to different processing nodes and manages the execution of these tasks with the aim of keeping the different nodes busy, thus taking maximum advantages of the processing resources available in the system. These tasks are created by splitting the set of sequences to analyze in several chunks, and running independent subanalysis in parallel. EST2uni sends these jobs to the load distribution tool cue and checks that they run as expected. The pipeline is distributed with the code necessary to use CONDOR  as the load distribution tool. However, the high modularity of the code makes it easy to adapt EST2uni to use another load distribution tools, like openMosix  or openPBS .
EST2uni creates and populates a structured MySQL relational database where the results of the different analyses are stored automatically by the pipeline. This database is key element for the pipeline, since the different analysis modules used in the pipeline get their input data from the database and write their output to it. The database contains information about all the data obtained, from cDNA libraries, isolated clones, and raw sequences to analysis results, as well as additional information about genetic maps and markers or expression data provided by users, or the representative clone selected for each unigene. It can also include information about the researchers and research centers involved, as well as journal publications or public accession numbers of sequences derived from the project. Access to the database can be password-protected in order to keep the data private.
The web site also provides some statistics of the full EST collection, like contigs/singletons distribution, unigene length distribution, number of ESTs per unigene distribution, or number of unigenes annotated with the different functional and structural criteria. It also provides statistics for each single library, like number of ESTs, singletons, contigs, unigenes, novelty, and redundancy. Tables for global and library-specific Gene Ontology annotations are also provided where unigenes annotated in each functional category can be directly retrieved, and libraries can be compared for the Gene Ontology annotation of their unigenes.
Comparison with other tools
Comparison with similar tools. Features of different EST analysis and database tools.
based in non-comercial tools
expression data integration
complex data mining2
genetic markers integration
functional motifs identification
We have developed an EST analysis tool capable of converting, in a fully automatic and time-effective way, a set of trace files or plain sequences in a highly structured and annotated EST database with a user-oriented web interface for efficient data mining. The EST analysis pipeline includes standard pre-processing, clustering and annotation programs, and can incorporate gene expression data to the database. The software is highly modular, which facilitates the incorporation of new methods and analyses, meeting the needs of different EST projects. Running options are also easily adapted to local needs by simply modifying an extensively documented single configuration text file which provides the parameters to be used by the different analyses. Once configured, the pipeline runs without user assistance, from the input files to the final annotated EST database. The EST analysis can be run either in a single standard computer, or in a PC-cluster, thus taking advantage of the multiprocessing capabilities of these systems, which allows reduction of the time required to complete the EST analysis. The web site deployed is a powerful data mining tool with a complex, yet easy to use, query interface, that also provides functionalities for bulk data retrieval and download. It also eases the use of several tools, like primer design and BLAST searches against the database. Access to the data can be restricted by passwords to keep the data private. The development team of EST2uni is continuously improving the end-user interface, the quality of the analyses, and the integration with other tools. We have set up a public subversion server , and a mailing list  to allow collective development of the code, which everyone is invited to join.
To sum up, we conceive this bioinformatics tool as an open and evolving project, and all the bioinformatics community is invited to participate, using and improving the tool that we have created for our specific EST projects (Citrus Functional Genomics Project , ChillPeach Project , and Spanish Melon Genomics Project ).
Availability and requirements
EST2uni can be freely downloaded from Internet . Detailed installation instructions are provided on the download web page and are also included in the software package. The application is free in the sense that it has been released under the GPL license, and its development is open and collaborative. Any researcher is free to use it, to modify it, and to deploy their own web site with the results.
No hardware or memory restrictions are imposed by this software package other than those that apply to the external programs used in the pipeline. In general, it runs without problems in any standard medium-level ix86-based equipment using the Linux operating system. We have tested and configured correctly this software in computers running the following GNU/Linux distributions: Fedora Core 5, Ubuntu 6.06, SuSE Linux 8.2, and Debian Sarge. It should run without problems in any Unix installation as long as all the required software is installed. The following software, which come with any GNU/Linux distribution, is absolutely required to run the EST2uni pipeline: Perl, Apache, MySQL, and PHP. EST2uni also requires certain additional Perl modules, specifically Bioperl and DBI. In addition, the following external tools and resources are used to perform the different analysis: phred, Lucy, RepeatMasker, Repbase, SeqClean, NCBI's UniVec, TGICL, ESTScan, sputnik, NCBI BLAST, ipcress, and the Perl module go-perl. For parallel processing, the load balancing tool Condor must be installed. Detailed instructions for installing each of these external programs are also provided with the package and at the EST2uni download web site.
Partially funded by "Conselleria de Agricultura, Pesca y Alimentacion de la Comunidad Valenciana" and Spanish "Ministerio de Ciencia y Tecnologia" (research grants GEN2001-4885-C05 and GEN2003-20237-C06). We acknowledge Dr. Joaquin Dopazo and the Spanish National Institute of Bioinformatics , a platform of Genoma España, for providing help to integrate the Babelomics and GEPAS tools in EST2uni. We also acknowledge Dr. C. Douglas Grubb for his proofreading of this manuscript.
- Adams MD, Soares MB, Kerlevage AR, Fields C, Venter JC: Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nature Genetics 1993, 4: 373–380. 10.1038/ng0893-373View ArticlePubMed
- Ayoubi P, Jin X, Leite S, Liu X, Martajaja J, Abduraham A, Wan Q, Yan W, Misawa E, Prade RA: PipeOnline 2.0: automated EST processing and functional data sorting. Nucleic Acids Research 2002, 30(21):4761–4769. 10.1093/nar/gkf585PubMed CentralView ArticlePubMed
- Mao C, Cushman JC, May GD, Weller JW: ESTAP – an automated system for the analysis of EST data. Bioinformatics 2003, 19(13):1720–1722. 10.1093/bioinformatics/btg205View ArticlePubMed
- Parkinson J, Anthony A, Wasmuth J, Schmid R, Hedley A, Blaxter M: PartiGene – Constructing partial genomes. Bioinformatics 2004, 20: 1398–1404. 10.1093/bioinformatics/bth101View ArticlePubMed
- Kumar CG, LeDuc R, Gong G, Roinishivili L, Lewin HA, Liu L: ESTIMA, a tool for EST management in a multi-project environment. BMC Bioinformatics 2004, 5: 176. 10.1186/1471-2105-5-176PubMed CentralView ArticlePubMed
- Matukumalli LK, Grefenstette JJ, Sonstegard TS, Tassell CPV: EST-PAGE – managing and analyzing EST data. Bioinformatics 2004, 20(2):286–288. 10.1093/bioinformatics/btg411View ArticlePubMed
- D'Agostino N, Aversano M, Chiusano ML: ParPEST: a pipeline for EST data analysis based on parallel computing. BMC Bioinformatics 2005, 6: S9. 10.1186/1471-2105-6-S4-S9PubMed CentralView ArticlePubMed
- Davila AMR, Lorenzini DM, Mendes PN, Satake TS, Sousa GR, Campos LM, Mazzoni CJ, Wagner G, Pires PF, Grisard EC, Cavalcanti MCR, Campos MLM: GARSA: genomic analysis resources for sequence annotation. Bioinformatics 2005, 21(23):4302–4303. 10.1093/bioinformatics/bti705View ArticlePubMed
- Rudd S: openSputnik – a database to ESTablish comparative plant genomics using unsaturated sequence collections. Nucleic Acids Research 2005, 33: D622-D627. 10.1093/nar/gki040PubMed CentralView ArticlePubMed
- Citrus Functional Genomics project[http://bioinfo.ibmcp.upv.es/genomics/cfgpDB]
- ChillPeach project[http://bioinfo.ibmcp.upv.es/genomics/ChillPeachDB]
- Spanish Melon Genomics Project[http://www.melogen.upv.es]
- Condor Project[http://www.cs.wisc.edu/condor]
- Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods in Molecular Biology 2000, 132: 365–386.PubMed
- Montaner D, Tarraga J, Huerta-Cepas J, Burguet J, Vaquerizas JM, Conde L, Minguez P, Vera J, Mukherjee S, Valls J, Pujana MAG, Alloza E, Herrero J, Al-Shahrour F, Dopazo J: Next station in microarray data analysis: GEPAS. Nucleic Acids Research 2006, 34: W486-W491. 10.1093/nar/gkl197PubMed CentralView ArticlePubMed
- Al-Shahrour F, Minguez P, Vaquerizas JM, Conde L, Dopazo J: BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments. Nucleic Acids Research 2005, 33: W460-W464. 10.1093/nar/gki456PubMed CentralView ArticlePubMed
- EST2uni homepage[http://www.melogen.upv.es/genomica/web_estpipe]
- Ewing B, Hillier L, Wendl MC, Green P: Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment. Genome Research 1998, 8: 175–185.View ArticlePubMed
- Chou HH, Holmes MH: DNA sequence quality trimming and vector removal. Bioinformatics 2001, 17: 1093–1104. 10.1093/bioinformatics/17.12.1093View ArticlePubMed
- RepeatMasker homepage[http://www.repeatmasker.org]
- Software from The Gene Index project[http://compbio.dfci.harvard.edu/tgi/software]
- UniVec database at National Center for Biotechnology Information[http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html]
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Research 1999, 9(9):1093–1104. 10.1101/gr.9.9.868View Article
- Iseli C, Jongeneel CV, Bucher P: ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology: 06–10 August 1999 Edited by: Lengauer T, Schneider R, Bork P, Brutlag DL, Glasgow JI, Mewes HW, Zimmer R. 1999, 138–158.
- Everstz EM, Au-Young J, Ruvolo MV, Lim AC, Reynolds MA: Hybridization cross-reactivity within homologous hene families on glass cDNA microarrays. BioTechniques 2001, 31: 1182–1192.
- Sputnik – DNA microsatellite repeat search utility[http://espressosoftware.com/pages/sputnik.jsp]
- Slater GSC, Birney E: Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 2005, 6: 31. 10.1186/1471-2105-6-31PubMed CentralView ArticlePubMed
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215(3):403–410.View ArticlePubMed
- Consortium TGO: Gene Ontology: tool for the unification of biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556View Article
- HMMER homepage[http://hmmer.janelia.org]
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Research 2004, 32: D138-D141. 10.1093/nar/gkh121PubMed CentralView ArticlePubMed
- MeV homepage[http://www.tm4.org/mev.html]
- openMosix project[http://openmosix.sourceforge.net]
- openPBS homepage[http://www-unix.mcs.anl.gov/openpbs]
- mpiBLAST homepage[http://www.mpiblast.org]
- Parallel Virtual Machine homepage[http://www.csm.ornl.gov/pvm]
- Biomax Informatics AG[http://www.biomax.com]
- EST2uni subversion server address[svn://phobos.agr.upv.es/estpipe]
- EST2uni mailing list[https://listas.upv.es/mailman/listinfo/est2uni]
- Spanish National Institute for Bioinformatics homepage[http://www.inab.org]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.