Recent advances in high-throughput sequencing technology have provided a mechanism to gain genomics insight on species without a complete genome sequence by generating expressed sequence tags collections (ESTs, ). ESTs are single-pass, partial sequences obtained from randomly selected complementary DNA (cDNA) clones and need to be processed and annotated to provide a biologically relevant data set. They include low-quality and vector regions that must be identified and removed to obtain high-quality, clean sequences suitable for further analysis. In addition, due to the random selection of the sequenced cDNA clones, a clustering step is needed to obtain a non-redundant set of unique consensus sequences, or unigenes. Finally, functional and structural annotation of the unigenes is required in order to add relevant biological information to the sequences. All these data must be conveniently stored and organized in a structured database, and interfaces must be set up for end-users to efficiently retrieve and mine all these data.
Due to the high number of sequences in most ESTs datasets, different computer-based methods are required to process, annotate, record, display, and retrieve the data. These methods are applied sequentially from the input raw sequence data to the final searchable, fully annotated EST database, and knowledge of computing science is needed to arrange them in an efficient and reliable analysis pipeline. The usual approach to this problem is to build an in-house prepared set of script programs that semi-automates the analysis. This solution requires highly skilled bioinformatics staff capable of programming and using the scripts, and is inefficient because a different pipeline should be prepared for each project, and because the resulting semi-automated process is difficult to maintain and lacks reproducibility. It would therefore be more convenient to use a well tested, freely available automatic tool. In our opinion, this application should ideally have the following features: 1) to be fully automated in a pipeline covering all the steps from the input chromatogram files to a clean, annotated web-searchable EST database, 2) to be highly modular and adaptable, 3) to be able to run in parallel in a personal computer (PC) cluster, thus benefiting from the multiprocessing capabilities of these systems, 4) to use third-party freely-available programs, in order to ease the incorporation of the improvements made by others programmers, 5) to include a highly-configurable and extensible user-friendly interface to perform data mining by combining any search criteria, fitting the final user needs, and 6) to be based on an open source license to allow a continuous development by a community of users and programmers, as well as its customization for the particular needs of different projects.
Some applications, including PipeOnLine , ESTAP , PartiGene , ESTIMA , EST-PAGE , ParPEST , GARSA , or openSputnik , have been proposed and they fulfill a certain number of the desired characteristics. However, as far as we know, none of these packages are endowed with all the requirements indicated above, especially the code availability, enabling costumization, and the automatic creation of a user-friendly web site to perform complex queries ready to be deployed in a production environment.
In an attempt to fulfill the need for an analysis software which accomplishes all the mentioned requirements, we have developed a software package, namely EST2uni (EST analysis software TO create an annotated UNIgene database). This pipeline has been tested through three genomics projects which we are involved in: Citrus Functional Genomics Project , ChillPeach Project , and Spanish Melon Genomics Project . EST2uni uses a set of chromatogram files as input to produce a structured and annotated EST database, as well as a web site to perform complex queries and data mining. Configuration of the pipeline is done by just editing a single well-documented text file. After initial set up configuration, the pipeline is completely automated, and can be run in parallel in a PC cluster using the load distribution tool CONDOR . Its modular structure provides an easy way to adapt the analysis to the special requirements of individual projects. Furthermore, the code is designed to easily integrate well-tested, widely-used, freely available third-party tools, either as locally installed programs (e.g., Primer3 ) or as web services (e.g., GEPAS  and Babelomics suites ). The software package is freely distributed  under a GPL license, and can be easily installed in a standard Linux system running Apache HTTP Server, Perl scripting language, MySQL database management system, and PHP language.