Making sense of EST sequences by CLOBBing them
© Parkinson et al; licensee BioMed Central Ltd. 2002
Received: 20 August 2002
Accepted: 25 October 2002
Published: 25 October 2002
Expressed sequence tags (ESTs) are single pass reads from randomly selected cDNA clones. They provide a highly cost-effective method to access and identify expressed genes. However, they are often prone to sequencing errors and typically define incomplete transcripts. To increase the amount of information obtainable from ESTs and reduce sequencing errors, it is necessary to cluster ESTs into groups sharing significant sequence similarity.
As part of our ongoing EST programs investigating 'orphan' genomes, we have developed a clustering algorithm, CLOBB (Cl uster o n the b asis of B LAST similarity) to identify and cluster ESTs. CLOBB may be used incrementally, preserving original cluster designations. It tracks cluster-specific events such as merging, identifies 'superclusters' of related clusters and avoids the expansion of chimeric clusters. Based on the Perl scripting language, CLOBB is highly portable relying only on a local installation of NCBI's freely available BLAST executable and can be usefully applied to > 95 % of the current EST datasets. Analysis of the Danio rerio EST dataset demonstrates that CLOBB compares favourably with two less portable systems, UniGene and TIGR Gene Indices.
CLOBB provides a highly portable EST clustering solution and is freely downloaded from: http://www.nematodes.org/CLOBB
Expressed sequence tags (EST) are single pass sequence reads from randomly selected cDNA clones that sample the diversity of genes expressed by an organism . ESTs are a valuable adjunct to whole genome sequencing, as they facilitate gene identification. For organisms where whole genome sequencing is a distant goal, EST analysis is a highly cost-effective gene discovery method. The utility of ESTs is illustrated by the phylogenetic diversity of organisms represented in dbEST, the NCBI's EST database .
Random sampling of clones means that redundancy can be expected in EST datasets, even those derived from normalised or subtracted cDNA libraries. Unlike whole genome sequencing, where multiple sequencing of each segment is the norm, ESTs are single pass reads of unverified quality that may contain base-calling and other errors. Additionally an EST may often only provide information on a partial segment of an entire cDNA. Finally, analysis of EST datasets can be overwhelming due to the sheer number of sequences involved.
To address issues of redundancy, quality and data handling, EST clustering can be employed. This involves the grouping of ESTs on the basis of sequence similarity into clusters representing putative genes. These groups can then be used to derive consensus sequences that have a higher overall sequence quality and increase the length of transcript that can be assigned. To date a number of different clustering methods have been developed in which ESTs are grouped into a set of "gene indices". These range from simple scripts which run and parse the output of sequence database searches e.g. SEALS , INCA  and Zymogenetics' REX , through more specialised programs such as JESAM  and Glaxo's "Dynamic" assembler , to programs which rely on non-alignment based algorithms, such as d2_cluster . In addition to these standalone solutions, there are also a number of dedicated database systems such as UniGene  and the TIGR Gene Indices [10–12], which create and maintain gene indices derived from entire organismal sets of ESTs.
Our interest in EST clustering arises as part of our involvement in EST projects on 'orphan' genomes. One such project involves a program of gene discovery in parasitic nematodes with the remit of generating ~20,000 ESTs for each of 10 different species of parasitic nematode . To maximise the information derived from these ESTs, for each species of nematode we study a gene index based on the ESTs must be generated. As each dataset may be generated over an extended time period by several different laboratories and we wish to release the information to the public domain as it arises, we required a clustering algorithm that (1) could be run incrementally and (2) which would allow existing clusters to be tracked through subsequent builds. Further, due to the nature of cDNA library construction, the clustering algorithm had to be robust enough to deal with chimeras (clones which arise from the ligation of two unrelated transcripts). In addition, a piece of software was required which was fully accessible (i.e not a pre-built binary) and where parameters could be appropriately set to deal with the nematode datasets. At the beginning of the project, none of the available programs examined were either publicly available in a portable format or fulfilled the aforementioned criteria.
Here we describe a program (CLOBB – Cl uster o n the b asis of B LAST) based on the use of BLAST similarity scores [14, 15] that achieves these goals. The program is freely available, and is written in the perl scripting language (and is therefore fully customisable). The program depends upon the availability of a locally installed version of BLAST (freely obtainable from http://www.ncbi.nlm.nih.gov).
Results and Discussion
In order to provide a benchmark for the performance of the program, the latest Danio rerio (zebrafish) UniGene clustered dataset containing 60,357 sequences (build #21 – 24/08/01) was downloaded and reclustered with CLOBB using the default settings for the tuneable variables outlined in Methods. Vector screening of the sequences downloaded from UniGene revealed 28 sequences with vector sequence. Due to this low level of contamination, it was felt to be more important to maintain consistency with the UniGene build and therefore the original, unmasked sequences were used for clustering. Benchmarking suggests that the time of execution of the algorithm scales exponentially with the number of sequenced involved (results not shown). For the Danio rerio dataset, it took a pentium III 750 MHz processor 129 hours and 9 minutes to cluster 60,357 sequences. For our nematode EST datasets, (~20,000 ESTs) the time taken to cluster is clearly acceptable. Of the 387 species represented in dbEST , 370 (> 95%) have fewer than 100,000 ESTs and may therefore be usefully processed by CLOBB. It should be noted, however, that in its present form, CLOBB is unsuitable for much larger datasets (such as the 135,000 C. elegans ESTs or 3.8 million human ESTs). Since much of the computing time is spent in reformatting the database for searching with BLAST, future development of the algorithm will investigate reducing the computational overhead associated with this part of the script (potentially by building one initial database containing all sequences, both novel and those previously allocated to a cluster, and performing each search against it).
To compare the performance of our clustering algorithm, we have also downloaded the TIGR gene indices set for D. rerio (ZGI Release 7.0 – 07/08/01). In the UniGene process, clusters are initially formed by pair-wise comparison of mRNAs and genomic DNA fragments. ESTs are then added to these clusters providing that such an addition does not lead to the joining of two distinct clusters created in the preceding stage. The TIGR gene indices are built in two stages . Firstly WU-BLAST  performs a series of pair-wise sequence comparisons to group all those sequences sharing > = 95% sequence similarity over 40 bp with unmatched overhangs of less than 20 bp. These groups are then subjected to a further round of clustering using the program CAP3 [17, 18] to generate a tentative consensus sequence. It should be noted that the TIGR gene index set is based on 95,910 sequences. This discrepancy arises from both the date of clustering (the TIGR and UniGene datasets were obtained at different times) and from the difference in screening methods used by the two protocols. Whilst both methods filter sequences for vector contaminants and polyA tails, rejecting those sequences of <100 bp in length, UniGene also filters out mitochondrial and ribosomal sequences in addition to repetitive elements. The TIGR dataset was therefore filtered for sequences found in the UniGene dataset. In total the TIGR dataset contained 56,888 of the UniGene sequences. The missing 3,469 sequences appear to be derived from EST submissions occurring after 1st August 2001. In the following comparisons between the three cluster datasets, only the 56,888 sequences common to all three were used.
Cluster size distribution for the three compared D. rerio cluster datasets
Size of Cluster (number of sequences)
UniGene Build 21 24/08/01
TIGR ZGI V.7 07/08/01
Total Clusters (from 58,888 sequences)
Distribution of cluster events
Detailed analysis of UniGene cluster ug.2984
Total sequences in clusters containing at least one sequence derived from ug.2984
Clusters with > 100 seqs (sizes)
4 (1075, 146, 145, 136)
5 (425, 219, 214, 203, 143)
Clusters with only one sequence (singletons)
To determine how the order in which sequences are added to the CLOBB database may affect clustering, the 1,900 sequences from the CLOBB clusters derived from ug.2984 were reclustered separately using the CLOBB algorithm. This led to the construction of 74 clusters, with only 31 having one EST member (2 of which were not originally found in the ug.2984 dataset). There are now five clusters with greater than 100 sequences, with the largest containing only 425 sequences. This shows, as expected, that clusters formed by the CLOBB algorithm may vary according to the order in which sequences are added. This behaviour is explained by the unidirectional nature of cluster growth. Given that two sequences may share significant similarity to a cluster but not to each other (due to differences in an overlap extending beyond the cluster), further growth of that cluster will depend upon which of the two sequences it encounters first (see Figure 1).
This is a problem common to most clustering algorithms. For example CAP3 was also found to create alternate sets of clusters for ug.2984 by simply altering the order of the sequences in the fasta file it was given. In general such problems tend to be restricted to only a few large clusters and are usually dealt with on a manual basis. To aid this process, we have included a 'supercluster' feature to CLOBB that automatically identifies where such problems may occur. For the ug.2984 cluster used in these analyses, 28 of the 33 clusters, generated in the original CLOBB analysis, containing more than one sequence were tagged as belonging to at least one of four 'superclusters'. Of the remaining five clusters, sequences from one cluster did not show any type II match (see Methods) with any other CLOBB cluster, whilst for the other four clusters, sequences did not have significant (e < 10-5) BLAST similarity with sequences from any other CLOBB cluster containing a sequence from the original ug.2984 dataset. Detailed examination of these clusters revealed that their constitutive sequences may have been mis-assigned to ug.2984 by the UniGene algorithm as a result of imperfect clone-read labelling.
Post cluster consensus assembly using CAP3 of CLOBB clusters derived from ug.2984
Total number of clusters used in assembly (sequences)
Number of contigs built from > 1 sequence (A)
Number of singlets produced in assembly (B)
Total predicted tentative consensuses (A) + (B) + number of singletons
CLOBB pooled clusters
CLOBB-α individual clusters
CLOBB-β individual clusters
TIGR predictions for the same 1900 sequences
In terms of maintaining continuity between builds, the UniGene clusters are renumbered after comparison with the previous build. However, it should be noted that sequences assigned to a cluster may change with subsequent builds and that cluster identifiers may disappear (typically when two clusters merge). For the TIGR clusters, identifiers which change through mergers or splits are kept in their database. In CLOBB merges and splits are recorded and cluster membership is tracked for each EST.
Summary of features of the three cluster methods examined
Underlying Clustering Method
WU-BLAST & CAP3
Dependent on stage of clustering
>= 95% identity over
> 40 bp
>= 95% identity over 30 bp
< 20 bp
< 10% of sequence length
Those with > 10% of sequence length are allowed if they contain > 10% unassigned bases
Clusters are always contiguous?
Dealing with potential chimeric clusters
Initial clustering performed with gene sequences – merging of these initial distinct clusters rejected
CAP3 does not include identified chimeric sequences
Definition of type III matches and 'superclusters' prevents chimeric sequences from merging unsuitable clusters.
Continuity (addition of new sequences)
New builds are compared with previous builds
Incremental within algorithm
Availability of previous builds
Notes showing retirement of clusters
'superclusters' and merge events can be tagged
Portability and adapatibility
Ease of retention of manual curation
CLOBB is freely available with POD documentation from our website http://www.nematodes.org/CLOBB.
Calculations were performed on an Intel pentium III 750 Mhz processor running Red Hat Linux 6.2 with perl version 5.005. The script was written in the perl scripting language and uses NCBI BLAST version 2.1.2.
CLOBB is an iterative clustering method – that is the cluster database increases by the addition of new sequences. The program follows the schematic outlined in Figure 2. Firstly all ESTs to be clustered are placed in a common directory in FASTA format. When the program is run, a 'master' list is made of all the files in this directory. If a previous cluster build has been performed, it is read to determine the number of the last identified cluster. The first EST is then compared using BLASTN to the current cluster database. The BLAST output is parsed for high-scoring segment pairs (HSPs). For all HSPs with an identity of >= 95% and length > 30 bp, the subject sequence is recorded as a type I match. The next stage of the process goes through the list of type I matches to identify any problems associated with the match. This is achieved by parsing the beginning and end positions of the query and subject sequences from the BLAST output. If these positions overlap beyond the HSP by more than 30 bases (i.e. the HSP does not extend through the full overlap of the sequences), a further check is performed to ensure that this is not due to the presence of poor quality sequence (determined by the number of bases assigned 'N' in the overlap regions). Type I matches which do not have high quality overlaps of more than 30 bases beyond the HSP are designated as type II matches. Other type I matches which possess high quality overlaps of greater than 30 bases which are not part of a HSP are designated as type III matches.
The next stage of cluster assignment then involves checking through the lists of type II and type III matches to ensure that no conflicts arise. Given a cluster in which some members are type II matches, if there are other members of the same cluster which have been designated as type III matches, then this indicates that the query sequence matches some but not all members of a cluster and is therefore assigned a new cluster number. The inclusion of this feature in the algorithm prevents the rapid expansion of chimeric clusters and can result in a splitting of related sequences into many different (related) clusters. However, when such events occur, the program catalogues the clusters involved, identifying them as 'similar to' the type III match for subsequent post-process analysis (typically performed by manual curation).
Another complication occurs when two or more type II matches arise from different clusters. Firstly the BLAST output is re-analysed to determine whether the HSPs of the matches occur in overlapping regions. If they do not, the query effectively links the clusters and they are merged into the cluster with the lowest index – a separate note is recorded to indicate that such a merge operation has occurred. If they do overlap, this may indicate that they are either alternatively spliced variants of one gene or closely related members of a gene family, and the query sequence is assigned the cluster number of the type II match with which it had the highest BLAST score, providing that said cluster did not contain a type III match, and an annotation added to indicate that these clusters may be members of a 'supercluster'. Once the query sequence has been assigned to a cluster, it is added to the growing cluster database which is then reformatted to allow the next search.
From the above schema it is apparent that the script contains a number of tuneable variables such as percentage identity in overlap, minimum length of HSP and maximum allowable non-HSP overlap. CLOBB is designed to use EST sequences downloaded from dbEST. Access to quality data from sequence chromatograms would make it possible to use more accurate measures than simply the number of N's in overlapping regions to determine regions of poor quality. Due to the portability and readability of the perl scripting languages, such features would be easy to introduce.
We would like to thank James Callahan (Smith College, MA, USA) for assistance with early versions of the scripting, and the Blaxter Nematode Genomics lab for helpful discussions. This work was funded by the Wellcome Trust and the UK Medical Research Council.
- Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropolous MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al.: Complementary DNA sequencing: Expressed sequence tags and human genome project. Science 1991, 252: 1651–1656.View ArticlePubMed
- Boguski MS, Lowe TM, Tolstoshev CM: dbEST–database for "expressed sequence tags". Nat Genet 1993, 4: 332–333.View ArticlePubMed
- Walker DR, Koonin EV: SEALS: A System for Easy Analysis of Lots of Sequences. Int. Sys. Mol. Biol. 1997, 5: 333–339.
- Graul RC, Sadée W: Evolutionary relationships among proteins probed by an iterative neighborhood cluster analysis (INCA). Alignment of bacteriorhodopsins with the yeast sequence YRO2. Pharm Res 1997, 14: 1533–1541. 10.1023/A:1012166015402View ArticlePubMed
- Yee DP, Conklin D: Automated clustering and assembly of large EST collections. Proc Int Conf Intell Syst Mol Biol 1998, 6: 203–211.PubMed
- Parsons JD, Rodriguez-Tomé P: JESAM: CORBA software components to create and publish EST alignments and clusters. Bioinformatics 2000, 16: 313–325. 10.1093/bioinformatics/16.4.313View ArticlePubMed
- Gill RW, Hodgman TC, Littler CB, Oxer MD, Montgomery DS, Taylor S, Sanseau P: A new dynamic tool to perform assembly of expressed sequence tags (ESTs). Comput Appl Biosci 1997, 13: 453–457.PubMed
- Burke J, Davison D, Hide W: d2_cluster: A validated method for clustering EST and full-length cDNA sequences. Genome Res 1999, 9: 1135–1142. 10.1101/gr.9.11.1135PubMed CentralView ArticlePubMed
- Boguski MS, Schuler GD: ESTablishing a human transcript map. Nat Genet 1995, 10: 369–371.View ArticlePubMed
- Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O, et al.: Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature 1995, 377: Supplement: 3–174.PubMed
- Sutton GG, White O, Adams MD, Kerlavage AR: TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Gen Sci Technol 1995, 1: 9–19.View Article
- White O, Kervalage AR: TDB: new databases for biological discovery. Meths Enzymol 1996, 266: 27–40.View Article
- Parkinson J, Whitton C, Guiliano D, Daub J, Blaxter M: 200 000 nematode expressed sequence tags on the net. Trends Parasitol 2001, 17: 394–396. 10.1016/S1471-4922(01)01954-7View ArticlePubMed
- Altschul SF, Gish W, Miller W, Myers MW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–10. 10.1006/jmbi.1990.9999View ArticlePubMed
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMed
- Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J: An optimized protocol for analysis of EST sequences. Nucleic Acids Res 2000, 28: 3657–3665. 10.1093/nar/28.18.3657PubMed CentralView ArticlePubMed
- Huang X: An improved sequence assembly program. Genomics 1996, 33: 21–31. 10.1006/geno.1996.0155View ArticlePubMed
- Huang X, Madan A: CAP3: A DNA Sequence Assembly Program. Genome Res 1999, 9: 868–877. 10.1101/gr.9.9.868PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.