Source databases
The TSE uses five data providers: ITIS, Index Fungorum, IPNI, uBIO, and the NCBI.
ITIS
The Integrated Taxonomic Information System (ITIS) [5] was established in the mid 1990's by a consortium of United States federal agencies tasked with to providing a database of taxonomic information for North American taxa. In addition to the original site in the United States [5], there is a French language version hosted by the Canadian Biodiversity Information Facility [18], and a Spanish language version hosted in Mexcio [19]. The Canadian site can serve data in XML format, and users can search for a name, or retreive details about an individual record using a simple URL API. A Document Type Definition (DTD) file for the XML format is available from the ITIS web site.
ITIS provides a classification of taxonomic names (i.e., a parent-child hierarchy), and where more than one name exists for a taxon, ITIS specifies which name it regards as correct (termed the "accepted" name if the taxon is an animal, and "valid" if it is a plant). Every name in the database, regardless of taxonomic status or position in the hierarchy is assigned a unique identifier (its "taxon serial number"). The database schema is fully documented, and the entire database is available for downloading by FTP as a SQL schema with the data in delimited text files. As a consequence, ITIS is frequently used as the de facto source of taxonomic data in biodiversity informatics projects.
IPNI
The International Plant Names Index (IPNI) [20] combines data from three sources: Index Kewensis (Royal Botanic Gardens, Kew), the Gray Card Index (Harvard University Herbaria), and the Australian Plant Names Index (Australian National Herbarium), and contains some 1.6 million records. It provides names and associated basic bibliographical details for vascular plants. The IPNI web site provides web forms for querying the database, and data can be returned in HTML, "%" delimited text, or XML. However, the XML is a serialisation of IPNI database objects, rather than a format designed to be handled by end users. There are plans to support emerging standards, such as the Taxonomic Concept Transfer Schema [21]. IPNI aims to be a catalogue of all names that have been applied to vascular plants. However, where more than one name for a taxon exists, IPNI does not specify which name should be used, that is, it does not indicate an "accepted name" for a taxon. In this sense it is That is, it is a nomenclatural database rather than a taxonomic database. However, if two names are nomenclatural synonyms, the HTML output specifies the nature of synonymy, such as "basionym" (one name is the original name for the taxon), "nomenclatural synonym" (one or other of the names is the basionym, or the names share a basionym), or "replaced synonym" (one name has been created to replace another). IPNI provides a minimal classification, in that genera are assigned to families, but no higher-level classification is given.
Index Fungorum
IndexFungorum [22] is a database of over 370,000 names of fungi, primarily at species level. The database can be searched through a web interface or through a SOAP web service http://www.indexfungorum.org/ixfwebservice/fungus.asmx which returns an XML document. If more than one name exists for a fungus, Index Fungorum designates one name as the "current name." It also reports the basionym (first recorded name) for that taxon. Index Fungorum does support a detailed hierarchical classification in the form of a lineage, but higher level taxa are not assigned records in the database (unlike, for example, ITIS). In fungal taxonomy, names are often assigned to the asexual state (anamorph) of a fungus for which the sexual state (telomorph) is unknown. Names for anamorphs are flagged as such in the database.
uBio
The Universal Biological Indexer and Organizer (uBio) [23] is a product of the science library community, and is motivated by the information retrieval problem posed by the lack of long term stability of many taxonomic names [2]. Presently it is the single largest electronic catalogue of scientific names (1,396,868 as of 13 November 2004). In addition to a web interface uBio provides a SOAP web service http://www.ubio.org/service/ which returns a nested array data structure.
NCBI
The NCBI Taxonomy database [6] is a curated database of the names of all organisms for which sequences have been submitted to GenBank [24]. Each taxon regardless of taxonomic level is assigned a unique identifier (the "taxid"), and the NCBI taxonomy provides a single classification for all taxa in its database. If a taxon has more than one scientific name, each name has name has the same taxid, but only one is indicated as the "scientific name" [25]. The other names are flagged as synonyms, common names, etc. The NCBI taxonomy is not intended to be an authoritative source of taxonomic information, but is a rapidly grouping database that contains many taxa that are not found in other databases. Although every sequence in NCBI is assigned to an organism, in many cases the exact identity of that organism may be unknown. Sequences obtained from environmental sampling are typically unidentified, and the number of such sequences is likely to increase with the advent of large scale environmental genomics [26]. The NCBI taxonomy database can be queried via the Entrez Utilities [27] using wither a URL or a SOAP interface. The entire database is also available for download by FTP.
Architecture
The basic architecture of the TSE is summarised in Fig. 1. For each database a wrapper (implemented as a class in the PHP scripting language) is responsible for communicating with the database, using either the HTTP GET protocol (using the Net HTTP Client [28] library) or SOAP (using the NuSOAP library [29]). The wrapper takes the query string supplied by the user, and constructs a suitable query for the corresponding database, such as a URL or a SOAP call. The wrapper is also responsible for handling the response. If databases return a XML document this is transformed using an XSLT style sheet into the XML format used by TSE. Other formats such as text or SOAP data structures are converted into XML by the wrapper.
Each wrapper is derived from the same base class which provides some generic routines for creating XML documents and for caching results (see next section). The wrapper class supports three methods, IsAlive, NameSearch, and GetDataForID, which must be overridden in descendant classes. The IsAlive method queries whether the data source is available. The NameSearch method queries a data source for a given string. If one or more names are found, NameSearch returns basic information about that name, including the identifier used by the data source. This identifier is used by the GetDataForID method to query the data source for more details about the name.
Caching results
In order to improve the responsiveness of the search engine, the results of queries to each source database are cached for 24 hours. The results of the query are stored in the format returned by the database (i.e., XML or delimited text), except for uBio where the SOAP response is serialised to disk.
Approximate string matching
The Taxonomic Search Engine seeks exact matches to the user supplied query. In order to accommodate spelling mistakes the web interface to the search engine supports approximate string matching using two techniques. The first employs agrep [30] to search for a match amongst a flat file list of names obtained from the ITIS and NCBI databases. Names showing no more than two character differences from the query string are returned as suggested alternative spellings. To supplement agrep, the TSE calls Google's spelling suggestion web service [31] and adds the result of that query (if any) to the list of suggested spellings.
Interface
The TSE has a simple web interface (Fig. 2). The user types in a query, and has the option to specify whether TSE should look for alternative spellings. Clicking on the "Go" button starts the search. The XML summary of the search is transformed into HTML using an XSLT transformation. The user can click on a name to get more information, including a link to the original database source for the name, and a LSID for the name.
Web service
The TSE has a SOAP web service that is described by a Web Services Description Language (WSDL) file available at http://darwin.zoology.gla.ac.uk/~rpage/portal/TSE.php?wsdl. The service provides two operations: NameSearch which queries the source databases for a user-supplied name, and SpellingSuggestion, which suggests alternative spellings for a name. Hence users can write web service clients that can use the TSE as part of their own applications. The TSE web site provides source code for two simple clients written in perl.
Life Science Identifiers
A LSID is a Uniform Resource Name (URN) comprising five parts: the Network Identifier ("lsid"), the root DNS name of the issuing authority, a namespace, an object identifier, and optionally a revision id to indicate the version [11]. TSE generates LSIDs by concatenating the name of the source web server with the suffix "lsid.zoology.gla.ac.uk" to generate the authority. The namespace is the name given to the identifier in the source database, and the object identifier is the identifier used by the source database. For example, the record for Homo sapiens in the ITIS database would have the LSID:
urn:lsid:itis.usda.gov.lsid.zoology.gla.ac.uk:tsn:180092
where "tsn" is the "taxonomic serial number" used by ITIS as a unique identifier for each taxonomic name, and "180092" is the tsn for Homo sapiens.
The TSE uses the perl library distributed by IBM's Life Science Identifier project [11] to create a LSID authority for each of the source databases. Hence, any software that can resolve LSIDs (such as LaunchPad [11] or the BioPathways Consortium Web Resolver [32]) can view the metadata associated with an LSID generated by TSE. For ITIS this metadata is constructed by querying a local copy of the ITIS database, but for the remaining databases the LSID metadata is generated using the same combination of GET/HTTP and SOAP calls used to query the source databases by TSE (although these calls are implemented in perl).
Performance evaluation
The 2004 edition of the Species 2000 CD-ROM [14] was used as a source of names with which to query the TSE. This database comprises 583,469 names provided by 18 taxonomic databases, two of which (ITIS and Index Fungorum) are also source databases for TSE. In addition, uBio currently includes names from the 2003 edition of the Species 2000 CD-ROM in its database. Hence, most names in the Species 2000 list are likely to be found by TSE.
To create a test dataset, 1000 names were selected at random from the Species 2000 dataset. Each name was sent to the TSE web service by a perl script which recorded the time taken for each source database to respond to the query, and whether that source database contained the name. The time recorded is from the time the query was made until the time the response was returned – post processing by the TSE is not included in the measurement. For this experiment, the cache feature was turned off so that for each query the TSE went to the external source database, rather than using a local copy of the query result.