We next discuss the data indexed by OReFiL, its search function and interface, and its advantages and limitations.
Indexed Data
The indexed data consisted of 7365 URLs and 8429 MEDLINE entries. Figure 1 shows the distribution of MeSH terms at the second level of the hierarchy. This indicates that the retrievable contents are well diversified and accordingly, a search with a MeSH term effectively narrows down resources.
Search Function
OReFiL has language-model-based and Boolean-based search functions. Because OReFiL uses Indri in the Lemur Toolkit, users can use the rich and flexible query language to find query-relevant online resources. The simplest query consists of a word such as pathway. To find resources whose corresponding MEDLINE entry has a specific MeSH term, the user needs to add a query modifier that can be chosen from the following:
-
.mesh (the specified MeSH term can be a major topic descriptor or not in the MEDLINE entry),
-
.noexp (the specified MeSH term is a major topic descriptor in the MEDLINE entry), or
-
.majr (in addition to the condition of .noexp, query expansion is on).
Search behaviors for these query modifiers correspond to the way of treating MeSH terms by the indexer. The system currently does not use MeSH Subheadings. An example of using a query modifier for MeSH is Proteins.mesh, in which OReFiL searches for resources for which the corresponding MEDLINE entries have the MeSH term "Proteins," but it does not consider whether the term is a major topic descriptor. An example of using the query expansion is Proteins.majr, in which OReFiL returns entries having the MeSH term "Proteins," and its descendants, such as "Caspase 9." Users can also search by an author's name using the modifier .auth such as smith.auth. The other modifiers are .atitle, .tiab, and .web. These are used to search for resources under the conditions that a specified term appears in their corresponding MEDLINE titles, MEDLINE titles and abstracts, and fetched web pages, respectively. To query with multiple words, the #1 operator is used. For example, in the query #1 (metabolic pathway), OReFiL searches for entries having the exact expression "metabolic pathway." For a Boolean "AND" search, the user needs to use the #band operator, such as #band(pathway apoptosis) to retrieve entries in which both "pathway" and "apoptosis" appear. Boolean-based and language-model-based searches are mutually exclusive, and the language-model-based ranking we briefly described above does not work when users conduct a Boolean-based search. Instead, users can use the #filreq operator such as #filreq(Caspases.noexp apoptosis) to search for resources relevant to apoptosis whose MEDLINE entries have the MeSH term of "Caspases," and the ranking is determined based on the language model. The operators can be combined such as #filreq(#band(pathway apoptosis) #l(Oxidative Stress).majr).
Interface
Figure 2 shows a screenshot of an OReFiL output. Its homepage is common to the search result pages (the area above the first horizontal line is displayed when you first access the site). In the MeSH terms box (encircled by a dotted line), MeSH terms (descriptors of major topics and their conceptual ancestors in the MeSH hierarchy) that are annotated to the MEDLINE abstracts included in the hit list are displayed in alphabetical order. Each font size reflects the frequency of the term in the hit list; terms that are more frequently annotated to the MEDLINE abstracts in the hit list are bigger and vice versa. OReFiL counts not only the annotated MeSH terms, but also their conceptual ancestors. MeSH terms also function as a filter to narrow down a search result. By clicking one of these, the browser automatically composes and appropriately fills in a required query. Users can easily remove a MeSH term from the query box by clicking the term. Users do not need to type in a complicated query to narrow down the results. After the user confirms the query box, clicking the "SUBMIT" button will produce the results.
The hit list appears below the MeSH term box. Each hit entry produces a grey box containing the following information:
-
Title of a web page introduced in a peer-reviewed paper, with a link to the page and its URL,
-
MeSH terms (descriptors of major topics),
-
Links to popular web page search systems to search for web pages having a link to the hit web page,
-
Links to major full-text paper journal search systems to search for papers citing the hit web page,
-
Paper titles with a link to the corresponding PubMed entries.
We added the third and fourth items because they are useful to evaluate the reputation of the web page and acquire related knowledge. A hit entry has links to Google and AltaVista [20] to search for web pages that have a link to the web page of the entry. They provide the link: option and can retrieve web pages having a link to a given web page. For example, the search result for the query link:pubmed.gov is a list of web pages having a link to pubmed.gov. A hit entry also has links to (1) BioMed Central, (2) Scirus, (3) HighWire Press, and (4) Google Scholar to search for full-text papers in which the URL appears. Each link includes an appropriate query; therefore, users can get a search result after just clicking it.
Our crawler could not fetch some web pages. We currently classify these cases into the following three categories: (1) page not found, (2) network or server problem, and (3) other issues. In the first case, the server returns an HTTP status of 404; in the second case, the first digit of the status is 5 (5xx). Because the probability of a web page's removal is higher in the first than in the second case, we distinguish between the two. The 5xx code indicates a problem in the network or the server, so we determined that they needed to be distinguished from the other issues. Checking the "Hide page-not-found entries" box allows users to hide the entries in the first category from a hit list (it is checked as the initial setting).
Advantages
First, general and very popular search systems such as Google, Yahoo, and AltaVista can be used to find online resources in the life science domain. However, it, is difficult to use these systems to find specific types of web pages. OReFiL only searches for web pages with URLs that have appeared in peer-reviewed journals in the domain.
Second, although PubMed only searches for peer-reviewed journal papers in the domain, its results are too broad for users who want to find online resources. In addition, PubMed does not search for related web pages or URLs not appearing in the paper abstracts. OReFiL currently covers MEDLINE entries and BMC full-text papers, and we are preparing to include other open-access journal papers, including the contents of NAR, Genome Research, and others at PubMed Central.
Third, BioMed Central maintains a catalog of more than 1100 database sites and provides subject-area-based browsing and search functions (BioMed Central Databases). However, they only focus on databases and do not provide links to peer-reviewed paper information.
Another domain-specific search system is maintained by the Health Sciences Library System at the University of Pittsburgh [5, 21]. They provide a basic Boolean-based search function and a clustering-based browsing and narrowing down function. The clustering function is provided by a commercial software Vivisimo clustering engine [22] that automatically clusters documents based on their contents. Each online resource entry is manually curated, assuring good-quality results. However, this system takes more time and effort to update than does OReFiL. Other sites providing curated collections of resources also have this problem. In addition, OReFiL's MeSH-term-based search can be used to cluster resources.
To address these above issues, we developed an online resource search system in the life sciences that is up to date, covers a large number of resources in the domain, provides a link to peer-reviewed paper information, and has a flexible search function using an open-source toolkit. Because our crawler periodically accesses each URL, the existence of web pages can be checked.
Limitations and points for improvement
Currently, OReFiL does not consider the context in which a URL appears; authors may be developers or users of a resource that they mentioned in the paper. Enabling URLs in both contexts to be searchable itself is not an issue, since it allows users to find resources that the developer has not introduced in an indexed peer-reviewed paper. The credibility of these resources can be verified by the number of PubMed entries in its entry or the impact factor of the journal in which the URL of the resource appears. However, such context information is beneficial to users, and we are planning to develop a method of identifying whether a paper introduces or uses a resource.
We also plan to develop an option for changing the ranking method. The current, system only evaluates the significance based on the language model; it is influenced by the distribution and frequency of terms in the target documents. It would also be desirable for OReFiL to be able to rank a query-relevant hit list by reputation or published dates.
The ability to display the part of a hit document (MEDLINE entry or fetched web page) in which a query term appears would also be useful to users because it is sometimes difficult to determine the relevance of the hit. We are developing a method to do this.
Our crawler currently cannot follow some "moved" pages. In some cases, a site has changed its URL several times, and each of the URLs is in MEDLINE abstracts. OReFiL identifies a resource as its corresponding URL; therefore, such a site will appear multiple times in the hit list that is generated. We need to fix this "URL synonym" issue.
Expanding OReFiL's coverage is a difficult task because many life science journals still do not allow open access to their full-text papers. Nevertheless, because both the number of URLs in MEDLINE abstracts (see Fig. 3 and [1]) and the number of journals allowing open access [23] are steadily increasing, OReFiL will become more useful. In addition, accepting user's submitted URLs and their corresponding PubMed IDs will improve OReFiL's coverage.