In this paper, we deal with retrieval of the articles referred from each entry of the PDB (Protein Data Bank), because we will use the structural information related to the article. If you need not attach great importance to structural aspect, this restriction is not always required. For each article, based on the corresponding PDB entry, the structural and functional information on the objects (e.g. protein, gene), concepts, themes, and so on described in the article is referred from various biodatabases as follows,

A database of protein structure information (PDB)

A database of protein structural classification information (SCOP)

A database of gene ontology (GO)

A database of protein sequence information (SwissProt/UniProt)

A database of biomedical article (MEDLINE/PubMed)
The related articles can be retrieved based on the relevance between input articles and the target articles, which is evaluated using the information and conceptual hierarchies obtained from these databases.
Calculation of the relevance based on the concept hierarchy
Evaluation of relevance using the concept hierarchy
Before defining the relevance between articles, we define d(H, t_{1}, t_{2}), the relevance between two concepts t_{1} and t_{2} on a concept hierarchy H as
where d(H) represents the depth of the concept hierarchy and c_{
p
}(t_{1}, t_{2}) represents the lowest level ancestor common to t_{1} and t_{2}. P(t, c_{
p
}(t_{1}, t_{2})) is the path length between t ∈ t_{1}, t_{2} and c_{
p
}(t_{1}, t_{2}) defined as follows.
P(t, c_{
p
}(t_{1}, t_{2})) = min{E  E ∈ P_{min}(t, c_{
p
}(t_{1}, t_{2}))} (2)
where P_{min}(t, c_{
p
}(t_{1}, t_{2})) is the set of the shortest paths from t to c_{
p
}(t_{1}, t_{2}). The length of each edge is assumed to be a fixed value, namely 1.0, to make understanding easier, but we redefine the equation (2) by giving weight to each edge in order to reflect the user’s intention to the relevance evaluation process. The redefinition by giving weight to update the edge length is described later.
Figure 1 shows an example of relevance evaluation by the concept hierarchy. For example, the relevance between two concepts “M/G1 Transition” and “cell cycle arrest” is 5 (= 4 × 2 – 3) because the depth of concept hierarchy is 4 and the sum of path to the lowest common ancestor “cell cycle control” is 2+1. The relevance evaluation between the articles on the concept hierarchy of GO (Gene Ontology) and SCOP based on the equation (1) is discussed in succeeding sections.
Calculating the relevance between concepts from the functional viewpoint
The relevance between concepts in GO is evaluated based on the idea that “the gene product related to the lower concept has to be related to the higher concept”. First, the functional concept included in the article is identified by finding the functional information such as keywords or protein names from the GO hierarchy [6, 7].
Two types of relation, namely “is_a” and “part_of” are used together in one concept hierarchy in GO. Because the relation “part_of” has the ambiguity, only the relation “is_a” is considered for the calculation of relevance in the concept hierarchy. Figure 2 shows an example of hierarchy extracted by tracking back the relation “is_a” based on the concept hierarchy of GO from a term corresponding to a concept in the article to the top of the hierarchy.
If more than one functional concepts (functional terms) are included in one input article, the retrieval results may change depending on which concept is important for the user. Therefore, we introduce the weight that should be assigned to the concept, which is given by the user. We define d_{
GO
}(H, D_{1}, D_{2}), the relevance between articles D_{1}(input) and D_{2}(retrieval target) for the functional concept hierarchy in GO as follows.
where ω_{
t
}_{1} is the weight that is assigned to the concept (functional term) t_{1} in the article D_{1} given by the user, and GO(D) is the set of the functional concepts (functional terms) in the article D.
Calculating the relevance from the viewpoint of the protein structure
The protein structures are classified at six levels (i.e. class, fold, superfamily, family, protein, and species) in the SCOP database hierarchically. d_{
SCOP
}(H, D_{1}, D_{2}), the relevance between two articles D_{1} (input) and D_{2} (retrieval target) based on the protein structural concept hierarchy H is defined as
where SCOP(D_{1}) is the set of the structural concepts, namely the classification results of the protein described in the article D_{1}.
Calculating the relevance from the viewpoint of the medical term
The hierarchy consisting of the medical terms can be obtained from PubMed. The MeSH (Medical Subject Headings) terms corresponding to the article are extracted by referring PubMed. MeSH is the National Library of Medicine’s comprehensive controlled vocabulary thesaurus, in which the descriptors are arranged in a hierarchy. Because about ten MeSH terms are related to each article, the same manner in case of GO can be applied for calculating the relevance. In other words, the calculated path length on the hierarchy of the MeSH term for each related articles is used for evaluating the relevance in the input article. And we define the relevance for the viewpoint of the medical term between the articles as the sum of the relevance calculated from each MeSH term.
d_{
mh
}(H, D_{1}, D_{2}), the relevance between two articles D_{1} and D_{2} based on the protein functional concept hierarchy H is defined as
where ω_{
t
}_{1} is the weight of the concept added if the term t_{1} is a Major Topic in the article D_{1}.
Calculation of the relevance for the related article by the update of the edge length
Outline
Generally, in information retrieval system, multiple keywords (namely, ANDsearch) are often used to specify the user’s requirement. In the proposed method, the user’s intention for retrieval of related articles is specified using more than one articles as inputs (an initial article and additional articles). In other words, the length of the path between concepts, which is calculated using the initial article first, is updated based on the similarity between initial article and additional articles. Note that we use the term ‘relevance’ for evaluating the relation between a concept in query articles and a concept in retrieval targets, whereas we use the term ‘similarity’ for evaluating the relation between concepts in query articles (the initial article and the additional articles) to distinguish them.
Calculating the similarity between concepts in query articles
For the concept hierarchy, the attempt to evaluate the similarity between concepts has been well studied. One of the most primary method is to evaluate the length of the path on the graph representing the concept hierarchy with nodes and edges, which is similar manner to our relevance measure mentioned above. But it is difficult to give a weight to the path systematically from only the path length between the concepts. In addition, such a method cannot consider the depth of the common ancestor of two concepts in hierarchy, that is, the similarity is evaluated independently whether the common ancestor is located near the root or remote from the root. Similarly, the method in which the similarity is evaluated based on the entropy focusing on the common ancestor of the concepts has been proposed [8, 9]. This method can calculate the similarity considering the location of the common ancestor, but it is not enough to give weight to the path between the concepts which is required for updating the relevance in our method. Therefore, in this study, the method proposed in [10] that can measure the similarity between the concepts considering both the path between the concepts and the common node is applied to weighting the path between the concepts.
Giving the weight to the edge in the concept hierarchy
The user’s intention, from which the related articles are retrieved, should be specified by the initial article and the additional articles. By considering the path between the concepts described in the initial article and the additional articles, the article including a lot of concepts similar to the common concepts in them has to be more related to the input articles. Therefore, the weight of the edges connecting the concepts in the initial and the additional articles is given by using the similarity measure described in [10].
We define the weight ω(e) of the edge e, called edge weight, using the concept hierarchy as
where t_{1} is the concept in T_{
A
}, the set of concepts in the initial article A and t_{2} is the concept in T_{
L
}, the set of concepts in the additional article L. S(t_{1}, t_{2}) is similarity between two concepts t_{1} and t_{2} calculated by using the method described in [10], and E (t_{1}, t_{2}) is the set of edges connecting the concepts t_{1} and t_{2}.
Figure 3 shows an example of calculated edge weight that is assigned to each edge in the paths between the concepts. The nodes in red show the concepts in the initial article, and the nodes in blue show the concepts in the additional article.
Updating edge weight by more than one additional articles
ANDsearch or ORsearch by using more than one additional articles updates the edge weight by using the concept involved in each additional article.
Let W E(A, L) be a set of edges, which the edge weight should be assigned to by the input article A and the additional article L, as follows.
If the additional articles L_{1},…,L_{
m
} are given instead of one additional article L, only the common edge is the edge which the edge weight is assigned to in case of the ANDsearch. If the different edge weight is assigned to the same edge, the smallest one is selected.
On the other hand, the edge is the edge which the edge weight is assigned to in case of the ORsearch. And in the same way as the ANDsearch, the smallest edge weight is employed in case of adding the different edge weight to the same edge.
In Figure 4, the concept “I” is the concept in the initial article and the concepts “G” and “H” are the concepts in the additional articles. The edge weight is assigned to each of the common edge (denoted by in the figure) in the intersection of the set of the edges connecting ‘I’ and ‘G’ and the set of the edges connecting ‘I’ and ‘H’ for the ANDsearch.
Evaluating the relevance between two concepts based on updated edge weight
We have defined the calculation of the basic relevance as the equation (1) introduced in the previous section. By adding the edge weight to the path between the concepts based on the additional articles, the definition of P(t_{1}, c_{
p
}(t_{1}, t_{2})) and P(t_{2}, c_{
p
}(t_{1}, t_{2})) in the equation (1) has to be modified. That is, the equation (2) is redefined as
where ω(e) is the edge weight from the equation (6) and P_{min}(t, c_{
p
}(t_{1}, t_{2})) is the set of the shortest paths connecting t and c_{
p
}(t_{1}, t_{2}).
The equation (2’) is applied to each calculation of the relevance between concepts. In other words, the relevance of the equation (1) is modified by the equation (2’), consequently the edge weight reflecting the user’s intention for the equations (3), (4), (5) is applied.