Outline
Figure 1, in which ‘T’ in red is a target protein and ‘A’ – ‘J’ indicate its neighboring proteins that are extracted from the PPI network, shows the outline of prediction of interaction sites, where the neighboring proteins are defined as proteins within a distance of two from the target protein in the PPI network. Since the interaction site often forms a concave structure, instead of the whole of molecular surface of the protein, only pockets are treated as candidates of the interaction sites. In other words, interaction sites are predicted by extracting a pocket whose shape and physical properties are commonly observed among ‘A’ – ‘J’ and ‘T’. In practical cases, however, all of neighboring proteins ‘A’ – ‘J’ do not always have similar functions. For this reason, the groups, called neighboring protein clusters, in which a similar pocket is commonly observed, are extracted from ‘A’ – ‘J’. In our method, how to extract the cluster which shares discriminative pocket being similar in shape and physical properties is an important issue. If structurally similar groups are simply extracted from the neighboring proteins the cluster with similar structural features would be extracted, but the cluster which shares a “discriminative” pocket is not always obtained because the similarity of pockets which are observed in many proteins universally tend to be high. To cope with this problem, we introduce a restriction that each cluster must have at least one protein with known interaction sites. Next, the score is given for each pocket of the target protein which appears in all of extracted neighboring protein clusters commonly, and the topranked pockets are output as interaction sites.
Meanwhile, if the target protein ‘T’ has no neighboring protein with a known interaction site, it is impossible to construct any neighboring proteins clusters. To handle this difficulty, the prediction process is repeated by considering the predicted interaction sites as known interaction sites. In addition, repetition of the prediction process increases the neighboring proteins having the predicted (i.e. known) interaction sites, reorganization of the clusters using them will improve the prediction accuracy.
Molecular surface data and pocket
In the proposed method, molecular surface data available from eFsite database [24] are used. A number of polygons represent the molecular surface, and every vertex composing polygons has the information of structure (location, maximum curvature, and minimum curvature), the property values (electrostatic potential and hydrophobicity), and the connection information of vertices. Interaction sites are widely known having concave structures on surface because of binding stability, specificity, and reaction promotion. Much research on searching and extracting pockets from the protein surface as candidates of interaction sites has been conducted [25, 26].
In fact, the number of vertices of molecular surface of some proteins is over 20,000, so it is impractical idea to handle the whole molecular surface for comparing protein structures. Thus focusing on only pockets extracted from the molecular surface has advantages. In our method, the LIGSITE [27] algorithm is utilized to extract pockets. About 30 pockets are extracted for each protein.
Representaion of pockets by histograms
It is known that proteins change their conformation in interacting, so comparing pockets by rigid superimposing of vertices which construct a pocket each other is inappropriate. So far, many methods for comparing surface patches have been proposed [28, 29]. In order to compare molecular surfaces of the pockets from the viewpoint of mainly physical properties and roughly geometrical figures, we introduce a method of representing a molecular surface using histogram of structural and physical properties of the surface. Comparison of histogram is utilized in the area of such as image processing and it can compare pockets not definitely but roughly. As a pocket is constructed from vertex set of polygons, the pocket can be expressed with the four histograms, which are defined using three parameters, the range of rank d, the maximum value max, and the minimum value min, from four properties, namely, maximum curvature κ_{
max
}, minimum curvature κ_{
min
}, electrostatic potential C, and hydrophobicity H, of each vertex shown as follows. Values of the parameters max, min, and d are determined experimentally.
M = (κ_{
max
} + κ_{
min
})/2
max = 3.0, min = –3.0, d = 0.01
G = κ_{
max
} · κ_{
min
}
max = 3.0, min = –3.0, d = 0.01
max = 0.6, min = –0.6, d = 0.01
max = 5.0, min = –5.0, d = 0.1
Similarity among pockets
A pocket is expressed using four histograms of structural and physical properties. We define similarity among pockets by comparing the four histograms.
Let p_{1},…,p_{
N
} be N pockets and each pocket is expressed with the histogram of mean curvature M_{
i
}(1 ≤ i ≤ N), the histogram of Gaussian curvature G_{
i
}, the histogram of electrostatic potential C_{
i
}, and the histogram of hydrophobicity H_{
i
}. We simply define S_{
pkt
}(p_{1}, …, p_{
N
}), the similarity among pockets p_{1},…,p_{
N
}, by
S_{
ptk
}(p_{1}, …, p_{
N
}) =J(M_{1},…,M_{
N
}) × J(G_{1},…,G_{
N
}) × J(C_{1},…,C_{
N
}) × J(H_{1},…,H_{
N
}) (1)
where J(A_{1},…,A_{
N
}) represents the similarity among the histograms A_{1},…,A_{
N
}, which is defined by
where a_{
i
}_{
k
}(1 ≤ i ≤ N) represents frequency of kth rank of ith histogram, and n represents the maximum value of the rank. Equation (2) is based on the idea of Jaccard coefficient to comparing histograms. That is to say, the similarity among pockets S_{
pkt
} is defined as the product of the similarity of the four histograms expressing each pocket.
Extraction of neighboring proteins cluster
In our method, we define a neighboring proteins cluster as a subset of proteins sharing the pockets that are similar in shape and physical properties and are specific to the cluster, which are extracted from the set of the neighboring proteins. We introduce the similarity measure that shows how similar the pockets on each protein in the subset are. If each protein in the subset has the similar interaction site, they are likely to share common pockets, then the similarity of the pockets in proteins in the subset must be high. Therefore, the pockets of each protein in the subset are exhaustively compared by using the similarity among pockets given by equation (1), then the highest similarity is put to be the subset similarity. However, there is a possibility that this highest similarity is actually due to the nonspecific pockets which appear universally in the several proteins. To handle this matter, strong restriction is introduced, in which any subset must contain one or more proteins having a known interaction site. The following is an algorithm of extracting neighboring protein clusters.

1.
Let P be a set of neighboring proteins, and S(⊆P) be a set of proteins in P whose interaction sites are known. PS(P, n) is a power set of P whose cardinality is n(1 <n ≤ k), and ps is an element of PS(P, n), namely ps ∈ PS(P, n). Enumerate all of ps satisfying the following constraint.
ps ∩ S ≠ φ (3)

2.
Let P_{
x
}_{1}, …, P_{
x
}_{
n
} be proteins in ps, namely ps = {P_{
x
}_{1},…,P_{
x
}_{
n
}}. Calculate S_{
set
}(ps), the similarity among {P_{
x
}_{1},…,P_{
x
}_{
n
}} by the following definition
S_{
set
}(ps) = max{S_{
pkt
}(p^{1},…,p^{n})p^{i} ∈ pkt(P_{
x
}_{
i
}), 1 ≤ i ≤ n} (4)
where pkt(P_{
x
}_{
i
}) denotes a set of pockets in protein P_{
x
}_{
i
}. S_{
set
}(ps) means the similarity of the combination of the most similar pockets when exaustively comparing all the pockets of each protein.

3.
Extract the elements ranked in the top Z of PS(P, n) as clusters C as follows.
C = {x ∈ PS(P, n)Rank(x, PS(P, n)) <Z} (5)
where the function Rank(x, PS(P, n)) gives the ranking of x in PS(P, n) in terms of the similarity, which is formally defined as
Rank(x, PS(P, n)) = {y ∈ PS(P, n)x ≠ y, S_{
set
}(x) <S_{
set
}(y)} (6)
Figure 2 illustrates an example of extracting the neighboring protein clusters for k = 4 and Z = 1. In this example, a set of the neighboring proteins P is constructed from six proteins, and two of them are proteins having a known interaction site. k = 4 gives ps, the possible subset of the neighboring proteins set, whose size is 2, 3, or 4. First, ps which satisfies the constraint (3) is enumerated. Next, the enumerated ps is ranked in accordance with the value of the similarity. The constraint Z = 1 leads to extract ps with the highest similarity value as the clusters.
Scoring of pockets
If a pocket in the target protein is similar to the pockets that appear commonly in the high ranked neighboring protein cluster, it may be a candidate of the interaction site. To evaluate each candidate, we introduce the votingbased scoring scheme. In this scheme, a set of pockets consisting of one pocket from the target protein and the similar pockets from the neighboring protein cluster is evaluated from the viewpoint of similarity, and the pocket (from the target) that wins the highest similarity value is voted. Formally, for a target protein P_{
T
} and the proteins P_{1},…,P_{
n
} in the neighboring protein cluster, the pockets to be voted are enumerated as follows.
p ∈ pkt(P_{
T
}) s.t.
S_{
pkt
}(p^{1},…,p^{n}, p) = S_{
set
}({P_{1},…,P_{
n
}, P_{
T
}}),
∀i, 1 ≤ i ≤ n, p^{i} ∈ pkt(P_{
i
}) (7)
Complement and feedback by repetition
If the target protein has no neighboring proteins with the known interaction site, any cluster cannot be constructed because there is no ps satisfying the constraint (3). On the other hand, if the pockets of predicted protein are regarded as interaction sites we can get more proteins with known interaction sites on the PPI networks. That is, we can construct clusters using newly known (namely predicted) interaction sites, which enable repetitive prediction as shown in Figure 3. This repetitive process plays a complementary role for the protein which contains no interaction site in its neighboring proteins. Figure 3 shows that there are few proteins that have known interaction sites in a single cycle prediction, but the repetitive prediction increases the proteins having known interaction site.
Even if we deal in the target proteins whose neighboring proteins contain known interaction site, there is a possibility that the actual interaction site of the target protein may not be similar to any interaction sites of the neighboring proteins. The repetitive prediction plays a role of feedback for this problem. As the number of proteins whose neighboring proteins contain known interaction sites increases, the clusters of the neighboring proteins can be reconstructed. This feedback has the possibility to choose different pockets as interaction sites from the previous prediction process, and may get better result.