Privacypreserving search for chemical compound databases
 Kana Shimizu^{1}Email author,
 Koji Nuida^{2, 3},
 Hiromi Arai^{4},
 Shigeo Mitsunari^{5},
 Nuttapong Attrapadung^{3},
 Michiaki Hamada^{6},
 Koji Tsuda^{7},
 Takatsugu Hirokawa^{8},
 Jun Sakuma^{9},
 Goichiro Hanaoka^{3} and
 Kiyoshi Asai^{7}
https://doi.org/10.1186/1471210516S18S6
© Shimizu et al.; 2015
Published: 9 December 2015
Abstract
Background
Searching for similar compounds in a database is the most important process for insilico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. However, a serious dilemma arises when the database holder also wants to output no information except for the search results, and such a dilemma prevents the use of many important data resources.
Results
In order to overcome this dilemma, we developed a novel cryptographic protocol that enables database searching while keeping both the query holder's privacy and database holder's privacy. Generally, the application of cryptographic techniques to practical problems is difficult because versatile techniques are computationally expensive while computationally inexpensive techniques can perform only trivial computation tasks. In this study, our protocol is successfully built only from an additivehomomorphic cryptosystem, which allows only addition performed on encrypted values but is computationally efficient compared with versatile techniques such as general purpose multiparty computation. In an experiment searching ChEMBL, which consists of more than 1,200,000 compounds, the proposed method was 36,900 times faster in CPU time and 12,000 times as efficient in communication size compared with general purpose multiparty computation.
Conclusion
We proposed a novel privacypreserving protocol for searching chemical compound databases. The proposed method, easily scaling for largescale databases, may help to accelerate drug discovery research by making full use of unused but valuable data that includes sensitive information.
Keywords
Introduction
In recent years, the increasing cost of drug development and decreasing number of new chemical entities have become growing concerns [1]. One of the most popular approaches for overcoming these problems is searching for similar compounds in databases [2]. In order to improve the efficiency of this task, it is important to utilize as many data resources as possible. However, the following dilemma prevents the use of many existing data resources. Unpublished experimental results have been accumulated at many research sites, and such data has scientific value [3]. Since data holders are usually afraid of sensitive information leaking from the data resources, they do not want to release the full data, but they might allow authorized users to search the data as long as the users obtain only search results from which they cannot infer sensitive information. Likewise, private databases of industrial research might be made available if the sensitive information were sufficiently protected. On the other hand, query compounds are also sensitive information for the users, and thus the users usually avoid sending queries and want to download all of the data in order to conduct search tasks on their local computers. In short, we cannot utilize important data resources because both the data holder and the data user insist on their privacy. Therefore, an emerging issue is to develop novel technology that enables privacypreserving similarity searches.We show several use cases in the next section.
Let us start by clarifying privacy problems in database searches. In a database search, two types of privacy are of concern: "user privacy" (also known as input privacy) and "database privacy" (also known as output privacy). The first is equal to protecting the user's query from being leaked to others including the database holder. The second is equal to protecting the database contents from being leaked to others including the database user, except for the search results held by the user. Here we firstly consider the case of using no privacypreserving techniques; namely, the user sends a plain query to the server and the server sends the search result. In this case, the user's query is fully obtained by the server. On the database side, the server's data is not directly leaked to the user. However, there is a potential risk that the user may infer the database contents from the search results. To protect user privacy, a scheme called singledatabase private information retrieval (PIR) has been proposed [4]. The simplest method for achieving PIR is that the user downloads all the contents of the database and searches on his/her local computer. Since this naive approach needs a huge communication size, several cryptographic techniques have been developed, in which the query is safely encrypted/randomized in the user's computer and the database conducts the search without seeing the query. Although PIR is useful for searching public databases, it does not suit the purpose of searching private databases because of the lack of database privacy. Likewise, similarity evaluation protocols keep user privacy [5–7] but they do not sufficiently protect database privacy because the server directly outputs similarity scores that become important hints for inferring database contents.
Generally speaking, it is very difficult to keep both user privacy and database privacy, because the database side must prevent various attacks without seeing the user's query. Among them, the following two attacks are major concerns.

Regression attack

Illegal query attack
Searching with an illegal query often causes unexpected server behaviour. In such a case, the server might return unexpected results that include important server information. To prevent this, the server should ensure the correctness of the user's query.
In the field of cryptography, there have been studies of versatile techniques such as general purpose multiparty computation (GPMPC) [8] and fully homomorphic encryption (FHE) [9], which enable the design of systems that maintain both user privacy and database privacy. However, these techniques require huge computational costs as well as intensive communications between the parties (see the recent performance evaluation of FHE [10]), so they are scarcely used in practical applications. In order to avoid using such techniques, a similarity search protocol using a trusted third party [11] and a privacy preserving SQL database using a trusted proxy server [12] have been proposed, but those methods assure privacy only when the third party does not collude with the user or the server, which is not convenient for many real problems. As far as we know, no practical method has been proposed despite the great importance of privacypreserving similarity searching. To overcome this lack, we propose a novel privacypreserving similarity search method that can strongly protect database privacy as well as user privacy while keeping a significantly low computational cost and small communication size.
The rest of this paper is organized as follows. In the next section, we summarize our achievements in this study. This is followed by the Cryptographic background section and the Method section, where we define the problem and introduce details of the proposed protocol. In the Security analyses section, both the user privacy and database privacy of the proposed protocol are discussed in detail. In the Performance evaluation section, the central processing unit (CPU) time and communication size of the proposed protocol are evaluated for two datasets extracted from ChEMBL. Finally, we present our conclusions for this study in the Conclusion section.
Our Achievements
Here we focus on similarity search with the Tversky index of fingerprints, which is the most popular approach for chemical compound searches [13] and is used for various search problems in bioinformatics. To provide a concrete application, we address the problem of counting the number of similar compounds in a database, which solves various problems appearing in chemical compound searches. The following model describes the proposed method.
Model 1 The user is a private chemical compound holder, and the server is a private database holder. The user learns nothing but the number of similar compounds in the server's database, and the server learns nothing about the user's query compound.
Here we introduce only a small fraction of the many scientific or industrial problems solved by Model 1.
1 Secure prepurchase inspection service for chemical compound.
When a client considers the purchase of a commercial database such as a focused library [14], he/she wants to check whether the database includes a sufficient number of similar compounds, without sending his/her private query, but the server does not allow downloading of the database.
2 Secure patent compound filtering.
When a client finds a new compound, he/she usually wants to know whether it infringes on competitors' patents by searching the database of patentprotected compounds maintained by third parties. The same problem occurs when the client wants to check whether or not the compound is undesirable.
3 Secure negative results check.
It is a common perception that current scientific publication is strongly biased against negative results [3], although a recent study showed statistically that negative results brought meaningful benefit [15]. Since researchers are reluctant to provide negative results, which often include sensitive information, a privacypreserving system for sharing those results would greatly contribute to reducing redundant efforts for similar research topics. For example, it would be useful to have a system that allows a user to check whether the query is similar to failed compounds that have previously been examined in other laboratories.
In this study, we propose a novel protocol called the secure similar compounds counter (SSCC) which achieves Model 1. The first main achievement of this study is that SSCC is remarkably tolerant against regression attacks compared with existing protocols which directly output the similarity score. Moreover, we propose an efficient method for protecting the database from illegal query attacks. These points are discussed in the Security analyses section.
The second main achievement is that SSCC is significantly efficient both in computational cost and communication size. We carefully designed the protocol such that it uses only an additivehomomorphic cryptosystem, which is computationally efficient, and does not rely on any timeconsuming cryptographic methods such as GPMPC or FHE. Hence the performance of the protocol is sufficiently high for a largescale database such as ChEMBL [16], as is shown in the Performance evaluation section.
Cryptographic background
Additively homomorphic encryption scheme
In this paper, we use an additivehomomorphic cryptosystem to design our protocol. The key feature of the additivehomomorphic cryptosystem is that it enables to perform additive operations on encrypted values. Therefore, intuitively, any standard computation algorithm can be converted into the privacypreserving computation algorithm, if operations used in the standard algorithm can be replaced by additions.
More formally, we use a publickey encryption scheme (KeyGen; Enc;Dec), which is semantically secure; that is, an encryption result (ciphertext) leaks no information about the original message (plaintext) [17]. Here, KeyGen is a key generation algorithm for selecting a pair (pk, sk) of a public key pk and a secret key sk; Enc(m) denotes a ciphertext obtained by encrypting message m under the given pk; and Dec(c) denotes the decryption result of ciphertext c under the given sk. We also require the following additivehomomorphic properties:

Given two ciphertexts Enc(m_{1}) and Enc(m_{2}) of messages m_{1} and m_{2}, Enc(m_{1} + m_{2}) can be computed without knowing m_{1}, m_{2} and the secret key (denoted by Enc(m_{1}) ⊕ Enc(m_{2})).

Given a ciphertext Enc(m) of a message m and an integer e, Enc(e⊕m) can be computed without knowing m and the secret key (denoted by e ⊗ Enc(m)).
For example, we can use either the Paillier cryptosystem [18] or the "lifted" version of the ElGamal cryptosystem [19] as such an encryption scheme; now the second operation ⊗ can be achieved by repeating the first operation ⊕. We notice that the range of plaintexts for those cryptosystems can be naturally set as an integer interval [−N_{1}, N_{2}] for some sufficiently large N_{1}, N_{2} >0; therefore, the plaintexts are divided into positive ones, negative ones, and zero.
Noninteractive zeroknowledge proof
Below, we discuss the following situation: A user (a prover) wants to make a server (a verifier) convinced that a ciphertext c generated by the user corresponds to a message m in {0, 1}, but does not want to reveal any information about which of 0 and 1 is m. This can be achieved by using a cryptographic tool called noninteractive zeroknowledge (NIZK) proof. In the present case, it enables the user to generate a "proof" associated with c, so that:

If m is indeed in {0, 1}, then the server can verify this fact by testing the proof (without knowing m itself).

If m ∉ {0, 1}, then the user cannot generate a proof that passes the server's test.

The server cannot obtain any information about m from the proof, except for the fact that m ∈ {0, 1}.
(See [20] for a general formulation.) Besides the existing generalpurpose NIZK proofs, Sakai et al. [21] proposed an efficient scheme specific to the "lifted" ElGamal cryptosystem, which we use below. (See Section 1 of Additional File 1 in which we give the brief description of the NIZK proofs [21].)
Method
Tversky index is useful since it includes several important similarity measurements such as Jaccard Index (JI, which is exactly TI_{1,1} and also known as Tanimoto Index) and Dice index (which is exactly TI_{1/2,1/2}) [22]. First, we introduce the basic idea and two efficient techniques for improving database privacy. Then, we describe our full proposed protocol.
Basic idea
where λ_{1} = c(θ^{1} − 1 + α + β), λ_{2} = cα, λ_{3} = cβ and any positive value c. We assume that the parameters and the threshold for the Tversky index are rational numbers denoted by α = μ_{ a }/γ, β = μ_{ b }/γ and θ = θ_{ n }/θ_{ d }, where μ_{ a }, μ_{ b }, γ, θ_{ n } and θ_{ d } are nonnegative integers. By using c = γθ_{ n }g^{−1} under this assumption, λ_{1}, λ_{2} and λ_{3} become nonnegative integers where g is the greatest common divisor of γ(θ_{ d } − θ_{ n }) + θ_{ n }(μ_{ a } + μ_{ b }), θ_{ n }μ_{ a } and θ_{ n }μ_{ b }.
Motivated by this observation, we define the following modified score, called the threshold Tversky index:
where g is the greatest common divisor of γ(θ_{ d } − θ_{ n })+θ_{ n }(μ_{ a }+μ_{ b }), θ_{ n }μ_{ a } and θ_{ n }μ_{ b }.
By the above argument, we have TI_{ α,β }(p, q) ≥ θ if and only if ${\overline{\mathsf{\text{TI}}}}_{\alpha ,\beta ,\theta}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)\ge 0$. Therefore, the user can know whether or not his/her target compound qis similar (i.e., TI_{α,β}(p, q) ≥ θ) to the fingerprint pin the database, by obtaining only the value ${\overline{\mathsf{\text{TI}}}}_{\alpha ,\beta ,\theta}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$.
In the protocol, the bits of the user's target fingerprint qand the value p held by the server are both encrypted using the user's public key. Since ${\overline{\mathsf{\text{TI}}}}_{\alpha ,\beta ,\theta}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ can be computed by the addition of these values and multiplication by integers, the protocol can calculate (without the secret key) a ciphertext of ${\overline{\mathsf{\text{TI}}}}_{\alpha ,\beta ,\theta}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$, which is then decrypted by the user. For simplicity, we will abuse the notation and write TI(p, q), $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ without subscripts α, β, θ when the context is clear.
We emphasize that our protocol does not use timeconsuming cryptographic methods such as GPMPC and FHE, and data transfer occurs only twice during an execution of the protocol. Hence, our protocol is efficient enough to scale to large databases.
Database security enhancement techniques against regression attack
As discussed in Introduction section, the server needs to minimize returned information in order to minimize the success ratio of the regression attack. That is, the ideal situation for the server is that the user learns only the similarity/nonsimilarity property of fingerprints pand q, without knowing any other information about the secret fingerprint p. This means that only the sign of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ should be known by the user. However, in our basic protocol, the value of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ is fully obtained by the user; Database privacy is not protected from regression attacks. (See the Security analyses section for details.) In order to send only the sign of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$, we firstly considered using a bitwise decomposition protocol [23] for extracting and sending only the sign bit of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$. Although this approach is ideal in terms of security, the protocol requires more than 30 rounds of communications, which is much more efficient than using GPMPC or FHE, but rather timeconsuming for largescale databases. Therefore, here we propose the novel technique of using dummy replies, which requires only one round of communication while sufficiently minimizing information leakage of p. In the proposed technique, besides its original reply $t=\mathsf{\text{Enc}}\phantom{\rule{2.36043pt}{0ex}}\left(\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)\right)$, the server also chooses random integers φ_{1}, ..., φ_{ n } from a suitable interval and encrypts those values under the user's public key pk. Then the server sends the user a collection of ciphertexts t, Enc(φ_{1}), ..., Enc(φ_{ n }) that are shuffled to conceal the true ciphertext t, as well as the number s_{d} of dummy values φ_{ k } with φ_{ k } ≥ 0. The user decrypts the received n + 1 ciphertexts, counts the number s_{c} of nonnegative values among the decryption results, and compares s_{c} to s_{d}. Now we have $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)\ge 0$ if and only if s_{c} − s_{d} = 1; therefore, the user can still learn the sign of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$, while the actual value of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ is concealed by the dummies. We have confirmed that the information leakage of papproaches zero as the number of dummies becomes large; see the Security analyses for pudding dummies section for more detailed discussion. (We have also developed another security enhancement technique using signpreserving randomization of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$; see Section 2 of Additional File 1 for details.)
Database security enhancement technique against illegal query attack Illegal query attacks can be prevented if the server can detect whether or not the user's query is valid. To keep user privacy, the server must conduct this task without obtaining more information than the validity/invalidity of the query. In fact, this functionality can be implemented by using the NIZK proof by Sakai et al. [21] mentioned in the Noninteractive zeroknowledge proof section. The improved protocol requires the user to send the server a proof associated with the encrypted fingerprint bits q_{i}, from which the server can check whether qis indeed a valid fingerprint (without obtaining any other information about q); the server aborts the protocol if qis invalid. Here we use the "lifted" ElGamal cryptosystem as our basic encryption scheme to apply Sakai's scheme. (We note that if we require the user to send Enc(−q) used by server's computation, then another NIZK proof is necessary to guarantee the validity of the additional ciphertext, which decreases the communication efficiency of our protocol. Hence our protocol requires the server to calculate Enc(−q) by itself.) The formal definition of the valid query is given in the Database privacy in malicious model section.
Secure similar compounds counter
For the general case that the database consists of more than one fingerprint p, we propose the protocol shown in Algorithm 1 to count the number of fingerprints psimilar to the target fingerprint q. In the protocol, the server simply calculates the encryption of the threshold Tversky indices for all database entries and, as discussed above, replies with a shuffled collection of these true ciphertexts and dummy ciphertexts, as well as the number s_{d} of nonnegative dummy values. Then the value s_{c} − s_{d} finally obtained by the user is equal to the number of similar fingerprints pin the database.
Algorithm 1 The secure similar compounds counter (SSCC)

Public input: Length of fingerprints ℓ and parameters for the Tversky index θ = θ_{ n }/θ_{ d }, α = μ_{ a }/γ, β = μ_{ b }/γ

Private input of a user: Target fingerprint q

Private input of a server: Set of fingerprints P = {p^{(1)}, ..., p^{(M)}}
1 (Key setup of cryptosystem) The user generates a key pair (pk, sk) by the key generation algorithm KeyGen for the additivehomomorphic cryptosystem and sends public key pk to the server (the user and the server share public key pk and only the user knows secret key sk).
2 (Initialization) The user encrypts his/her fingerprint q as a vector of ciphertexts: $\mathsf{\text{E}}\overrightarrow{\mathsf{\text{n}}}\mathsf{\text{c}}\left({q}_{k}\right):=\left(\mathsf{\text{Enc}}\left({q}_{1}\right),\phantom{\rule{2.36043pt}{0ex}}\dots ,\phantom{\rule{2.36043pt}{0ex}}\mathsf{\text{Enc}}\left({g}_{\ell}\right)\right)$. He/she also generates v as a vector of proofs. Each proof vi is associated with Enc(q_{ i }).
3 (Query of entry) The user sends the vector of ciphertexts $\mathsf{\text{E}}\overrightarrow{\mathsf{\text{n}}}\mathsf{\text{c}}\left({q}_{k}\right)$ and the vector of proofs v to the server as a query.
4 (Query validity verification) The server verifies the validity of $\mathsf{\text{E}}\overrightarrow{\mathsf{\text{n}}}\mathsf{\text{c}}\left({q}_{k}\right)$ by testing the vector of proof v. If v does not pass the server's test, the user cannot move on to the next step.
5 (Calculation of threshold Tversky index)
(a) The server calculates the greatest common divisor of γ(θ_{ d } − θ_{ n }) + θ_{ n }(μ_{ a } + μ_{ b }), θ_{ n }μ_{ a } and θ_{ n }μ_{ b } as g, and calculates λ_{1} = γθ_{ n }g^{−1} (θ^{−1} − 1 + α + β), λ_{2} = γθ_{ n }g^{−1}α, and λ_{3} = γθ_{ n }g^{−1}β.
(b) The server calculates $\mathsf{\text{Enc}}\left(\leftq\right\right)=\mathsf{\text{Enc}}\left({\sum}_{i=1}^{\ell}{q}_{i}\right)$ from $\mathsf{\text{E}}\overrightarrow{\mathsf{\text{n}}}\mathsf{\text{c}}\left({q}_{k}\right):\mathsf{\text{Enc}}\left(\leftq\right\right)=1\phantom{\rule{2.36043pt}{0ex}}\otimes \phantom{\rule{2.36043pt}{0ex}}{\oplus}_{i=1}^{\ell}\mathsf{\text{Enc}}\left({q}_{i}\right)$.
(c) for j = 1 to M do
i. The server calculates $\left{p}^{\left(j\right)}\right={\sum}_{i=1}^{\ell}{p}_{i}^{\left(j\right)}$ and encrypts it to obtain a ciphertext $\mathsf{\text{Enc}}\left(\left{p}^{\left(j\right)}\right\right)$.
ii. The server calculates a ciphertext t_{ j } of threshold Tversky index $\overline{\mathsf{\text{TI}}}\left({p}^{\left(j\right)},q\right)$.
c ← Enc(0)
for k = 1 to ℓ do
if ${{p}_{k}}^{\left(j\right)}=1$
c ← c ⊕ Enc(q_{ k }) ▷ Computing $\mathsf{\text{Enc}}\left({p}^{\left(j\right)}\cap q\right)$
end if
end for
t_{ j } ← λ_{1} ⊗ c ⊕ λ_{2} Enc(−p^{(j)}) ⊕ λ_{3} ⊗ Enc(−q)
end for
6 (Padding of dummies)
(a) The server generates a set of dummy values {φ_{1}, ..., φ_{ n }} and counts the number sd of nonnegative dummies φ_{ i } ≥ 0.
(b) The server encrypts φ_{ i } to obtain a ciphertext Enc(φ_{ i }) for i = 1, ..., n.
(c) The server shuffles the contents of the set T = {t_{1}, ..., t_{ M }, Enc(φ_{1}), ..., Enc(φ_{ n })}.
7 (Return of matching results) The server sends T and sd to the user.
8 (Decryption and counting) The user decrypts the contents of T and counts the number sc of nonnegative values.
9 (Evaluation) The user obtains sc − sd as the number of similar fingerprints in the database.
Parameter settings of the protocol
Decrypting an encrytion of too large value needs huge computation cost if the liftedElGamal cryptosystem is used. Therefore, in order to keep the consistency and efficiency of the protocol, the range of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ should not be too large. i.e., the integer parameters λ_{1}, λ_{2} and λ_{3} in the threshold Tversky index should not be too large. In fact, this will not cause a problem in practice; For example, the parameters become λ_{1} = 9, λ_{2} = λ_{3} = 4 for computing ${\overline{\mathsf{\text{TI}}}}_{1,\phantom{\rule{2.36043pt}{0ex}}1,\phantom{\rule{2.36043pt}{0ex}}0.8}$ which is a typical setting of a chemical compound search. In this case, a minimum value and a maximum value of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ is 664 and 166 for 166 MACCS keys, which is a sufficiently small range. (See Section 3 of Additional File 1 for details.)
Security analyses
In this section, we evaluate security of SSCC by several approaches.
In the area of cryptology, the following two standard security models for twoparty computation have been considered:

Semihonest model : Both parties follow the protocol, but an adversarial one attempts to infer additional information about the other party's secret input from the legally obtained information.

Malicious model : An adversarial party cheats even in the protocol (e.g., by inputting maliciously chosen invalid values) in order to illegally obtaining additional information about the secret.
We analyze user privacy and database privacy in both the semihonest and malicious models. For the database privacy, we firstly compare attack success ratios for the case of using our method which aims to output a binary sign and the other case of using the previous methods which aim to output a similarity score, and show that outputting a binary sign improves database privacy. We also evaluate security strength of our method against a regression attack by comparing attack success ratios for the case of using dummies and the ideal case that uses a versatile technique (such as GPMPC and FHE) to output a binary sign, and show that the security strength for the case of using dummies is almost the same as the ideal case under realistic settings.
User privacy
The semantic security of the encryption scheme used in the protocol (see the Additively homomorphic encryption scheme section) implies immediately that the server cannot infer any information about the user's target fingerprint qduring the protocol. This holds in both the semihonest and malicious models.
Thresholding largely improves database privacy
We mentioned in the introduction section that minimizing information returned from the server reduces success ratio of regression attack. Therefore, SSCC aims for "ideal" case in which the user learns only the sign of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ during the protocol. The previous methods that compute Jaccard Index aim for the "plain" case, in which the user fully learns the value TI(p, q). Here we evaluate the efficiency of the thresholding by comparing success probabilities of regression attack for those two cases. We consider the general case in which the user is allowed to send more than one query and those queries are searched by Jaccard Index. We also suppose that the database consists of a single fingerprint pin order to clarify the effect of thresholding.
Security analyses for padding dummies
We showed that the output privacy in the "ideal" case is significantly improved from the "plain" case. Here we experimentally evaluate how the actual situation of our proposed protocol is close to the "ideal" case.
Before going into detail analyses, let us discuss how to generate dummies. It is ideal for the server privacy to generate a dummy according to the same distribution where $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ is generated from. However, this is not realistic because $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ is determined by both pand qwhich is user's private information. Therefore, in our analyses, we assume that a dummy is generated from uniform distribution over possible values of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$. For example, if possible values of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ is {1, 2, 3, 4, 5}, dummies are randomly selected from any one of them. The purpose of padding dummies is to mitigate the risk of leaking $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$. In order to clarify the effect of the use of dummy values, we concentrate on the basic case; the database contains a single p, and there exist k possible values of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$. ith value of the k possible values arises as the true $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ according to the probability w_{ i }. Namely, true $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ is generated from the multinomial distribution with k different probabilities w= w_{1}, ...,w_{k}, while dummies are generated from the multinomial distribution with equal probability 1/k. To conduct stringent analyses, we assume that the user knows w, and he/she also knows that dummies are uniformly distributed over k possible $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$.
The security provided by our protocol can be formalized in the following manner. First we recall that, in our protocol, the server computes encryption of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ and encryption of dummy values φ_{1}, ..., φ_{ n }, and then sends the user the n+1 encrypted values as well as the number of positive dummy values in φ_{1}, ..., φ_{ n }. For the purpose of formalizing the security, we introduce a "fictional" server that performs the following: It first receives the encrypted values $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$, φ_{1}, ..., φ_{ n } from the real server. Secondly, it gets the sign of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$. (We note that a real server cannot do it since it requires unrealistic computational power that breaks the security of the encryption scheme, so this is just fictional for the sake of mathematical definition.) Thirdly, it generates another dummy value $\overline{\mathsf{\text{TI}}}\prime $ randomly, and independently of the values of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$, φ_{1}, ..., φ_{ n } (except for the sign of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$), in the following manner:

If $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ is positive, then $\overline{\mathsf{\text{TI}}}\prime $ is chosen randomly from positive values.

If $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ is negative, then $\overline{\mathsf{\text{TI}}}\prime $ is chosen randomly from negative values.
Finally, the fictional server sends the user an encryption of $\overline{\mathsf{\text{TI}}}\prime $ (instead of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$) as well as the encrypted φ_{1}, ..., φ_{ n } and the number of positive values in φ_{1}, ..., φ_{ n }. We note that, when the user receives a reply from the fictional server, the user can know the sign of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ which is the same as that of $\overline{\mathsf{\text{TI}}}\prime $, but cannot know any other information on $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ since $\overline{\mathsf{\text{TI}}}\prime $ is independent of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$. In the setting, the following property can be proven:
Theorem 1 Suppose that the user cannot distinguish, within computational time TIME, the sets of decrypted values of ciphertexts involved in outputs of the real server and of the fictional server. Then any information computable within computational time TIME from the decryption results for output of the real server is equivalent to information computable within computational time TIME′ from the sign of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$only, where TIME′ is a value which is close to TIME.
Proof Let $\mathcal{A}$ be an algorithm, with running time TIME, which outputs some information on the decrypted values for an output of the real server. We construct an algorithm $\mathcal{A}\prime $ which computes, from the sign of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ only, an information equivalent to the information computed by $\mathcal{A}$. The construction is as follows; from the sign of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$, $\mathcal{A}\prime $ generates dummy values by mimicking the behavior of the fictional server, and then $\mathcal{A}\prime $ inputs these dummy values to a copy of $\mathcal{A}$, say $\mathcal{A}*$, and gets the output of $\mathcal{A}*$. Now if the output of $\mathcal{A}\prime $ is not equivalent to the output of $\mathcal{A}$, then the definition of $\mathcal{A}\prime $ implies that the probability distributions of the outputs of $\mathcal{A}$ with inputs given by the decrypted values for outputs of the real server and of the fictional server are significantly different (since $\mathcal{A}*$ used in $\mathcal{A}\prime $ is a copy of $\mathcal{A}$); it enables the user to distinguish the two possibilities of his/her received values by observing the output of $\mathcal{A}$, but this contradicts the assumption of the theorem. Therefore, the output of $\mathcal{A}\prime $ is equivalent to the output of $\mathcal{A}$ as claimed. Moreover, the computational overhead of $\mathcal{A}\prime $ compared to $\mathcal{A}$ is just the process of generating dummy values by mimicking the behavior of the fictional server; it is not large (i.e., TIME′ is close to TIME as claimed) since the serverside computation of our proposed protocol is efficient. Hence, the theorem holds.
Roughly rephrasing, if the assumption of the theorem is true for a larger TIME, then the actual situation of our proposed protocol becomes closer to the "ideal" case provided we focus on any information available from efficient computation. As a first step to evaluate how the assumption is plausible (i.e., how the value TIME in the assumption can be large), we performed computer experiments to show that some natural attempts to distinguish the actual and the fictional cases do not succeed, as explained below.
In this experiment, we evaluate the security of our protocol by comparing the probabilities that the user correctly guesses the value $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ in two cases: The case in which the user makes a guess based only on a prior knowledge w, and the other case in which the user makes a guess based on the observation of the search result under the condition that he/she knows w.
In this case, the success probability of the guess is ${w}_{{i}_{0}}$.
We estimated success probabilities of user's guess for the both cases by simulation experiments. Here we assumed typical case when TI_{1,1,0.8} and 166 MACCS keys are used. In this case, k = 831 and we performed the experiments for n = 831 × 10^{0}, 831 × 10^{1}, ..., 831 × 10^{4} on three different distributions of $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ which were obtained by the following schemes:
1 We randomly selected one fingerprint qfrom ChEMBL and calculated $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ for all the entries in ChEMBL and used the observed distribution as w. In our experiment, 177159th fingerprint was selected as q(referred as w^{ChEMBL177159}).
2 The same scheme as 1) was used when qwas 265935th fingerprint (referred as w^{ChEMBL265935}).
3 We randomly selected a value from 1, ..., k for m times and count frequency of i as h_{ i } and set w_{ i } = h_{ i }/m (referred as w^{random}). We used k × 5 as m.
All the distributions used here are shown in Section 6 of Additional File 1.
The experimental success ratios of the user's guess based on the server's return and the prior distribution of true value (n = 813,
n= 831  n= 831 × 10^{1}  n= 831 × 10^{2}  n= 831 × 10^{3}  n= 831 × 10^{4}  Ideal value  

w ^{ChEMBL−177159}  0.03552  0.01738  0.01101  0.01009  0.00977  0.00981 
w ^{ChEMBL−265935}  0.02991  0.01337  0.00903  0.00798  0.00784  0.00807 
w ^{rand}  0.00914  0.0041  0.00309  0.00279  0.00305  0.00289 
Security analyses for padding dummies for the case when the user is allowed to send more than one query
One might suspect that the attacker can detect the true $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ by sending the same query twice and finding the value which is appeared in both results. However, this attack does not easily succeed if n is sufficiently larger than k (i.e., ideally, all possible values of $\overline{\mathsf{\text{TI}}}$ are covered by sufficient number of dummies), and we consider that k is not too large in practice as we discussed in Parameters settings of the protocol section.
Database privacy in malicious model
For our protocol, the difference between the malicious and semihonest models is that in the malicious model the user may use an invalid input qwhose components q_{i} are not necessarily in {0, 1}. If the user chooses qin such a way that some component q_{i} is extremely large and the remaining ℓ − 1 components are all zero, then $\overline{\mathsf{\text{TI}}}\left(p,\phantom{\rule{2.36043pt}{0ex}}q\right)$ will also be an extreme value (distinguishable from the dummy values) and depend dominantly on the bit p_{ i }; therefore, the user can almost surely guess the secret bit p_{i}. Since our protocol detects whether or not q_{i} is a bit value without invading user privacy, it can safely reject illegal queries and prevent any illegal query attacks, including above case.
Performance evaluation
In this section, we evaluate the performance of the proposed method on two datasets created from ChEMBL.
Implementation
We implemented the proposed protocol using the C++ library of elliptic curve ElGamal encryption [24], in which the NIZK proposed in the previous study [21] is also implemented.
For the implementation, we used parameters called secp192k1, as recommended by SECG (The Standards for Efficient Cryptography Group). These parameters are considered to be more secure than 1024bit RSA encryption, which is the most commonly used publickey cryptosystem. The implementation of
Owing to the limitation of the range of plaintext, the implementation here does not include signpreserving randomization. For the purpose of comparison, we also implemented a GPMPC protocol by using Fairplay [25]. In order to reduce the circuit size of the GPMPC, we implemented s simple task that computes the sign of Tversky index between a query and a fingerprint in the database, and repeated the task for all the fingerprints in the database. Thus the CPU time and data transfer size of the implementation is linear to the size of database.
Experimental setup
The Jaccard index along with the threshold θ = 0.8 were used for both protocols. For SSCC, we used 10,000 dummies. These two implementations were tested on two datasets: one, referred to as ChEMBL 1000, was the first 1000 fingerprints stored in ChEMBL, and the other, referred to as ChEMBL Full, was 1,292,344 fingerprints in the latest version of ChEMBL. All the programs were run on a single core of an Intel Xeon 2.9 GHz on the same machine equipped with 64 GB memory. To avoid environmental effects, we repeated the same experiment five times and calculated average values.
Results
CPU time and communication size of secure similar compounds counter (SSCC) and those of generalpurpose multiparty computation (GPMPC).
ChEMBL_1000  ChEMBL_Full  

CPU time (s)  
SSCC (server)  0.69  167.19 
SSCC (client)  1.53  172.37 
GPMPC (server)  4, 075.15  − 
GPMPC (client)  4, 366.18  − 
Communication size (MB)  
SSCC (server → client)  2.24  265.33 
SSCC (client → server)  0.03  0.03 
GPMPC (server → client)  42.50  − 
GPMPC (client → server)  2, 128.00  − 
The experiment on ChEMBL Full by using GPMPC did not finish within 24 hours. Since both CPU time and communication size are exactly linear to the size of database for the GPMPC protocol, the results of ChEMBL Full for GPMPC are estimated to be more than 1600 hours for both sides and 3 Gbyte data transfer from client to server, considering the results of ChEMBL 1000.
By using simple data parallelization, the computational speed will be improved linearly with the number of CPUs. Since all the programs were run on the same machine there was almost no latency for the communication between the two parties in these experiments. Therefore, GPMPC, whose communication size is huge, is expected to require far more time when it runs on an actual network that is not always in a good condition. The other important point is that SSCC requires only two data transfers, which enables data transfer after offline calculation. On the other hand, GPMPC must keep online during the search because of the high communication frequency. We also note that it took less than 100 MB to compile SSCC, while GPMPC required more than 16 GB. Considering these observations, SSCC is efficient for practical use. It is known that several techniques improve the performance of GPMPC and the previous work by Pinkas et al. [26] reported that Free XOR [27] and Garbled Row Reduction [26], which are commonly used in stateoftheart GPMPC methods [28–31], reduced running time and communication size by factors of 1.8 and 6.3 respectively when a circuit computing an encryption of AES was evaluated. Though these techniques are not implemented in Fairplay, we consider that GPMPC is yet far less practical for the largescale chemical compound search problem compared to our method which improved running time and communication size by factors of 36, 900 and 12, 000.
Conclusion
In this study, we proposed a novel privacypreserving protocol for searching chemical compound databases. To our knowledge, this is the first practical study for privacypreserving (for both user and database sides) similarity searching in the fields of bioinformatics and chemoinformatics. Moreover, the proposed method could be applied to a wide range of life science problems such as searching for similar singlenucleotide polymorphism (SNP) patterns in a personal genome database. While the protocol proposed here focuses on searching for a number of similar compounds, we are examining further improvements of the protocol such as the client being able to download similar compounds; we expect this ongoing study to further contribute to the drug screening process. In recent years, open innovation has been attracting attention as a promising approach for speeding up the process of new drug discovery [32]. For example, research on neglected tropical diseases including malaria has been promoted by the recent attempt to share chemical compound libraries in the research community. In spite of high expectations, such an approach is still limited to economically less important problems on account of privacy problems [33]. Therefore, privacypreserving data mining technology is expected to be the breakthrough promoting open innovation and we believe that our study will play an important role.
Declarations
Acknowledgements
KS thanks Yusuke Sakai and Takahiro Matsuda for fruitful discussions.
Declarations
This work was supported by the JapanFinland Cooperative Scientific Research Program of JST/AMED (to KS) and JSPS KAKENHI Grant Number 25540131 (to KS and MH). JS and KT are supported by JST CREST. KT is supported by JST ERATO, RIKEN PostK, NIMS MI2I, JSPS KAKENHI Nanostructure and JSPS KAKENHI Grant Number 15H05711. JS is supported by JSPS KAKENHI Grant Number 24680015.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 18, 2015: Joint 26th Genome Informatics Workshop and 14th International Conference on Bioinformatics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S18.
Authors’ Affiliations
References
 Subbaraman N: Flawed arithmetic on drug development costs. Nature Biotechnology. 2011, 29 (5): 381381.View ArticlePubMedGoogle Scholar
 Miller Ma: Chemical database techniques in drug discovery. Nature Reviews Drug Discovery. 2002, 1 (3): 2207.View ArticlePubMedGoogle Scholar
 Schooler J: Unpublished results hide the decline effect. Nature. 2011, 470: 437View ArticlePubMedGoogle Scholar
 Ostrovsky R, Skeith WE: A survey of singledatabase private information retrieval: techniques and applications. Proceedings of the 10th International Conference on Practice and Theory in Publickey Cryptography PKC'07. 2007, 393411.Google Scholar
 Goethals B, Laur S, Lipmaa H, Mielik¨ainen T: On private scalar product computation for privacypreserving data mining. Proceedings of the 7th Annual International Conference on Information Security and Cryptology ICISC 2004. 2004, 104120.Google Scholar
 Blundo C, Cristofaro ED, Gasti P: EsPRESSo : Efficient PrivacyPreserving Evaluation of Sample Set Similarity. Proceedings of Data Privacy Management and Autonomous Spontaneous Security: 7th International Workshop, DPM 2009 and 5th International Workshop, SETOP 2012 DMP/SETOP 2012. 2012, 89103.Google Scholar
 Murugesan M, Jiang W, Clifton C, Si L, Vaidya J: Efficient privacypreserving similar document detection. The VLDB Journal. 2010, 19 (4): 457475.View ArticleGoogle Scholar
 Yao ACC: How to generate and exchange secrets. Proceedings of the 27th Annual Symposium on Foundations of Computer Science SFCS '86. 1986, 162167.Google Scholar
 Gentry C: Fully homomorphic encryption using ideal lattices. Proceedings of the 41st Annual ACM Symposium on Theory of Computing STOC '09. 2009, 169178.View ArticleGoogle Scholar
 Togan M, Plesca C: Comparisonbased computations over fully homomorphic encrypted data. Communications (COMM), 2014 10th International Conference. 2014, 16. doi:10.1109/ICComm.2014.6866760Google Scholar
 Laur S, Lipmaa H: On private similarity search protocols. Proceedings of the 9th Nordic Workshop on Secure IT Systems NordSec. 2004, 7377.Google Scholar
 Popa RA, Redfield CMS, Zeldovich N, Balakrishnan H: Proceedings of the 23rd ACM Symposium on Operating Systems Principles SOSP 11. 85100.Google Scholar
 Martin YC, Kofron JL, Traphagen LM: Do structurally similar molecules have similar biological activity?. Journal of Medicinal Chemistry. 2002, 45 (19): 43504358.View ArticlePubMedGoogle Scholar
 Miller JL: Recent developments in focused library design: targeting genefamilies. Current Topics in Medicinal Chemistry. 2006, 6 (1): 1929.View ArticlePubMedGoogle Scholar
 Curty R, Tang J: Someone's loss might be your gain: A case of negative results publications in science. Proceedings of the American Society for Information Science and Technology ASISTS. 2012, 49:Google Scholar
 Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, AlLazikani B, Overington JP: ChEMBL: a largescale bioactivity database for drug discovery. Nucleic Acids Research. 2012, 40 (Database): 11001107.View ArticleGoogle Scholar
 Goldwasser S, Micali S: Probabilistic encryption. J Comput Syst Sci. 1984, 28 (2): 270299.View ArticleGoogle Scholar
 Paillier P: Publickey cryptosystems based on composite degree residuosity classes. Proceedings of the 17th International Conference on Theory and Application of Cryptographic Techniques EUROCRYPT'99. 1999, 223238.Google Scholar
 ElGamal T: A public key cryptosystem and a signature scheme based on discrete logarithms. IEEE Transactions on Information Theory. 1985, 31 (4): 469472.View ArticleGoogle Scholar
 Goldreich O: Foundations of Cryptography: Volume 1. 2001, Cambridge University PressView ArticleGoogle Scholar
 Sakai Y, Emura K, Hanaoka G, Kawai Y, Omote K: Methods for restricting message space in publickey encryption. IEICE Transactions. 2013, 96A (6): 11561168.View ArticleGoogle Scholar
 Tversky A: Features of similarity. Psychological Review. 1977, 84 (4): 327352.View ArticleGoogle Scholar
 Damgård I, Fitzi M, Kiltz E, Nielsen JB, Toft T: Unconditionally secure constantrounds multiparty computation for equality, comparison, bits and exponentiation. Proceedings of the 3rd Theory of Cryptography Conference TCC 2006. 2006, 285304.Google Scholar
 C++ Library implementing elliptic curve ElGamal crypto system [19]. 2015, URL accessed April 13, 2015, [https://github.com/aistcrypt/LiftedElGamal]
 Bendavid A, Nisan N, Pinkas B: Fairplaymp: A system for secure multiparty computation. Proceedings of ACM Conference on Computer and Communications Security CCS 2008. 2008, 1721.Google Scholar
 Pinkas B, Schneider T, Smart NP, Williams SC: Secure twoparty computation is practical. Proceedings of the 15th International Conference on the Theory and Application of Cryptology and Information Security ASIACRYPT 2009. 2009, 250267.Google Scholar
 Kolesnikov V, Schneider T: Improved garbled circuit: Free XOR gates and applications. Proceedings of the 35th International Colloquium on Automata, Languages and Programming ICALP 2008. 2008, 486498.Google Scholar
 Henecka W, Kögl S, Sadeghi A, Schneider T, Wehrenberg I: TASTY: tool for automating secure twoparty computations. Proceedings of the 17th ACM Conference on Computer and Communications Security CCS 2010. 2010, 451462.Google Scholar
 Huang Y, Evans D, Katz J, Malka L: Faster secure twoparty computation using garbled circuits. Proceedings of the 20th USENIX Security Symposium USENIX 2011. 2011Google Scholar
 Huang Y, Shen CH, Evans D, Katz J, Shelat A: Efficient secure computation with garbled circuits. Proceedings of the 7th International Conference on Information Systems Security ICISS. 2011, 2848.View ArticleGoogle Scholar
 Kreuter B, Shelat A, Shen C: Billiongate secure computation with malicious adversaries. Proceedings of the 21th USENIX Security Symposium USENIX Security 2012. 2012, 285300.Google Scholar
 Williams AJ, Harland L, Groth P, Pettifer S, Chichester C, Willighagen EL, Evelo CT, Blomberg N, Ecker G, Goble C, Mons B: Open PHACTS: semantic interoperability for drug discovery. Drug Discovery Today. 2012, 17 (2122): 11881198.View ArticlePubMedGoogle Scholar
 Hunter J, Stephens S: Is open innovation the way forward for big pharma?. Nature Reviews Drug Discovery. 2010, 9 (2): 8788.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.