A Question-Entailment Approach to Question Answering

One of the challenges in large-scale information retrieval (IR) is to develop fine-grained and domain-specific methods to answer natural language questions. Despite the availability of numerous sources and datasets for answer retrieval, Question Answering (QA) remains a challenging problem due to the difficulty of the question understanding and answer extraction tasks. One of the promising tracks investigated in QA is to map new questions to formerly answered questions that are `similar'. In this paper, we propose a novel QA approach based on Recognizing Question Entailment (RQE) and we describe the QA system and resources that we built and evaluated on real medical questions. First, we compare machine learning and deep learning methods for RQE using different kinds of datasets, including textual inference, question similarity and entailment in both the open and clinical domains. Second, we combine IR models with the best RQE method to select entailed questions and rank the retrieved answers. To study the end-to-end QA approach, we built the MedQuAD collection of 47,457 question-answer pairs from trusted medical sources, that we introduce and share in the scope of this paper. Following the evaluation process used in TREC 2017 LiveQA, we find that our approach exceeds the best results of the medical task with a 29.8% increase over the best official score. The evaluation results also support the relevance of question entailment for QA and highlight the effectiveness of combining IR and RQE for future QA efforts. Our findings also show that relying on a restricted set of reliable answer sources can bring a substantial improvement in medical QA.


Introduction
With the availability of rich data on users' locations, profiles and search history, personalization has become the leading trend in large-scale information retrieval. However, efficiency through personalization is not yet the most suitable model when tackling domain-specific searches. This is due to several factors, such as the lexical and semantic challenges of domain-specific data that often include advanced argumentation and complex contextual information, the higher sparseness of relevant information sources, and the more pronounced lack of similarities between users' searches.
A recent study on expert search strategies among healthcare information professionals [44] showed that, for a given search task, they spend an average of 60 minutes per collection or database, 3 minutes to examine the relevance of each document, and 4 hours of total search time. When written in steps, their search strategy spans over 15 lines and can reach up to 105 lines.
With the abundance of information sources in the medical domain, consumers are more and more faced with a similar challenge, one that needs dedicated solutions that can adapt to the heterogeneity and specifics of health-related information.

arXiv:1901.08079v1 [cs.CL] 23 Jan 2019
Dedicated Question Answering (QA) systems are one of the viable solutions to this problem as they are designed to understand natural language questions without relying on external information on the users.
In the context of QA, the goal of Recognizing Question Entailment (RQE) is to retrieve answers to a premise question (P Q) by retrieving inferred or entailed questions, called hypothesis questions (HQ) that already have associated answers. Therefore, we define the entailment relation between two questions as: a question A entails a question B if every answer to B is also a correct answer to A [4].
RQE is particularly relevant due to the increasing numbers of similar questions posted online [32] and its ability to solve differently the challenging issues of question understanding and answer extraction. In addition to being used to find relevant answers, these resources can also be used in training models able to recognize inference relations and similarity between questions.
In this paper, we study question entailment in the medical domain and the effectiveness of the end-to-end RQE-based QA approach by evaluating the relevance of the retrieved answers. Although entailment was attempted in QA before [22,38,10], as far as we know, we are the first to introduce and evaluate a full medical question answering approach based on question entailment for free-text questions. Our contributions are: 1. A study of machine learning and deep learning approaches to RQE using different kinds of datasets, including textual inference, question similarity and entailment in both the open and clinical domains. 2. A collection of 47,457 medical question-answer pairs with additional annotations, constructed from trusted sources such as NIH websites. We make this resource publicly available 1 . 3. A new QA approach based on question entailment. Our approach uses IR models to retrieve question candidates and the RQE model to identify entailed questions and return their answers. 4. An evaluation of the RQE-based QA system on TREC 2017 LiveQA medical questions [7]. Results showed that our approach exceeds the best official score on the medical task using only the collection of 47K QA pairs as answers source.
The next section is dedicated to related work on question answering, question similarity and entailment. In Section 3, we present two machine learning (ML) and deep learning (DL) methods for RQE and compare their performance using open-domain and clinical datasets. Section 4 describes the new collection of medical question-answer pairs. In Section 5, we describe our RQE-based approach for QA. Section 6 presents our evaluation of the retrieved answers and the results obtained on TREC 2017 LiveQA medical questions.

Background
In this section we define the RQE task and describe related work at the intersection of question answering, question similarity and textual inference.

Task Definition
The definition of Recognizing Question Entailment (RQE) can have a significant impact on QA results. In related work, the meaning associated with Natural Language Inference (NLI) varies among different tasks and events. For instance, Recognizing Textual Entailment (RTE) was addressed by the PASCAL challenge [13], where the entailment relation has been assessed manually by human judges who selected relevant sentences "entailing" a set of hypotheses from a list of documents returned by different Information Retrieval (IR) methods. In another definition, the Stanford Natural Language Inference corpus SNLI [8], used three classification labels for the relations between two sentences: entailment, neutral and contradiction. For the entailment label, the annotators who built the corpus were presented with an image and asked to write a caption "that is a definitely true description of the photo". For the neutral label, they were asked to provide a caption "that might be a true description of the label". They were asked for a caption that "is definitely a false description of the photo" for the contradiction label.
More recently, the multiNLI corpus [51] was shared in the scope of the RepEval 2017 shared task 2 [37]. To build the corpus, annotators were presented with a premise text and asked to write three sentences. One novel sentence, which is "necessarily true or appropriate in the same situations as the premise," for the entailment label, a sentence, which is "necessarily false or inappropriate whenever the premise is true," for the contradiction label, and a last sentence, "where neither condition applies," for the neutral label.
Whereas these NLI definitions might be suitable for the broad topic of text understanding, their relation to practical information retrieval or question answering systems is not straightforward.
In contrast, RQE has to be tailored to the question answering task. For instance, if the premise question is "looking for cold medications for a 30 yo woman", a RQE approach should be able to consider the more general (less restricted) question "looking for cold medications" as relevant, since its answers are relevant for the initial question, whereas "looking for medications for a 30 yo woman" is a useless contextualization. The entailment relation we are seeking in the QA context should include relevant and meaningful relaxations of contextual and semantic constraints (cf. Section 3.1).

Related Work on Question Answering
Classical QA systems face two main challenges related to question analysis and answer extraction. Several QA approaches were proposed in the literature for the open domain [14,23] and the medical domain [3,6,52]. A variety of methods were developed for question analysis, focus (topic) recognition and question type identification [29,5,34,33]. Similarly, many different approaches tackled document or passage retrieval and answer selection and (re)ranking [50,46,47].
An alternative approach consists in finding similar questions or FAQs that are already answered [25,49]. One of the earliest question answering systems based on finding similar questions and re-using the existing answers was FAQ FINDER [9]. Another system that complements the existing Q&A services of NetWellness 3 is SimQ [32], which allows retrieval of similar web-based consumer health questions. SimQ uses syntactic and semantic features to compute similarity between questions, and UMLS [31] as a standardized semantic knowledge source. The system achieves 72.2% precision, 78.0% recall and 75.0% F-score on NetWellness questions. However, the method was evaluated only on one question similarity dataset, and the retrieved answers were not evaluated.
The aim of the medical task at TREC 2017 LiveQA was to develop techniques for answering complex questions such as consumer health questions, as well as to identify relevant answer sources that can comply with the sensitivity of medical information retrieval.
The CMU-OAQA system [48] achieved the best performance of 0.637 average score on the medical task by using an attentional encoder-decoder model for paraphrase identification and answer ranking. The Quora question-similarity dataset was used for training. The PRNA system [15] achieved the second best performance in the medical task with 0.49 average score using Wikipedia as the first answer source and Yahoo and Google searches as secondary answer sources. Each medical question was decomposed into several subquestions. To extract the answer from the selected text passage, a bi-directional attention model trained on the SQUAD dataset was used.
Deep neural network models have been pushing the limits of performance achieved in QA related tasks using large training datasets. The results obtained by CMU-OAQA and PRNA showed that large open-domain datasets were beneficial for the medical domain. However, the best system (CMU-OAQA) relying on the same training data obtained a score of 1.139 on the LiveQA open-domain task.
While this gap in performance can be explained in part by the discrepancies between the medical test questions and the open-domain questions, it also highlights the need for larger medical datasets to support deep learning approaches in dealing with the linguistic complexity of consumer health questions and the challenge of finding correct and complete answers.
Another technique was used by ECNU-ICA team [2] based on learning question similarity via two long short-term memory (LSTM) networks applied to obtain the semantic representations of the questions. To construct a collection of similar question pairs, they searched community question answering sites such as Yahoo! and Answers.com. In contrast, the ECNU-ICA system achieved the best performance of 1.895 in the open-domain task but an average score of only 0.402 in the medical task. As the ECNU-ICA approach also relied on a neural network for question matching, this result shows that training attention-based decoder-encoder networks on the Quora dataset generalized better to the medical domain than training LSTMs on similar questions from Yahoo! and Answers.com.
The CMU-LiveMedQA team [52] designed a specific system for the medical task. Using only the provided training datasets and the assumption that each question contains only one focus, the CMU-LiveMedQA system obtained an average score of 0.353. They used a convolutional neural network (CNN) model to classify a question into a restricted set of 10 question types and crawled "relevant" online web pages to find the answers. However, the results were lower than those achieved by the systems relying on finding similar answered questions. These results support the relevance of similar question matching for the end-to-end QA task as a new way of approaching QA instead of the classical QA approaches based on Question Analysis and Answer Retrieval.

Related Work on Question Similarity and Entailment
Several efforts focused on recognizing similar questions. Jeon et al. [24] showed that a retrieval model based on translation probabilities learned from a question and answer archive can recognize semantically similar questions. Duan et al. [18] proposed a dedicated language modeling approach for question search, using question topic (user's interest) and question focus (certain aspect of the topic).
Lately, these efforts were supported by a task on Question-Question similarity introduced in the community QA challenge at SemEval (task 3B) [35]. Given a new question, the task focused on reranking all similar questions retrieved by a search engine, assuming that the answers to the similar questions will be correct answers for the new question. Different machine learning and deep learning approaches were tested in the scope of SemEval 2016 [35] and 2017 [36] task 3B. The best performing system in 2017 achieved a MAP of 47.22% using supervised Logistic Regression that combined different unsupervised similarity measures such as Cosine and Soft-Cosine [11]. The second best system achieved 46.93% MAP with a learning-to-rank method using Logistic Regression and a rich set of features including lexical and semantic features as well as embeddings generated by different neural networks (siamese, Bi-LSTM, GRU and CNNs) [21]. In the scope of this challenge, a dataset was collected from Qatar Living forum for training. We refer to this dataset as SemEval-cQA 4 .
In another effort, an answer-based definition of RQE was proposed and tested [4]. The authors introduced a dataset of clinical questions and used a feature-based method that provided an Accuracy of 75% on consumer health questions. We will call this dataset Clinical-QE 5 . Dos Santos et al. [17] proposed a new approach to retrieve semantically equivalent questions combining a bag-of-words representation with a distributed vector representation created by a CNN and user data collected from two Stack Exchange communities. Lei et al. [30] proposed a recurrent and convolutional model (gated convolution) to map questions to their semantic representations. The models were pre-trained within an encoder-decoder framework.

RQE Approaches and Experiments
The choice of two methods for our empirical study is motivated by the best performance achieved by Logistic Regression in question-question similarity at SemEval 2017 (best system [11] and second best system [21]), and the high performance achieved by neural networks on larger datasets such as SNLI [8,28,12,20]. We first define the RQE task, then present the two approaches, and evaluate their performance on five different datasets.

Definition
In the context of QA, the goal of RQE is to retrieve answers to a new question by retrieving entailed questions with associated answers. We therefore define question entailment as: • a question A entails a question B if every answer to B is also a complete or partial answer to A.
We present below two examples of consumer health questions Ai and entailed questions Bi: Example 1 (each answer to the entailed question B1 is a complete answer to A1): • A1: What is the latest news on tennitis, or ringing in the ear, I am 75 years old and have had ringing in the ear since my mid 5os. Thank you.
• B1: What is the latest research on Tinnitus?
Example 2 (each answer to the entailed question B2 is a partial answer to A2): • A2: My mother has been diagnosed with Alzheimer's, my father is not of the greatest health either and is the main caregiver for my mother. My question is where do we start with attempting to help our parents w/ the care giving and what sort of financial options are there out there for people on fixed incomes.
• B2: What resources are available for Alzheimer's caregivers?
The inclusion of partial answers in the definition of question entailment also allows efficient relaxation of the contextual constraints of the original question A to retrieve relevant answers from entailed, but less restricted, questions.

Deep Learning Model
To recognize entailment between two questions P Q (premise) and HQ (hypothesis), we adapted the neural network proposed by Bowman et al. [8]. Our DL model, presented in Figure 1, consists of three 600d ReLU layers, with a bottom layer taking the concatenated sentence representations as input and a top layer feeding a softmax classifier. The sentence embedding model sums the Recurrent neural network (RNN) embeddings of its words. The word embeddings are first initialized with pretrained GloVe vectors. This adaptation provided the best performance in previous experiments with RQE data. GloVe 6 is an unsupervised learning algorithm to generate vector representations for words [40]. Training is performed on aggregated word co-occurrence statistics from a large corpus, and the resulting representations show interesting linear substructures of the word vector space. We use the pretrained common crawl version with 840B tokens and 300d vectors, which are not updated during training.

Logistic Regression Classifier
In this feature-based approach, we use Logistic Regression to classify question pairs into entailment or no-entailment. Logistic Regression achieved good results on this specific task and outperformed other statistical learning algorithms such as SVM and Naive Bayes. In a preprocessing step, we remove stop words and perform word stemming using the Porter algorithm [41] for all (P Q,HQ) pairs.
We use a list of nine features, selected after several experiments on RTE datasets [13]. We compute five similarity measures between the pre-processed questions and use their values as features. We use Word Overlap, the Dice coefficient based on the number of common bigrams, Cosine, Levenshtein, and the Jaccard similarities. Our feature list also includes the maximum and average values obtained with these measures and the question length ratio (length(P Q)/length(HQ)). We compute a morphosyntactic feature indicating the number of common nouns and verbs between P Q and HQ. TreeTagger [45] was used for POS tagging.
For RQE, we add an additional feature specific to the question type. We use a dictionary lookup to map triggers to the question type (e.g. Treatment, Prognosis, Inheritance). Triggers are identified for each question type based on a manual annotation of a set of medical questions (cf. Section 4.2). This feature has three possible values: 2 (Perfect match between P Q type(s) and HQ type(s)), 1 (Overlap between P Q type(s) and HQ type(s)) and 0 (No common types).

Training Datasets
We evaluate the RQE methods (i.e. deep learning model and logistic regression classifier) using two datasets of sentence pairs (SNLI and multiNLI), and three datasets of question pairs (Quora, Clinical-QE, and SemEval-cQA).
The Stanford Natural Language Inference corpus (SNLI) [8] contains 569,037 sentence pairs written by humans based on image captioning. The training set of the MultiNLI corpus [51] consists of 393,000 pairs of sentences from five genres of written and spoken English (e.g. Travel, Government). Two other "matched" and "mismatched" sets are also available for development (20,000 pairs). Both SNLI and multiNLI consider three types of relationships between sentences: entailment, neutral and contradiction. We converted the contradiction and neutral labels to the same non-entailment class.
The question similarity dataset of SemEval 2016 Task 3B (SemEval-cQA) [35] contains 3,869 question pairs and aims to re-rank a list of related questions according to their similarity to the original question. The same dataset was used for SemEval 2017 Task 3 [36].

RQE Test Dataset
To construct our test dataset, we used a publicly shared set of Consumer Health Questions (CHQs) received by the U.S. National Library of Medicine (NLM), and annotated with named entities, question types, and focus [26,27]. The CHQ dataset consists of 1,721 consumer information requests manually annotated with subquestions, each identified by a question type and a focus.
First, we selected automatically harvested FAQs, from U.S. National Institutes of Health (NIH) websites, that share both the same focus and the same question type with the CHQs. As FAQs are most often very short, we first assume that the CHQ entails the FAQ. Two sets of pairs were constructed: (i) positive pairs of CHQs and FAQs sharing at least one common question type and the question focus, and (ii) negative pairs corresponding to a focus mismatch or type mismatch. For each category of negative examples, we randomly selected the same number of pairs for a balanced dataset. Then, we manually validated the constructed pairs and corrected the positive and negative labels when needed. The final RQE dataset contains 850 CHQ-FAQ pairs with 405 positive and 445 negative pairs. Table 1 presents examples from the five training datasets (SNLI, MultiNLI, SemEval-cQA, Clinical-QE and Quora) and the new test dataset of medical CHQ-FAQ pairs.

Results of RQE Approaches
In the first experiment, we evaluated the DL and ML methods on SNLI, multi-NLI, Quora, and Clinical-QE. For the datasets that did not have a development and test sets, we randomly selected two sets, each amounting to 10% of the data, for test and development, and used the remaining 80% for training. For MultiNLI, we used the dev1-matched set for validation and the dev2-mismatched set for testing. In the second experiment, we used these datasets for training only and compared their performance on our test set of 850 consumer health questions. Table 3 presents the results of this experiment. Logistic Regression trained on the clinical-RQE data outperformed DL models trained on all datasets, with 73.18% Accuracy.
To validate further the performance of the LR method, we evaluated it on question similarity detection. A typical approach to this task is to use an IR method to find similar question candidates, then a more sophisticated method to select and re-rank the similar questions. We followed a similar approach for this evaluation by combining the LR method with the IR baseline provided in the context of SemEval-cQA. The hybrid method combines the score provided by the Logistic Regression model and the reciprocal rank from the IR baseline using a weight-based combination:

850
PQ: IHSS heart condition and WPW heart condition. Is there any way you could send me information on both these heart conditions? My son has to get tested for them eventually and I would just like information to understand the conditions of both of them more. HQ: What is Wolff-Parkinson-White syndrome ?
The weight w was set empirically through several tests on the cQA-2016 development set (w = 8.9). Table 4 presents the results on the cQA-2016 and cQA-2017 test datasets. The hybrid method (LR+IR) provided the best results on both datasets. On the 2016 test data, the LR+IR method outperformed the best system in all measures, with 80.57% Accuracy and 77.47% MAP (official system ranking measure in SemEval-cQA). On the cQA-2017 test data, the LR+IR method obtained 44.66% MAP and outperformed the cQA-2017 best system in Accuracy with 67.27%.

Discussion of RQE Results
When trained and tested on the same corpus, the DL model with GloVe embeddings gave the best results on three datasets (SNLI, MultiNLI and Quora). Logistic Regression gave the best Accuracy on the Clinical-RQE dataset with 98.60%. When tested on our test set (850 medical CHQs-FAQs pairs), Logistic Regression trained on Clinical-QE gave the best performance with 73.18% Accuracy.
The SNLI and multi-NLI models did not perform well when tested on medical RQE data. We performed additional evaluations using the RTE-1, RTE-2 and RTE-3 open-domain datasets provided by the PASCAL challenge and the results were similar. We have also tested the SemEval-cQA-2016 model and had a similar drop in performance on RQE   data. This could be explained by the different types of data leading to wrong internal conceptualizations of medical terms and questions in the deep neural layers. This performance drop could also be caused by the complexity of the test consumer health questions that are often composed of several subquestions, contain contextual information, and may contain misspellings and ungrammatical sentences, which makes them more difficult to process [42]. Another aspect is the semantics of the task as discussed in Section 2.1. The definition of textual entailment in open-domain may not quite apply to question entailment due to the strict semantics. Also the general textual entailment definitions refer only to the premise and hypothesis, while the definition of RQE for question answering relies on the relationship between the sets of answers of the compared questions.

Building a Medical QA Collection from Trusted Resources
A RQE-based QA system requires a collection of question-answer pairs to map new user questions to the existing questions with an RQE approach, rank the retrieved questions, and present their answers to the user.

Method
To construct trusted medical question-answer pairs, we crawled websites from the National Institutes of Health 7 (cf. Section 4.3). Each web page describes a specific topic (e.g. name of a disease or a drug), and often includes synonyms of the main topic that we extracted during the crawl.
We constructed hand-crafted patterns for each website to automatically generate the question-answer pairs based on the document structure and the section titles. We also annotated each question with the associated focus (topic of the web page) as well as the question type identified with the designed patterns (cf. Section 4.2).
To provide additional information about the questions that could be used for diverse IR and NLP tasks, we automatically annotated the questions with the focus, its UMLS Concept Unique Identifier (CUI) and Semantic Type. We combined two methods to recognize named entities from the titles of the crawled articles and their associated UMLS CUIs: (i) exact string matching to the UMLS Metathesaurus 8 , and (ii) MetaMap Lite 9 [16]. We then used the UMLS Semantic Network to retrieve the associated semantic types and groups.

Question Types
The question types were derived after the manual evaluation of 1,721 consumer health questions. • What is the action of DRUG and how does it work?
• Who should get DRUG and why is it prescribed?
• What to do in case of a severe reaction to DRUG? 3. Question Type for other medical entities (e.g. Procedure, Exam, Treatment): Information.
• What is Coronary Artery Bypass Surgery?
• What are Liver Function Tests?

Medical Resources
We used 12 trusted websites to construct a collection of question-answer pairs. For each website, we extracted the free text of each article as well as the synonyms of the article focus (topic). These resources and their brief descriptions are provided below: 1. National Cancer Institute (NCI) 10 : We extracted free text from 116 articles on various cancer types (729 QA pairs). We manually restructured the content of the articles to generate complete answers (e.g. a full answer about the treatment of all stages of a specific type of cancer). Figure 2 presents examples of QA pairs generated from a NCI article. 2. Genetic and Rare Diseases Information Center (GARD) 11 : This resource contains information about various aspects of genetic/rare diseases. We extracted all disease question/answer pairs from 4,278 topics (5,394 QA pairs).

The Proposed Entailment-based QA System
Our goal is to generate a ranked list of answers for a given Premise Question P Q by ranking the recognized Hypothesis Questions HQs. Based on the RQE experiments above (Section 3.5), we selected Logistic Regression trained on the clinical-RQE dataset to recognize entailed questions and rank them with their classification scores.

RQE-based QA Approach
Classifying the full QA collection for each test question is not feasible for real-time applications. Therefore, we first filter the questions with an IR method to retrieve candidate questions, then classify them as entailed (or not) by the user/test question. Based on the positive results of the combination method tested on SemEval-cQA data (Section 3.5), we adopted a combination method to merge the results obtained by the search engine and the RQE scores. The answers are then combined from both methods and ranked using an aggregate score. Figure 4 presents the overall architecture of the proposed QA system. We describe each module in more details next.

Finding Similar Question Candidates
For each premise question P Q, we use the Terrier search engine 22 to retrieve N relevant question candidates {HQ j , j ∈ [1, N ]} and then apply the RQE classifier to predict the labels for the pairs (P Q,HQ j ).
We indexed the questions of our QA collection without the associated answers. In order to improve the indexing and the performance of question retrieval, we also indexed the synonyms of the question focus and the triggers of the question type with each question. This choice allowed us to avoid the shortcomings of query expansion, including incorrect or irrelevant synonyms and the increased execution time. The synonyms of the question focus (topic) were extracted automatically from the QA collection. The triggers of each question type were defined manually in the question types taxonomy. Below are two examples of indexed questions from our QA collection, with the automatically added focus synonyms and question type triggers: 1. What are the treatments for Torticollis?
The IR task consists of retrieving hypothesis questions HQ j relevant to the submitted question P Q. As fusion of IR result has shown good performance in different tracks in TREC, we merge the results of the TF-IDF weighting function and the In-expB2 DFR model [39].
Let QL V = HQ V 1 , HQ V 2 , ..., HQ V N be the set of N questions retrieved by the first IR model V and QL W = HQ W 1 , HQ W 2 , ..., HQ W N be the set of N questions retrieved by the second IR model W . We merge both sets by summing the scores of each retrieved question HQ j in both QL V and QL W lists, then we re-rank the hypothesis questions HQ j .

Combining IR and RQE Methods
The IR models and the RQE Logistic Regression model bring different perspectives to the search for relevant candidate questions. In particular, question entailment allows understanding the relations between the important terms, whereas the traditional IR methods identify the important terms, but will not notice if the relations are opposite. Moreover, some of the question types that the RQE classifier learns will not be deemed important terms by traditional IR and the most relevant questions will not be ranked at the top of the list. Therefore, in our approach, when a question is submitted to the system, candidate questions are fetched using the IR models, then the RQE classifier is applied to filter out the non-entailed questions and re-rank the remaining candidates.
Specifically, we denote CL the list of question candidates {HQ j , 1 ≤ j ≤ N } returned by the IR system. The premise question P Q is then used to construct N question pairs {(P Q, HQ j ), 1 ≤ j ≤ N }. The RQE classifier is then applied to filter out the question pairs that are not entailed and re-rank the remaining pairs. More precisely, let EL P Q = {HQ 1 , HQ 2 , . . . , HQ k . . .} in CL be the list of selected candidate questions that have a positive entailment relation with a given premise question P Q. We rank EL P Q by computing a hybrid score Hscore k for each candidate question HQ k taking into account the score of the IR system score k (IR) and the score of the RQE system score k (RQE).
For each system S ∈ {IR, RQE}, we normalize the associated score by dividing it by the maximum score among the N candidate questions retrieved by S for P Q: In our experiments, we fixed the value of N to 100. This threshold value was selected as a safe value for this task for the following reasons: • Our collection of 47,457 question-answer pairs was collected from only 12 NIH institutes and is unlikely to contain more than 100 occurrences of the same focus-type pair.
• Each question was indexed with additional annotations for the question focus, its synonyms and the question type synonyms.

Evaluating RQE for Medical Question Answering
The objective of this evaluation is to study the effectiveness of RQE for Medical Question Answering, by comparing the answers retrieved by the hybrid entailment-based approach, the IR method and the other QA systems participating to the medical task at TREC 2017 LiveQA challenge (LiveQA-Med).

Evaluation Method
We developed an interface to perform the manual evaluation of the retrieved answers. Figure 5 presents the evaluation interface showing, for each test question, the top-10 answers of the evaluated QA method and the reference answer(s) used by LiveQA assessors to help judging the retrieved answers by the participating systems.
We used the test questions 23 of the medical task at TREC-2017 LiveQA [7]. These questions are randomly selected from the consumer health questions that the NLM receives daily from all over the world. The test questions cover different medical entities and have a wide list of question types such as Comparison, Diagnosis, Ingredient, Side effects and Tapering. 23 https://github.com/abachaa/LiveQA_MedicalTask_TREC2017 For a relevant comparison, we used the same judgment scores as the LiveQA Track: • Correct and Complete Answer (4) • Correct but Incomplete (3) • Incorrect but Related (2) • Incorrect (1) We evaluated the answers returned by the IR-based method and the hybrid QA method (IR+RQE) according to the same reference answers used in LiveQA-Med. The answers were anonymized (the method names were blinded) and presented to 3 assessors: a medical doctor (Assessor A), a medical librarian (B) and a researcher in medical informatics (C). None of the assessors participated in the development of the QA methods. Assessors B and C evaluated 1,000 answers retrieved by each of the methods (IR and IR+RQE). Assessor A evaluated 2,000 answers from both methods. Table 5 presents the inter-annotator agreement (IAA) through F1 score computed by considering one of the assessors as reference. In the first evaluation, we computed the True Positives (TP) and False Positives (FP) over all ratings and the Precision and F1 score. As there are no negative labels (only true or false positives for each category), Recall is 100%. We also computed a partial IAA by grouping the "Correct and Complete Answer" and "Correct but Incomplete" ratings (as Correct), and the "Incorrect but Related" and "Incorrect" ratings (as Incorrect). The average agreement on distinguishing the Correct and Incorrect answers is 94.33% F1 score. Therefore, we used the evaluations performed by assessor A for both methods. The official results of the TREC LiveQA track relied on one assessor per question as well.  Table 5: Inter-Annotator Agreement (IAA) over all ratings in the manual evaluation of the retrieved answers. Partial IAA over two ratings "Correct" and "Incorrect".

Evaluation of the first retrieved answer
We computed the measures used by TREC LiveQA challenges [1,7] to evaluate the first retrieved answer for each test question: • avgScore(0-3): the average score over all questions, transferring 1-4 level grades to 0-3 scores. This is the main score used to rank LiveQA runs.
• succ@i+: the number of questions with score i or above (i∈{2..4}) divided by the total number of questions.
• prec@i+: the number of questions with score i or above (i∈{2..4}) divided by number of questions answered by the system.   Table 6 presents the average scores, success and precision results. The hybrid IR+RQE QA system achieved better results than the IR-based system with 0.827 average score. It also achieved a higher score than the best results achieved in the medical challenge at LiveQA'17. Evaluating the RQE system alone is not relevant, as applying RQE on the full collection for each user question is not feasible for a real-time system because of the extended execution time.

Evaluation of the top ten answers
In this evaluation, we used Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) which are commonly used in QA to evaluate the top-10 answers for each question. We consider answers rated as "Correct and Complete Answer" or "Correct but Incomplete" as correct answers, as the test questions contain multiple subquestions while each answer in our QA collection can cover only one subquestion.
MAP is the mean of the Average Precision (AvgP) scores over all questions.
AvgP i • Q is the number of questions. AvgP i is the AvgP of the i th question.
• K is the number of correct answers. rank n is the rank of n th correct answer.
MRR is the average of the reciprocal ranks for each question. The reciprocal rank of a question is the multiplicative inverse of the rank of the first correct answer.
• Q is the number of questions. rank i is the rank of the first correct answer for the i th question.

Discussion of entailment-based QA for the medical domain
In our evaluation, we followed the same LiveQA guidelines with the highest possible rigor. In particular, we consulted with NIST assessors who provided us with the paraphrases of the test questions that they used to judge the answers. Our IAA on the answers rating was also high compared to related tasks, with an 88.5% F1 agreement with the exact four categories and a 94.3% agreement when reducing the categories to two: "Correct" and "Incorrect" answers. Our results show that RQE improves the overall performance and exceeds the best results in the medical LiveQA'17 challenge by a factor of 29.8%. This performance improvement is particularly interesting as: (a) Our answer source has only 47K question-answer pairs when LiveQA participating systems relied on much larger collections, including the World Wide Web. (b) Our system answered one subquestion at most when many LiveQA test questions had several subquestions. The latter observation, (b), makes the hybrid IR+RQE approach even more promising as it gives it a large potential for the improvement of answer completeness.
The former observation, (a), provides another interesting insight: restricting the answer source to only reliable collections can actually improve the QA performance without losing coverage (i.e., our QA approach provided at least one answer to each test question and obtained the best relevance score).
In another observation, the assessors reported that many of the returned answers had a correct question type but a wrong focus, which indicates that including a focus recognition module to filter such wrong answers can improve further the QA performance in terms of precision. Another aspect that was reported is the repetition of the same (or similar) answer from different websites, which could be addressed by improving answer selection with inter-answer comparisons and removal of near-duplicates. Also, half of the LiveQA test questions are about Drugs, when only two of our resources are specialized in Drugs, among 12 sub-collections overall. Accordingly, the assessors noticed that the performance of the QA systems was better on questions about diseases than on questions about drugs, which suggests a need for extending our medical QA collection with more information about drugs and associated question types.
We also looked closely at the private websites used by the LiveQA-Med annotators to provide some of the reference answers for the test questions. For instance, the ConsumerLab website was useful to answer a question about the ingredients of a Drug (COENZYME Q10). Similarly, the eHealthMe website was used to answer a test question asking about interactions between two drugs (Phentermine and Dicyclomine) when no information was found in DailyMed. eHealthMe provides healthcare big data analysis and private research and studies including self-reported adverse drug effects by patients.
But the question remains on the extent to which such big data and other private websites could be used to automatically answer medical questions if information is otherwise unavailable. Unlike medical professionals, patients do not necessarily have the knowledge and tools to validate such information. An alternative approach could be to put limitations on medical QA systems in terms of the questions that can be answered (e.g. "What is my diagnosis for such symptoms") and build classifiers to detect such questions and warn the users about the dangers of looking for their answers online.
More generally, medical QA systems should follow some strict guidelines regarding the goal and background knowledge and resources of each system in order to protect the consumers from misleading or harmful information. Such guidelines could be based (i) on the source of the information such as health and medical information websites sponsored by the U.S. government, not-for-profit health or medical organizations, and medical university centers, or (ii) on conventions such as the code of conduct of the HON Foundation (HONcode) that addresses the reliability and usefulness of medical information on the Internet.
Our experiments show that limiting the number of answer sources with such guidelines is not only feasible, but it could also enhance the performance of the QA system from an information retrieval perspective.

Conclusion
In this paper, we carried out an empirical study of machine learning and deep learning methods for Recognizing Question Entailment in the medical domain using several datasets. We developed a RQE-based QA system to answer new medical questions using existing question-answer pairs. We built and shared a collection of 47K medical questionanswer pairs 24 . Our QA approach outperformed the best results on TREC-2017 LiveQA medical test questions. The proposed approach can be applied and adapted to open-domain as well as specific-domain QA. Deep learning models achieved interesting results on open-domain and clinical datasets, but obtained a lower performance on consumer health questions. We will continue investigating other network architectures including transfer learning, as well as creation of a large collection of consumer health questions for training to improve the performance of DL models. Future work also includes exploring integration of a Question Focus Recognition module to enhance candidate question retrieval, and expanding our question-answer collection.

Acknowledgements
We thank Halil Kilicoglu (NLM/NIH) for his help with the crawling and the manual evaluation and Sonya E. Shooshan (NLM/NIH) for her help with the judgment of the retrieved answers. We also thank Ellen Voorhees (NIST) for her valuable support with the TREC LiveQA evaluation.

Annex A: Case Study
We consider the case of the question number 36  This question was answered by 5 participating runs (vs. 8 runs for other questions), and all submitted answers were wrong (scores of 1 or 2). However, our IR-based QA system retrieved one excellent answer (score 4) and our hybrid IR+RQE system provided 3 excellent answers. • 36 1 Congenital Diaphragmatic Hernia (CDH) is a congenital abnormality that is often life-threatening because it can limit the growth of the lungs. The site of herniation is most commonly found on the left side in the hole originally designed for the entry of alimentary canal. Herniation is sometimes found near sternum. Contents * Defects It involves three major defects. * A failure of the diaphragm to completely close during development. * Herniation of the abdominal contents into the chest * Pulmonary Hypoplasia Early diagnosis and presentation This condition can often be diagnosed before birth and fetal intervention can sometimes help, depending on the severity of the condition. [1] Infants born with diaphragmatic hernia experience respiratory failure due to both pulmonary hypertension and pulmonary hypoplasia. The first condition is a restriction of blood flow through the lungs thought to be caused by defects in the lung. Pulmonary hypoplasia or decreased lung volume is directly related 4. 36 ECNU-EastChinaNormalUniversity 1 • 36 1 The act of facial motion starts in the brain and travels through the facial nerves to the muscles in the face. These muscles then contract in response to a stimulus. Inside the skull, the facial nerve is a single nerve. Congenital facial palsy is a condition present at birth. Moebius syndrome is a congenital condition. In most cases the exact cause of congenital palsy is uncertain.

36 ECNU_ICA-EastChinaNormalUniversity 2
• 36 2 CHERUBS is a non-profit organization created to help families affected by Congenital Diaphragmatic Hernia. If you visit the site you will learn all about CDH, read stories of patients (children and adults), see photos and more. Membership and information is free. http://www.cdhsupport.org My name is Dawn and I run CHERUBS. My son was born with CDH and I started CHERUBS 16 years ago. Now, we have over 3600 families in our group so I know there will be someone who truly understands what you're going through. If you'd like to talk or have questions, you can reach me at dawn.williamson@cdhsupport.org B) Our IR-based QA System: • Answer S1-1 (MedlinePlus 25 : What causes Diaphragmatic hernia ? (Also called: Hernia -diaphragmatic; Congenital hernia of the diaphragm (CDH)) ANSWER: A diaphragmatic hernia is a rare defect. It occurs while the baby is developing in the womb. Because the diaphragm is not fully developed, organs, such as the stomach, small intestine, spleen, part of the liver, and the kidney, may take up part of the chest cavity. CDH most often involves only one side of the diaphragm. It is more common on the left side. Often, the lung tissue and blood vessels in the area do not develop normally either. It is not clear if the diaphragmatic hernia causes the underdeveloped lung tissue and blood vessels, or the other way around. 40% of babies with this condition have other problems as well. Having a parent or sibling with the condition increases the risk.

C) Our IR+RQE QA System:
1. Answer S2-1 (from GHR 26 ): What are the genetic changes related to congenital diaphragmatic hernia ? (Also called: congenital diaphragmatic defect) ANSWER: Congenital diaphragmatic hernia has many different causes. In 10 to 15 percent of affected individuals, the condition appears as a feature of a disorder that affects many body systems, called a syndrome. Donnai-Barrow syndrome, Fryns syndrome, and Pallister-Killian mosaic syndrome are among several syndromes in which congenital diaphragmatic hernia may occur. Some of these syndromes are caused by changes in single genes, and others are caused by chromosomal abnormalities that affect several genes. About 25 percent of individuals with congenital diaphragmatic hernia that is not associated with a known syndrome also have abnormalities of one or more major body systems. Affected body systems can include the heart, brain, skeleton, intestines, genitals, kidneys, or eyes. In these individuals, the multiple abnormalities likely result from a common underlying disruption in development that affects more than one area of the body, but the specific mechanism responsible for this disruption is not clear. Approximately 50 to 60 percent of congenital diaphragmatic hernia cases are isolated, which means that affected individuals have no other major malformations. More than 80 percent of individuals with congenital diaphragmatic hernia have no known genetic syndrome or chromosomal abnormality. In these cases, the cause of the condition is unknown. Researchers are studying changes in several genes involved in the development of the diaphragm as possible causes of congenital diaphragmatic hernia. Some of these genes are transcription factors, which provide instructions for making proteins that help control the activity of particular genes (gene expression). Others provide instructions for making proteins involved in cell structure or the movement (migration) of cells in the embryo. Environmental factors that influence development before birth may also increase the risk of congenital diaphragmatic hernia, but these environmental factors have not been identified.
2. Answer S2-2 (from GHR): Is congenital diaphragmatic hernia inherited ? (Also called: congenital diaphragmatic defect): Isolated congenital diaphragmatic hernia is rarely inherited. In almost all cases, there is only one affected individual in a family. When congenital diaphragmatic hernia occurs as a feature of a genetic syndrome or chromosomal abnormality, it may cluster in families according to the inheritance pattern for that condition.