[2009.13081] What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

paper

2020·arXiv·arxiv.org/abs/2009.13081

Authors

Di Jin·Eileen Pan·Nassim Oufattole·Wei-Hung Weng·Hanyi Fang·Peter Szolovits

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

MedQA is a large-scale multilingual medical question-answering dataset from professional exams that can be used to evaluate AI models' medical reasoning capabilities and potential for deployment in healthcare contexts, relevant for assessing AI safety in high-stakes domains.

Paper Details

Citations

257 influential

Year

2021

arXiv:2009.13081 DOI:10.20944/preprints202105.0498.v1 Semantic Scholar

Metadata

arxiv preprintdataset

Abstract

Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7\%, 42.0\%, and 70.1\% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect MedQA to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future.

Summary

MedQA is the first free-form multiple-choice open domain question answering dataset for medical problems, sourced from professional medical board exams across three languages: English, simplified Chinese, and traditional Chinese, containing 12,723, 34,251, and 14,123 questions respectively. The authors implement both rule-based and neural methods combining document retrieval and machine comprehension, finding that current best approaches achieve only 36.7%, 42.0%, and 70.1% test accuracy on English, traditional Chinese, and simplified Chinese questions respectively, demonstrating significant challenges for existing OpenQA systems.

Cited by 1 page

Page	Type	Quality
AI Capability Threshold Model	Analysis	72.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202666 KB

[2009.13081] What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 What Disease does this Patient Have?
 A Large-scale Open Domain Question Answering Dataset from Medical Exams

 
 
 
Di Jin, 1 
Eileen Pan, 1 
Nassim Oufattole 1 
 Wei-Hung Weng, 1 
Hanyi Fang, 2 
Peter Szolovits 1 
 
 
 

 
 Abstract

 Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA , collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7%, 42.0%, and 70.1% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect MedQA to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future. 1 1 1 Data and baselines source code are available at: https://github.com/jind11/MedQA 

 
 
 
 1 Introduction

 
 Question answering (QA) is a fundamental task in Natural Language Processing (NLP), which requires models to answer a particular question. When given the context text associated with the question, language pre-training based models such as BERT (Devlin et al. 2019 ) , RoBERTa (Liu et al. 2019 ) , and ALBERT (Lan et al. 2019 ) have achieved nearly saturated performance on most of the popular datasets (Rajpurkar et al. 2016 ; Rajpurkar, Jia, and Liang 2018 ; Lai et al. 2017 ; Yang et al. 2018 ; Gao et al. 2020 ) . However, real-world scenarios for QA are usually much more complex and one may not have a body of text already labeled as containing the
answer to the question. In this scenario, models are required to find and extract relevant information to questions from large-scale text sources such as a search engine (Dunn et al. 2017 ) and Wikipedia  (Chen et al. 2017 ) . This type of task is generally called as open-domain question answering
(OpenQA), which has recently attracted lots of attention from the NLP community (Clark and Gardner 2018 ; Wang et al. 2019 ; Asai et al. 2020 ) but still remains far from being solved.

 
 
 
 
 
 
 
 Question 
 
 
 
 
 A 27-year-old male presents to urgent care complaining of pain with urination. He reports that the pain started 3 days ago. He has never experienced these symptoms before. He denies gross hematuria or pelvic pain . He is sexually active with his girlfriend, and they consistently use co

... (truncated, 66 KB total)

Resource ID: db13f518d99c0810 | Stable ID: sid_l4a83eZJej