[2009.13081] What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
MedQA is a large-scale multilingual medical question-answering dataset from professional exams that can be used to evaluate AI models' medical reasoning capabilities and potential for deployment in healthcare contexts, relevant for assessing AI safety in high-stakes domains.
Paper Details
Metadata
Abstract
Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7\%, 42.0\%, and 70.1\% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect MedQA to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future.
Summary
MedQA is the first free-form multiple-choice open domain question answering dataset for medical problems, sourced from professional medical board exams across three languages: English, simplified Chinese, and traditional Chinese, containing 12,723, 34,251, and 14,123 questions respectively. The authors implement both rule-based and neural methods combining document retrieval and machine comprehension, finding that current best approaches achieve only 36.7%, 42.0%, and 70.1% test accuracy on English, traditional Chinese, and simplified Chinese questions respectively, demonstrating significant challenges for existing OpenQA systems.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| AI Capability Threshold Model | Analysis | 72.0 |
Cached Content Preview
# What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
Di Jin,1
Eileen Pan,1
Nassim Oufattole1
Wei-Hung Weng,1
Hanyi Fang,2
Peter Szolovits1
###### Abstract
Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7%, 42.0%, and 70.1% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect MedQA to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future.111Data and baselines source code are available at: https://github.com/jind11/MedQA
## 1 Introduction
Question answering (QA) is a fundamental task in Natural Language Processing (NLP), which requires models to answer a particular question. When given the context text associated with the question, language pre-training based models such as BERT (Devlin et al. [2019](https://ar5iv.labs.arxiv.org/html/2009.13081#bib.bib15 "")), RoBERTa (Liu et al. [2019](https://ar5iv.labs.arxiv.org/html/2009.13081#bib.bib26 "")), and ALBERT (Lan et al. [2019](https://ar5iv.labs.arxiv.org/html/2009.13081#bib.bib24 "")) have achieved nearly saturated performance on most of the popular datasets (Rajpurkar et al. [2016](https://ar5iv.labs.arxiv.org/html/2009.13081#bib.bib33 ""); Rajpurkar, Jia, and Liang [2018](https://ar5iv.labs.arxiv.org/html/2009.13081#bib.bib32 ""); Lai et al. [2017](https://ar5iv.labs.arxiv.org/html/2009.13081#bib.bib23 ""); Yang et al. [2018](https://ar5iv.labs.arxiv.org/html/2009.13081#bib.bib40 ""); Gao et al. [2020](https://ar5iv.labs.arxiv.org/html/2009.13081#bib.bib18 "")). However, real-world scenarios for QA are usually much more complex and one may not have a body of text already labeled as containing the
answer to the question. In this scenario, models are required to find and extract relevant information to questions from large-scale text sources such as a search engine (Dunn et al. [2017](https://ar5iv.labs.arxiv.org/html/2009.13081#bib.bib17 "")) and Wikipedia (Chen et al. [2017](https://ar5iv.labs.arxiv.org/html/2009.13081#bib.bib8 "")). This type of task is generally called as open-domain question answering
(OpenQA), which has recently attracted lots of attention from the NLP community (Clark and Gardner [2018](h
... (truncated, 73 KB total)db13f518d99c0810 | Stable ID: MzRkYTc5Nz