Pan et al. (2022)

paper

2022·arXiv·arxiv.org/abs/2209.14610

Authors

Pan Lu·Liang Qiu·Kai-Wei Chang·Ying Nian Wu·Song-Chun Zhu·Tanmay Rajpurohit·Peter Clark·Ashwin Kalyan

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

This paper addresses mathematical reasoning in language models across heterogeneous data formats, contributing to understanding of LLM capabilities and limitations relevant to AI safety evaluation and alignment research.

Paper Details

Citations

51 influential

Year

2022

arXiv:2209.14610 DOI:10.7717/peerj.20552/fig-2 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TabMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multi-choice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in selecting in-context examples.

Summary

This paper introduces TabMWP, a dataset of 38,431 mathematical word problems requiring reasoning over both textual and tabular data, addressing a gap in evaluating language models on heterogeneous information. The authors demonstrate that few-shot GPT-3 performs unstably on such complex problems due to sensitivity to in-context example selection. To address this, they propose PromptPG, a policy gradient-based method that learns to select optimal in-context examples from training data, achieving 5.31% improvement over baselines and significantly reducing prediction variance.

Cited by 1 page

Page	Type	Quality
Goal Misgeneralization Probability Model	Analysis	61.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202689 KB

[2209.14610] Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning

 
 
 Pan Lu 1,3 , Liang Qiu 1 , Kai-Wei Chang 1 , Ying Nian Wu 1 , Song-Chun Zhu 1 , 
 Tanmay Rajpurohit 2 , Peter Clark 3 , Ashwin Kalyan 3 
 1 University of California, Los Angeles, 2 Georgia Institute of Technology, 3 Allen Institute for AI
 
 

 
 Abstract

 Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems ( TabMWP ), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TabMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multi-choice , and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TabMWP , including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TabMWP . To mitigate this, we further propose a novel approach, PromptPG , which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in selecting in-context examples. 1 1 1 The data and code are available at https://promptpg.github.io . 
 Work was partially done while Pan Lu was an intern at Allen Institute for AI (AI2). 

 
 
 
 1 Introduction

 
 Developing machines equipped with mathematical reasoning capabilities is one of the long-standing goals of artificial intelligence. Solving math word problems (MWPs) is a well-defined task to diagnose the ability of intelligent systems to perform numerical reasoning and problem-solving as humans. A surge of datasets has been proposed to facilitate the research in this domain (Upadhyay & Chang, 2017 ; Amini et al., 2019 ; Miao et al., 2020 ; Cobbe et al., 2021 ) . However, most existing MWP datasets focus on textual ma

... (truncated, 89 KB total)

Resource ID: 3644f42a7817a7f5 | Stable ID: sid_L3HBLmGUsm