Research by Valmeekam et al. (2023)

paper

2023·arXiv·arxiv.org/abs/2305.15771

Authors

Karthik Valmeekam·Matthew Marquez·Sarath Sreedharan·Subbarao Kambhampati

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Empirical study evaluating large language models' planning and reasoning capabilities, examining their effectiveness in autonomous planning tasks and potential as heuristic sources for external planners—relevant to understanding LLM reliability and safety constraints.

Paper Details

Citations

390

23 influential

Year

2023

arXiv:2305.15771 DOI:10.48550/arXiv.2305.15771 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks and (2) the potential of LLMs in LLM-Modulo settings where they act as a source of heuristic guidance for external planners and verifiers. We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs' ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of ~12% across the domains. However, the results in the LLM-Modulo setting show more promise. In the LLM-Modulo setting, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on the generated plans and back-prompt the LLM for better plan generation.

Summary

Valmeekam et al. (2023) investigates the planning capabilities of large language models (LLMs) by evaluating their performance on commonsense planning tasks in two settings: autonomous plan generation and LLM-Modulo (where LLMs provide heuristic guidance to external planners). The study finds that LLMs have severely limited autonomous planning abilities, with GPT-4 achieving only ~12% success rate across domains. However, the LLM-Modulo approach shows promise, demonstrating that LLM-generated guidance can enhance external planners' search processes and that external verifiers can provide feedback to iteratively improve LLM plan generation.

Cited by 1 page

Page	Type	Quality
Reasoning and Planning	Capability	65.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20260 KB

[2305.15771] Untitled Document 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 

 
Conversion to HTML had a Fatal error and exited abruptly. This document may be truncated or damaged.
 
 
 ◄ 
 
 Feeling
lucky? 
 
 Conversion
report 
 Report
an issue 
 View original
on arXiv ►

Resource ID: 984d52715ad3ac6c | Stable ID: sid_E9VZdDxrB7