Research by Valmeekam et al. (2023)
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Empirical study evaluating large language models' planning and reasoning capabilities, examining their effectiveness in autonomous planning tasks and potential as heuristic sources for external planners—relevant to understanding LLM reliability and safety constraints.
Paper Details
Metadata
Abstract
Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks and (2) the potential of LLMs in LLM-Modulo settings where they act as a source of heuristic guidance for external planners and verifiers. We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs' ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of ~12% across the domains. However, the results in the LLM-Modulo setting show more promise. In the LLM-Modulo setting, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on the generated plans and back-prompt the LLM for better plan generation.
Summary
Valmeekam et al. (2023) investigates the planning capabilities of large language models (LLMs) by evaluating their performance on commonsense planning tasks in two settings: autonomous plan generation and LLM-Modulo (where LLMs provide heuristic guidance to external planners). The study finds that LLMs have severely limited autonomous planning abilities, with GPT-4 achieving only ~12% success rate across domains. However, the LLM-Modulo approach shows promise, demonstrating that LLM-generated guidance can enhance external planners' search processes and that external verifiers can provide feedback to iteratively improve LLM plan generation.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Reasoning and Planning | Capability | 65.0 |
Cached Content Preview
Conversion to HTML had a Fatal error and exited abruptly. This document may be truncated or damaged. [◄](https://ar5iv.labs.arxiv.org/html/2305.15770) [](https://ar5iv.labs.arxiv.org/) [Feeling\\ \\ lucky?](https://ar5iv.labs.arxiv.org/feeling_lucky) [Conversion\\ \\ report](https://ar5iv.labs.arxiv.org/log/2305.15771) [Report\\ \\ an issue](https://github.com/dginev/ar5iv/issues/new?template=improve-article--arxiv-id-.md&title=Improve+article+2305.15771) [View original\\ \\ on arXiv](https://arxiv.org/abs/2305.15771) [►](https://ar5iv.labs.arxiv.org/html/2305.15772)
984d52715ad3ac6c | Stable ID: NzNkZDE3Zm