Helpful to a Fault (2025).
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
Note: the title 'Helpful to a Fault' appears mismatched with the MIR-Bench content; the resource may conflate two separate papers. Relevance to AI safety is indirect, primarily as a capabilities evaluation tool.
Paper Details
Metadata
Abstract
The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition that asks LLM to predict output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for many-shot in-context reasoning, and acquired many insightful findings including scaling effect, robustness, inductive vs. transductive reasoning, retrieval Augmented Generation (RAG), coding for inductive reasoning, cross-domain generalizability, etc.
Summary
MIR-Bench is a benchmark designed to evaluate LLMs' inductive reasoning capabilities in long-context, many-shot settings, testing their ability to induce rules from hundreds to thousands of input-output examples. It addresses gaps in existing benchmarks that focus on few-shot or simple classification tasks, and investigates robustness to erroneous examples and Chain-of-Thought effects.
Key Points
- •Introduces MIR-Bench, a benchmark testing LLMs on inductive reasoning with hundreds to thousands of in-context examples across diverse data formats.
- •Addresses limitations of prior benchmarks that focus on few-shot settings or simple classification, pushing toward many-shot complex rule induction.
- •Studies robustness of LLMs to erroneous examples in long-context scenarios, revealing failure modes under noisy conditions.
- •Investigates the effects of Chain-of-Thought prompting on inductive reasoning performance in many-shot settings.
- •Provides insights into how LLMs integrate complex information across very long contexts, relevant to understanding model capabilities and limitations.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Capability | 60.0 |
Cached Content Preview
[2502.09933] MIR-Bench: Benchmarking LLM’s Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning
1]ByteDance Seed
2]University of Illinois Urbana-Champaign
\contribution [*]Work done at ByteDance
\contribution [†]Corresponding authors
MIR-Bench: Benchmarking LLM’s Long-Context
Intelligence via Many-Shot In-Context Inductive Reasoning
Kai Yan
Zhan Ling
Kang Liu
Yifan Yang
Ting-Han Fan
Lingfeng Shen
Zhengyin Du
Jiecao Chen
[
[
kaiyan3@illinois.edu
jiecao.chen@bytedance.com
(March 5, 2025)
Abstract
Inductive Reasoning (IR), the ability to summarize rules from examples and apply on new ones, has long been viewed as a primal ability for general intelligence and widely studied by cognitive science and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually < < 10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations are mostly focused on classification (a very limited aspect of IR), and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context inductive reasoning benchmark that asks LLM to induce output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for inductive reasoning and many-shot ICL, including robustness against erroneous shots and the effect of Chain-of-Thought (CoT), and acquired insightful findings.
\correspondence
Kai Yan at , Jiecao Chen at
\checkdata [Project Page]https://github.com/KaiYan289/MIR-Bench
\checkdata [Dataset]https://huggingface.co/datasets/kaiyan289/MIR-Bench
1 Introduction
The tremendous success of Large Language Models (LLMs) in recent years [ 51 , 24 , 25 ] has finally brought us to the extent where human-level Artificial General Intelligence (AGI) becomes seemingly within reach [ 25 ] . With such success, researchers have shifted their focus from syntax- and word-level traditional Natural Language Processing (NLP) tasks such as named entity recognition [ 47 , 37 ] , sentiment classification [ 63 , 68 ] and translation [ 42 , 65 ] onto those which measure abilities once considered unique to humans. Inductive Reasoning (IR) [ 20 ] , which is the ability to summarize general high-level rules from existing examples, thus comes into attention [ 10 ] ; instead of math or coding wh
... (truncated, 98 KB total)92dfabf49665e538 | Stable ID: ZDc0Y2IwNW