Helpful to a Fault (2025).

paper

2025·arXiv·arxiv.org/abs/2502.09933

Authors

Kai Yan·Zhan Ling·Kang Liu·Yifan Yang·Ting-Han Fan·Lingfeng Shen·Zhengyin Du·Jiecao Chen

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Note: the title 'Helpful to a Fault' appears mismatched with the MIR-Bench content; the resource may conflate two separate papers. Relevance to AI safety is indirect, primarily as a capabilities evaluation tool.

Paper Details

Citations

0 influential

Year

2025

arXiv:2502.09933 Semantic Scholar

Metadata

Importance: 38/100arxiv preprintdataset

Abstract

The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition that asks LLM to predict output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for many-shot in-context reasoning, and acquired many insightful findings including scaling effect, robustness, inductive vs. transductive reasoning, retrieval Augmented Generation (RAG), coding for inductive reasoning, cross-domain generalizability, etc.

Summary

MIR-Bench is a benchmark designed to evaluate LLMs' inductive reasoning capabilities in long-context, many-shot settings, testing their ability to induce rules from hundreds to thousands of input-output examples. It addresses gaps in existing benchmarks that focus on few-shot or simple classification tasks, and investigates robustness to erroneous examples and Chain-of-Thought effects.

Key Points

•Introduces MIR-Bench, a benchmark testing LLMs on inductive reasoning with hundreds to thousands of in-context examples across diverse data formats.
•Addresses limitations of prior benchmarks that focus on few-shot settings or simple classification, pushing toward many-shot complex rule induction.
•Studies robustness of LLMs to erroneous examples in long-context scenarios, revealing failure modes under noisy conditions.
•Investigates the effects of Chain-of-Thought prompting on inductive reasoning performance in many-shot settings.
•Provides insights into how LLMs integrate complex information across very long contexts, relevant to understanding model capabilities and limitations.

Cited by 1 page

Page	Type	Quality
Large Language Models	Capability	60.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202698 KB

[2502.09933] MIR-Bench: Benchmarking LLM’s Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning 
 
 
 
 
 
 
 
 
 
 
 

 
 
 

 
 
 
 
 
 
 
 1]ByteDance Seed
2]University of Illinois Urbana-Champaign
 \contribution [*]Work done at ByteDance
 \contribution [†]Corresponding authors

 
 MIR-Bench: Benchmarking LLM’s Long-Context
Intelligence via Many-Shot In-Context Inductive Reasoning

 
 
 Kai Yan
 
    
 Zhan Ling
 
    
 Kang Liu
 
    
 Yifan Yang
 
    
 Ting-Han Fan
 
    
 Lingfeng Shen
 
    
 Zhengyin Du
 
    
 Jiecao Chen
 
 [
 
 [
 
 kaiyan3@illinois.edu 
 
 jiecao.chen@bytedance.com 
 
 
 (March 5, 2025) 

 
 Abstract

 Inductive Reasoning (IR), the ability to summarize rules from examples and apply on new ones, has long been viewed as a primal ability for general intelligence and widely studied by cognitive science and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually < < 10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations are mostly focused on classification (a very limited aspect of IR), and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context inductive reasoning benchmark that asks LLM to induce output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for inductive reasoning and many-shot ICL, including robustness against erroneous shots and the effect of Chain-of-Thought (CoT), and acquired insightful findings.

 
 
 \correspondence 
 Kai Yan at , Jiecao Chen at 
 \checkdata [Project Page]https://github.com/KaiYan289/MIR-Bench
 \checkdata [Dataset]https://huggingface.co/datasets/kaiyan289/MIR-Bench
 

 
 
 
 1 Introduction

 
 The tremendous success of Large Language Models (LLMs) in recent years  [ 51 , 24 , 25 ] has finally brought us to the extent where human-level Artificial General Intelligence (AGI) becomes seemingly within reach  [ 25 ] . With such success, researchers have shifted their focus from syntax- and word-level traditional Natural Language Processing (NLP) tasks such as named entity recognition  [ 47 , 37 ] , sentiment classification  [ 63 , 68 ] and translation  [ 42 , 65 ] onto those which measure abilities once considered unique to humans. Inductive Reasoning (IR)  [ 20 ] , which is the ability to summarize general high-level rules from existing examples, thus comes into attention  [ 10 ] ; instead of math or coding wh

... (truncated, 98 KB total)

Resource ID: 92dfabf49665e538 | Stable ID: sid_ExcjVymuMe