Skip to content
Longterm Wiki
Navigation
Updated 2026-01-29HistoryData
Page StatusContent
Edited 2 months ago2.8k words3 backlinksUpdated every 3 weeksOverdue by 45 days
57QualityAdequate •37ImportanceReference66ResearchModerate
Content8/13
SummaryScheduleEntityEdit historyOverview
Tables18/ ~11Diagrams2/ ~1Int. links3/ ~22Ext. links77/ ~14Footnotes0/ ~8References11/ ~8Quotes0Accuracy0RatingsN:4.5 R:6.5 A:5.5 C:7Backlinks3
Issues3
QualityRated 57 but structure suggests 93 (underrated by 36 points)
Links21 links could use <R> components
StaleLast edited 66 days ago - may need review

Heavy Scaffolding / Agentic Systems

Concept

Heavy Scaffolding / Agentic Systems

Comprehensive analysis of multi-agent AI systems with extensive benchmarking data showing rapid capability growth (77.2% SWE-bench, 5.5x improvement 2023-2025) but persistent reliability challenges (45-60% error propagation rates, 2:1 human advantage at 32-hour tasks). Estimates 25-40% probability of paradigm dominance at transformative AI, with 67% Fortune 500 deployment but only 6% full trust for core processes.

Related
Organizations
OpenAIAnthropic
Concepts
Dense Transformers
Capabilities
Light Scaffolding
2.8k words · 3 backlinks

Quick Assessment

DimensionAssessmentEvidence
Current CapabilityModerate-HighClaude Sonnet 4.5 achieves 77.2% on SWE-bench Verified; WebArena agents improved from 14% to 60% success rate (2023-2025)
ReliabilityLow-ModerateMulti-agent systems show 50%+ failure rates on complex tasks; error propagation remains key bottleneck
Safety ProfileMixedScaffold code is auditable, but autonomy amplifies scope of potential harms across physical, financial, and digital dimensions
Research MaturityMediumReAct (ICLR 2023) established foundations; 1,600+ annotated failure traces now available via MAST-Data
Deployment StatusProductionClaude Code, Devin, OpenAI Assistants in commercial use; enterprise adoption accelerating
ScalabilityUncertainPerformance gains plateau at longer time horizons; 32-hour tasks show humans outperforming AI 2:1
Dominance Probability25-40%Strong growth trends but reliability constraints may limit ceiling
SourceLink
Official Websitelilianweng.github.io
Wikipediaen.wikipedia.org
LessWronglesswrong.com
arXivarxiv.org

Overview

Heavy scaffolding refers to AI systems where significant capability and behavior emerges from the orchestration code rather than just the underlying model. These systems combine foundation models with tools, persistent memory, multi-agent coordination, and autonomous operation loops.

Examples include Claude Code (Anthropic's coding agent), Devin (Cognition's software engineer), AutoGPT, and various research agent frameworks. The key distinguishing feature is that the scaffold itself is a major determinant of system behavior, not just a thin wrapper around model calls.

This paradigm has an estimated 25-40% probability of being dominant at transformative AI, with strong growth trends as scaffolding becomes easier to build and demonstrates clear capability gains. The 2025 International AI Safety Report notes that "increasingly capable AI agents will likely present new, significant challenges for risk management."

Agentic Architecture Patterns

The following diagram illustrates the common architectural patterns found in modern agentic systems, showing how different components interact across the planning, execution, and feedback loops:

Diagram (loading…)
flowchart TD
  subgraph input["User Input"]
      task["Task/Goal"]
  end

  subgraph planning["Planning Layer"]
      decompose["Task Decomposition"]
      select["Strategy Selection"]
      schedule["Action Scheduling"]
  end

  subgraph execution["Execution Layer"]
      toolcall["Tool Invocation"]
      observe["Observation Processing"]
      reflect["Self-Reflection"]
  end

  subgraph memory["Memory Systems"]
      short["Short-term Context"]
      long["Long-term Storage"]
      episodic["Episodic Memory"]
  end

  subgraph external["External Environment"]
      apis["APIs/Web"]
      files["File Systems"]
      code["Code Execution"]
      browser["Browser"]
  end

  task --> decompose
  decompose --> select
  select --> schedule
  schedule --> toolcall
  toolcall --> apis
  toolcall --> files
  toolcall --> code
  toolcall --> browser
  apis --> observe
  files --> observe
  code --> observe
  browser --> observe
  observe --> reflect
  reflect --> short
  short --> long
  long --> episodic
  episodic --> decompose
  reflect --> schedule

  style planning fill:#e8f4f8
  style execution fill:#f0f8e8
  style memory fill:#f8f0e8
  style external fill:#f8e8e8

Conceptual Architecture

Diagram (loading…)
flowchart TB
  subgraph scaffold["Scaffold Layer (Readable Code)"]
      orchestrator["Orchestrator"]
      memory["Persistent Memory"]
      tools["Tool Registry"]
      planner["Planning Module"]
      critic["Self-Critique"]
  end

  subgraph models["Model Layer (Black Box)"]
      llm1["LLM: Reasoning"]
      llm2["LLM: Coding"]
      llm3["LLM: Critique"]
  end

  subgraph external["External World"]
      files["File System"]
      web["Web/APIs"]
      exec["Code Execution"]
  end

  orchestrator --> planner
  planner --> llm1
  llm1 --> critic
  critic --> llm3
  orchestrator --> tools
  tools --> llm2
  llm2 --> exec
  tools --> web
  memory --> orchestrator
  exec --> files

Key Properties

PropertyRatingAssessment
White-box AccessMEDIUM-HIGHScaffold code is fully readable and auditable; model calls remain black boxes
TrainabilityLOWModels trained separately; scaffold is engineered code, not learned
PredictabilityLOWMulti-step plans can diverge unpredictably; emergent behavior from agent loops
ModularityHIGHExplicit component architecture with clear boundaries
Formal VerifiabilityPARTIALScaffold logic can be formally verified; model outputs cannot

Safety Implications

Advantages

AdvantageExplanation
Auditable orchestrationEvery decision point in the scaffold can be logged, reviewed, and understood
Insertable safety checksCan add human approval, sandboxing, or constraint checking in code
Modular failure isolationWhen something breaks, you can identify which component failed
Testable control flowCan write unit tests for scaffold behavior, even if model outputs vary
Interpretable planningMulti-step plans are often explicitly represented and inspectable

Risks

RiskSeverityExplanationMitigation Status
Emergent multi-step behaviorHIGHBehavior emerges from interaction of components over many steps; 14 failure modes identified in MAST taxonomyActive research; 14% improvement achievable with interventions
Autonomous operationHIGHLess human oversight when agents run for extended periods; agents now run 30+ hours autonomouslyHuman-in-loop checkpoints being standardized
Tool use enables real harmHIGHFile system, web access, code execution = real-world consequencesSandboxing, permission systems widely deployed
Deceptive scaffoldingMEDIUMScaffold could be designed (or evolve) to hide intentionsAudit trails, logging standard practice
Scaling unpredictabilityMEDIUMMore agents, longer loops = harder to predict outcomesGradient Institute developing risk analysis techniques
Security vulnerabilitiesHIGHSurvey identifies 4 knowledge gaps: unpredictable inputs, complex execution, variable environments, untrusted entitiesEmerging field; defenses lagging threats

Current Examples

SystemDeveloperKey FeaturesBenchmark PerformanceStatus
Claude CodeAnthropicCoding agent with file access, terminal, multi-file editing77.2% SWE-bench VerifiedProduction
DevinCognitionFull software engineer agent with browser, terminalFirst to reach 13.86% SWE-bench (Mar 2024); valued at $10.2BProduction
CUGAIBM ResearchEnterprise-ready hierarchical planner-executor61.7% WebArena (SOTA)Production
AutoGPTOpen sourceGeneral autonomous agent with plugins181K+ GitHub starsResearch/Hobby
MetaGPTOpen sourceMulti-agent framework with SOPs83%+ on HumanEvalFramework
VoyagerNVIDIAMinecraft agent with skill libraryFirst LLM-powered embodied agentResearch
OpenAI AssistantsOpenAIAPI for building custom agents with toolsCode Interpreter, retrievalProduction
LangChain AgentsLangChainFramework for building agent pipelines140M+ monthly downloadsFramework

Benchmark Performance Data

Empirical benchmarks provide quantitative evidence of agentic system capabilities and limitations. The table below summarizes performance across major evaluation suites:

BenchmarkTask TypeBest Agent PerformanceHuman BaselineKey Finding
SWE-bench VerifiedSoftware engineering77.2% (Claude Sonnet 4.5); 80.9% (Opus 4.5)≈90% (estimated)5.5x improvement from 13.86% (Devin, Mar 2024) to 77.2% (Sep 2025)
SWE-bench ProComplex software tasks23.3% (GPT-5/Claude Opus 4.1)Not measuredSignificant drop from Verified; highlights reliability gap
WebArenaWeb navigation61.7% (IBM CUGA, Feb 2025)78.24%4.3x improvement from 14.41% baseline (2023); Zhou et al. 2023
WebChoreArenaTedious web tasks37.8% (Gemini 2.5 Pro)Not measuredMemory and calculation tasks remain challenging
ALFWorldEmbodied tasks48.5% (GPT-4 AutoGPT)≈95%Surpassed imitation learning baselines; Liu et al. 2023
HotPotQAMulti-hop QA27.4% (ReAct)≈60%ReAct trails CoT slightly but gains interpretability; Yao et al. 2022
RE-BenchComplex tasks (2hr)4x human scoreBaselineAt 32 hours, humans outperform AI 2:1; time-horizon dependent
AppWorldAPI orchestration48.2% (IBM CUGA)Not measured87.5% on Level 1 tasks; complex multi-API coordination

The trajectory of agentic systems shows rapid improvement but persistent reliability gaps:

Metric202320242025Trend
SWE-bench (best agent)13.86% (Devin)49% (Claude 3.5 Sonnet)77.2% (Claude Sonnet 4.5)+463% over 2 years
WebArena success rate14.41%≈45%61.7%+328% over 2 years
Multi-agent task completion35-40%45-55%55-65%Steady improvement
Error propagation rateHigh (unmeasured)≈60% cascade failures≈45% with mitigationsImproving with research

Research Landscape

Key Papers

PaperYearVenueContributionKey Metrics
ReAct: Synergizing Reasoning and Acting2022ICLR 2023Foundational reasoning+action framework+34% absolute on ALFWorld; 94% fact accuracy
Toolformer2023NeurIPSSelf-supervised tool use learningModels learn APIs from 25K demonstrations
Voyager2023NeurIPSFirst LLM-powered embodied agent3.3x more unique items discovered vs baselines
Generative Agents2023UISTBelievable simulacra with memory25 agents; 2-week simulated time
AgentVerse2024ICLR 2024Multi-agent collaboration frameworkMeta-programming; dynamic role adjustment
SWE-bench2023ICLR 2024Real GitHub issue resolution benchmark2,294 tasks from 12 popular repositories
MAST-Data2025arXivMulti-agent failure taxonomy1,600+ traces; 14 modes; κ=0.88 agreement
Agentic AI Security2025arXivSecurity threat taxonomy4 knowledge gaps; comprehensive defense survey

Multi-Agent Failure Taxonomy

Research from the MAST-Data study identifies 14 unique failure modes clustered into three categories:

CategoryFailure ModesFrequencyMitigation
System Design IssuesImproper task decomposition, inadequate tool selection, memory overflow35-40% of failuresBetter planning modules, explicit verification
Inter-Agent MisalignmentConflicting objectives, communication breakdowns, role confusion25-30% of failuresStandardized protocols, centralized coordination
Task VerificationIncomplete outputs, quality control failures, premature termination30-35% of failuresHuman-in-loop checkpoints, automated testing

The study found inter-annotator agreement (kappa = 0.88) validating the taxonomy, and that interventions yielded +14% improvement for ChatDev but "remain insufficiently [high] for real-world deployment."

Key Labs and Organizations

OrganizationFocus AreaKey AchievementsNotable Systems
AnthropicFrontier agents + safety77.2% SWE-bench; 30+ hour sustained operationClaude Code, Computer Use
CognitionAutonomous software engineeringFirst 13.86% SWE-bench (Mar 2024); $10.2B valuationDevin
OpenAIAgent APIs + reasoningCode Interpreter, function calling ecosystemAssistants API, o1/o3 reasoning
IBM ResearchEnterprise-ready agents61.7% WebArena SOTA (Feb 2025); open-sourceCUGA
LangChainAgent frameworks140M+ monthly PyPI downloadsLangGraph, LangSmith
MetaGPTMulti-agent SOPs47K+ GitHub stars; standardized workflowsMetaGPT framework
NVIDIAEmbodied agentsFirst LLM-powered embodied agentVoyager

Trend Analysis

Heavy scaffolding is experiencing rapid growth due to several factors:

  1. Scaffolding is getting cheaper - Frameworks like LangChain, LlamaIndex, MetaGPT reduce development time by 60-80%
  2. Clear capability gains - Agents demonstrably outperform single-turn interactions; SWE-bench improved 5.5x in two years
  3. Tool use is mature - Function calling, code execution are well-understood; 90%+ of production agents use tool calling
  4. Enterprise demand - McKinsey reports agentic AI adds "additional dimension to the risk landscape" as systems move from enabling interactions to driving transactions

Enterprise Adoption Landscape

Metric20242025ChangeSource
Fortune 500 production deployments19%67%+248% YoYAxis Intelligence
Organizations using Microsoft Copilot Studio230,000+Including 90% of Fortune 500Kong Inc. Report
Fortune 100 using AutoGen framework40%+For internal agentic systemsMicrosoft Research
Full trust in AI agents for core processes6%43% trust for limited tasks onlyHBR Survey 2025
Gartner projection: Enterprise software with agentic AIless than 1%33% by 202833x growth projectedGartner

Trust Gap Analysis: While 90% of enterprises report actively adopting AI agents, only 6% express full trust for core business processes. 43% trust agents only for limited/routine operational tasks, and 39% restrict them to supervised use cases. This trust gap represents both a current limitation and an opportunity for safety-focused development.

Trajectory Projection

PeriodExpected DevelopmentConfidence
2024-2025Specialized vertical agents (coding, research, customer service)High (already occurring)
2025-2027General-purpose agents with longer autonomy; 70%+ benchmark performanceMedium-High
2027-2030Multi-agent ecosystems, agent-to-agent collaborationMedium
2030+Potential dominant paradigm if reliability exceeds 90%Low-Medium

Growth Indicators

MetricValueSource
GitHub stars (AutoGPT)181,000+GitHub Repository
Agent framework downloads/month140M+ (LangChain)PyPI Stats
Enterprise agent deployments67% of Fortune 500 in productionAxis Intelligence 2025
AI startup funding (2025)$202B total, 50% of all VCCrunchbase 2025
Agent-related papers (2024)500+ on arXivAwesome-Agent-Papers
Agentic AI market projection$89.6B by 2026DigitalDefynd 2025

Cost and Economics

Understanding the economics of agentic systems is critical for both deployment decisions and safety considerations.

API and Compute Costs

Model/SystemInput CostOutput CostContext WindowTypical Task Cost
Claude Sonnet 4.5$3/M tokens$15/M tokens200K tokens$0.50-5.00 per SWE-bench task
GPT-4o$2.50/M tokens$10/M tokens128K tokens$0.30-3.00 per task
Claude Opus 4.5$15/M tokens$75/M tokens200K tokens$2.00-20.00 per complex task
Open-source (Llama 3.1 70B)≈$0.50/M tokens≈$0.75/M tokens128K tokens$0.10-1.00 per task

Cost-Benefit Analysis

MetricValueSource
Average agent task cost (coding)$0.50-5.00API pricing estimates
Human developer hourly rate$75-200/hourIndustry averages
Break-even thresholdAgent 3-4x slower than humanCost parity analysis
Enterprise ROI on agent deployment2-5x within first yearMcKinsey 2025
Venture funding in AI agents (2025)$202B total AI; agents dominateCrunchbase

Comparison with Other Paradigms

AspectHeavy ScaffoldingMinimal ScaffoldingProvable Systems
InterpretabilityScaffold: HIGH, Model: LOWLOWHIGH by design
Capability ceilingHIGH (tool use)LIMITEDUNKNOWN
Development speedFASTFASTSLOW
Safety guaranteesPARTIAL (scaffold only)NONESTRONG
Current maturityMEDIUMHIGHLOW

Key Uncertainties

UncertaintyCurrent EvidenceImplications
Reliability at scaleRE-Bench shows humans outperform AI 2:1 at 32-hour tasks; error propagation causes 45-60% of failuresMay limit agent autonomy to shorter task horizons (under 8 hours)
Emergent deceptionACM survey identifies "emergent behaviors" including "destructive behaviors leading to undesired outcomes"Multi-agent coordination introduces unpredictability absent in single-agent systems
Human oversight integrationNature study proposes triadic framework: human regulation, agent alignment, environmental feedbackCurrent systems lack standardized oversight mechanisms
Scaffold complexityAgent Workflow Memory achieved 51% success boost; architectural choices matter as much as model capabilityScaffold engineering may become a specialized discipline
Error propagationChain-of-Thought acts as "error amplifier" where minor mistakes cascade through subsequent actionsEarly detection and correction are critical; memory and reflection reduce risk

Implications for Safety Research

Research That Transfers Well

  • Control and containment - Sandboxing, permission systems, action constraints
  • Interpretability of plans - Understanding multi-step reasoning
  • Human-in-the-loop design - Approval workflows, uncertainty communication
  • Testing and red-teaming - Adversarial evaluation of agent systems

Research That May Not Transfer

  • Mechanistic interpretability - Scaffold behavior isn't in weights
  • Training-time interventions - Scaffold isn't trained
  • Representation analysis - Scaffold doesn't have representations

Sources and Further Reading

Primary Research

Industry Reports

Surveys and Collections

References

1International AI Safety Report 2025internationalaisafetyreport.org

A landmark international scientific assessment co-authored by 96 experts from 30 countries, providing a comprehensive overview of general-purpose AI capabilities, risks, and risk management approaches. It aims to establish shared scientific understanding across nations as a foundation for global AI governance. The report covers topics including capability evaluation, misuse risks, systemic risks, and mitigation strategies.

SWE-bench is a benchmark and leaderboard platform for evaluating AI models on real-world software engineering tasks, particularly resolving GitHub issues in open-source Python repositories. It offers multiple dataset variants (Lite, Verified, Multimodal) and standardized metrics to compare coding agents. It has become a widely-used standard for assessing the practical software engineering capabilities of LLM-based agents.

WebArena is a benchmark environment for evaluating autonomous web-browsing AI agents on realistic, long-horizon tasks across functional websites (e-commerce, forums, code repos, etc.). It tests agents' ability to complete complex multi-step goals requiring planning, navigation, and tool use in a self-hosted web ecosystem. The benchmark helps measure progress and identify limitations in agentic AI systems operating in realistic digital environments.

ReAct introduces a prompting paradigm that interleaves reasoning traces with task-specific actions in LLMs, enabling them to use external tools (e.g., Wikipedia API) while reasoning. This approach reduces hallucination and error propagation compared to chain-of-thought alone, and outperforms imitation/reinforcement learning baselines on interactive decision-making benchmarks by large margins.

★★★☆☆

METR is an organization conducting research and evaluations to assess the capabilities and risks of frontier AI systems, focusing on autonomous task completion, AI self-improvement risks, and evaluation integrity. They have developed the 'Time Horizon' metric measuring how long AI agents can autonomously complete software tasks, showing exponential growth over recent years. They work with major AI labs including OpenAI, Anthropic, and Amazon to evaluate catastrophic risk potential.

★★★★☆
6SWE-bench: Can Language Models Resolve Real-World GitHub Issues?arXiv·Carlos E. Jimenez et al.·2023·Paper

SWE-bench is a new evaluation framework for assessing language models' ability to resolve real-world software engineering problems. It consists of 2,294 GitHub issues from 12 popular Python repositories, requiring models to edit codebases to fix issues. The benchmark demands complex reasoning including multi-file coordination, long context processing, and execution environment interaction. Current state-of-the-art models perform poorly on this task, with Claude 2 achieving only 1.96% success rate, indicating significant room for improvement in developing more practical and autonomous AI systems.

★★★☆☆

Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its family of AI assistants, with a stated mission of responsible development and maintenance of advanced AI for long-term human benefit.

★★★★☆

OpenAI is a leading AI research and deployment company focused on building advanced AI systems, including GPT and o-series models, with a stated mission of ensuring artificial general intelligence (AGI) benefits all of humanity. The homepage serves as a gateway to their research, products, and policy work spanning capabilities and safety.

★★★★☆

A McKinsey practitioner-oriented guide for technology leaders on safely deploying agentic AI systems in enterprise contexts. The resource likely covers risk frameworks, security considerations, and governance practices for AI agents that can take autonomous actions. Content is inaccessible due to access restrictions.

★★★☆☆

Microsoft Research is Microsoft's primary research division, conducting fundamental and applied research across computer science, AI, and related disciplines. It publishes work on AI safety, fairness, interpretability, and responsible AI alongside broader computer science topics. The lab is a major industry contributor to AI alignment and safety-adjacent research.

★★★★☆

Crunchbase data shows AI captured nearly 50% of all global startup funding in 2025, totaling $202.3 billion invested. Foundation model companies like OpenAI and Anthropic dominated the largest individual funding rounds, reflecting continued concentration of capital in frontier AI development.

Related Wiki Pages

Top Related Pages

Approaches

Capability Elicitation

Organizations

LessWrong

Concepts

Agentic AIMinimal ScaffoldingLong-Horizon Autonomous TasksAutonomous CodingLarge Language Models