Skip to content
Longterm Wiki
Navigation
Updated 2026-03-16HistoryData
Page StatusContent
Edited 3 weeks ago8.5k words52 backlinksUpdated every 3 weeksDue in 1 day
60QualityGood •94ImportanceEssential76.5ResearchHigh
Content6/13
SummaryScheduleEntityEdit history9Overview
Tables12/ ~34Diagrams1/ ~3Int. links55/ ~68Ext. links40/ ~42Footnotes1/ ~25References35/ ~25Quotes0Accuracy0RatingsN:4.5 R:6.5 A:5 C:7.5Backlinks52
Change History9
Auto-improve (standard): Large Language Models3 weeks ago

Improved "Large Language Models" via standard pipeline (522.2s). Quality score: 74. Issues resolved: Section 'Political Evasion Detection' is cut off mid-sentenc; Bare URL in 'Key Links' table: 'learn.microsoft.com' and 'en; Footnote reference [^2] is used in the Quick Assessment tabl.

522.2s · $5-8

Auto-improve (standard): Large Language Models3 weeks ago

Improved "Large Language Models" via standard pipeline (496.8s). Quality score: 74. Issues resolved: Gemma Scope 2 section is truncated mid-sentence at the end: ; Footnote reference [^2] is used in the Quick Assessment tabl; The 'FACTS Grounding' and 'FACTS Benchmark Suite' section li.

496.8s · $5-8

Auto-improve (standard): Large Language Models3 weeks ago

Improved "Large Language Models" via standard pipeline (503.3s). Quality score: 72. Issues resolved: Footnote reference [^2] is used in the Quick Assessment tabl; The 'Key Links' table lists 'learn.microsoft.com' as the 'Of; The arXiv link in 'Key Links' (arxiv.org/abs/2402.14207) is .

503.3s · $5-8

Auto-improve (standard): Large Language Models3 weeks ago

Improved "Large Language Models" via standard pipeline (489.5s). Quality score: 71. Issues resolved: Page is truncated — the section 'Cybersecurity Capabilities ; Frontmatter 'lastEdited' is '2026-03-13', which is a future ; Footnote references such as [^rc-ed93], [^rc-e8f4], [^rc-c58.

489.5s · $5-8

Auto-improve (standard): Large Language Models4 weeks ago

Improved "Large Language Models" via standard pipeline (515.5s). Quality score: 72. Issues resolved: Frontmatter 'lastEdited' is set to '2026-03-11', which is a ; Capability Progression Timeline table contains a row for 'Cl; The 'FACTS Benchmark Suite' row in the Hallucination benchma.

515.5s · $5-8

Auto-improve (standard): Large Language Models4 weeks ago

Improved "Large Language Models" via standard pipeline (516.9s). Quality score: 71. Issues resolved: Frontmatter 'lastEdited' date is '2026-03-10', which is a fu; Capability Progression Timeline: '<EntityLink id="E1030">Cla; Benchmark Performance Comparison table: TruthfulQA row has '.

516.9s · $5-8

Auto-improve (standard): Large Language Models4 weeks ago

Improved "Large Language Models" via standard pipeline (510.4s). Quality score: 71. Issues resolved: Page is truncated mid-sentence at the end: '...achieving com; Frontmatter field 'lastEdited' is dated '2026-03-09' which i; Capability Progression Timeline table contains a duplicate E.

510.4s · $5-8

Auto-improve (standard): Large Language Models4 weeks ago

Improved "Large Language Models" via standard pipeline (1306.4s). Quality score: 71. Issues resolved: Frontmatter: 'llmSummary' contains an escaped dollar sign (\; Capability Progression Timeline table: 'Claude Opus 4.5' row; Benchmark Performance Comparison table: ARC-AGI-2 row shows .

1306.4s · $5-8

Surface tacticalValue in /wiki table and score 53 pages7 weeks ago

Added `tacticalValue` to `ExploreItem` interface, `getExploreItems()` mappings, the `/wiki` explore table (new sortable "Tact." column), and the card view sort dropdown. Scored 49 new pages with tactical values (4 were already scored), bringing total to 53.

sonnet-4 · ~30min

Issues2
QualityRated 60 but structure suggests 100 (underrated by 40 points)
Links9 links could use <R> components

Large Language Models

Capability

Large Language Models

Comprehensive analysis of LLM capabilities showing rapid progress from GPT-2 (1.5B parameters, 2019) to GPT-5 and Gemini 2.5 (2025), with training costs growing 2.4x annually and projected to exceed $1B by 2027. Documents emergence of inference-time scaling paradigm, mechanistic interpretability advances including Gemma Scope 2, multilingual alignment research, factuality benchmarking via FACTS suite, key technical breakthroughs in vision-language learning (CLIP), text-to-image generation (DALL-E), embeddings as a distinct capability class, and identifies key safety concerns including 8-45% hallucination rates, persuasion capabilities, and growing autonomous agent capabilities. Also covers MoE architecture optimization, continual learning challenges, knowledge editing, bias and fairness evaluation across gender and political dimensions, layer-wise mechanistic interpretability distinguishing recall from reasoning, and multimodal continual learning.

First MajorGPT-2 (2019)
Key LabsOpenAI, Anthropic, Google
Related
Capabilities
Reasoning and PlanningAgentic AI
Organizations
OpenAI
8.5k words · 52 backlinks

Quick Assessment

DimensionAssessmentEvidence
Capability LevelNear-human to superhuman on structured taskso3-mini achieves 87.5% on ARC-AGI (human baseline ≈85%); 87.7% on GPQA Diamond
Progress Rate2–3× capability improvement per yearStanford AI Index 2025[^2]: benchmark scores rose 18–67 percentage points in one year
Training Cost Trend2.4× annual growthEpoch AI: frontier model training costs projected to exceed $1B by 2027
Inference Cost Trend280× reduction since 2022GPT-4-equivalent dropped from $10 to $1.07 per million tokens
Hallucination Rates8–45% depending on taskVectara leaderboard: best models reportedly at ≈8%; HalluLens: up to 45% on factual queries
Safety MaturityModerateConstitutional AI, RLHF established; Responsible Scaling Policies implemented by major labs
Open-Closed GapNarrowingGap reportedly shrunk from 8.04% to 1.70% on Chatbot Arena (Jan 2024 → Feb 2025)
SourceLink
Official Websitelearn.microsoft.com
Wikipediaen.wikipedia.org
arXivarxiv.org

Overview

Large Language Models (LLMs) are transformer-based neural networks trained on vast text corpora using next-token prediction. Despite their deceptively simple training objective, LLMs exhibit sophisticated emergent capabilities including reasoning, coding, scientific analysis, and complex task execution. These models have transformed abstract AI safety discussions into concrete, immediate concerns while providing the clearest path toward artificial general intelligence.

The core insight underlying LLMs is that training a model to predict the next word in a sequence—a task achievable without labeled data—produces internal representations useful for a wide range of downstream tasks. This approach was explored in early unsupervised pretraining work from OpenAI in 2018. OpenAI's GPT-2 (2019) then demonstrated coherent multi-paragraph generation at scale, showing that larger models trained on more data produced qualitatively stronger outputs. An earlier indicator that such models form interpretable internal structure was the discovery of an "unsupervised sentiment neuron" in 2017, which emerged without any sentiment-specific supervision.

A foundational development alongside raw scale was learning to align model outputs to human preferences. The GPT-2 paper itself acknowledged concerns about potential misuse of fluent generation, leading OpenAI to initially withhold the full 1.5B-parameter model and conduct a staged release to study harms before broader deployment. Subsequent work on instruction-following alignment—culminating in InstructGPT—demonstrated that fine-tuning with human feedback could substantially improve model usefulness and reduce harmful outputs without proportional increases in model size. The complementary technique of RLHF applied to summarization tasks, and later to instruction-following, established the training paradigm that underlies most current aligned frontier models.

Current frontier models—including GPT-4o, Claude Opus 4.5, Gemini 2.5 Pro, and Llama—demonstrate near-human or superhuman performance across diverse cognitive domains. Training runs for leading frontier systems reportedly consume hundreds of millions of dollars, and model parameter counts have reached into the hundreds of billions to trillions. These substantial computational investments have shifted AI safety from theoretical to practical urgency. The late 2024–2025 period marked a paradigm shift toward inference-time compute scaling, with reasoning models such as OpenAI's o1 and o3 achieving higher performance on reasoning benchmarks by allocating more compute at inference rather than only at training time.

A parallel development is the rapid growth of the open-weight ecosystem. Meta's Llama family has grown substantially since its initial release, with Meta reporting over a billion downloads and more than ten times the developer activity compared to 2023. Google's Gemma models—including Gemma 3 and the mobile-first Gemma 3n variants—have provided the safety research community with accessible architectures for Mechanistic Interpretability work. This open/closed model convergence has implications for both capability diffusion and the tractability of safety interventions.

Capability Architecture

The diagram below maps the flow from training through inference to observed capabilities. Capabilities are grouped by which inference regime produces them; the distinction between "standard" and "search-augmented" inference is a key axis along which safety-relevant behaviors (extended planning, autonomous task execution) emerge.

Diagram (loading…)
flowchart TD
  subgraph TRAINING["Training Phase"]
      DATA[Text Corpora] --> PRETRAIN[Pretraining]
      PRETRAIN --> BASE[Base Model]
      BASE --> RLHF[RLHF/Constitutional AI]
      RLHF --> ALIGNED[Aligned Model]
  end

  subgraph INFERENCE["Inference Phase"]
      ALIGNED --> STANDARD[Standard Inference]
      ALIGNED --> COT[Chain-of-Thought]
      COT --> REASONING[Reasoning Models]
      REASONING --> SEARCH[Inference-Time Search]
  end

  subgraph CAPABILITIES["Emergent Capabilities"]
      STANDARD --> BASIC[Text Generation<br/>Translation<br/>Summarization]
      COT --> INTER[Complex Reasoning<br/>Code Generation<br/>Tool Use]
      SEARCH --> ADV[PhD-Level Analysis<br/>Mathematical Proof<br/>Autonomous Agents]
  end

  style TRAINING fill:#e6f3ff
  style INFERENCE fill:#fff3e6
  style CAPABILITIES fill:#e6ffe6

Note: Capabilities in the "Emergent Capabilities" subgraph are descriptive, not evaluative. Safety implications of each capability class are discussed in the Concerning Capabilities Assessment and Safety-Relevant Positive Capabilities sections below.

Risk Assessment

Risk CategorySeverityLikelihoodTimelineTrend
Deceptive CapabilitiesHighModerate1-3 yearsIncreasing
Persuasion & ManipulationHighHighCurrentAccelerating
Autonomous Cyber OperationsModerate-HighModerate2-4 yearsIncreasing
Scientific Research AccelerationMixedHighCurrentAccelerating
Economic DisruptionHighHigh2-5 yearsAccelerating

Capability Progression Timeline

ModelReleaseParametersKey BreakthroughPerformance Milestone
GPT-2Feb 20191.5BCoherent text generationInitially withheld for safety concerns; full 1.5B release Nov 2019
GPT-3Jun 2020175BFew-shot learning emergenceCreative writing, basic coding
CodexAug 2021UndisclosedCode generation from natural languagePowered early GitHub Copilot; evaluated on HumanEval benchmark
InstructGPTJan 20221.3B–175BInstruction-following via RLHFPreferred by labelers over 175B GPT-3 despite smaller size
GPT-4Mar 2023UndisclosedMultimodal reasoningReportedly ≈90th percentile SAT, bar exam passing
GPT-4oMay 2024UndisclosedMultimodal speed/costReal-time audio-visual, described as 2x faster than GPT-4 Turbo
Claude 3.5 SonnetJun 2024UndisclosedAdvanced tool use86.5% MMLU, leading SWE-bench at release
o1Sep 2024UndisclosedChain-of-thought reasoning77.3% GPQA Diamond, 74% MATH
o3Dec 2024UndisclosedInference-time search87.7% GPQA Diamond, 91.6% AIME 2024
Gemini 2.0 FlashFeb 2025UndisclosedAgentic multimodal capabilitiesNative image generation, real-time audio; positioned as foundation for agentic era
Gemini 2.5 ProMar 2025UndisclosedLong-context reasoning1M-token context, leading coding benchmarks at release
Llama 4Apr 2025UndisclosedNatively multimodal open-weightMixture-of-experts architecture
GPT-5May 2025UndisclosedUnified reasoning + tool useHighest reported scores at release on GPQA Diamond and SWE-bench
Claude Opus 4.5Reportedly Nov 2025UndisclosedExtended reasoningReportedly 80.9% SWE-bench Verified
GPT-5.2Reportedly late 2025UndisclosedDeep thinking modesReportedly 93.2% GPQA Diamond, 90.5% ARC

Primary sources: OpenAI model announcements, Anthropic model cards, Google DeepMind blog.

Benchmark Performance Comparison (2024–2025)

BenchmarkMeasuresGPT-4o (2024)o1 (2024)o3 (2024)Human Expert
GPQA DiamondPhD-level science≈50%77.3%87.7%≈89.8%
AIME 2024Competition math13.4%74%91.6%Top 500 US
MMLUGeneral knowledge84.2%90.8%≈92%89.8%
SWE-bench VerifiedReal GitHub issues33.2%48.9%71.7%N/A
ARC-AGINovel reasoning≈5%13.3%87.5%≈85%
CodeforcesCompetitive coding≈11%89% (94th percentile)99.8th percentileN/A

Sources: OpenAI o1 system card, ARC Prize o3 analysis, Anthropic Claude 3.5 model card.

The o3 results represent a qualitative shift: o3 achieved nearly human-level performance on ARC-AGI (87.5%) versus a ~85% human baseline, a benchmark specifically designed to test general reasoning rather than pattern matching. On FrontierMath, o3 reportedly solved 25.2% of problems compared to o1's 2%—a roughly 12x improvement that suggests reasoning capabilities may be scaling faster than expected. However, on the harder ARC-AGI-2 benchmark, o3 scores only ~3% compared to ~60% for average humans, revealing significant limitations in truly novel reasoning tasks.

Scaling Laws and Predictable Progress

Core Scaling Relationships

Research by Kaplan et al. (2020) and refined by Hoffmann et al. (2022) demonstrates robust mathematical relationships governing LLM performance:

FactorScaling LawImplication
Model SizePerformance ∝ N^0.07610x parameters → 1.9x performance
Training DataPerformance ∝ D^0.09510x data → 2.1x performance
ComputePerformance ∝ C^0.05010x compute → 1.4x performance
Optimal RatioN ∝ D^0.47Chinchilla scaling for efficiency

Sources: Chinchilla paper; Scaling Laws for Neural Language Models

According to Epoch AI research, approximately two-thirds of LLM performance improvements over the last decade are attributable to increases in model scale, with training techniques contributing roughly 0.4 orders of magnitude per year in compute efficiency. The cost of training frontier models has grown by reportedly 2.4x per year since 2016, with the largest models projected to exceed $1B by 2027, according to Epoch AI cost analyses.

A related phenomenon is the emergence of new scaling regimes beyond training compute. Research published in 2025 finds that emergent capability thresholds are sensitive to random factors including data ordering and initialization seeds, suggesting that the apparent sharpness of emergence in aggregate curves may partly reflect averaging over many random runs with different thresholds rather than a clean phase transition.

The Shift to Inference-Time Scaling (2024–2025)

The o1 and o3 model families introduced a new paradigm: inference-time compute scaling. Rather than only scaling training compute, these models allocate additional computation at inference time through extended reasoning chains and search procedures.

Scaling TypeMechanismTrade-offExample
Pre-training scalingMore parameters, data, training computeHigh upfront cost, fast inferenceGPT-4, Claude 3.5
Inference-time scalingLonger reasoning chains, searchLower training cost, expensive inferenceo1, o3
Combined scalingBoth approachesMaximum capability, maximum costGPT-5, Claude Opus 4.5

This shift has implications for AI safety: inference-time scaling allows models to "think longer" on hard problems, potentially achieving strong performance on specific tasks while maintaining manageable training costs. According to some sources, o1 is approximately 6x more expensive and 30x slower than GPT-4o per query. The Stanford AI Index 2025 cites the RE-Bench evaluation finding that in short time-horizon settings (2-hour budget), top AI systems score 4x higher than human experts, but as the time budget increases to 32 hours, human performance surpasses AI by roughly 2 to 1.

Research on inference efficiency has produced complementary approaches. Work on efficient reasoning with balanced thinking explores how models can allocate inference-time compute more selectively—spending more tokens on genuinely hard sub-problems and fewer on straightforward ones—rather than applying uniform extended chain-of-thought across all queries. Similarly, speculative decoding with online learning (research on "evolving drafts") allows draft models to adapt to target model distributions during deployment, reducing latency while preserving output quality. Cross-family speculative prefill techniques use small draft models from different model families to compress long-context inputs without fine-tuning, enabling long-context inference at reduced cost.

Emergent Capability Thresholds

CapabilityEmergence ScaleEvidenceSafety Relevance
Few-shot learning≈100B parametersGPT-3 breakthroughTool use foundation
Chain-of-thought≈10B parametersPaLM, GPT-3 variantsComplex reasoning
Code generation≈1B parametersCodex, GitHub CopilotCyber capabilities
Instruction following≈10B parametersInstructGPTHuman-AI interaction paradigm
PhD-level reasoningo1+ scaleGPQA Diamond performanceExpert-level autonomy
Strategic planningo3 scaleARC-AGI performanceDeception potential

Research documented in a 2025 emergent abilities survey finds that emergent abilities depend on multiple interacting factors: scaling up parameters or depth lowers the threshold for emergence but is neither necessary nor sufficient alone—data quality, diversity, training objectives, and architecture modifications also matter significantly. Emergence aligns more closely with pre-training loss landmarks than with sheer parameter count; smaller models can match larger ones if training loss is sufficiently reduced.

According to the Stanford AI Index 2025, benchmark performance improved substantially in a single year: scores rose by 18.8, 48.9, and 67.3 percentage points on MMMU, GPQA, and SWE-bench respectively. The gap between leading US and Chinese models also narrowed—from 17.5 to 0.3 percentage points on MMLU—over the same period.

Safety implication: As AI systems gain autonomous reasoning capabilities, they also develop behaviors relevant to safety evaluation, including goal persistence, strategic planning, and the capacity for deceptive alignment. OpenAI's o3-mini reportedly became the first AI model to receive a "Medium risk" classification for Model Autonomy under internal capability evaluation frameworks.

Depth Dynamics and Linear Scaling Structure

Recent mechanistic work finds that as language models scale, low-order linear dynamics emerge in the depth dimension: the sequence of layer-wise representations can be approximated by low-dimensional linear trajectories, with most representational change concentrated in early and late layers. This "low-order linear depth dynamics" phenomenon suggests that scaling does not uniformly distribute representational work across depth, but instead concentrates it at certain layers—an observation with implications for understanding where capabilities reside and how to intervene on model behavior.

Related analysis of expert selection patterns in Mixture-of-Experts (MoE) models finds that which experts are activated for a given input reveals nearly as much about the input's semantics as the text itself, suggesting that routing decisions encode substantial semantic information. This has implications for both interpretability (expert routing as a window into model representations) and efficiency (routing-based inference optimization).

Key Technical Breakthroughs (2023–2025)

The period from 2023 through 2025 saw several foundational capability developments that extended LLMs beyond text-only tasks and introduced new paradigms for alignment and generalization.

CLIP and Vision-Language Learning

Contrastive Language-Image Pre-training (CLIP), introduced by OpenAI, established a foundational approach to jointly learning text and image representations by training on large numbers of (image, text) pairs using a contrastive objective. Rather than training on a fixed set of labeled categories, CLIP learns to match images with their natural language descriptions, producing representations that transfer to a wide range of visual tasks without task-specific fine-tuning.

CLIP's significance for the LLM ecosystem is twofold. First, the learned image embeddings provide a bridge between vision and language modalities, enabling downstream multimodal systems—including early versions of GPT-4V and Gemini's multimodal capabilities—to ground language representations in visual content. Second, CLIP demonstrated that web-scale supervision from natural language (rather than curated categorical labels) could produce vision models competitive with or superior to fully supervised alternatives on many benchmarks, reinforcing the "large-scale weakly supervised pretraining" paradigm that underlies modern LLMs.

From a safety perspective, CLIP-style vision-language models introduce new dual-use concerns: the same representations that enable useful image understanding also enable automated content classification, surveillance at scale, and generation of targeted visual content.

DALL-E and Text-to-Image Generation

OpenAI's DALL-E (2021) and subsequent DALL-E 2 (2022) and DALL-E 3 (2023) demonstrated that autoregressive and diffusion-based models trained jointly on text and images could generate high-quality images from natural language descriptions. DALL-E 2 used CLIP embeddings as an intermediate representation, conditioning image generation on CLIP image embeddings produced from text inputs.

Key developments in the DALL-E line:

  • DALL-E (2021): Demonstrated autoregressive image generation from text using a discrete variational autoencoder for image tokenization
  • DALL-E 2 (2022): Introduced diffusion-based generation conditioned on CLIP embeddings, substantially improving image quality and text-image alignment
  • DALL-E 3 (2023): Integrated with ChatGPT for conversational image generation and focused on accurate text rendering within images, a persistent weakness of earlier systems

The DALL-E API became available in public beta, enabling developers to integrate text-to-image generation into applications. Hierarchical text-conditional image generation using CLIP latents—the technical approach underlying DALL-E 2—established a general framework for conditioning generative models on language representations that has influenced subsequent multimodal architectures including those used in Gemini 2.0's native image generation.

Safety considerations introduced by text-to-image generation include: generation of non-consensual synthetic imagery, disinformation through fabricated photorealistic images, and copyright concerns around training on artist works. OpenAI's DALL-E deployments include content policies and classifiers to filter prohibited content categories, though the effectiveness of such filters against adversarial prompting has been an ongoing area of evaluation.

Image GPT and Generative Modeling Precursors

Image GPT (2020) applied the GPT architecture directly to sequences of pixel values, demonstrating that autoregressive models trained on images could generate coherent visual content and that the learned representations were useful for downstream image classification tasks. While image GPT predates the DALL-E line, it established that the transformer architecture and next-token prediction objective were not inherently language-specific—a conceptual precursor to the unified multimodal models that followed.

Earlier generative modeling research at OpenAI, including work on Generative Models and sparse transformers for long-sequence modeling, provided architectural foundations for handling the high-dimensional sequences required for image and audio generation. Generative Modeling with Sparse Transformers demonstrated that attention patterns could be made sparser without degrading generation quality, enabling longer context windows—a technique relevant to both image generation and the long-context LLMs of 2024–2025.

Weak-to-Strong Generalization and Superalignment

A central challenge for aligning future AI systems is that human supervisors may lack the ability to evaluate superhuman-level outputs directly. If a Superintelligence produces outputs that humans cannot assess, standard RLHF-style alignment breaks down: the reward model trained on human preferences cannot reliably distinguish between genuinely good and superficially good outputs.

OpenAI's Weak-to-Strong Generalization research (2023–2024) investigated this problem empirically by studying whether weak model supervisors can elicit good behavior from strong models. The experimental setup used smaller, weaker LLMs as "supervisors" for larger, stronger models, testing whether the strong model generalizes beyond what the weak supervisor can directly evaluate. Key findings:

  • Strong models trained with weak supervision consistently outperform the weak supervisor on held-out evaluations, demonstrating that weak-to-strong generalization is achievable in current systems
  • The degree of generalization varies substantially by task, with some domains showing near-ceiling strong model performance even under weak supervision and others showing significant performance degradation
  • Scalable oversight techniques—including bootstrapping progressively stronger supervisors and using the strong model's own representations to improve reward modeling—substantially narrow the gap between weak-supervised and fully-supervised strong model performance
  • The research identifies "sandbagging" (the strong model deliberately performing at the weak supervisor's level) as a potential failure mode that must be distinguished from genuine capability limitations

This line of research is directly relevant to the superalignment problem—the challenge of aligning systems that are substantially more capable than their human overseers—and contributed to OpenAI's stated goal of solving the core technical problems of superalignment within four years of the 2023 announcement. The findings suggest that current alignment techniques may generalize further toward superintelligence than previously assumed, though the empirical results are necessarily drawn from LLM-scale experiments rather than from genuinely superhuman systems.

The broader theoretical framing connects to Scalable Oversight approaches: if weak supervisors can reliably elicit and evaluate strong model capabilities in some domains, the problem may be decomposable into finding the right oversight scaffolding rather than requiring human evaluators who are themselves superintelligent.

Reasoning Models: o1 and o3

OpenAI's introduction of the o1 model family in September 2024 marked the operational deployment of inference-time compute scaling at scale. The Learning to Reason with LLMs post described the training approach: o1 models are trained to produce extended chain-of-thought reasoning before answering, with the reasoning process itself optimized via reinforcement learning. The model learns to explore multiple reasoning paths, backtrack when encountering contradictions, and produce final answers that reflect the outcome of this extended deliberation.

From Introducing OpenAI o1, key technical features include:

  • Reasoning tokens are generated but not shown to users by default, functioning as an internal "scratchpad"
  • Performance improves log-linearly with the number of reasoning tokens allocated, up to a task-dependent ceiling
  • The model shows substantially larger gains on tasks requiring multi-step reasoning, mathematical proof, and code debugging compared to tasks well-suited to pattern matching

The o3-mini release extended reasoning capabilities to a smaller, lower-cost model. OpenAI o3-mini documented that o3-mini on high-compute settings achieved performance comparable to o1 on mathematical benchmarks at substantially lower inference cost, suggesting that inference-time compute efficiency is an active optimization target alongside raw capability.

Domain-specific evaluations documented in OpenAI's research program demonstrated o1's practical utility across technical fields: Answering Quantum Physics Questions with OpenAI o1 found that o1 could engage substantively with graduate-level quantum mechanics problems; Decoding Genetics with OpenAI o1 documented performance on genomics interpretation tasks; Economics and Reasoning with OpenAI o1 showed competence on economic modeling and analysis; and Coding with OpenAI o1 documented performance on competitive programming benchmarks.

Formal Mathematics

Solving (Some) Formal Math Olympiad Problems documented early results on applying LLMs to formal mathematical reasoning in proof assistants, connecting natural language mathematical intuition to machine-verifiable proofs. This research line is significant for AI safety because Interpretability provides a pathway to checking AI-generated reasoning with mathematical rigor—a potential component of scalable oversight for technical domains.

Text Embeddings as a Capability Class

Text embeddings—dense vector representations of text that encode semantic meaning—constitute a distinct capability class from generative LLMs, with applications including semantic search, retrieval-augmented generation (RAG), classification, and clustering.

Foundations and Development

OpenAI's early embedding work demonstrated that transformer representations learned during pretraining could be applied to downstream tasks without task-specific fine-tuning. Improving Language Understanding with Unsupervised Learning (GPT-1, 2018) showed that a single pretrained model could produce representations useful across diverse NLP tasks, establishing the transfer learning paradigm underlying modern embedding models.

Introducing Text and Code Embeddings (January 2022) announced the initial OpenAI embedding API, providing access to embeddings derived from GPT-3 family models for both natural language and code. The API enabled developers to compute semantic similarity between texts, cluster documents, and build retrieval systems without training task-specific models.

The technical foundations were elaborated in Text and Code Embeddings by Contrastive Pre-training (Neelakantan et al., 2022), which introduced the Contrastive Pre-trained Transformer (CPT) training approach. CPT trains embedding models using a contrastive objective on (text, related text) pairs, encouraging the model to assign similar vectors to semantically related texts and dissimilar vectors to unrelated ones. Key findings from the paper:

  • Contrastive pre-training on large-scale unlabeled data substantially outperformed fine-tuning pretrained language models on embedding tasks
  • Code and natural language embeddings trained jointly produce representations that support cross-modal retrieval (e.g., natural language queries returning relevant code snippets)
  • The resulting embedding models achieved state-of-the-art performance on semantic textual similarity, text search, and code search benchmarks at the time of publication

New and Improved Embedding Model (December 2022) introduced the text-embedding-ada-002 model, which replaced six earlier embedding models with a single unified model at lower cost and higher performance. The ada-002 model became the de facto standard for OpenAI-based RAG systems for approximately two years.

New Embedding Models and API Updates (January 2024) introduced the text-embedding-3-small and text-embedding-3-large models, offering improvements in multilingual performance and a novel feature: dimensions truncation, which allows users to trade off embedding size against performance by specifying a desired vector dimension smaller than the model's native output. This enables deployment in storage-constrained settings without training separate smaller models.

Embedding Model Evolution

ModelReleaseKey FeaturePerformance Characteristics
text-similarity-* family2021Task-specific text similarityMultiple separate models per task type
text-embedding-ada-002Dec 2022Unified model replacing 6 predecessorsLower cost, higher quality; 1536 dimensions
text-embedding-3-smallJan 2024Dimensions truncation, multilingualLower cost; configurable dimensions
text-embedding-3-largeJan 2024Highest quality, dimensions truncationHighest MTEB scores; 3072 dimensions natively

Sources: New and Improved Embedding Model; New Embedding Models and API Updates

Embeddings in RAG Systems

The development of embedding APIs intersected with the growth of retrieval-augmented generation (RAG) architectures, in which an LLM's parametric knowledge is supplemented by retrieved documents identified through embedding similarity search. RAG systems using OpenAI embeddings became a standard deployment pattern for enterprise applications requiring up-to-date or domain-specific knowledge.

The interaction between embedding quality and RAG system factuality is direct: higher-quality embeddings retrieve more relevant documents, reducing the probability that the LLM generates unsupported claims to fill knowledge gaps. However, as the FACTS Grounding benchmark documents, even RAG systems with high-quality retrieval can produce outputs that misrepresent or hallucinate details from retrieved documents—suggesting that embedding quality is necessary but not sufficient for factual accuracy.

Research on in-context knowledge updates finds that retrieval-augmented models exhibit systematic retrieval bias under multiple knowledge updates: when a model's context contains several updates to the same entity or fact, retrieval attention tends to favor earlier or more prominent context positions over later ones, causing the model to sometimes apply outdated retrieved information even when more recent information is present in context. This positional bias is distinct from simple hallucination and requires targeted evaluation methodology to detect.

Safety Implications of Embeddings

Embedding capabilities introduce several safety-relevant considerations:

  • Semantic search for harmful content: High-quality embeddings enable more effective retrieval of harmful information from large corpora, potentially reducing the search cost for users seeking dangerous content
  • Behavioral profiling: Embedding representations of user queries can enable behavioral profiling that raises privacy concerns in deployment contexts
  • Membership inference: Research has shown that embedding models can leak information about their training data through membership inference attacks, with implications for privacy-sensitive fine-tuning scenarios
  • Robustness: Embedding-based retrieval systems can be manipulated through adversarial documents designed to surface in similarity searches for benign queries, a form of indirect prompt injection relevant to RAG security

Foundational Safety Research and Alignment Techniques

GPT-2: Staged Release as Safety Practice

The GPT-2 release in 2019 marked the first occasion where an AI lab deliberately staged a major model release over several months specifically to study potential misuse. OpenAI released progressively larger versions of GPT-2 (117M, 345M, 762M, and finally 1.5B parameters) across February–November 2019, monitoring for evidence of misuse between each stage. The six-month review found no strong evidence of misuse attributable to the staged approach, and OpenAI acknowledged the difficulty of empirically evaluating whether staged release achieved its safety goals. This episode established the practice of conducting deliberate capability-safety reviews before broad model deployment, a practice later formalized in responsible scaling policies.

A central concern motivating the staged release was that models capable of coherent, long-form text generation at scale could be used for disinformation, spam, and other misuse at lower cost than previous methods. OpenAI's subsequent analysis noted that the question of whether to release a model cannot be separated from analysis of what specific harms it enables and at what marginal uplift over existing tools.

Instruction Following and InstructGPT

A critical development between GPT-3 and GPT-4 was the discovery that models fine-tuned to follow instructions using human feedback were substantially preferred by users over much larger models trained only on next-token prediction. OpenAI's Aligning Language Models to Follow Instructions (2022)—the InstructGPT paper—demonstrated that a 1.3B-parameter RLHF-tuned model was preferred over the 175B GPT-3 in head-to-head evaluations by human labelers across a range of tasks. This result was significant for safety because it showed that alignment training and capability were not in fundamental tension: smaller, better-aligned models could outperform larger unaligned ones.

InstructGPT also documented a key limitation of RLHF: the fine-tuned models showed measurable performance regression on standard NLP benchmarks (the "alignment tax"), which OpenAI partially mitigated by mixing pretraining data into the RLHF training. The paper identified sycophancy—models outputting what users want to hear rather than accurate information—as a persistent failure mode stemming from human labeler preferences.

Learning to Summarize and RLHF Foundations

Two key papers established the empirical foundations for applying RLHF to language tasks at scale:

The summarization work also introduced the practice of training reward models separately from the policy model and using them as a proxy for human judgment, a design choice that has shaped subsequent RLHF applications. An early application to long-form content was Summarizing Books with Human Feedback (Wu et al., 2021), which used recursive decomposition to enable human labelers to evaluate summaries of book-length texts—a technique relevant to scalable oversight of tasks too long for direct human evaluation.

Improving Language Model Behavior Through Curated Data

Improving Language Model Behavior by Training on a Curated Dataset examined whether targeted fine-tuning on small, carefully curated datasets could address specific behavioral problems in deployed LLMs. The research found that training on a relatively small number of high-quality demonstrations of desired behaviors—particularly around sensitive topics, contested claims, and edge cases—could produce measurable improvements in alignment without the computational expense of full RLHF cycles.

This approach complements RLHF by providing a pathway for targeted behavioral correction: rather than retraining the full reward model and running PPO for every observed failure mode, curated dataset fine-tuning enables rapid iteration on specific identified problems. The technique has been adopted in various forms across subsequent model releases, including as a component of GPT-4's safety training.

TruthfulQA and Measuring Falsehood

A persistent failure mode of large language models is that they generate false answers that nonetheless reflect common human misconceptions—a pattern distinct from simple hallucination. TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2022) documented this systematically: the benchmark comprises 817 questions spanning 38 categories where the correct answer is counterintuitive or contradicts popular belief.

Key findings from TruthfulQA:

  • Larger models are not more truthful on this benchmark; in some categories, larger models perform worse because they more fluently reproduce common misconceptions
  • The best models at time of evaluation answered truthfully on roughly 58% of questions, versus a human baseline of 94%
  • Model-generated answers that were false were often confidently stated, reducing the signal value of uncertainty expressions

This work established an important distinction between calibration (knowing what you don't know) and truthfulness (correctly answering questions), and demonstrated that scaling alone does not reliably improve truthfulness. TruthfulQA has been used as a standard evaluation in subsequent model releases, including the GPT-4 technical report.

SimpleQA: Factuality on Unambiguous Questions

Introducing SimpleQA introduced a complementary factuality benchmark focused on short-answer questions with a single unambiguous correct answer, in contrast to TruthfulQA's focus on misconception-prone questions. SimpleQA is designed to be resistant to the confounds that affect longer-form factuality evaluations: answers are short enough that partial credit and interpretation artifacts are minimized, and questions are selected to have ground-truth answers that can be verified against authoritative sources.

The benchmark documents that frontier models in 2024 achieve approximately 38–43% accuracy on SimpleQA in standard settings, with calibration analyses showing that models are systematically overconfident in incorrect answers. Notably, larger and more capable models show higher accuracy but do not consistently show better calibration—the ratio of expressed confidence to actual accuracy does not reliably improve with model capability.

WebGPT: Factuality Through Web Browsing

WebGPT: Browser-Assisted Question-Answering with Human Feedback (Nakano et al., 2021) addressed factuality from a different angle: instead of improving parametric knowledge, the system learned to browse the web and cite sources when answering long-form questions. WebGPT was fine-tuned using human feedback to reward answers that were both well-supported by retrieved sources and preferred by evaluators.

Key contributions:

  • Demonstrated that retrieval-augmented generation with citation could substantially reduce unsupported factual claims
  • Found that human evaluators preferred WebGPT answers to GPT-3 answers on 56% of ELI5 questions, and that the model's answers were more factual than those produced without web access
  • Identified a persistent failure mode: the model sometimes selected misleading or unrepresentative sources to support a desired conclusion, foreshadowing later concerns about "faithful but misleading" retrieval-augmented generation

WebGPT represents an early instance of tool-augmented LLM deployment and contributed to the design of subsequent browsing-enabled deployments including ChatGPT's web browsing mode and SearchGPT, a dedicated AI search prototype announced in 2024.

OpenAI Codex and Code Generation

Evaluating Large Language Models Trained on Code (Chen et al., 2021)—the Codex paper—introduced the HumanEval benchmark and reported that Codex solved 28.8% of programming problems in a zero-shot setting, rising to 70.2% with repeated sampling. Codex was fine-tuned from GPT-3 on publicly available code from GitHub and became the foundation of GitHub Copilot.

From a safety perspective, Codex introduced a new dual-use concern: models capable of generating functional code from natural language instructions could assist with both beneficial software development and potentially harmful applications including vulnerability discovery and exploit development. The Codex paper acknowledged these concerns and noted that the responsible disclosure of such capabilities required consideration of both the benefits to legitimate users and the marginal uplift provided to malicious actors.

The HumanEval benchmark—which tests whether generated code passes unit tests on 164 programming problems—has become a standard evaluation metric, though subsequent work has noted that benchmark performance can overstate practical coding ability on real-world repositories with complex dependencies.

Deliberative Alignment

Deliberative Alignment: Reasoning Enables Safer Language Models (2024) introduced a safety training approach in which models are explicitly trained to reason through safety considerations before generating responses to potentially sensitive queries. Rather than relying solely on learned behavioral patterns to refuse or comply with requests, deliberative alignment trains the model to identify relevant safety principles, reason about their application to the specific context, and generate responses consistent with that reasoning.

Key claims from the paper:

  • Models trained with deliberative alignment show reduced rates of both over-refusal (refusing benign requests) and under-refusal (complying with genuinely harmful ones)
  • The reasoning process is observable and partially auditable, providing some transparency into why a given refusal or compliance decision was made
  • The approach transfers more reliably to novel or edge-case inputs than pure behavioral fine-tuning, because the model reasons about principles rather than matching patterns

Deliberative alignment is described as a component of GPT-4o's safety training and has been cited as a mechanism for improving the precision of safety behaviors—distinguishing between superficially similar benign and harmful requests.

RLHF and Alignment Training Foundations

A key technique underlying modern aligned LLMs is Reinforcement Learning from Human Feedback (RLHF), which trains a reward model on human preference comparisons and uses it to fine-tune Large Language Models outputs via RL. The foundational application of this approach to Large Language Models was demonstrated in Fine-Tuning GPT-2 from Human Preferences (Ziegler et al., 2019), which showed that human feedback could steer model behavior on stylistic tasks. This was later scaled to instruction-following via InstructGPT and then to Claude's Constitutional AI framework.

RLHF ComponentFunctionSafety Relevance
Reward modelConverts human preferences into a differentiable signalCentral to RLHF alignment
PPO fine-tuningUpdates language model to maximize rewardCan introduce reward hacking
Constitutional AIReplaces human labelers with model self-critique against principlesScales alignment oversight; see Constitutional AI
Multilingual consistencyEnforces safety behaviors across languagesAlign Once, Benefit Multilingually (2025)

A significant limitation is that RLHF-trained models can exhibit sycophancy—systematically agreeing with user beliefs rather than providing accurate responses—because human raters often prefer confident, agreeable answers. Recent work on multi-objective alignment for psychotherapy contexts and policy-constrained alignment frameworks like PolicyPad (2025) explore how to balance multiple competing alignment objectives.

Multilingual alignment gap: A consistent finding across alignment research is that safety training applied primarily in English does not reliably transfer to other languages. Align Once, Benefit Multilingually (2025) proposes methods to enforce multilingual consistency in safety alignment by training on cross-lingual consistency losses. Related work on Helpful to a Fault (2025) measures illicit assistance rates across 40+ languages in multi-turn interactions, finding that multilingual agents provide more potentially harmful assistance than English-only evaluations would suggest, particularly in low-resource languages where safety fine-tuning data is sparse.

Alignment from User Interactions

A complementary approach to RLHF and constitutional AI involves learning alignment signals directly from user interaction data rather than from curated preference annotations. Work on aligning language models from user interactions explores how implicit feedback signals—such as conversation continuations, follow-up queries, and editing patterns—can be used to infer user preferences and refine model behavior without requiring explicit preference labels. This approach is particularly relevant for deployment-time adaptation, where models can be updated based on real interaction patterns rather than synthetic preference datasets. It raises corresponding safety questions: interactions from adversarial or manipulative users could introduce undesirable behavioral shifts if interaction-based alignment is applied without filtering.

Group Relative Policy Optimization and Recommendation Alignment

A variant of policy optimization applied to recommendation systems uses Group Relative Policy Optimization (GRPO) to align language model-generated recommendations with consistency requirements across user groups. Work on information-consistent LLM recommendations via GRPO addresses the finding that standard RLHF-trained models produce recommendations that are inconsistent across similar users—recommending different items to demographically similar users in ways that reflect training data biases rather than principled personalization. GRPO frames consistency as a group-level constraint within the RL objective, penalizing policies that produce high variance across equivalent user contexts.

Reinforcement Learning for Code Generation

Work on EvolveCoder applies reinforcement learning to code generation by using adversarial test case generation to provide richer feedback signals than standard pass/fail unit tests. The approach generates adversarial test cases that probe edge cases and boundary conditions, using these as verifiers during RL training. This addresses a known limitation of HumanEval-style evaluation: models can achieve high pass rates on standard test suites by exploiting common patterns without generalizing to edge cases. Adversarial verification during training yields more robust code generation with better generalization to held-out test conditions.

Similarly, work on multiple-choice question design for reinforcement learning with verifiable rewards (RLVR) finds that the design of distractors—incorrect answer choices—substantially affects the quality of reasoning elicited by RL training. Carefully designed distractors that target specific reasoning errors produce stronger RL training signals than randomly selected wrong answers, enabling more efficient RLVR with smaller datasets.

Test-Time RL Alignment and Benchmark Artifacts

Research on test-time RL alignment finds that applying RL fine-tuning specifically to improve performance on benchmark tasks exposes artifacts related to task familiarity in the base model. When models are fine-tuned at test time to improve on specific benchmarks, the gains are larger for tasks whose format and domain closely match the model's pretraining distribution, and smaller for genuinely novel reasoning tasks. This suggests that benchmark-measured improvements from RL alignment may partially reflect exploitation of familiar patterns rather than generalizable capability improvements—a concern for evaluating whether RL-aligned models are safer in novel deployment contexts.

Mechanistic Interpretability

Layer-Wise Recall vs. Reasoning

Research on disentangling recall and reasoning in transformer models through layer-wise attention and activation analysis provides evidence that these two cognitive functions are partially localized to different layers and attention head types. Specifically:

  • Factual recall—retrieving stored associations between entities and attributes—tends to be mediated by mid-layer attention heads that implement a lookup-like operation over key-value associations in the residual stream
  • Relational reasoning—combining multiple pieces of retrieved information through multi-step inference—engages later layers and broader attention patterns that aggregate across positions
  • The two functions can be partially dissociated: interventions (such as activation patching) that disrupt mid-layer recall heads impair factual retrieval without proportionally impairing multi-step reasoning, and vice versa

This layer-wise dissociation has implications for interpretability methodology: analyses that focus only on attention patterns at a single layer may misattribute recall to reasoning or vice versa, and interventions targeting specific safety-relevant behaviors should account for which layers implement those behaviors.

Activation Steering and Cross-Layer Consistency

Work on global evolutionary steering refines activation steering—the technique of adding learned directions to model activations to modify behavior—by enforcing cross-layer consistency in the steering vectors applied. Standard activation steering applies a single direction vector at one layer; cross-layer consistency methods extend this by finding steering vectors that produce coherent behavioral changes across multiple layers simultaneously, using evolutionary search to optimize for consistency.

The key finding is that inconsistent cross-layer steering can produce fragile or incoherent behavioral changes that do not generalize to held-out prompts, while consistent cross-layer steering produces more robust and transferable behavioral modifications. This has implications for safety applications of activation steering, such as suppressing unsafe behaviors: interventions that are inconsistent across layers may appear effective on evaluation sets while failing on inputs that activate the behavior through different layer pathways.

Language Models as Injective Functions

A theoretical result establishes that language models with certain architectural properties are injective—that is, distinct input sequences map to distinct output distributions. This injectivity implies a form of invertibility: in principle, given a model's output distribution, one could recover the input sequence that produced it. The implications for safety include:

  • Membership inference attacks on language models may be more powerful than previously recognized, because injectivity implies that model outputs carry information about inputs
  • Watermarking and attribution schemes for LLM-generated text may be more tractable than assumed, because injectivity supports recovery of the generating input from output characteristics
  • Privacy guarantees for models trained on sensitive data require more careful analysis in light of potential invertibility

The result applies under specific conditions on model architecture and vocabulary size; the practical extent to which frontier models satisfy these conditions is an open empirical question.

Gemma Scope 2 and Sparse Autoencoders

Google DeepMind's Gemma Scope 2 extends the Gemma Scope sparse autoencoder suite—a set of interpretability tools built on the Gemma open-weight model family—with improved training methodology and coverage of additional model components including attention layers and the embedding space. Gemma Scope 2 provides researchers with pre-trained sparse autoencoders that decompose model activations into interpretable features, enabling analysis of which features are active during specific model behaviors.

The sparse autoencoder approach builds on the superposition hypothesis: transformer hidden layers represent more features than dimensions by using superimposed representations that can be disentangled with sparsity-inducing decomposition methods. Gemma Scope 2's contribution includes improved feature quality and broader architectural coverage compared to the original Gemma Scope release, making it a more complete toolkit for mechanistic interpretability research on an accessible open-weight model family.

Architecture Advances: Mixture-of-Experts

MoE Fundamentals and the Llama 4 Design

Mixture-of-Experts (MoE) architectures replace dense feed-forward layers with a collection of "expert" sub-networks, activating only a subset of experts for each token during inference. This allows total parameter counts to grow substantially while keeping per-token compute roughly constant. Llama 4's adoption of MoE for its natively multimodal open-weight models brought this architecture into widespread open deployment.

Key properties of MoE relevant to safety and capability research:

PropertyDescriptionImplication
Sparse activationOnly K-of-N experts activated per tokenReduces inference FLOPs relative to parameter count
Expert specializationExperts tend to specialize by domain or token typeMay concentrate dangerous knowledge in specific experts
Routing interpretabilityExpert selection patterns carry semantic informationExpert routing as interpretability signal
Load balancingTraining requires auxiliary losses to avoid expert collapseAuxiliary losses can interact with alignment training

Reducing MoE Redundancy

Work on LightMoE addresses a known inefficiency in trained MoE models: many experts learn highly similar representations, contributing redundant computation without adding representational capacity. LightMoE identifies redundant expert pairs through representational similarity analysis and replaces them with single consolidated experts, reducing model size and inference cost while preserving performance. This post-training compression technique is complementary to architecture-level MoE design and is applicable to already-trained models without retraining from scratch.

The expert redundancy finding also reinforces the observation that expert selection in MoE models reveals nearly as much semantic information as the token text itself: experts that encode similar semantic content tend to be routed similarly, making routing patterns a relatively low-noise signal about input semantics.

Expert Pyramid Tuning

Expert Pyramid Tuning proposes a parameter-efficient fine-tuning approach for MoE models that allocates trainable parameters in proportion to each expert's importance for the target task, forming a pyramid-shaped parameter budget (more parameters for high-importance experts, fewer for low-importance ones). This contrasts with uniform LoRA-style approaches that apply the same parameter budget to all model components. Task-specific importance is estimated from activation statistics on task data before fine-tuning, allowing the budget allocation to be informed by actual expert utilization patterns.

Continual Learning in LLMs

The Catastrophic Forgetting Problem

Deploying LLMs in production requires updating model knowledge over time—incorporating new facts, correcting errors, and adapting to new domains—without degrading performance on previously learned capabilities. This challenge, broadly termed continual learning, intersects with several safety-relevant concerns:

  • Knowledge staleness: Models with outdated parametric knowledge may generate plausible but incorrect information about recent events
  • Selective forgetting: Fine-tuning for safety may inadvertently degrade performance on benign capabilities, or vice versa
  • Cumulative drift: Iterative fine-tuning over many update cycles can cause slow drift in model behavior that is difficult to detect incrementally

Research surveying continual learning methods for large language models identifies three main approaches: (1) regularization-based methods that add penalties to prevent large weight changes from previous training; (2) replay-based methods that mix previous training data into fine-tuning batches; and (3) architecture-based methods that add new parameters for new knowledge while freezing existing ones. Each approach involves different trade-offs between plasticity (ability to learn new information) and stability (retention of existing knowledge).

Multimodal Continual Learning

Multimodal continual learning in the context of multimodal large language models (MLLMs) faces additional challenges beyond those in text-only settings. When models are updated across multiple deployment scenarios—each with different distributions of image-text pairs, task types, and domains—the standard catastrophic forgetting problem is compounded by cross-modal interference: updates that improve vision-language alignment in one scenario can degrade it in others.

Work on multimodal continual learning from multi-scenario perspectives proposes scenario-aware replay strategies that maintain a compact memory of representative examples from previous scenarios, using importance sampling to prioritize examples that are most at risk of being forgotten under the current update. The approach tracks both text and vision encoder stability, applying targeted regularization to the encoder components most affected by each update.

Knowledge Editing

Motivation and Approaches

Knowledge editing addresses the problem of updating specific factual knowledge in an LLM's parameters without full retraining. Unlike continual learning, which focuses on updating across broad domains, knowledge editing targets precise, localized changes: replacing outdated facts (e.g., updating who holds a particular office), correcting errors, or removing sensitive information.

Current knowledge editing approaches fall into several categories:

ApproachMechanismStrengthsLimitations
Fine-tuning basedGradient descent on target factsSimple; compatible with any architectureRisk of forgetting; slow for many edits
Locate-and-edit (ROME, MEMIT)Identifies and directly modifies factual storage weightsFast; targetedMay disrupt related knowledge
Meta-learning basedLearns to edit via bi-level optimizationGeneralizes across edit typesHigher training cost
In-context editingProvides updated facts in context without weight changesNo weight modification requiredLimited to context window; not persistent

MetaKE: Meta-Learning Aligned Knowledge Editing

MetaKE proposes a meta-learning framework for knowledge editing that uses bi-level optimization to align the knowledge editing process with the model's internal knowledge representations. The outer optimization learns an editing strategy that generalizes across fact types; the inner optimization applies the strategy to specific edits. The bi-level structure allows MetaKE to account for interactions between edited facts and related knowledge that single-level editing methods miss.

A key finding from MetaKE is that naive knowledge editing often introduces inconsistencies—editing one fact without updating related facts that should co-vary with it—and that meta-learning can partially address this by training the editing strategy on datasets that include such co-variation patterns. This consistency problem is directly relevant to safety: editing a model to remove one form of dangerous knowledge while leaving related knowledge intact may create an inconsistent knowledge state that partially restores the dangerous capability through inference from remaining knowledge.

Structural Knowledge Unlearning

GONE (structural knowledge unlearning via neighborhood-expanded distribution shaping) addresses the problem of removing specific knowledge from a trained model more completely than standard fine-tuning-based unlearning. GONE identifies the "knowledge neighborhood"—the set of related facts and reasoning patterns that co-activate with the target knowledge—and shapes the model's output distribution to be inconsistent with the entire neighborhood, not just the directly targeted fact.

The motivation is that standard unlearning methods that only target the specific fact to be forgotten leave the model's knowledge neighborhood intact, allowing the model to partially reconstruct the forgotten fact through inference from related knowledge. Neighborhood-expanded distribution shaping is designed to make such reconstruction more difficult, though the completeness of unlearning remains difficult to verify empirically.

LLM unlearning with LLM beliefs takes a complementary approach: it uses the model's own internal uncertainty estimates (its "beliefs" about a fact) to guide unlearning, focusing gradient updates on facts the model is confident about. This reduces the collateral damage to related knowledge that arises when unlearning updates are applied uniformly regardless of the model's confidence.

Bias and Fairness Evaluation

Gender Bias

Research on whether LLMs have a gender (entropy) bias examines whether LLMs systematically assign lower entropy—more confident, less diverse—probability distributions to text describing one gender versus another. The hypothesis is that if training data contains more stereotyped or formulaic text about one gender, the model will learn to predict such text with higher confidence, manifesting as lower entropy in gender-relevant contexts.

Findings across multiple models suggest that gender entropy bias is present and measurable, though its magnitude varies by task and language domain. The bias is more pronounced in occupational and agentic contexts (describing what a person does or decides) than in descriptive contexts (physical or demographic description). This has implications for applications where LLMs generate text about people—biographical writing, automated job descriptions, performance evaluations—where lower-entropy (more stereotyped) outputs for one gender can reinforce occupational segregation.

Political Bias

Research on whether LLMs can infer political alignment from online conversations finds that frontier LLMs can assign political orientation labels to text samples from online political discussions with above-chance accuracy, even when the text contains no explicit political markers. This capability has dual implications: it can be used for automated political profiling (a privacy and manipulation risk), and it provides evidence that LLMs have internalized associations between language patterns and political identity that may affect output generation.

Work on sectarian preference evaluation (SectEval) extends political bias analysis to sectarian dimensions—religious, regional, and cultural group affiliations—finding that LLMs exhibit latent preferences for certain sectarian positions when generating text on contested topics within religious or cultural communities. These preferences often reflect the distribution of English-language internet text, which overrepresents particular religious traditions and cultural perspectives.

Research on political evasion detection examines whether LLMs can identify when responses to politically sensitive questions are deliberately evasive—avoiding commitment to a position while appearing to engage with the question. This capability is relevant both for evaluating political neutrality in LLM deployments and for detecting evasive behavior in automated political

References

OpenAI is a leading AI research and deployment company focused on building advanced AI systems, including GPT and o-series models, with a stated mission of ensuring artificial general intelligence (AGI) benefits all of humanity. The homepage serves as a gateway to their research, products, and policy work spanning capabilities and safety.

★★★★☆

MemoryArena introduces a benchmark designed to evaluate the long-term memory capabilities of large language models across diverse conversational and task settings. It assesses how well models retain, retrieve, and utilize information over extended interactions, addressing a key limitation in current LLM evaluation frameworks. The benchmark provides structured tasks and metrics to compare memory performance across different architectures and approaches.

★★★☆☆

Partnership on AI (PAI) is a nonprofit coalition of AI researchers, civil society organizations, academics, and companies working to develop best practices, conduct research, and shape policy around responsible AI development. It brings together diverse stakeholders to address challenges including safety, fairness, transparency, and the societal impacts of AI systems. PAI serves as a coordination hub for cross-sector dialogue on AI governance.

★★★☆☆

Google DeepMind introduces FACTS Grounding, an online leaderboard and benchmark evaluating LLMs' ability to generate responses fully grounded in provided context documents up to 32k tokens. The two-phase automated evaluation first checks whether responses fulfill user requests, then assesses factual grounding, using aggregated judge models to reduce evaluation bias. The benchmark includes both public and private leaderboard splits to enable external participation while preserving integrity.

★★★☆☆

OpenAI researchers argue that LLM hallucinations persist because standard training and evaluation procedures reward confident guessing over honest uncertainty, using multiple-choice test analogies to illustrate the misaligned incentives. They propose that evaluation methods should penalize errors more than abstentions, and that models should be trained to express calibrated uncertainty. The accompanying paper formalizes these ideas and demonstrates improvements on benchmarks like SimpleQA.

★★★★☆
6Emergent AbilitiesarXiv·Jason Wei et al.·2022·Paper

This paper introduces the concept of 'emergent abilities' in large language models—capabilities that appear in larger models but are absent in smaller ones, making them unpredictable through simple extrapolation of smaller model performance. Unlike the generally predictable improvements from scaling, emergent abilities represent a discontinuous phenomenon where new capabilities suddenly manifest at certain model scales. The authors argue that this emergence suggests further scaling could unlock additional unforeseen capabilities in language models.

★★★☆☆

HalluLens introduces a comprehensive benchmark for evaluating LLM hallucinations, distinguishing between extrinsic hallucinations (content deviating from training data) and intrinsic hallucinations, built on a clear taxonomy. It introduces three new extrinsic hallucination tasks with dynamic test set generation to prevent data leakage and improve robustness. The benchmark addresses the fragmented state of hallucination research by providing a unified framework and publicly releasing the codebase.

Radford et al. trained a multiplicative LSTM on 82 million Amazon reviews to predict the next character, discovering that the model unsupervised learned a single 'sentiment neuron' highly predictive of sentiment. This representation achieves state-of-the-art accuracy on Stanford Sentiment Treebank (91.8%) and can match fully supervised systems with 30-100x fewer labeled examples, suggesting large neural networks spontaneously develop interpretable internal representations.

★★★★☆
10Hoffmann et al. (2022)arXiv·Jordan Hoffmann et al.·2022·Paper

Hoffmann et al. (2022) investigates the optimal allocation of compute budgets between model size and training data for transformer language models. Through extensive experiments training over 400 models ranging from 70M to 16B parameters, the authors find that current large language models are significantly undertrained due to emphasis on model scaling without proportional increases in training data. They propose that compute-optimal training requires equal scaling of model size and training tokens—doubling model size should be accompanied by doubling training data. The authors validate this finding with Chinchilla (70B parameters), which matches Gopher's compute budget but uses 4× more data, achieving superior performance across downstream tasks and reaching 67.5% on MMLU, a 7% improvement over Gopher.

★★★☆☆

The NIST AI RMF is a voluntary, consensus-driven framework released in January 2023 to help organizations identify, assess, and manage risks associated with AI systems while promoting trustworthiness across design, development, deployment, and evaluation. It provides structured guidance organized around core functions and is accompanied by a Playbook, Roadmap, and a Generative AI Profile (2024) addressing risks specific to generative AI systems.

★★★★★

OpenAI introduces GPT-2, a 1.5 billion parameter transformer language model trained on 40GB of internet text, capable of generating coherent multi-paragraph text and performing zero-shot transfer on tasks like translation and summarization. Notably, OpenAI withheld the full model from public release due to concerns about misuse, making this a landmark case in AI deployment ethics and responsible disclosure.

★★★★☆
13SecCodeBench-V2 (2025).arXiv·Tiantian Li et al.·2025·Paper

This paper demonstrates reversible, nanosecond-scale structural phase transitions between two layered forms of indium selenide (In2Se3) for integrated photonic memory applications. Using high-resolution pair distribution function analysis, the authors reveal that the switching mechanism involves inter-layer shear glide and isosymmetric phase transitions with low reconfigurational entropy, enabling efficient crystalline-to-crystalline transitions rather than the slower amorphous-crystalline transitions typical in optical memristive devices. The work characterizes the broadband refractive index contrast, optical transparency, and loss characteristics of the material in thin films, demonstrating practical feasibility for scalable photonic memory devices.

★★★☆☆

This resource appears to be a 404 error page from OpenAI, meaning the original content about shipping smarter agents is no longer accessible at this URL. The page itself contains only a haiku generated by GPT and does not provide substantive information about the intended topic.

★★★★☆

OpenAI's official system card for GPT-5 documenting the model's safety evaluations, capabilities assessments, and risk mitigations prior to deployment. It covers red-teaming results, alignment measures, and identified limitations across domains including cybersecurity, CBRN risks, and persuasion. The card represents OpenAI's formal safety disclosure process for a frontier model release.

★★★★☆
16Kaplan et al. (2020)arXiv·Jared Kaplan et al.·2020·Paper

Kaplan et al. (2020) empirically characterize scaling laws for language model performance, demonstrating that cross-entropy loss follows power-law relationships with model size, dataset size, and compute budget across seven orders of magnitude. The study reveals that architectural details like width and depth have minimal impact, while overfitting and training speed follow predictable patterns. Crucially, the findings show that larger models are significantly more sample-efficient, implying that optimal compute-efficient training involves training very large models on modest datasets and stopping before convergence.

★★★☆☆

OpenAI introduces the o1 model series, which uses reinforcement learning to train large language models to reason through complex problems via extended chain-of-thought before responding. The models demonstrate significantly improved performance on challenging benchmarks in mathematics, coding, and scientific reasoning. This represents a major capability advance with implications for both AI applications and AI safety evaluation.

★★★★☆

This paper introduces GPT-1, demonstrating that generative pre-training of a language model on large unlabeled text corpora followed by discriminative fine-tuning on specific tasks yields strong performance across diverse NLP benchmarks. It established the foundational paradigm of unsupervised pre-training plus supervised fine-tuning that underpins modern large language models. The work showed that transformer-based models can learn general-purpose language representations transferable to downstream tasks with minimal task-specific architecture changes.

★★★★☆

This paper proposes a content-based framework for determining when AI systems should refuse cybersecurity-related requests, addressing the challenge of balancing security research utility against potential misuse. The framework provides principled criteria for refusal decisions based on the nature and specificity of requested content rather than topic alone.

★★★☆☆
20Helpful to a Fault (2025).arXiv·Kai Yan et al.·2025·Paper

MIR-Bench is a benchmark designed to evaluate LLMs' inductive reasoning capabilities in long-context, many-shot settings, testing their ability to induce rules from hundreds to thousands of input-output examples. It addresses gaps in existing benchmarks that focus on few-shot or simple classification tasks, and investigates robustness to erroneous examples and Chain-of-Thought effects.

★★★☆☆

The DeepMind research homepage serves as a portal to Google DeepMind's published research across AI capabilities, safety, and applications. It aggregates papers, blog posts, and project overviews from one of the world's leading AI research labs. The page reflects DeepMind's broad research agenda spanning reinforcement learning, foundation models, and AI safety.

★★★★☆

Google DeepMind's technical report on Gemma 3, a family of open-weight language models designed for broad accessibility and deployment. The report covers model architecture, training methodology, capabilities benchmarks, and safety evaluations. Gemma 3 represents Google's contribution to open-model development with integrated safety measures.

23AREG Benchmark (2025).arXiv·Krzysztof Pachucki, Vojtěch Patkóš & Vladimir A. Yerokhin·2025·Paper

This paper derives precise formulas for combined nuclear-recoil and finite-nuclear-size effects in hydrogenic systems at orders (Zα)⁵ and (Zα)⁶ without expanding in nuclear charge radius, applicable to both electronic and muonic atoms. A key finding is that the widely-used Breit approximation produces an unphysical linear dependence on nuclear charge radius at order (Zα)⁵, which vanishes in full QED treatment, leaving only the correct quadratic contribution. These results are particularly important for high-precision determinations of nuclear charge radii from muonic atom spectroscopy and isotope shift measurements.

★★★☆☆

Epoch AI analyzes the financial costs of training state-of-the-art AI models, estimating training runs for leading frontier models and projecting how these costs are evolving. The analysis examines compute expenditures, hardware costs, and trends suggesting training costs for top models may reach billions of dollars. This provides crucial empirical grounding for policy and governance discussions around AI development economics.

★★★★☆

Anthropic is an AI safety company focused on building reliable, interpretable, and steerable AI systems. The company conducts frontier AI research and develops Claude, its family of AI assistants, with a stated mission of responsible development and maintenance of advanced AI for long-term human benefit.

★★★★☆

A public leaderboard that benchmarks large language models on their tendency to hallucinate or introduce factual inconsistencies when summarizing documents. It provides a standardized evaluation framework comparing models on 'groundedness' — how faithfully they summarize source material without fabricating information. The leaderboard is regularly updated as new models are released.

★★★☆☆
27OpenAI: Model BehaviorOpenAI·Rakshith Purushothaman·2025·Paper

This is OpenAI's research overview page describing their work toward artificial general intelligence (AGI). The page outlines OpenAI's mission to ensure AGI benefits all of humanity and highlights their major research focus areas: the GPT series (versatile language models for text, images, and reasoning), the o series (advanced reasoning systems using chain-of-thought processes for complex STEM problems), visual models (CLIP, DALL-E, Sora for image and video generation), and audio models (speech recognition and music generation). The page serves as a hub linking to detailed research announcements and technical blogs across these domains.

★★★★☆

AIMultiple research examines AI hallucination rates, notably citing an approximately 17% citation hallucination rate meaning roughly 1 in 6 AI responses contains fabricated or inaccurate references. The resource provides analysis of hallucination causes, prevalence across AI systems, and potential mitigation approaches for enterprise and research use.

Anthropic's research page aggregates their work across AI alignment, mechanistic interpretability, and societal impact assessment, all oriented toward understanding and mitigating risks from increasingly capable AI systems. It serves as a central hub for their published findings and ongoing safety-focused investigations.

★★★★☆

OpenAI's Superalignment team introduces a research paradigm for tackling superintelligence alignment by studying whether weak models can supervise stronger ones. They demonstrate that a GPT-2-level supervisor can elicit near GPT-3.5-level performance from GPT-4, showing that strong pretrained models can generalize beyond their weak supervisor's limitations. This provides an empirically tractable analogy for the core challenge of humans supervising superhuman AI.

★★★★☆

This paper presents a method for summarizing entire fiction novels by combining reinforcement learning from human feedback (RLHF) with recursive task decomposition, enabling human supervisors to provide feedback on complex tasks without needing to evaluate the full output themselves. The approach fine-tunes GPT-3 via behavioral cloning and reward modeling, achieving state-of-the-art results on BookSum and competitive results on NarrativeQA.

★★★☆☆
32Kenton et al. (2021)arXiv·Stephanie Lin, Jacob Hilton & Owain Evans·2021·Paper

Kenton et al. (2021) introduce TruthfulQA, a benchmark of 817 questions across 38 categories designed to measure whether language models generate truthful answers. The benchmark specifically includes questions where humans commonly hold false beliefs, requiring models to avoid reproducing misconceptions from training data. Testing GPT-3, GPT-Neo/J, GPT-2, and T5-based models revealed that the best model achieved only 58% truthfulness compared to 94% human performance. Notably, larger models performed worse on truthfulness despite excelling at other NLP tasks, suggesting that scaling alone is insufficient and that alternative training objectives beyond text imitation are needed to improve model truthfulness.

★★★☆☆

SimpleQA is an OpenAI benchmark designed to evaluate the factual accuracy of large language models on short, unambiguous questions with single correct answers. It aims to measure 'calibrated uncertainty' and honest factual recall, providing a clean signal for whether models know what they claim to know. The benchmark is intended to track improvements in model honesty and factuality over time.

★★★★☆
34OpenAI WebGPT behaviorarXiv·Reiichiro Nakano et al.·2021·Paper

OpenAI fine-tuned GPT-3 to answer long-form questions by enabling it to search and browse the web in a text-based environment. The model was trained using imitation learning from human demonstrations on the ELI5 dataset, then optimized using reinforcement learning from human feedback. The approach requires models to collect references while browsing to support their answers and improve factual accuracy evaluation. The resulting model outperformed both human demonstrators (56% preference) and top Reddit answers (69% preference), demonstrating the effectiveness of combining behavior cloning with reward model-based optimization.

★★★☆☆
35Evaluating Large Language Models Trained on CodearXiv·Mark Chen et al.·2021·Paper

This paper introduces Codex, a GPT-based model fine-tuned on publicly available code from GitHub, and evaluates it on HumanEval, a new benchmark for measuring functional correctness in code generation. Codex powers GitHub Copilot and demonstrates significant capability in generating Python solutions to programming problems. The paper also discusses safety considerations including misuse potential and economic impacts on software developers.

★★★☆☆

Related Wiki Pages

Top Related Pages

Concepts

Large Language ModelsSuperintelligence

Risks

Emergent Capabilities

Other

RLHFPhilip TetlockScalable OversightRobin HansonGPT-4GPT-4o

Approaches

AI AlignmentAI Governance Coordination Technologies

Analysis

AI Uplift Assessment ModelDeceptive Alignment Decomposition ModelWikipedia ViewsSquiggleAI

Policy

EU AI ActBletchley Declaration

Key Debates

AI Alignment Research Agendas