Overview

Navigation

Overview

Updated 2026-02-20History Data

Page StatusResponse

Edited 3 months ago88 words

Content3/13

Change History1

Clarify overview pages with new entity type3 months ago

Added `overview` as a proper entity type throughout the system, migrated all 36 overview pages to `entityType: overview`, built overview-specific InfoBox rendering with child page links, created an OverviewBanner component, and added a knowledge-base-overview page template to Crux.

Issues2

StaleLast edited 90 days ago - may need review

StructureNo tables or diagrams - consider adding visual content

Training Methods (Overview)

Training methods for alignment focus on shaping model behavior during the learning process.

Core Approaches:

RLHF: Reinforcement Learning from Human Feedback - the foundation of modern alignment training
Constitutional AI: Self-critique based on principles
Preference Optimization: Direct preference learning (DPO, IPO)

Specialized Techniques:

Process Supervision: Rewarding reasoning steps, not just outcomes
Reward Modeling: Learning human preferences from comparisons
Refusal Training: Teaching models to decline harmful requests
Adversarial Training: Robustness through adversarial examples

Advanced Methods:

Weak-to-Strong Generalization: Can weak supervisors train strong models?
Capability Unlearning: Removing dangerous knowledge

Related Wiki Pages

Top Related Pages

Approach

Weak-to-Strong Generalization

Weak-to-strong generalization investigates whether weak supervisors can reliably elicit good behavior from stronger AI systems.

Approach

Constitutional AI

Anthropic's Constitutional AI (CAI) methodology uses explicit principles and AI-generated feedback to train safer language models, demonstrating 3-...

Approach

Capability Unlearning / Removal

Methods to remove specific dangerous capabilities from trained AI models, directly addressing misuse risks by eliminating harmful knowledge, though...

Approach

Process Supervision

Process supervision trains AI systems to produce correct reasoning steps, not just correct final answers.

Approach

Preference Optimization Methods

Post-RLHF training techniques including DPO, ORPO, KTO, IPO, and GRPO that align language models with human preferences more efficiently than reinf...

Training Methods (Overview)

Related Wiki Pages

Top Related Pages

Weak-to-Strong Generalization

Constitutional AI

Capability Unlearning / Removal

Process Supervision

Preference Optimization Methods

Other

Approaches