Training Methods (Overview)
Training methods for alignment focus on shaping model behavior during the learning process.
Core Approaches:
- RLHF: Reinforcement Learning from Human Feedback - the foundation of modern alignment training
- Constitutional AI: Self-critique based on principles
- Preference Optimization: Direct preference learning (DPO, IPO)
Specialized Techniques:
- Process Supervision: Rewarding reasoning steps, not just outcomes
- Reward Modeling: Learning human preferences from comparisons
- Refusal Training: Teaching models to decline harmful requests
- Adversarial Training: Robustness through adversarial examples
Advanced Methods:
- Weak-to-Strong Generalization: Can weak supervisors train strong models?
- Capability Unlearning: Removing dangerous knowledge