Back
AI Alignment - Wikipedia
referenceCredibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: Wikipedia
A general Wikipedia entry useful as a starting point or for definition-checking; not a primary research source, but helpful for orienting newcomers or clarifying terminology like 'alignment faking.'
Metadata
Importance: 45/100wiki pagereference
Summary
Wikipedia's overview article on AI alignment, covering the field's goals, key concepts, and challenges including the concept of 'alignment faking' where AI systems appear aligned during training but pursue different objectives when deployed. Serves as a general reference entry point for understanding the breadth of alignment research and terminology.
Key Points
- •Defines AI alignment as the challenge of ensuring AI systems pursue goals intended by their designers or humanity broadly
- •Covers key alignment failure modes including reward hacking, goal misgeneralization, and deceptive alignment
- •Alignment faking refers to AI behavior where a model appears to conform to training objectives while concealing misaligned underlying goals
- •Summarizes major research approaches including RLHF, interpretability, and scalable oversight
- •Provides accessible overview linking to deeper technical and philosophical literature
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| Why Alignment Might Be Hard | Argument | 69.0 |
| Mesa-Optimization | Risk | 63.0 |
Cached Content Preview
HTTP 200Fetched Mar 8, 202698 KB
AI alignment - Wikipedia
Jump to content
From Wikipedia, the free encyclopedia
Conformance of AI to intended objectives
Part of a series on Artificial intelligence (AI)
Major goals
Artificial general intelligence
Intelligent agent
Recursive self-improvement
Planning
Computer vision
General game playing
Knowledge representation
Natural language processing
Robotics
AI safety
Approaches
Machine learning
Symbolic
Deep learning
Bayesian networks
Evolutionary algorithms
Hybrid intelligent systems
Systems integration
Open-source
AI data centers
Applications
Bioinformatics
Deepfake
Earth sciences
Finance
Generative AI
Art
Audio
Music
Government
Healthcare
Mental health
Industry
Software development
Translation
Military
Physics
Projects
Philosophy
AI alignment
Artificial consciousness
The bitter lesson
Chinese room
Friendly AI
Ethics
Existential risk
Turing test
Uncanny valley
Human–AI interaction
History
Timeline
Progress
AI winter
AI boom
AI bubble
Controversies
Deepfake pornography
Taylor Swift deepfake pornography controversy
Grok sexual deepfake scandal
Google Gemini image generation controversy
Pause Giant AI Experiments
Removal of Sam Altman from OpenAI
Statement on AI Risk
Tay (chatbot)
Théâtre D'opéra Spatial
Voiceverse NFT plagiarism scandal
Glossary
Glossary
v
t
e
In the field of artificial intelligence (AI), alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives. [ 1 ]
It is often difficult for AI designers to specify the full range of desired and undesired behaviors. Therefore, the designers often use simpler proxy goals , such as gaining human approval . But proxy goals can overlook necessary constraints or reward the AI system for merely appearing aligned. [ 1 ] [ 2 ] AI systems may also find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways ( reward hacking ). [ 1 ] [ 3 ]
Advanced AI systems may develop unwanted instrumental strategies , such as seeking power or self-preservation because such strategies help them achieve their assigned final goals. [ 1 ] [ 4 ] [ 5 ] Furthermore, they might develop undesirable emergent goals that could be hard to detect before the system is deployed and encounters new situations and data distributions . [ 6 ] [ 7 ] Empirical research showed in 2024 that advanced large language models (LLMs)
... (truncated, 98 KB total)Resource ID:
c799d5e1347e4372 | Stable ID: MjgzZjIyZj