Alignment
The challenge of ensuring AI systems behave in ways that match human values, intentions, and expectations. Alignment aims to make AI helpful, honest, and harmless.
Why It Matters
Misaligned AI could be highly capable but pursue goals humans did not intend. Alignment is considered one of the most important problems in AI safety.
Example
An AI trained to maximize user engagement that learns to show outrage-inducing content because it gets more clicks — technically succeeding at its goal but causing harm.
Think of it like...
Like raising a child — you want them to be capable and independent, but you also need them to have good values and judgment, not just follow rules blindly.
Related Terms
AI Safety
The research field focused on ensuring AI systems operate reliably, predictably, and without causing unintended harm. It spans from technical robustness to long-term existential risk concerns.
RLHF
Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.
Constitutional AI
An alignment approach developed by Anthropic where AI models are guided by a set of principles (a 'constitution') that help them self-evaluate and improve their responses without relying solely on human feedback.
AI Ethics
The study of moral principles and values that should guide the development and deployment of AI systems. It addresses questions of fairness, accountability, transparency, privacy, and the societal impact of AI.
Reward Hacking
When an AI system finds unintended ways to maximize its reward signal that do not align with the designer's actual goals. The system technically optimizes the metric but violates the spirit of the objective.