AI Alignment Tax
The performance cost of making AI models safer and more aligned with human values. Safety training sometimes reduces raw capability on certain tasks.
Why It Matters
The alignment tax challenges the narrative that safety and capability are always in tension. Reducing this tax is a key research goal.
Example
A model that scores slightly lower on coding benchmarks after RLHF safety training, because it now refuses to generate malicious code that the benchmark counts as correct.
Think of it like...
Like the fuel economy cost of adding safety features to a car — airbags and seatbelts add weight, slightly reducing performance, but the safety is worth it.
Related Terms
Alignment
The challenge of ensuring AI systems behave in ways that match human values, intentions, and expectations. Alignment aims to make AI helpful, honest, and harmless.
RLHF
Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.
AI Safety
The research field focused on ensuring AI systems operate reliably, predictably, and without causing unintended harm. It spans from technical robustness to long-term existential risk concerns.
Evaluation
The systematic process of measuring an AI model's performance, safety, and reliability using various metrics, benchmarks, and testing methodologies.
Responsible AI
An approach to developing and deploying AI that prioritizes ethical considerations, fairness, transparency, accountability, and societal benefit throughout the entire AI lifecycle.