Reward Model
A model trained to predict how good a response is based on human preferences. In RLHF, the reward model scores outputs to guide the language model toward responses humans prefer.
Why It Matters
The reward model is the bridge between human values and machine optimization. Its quality directly determines how well the LLM aligns with human preferences.
Example
A reward model scoring response A as 0.85 and response B as 0.42 for the same question, indicating A is much more aligned with human preferences.
Think of it like...
Like a judge at a talent show who assigns scores based on audience reactions — they learn what the audience likes and can then rate new performances accordingly.
Related Terms
RLHF
Reinforcement Learning from Human Feedback — a technique used to align language models with human preferences. Human raters rank model outputs, and this feedback trains a reward model that guides further training.
DPO
Direct Preference Optimization — a simpler alternative to RLHF that directly optimizes a language model from human preference data without needing a separate reward model. It is more stable and easier to implement.
Alignment
The challenge of ensuring AI systems behave in ways that match human values, intentions, and expectations. Alignment aims to make AI helpful, honest, and harmless.
Reinforcement Learning
A type of machine learning where an agent learns to make decisions by taking actions in an environment and receiving rewards or penalties. The agent aims to maximize cumulative reward over time through trial and error.