Machine Learning

Reward Model

A model trained to predict how good a response is based on human preferences. In RLHF, the reward model scores outputs to guide the language model toward responses humans prefer.

Why It Matters

The reward model is the bridge between human values and machine optimization. Its quality directly determines how well the LLM aligns with human preferences.

Example

A reward model scoring response A as 0.85 and response B as 0.42 for the same question, indicating A is much more aligned with human preferences.

Think of it like...

Like a judge at a talent show who assigns scores based on audience reactions — they learn what the audience likes and can then rate new performances accordingly.

Related Terms