Reward model

Reward Model

Definition

A reward model is a neural network trained to score language model outputs according to human preferences, serving as a proxy for human feedback in reinforcement learning from human feedback. Human annotators rank model outputs, and the reward model learns to predict these rankings.

During RLHF, the language model is fine-tuned using PPO to maximize reward model scores while staying close to the original policy.

Related Terms

RLHF

Reinforcement Learning from Human Feedback

PPO

Proximal Policy Optimization

DPO

Data Protection Officer

Alignment

AI Alignment

← Back to Glossary

Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.

Talk to a Human See the Product