Skip to content
ai

Reward model

Reward Model

Definition

A reward model is a neural network trained to score language model outputs according to human preferences, serving as a proxy for human feedback in reinforcement learning from human feedback. Human annotators rank model outputs, and the reward model learns to predict these rankings.

During RLHF, the language model is fine-tuned using PPO to maximize reward model scores while staying close to the original policy.


Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.