RLHF

Reinforcement Learning from Human Feedback

Definition

RLHF is a training methodology that uses human preference judgments to align language model behavior with human values. Human annotators compare model outputs and rank them, training a reward model on these preferences.

The language model is then fine-tuned with PPO to maximize the reward model's score. RLHF was key to creating ChatGPT-style helpful, harmless, and honest assistants.

Related Terms

DPO

Data Protection Officer

PPO

Proximal Policy Optimization

Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.

Talk to a Human See the Product