PPO

Proximal Policy Optimization

Definition

PPO is a reinforcement learning algorithm used in RLHF to fine-tune language models based on reward signals from a reward model. It constrains policy updates to stay close to the previous policy (via a clipping objective), ensuring stable training.

PPO-based RLHF powered InstructGPT and ChatGPT, though its complexity has led to the adoption of simpler alternatives like DPO.

Related Terms

RLHF

Reinforcement Learning from Human Feedback

DPO

Data Protection Officer

Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.

Talk to a Human See the Product