Skip to content
ai

PPO

Proximal Policy Optimization

Definition

PPO is a reinforcement learning algorithm used in RLHF to fine-tune language models based on reward signals from a reward model. It constrains policy updates to stay close to the previous policy (via a clipping objective), ensuring stable training.

PPO-based RLHF powered InstructGPT and ChatGPT, though its complexity has led to the adoption of simpler alternatives like DPO.


Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.