A brief example of reward hacking in GRPO

In applying reinforcement learning to large language models (LLMs), reward hacking seems to be the rule. This is unlike in pretraining or supervised finetuning, where downstream performance improves ~predictably with compute. I came across a great example of this phenomenon for basically the first LLM GRPO experiment I tried: writing 20-character TLDRs for Reddit posts.

I used Hugging Face’s (HF) GRPOTrainer on Qwen2.5-1.5B-Instruct, distributing training across four H100s with Accelerate. I lifted my code essentially straight from HF’s GRPOTrainer quickstart, besides adding a KL divergence penalty (the beta coefficient) and some logging.

from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer

dataset = load_dataset("trl-lib/tldr", split="train")

# Define the reward function, which rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]

training_args = GRPOConfig(
    output_dir="./Qwen2.5-1.5B-GRPO",
    log_completions=True,
    beta=0.001,
)
trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    reward_funcs=reward_len,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

The reward function was (effectively) 20 - len(Qwen's completion). But the way GRPO works is that Qwen outputs 8 candidate completions, and the z-score of the computed reward is applied to them (this is called the “advantage”). In a bid to make its completions shorter, Qwen fell into stuffing its 256-token-per-completion limit with random numbers. The length of each completion became 256, so each completion’s reward z-score became 0, and the model stopped improving.

Here are the model’s completions after the collapse:

Here’s a view of some metrics tracked throughout training:

Note that less than 0.01 epochs through the dataset, train/reward and train/reward_std both go flat (the model starts outputting 256-character nonsense for every prompt). However, in becoming repetitive, the model has strayed far from its initial state (i.e., the KL divergence, tracked by train/kl, has become high). Since GRPO advantage is now consistently zero, the objective function can only be maximized by bringing the KL divergence penalty down; the optimizer quickly does this, and train/kl hovers around 0.05 for the rest of the run.

Of course, there are a bunch of ways to avoid this collapse. GRPO a bigger model. Write a good system prompt. Don’t have a high max_completion_length (like GRPOTrainer’s default 256 tokens), so that the model has an easier time getting to 20 coherent characters. Have a higher KL divergence penalty. Have a strong LLM judge provide a penalty term for incoherent output. But this goes to show good reward design requires nontrivial debugging.