A brief example of reward hacking in GRPO
In applying reinforcement learning to large language models (LLMs), reward hacking seems to be the rule. This is unlike in pretraining or supervised finetuning, where downstream performance improves ~predictably with compute. I came across a great example of this phenomenon for basically the first LLM GRPO experiment I tried: writing 20-character TLDRs for Reddit posts.
I used Hugging Face’s (HF)
GRPOTrainer on
Qwen2.5-1.5B-Instruct,
distributing training across four H100s with
Accelerate. I lifted my code
essentially straight from HF’s GRPOTrainer
quickstart,
besides adding a KL divergence penalty (the beta
coefficient) and some
logging.
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
dataset = load_dataset("trl-lib/tldr", split="train")
# Define the reward function, which rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
return [-abs(20 - len(completion)) for completion in completions]
training_args = GRPOConfig(
output_dir="./Qwen2.5-1.5B-GRPO",
log_completions=True,
beta=0.001,
)
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-1.5B-Instruct",
reward_funcs=reward_len,
args=training_args,
train_dataset=dataset,
)
trainer.train()
The reward function was (effectively) 20 - len(Qwen's completion)
. But the way
GRPO works is that Qwen outputs 8 candidate completions, and the
z-score of the computed reward
is applied to them (this is called the “advantage”). In a bid to make its
completions shorter, Qwen fell into stuffing its 256-token-per-completion limit
with random numbers. The length of each completion became 256, so each
completion’s reward z-score became 0, and the model stopped improving.
Here are the model’s completions after the collapse:
Here’s a view of some metrics tracked throughout training:






Note that less than 0.01 epochs through the dataset, train/reward
and
train/reward_std
both go flat (the model starts outputting 256-character
nonsense for every prompt). However, in becoming repetitive, the model has
strayed far from its initial state (i.e., the KL
divergence,
tracked by train/kl
, has become high). Since GRPO advantage is now
consistently zero, the objective function can only be maximized by bringing the
KL divergence penalty down; the optimizer quickly does this, and train/kl
hovers around 0.05 for the rest of the run.
Of course, there are a bunch of ways to avoid this collapse. GRPO a bigger
model. Write a good system prompt. Don’t have a high max_completion_length
(like GRPOTrainer’s default 256 tokens), so that the model has an easier time
getting to 20 coherent characters. Have a higher KL divergence penalty. Have a
strong LLM judge provide a penalty term for incoherent output. But this goes to
show good reward design requires nontrivial debugging.