-
Notifications
You must be signed in to change notification settings - Fork 74
Description
The RewardManagerWorker inside the agent loop initializes its reward manager with num_examine=0, which is the training configuration:
# verl_tool/agent_loop/agent_loop.py, line 331
self.reward_manager = load_reward_manager( config, tokenizer, num_examine=0, **config.reward_model.get("reward_kwargs", {}) )
During rollout, the agent loop computes a reward score per trajectory and attaches it to the batch as rm_scores:
#verl_tool/agent_loop/agent_loop.py
scores = [input.reward_score for input in inputs] if all(score is not None for score in scores): ... batch["rm_scores"] = rm_scores
The problem is that every reward manager short-circuits when it sees rm_scores already present:
# e.g. verl_tool/workers/reward_manager/mt_torl.py
if "rm_scores" in data.batch.keys(): ... return {"reward_tensor": data.batch["rm_scores"], ...}
This means when val_reward_fn (which uses num_examine=1) is called during validation, it never actually runs its own scoring path — it just returns the pre-computed training scores resulting in low reward values in wandb logs due to assigning scores (+1/-1) for correct/wrong generations.