Verl-tool uses the same reward manager scoring as in training.

The RewardManagerWorker inside the agent loop initializes its reward manager with num_examine=0, which is the training configuration:

`# verl_tool/agent_loop/agent_loop.py, line 331`
`self.reward_manager = load_reward_manager(
    config, tokenizer, num_examine=0, **config.reward_model.get("reward_kwargs", {})
)`

During rollout, the agent loop computes a reward score per trajectory and attaches it to the batch as rm_scores:

` #verl_tool/agent_loop/agent_loop.py`
`scores = [input.reward_score for input in inputs]
if all(score is not None for score in scores):
    ...
    batch["rm_scores"] = rm_scores
`

The problem is that every reward manager short-circuits when it sees `rm_scores` already present:

`# e.g. verl_tool/workers/reward_manager/mt_torl.py` 

`if "rm_scores" in data.batch.keys():
    ...
    return {"reward_tensor": data.batch["rm_scores"], ...}
`


This means when `val_reward_fn` (which uses num_examine=1) is called during validation, it never actually runs its own scoring path — it just returns the pre-computed training scores resulting in low reward values in wandb logs due to assigning scores (+1/-1) for correct/wrong generations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verl-tool uses the same reward manager scoring as in training. #149

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Verl-tool uses the same reward manager scoring as in training. #149

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions