Skip to content

GRPO Reward Weight Scheduler #36490

@leonardtang

Description

@leonardtang

Feature request

It would be great to support dynamic weights for aggregating rewards -- i.e. different weightings based on how deep into training we have progressed.

Motivation

There are often rewards that we can use for local updates that don't make sense globally in terms of their magnitude.

For example, one potential reward is to rank the set of rollouts and assign the #1 ranking a max reward of 1 and the last ranking a min reward of 0. This is useful locally when true rewards are sparse, but becomes distracting in the limit of training (the #1 ranking always gets a reward of 1).

It is possible to schedule the reward function itself, but that seems not as clean. Also, the logs for that reward function would be misleading.

Your contribution

Should be easy for me to submit a PR for this, but thought it's worth flagging explicitly here for feedback

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions