You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've created a custom fork based on the faster GRPO trainer PR with some nice improvements to allow large-scale training using just 1 single node. To summarize, I've done the following things:
(1) Policy model + reference model + vllm engines are now living on the same node
(2) All gpus can be used to generate rollouts, and vllm tensor_parallel_size can be set to values > 1
(3) Policy model and optimizer states are offloaded to cpu and reloaded to gpu prior to and after rollout generation. I've tested the offloading strategies with both deepspeed zero2 and zero3.
(4) Training with num_iterations > 1
I've been able to do full-finetuning with qwen 7b and lora-finetuning with qwen 14b on single 8xh100 node.
If you're are interested, I'm willing to open a PR and share more detailed training logs + evaluation on AIME 24-25.