Multi-gpu vllm inference with tensor parallelism, colocating policy model + ref model + vllm engine on the same node

Hello @lewtun @edbeeching,

I've created a custom [fork](https://github.com/nhannguyen2709/open-r1/blob/main/src/open_r1/faster_grpo_trainer.py)  based on the faster GRPO trainer [PR](https://github.com/huggingface/open-r1/pull/371) with some nice improvements to allow large-scale training using just 1 single node. To summarize, I've done the following things:

(1) Policy model + reference model + vllm engines are now living on the same node
(2) All gpus can be used to generate rollouts, and vllm tensor_parallel_size can be set to values > 1
(3) Policy model and optimizer states are offloaded to cpu and reloaded to gpu prior to and after rollout generation. I've tested the offloading strategies with both deepspeed zero2 and zero3.
(4) Training with num_iterations > 1 

I've been able to do full-finetuning with qwen 7b and lora-finetuning with qwen 14b on single 8xh100 node. 

If you're are interested, I'm willing to open a PR and share more detailed training logs + evaluation on AIME 24-25.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-gpu vllm inference with tensor parallelism, colocating policy model + ref model + vllm engine on the same node #514

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-gpu vllm inference with tensor parallelism, colocating policy model + ref model + vllm engine on the same node #514

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions