Skip to content

Multi-gpu vllm inference with tensor parallelism, colocating policy model + ref model + vllm engine on the same node #514

@nhannguyen2709

Description

@nhannguyen2709

Hello @lewtun @edbeeching,

I've created a custom fork based on the faster GRPO trainer PR with some nice improvements to allow large-scale training using just 1 single node. To summarize, I've done the following things:

(1) Policy model + reference model + vllm engines are now living on the same node
(2) All gpus can be used to generate rollouts, and vllm tensor_parallel_size can be set to values > 1
(3) Policy model and optimizer states are offloaded to cpu and reloaded to gpu prior to and after rollout generation. I've tested the offloading strategies with both deepspeed zero2 and zero3.
(4) Training with num_iterations > 1

I've been able to do full-finetuning with qwen 7b and lora-finetuning with qwen 14b on single 8xh100 node.

If you're are interested, I'm willing to open a PR and share more detailed training logs + evaluation on AIME 24-25.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions