[Feature]: control over llm_engine placement when multiple gpus are available.

### 🚀 The feature, motivation and pitch

I need a way to specify which gpu exactly should vllm use when multiple gpus are available. Currently, it automatically occupies all available gpus (https://docs.vllm.ai/en/latest/serving/distributed_serving.html).

For example, something like this: `vllm.LLM(model_path, device="cuda:N")`

#691 is exactly the same question but they end up agreeing that they can use Ray. I'm asking for a simpler solution that would not require spending time on extra engineering.

### Alternatives

My use-case doesn't allow me to use CUDA_VISIBLE_DEVICES to specify which gpu to use. That's because i train a model on multiple gpus in a DDP-like fashion where each vllm instance generates data for a model on its device, then gradients are synchronized and so on. So I cannot set CUDA_VISIBLE_DEVICES to some specific device as that would turn multiple-gpu training in a single-gpu training.

Also, I cannot just avoid this problem by running a vllm-server on a separate gpu because I need to substitute model weights (loras) on-the-fly and currently this is not available (#3446).

### Additional context

So I either need a way to specify which gpu to use, or have the #3446 PR completed so I can run a server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: control over llm_engine placement when multiple gpus are available. #6312

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: control over llm_engine placement when multiple gpus are available. #6312

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions