- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
Description
🚀 The feature, motivation and pitch
I need a way to specify which gpu exactly should vllm use when multiple gpus are available. Currently, it automatically occupies all available gpus (https://docs.vllm.ai/en/latest/serving/distributed_serving.html).
For example, something like this: vllm.LLM(model_path, device="cuda:N")
#691 is exactly the same question but they end up agreeing that they can use Ray. I'm asking for a simpler solution that would not require spending time on extra engineering.
Alternatives
My use-case doesn't allow me to use CUDA_VISIBLE_DEVICES to specify which gpu to use. That's because i train a model on multiple gpus in a DDP-like fashion where each vllm instance generates data for a model on its device, then gradients are synchronized and so on. So I cannot set CUDA_VISIBLE_DEVICES to some specific device as that would turn multiple-gpu training in a single-gpu training.
Also, I cannot just avoid this problem by running a vllm-server on a separate gpu because I need to substitute model weights (loras) on-the-fly and currently this is not available (#3446).
Additional context
So I either need a way to specify which gpu to use, or have the #3446 PR completed so I can run a server.