fix cuda device not found error when LLM is initialized in ray actor #3198

wuxibin89 · 2024-03-05T10:34:33Z

After #2221, when tensor_parallel_size>1, the driver process's CUDA_VISIBLE_DEVICES is manually set after RayWorkerVllm has been up. When LLM engine is initialized in a ray actor with num_gpus=0, ray set its CUDA_VISIBLE_DEVICES to '', then torch.cuda.is_available() will return False and causes any subsequent CUDA_VISIBLE_DEVICES updates invalid. The root cause is that pytorch initializes device number only once:
https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDAFunctions.cpp#L96-L113

We should be very careful not call any torch.cuda.* function before CUDA_VISIBLE_DEVICES is set.

import ray
from vllm import LLM

@ray.remote
class LLMDeployment:
    def __init__(self, *args, **kwargs) -> None:
        self.llm = LLM(*args, **kwargs) 


actor = LLMDeployment.remote("facebook/opt-13b", tensor_parallel_size=4)

njhill · 2024-03-05T19:48:27Z

@wuxibin89 I think this was actually introduced by #2569. I included a similar fix here.

I think we can keep auto but just not call torch.cuda.is_available() there.

zhuohan123 · 2024-03-07T07:16:56Z

@wuxibin89 I think this was actually introduced by #2569. I included a similar fix here.

I think we can keep auto but just not call torch.cuda.is_available() there.

@njhill can you separate the fix to another small pr or @wuxibin89 can you modify this pr accordingly?

wuxibin89 · 2024-03-07T09:17:22Z

I think @njhill 's fix is better :)

fix cuda device not found error when LLM is initialized in ray actor

84aaa6e

wuxibin89 mentioned this pull request Mar 5, 2024

cuda.is_available is False in LLMRayActor OpenRLHF/OpenRLHF#233

Closed

wuxibin89 closed this Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix cuda device not found error when LLM is initialized in ray actor #3198

fix cuda device not found error when LLM is initialized in ray actor #3198

Uh oh!

wuxibin89 commented Mar 5, 2024

Uh oh!

njhill commented Mar 5, 2024

Uh oh!

zhuohan123 commented Mar 7, 2024

Uh oh!

wuxibin89 commented Mar 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix cuda device not found error when LLM is initialized in ray actor #3198

fix cuda device not found error when LLM is initialized in ray actor #3198

Uh oh!

Conversation

wuxibin89 commented Mar 5, 2024

Uh oh!

njhill commented Mar 5, 2024

Uh oh!

zhuohan123 commented Mar 7, 2024

Uh oh!

wuxibin89 commented Mar 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants