Your current environment
Unable to obtain environmental information at the moment.
🐛 Describe the bug
In the code vllm/executor/ray_gpu_executor.py:line 142, if the number of GPUs on a node exceeds 10 (such as NVIDIA HGX A100 with 16-GPU), the result of sorted(gpu_ids) would be 0,10,11,12,13,14,15,2,3,4,5,6,7,8,9, instead of 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15. This results in an NCCL Error, because the order of GPUs in the Ray Executor (lexicographical order) is inconsistent with the order of GPUs in NCCL (actual numerical order).
The correct way should be
node_gpus[node_id] = sorted(gpu_ids, key=lambda x: int(x))