Skip to content

[Bug]: deepseekv3.2 tool_calls failure #26897

@CallmeZhangChenchen

Description

@CallmeZhangChenchen

Your current environment

The output of python collect_env.py
Your output of `python collect_env.py` here
Collecting environment information...                                                                                                                   
==============================                                                                                                                          
        System Info                                                                                                                                     
==============================                                                                                                                          
OS                           : Ubuntu 24.04.2 LTS (x86_64)                                                                                              
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0                                                                                    
Clang version                : Could not collect                                                                                                        
CMake version                : version 3.31.6                               
Libc version                 : glibc-2.39                                                                                                               
                                                                                                                                                        
==============================                                              
       PyTorch Info                                                         
==============================    
PyTorch version              : 2.8.0+cu128                                  
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.15.0-151-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.41 
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration 👍 
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200

Nvidia driver version        : 550.127.08
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.1
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True


==============================                                                                                                       16:25:18 [128/1908]Versions of relevant libraries                                              
==============================                                              
[pip3] mypy_extensions==1.1.0                                               
[pip3] numpy==1.26.4                                                        
[pip3] nvidia-cublas-cu12==12.8.4.1                                                                                                                     
[pip3] nvidia-cuda-cupti-cu12==12.8.90                                                                                                                  
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93                                                                                                                  
[pip3] nvidia-cuda-runtime-cu12==12.8.90                                                                                                                
[pip3] nvidia-cudnn-cu12==9.10.2.21                                                                                                                     
[pip3] nvidia-cudnn-frontend==1.11.0                                                                                                                    
[pip3] nvidia-cufft-cu12==11.3.3.83                                                                                                                     
[pip3] nvidia-cufile-cu12==1.13.1.3                                                                                                                     [pip3] nvidia-curand-cu12==10.3.9.90                                        
[pip3] nvidia-cusolver-cu12==11.7.3.90       
[pip3] nvidia-cusparse-cu12==12.5.8.93                         
[pip3] nvidia-cusparselt-cu12==0.7.1                                        
[pip3] nvidia-dali-cuda120==1.49.0                                          
[pip3] nvidia-ml-py==12.570.86                                              
[pip3] nvidia-modelopt==0.27.1                                              
[pip3] nvidia-modelopt-core==0.27.1                                         
[pip3] nvidia-nccl-cu12==2.27.3                                             
[pip3] nvidia-nvcomp-cu12==4.2.0.14                                         
[pip3] nvidia-nvimgcodec-cu12==0.5.0.13              
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvjpeg-cu12==12.4.0.16
[pip3] nvidia-nvjpeg2k-cu12==0.8.1.40
[pip3] nvidia-nvtiff-cu12==0.5.0.67
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] nvidia-resiliency-ext==0.3.0
[pip3] onnx==1.17.0
[pip3] optree==0.15.0
[pip3] pynvml==12.0.0
[pip3] pytorch-triton==3.3.0+git96316ce52.nvinternal
[pip3] pyzmq==26.4.0
[pip3] torch==2.8.0
[pip3] torch_tensorrt==2.8.0a0
[pip3] torchaudio==2.8.0
[pip3] torchprofile==0.0.4
[pip3] torchvision==0.23.0
[pip3] transformers==4.57.1                                                 
[pip3] triton==3.4.0                                                        
[conda] Could not collect                                                   
                                                                                                                                                        
==============================                                                                                                                          
         vLLM Info                                                                                                                                      
==============================                                                                                                                          
ROCM Version                 : Could not collect                                                                                                        
vLLM Version                 : 0.11.0                                                                                                                   
vLLM Build Flags:                                                                                                                                       
  CUDA Archs: 7.5 8.0 8.6 9.0 10.0 12.0+PTX; ROCm: Disabled                                                                                             GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9   C
PU Affinity     NUMA Affinity   GPU NUMA ID                    
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS    0
-47,96-143      0               N/A                                         
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS    0
-47,96-143      0               N/A                                         
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS    0
-47,96-143      0               N/A                                         
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS    0
-47,96-143      0               N/A                                         
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    NODE    NODE   4
8-95,144-191    1               N/A 
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    NODE    NODE   4
8-95,144-191    1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    NODE    NODE   4
8-95,144-191    1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    PIX    4
8-95,144-191    1               N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS
NIC1    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS
NIC2    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     SYS     SYS
NIC3    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    NODE
NIC5    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    NODE    NODE
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    NODE    NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      PXB     NODE
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PXB      X      NODE
NIC9    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE     X 

Legend:

🐛 Describe the bug

pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
pip install https://wheels.vllm.ai/dsv32/deep_gemm-2.1.0%2B594953a-cp312-cp312-linux_x86_64.whl

server:
vllm serve deepseek-ai/DeepSeek-V3.2-Exp --served-model-name DeepSeek-V3.2-Exp --trust-remote-code --data-parallel-size 8 --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser deepseek_v31 --chat-template ./deepseek_v31.jinjia

Follow the code here,#23454
#23454
Test Script (Non-Streaming):

Result

root@pytorch-job-20251015152714476386g2zcyrl9qvtf:/# python3 test_tool_calls.py 
Traceback (most recent call last):
  File "//test_tool_calls.py", line 65, in <module>
    response = client.chat.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_utils/_utils.py", line 286, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/resources/chat/completions/completions.py", line 1156, in create
    return self._post(
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1259, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/openai/_base_client.py", line 1047, in request
    raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Error code: 500 - {'error': {'message': 'EngineCore encountered an issue. See stack trace (above) for the root cause.', 'type': 'Internal Server Error', 'param': None, 'code': 500}}
....
(APIServer pid=538) WARNING 10-15 16:18:23 [protocol.py:93] The following fields were present in the request but ignored: {'strict'}                    
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710] EngineCore encountered a fatal error.                                                       
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710] Traceback (most recent call last):                                                          
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]     engine_core.run_busy_loop()
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1056, in run_busy_loop
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]     self.execute_dummy_batch()
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in execute_dummy_batch
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]     self.model_executor.execute_dummy_batch()
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in 
execute_dummy_batch
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]     self.collective_rpc("execute_dummy_batch")
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]     return func(*args, **kwargs)
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 491, in 
execute_dummy_batch
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]     self.model_runner._dummy_run(1, uniform_decode=True)
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]     return func(*args, **kwargs)
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3180, in _dummy_run
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]     return hidden_states, hidden_states[logit_indices]
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]                           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710] For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                       
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                         
(EngineCore_DP5 pid=826) ERROR 10-15 16:18:24 [core.py:710]                                                                                             
(EngineCore_DP5 pid=826) Process EngineCore_DP5:                                                                                                        
(EngineCore_DP5 pid=826) Traceback (most recent call last):                                                                                             (EngineCore_DP5 pid=826)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP5 pid=826)     self.run()                                                                                                                 
(EngineCore_DP5 pid=826)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run                                                      (EngineCore_DP5 pid=826)     self._target(*self._args, **self._kwargs)
(EngineCore_DP5 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP5 pid=826)     raise e                                                                                                                    (EngineCore_DP5 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 701, in run_engine_core
(EngineCore_DP5 pid=826)     engine_core.run_busy_loop()                                                                                                
(EngineCore_DP5 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1056, in run_busy_loop                           
(EngineCore_DP5 pid=826)     self.execute_dummy_batch()
(EngineCore_DP5 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in execute_dummy_batch
(EngineCore_DP5 pid=826)     self.model_executor.execute_dummy_batch()                                                                                  (EngineCore_DP5 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in execute_dummy_batch
(EngineCore_DP5 pid=826)     self.collective_rpc("execute_dummy_batch")                                                                                 
(EngineCore_DP5 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP5 pid=826)     return [run_method(self.driver_worker, method, args, kwargs)]                                                              (EngineCore_DP5 pid=826)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP5 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP5 pid=826)     return func(*args, **kwargs)                                                                                               
(EngineCore_DP5 pid=826)            ^^^^^^^^^^^^^^^^^^^^^                                                                                               
(EngineCore_DP5 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 491, in execute_dummy_batch
(EngineCore_DP5 pid=826)     self.model_runner._dummy_run(1, uniform_decode=True)                                   
(EngineCore_DP5 pid=826)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context                     (EngineCore_DP5 pid=826)     return func(*args, **kwargs)
(EngineCore_DP5 pid=826)            ^^^^^^^^^^^^^^^^^^^^^                                                                                               
(EngineCore_DP5 pid=826)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3180, in _dummy_run
(EngineCore_DP5 pid=826)     return hidden_states, hidden_states[logit_indices]                                                                         (EngineCore_DP5 pid=826)                           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
(EngineCore_DP5 pid=826) torch.AcceleratorError: CUDA error: an illegal memory access was encountered             
(EngineCore_DP5 pid=826) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP5 pid=826) For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                                          
(EngineCore_DP5 pid=826) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                            (EngineCore_DP5 pid=826)   
(APIServer pid=538) ERROR 10-15 16:18:24 [async_llm.py:480] AsyncLLM output_handler failed.  
....

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions