-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Description
Your current environment
The output of python collect_env.py
Collecting environment information...
==============================
System Info
==============================
OS : Ubuntu 24.04.3 LTS (x86_64)
GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version : Could not collect
CMake version : Could not collect
Libc version : glibc-2.39
==============================
PyTorch Info
==============================
PyTorch version : 2.8.0+cu128
Is debug build : False
CUDA used to build PyTorch : 12.8
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] (64-bit runtime)
Python platform : Linux-6.14.0-1012-x86_64-with-glibc2.39
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : Could not collect
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration :
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe
Nvidia driver version : 580.65.06
cuDNN version : Could not collect
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7V13 64-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 1
Stepping: 1
BogoMIPS: 4890.88
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves user_shstk clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 1.5 MiB (48 instances)
L1i cache: 1.5 MiB (48 instances)
L2 cache: 24 MiB (48 instances)
L3 cache: 192 MiB (6 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
Vulnerability Gather data sampling: Not affected
Vulnerability Ghostwrite: Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0
[pip3] torchaudio==2.8.0
[pip3] torchvision==0.23.0
[pip3] transformers==4.57.0
[pip3] triton==3.4.0
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
vLLM Version : 0.11.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NODE 0-23 0 N/A
GPU1 NV12 X SYS 24-47 1 N/A
NIC0 NODE SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
==============================
Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
Opening this issue after a request of @bbrowning in #22308
I am using gpt-oss-120b with tool calling capabilities. I have no problems with vLLM v0.10.2, but with the new release (v0.11.0), I find that one in two queries hangs indefinitely without a response. Looking at the vLLM logs, it seems that the model is generating, but no output is being generated. The problem exists in both streaming and non-streaming modes. I'm running vLLM in docker on Ubuntu 24.04 with 2x A100 80GB.
Below you have the docker-compose I'm using, the vLLM logs (if you see I sent two identically requests, one returned a response and the second instead triggered an infinite hanging generation) and a sample request to reproduce the issue.
My docker compose:
services:
vllm:
container_name: vllm
image: vllm/vllm-openai:v0.11.0 # v0.10.2 was working
command: "--model openai/gpt-oss-120b --tool-call-parser openai --enable-auto-tool-choice --max-model-len 131072 --gpu-memory-utilization 0.85 --tensor-parallel-size 2 --api-key XXXX --ssl-keyfile /certs/privkey.pem --ssl-certfile /certs/fullchain.pem"
environment:
TZ: "Europe/Rome"
HUGGING_FACE_HUB_TOKEN: "YYYY"
CUDA_VISIBLE_DEVICES: "0,1"
volumes:
- /datadisk/vllm/data:/root/.cache/huggingface
- /datadisk/vllm/cache:/root/.cache/vllm
- /datadisk/workspace/certificate/certbot-etc/archive/eeeeehhvolevi.ai:/certs
ports:
- 8000:8000
restart: always
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
runtime: nvidia
ipc: host
healthcheck:
test: [ "CMD", "curl", "-H", "Authorization: Bearer XXXX", "-f", "http://localhost:8000/v1/models" ]
interval: 30s
timeout: 5s
retries: 20
vLLM logs:
[+] Running 2/2
✔ Network vllm_default Created 0.0s
✔ Container vllm Created 0.2s
Attaching to vllm
vllm | INFO 10-09 10:57:57 [__init__.py:216] Automatically detected platform cuda.
vllm | (APIServer pid=1) INFO 10-09 10:58:07 [api_server.py:1839] vLLM API server version 0.11.0
vllm | (APIServer pid=1) INFO 10-09 10:58:07 [utils.py:233] non-default args: {'api_key': ['XXXX'], 'ssl_keyfile': '/certs/privkey.pem', 'ssl_certfile': '/certs/fullchain.pem', 'enable_auto_tool_choice': True, 'tool_call_parser': 'openai', 'model': 'openai/gpt-oss-120b', 'max_model_len': 131072, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.85}
vllm | (APIServer pid=1) `torch_dtype` is deprecated! Use `dtype` instead!
vllm | (APIServer pid=1) INFO 10-09 10:58:08 [model.py:547] Resolved architecture: GptOssForCausalLM
Parse safetensors files: 100%|██████████| 15/15 [00:01<00:00, 11.73it/s]
vllm | (APIServer pid=1) INFO 10-09 10:58:10 [model.py:1510] Using max model len 131072
vllm | (APIServer pid=1) INFO 10-09 10:58:16 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
vllm | (APIServer pid=1) INFO 10-09 10:58:16 [config.py:271] Overriding max cuda graph capture size to 992 for performance.
vllm | INFO 10-09 10:58:21 [__init__.py:216] Automatically detected platform cuda.
vllm | (EngineCore_DP0 pid=132) INFO 10-09 10:58:24 [core.py:644] Waiting for init message from front-end.
vllm | (EngineCore_DP0 pid=132) INFO 10-09 10:58:24 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":992,"local_cache_dir":null}
vllm | (EngineCore_DP0 pid=132) WARNING 10-09 10:58:24 [multiproc_executor.py:720] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
vllm | (EngineCore_DP0 pid=132) INFO 10-09 10:58:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_e8fa51e9'), local_subscribe_addr='ipc:///tmp/9ad4aa28-a1ca-4984-bd55-a99ce028680f', remote_subscribe_addr=None, remote_addr_ipv6=False)
vllm | INFO 10-09 10:58:27 [__init__.py:216] Automatically detected platform cuda.
vllm | INFO 10-09 10:58:27 [__init__.py:216] Automatically detected platform cuda.
vllm | W1009 10:58:31.851000 234 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
vllm | W1009 10:58:31.851000 234 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
vllm | W1009 10:58:31.851000 235 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
vllm | W1009 10:58:31.851000 235 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
vllm | INFO 10-09 10:58:34 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_88d1e323'), local_subscribe_addr='ipc:///tmp/256e2747-0d04-42d9-ad03-fe9de6da5d3f', remote_subscribe_addr=None, remote_addr_ipv6=False)
vllm | INFO 10-09 10:58:34 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a43b6f2e'), local_subscribe_addr='ipc:///tmp/43656d19-ff49-4f78-b36c-41eb16cdad0e', remote_subscribe_addr=None, remote_addr_ipv6=False)
vllm | [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
vllm | [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
vllm | [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
vllm | [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
vllm | INFO 10-09 10:58:35 [__init__.py:1384] Found nccl from library libnccl.so.2
vllm | INFO 10-09 10:58:35 [__init__.py:1384] Found nccl from library libnccl.so.2
vllm | INFO 10-09 10:58:35 [pynccl.py:103] vLLM is using nccl==2.27.3
vllm | INFO 10-09 10:58:35 [pynccl.py:103] vLLM is using nccl==2.27.3
vllm | WARNING 10-09 10:58:35 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.0 not supported, communicator is not available.
vllm | WARNING 10-09 10:58:35 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.0 not supported, communicator is not available.
vllm | INFO 10-09 10:58:35 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
vllm | INFO 10-09 10:58:35 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
vllm | INFO 10-09 10:58:36 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_e97f286f'), local_subscribe_addr='ipc:///tmp/07a381f9-f574-419b-8844-079c5e876274', remote_subscribe_addr=None, remote_addr_ipv6=False)
vllm | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm | [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
vllm | INFO 10-09 10:58:36 [__init__.py:1384] Found nccl from library libnccl.so.2
vllm | INFO 10-09 10:58:36 [pynccl.py:103] vLLM is using nccl==2.27.3
vllm | [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
vllm | INFO 10-09 10:58:36 [__init__.py:1384] Found nccl from library libnccl.so.2
vllm | INFO 10-09 10:58:36 [pynccl.py:103] vLLM is using nccl==2.27.3
vllm | INFO 10-09 10:58:36 [parallel_state.py:1208] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
vllm | INFO 10-09 10:58:36 [parallel_state.py:1208] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
vllm | INFO 10-09 10:58:36 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
vllm | INFO 10-09 10:58:36 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
vllm | (Worker_TP1 pid=235) INFO 10-09 10:58:36 [gpu_model_runner.py:2602] Starting to load model openai/gpt-oss-120b...
vllm | (Worker_TP0 pid=234) INFO 10-09 10:58:36 [gpu_model_runner.py:2602] Starting to load model openai/gpt-oss-120b...
vllm | (Worker_TP0 pid=234) INFO 10-09 10:58:36 [gpu_model_runner.py:2634] Loading model from scratch...
vllm | (Worker_TP1 pid=235) INFO 10-09 10:58:36 [gpu_model_runner.py:2634] Loading model from scratch...
vllm | (Worker_TP1 pid=235) INFO 10-09 10:58:37 [cuda.py:361] Using Triton backend on V1 engine.
vllm | (Worker_TP0 pid=234) INFO 10-09 10:58:37 [cuda.py:361] Using Triton backend on V1 engine.
vllm | (Worker_TP0 pid=234) INFO 10-09 10:58:37 [mxfp4.py:98] Using Marlin backend
vllm | (Worker_TP1 pid=235) INFO 10-09 10:58:37 [mxfp4.py:98] Using Marlin backend
vllm | (Worker_TP0 pid=234) INFO 10-09 10:58:37 [weight_utils.py:392] Using model weights format ['*.safetensors']
vllm | (Worker_TP1 pid=235) INFO 10-09 10:58:37 [weight_utils.py:392] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/15 [00:00<00:06, 2.10it/s]
Loading safetensors checkpoint shards: 13% Completed | 2/15 [00:01<00:06, 1.86it/s]
Loading safetensors checkpoint shards: 20% Completed | 3/15 [00:01<00:06, 1.86it/s]
Loading safetensors checkpoint shards: 27% Completed | 4/15 [00:02<00:05, 1.97it/s]
Loading safetensors checkpoint shards: 33% Completed | 5/15 [00:02<00:05, 1.93it/s]
Loading safetensors checkpoint shards: 40% Completed | 6/15 [00:03<00:04, 1.85it/s]
Loading safetensors checkpoint shards: 47% Completed | 7/15 [00:03<00:04, 1.86it/s]
Loading safetensors checkpoint shards: 53% Completed | 8/15 [00:04<00:03, 1.81it/s]
Loading safetensors checkpoint shards: 60% Completed | 9/15 [00:04<00:03, 1.79it/s]
Loading safetensors checkpoint shards: 67% Completed | 10/15 [00:05<00:02, 1.81it/s]
Loading safetensors checkpoint shards: 73% Completed | 11/15 [00:05<00:02, 1.83it/s]
Loading safetensors checkpoint shards: 80% Completed | 12/15 [00:06<00:01, 1.79it/s]
Loading safetensors checkpoint shards: 87% Completed | 13/15 [00:07<00:01, 1.77it/s]
Loading safetensors checkpoint shards: 93% Completed | 14/15 [00:07<00:00, 1.75it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:08<00:00, 1.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:08<00:00, 1.83it/s]
vllm | (Worker_TP0 pid=234)
vllm | (Worker_TP0 pid=234) INFO 10-09 10:58:46 [default_loader.py:267] Loading weights took 8.31 seconds
vllm | (Worker_TP0 pid=234) WARNING 10-09 10:58:46 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
vllm | (Worker_TP1 pid=235) INFO 10-09 10:58:46 [default_loader.py:267] Loading weights took 8.23 seconds
vllm | (Worker_TP1 pid=235) WARNING 10-09 10:58:46 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
vllm | (Worker_TP0 pid=234) INFO 10-09 10:58:49 [gpu_model_runner.py:2653] Model loading took 34.3767 GiB and 12.308948 seconds
vllm | (Worker_TP1 pid=235) INFO 10-09 10:58:49 [gpu_model_runner.py:2653] Model loading took 34.3767 GiB and 12.480512 seconds
vllm | (Worker_TP1 pid=235) INFO 10-09 10:59:00 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/aaa22f67ec/rank_1_0/backbone for vLLM's torch.compile
vllm | (Worker_TP0 pid=234) INFO 10-09 10:59:00 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/aaa22f67ec/rank_0_0/backbone for vLLM's torch.compile
vllm | (Worker_TP1 pid=235) INFO 10-09 10:59:00 [backends.py:559] Dynamo bytecode transform time: 10.84 s
vllm | (Worker_TP0 pid=234) INFO 10-09 10:59:00 [backends.py:559] Dynamo bytecode transform time: 10.85 s
vllm | (Worker_TP1 pid=235) INFO 10-09 10:59:03 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.469 s
vllm | (Worker_TP0 pid=234) INFO 10-09 10:59:03 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.452 s
vllm | (Worker_TP1 pid=235) INFO 10-09 10:59:04 [marlin_utils.py:353] You are running Marlin kernel with bf16 on GPUs before SM90. You can consider change to fp16 to achieve better performance if possible.
vllm | (Worker_TP0 pid=234) INFO 10-09 10:59:04 [marlin_utils.py:353] You are running Marlin kernel with bf16 on GPUs before SM90. You can consider change to fp16 to achieve better performance if possible.
vllm | (Worker_TP0 pid=234) INFO 10-09 10:59:05 [monitor.py:34] torch.compile takes 10.85 s in total
vllm | (Worker_TP1 pid=235) INFO 10-09 10:59:05 [monitor.py:34] torch.compile takes 10.84 s in total
vllm | (Worker_TP0 pid=234) INFO 10-09 10:59:07 [gpu_worker.py:298] Available KV cache memory: 31.78 GiB
vllm | (Worker_TP1 pid=235) INFO 10-09 10:59:07 [gpu_worker.py:298] Available KV cache memory: 31.78 GiB
vllm | (EngineCore_DP0 pid=132) INFO 10-09 10:59:07 [kv_cache_utils.py:1087] GPU KV cache size: 925,696 tokens
vllm | (EngineCore_DP0 pid=132) INFO 10-09 10:59:07 [kv_cache_utils.py:1091] Maximum concurrency for 131,072 tokens per request: 13.89x
vllm | (EngineCore_DP0 pid=132) INFO 10-09 10:59:07 [kv_cache_utils.py:1087] GPU KV cache size: 925,696 tokens
vllm | (EngineCore_DP0 pid=132) INFO 10-09 10:59:07 [kv_cache_utils.py:1091] Maximum concurrency for 131,072 tokens per request: 13.89x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 81/81 [00:07<00:00, 11.55it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:03<00:00, 9.96it/s]
vllm | (Worker_TP0 pid=234) INFO 10-09 10:59:18 [custom_all_reduce.py:203] Registering 8468 cuda graph addresses
vllm | (Worker_TP1 pid=235) INFO 10-09 10:59:22 [custom_all_reduce.py:203] Registering 8468 cuda graph addresses
vllm | (Worker_TP0 pid=234) INFO 10-09 10:59:22 [gpu_model_runner.py:3480] Graph capturing finished in 15 secs, took 1.20 GiB
vllm | (Worker_TP1 pid=235) INFO 10-09 10:59:22 [gpu_model_runner.py:3480] Graph capturing finished in 15 secs, took 1.20 GiB
vllm | (EngineCore_DP0 pid=132) INFO 10-09 10:59:22 [core.py:210] init engine (profile, create kv cache, warmup model) took 32.99 seconds
vllm | (APIServer pid=1) INFO 10-09 10:59:24 [loggers.py:147] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 115712
vllm | (APIServer pid=1) INFO 10-09 10:59:24 [api_server.py:1634] Supported_tasks: ['generate']
vllm | (APIServer pid=1) WARNING 10-09 10:59:24 [serving_responses.py:154] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [serving_responses.py:166] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [serving_chat.py:99] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [api_server.py:1912] Starting vLLM API server 0 on https://0.0.0.0:8000
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:34] Available routes are:
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /openapi.json, Methods: GET, HEAD
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /docs, Methods: GET, HEAD
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /docs/oauth2-redirect, Methods: GET, HEAD
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /redoc, Methods: GET, HEAD
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /health, Methods: GET
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /load, Methods: GET
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /ping, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /ping, Methods: GET
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /tokenize, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /detokenize, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/models, Methods: GET
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /version, Methods: GET
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/responses, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/responses/{response_id}, Methods: GET
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/responses/{response_id}/cancel, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/chat/completions, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/completions, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/embeddings, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /pooling, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /classify, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /score, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/score, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/audio/transcriptions, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/audio/translations, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /rerank, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v1/rerank, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /v2/rerank, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /scale_elastic_ep, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /is_scaling_elastic_ep, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /invocations, Methods: POST
vllm | (APIServer pid=1) INFO 10-09 10:59:26 [launcher.py:42] Route: /metrics, Methods: GET
vllm | (APIServer pid=1) INFO: Started server process [1]
vllm | (APIServer pid=1) INFO: Waiting for application startup.
vllm | (APIServer pid=1) INFO: Application startup complete.
vllm | (APIServer pid=1) INFO: 81.125.120.211:58982 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm | (APIServer pid=1) INFO 10-09 11:01:07 [loggers.py:127] Engine 000: Avg prompt throughput: 105.6 tokens/s, Avg generation throughput: 43.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm | (APIServer pid=1) INFO 10-09 11:01:17 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm | (APIServer pid=1) INFO: 163.162.186.79:55372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm | (APIServer pid=1) INFO 10-09 11:01:47 [loggers.py:127] Engine 000: Avg prompt throughput: 646.7 tokens/s, Avg generation throughput: 80.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:01:57 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 144.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:02:07 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 138.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:02:17 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 133.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:02:27 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 129.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:02:37 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 125.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:02:47 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 121.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:02:57 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 117.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:03:07 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 114.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:03:17 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 111.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:03:27 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 110.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:03:37 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 107.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:03:47 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 105.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:03:57 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 102.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.9%, Prefix cache hit rate: 14.5%
vllm | (APIServer pid=1) INFO 10-09 11:04:07 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 100.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 14.5%
Sample request to reproduce the issue:
curl https://eeeeehhvolevi.ai:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer XXXX" \
-d '{
"model": "openai/gpt-oss-120b",
"messages": [
{
"role": "developer",
"content": "# Role and Objective\nNetwork trace analyzer agent specialized in examining `.pcap` files.\n\n# Instructions\n## Planning\n- Begin with a concise checklist outlining planned analytical steps based on the user'\''s request before proceeding.\n\n## Function Calling\n- Use provided functions to gather data from the `.pcap` file as needed. The same function accepts different parameters providing different results, you can call the same function multiple times with different parameters if needed.\n- When using functions avoid parameters that may lead to excessive data retrieval. Instead, use filters and limits to focus on relevant data.\n- Rely on the function `execute_code` for math calculations.\n- If a function call fails (e.g., due to syntax errors), validate the input, correct the expression if possible, and retry the function. If correction isn'\''t feasible, report the issue and suggest alternative approaches.\n- Do not rely exclusively on a sequence of function calls, integrate critical reflection and thoughtful articulation of reasoning at each key decision point.\n\n## Precision and Clarity\n- Base all conclusions strictly on concrete data retrieved using available functions. Do not speculate, if information is incomplete or unobtainable, explicitly state uncertainty rather than offer assumptions.\n\n## Focus on what matters\n- Prioritize analysis of network flows and protocols directly relevant to the user'\''s query. Avoid extraneous details that do not contribute to addressing the specific request.\n- `.pcap` files may contain extraneous traffic being file extracted from real production environment. Do not take into account unrelated data to ensure analysis is focused strictly on relevant network flows."
},
{
"role": "user",
"content": "Analizza i flussi di rete in formato pcap e verifica la corrispondenza con il sequence diagram fornito. Il sequence diagram può contenere delle note che aiutano a comprendere meglio il flusso.\n\nblablabla\n\nbliblibli\n\n\n# Output\n- Report, in italiano, in formato markdown con 3 sezioni: Risposta breve, Dettagli, Punti di attenzione\n- Risposta breve deve indicare se il flusso di rete catturato corrisponde o meno al sequence diagram e una sintesi delle eventuali discrepanze\n- Dettagli deve fornire riferimenti specifici che supportano la risposta breve ed eventuali discrepanze riscontrate\n- Punti di attenzione deve elencare eventuali anomalie o punti di attenzione riscontrati durante l'\''analisi\n\nUsa tabelle solo quando hai poco testo da mostrare, altrimenti usa elenchi puntati o numerati."
}
],
"tools": [
{
"type": "function",
"function": {
"name": "extract_pcap_summary",
"description": "Extracts a general summary of the pcap file: total number of packets, main protocols, time intervals.",
"parameters": {
"type": "object",
"properties": {
"pcap_file_path": {
"type": "string",
"description": "Path to the .pcap to analyze."
}
},
"required": ["pcap_file_path"]
}
}
},
{
"type": "function",
"function": {
"name": "filter_protocol_messages",
"description": "Filters and returns messages of a specific protocol from the pcap file. Supports any protocol recognized by tshark (e.g., sip, diameter, gtp, http2, radius, sctp). Allows you to specify fields to extract and additional filters. Supports decoding DLT_USER 15 packets with ppcap payload.",
"parameters": {
"type": "object",
"properties": {
"pcap_file_path": {
"type": "string",
"description": "Path to the .pcap to analyze."
},
"protocol": {
"type": "string",
"description": "Name of the protocol to filter (e.g., `sip`, `diameter`, `gtp`, `http2`). It must match the name used by tshark."
},
"display_filters": {
"type": "string",
"description": "Optional. Additional tshark display filter (e.g., `sip.Method == \"INVITE\"` or `diameter.cmd.code == 272`). If omitted, returns all protocol messages."
},
"fields": {
"type": "array",
"items": {
"type": "string"
},
"description": "List of tshark fields to extract (e.g., [`frame.time`, `ip.src`, `ip.dst`, `sip.Method`, `diameter.Session-Id`]). If omitted, use a default set for the protocol. Suggestion: try default field at first run, then refine with specific fields if needed."
},
"decode_dlt_user_15": {
"type": "boolean",
"description": "Optional. Set to true to decode packets encoded as DLT_USER 15 with ppcap payload. Default is false."
}
},
"required": ["pcap_file_path", "protocol"]
}
}
},
{
"type": "function",
"function": {
"name": "find_protocol_fields",
"description": "Lists all available tshark fields belonging exactly to a specific protocol. Use it if you fail to find a field you expect to exist.",
"parameters": {
"type": "object",
"properties": {
"protocol": {
"type": "string",
"description": "Name of the protocol (e.g., '\''sip'\'', '\''http'\'', '\''dns'\'') to search for."
}
},
"required": ["protocol"]
}
}
},
{
"type": "function",
"function": {
"name": "execute_code",
"description": "Execute Python code in a secure sandbox. Use this to run code snippets, perform calculations, or manipulate data. Return the result of the execution.",
"parameters": {
"type": "object",
"properties": {
"code": {
"type": "string",
"description": "Code to execute, must be valid Python code."
}
},
"required": ["code"]
}
}
}
],
"temperature": 1.0,
"top_p": 1.0,
"reasoning_effort": "medium",
"stream": false
}'Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.