[Bug]: structured output with xgrammar using vllm serve with llama-8b fails results in os error OSError: OSError: (...)/.cache/torch_extensions/py312_cu124/xgrammar/xgrammar.so: cannot open shared object file: No such file or directory

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
INFO 02-19 18:59:41 __init__.py:190] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Rocky Linux 8.10 (Green Obsidian) (x86_64)
GCC version: (GCC) 13.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.28

Python version: 3.12.3 (main, Dec 16 2024, 18:23:37) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-4.18.0-553.16.1.el8_10.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB

Nvidia driver version: 560.35.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        8
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7452 32-Core Processor
Stepping:            0
CPU MHz:             2350.000
CPU max MHz:         2350.0000
CPU min MHz:         1500.0000
BogoMIPS:            4700.21
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-7,64-71
NUMA node1 CPU(s):   8-15,72-79
NUMA node2 CPU(s):   16-23,80-87
NUMA node3 CPU(s):   24-31,88-95
NUMA node4 CPU(s):   32-39,96-103
NUMA node5 CPU(s):   40-47,104-111
NUMA node6 CPU(s):   48-55,112-119
NUMA node7 CPU(s):   56-63,120-127
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	GPU2	GPU3	NIC0	NIC1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV4	NV4	NV4	SYS	SYS	24-31,88-95	3		N/A
GPU1	NV4	 X 	NV4	NV4	PIX	SYS	8-15,72-79	1		N/A
GPU2	NV4	NV4	 X 	NV4	SYS	PIX	56-63,120-127	7		N/A
GPU3	NV4	NV4	NV4	 X 	SYS	SYS	40-47,104-111	5		N/A
NIC0	SYS	PIX	SYS	SYS	 X 	SYS				
NIC1	SYS	SYS	PIX	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

LD_LIBRARY_PATH=/mnt/tier2/users/*my_id*/user_venv/lib/python3.12/site-packages/cv2/../../lib64:/apps/USE/easybuild/release/2024.1/software/Python/3.12.3-GCCcore-13.3.0/lib:/apps/USE/easybuild/release/2024.1/software/OpenSSL/3/lib:/apps/USE/easybuild/release/2024.1/software/libffi/3.4.5-GCCcore-13.3.0/lib64:/apps/USE/easybuild/release/2024.1/software/XZ/5.4.5-GCCcore-13.3.0/lib:/apps/USE/easybuild/release/2024.1/software/SQLite/3.45.3-GCCcore-13.3.0/lib:/apps/USE/easybuild/release/2024.1/software/Tcl/8.6.14-GCCcore-13.3.0/lib:/apps/USE/easybuild/release/2024.1/software/libreadline/8.2-GCCcore-13.3.0/lib:/apps/USE/easybuild/release/2024.1/software/ncurses/6.5-GCCcore-13.3.0/lib:/apps/USE/easybuild/release/2024.1/software/bzip2/1.0.8-GCCcore-13.3.0/lib:/apps/USE/easybuild/release/2024.1/software/binutils/2.42-GCCcore-13.3.0/lib:/apps/USE/easybuild/release/2024.1/software/zlib/1.3.1-GCCcore-13.3.0/lib:/apps/USE/easybuild/release/2024.1/software/GCCcore/13.3.0/lib64
CUDA_VISIBLE_DEVICES=0,1,2,3
CUDA_VISIBLE_DEVICES=0,1,2,3
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```

</details>


### 🐛 Describe the bug

hi - I get an error trying to generate structured prompts using xgrammar.
The below works perfectly fine using outlines as engine, it only seems to be a problem when running xgrammar (default)

Running 'vllm serve meta-llama/Llama-3.1-8B-Instruct' opens the server without any problems.

When I try to run the below in python, I get the error from the server logs:

```OSError: OSError: *path-to-my-work-dir*/.cache/torch_extensions/py312_cu124/xgrammar/xgrammar.so: cannot open shared object file: No such file or directory```

```python
from pydantic import BaseModel
from enum import Enum


class Answer(BaseModel):
    reasoning: str
    answer: str


json_schema = Answer.model_json_schema()

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",
)


completion = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France?",
        }
    ],
    extra_body={
        "guided_json": json_schema,
    },
)
print(completion.choices[0].message.content)
```

I run it using a fresh virtual enviroment, with only vllm installed.
The script used to produce this is almost a copy of the one on the structured prompting tutorial.
Adding 
```
"guided_decoding_backend": "outlines",
```

to the ```extra_body``` argument makes it run perfectly fine.
I've both tried reinstalling vllm from scratch but it is the same. Both when specifying ```"guided_decoding_backend": "xgrammar"``` or just letting it fall back to default results in the above error.

Hope that this is not a duplicate and that you might be able to help me!



### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: structured output with xgrammar using vllm serve with llama-8b fails results in os error OSError: OSError: (...)/.cache/torch_extensions/py312_cu124/xgrammar/xgrammar.so: cannot open shared object file: No such file or directory #13563

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: structured output with xgrammar using vllm serve with llama-8b fails results in os error OSError: OSError: (...)/.cache/torch_extensions/py312_cu124/xgrammar/xgrammar.so: cannot open shared object file: No such file or directory #13563

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions