-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.5.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35
Python version: 3.10.16 (main, Dec 11 2024, 16:18:56) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.19.90-vhulk2211.3.0.h1804.eulerosv2r10.aarch64-aarch64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: HiSilicon
Model name: Kunpeng-920
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 48
Socket(s): -
Cluster(s): 4
Stepping: 0x1
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache: 12 MiB (192 instances)
L1i cache: 12 MiB (192 instances)
L2 cache: 96 MiB (192 instances)
L3 cache: 192 MiB (8 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
NUMA node4 CPU(s): 96-119
NUMA node5 CPU(s): 120-143
NUMA node6 CPU(s): 144-167
NUMA node7 CPU(s): 168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] mypy==1.15.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torch-npu==2.5.1.dev20250320
[pip3] torchvision==0.20.1
[pip3] transformers==4.51.3
[conda] numpy 1.26.4 pypi_0 pypi
[conda] pyzmq 26.2.1 pypi_0 pypi
[conda] torch 2.5.1 pypi_0 pypi
[conda] torch-npu 2.5.1.dev20250320 pypi_0 pypi
[conda] torchvision 0.20.1 pypi_0 pypi
[conda] transformers 4.51.3 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.dev20+gb590adfdc
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect
VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_USE_MODELSCOPE=True
LD_LIBRARY_PATH=/home/sss/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/home/sss/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/home/sss/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/home/sss/Ascend/ascend-toolkit/latest/tools/aml/lib64:/home/sss/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/home/sss/Ascend/ascend-toolkit/latest/lib64:/home/sss/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/home/sss/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/home/sss/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/home/sss/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/home/sss/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/home/sss/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/home/sss/Ascend/ascend-toolkit/latest/tools/aml/lib64:/home/sss/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/home/sss/Ascend/ascend-toolkit/latest/lib64:/home/sss/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/home/sss/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/home/sss/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:
VLLM_USE_V1=1
TORCH_DEVICE_BACKEND_AUTOLOAD=1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
When I tested the structured output example in vllm with xgrammar, I get this error:
nanobind: leaked 1 instances!
- leaked instance 0xfffc8c22b108 of type "xgrammar.xgrammar_bindings.GrammarCompiler"
nanobind: leaked 1 types!
- leaked type "xgrammar.xgrammar_bindings.GrammarCompiler"
nanobind: leaked 9 functions!
- leaked function ""
- leaked function "compile_json_schema"
- leaked function "compile_builtin_json_grammar"
- leaked function "get_cache_size_bytes"
- leaked function "clear_cache"
- leaked function "__init__"
- leaked function "compile_structural_tag"
- leaked function "compile_regex"
- leaked function "compile_grammar"
nanobind: this is likely caused by a reference counting issue in the binding code.
How can I fix this warning log?
The test script is shown below:
# SPDX-License-Identifier: Apache-2.0
"""
This file demonstrates the example usage of guided decoding
to generate structured outputs using vLLM. It shows how to apply
different guided decoding techniques such as Choice, Regex, JSON schema,
and Grammar to produce structured and formatted results
based on specific prompts.
"""
from enum import Enum
from pydantic import BaseModel
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
MAX_TOKENS = 100
# Guided decoding by Choice (list of possible options)
guided_decoding_params_choice = GuidedDecodingParams(
choice=["Positive", "Negative"])
sampling_params_choice = SamplingParams(
guided_decoding=guided_decoding_params_choice, max_tokens=MAX_TOKENS)
prompt_choice = "Classify this sentiment: vLLM is wonderful!"
# Guided decoding by Regex
guided_decoding_params_regex = GuidedDecodingParams(regex=r"\w+@\w+\.com\n")
sampling_params_regex = SamplingParams(
guided_decoding=guided_decoding_params_regex, stop=["\n"], max_tokens=MAX_TOKENS)
prompt_regex = (
"Generate an email address for Alan Turing, who works in Enigma."
"End in .com and new line. Example result:"
"[email protected]\n")
# Guided decoding by JSON using Pydantic schema
class CarType(str, Enum):
sedan = "sedan"
suv = "SUV"
truck = "Truck"
coupe = "Coupe"
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
json_schema = CarDescription.model_json_schema()
guided_decoding_params_json = GuidedDecodingParams(json=json_schema)
sampling_params_json = SamplingParams(
guided_decoding=guided_decoding_params_json, max_tokens=MAX_TOKENS)
prompt_json = ("Generate a JSON with the brand, model and car_type of"
"the most iconic car from the 90's")
# Guided decoding by Grammar
simplified_sql_grammar = """
root ::= select_statement
select_statement ::= "SELECT " column " from " table " where " condition
column ::= "col_1 " | "col_2 "
table ::= "table_1 " | "table_2 "
condition ::= column "= " number
number ::= "1 " | "2 "
"""
guided_decoding_params_grammar = GuidedDecodingParams(
grammar=simplified_sql_grammar)
sampling_params_grammar = SamplingParams(
guided_decoding=guided_decoding_params_grammar, max_tokens=MAX_TOKENS)
prompt_grammar = ("Generate an SQL query to show the 'username' and 'email'"
"from the 'users' table.")
def format_output(title: str, output: str):
print(f"{'-' * 50}\n{title}: {output}\n{'-' * 50}")
def generate_output(prompt: str, sampling_params: SamplingParams, llm: LLM):
outputs = llm.generate(prompts=prompt, sampling_params=sampling_params)
return outputs[0].outputs[0].text
def main():
llm = LLM(model="/home/sss/cache/modelscope/models/Qwen/Qwen2.5-7B-Instruct", max_model_len=100)
choice_output = generate_output(prompt_choice, sampling_params_choice, llm)
format_output("Guided decoding by Choice", choice_output)
regex_output = generate_output(prompt_regex, sampling_params_regex, llm)
format_output("Guided decoding by Regex", regex_output)
json_output = generate_output(prompt_json, sampling_params_json, llm)
format_output("Guided decoding by JSON", json_output)
grammar_output = generate_output(prompt_grammar, sampling_params_grammar,
llm)
format_output("Guided decoding by Grammar", grammar_output)
if __name__ == "__main__":
main()
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working