-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Closed
Labels
Description
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
Model Input Dumps
No response
🐛 Describe the bug
It seems that xgrammar does not support some structured output. The test code is:
class Animals(BaseModel):
location: str
activity: str
animals_seen: conint(ge=1, le=5) # type: ignore # Constrained integer type
animals: list[str]
user_input = "I saw a puppy, a cat and a raccoon during my bike ride in the park"
messages = [
{
"role": "system",
"content": "You are a helpful which converts user input to JSON object. Respond in JSON format.",
},
{
"role": "user",
"content": f"convert to JSON according to provided schema: '{user_input}'",
},
]
logger.info(f"Sending Chat API request to {model_name}")
completion = client.client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.1,
max_tokens=250,
extra_body=dict(guided_json=json.dumps(Animals.model_json_schema()), guided_decoding_backend="lm-format-enforcer"),
)
assert completion is not None
logger.warning(f"{completion=}")
# check that output JSON has keys according to the schema, assert for values is too brittle (e.g. "park" vs "the park")
assert set(json.loads(completion.choices[0].message.content).keys()) == set(
Animals.model_fields.keys()
)
I set guided_decoding_backend
as outlines
or lm-format-enforcer
, it works fine. However, if I set it as xgrammar
, it can not pass the test. The completion is:
completion=ChatCompletion(id='chatcmpl-425b0ae7-02eb-467e-89bf-83080494182c', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n "location": "park",\n "activity": "bike ride",\n "animals_seen": \n \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[]), stop_reason=None)], created=1736835123, model='dsp.llama-3.1-8b-instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=250, prompt_tokens=80, total_tokens=330, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)
I tested with llama3.1-8b-instruct, quantized using gpt-w8a8 using llm-compressor. The deployment arguments are:
- '--tensor-parallel-size=1'
- '--max-num-batched-tokens=4096'
- '--enable-chunked-prefill'
- '--gpu-memory-utilization=0.96'
- '--enable-auto-tool-choice'
- '--tool-call-parser=llama3_json'
- '--chat-template=/mnt/models/tool_chat_template_llama3.1_json.jinja'
Thanks a lot!
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.