Skip to content

g-eoj/cragents

Repository files navigation

cragents

Constrain Reasoning Agents to limit reasoning output.

Why

And I'm thinking While I'm thinking... (Crackerman, Stone Temple Pilots, 1992)

Reasoning models use a lot of tokens for their reasoning output. This is resource intensive while not necessarily improving accuracy - have you ever seen a reasoning model talk itself out of the right answer? So it may be desirable to limit the tokens used. Doing so can:

  • Improved response speed
  • Decrease GPU memory requirements
  • Provide more space in the context for stuff that matters
  • Improve accuracy on user queries that do not require extended analysis

How

cragents provides a utility to constrain pydantic-ai agents, if vLLM is used to serve the agent's model. It will limit the number of paragraphs and the number of sentences per paragraph in reasoning output. The limits are configurable.

import cragents
import os
from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel, OpenAIChatModelSettings
from pydantic_ai.providers.openai import OpenAIProvider


# define an agent as you normally would
class Output(BaseModel):
    answer: str
    reason: str


model = OpenAIChatModel(
    model_name=os.environ["VLLM_MODEL_NAME"],
    provider=OpenAIProvider(
        api_key=os.environ["VLLM_API_KEY"],
        base_url=os.environ["VLLM_BASE_URL"],
    ),
    settings=OpenAIChatModelSettings(
        max_tokens=1000,
    ),
)

agent = Agent(
    model=model,
    output_type=[Output],
)

# constrain reasoning as appropriate
await cragents.constrain_reasoning(
    agent,
    reasoning_paragraph_limit=1,
    reasoning_sentence_limit=1,
)

# call the agent as you normally would
run = await agent.run("Hi")

Inspecting the ThinkingParts shows that output is constrained.

from pydantic_ai.messages import ThinkingPart

for message in run.all_messages():
    for part in message.parts:
        if isinstance(part, ThinkingPart):
            print(part)
ThinkingPart(content='\nOkay, the user said "Hi".\n', id='content', provider_name='openai')

For the above example, vLLM was run on a single RTX 4090:

uv run vllm serve "Qwen/Qwen3-VL-8B-Thinking-FP8" --gpu-memory-utilization 0.92 --api-key $VLLM_API_KEY --enable-auto-tool-choice --tool-call-parser hermes --max-model-len 40000 --guided-decoding-backend guidance

Limitations

  • Only models that use the <think></think> tokens to denote reasoning will work
  • Only models that use the <tool_call></tool_call> tokens to denote tool calls will work
  • vLLM must be started without a reasoning parser (pydantic-ai will still extract reasoning content correctly)

About

Limit reasoning output.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published