Skip to content

Conversation

@shen-shanshan
Copy link
Collaborator

@shen-shanshan shen-shanshan commented Jul 1, 2025

What this PR does / why we need it?

This PR (adapted from vllm-project/vllm#20232) updates the CachedRequestData definition to use a single instance shared across all requests in a batch, instead of creating a new instance per request.

This change brings two advantages:

  • Code simplification: Previously, to avoid the cost of instantiating CachedRequestData for every request, we cached and reused the class, introducing complexity and sometimes even causing a memory leak. With a single shared instance, we can eliminate this caching logic entirely, simplifying the codebase and removing the chance of leak.
  • Faster serialization: Sharing a single instance across the batch speeds up the serialization of SchedulerOutput. Although the data size remains unchanged, serializing one big object is faster than serializing many (up to 1024) small objects.

Does this PR introduce any user-facing change?

no.

How was this patch tested?

python examples/offline_inference_npu_v1.py

logs:

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 399.18it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.04s/it, est. speed input: 5.30 toks/s, output: 96.42 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Dr. David M. Kline, and I am a board-certified orthopedic surgeon. I am a member of the American Academy of Orthopedic Surgeons, the American Association of Hip and Knee Surgeons, and the American Association of Arthroscopy and Sports Medicine. I am also a member of the American College of Surgeons.\nI am a native of the San Francisco Bay Area and received my undergraduate degree from the University of California, Berkeley. I received my medical degree from the'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces. The president is further empowered to appoint federal judges, including members of the Supreme Court, subject to Senate approval. The president is also responsible for the enforcement of federal law and may grant federal pardons and reprieves. The president is further empowered to make treaties, subject to Senate ratification, and to receive foreign ambassadors'
Prompt: 'The capital of France is', Generated text: " Paris. Which of the following statements is true?\nA. Paris is the capital of France.\nB. Paris is not the capital of France.\nC. Paris is the capital of Germany.\nD. Paris is the capital of Italy.\nTo determine which statement is true, let's analyze each option step by step:\n\nA. Paris is the capital of France.\n- This statement is true. Paris is indeed the capital of France.\n\nB. Paris is not the capital of France.\n- This statement is"
Prompt: 'The future of AI is', Generated text: ' here. It’s not just a buzzword or a concept anymore. It’s a reality that’s transforming the way we live, work, and interact with technology. From self-driving cars to virtual assistants, AI is becoming an integral part of our daily lives. But what exactly is AI, and how is it changing the world? In this article, we’ll explore the basics of AI, its applications, and its impact on society.\nWhat is AI?\nArtificial Intelligence (AI) is a branch'

@codecov
Copy link

codecov bot commented Jul 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 34.14%. Comparing base (c30ddb8) to head (c7516d1).
⚠️ Report is 602 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1539      +/-   ##
==========================================
+ Coverage   27.39%   34.14%   +6.75%     
==========================================
  Files          56       63       +7     
  Lines        6191     7315    +1124     
==========================================
+ Hits         1696     2498     +802     
- Misses       4495     4817     +322     
Flag Coverage Δ
unittests 34.14% <ø> (+6.75%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MengqingCao
Copy link
Collaborator

@shen-shanshan
Copy link
Collaborator Author

Using ganyi's PR #1546.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants