Skip to content

Conversation

@lengrongfu
Copy link
Contributor

@lengrongfu lengrongfu commented Mar 29, 2025

FIX #15719

Test result:
image

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Mar 29, 2025
@mergify
Copy link

mergify bot commented Mar 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lengrongfu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot removed the needs-rebase label Mar 29, 2025
@lengrongfu lengrongfu changed the title [FIX]: vllm v1 verison metric num_gpu_blocks is None [V1][FIX]: vllm v1 verison metric num_gpu_blocks is None Mar 29, 2025
@lengrongfu lengrongfu changed the title [V1][FIX]: vllm v1 verison metric num_gpu_blocks is None [V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None Mar 29, 2025
@lengrongfu lengrongfu force-pushed the fix/v1-metric branch 2 times, most recently from 43214b6 to 0f51eb5 Compare March 29, 2025 16:01
@lengrongfu
Copy link
Contributor Author

@WoosukKwon can you take a look? thanks ~

@markmc
Copy link
Member

markmc commented Mar 31, 2025

Great catch! Yes indeed, num_gpu_blocks isn't currently available in the frontend, so we need some way of getting it from the engine

We currently return gpu_cache_usage in SchedulerStats:

@dataclass
class SchedulerStats:
    gpu_cache_usage: float = 0.0

it's tempting to include num_gpu_blocks there, even though it never changes. The overhead is small and the integration is easy!

Also, looking at the way BlockPool.get_usage() works, we could easily do:

@dataclass
class GPUCacheStats:
    num_blocks: float = 0.0
    num_free_blocks: float = 0.0

@dataclass
class SchedulerStats:
   gpu_cache_stats: GPUCacheStats = field(default_factory=GPUCacheStats)

It's easy to imagine we will want to add more to this over time

@lengrongfu
Copy link
Contributor Author

Great catch! Yes indeed, num_gpu_blocks isn't currently available in the frontend, so we need some way of getting it from the engine

We currently return gpu_cache_usage in SchedulerStats:

@dataclass
class SchedulerStats:
    gpu_cache_usage: float = 0.0

it's tempting to include num_gpu_blocks there, even though it never changes. The overhead is small and the integration is easy!

Also, looking at the way BlockPool.get_usage() works, we could easily do:

@dataclass
class GPUCacheStats:
    num_blocks: float = 0.0
    num_free_blocks: float = 0.0

@dataclass
class SchedulerStats:
   gpu_cache_stats: GPUCacheStats = field(default_factory=GPUCacheStats)

It's easy to imagine we will want to add more to this over time

@WoosukKwon About this suggest what do you think?

@mergify
Copy link

mergify bot commented Apr 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lengrongfu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@lengrongfu
Copy link
Contributor Author

@markmc PTAL

Copy link
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I don't love including this unchanging information in SchedulerStats - but it is low overhead

@njhill is there any implications from #15977 we should consider? This is basically information about the engine that is only available to the frontend once the engine has finished initializing

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed closely, just added a few comments of things that I noticed.

@njhill is there any implications from #15977 we should consider? This is basically information about the engine that is only available to the frontend once the engine has finished initializing

@markmc it would probably be better to return this in the "ready" message that the engine sends to the front-end here once startup is complete, if possible.

@markmc
Copy link
Member

markmc commented Apr 8, 2025

I haven't reviewed closely, just added a few comments of things that I noticed.

@njhill is there any implications from #15977 we should consider? This is basically information about the engine that is only available to the frontend once the engine has finished initializing

@markmc it would probably be better to return this in the "ready" message that the engine sends to the front-end here once startup is complete, if possible.

Yeah, thanks. Just need to expand that I guess ...

This is the two side of that:

vllm/vllm/v1/engine/core.py

Lines 488 to 489 in f6b32ef

# Send ready message to front-end once input socket is connected.
socket.send(b'READY')

if msg != b'READY':
raise RuntimeError(f"Engine {eng_id} failed: {msg.decode()}")

Either extend this binary protocol with an int:

message = b'READY' + struct.pack('!I', vllm_config.cache_config.num_gpu_blocks)

...
header = data[:5]
num_gpu_blocks = struct.unpack('!I', data[5:])[0]
if header != b'READY':
    ...
vllm_config.cache_config.num_gpu_blocks = num_gpu_blocks

or do it with JSON:

message_dict = {
    'type': 'READY',
    'num_gpu_blocks': vllm_config.cache_config.num_gpu_blocks,
}

message = json.dumps(message_dict).encode('utf-8')
....

data = socket.recv(1024)
message_dict = json.loads(data.decode('utf-8'))

if message_dict['type'] == 'READY':
    vllm_config.cache_config.num_gpu_blocks = message_dict['num_gpu_blocks']

@njhill
Copy link
Member

njhill commented Apr 8, 2025

Thanks @markmc, yes I was thinking something like the json thing. I think json dumps/loads can be used without the string encode/decode though. Also when there's more than one engine we may want to record the prometheus metric with a label for the engine index.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Apr 27, 2025

Can you update this PR with main? Then I can stamp it

@lengrongfu
Copy link
Contributor Author

Ok, I will update in the later.

@mergify
Copy link

mergify bot commented Apr 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @lengrongfu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) April 28, 2025 02:36
@DarkLight1337
Copy link
Member

Looks like some tests are failing consistently, PTAL

auto-merge was automatically disabled April 28, 2025 09:50

Head branch was pushed to by a user without write access

@lengrongfu lengrongfu force-pushed the fix/v1-metric branch 3 times, most recently from 218a66e to d67b894 Compare April 29, 2025 14:53
@lengrongfu
Copy link
Contributor Author

lengrongfu commented Apr 30, 2025

Looks like some tests are failing consistently, PTAL

Hi, ci pipeline tests is success. @DarkLight1337

@lengrongfu
Copy link
Contributor Author

Please again take a look @DarkLight1337

@DarkLight1337 DarkLight1337 merged commit d803786 into vllm-project:main Apr 30, 2025
43 checks passed
@njhill
Copy link
Member

njhill commented May 1, 2025

I think with these changes, in the data parallel case (when there are multiple engines), the metrics will still only reflect the num_gpu_blocks from one of the engines. Typically the values would be similar across engines but we may want to consider whether they should be published individually with labels or aggregated in some other way.

radeksm pushed a commit to radeksm/vllm that referenced this pull request May 2, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: num_gpu_blocks metric is None in V1

4 participants