[Performance] Optimize encoder cache memory consumption by storing encoder outputs only #26924

imkero · 2025-10-15T16:39:18Z

Purpose

This PR modifies model runners (they operates encoder cache) & multi modal Placeholder (they provide the metadata) to count and cache the original encoder cache instead of the interleaved encoder cache.

In detail, this PR includes following modifications:

add post-init behaviour in PlaceholderRange:
- cache the num_embeds
- remove the leading & tailing False in is_embed to enable more concrete scheduling
modify the encoder cache behaviour
- cache the original encoder_output (instead of the interleaved one currently)
- slice the cached encoder_output by counting True in PlaceholderRange.is_embed mask
modify Request.get_num_encoder_tokens to return actual encoder tokens num (without non embedding tokens) for encoder cache scheduling, coresponding to the modified encoder caching behaviour

Still in draft part

modify the profiling behaviour?
modify other model runners besides gpu_model_runner
enhance the performance of counting True in is_embed

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Kero Liang <[email protected]>

imkero · 2025-10-15T16:51:29Z

I'm wondering we can produce an is_embed_cumsum ahead of time so that we do not need to count True in is_embed in every step

mergify bot added multi-modality Related to multi-modality (#4194) v1 labels Oct 15, 2025

feat: contiguous encoder cache

1df6a4f

Signed-off-by: Kero Liang <[email protected]>

imkero force-pushed the feat/contiguous-encoder-cache branch from 0aad5c3 to 1df6a4f Compare October 15, 2025 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance] Optimize encoder cache memory consumption by storing encoder outputs only #26924

[Performance] Optimize encoder cache memory consumption by storing encoder outputs only #26924

imkero commented Oct 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

imkero commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Performance] Optimize encoder cache memory consumption by storing encoder outputs only #26924

Are you sure you want to change the base?

[Performance] Optimize encoder cache memory consumption by storing encoder outputs only #26924

Conversation

imkero commented Oct 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Still in draft part

Test Plan

Test Result

Uh oh!

imkero commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

imkero commented Oct 15, 2025 •

edited by github-actions bot

Loading