Skip to content

Conversation

imkero
Copy link
Contributor

@imkero imkero commented Oct 15, 2025

Purpose

Fix #25903

This PR modifies model runners (they operates encoder cache) & multi modal Placeholder (they provide the metadata) to count and cache the original encoder cache instead of the interleaved encoder cache.

In detail, this PR includes following modifications:

  1. add post-init behaviour in PlaceholderRange:
    • cache the num_embeds
    • remove the leading & tailing False in is_embed to enable more concrete scheduling
  2. modify the encoder cache behaviour
    • cache the original encoder_output (instead of the interleaved one currently)
    • slice the cached encoder_output by counting True in PlaceholderRange.is_embed mask
  3. modify Request.get_num_encoder_tokens to return actual encoder tokens num (without non embedding tokens) for encoder cache scheduling, coresponding to the modified encoder caching behaviour

Still in draft part

  • modify the profiling behaviour?
  • modify other model runners besides gpu_model_runner
  • enhance the performance of counting True in is_embed

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added multi-modality Related to multi-modality (#4194) v1 labels Oct 15, 2025
@imkero imkero force-pushed the feat/contiguous-encoder-cache branch from 0aad5c3 to 1df6a4f Compare October 15, 2025 16:40
@imkero
Copy link
Contributor Author

imkero commented Oct 15, 2025

I'm wondering we can produce an is_embed_cumsum ahead of time so that we do not need to count True in is_embed in every step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MM]: Optimize encoder cache memory consumption by storing encoder outputs only

1 participant