Skip to content

Conversation

bigPYJ1151
Copy link
Member

@bigPYJ1151 bigPYJ1151 commented Jun 12, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

  • Refine CPU backend default torch.compile config. Enable inductor as default. Disable inductor in most of CI tests to save time.
  • Allow setting default_max_num_seqs for different UsageContext. Setting default_max_num_seqs and default_max_num_batched_tokens for the CPU backend.

Test Plan

CI pipeline

Test Result

(Optional) Documentation Update

Signed-off-by: jiang1.li <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @bigPYJ1151, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the default configuration for the CPU backend. The primary goal is to enable torch.compile with the inductor backend by default for better performance, while simultaneously optimizing CI run times by disabling inductor for most tests. Additionally, it introduces more granular control over default sequence and batched token limits based on the usage context and sets specific defaults for the CPU platform.

Highlights

  • CPU Backend Defaults: Enabled torch.compile with the inductor backend by default for the CPU platform in V1 mode, moving away from the previously enforced eager mode.
  • CI Performance Optimization: Introduced an environment variable (VLLM_CPU_CI_ENV) to disable the inductor backend specifically within most CPU CI test runs to reduce compilation time and speed up tests. The benchmark test explicitly re-enables it.
  • Per-Context Sequence/Token Limits: Refactored the default max_num_seqs setting to be a dictionary allowing different default values based on the UsageContext (e.g., LLM_CLASS, OPENAI_API_SERVER). Added specific, lower default values for max_num_seqs and max_num_batched_tokens tailored for the CPU backend.
  • Compilation Settings Refinement: Adjusted the global compilation settings context manager used during model warm-up to pass the VLLM config and make the freezing setting conditional on max_autotune.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@bigPYJ1151 bigPYJ1151 changed the title [CPU] Refine CPU default config [CPU] Refine default config for the CPU backend Jun 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refines the CPU backend's default PyTorch Compile (Inductor) configuration and introduces more granular control over max_num_seqs based on usage context. The changes aim to enable Inductor by default for CPU for better performance, while using eager mode in most CI tests to save time.

Key changes include enabling Inductor by default for CPU, allowing context-specific defaults for sequence and batch token limits, and updating Inductor settings in the cpu_model_runner.

A critical concern is the modification to torch._inductor.config.freezing handling, which might default to False when True could be required by CPU Inductor backends. Additionally, the removal of force_disable_caches and changes to custom_ops handling warrant verification to prevent regressions or unexpected behavior. The CI script changes also warrant review.

Comment on lines +75 to +86
def _set_global_compilation_settings(config: VllmConfig):
import torch._inductor.config

# Note: The CPPGEMM backend requires freezing parameters.
freezing_value = torch._inductor.config.freezing
torch._inductor.config.freezing = True
# Note: workaround for "ValueError: fast mode: can't pickle cyclic objects
# including object type dict"
force_disable_caches = torch._inductor.config.force_disable_caches
torch._inductor.config.force_disable_caches = True
yield
torch._inductor.config.freezing = freezing_value
torch._inductor.config.force_disable_caches = force_disable_caches
inductor_config = config.compilation_config.inductor_compile_config
try:
# Note: The MKLDNN and CPPGEMM backend requires freezing parameters.
freezing_value = torch._inductor.config.freezing
if inductor_config.get("max_autotune", False):
torch._inductor.config.freezing = True
yield
finally:
torch._inductor.config.freezing = freezing_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for setting torch._inductor.config.freezing has changed. It's now conditional on inductor_config.get("max_autotune", False). Verify if freezing=False is safe and intended for the default CPU Inductor path, or if freezing=True should be set by default for CPU, independently of max_autotune if required by the underlying Inductor backends.

Comment on lines +178 to +181
if os.environ.get("VLLM_CPU_CI_ENV", "0") != "0":
backend = "eager"
else:
backend = "inductor"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The line compilation_config.custom_ops += ["none"] was removed. Clarify the expected behavior of custom ops with this change, for both "eager" and "inductor" backends under CompilationLevel.DYNAMO_ONCE. Ensure custom ops are correctly handled.

python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --dtype half &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
VLLM_CPU_CI_ENV=0 python3 benchmarks/benchmark_serving.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting VLLM_CPU_CI_ENV=0 here seems to override the environment variable set in the docker run command. Is this intentional?

Suggested change
VLLM_CPU_CI_ENV=0 python3 benchmarks/benchmark_serving.py \
python3 benchmarks/benchmark_serving.py \
--backend vllm \

@bigPYJ1151
Copy link
Member Author

Hi @Isotr0py @DarkLight1337 Would you please help to take a review on this PR? Some minor changes of default CPU backend configs, the cpu-test has passed. Thanks :)

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM!

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 12, 2025
@DarkLight1337 DarkLight1337 merged commit 6458721 into vllm-project:main Jun 13, 2025
81 checks passed
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants