Skip to content

Conversation

@eicherseiji
Copy link
Contributor

@eicherseiji eicherseiji commented Jul 14, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Problem: The current implementation forces users to choose between default loggers or custom loggers, but not both. It's possible to recreate the default logger list (and append a custom logger), but this creates a manual maintenance effort if the default loggers change.

Based on discussion in #14661, I see this was deprioritized to start, but thought it worth raising again as a minimally invasive ergonomics change. Would love to get thoughts on this. Thanks!

Implements feature request #17702.

Ref:

# vllm/vllm/v1/metrics/loggers.py
def setup_default_loggers(
    vllm_config: VllmConfig,
    log_stats: bool,
    engine_num: int,
    custom_stat_loggers: Optional[list[StatLoggerFactory]] = None,
) -> list[list[StatLoggerBase]]:
    """Setup logging and prometheus metrics."""
    if not log_stats:
        return []

    factories: list[StatLoggerFactory]
    if custom_stat_loggers is not None:
        factories = custom_stat_loggers
    else:
        factories = [PrometheusStatLogger]
        if logger.isEnabledFor(logging.INFO):
            factories.append(LoggingStatLogger)

Test Plan

pytest -vs v1/metrics/test_engine_logger_apis.py

Test Result

(base) ray@ip-10-0-251-217:~/default/work/vllm/tests$ pytest -vs v1/metrics/test_engine_logger_apis.py
INFO 07-15 17:16:59 [__init__.py:253] Automatically detected platform cuda.
=================================================================================================== test session starts ===================================================================================================
platform linux -- Python 3.11.11, pytest-8.4.1, pluggy-1.5.0 -- /home/ray/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /home/ray/default/work/vllm
configfile: pyproject.toml
plugins: asyncio-1.0.0, anyio-3.7.1
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                                                                                                                                          

v1/metrics/test_engine_logger_apis.py::test_async_llm_add_logger INFO 07-15 17:17:07 [config.py:3485] Downcasting torch.float32 to torch.float16.
INFO 07-15 17:17:07 [config.py:1561] Using max model len 1024
INFO 07-15 17:17:08 [config.py:2380] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 07-15 17:17:08 [cuda.py:103] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 07-15 17:17:08 [core.py:526] Waiting for init message from front-end.
INFO 07-15 17:17:08 [core.py:69] Initializing a V1 LLM engine (v0.1.dev7731+g480beba) with config: model='distilbert/distilgpt2', speculative_config=None, tokenizer='distilbert/distilgpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=distilbert/distilgpt2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
INFO 07-15 17:17:11 [parallel_state.py:1090] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 07-15 17:17:11 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 07-15 17:17:11 [gpu_model_runner.py:1742] Starting to load model distilbert/distilgpt2...
INFO 07-15 17:17:11 [gpu_model_runner.py:1747] Loading model from scratch...
INFO 07-15 17:17:11 [cuda.py:290] Using Flash Attention backend on V1 engine.
INFO 07-15 17:17:11 [weight_utils.py:296] Using model weights format ['*.safetensors']
INFO 07-15 17:17:11 [weight_utils.py:349] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 12.30it/s]

INFO 07-15 17:17:11 [default_loader.py:272] Loading weights took 0.09 seconds
INFO 07-15 17:17:12 [gpu_model_runner.py:1773] Model loading took 0.1547 GiB and 0.619774 seconds
INFO 07-15 17:17:13 [gpu_worker.py:244] Available KV cache memory: 19.34 GiB
INFO 07-15 17:17:13 [kv_cache_utils.py:728] GPU KV cache size: 1,126,912 tokens
INFO 07-15 17:17:13 [kv_cache_utils.py:732] Maximum concurrency for 1,024 tokens per request: 1100.50x
INFO 07-15 17:17:13 [core.py:172] init engine (profile, create kv cache, warmup model) took 1.24 seconds
PASSED

=================================================================================================== 1 passed in 12.93s ====================================================================================================

(Optional) Documentation Update

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @eicherseiji, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors how statistical loggers are integrated into the AsyncLLM engine by introducing a dedicated asynchronous API for adding them. This change provides more flexibility in managing loggers and includes built-in validation to prevent common misconfigurations.

Highlights

  • New API for Logger Management: Introduced a new asynchronous method, add_logger, to the AsyncLLM class. This method allows for dynamically adding statistical loggers to the engine after its initialization.
  • Refactored Logger Initialization: The way statistical loggers are added to AsyncLLM has been changed. Instead of passing stat_loggers directly to the AsyncLLM.from_engine_args constructor, loggers are now added via the new add_logger method post-instantiation.
  • Logger Validation and Duplication Prevention: The add_logger method includes checks to ensure that stat logging is enabled before adding a logger. It also prevents the addition of multiple loggers of the same type, raising a KeyError if a duplicate is detected.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the v1 label Jul 14, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new add_logger API to AsyncLLM, allowing for dynamic addition of stat loggers after engine initialization. The change is well-contained and includes a corresponding update to the tests to use the new API.

I've found a critical issue in the implementation of the duplicate logger check within the new add_logger method, which I've detailed in a specific comment. Addressing this will ensure the new feature works as intended.

@eicherseiji eicherseiji reopened this Jul 15, 2025
@robertgshaw2-redhat
Copy link
Collaborator

Why do we want this? I think we should be careful about adding a new API like this before 1.0

@eicherseiji
Copy link
Contributor Author

Why do we want this? I think we should be careful about adding a new API like this before 1.0

Hi @robertgshaw2-redhat! Thanks for taking a look. I updated the PR body with a more detailed problem statement. Basically my perspective is that it should be possible to add additional loggers without implicitly removing default ones.

Totally understand if this is not a safe change to make just yet, though. Are there docs/discussion somewhere I can reference for more background on our unstable/alpha API designs pre-1.0? Or maybe just extra context why this is too early?

@eicherseiji eicherseiji marked this pull request as ready for review July 16, 2025 21:42
@mergify
Copy link

mergify bot commented Jul 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @eicherseiji.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 21, 2025
@ruisearch42
Copy link
Collaborator

@eicherseiji this needs to be revisited considering #21257 ?

@njhill
Copy link
Member

njhill commented Jul 21, 2025

it should be possible to add additional loggers without implicitly removing default ones.

Yes I think the current interface isn't ideal in this regard. Either you default to built-in ones or provide completely separate list.

IMO we should address this via the constructor args though instead of a new method. Perhaps even a dedicated flag to toggle whether the default statsloggers should be included.

Signed-off-by: Seiji Eicher <[email protected]>
Signed-off-by: Seiji Eicher <[email protected]>
@eicherseiji
Copy link
Contributor Author

Thanks all for the feedback so far.

Indeed @ruisearch42, #21257 changed things slightly:

  • PrometheusStatLogger is no longer disabled when passing a list of custom stat loggers (unless a subclass is found in the list of custom loggers)
  • As a result, only LoggingStatLogger will still be disabled if a list of custom stat loggers are provided

@njhill I think the suggestion makes sense, the drawback is that we pollute the AsyncLLM constructor somewhat by adding another parameter (that's only conditionally applicable when custom stat loggers are used). To avoid adding to AsyncLLM.__init__, what if a new field was added to VllmConfig, e.g. LoggingConfig?

Otherwise, I see add_logger as less invasive change that implies retaining the default loggers (while still allowing users to remove the defaults by passing stat_loggers directly). Due to #21257, we need to check if the logger should be shared in a DP deployment. I pushed a commit along these lines to invite further feedback.

@mergify mergify bot removed the needs-rebase label Jul 22, 2025
Copy link
Collaborator

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Please make sure @robertgshaw2-redhat @njhill agree with the design change.

@ptovam
Copy link
Contributor

ptovam commented Jul 28, 2025

Hi @eicherseiji @ruisearch42 @njhill @robertgshaw2-redhat,

I’ve opened this PR to extend custom_stat_loggers support to the CLI, enabling dynamic loading of user-defined loggers. Following @simon-mo’s suggestion, I’m currently updating it to align with the vLLM plugin system.

(Initial plugin-based commit: #22456)

I’m wondering how this should integrate with the current implementation and the changes here. Since I’m planning to introduce a dedicated plugin group that dynamically loads StatLoggerBase classes, do you think we should maintain both approaches — custom_stat_loggers and add_logger — alongside the plugin system? Or would it make sense to unify or deprecate some of the existing mechanisms?

NOTE: This PR fixes the limitation of having to choose between default or custom loggers. That’s a great improvement, and the plugin system will need to align with this change as well—since add_logger() won’t map well to the plugin system, options like a LoggingConfig field on VllmConfig or a dedicated flag to toggle default loggers might be a better fit there.

Appreciate any thoughts, and am happy to adjust based on the direction you think is best.

@eicherseiji
Copy link
Contributor Author

Thanks @ptovam for helping coordinate!

@njhill seems like the constructor/non-method implementation you proposed may be more compatible with the plugin system than the add_logger method I have here. My leaning is toward a new field on VllmConfig, but I'm open to different directions. Lmk thoughts if you get a chance. cc: @robertgshaw2-redhat

@ptovam
Copy link
Contributor

ptovam commented Jul 28, 2025

Thanks! I like the idea of a new config field, as it could also make it easier to pass parameters (port, path, etc.) to custom loggers in the future — similar to how model_loader_extra_config works.

@njhill
Copy link
Member

njhill commented Aug 8, 2025

Thanks @eicherseiji @ptovam and sorry for taking so long to get back to this.

For an immediate change WDYT about the following:

  1. Change the behavior of the existing custom_stats_loggers arg to be additive, the default stats loggers will be used regardless of whether additional custom ones are added
  2. Change the behavior of disable_log_stats to only disable the built-in/default stats loggers, so that it can be used in conjunction with custom_stats_loggers

I think that will be minimally disruptive to existing users and address your original concern without requiring API additions.

Actually there is another open PR #22227 which addresses incompatibility of the logging stats logger with DP and in its current state also addresses (1).

Then as a follow-on we can decide how to extend this to enable plugging in custom loggers in the server case, i.e. #22456.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eicherseiji sorry for the checking again, I'll try to turn around quicker...

Comment on lines 651 to 654
self.dp_shared_loggers = []
if enable_default_loggers:
self.dp_shared_loggers.append(
PrometheusStatLogger(vllm_config, engine_idxs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to keep this as a single prometheus_logger field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the above self.dp_shared_loggers code can also be removed now? It's not used right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, thanks.

eicherseiji and others added 2 commits September 2, 2025 13:22
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: Seiji Eicher <[email protected]>
Copy link
Contributor Author

@eicherseiji eicherseiji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @njhill for the comments, my apologies for the delayed reply (OOO + urgent work last week). Incorporated the feedback, lmk what you think.

if enable_default_loggers and logger.isEnabledFor(logging.INFO):
factories.append(LoggingStatLogger)

# For Prometheus, need to share the metrics between EngineCores.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken. I think it makes sense to keep the change minimal/narrow for now and defer a dp_shared_loggers concept until it's clearly needed; keeping the single prometheus_logger field.

Comment on lines 651 to 654
self.dp_shared_loggers = []
if enable_default_loggers:
self.dp_shared_loggers.append(
PrometheusStatLogger(vllm_config, engine_idxs))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eicherseiji, no worries about the delay. I have one remaining comment

Comment on lines 651 to 654
self.dp_shared_loggers = []
if enable_default_loggers:
self.dp_shared_loggers.append(
PrometheusStatLogger(vllm_config, engine_idxs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the above self.dp_shared_loggers code can also be removed now? It's not used right?

Copy link
Contributor Author

@eicherseiji eicherseiji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated tests.

Comment on lines 651 to 654
self.dp_shared_loggers = []
if enable_default_loggers:
self.dp_shared_loggers.append(
PrometheusStatLogger(vllm_config, engine_idxs))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, thanks.

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eicherseiji

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2025
@njhill
Copy link
Member

njhill commented Sep 3, 2025

@eicherseiji looks like there is a test failure, presumably test needs updating: https://buildkite.com/vllm/ci/builds/29369#0199110d-18c2-49ab-9820-6c9995a41de8

@eicherseiji
Copy link
Contributor Author

Thanks @njhill! Fixed.

@njhill njhill enabled auto-merge (squash) September 4, 2025 20:08
@njhill njhill changed the title AsyncLLM custom stat logger APIs should extend default logger list [Misc] Have AsyncLLM custom_stat_loggers extend default logger list Sep 4, 2025
@simon-mo simon-mo disabled auto-merge September 4, 2025 21:25
@simon-mo simon-mo merged commit 60b755c into vllm-project:main Sep 4, 2025
38 of 40 checks passed
eicherseiji added a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…vllm-project#20952)

Signed-off-by: Seiji Eicher <[email protected]>
Signed-off-by: Seiji Eicher <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…vllm-project#20952)

Signed-off-by: Seiji Eicher <[email protected]>
Signed-off-by: Seiji Eicher <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants