-
-
Couldn't load subscription status.
- Fork 10.9k
Description
Your current environment
vLLM Version: 0.7.0
Model Input Dumps
No response
🐛 Describe the bug
Issue: V1 engine ignores custom logits processors and does not implement min-p sampling
Problem
-
Custom logits processors: In the new V1 engine, specifying a
logits_processorinSamplingParamsforLLM.generate()has no effect. The code ingpu_model_runner.pynever passes any sampling metadata intoself.model.compute_logits(...), so the logits processor is silently ignored. -
Min-p: Similarly,
min_p(a sampling parameter supported in V0 akin totop_kandtop_p) is not applied at all in V1. Thesampler.pyfor the new engine appears to skip it entirely, so it never factors into the final token selection.
If those features are not yet supported, consider at least raising a warning or error to avoid silent failures.
Possible Fix for Logits Processor Issue
- Create a new data class to hold relevant metadata for
self.model.compute_logits(...).- Could simply hold request ids and and request states (
CachedRequestState).
- Could simply hold request ids and and request states (
- Collate metadata inside
GPUModelRunner.execute_model(...). - Patch LogitsProcessor.forward() inside
logits_processor.pyto handle the new V1 metadata class alongside old V0 SamplingMetadata class. - Define
LogitsProcessor._apply_logits_processor_v1(...)or something similar to properly handle preprocessedhidden_statestensor in V1 model runner, as opposed to re-using the V0 version.
Possible Fix for Min-p Issue
- Add min_p attribute to
InputBatchingpu_input_batch.py. - Add min_p field to
SamplingMetadatadata class inmetadata.py. - Modify forward function of
Samplerinsampler.pyto apply min-p filtering.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.