Fix a bug in tying OPT embeddings #1

WoosukKwon · 2023-02-25T00:27:11Z

This PR fixes a bug in supporting OPT-350m/OPT-6.7b/OPT-13b and OPT-IML models.

The bug happened because our model code didn't include some methods that were required to tie the input and output embeddings.

add rope scaling as a cli arg so openai server can load rope scaled models

Fix key cache block shape.

Deterministic OpenVINO inference

merge code

BA-78554: Jurassic 2.5 * worked on jurasic2.5 configuration file, updated jurassic2_5 modeling file to support alternating experts/attn layers * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * jurassic_3 modeling file works, uses dummy weights initialized by "dummy" flag. Tokenizer raises issues, for now copying the mixtral tokenizer * changed default tokenizer vocab values, loading of custom .pt weight files works. * removed notebook * merging master to jurassic-2.5 to reset head * Merge branch 'master' into jurassic-2.5 * align to master Approved-by: Tomer Asida Approved-by: Mor Zusman

Triton compilation fix

Group Gemm Version

…ermerge feat:trace v1

Bug #1 (CRITICAL): Add missing begin() and stage() methods to KVWriteRouter - Flash attention backend calls router.begin() and router.stage() - KVWriteRouter only had write() and commit() methods - Added begin() to store slot_mapping and initialize shadow buffer - Added stage() to extract per-timestep slot and stage KV pairs - Without these, no tokens were being staged → 0% acceptance rate Bug #2 (MODERATE): Fix bonus token counting in accepted_lens - valid_sampled_token_ids includes [accepted_draft_tokens..., bonus_token] - Previous: len([bonus]) = 1, incorrectly counted as 1 accepted draft token - Fixed: Use max(0, len(seq) - 1) to exclude bonus token from count - Now correctly reports 0 accepted when only bonus token is present Files modified: - vllm/v1/kv_cache/write_router.py: Added begin() and stage() methods - vllm/v1/worker/gpu_model_runner.py: Fixed accepted_lens calculation

Bug #1: EAGLE tree proposal returned zeros for draft_logprobs - Root cause: When using topk for tree branching, code set draft_logp_list=None, then created zeros tensor as fallback (lines 850-851) - Fix: Compute actual log-probs from logits using log_softmax + gather - Applied at 2 locations: root level (lines 698-704) and tree levels (lines 839-846) Bug #2: Added diagnostic logging in rejection sampler - Log draft_p (nonzero) min/med/max to detect zeros - Log p_target min/med/max to detect degenerate softmax - Helps identify if target logits are masked/filtered before sampling Expected results after fix: - draft_logp: -3.2/-1.6/-0.0 (real log-probs, all ≤ 0) instead of 0/0/0 - p_target: 1e-6/1e-3/0.7 (realistic distribution) instead of 1/1/1 - Acceptance rate: 30-70% instead of 0% Files changed: - vllm/v1/spec_decode/eagle.py: Fix draft_logp computation - vllm/v1/sample/rejection_sampler.py: Add sanity logging

CRITICAL FIX: tau_d was reading draft_temperature (0.05) instead of target temperature from sampling_metadata (1.0). This caused: - tau_q = 0.05 + 0.3 = 0.35 (before) - Logit gap = 10/0.35 = 28.6 → exp(-28.6) ≈ 0 (underflow!) - q collapses to 0.98-1.0 After fix: - tau_d = 1.0 (from sampling_metadata.temperature) - tau_q = 1.0 + 0.3 = 1.3 - Logit gap = 10/1.3 = 7.7 → exp(-7.7) = 0.00045 (survives!) - q should be in [0.5, 0.8] range Changes: - propose(): Store sampling_metadata as self._current_sampling_metadata - _sample_draft_tokens(): Read tau_d from sampling_metadata, not opt_config

Enhanced documentation for plugin patches: 1. Patch vllm-project#1 (Usage Tracking Helper): - Clarified as OPTIONAL (has fallback in harmony streaming patch) - Changed from "REQUIRED" to "OPTIONAL" - Explained fallback mechanism in patched_stream_method.py - Marked as upstreamable (minor utility addition) 2. Patch vllm-project#3 (Harmony Token-by-Token Streaming): - Added detailed speculative decoding context - Explained Eagle draft model generates 5-10 tokens per step - Documented specific failures with batch processing: * Tool calling broken * Multi-channel content lost * Token truncation during channel transitions - Added before/after code examples - Linked to PR vllm-project#26291 (Eagle3 Multi-Channel Streaming Fix) - Documented upstream status and removal plan Key insight: This patch exists because Eagle speculative decoding returns multiple tokens per step, and upstream's batch processing can't handle per-token channel switching. Signed-off-by: Pradyun Ramadorai <[email protected]>

Fix OPT errors

44735b4

WoosukKwon merged commit cbf8779 into main Feb 25, 2023

WoosukKwon deleted the fix-opt branch February 25, 2023 00:29

murongweibo mentioned this pull request Jul 11, 2023

NCCL Error 5: invalid usage #427

Closed

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

CZT0 referenced this pull request in semedia-tech/vllm Sep 11, 2023

#1 测试部署vllm

cc4f1ce

orangetin referenced this pull request in togethercomputer/vllm-ttgi Sep 14, 2023

Merge pull request #1 from winglian/longchat-args

b9012fb

add rope scaling as a cli arg so openai server can load rope scaled models

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 18, 2023

Add function invoke call for underlying models (vllm-project#1)

9895bbd

bigPYJ1151 added a commit to bigPYJ1151/vllm that referenced this pull request Oct 30, 2023

Merge pull request vllm-project#1 from bigPYJ1151/fix_ans

b5e7066

Fix key cache block shape.

l1cacheDell pushed a commit to CaspianFang/vllm that referenced this pull request Nov 15, 2023

blora LlaMa support vllm-project#1

424df61

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang referenced this pull request in hongxiayang/vllm Feb 13, 2024

Fix a bug in tying OPT embeddings (#1)

2cb721d

kvikk mentioned this pull request Feb 15, 2024

ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects #2735

Closed

ilya-lavrenov referenced this pull request in ilya-lavrenov/vllm Feb 19, 2024

Merge pull request #1 from ilya-lavrenov/cpu-works

e3d65e0

Deterministic OpenVINO inference

daniel-geon-park added a commit to gmlwns2000/vllm-timber that referenced this pull request Apr 15, 2024

Merge pull request vllm-project#1 from DeepAuto-AI/geon-dev

d9d746e

merge code

afeldman-nm mentioned this pull request Apr 30, 2024

Adding support for encoder-decoder models, like T5 or BART #187

Closed

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

fmmoret mentioned this pull request May 8, 2024

[Bug]: Chunked prefill returning gibberish in some cases. #4697

Closed

Bellk17 added a commit to Bellk17/vllm that referenced this pull request May 10, 2024

Merge pull request vllm-project#1 from Bellk17/main

b36d574

Triton compilation fix

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

afeldman-nm mentioned this pull request Jun 3, 2024

[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests. #5208

Closed

ykim362 referenced this pull request in ykim362/vllm Jun 17, 2024

Wenxh/fp8 on a100 v5 (#1)

aca4a33

Group Gemm Version

xiejibing mentioned this pull request Jun 24, 2024

[Bug]: vLLM 0.4.2 8xH100 init failed #5785

Closed

llmpros mentioned this pull request Jun 27, 2024

[Frontend]: Support base64 embedding #5935

Merged

Juelianqvq mentioned this pull request Jul 3, 2024

[Bug]: Flashinfer stuck with CUDA Graph #6086

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

This was referenced Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaocode337317439 mentioned this pull request Jun 27, 2025

[Bug]:RuntimeError: CUDA error: an illegal memory access was encountered #20170

Open

1 task

Chris113113 mentioned this pull request Jul 10, 2025

[Bug]: [V1][gpu_model_runner.py] CUDA memory error #19415

Open

1 task

shrijayan mentioned this pull request Jul 12, 2025

vLLM hangs after 10 minutes without any error message #1492

Closed

aarondou mentioned this pull request Jul 16, 2025

[RFC]: Neuron Support for V1 Engine #21082

Closed

1 task

tyxiong23 mentioned this pull request Jul 30, 2025

[Bug]: GLM-4.1V-Thinking ValueError #21811

Closed

1 task

xiaomofang mentioned this pull request Jul 31, 2025

[Bug]: There is an issue with speculative inference in Eagle mode, where the context length of vLLM inference is constrained by the draft model. #21986

Open

1 task

devops724 mentioned this pull request Aug 3, 2025

[Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made #15483

Open

1 task

fernandaspets mentioned this pull request Aug 8, 2025

[Bug]: --tensor-parallel-size 2 seems broken for Blackwell 6000 pro since version 10 #22479

Open

crischeng mentioned this pull request Aug 12, 2025

[Bug]: CUDA error during nsys profile : unspecified launch failure #22746

Closed

1 task

bbartels pushed a commit to bbartels/vllm that referenced this pull request Aug 14, 2025

Merge pull request vllm-project#1 from RichardoMrMu/feat-trace-v1-aft…

a7414f7

…ermerge feat:trace v1

JeffreyWong20 mentioned this pull request Aug 19, 2025

[Bug]: [TPU] profiling_tpu/profiling.py example crashed when runs on vllm_tpu docker #23194

Closed

1 task

ruisearch42 mentioned this pull request Aug 22, 2025

[Bug]: VLLM_ALL2ALL_BACKEND=naive hangs/crashes on multi nodes when serving DeepSeekV3 #23448

Open

1 task

Tar-ive mentioned this pull request Aug 24, 2025

feat: Add TPU v6e architecture-adaptive attention backend #23507

Open

16 tasks

shaamil101-etched mentioned this pull request Aug 25, 2025

[Bug]: vLLM server timeout due to multiprocessing communication error #23582

Open

1 task

ZJY0516 mentioned this pull request Aug 31, 2025

[Bug]: CUDA error when serving MiniCPM-V model #23954

Closed

wyn1015 mentioned this pull request Sep 19, 2025

[Bug]: assortment of warnings / errors coming out of vllm basic python inference script #18634

Open

1 task

LinWang-avivia mentioned this pull request Sep 24, 2025

[Bug]: Sequence Parallelism and Async TP disabled by default #25277

Open

4 tasks

zhanghb55 mentioned this pull request Sep 25, 2025

[Bug]: Pipeline parallel (pp>1) crashes with CUDA illegal memory access #25650

Open

1 task

This was referenced Oct 7, 2025

[Performance]: Use int over list[int] as output_tokens to reduce GC overhead #26369

Open

[Core] Bookkeeping optimization: Batchify updates 1D numpy arrays (e.g. num_tokens, num_tokens_no_spec) #25801

Open

tina0852 mentioned this pull request Oct 11, 2025

[Bug]: Since version 0.9.2 comes with nccl built-in, using PCIE causes sys errors. How to disable nccl in vllm for versions after 0.9.2? #26607

Open

1 task

Michel-debug mentioned this pull request Oct 23, 2025

[Bug]: qwen3-vl-2b after ms-swift fine-tuning lance errors #27405

Open

1 task

Moondon69 mentioned this pull request Oct 23, 2025

[Bug]: vLLM crashes with SIGABRT on Intel Arc B-series (Battlemage) GPUs during model inspection #27408

Open

1 task

Flink-ddd mentioned this pull request Oct 23, 2025

Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420

Draft

whwangovo mentioned this pull request Oct 23, 2025

[Bug]: vLLM (TP=8) on 235B model triggers "CUDA error: unspecified launch failure" and persistent "ERR!" state in nvidia-smi #27430

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix a bug in tying OPT embeddings #1

Fix a bug in tying OPT embeddings #1

Uh oh!

WoosukKwon commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix a bug in tying OPT embeddings #1

Fix a bug in tying OPT embeddings #1

Uh oh!

Conversation

WoosukKwon commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant