Skip to content

Commit 2e88bcc

Browse files
committed
Merge remote-tracking branch 'cboss-vllm/main' into cboss/zigzag-vllm-rebase
2 parents 73e13df + 04ad0dc commit 2e88bcc

File tree

62 files changed

+609
-9711
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+609
-9711
lines changed

.buildkite/scripts/hardware_ci/run-cpu-test.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,6 @@ function cpu_tests() {
6666
6767
pytest -x -v -s tests/models/language/pooling -m cpu_model
6868
pytest -x -v -s tests/models/multimodal/generation \
69-
--ignore=tests/models/multimodal/generation/test_mllama.py \
7069
--ignore=tests/models/multimodal/generation/test_pixtral.py \
7170
-m cpu_model"
7271

.buildkite/test-pipeline.yaml

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -549,15 +549,6 @@ steps:
549549
commands: # LMEval+Transcription WER check
550550
- pytest -s entrypoints/openai/correctness/
551551

552-
- label: Encoder Decoder tests # 12min
553-
timeout_in_minutes: 20
554-
mirror_hardwares: [amdexperimental]
555-
source_file_dependencies:
556-
- vllm/
557-
- tests/encoder_decoder
558-
commands:
559-
- pytest -v -s encoder_decoder
560-
561552
- label: OpenAI-Compatible Tool Use # 23 min
562553
timeout_in_minutes: 35
563554
mirror_hardwares: [amdexperimental]

benchmarks/kernels/benchmark_moe.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -560,7 +560,7 @@ def save_configs(
560560
filename = os.path.join(save_dir, filename)
561561
print(f"Writing best config to {filename}...")
562562
with open(filename, "w") as f:
563-
json.dump(configs, f, indent=4)
563+
json.dump({"triton_version": triton.__version__, **configs}, f, indent=4)
564564
f.write("\n")
565565

566566

docker/Dockerfile.nightly_torch

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -246,15 +246,15 @@ RUN pip install setuptools==75.6.0 packaging==23.2 ninja==1.11.1.3 build==1.2.2.
246246

247247

248248
# build flashinfer for torch nightly from source around 10 mins
249-
# release version: v0.2.2.post1
249+
# release version: v0.3.1
250250
# todo(elainewy): cache flashinfer build result for faster build
251251
ENV CCACHE_DIR=/root/.cache/ccache
252252
RUN --mount=type=cache,target=/root/.cache/ccache \
253253
--mount=type=cache,target=/root/.cache/uv \
254254
echo "git clone flashinfer..." \
255255
&& git clone --recursive https://github.com/flashinfer-ai/flashinfer.git \
256256
&& cd flashinfer \
257-
&& git checkout v0.2.2.post1 \
257+
&& git checkout v0.3.1 \
258258
&& git submodule update --init --recursive \
259259
&& echo "finish git clone flashinfer..." \
260260
&& rm -rf build \

docs/contributing/model/multimodal.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -840,7 +840,6 @@ Some HF processors directly insert feature tokens without replacing anything in
840840
Examples:
841841

842842
- BLIP-2 (insert at start of prompt): <gh-file:vllm/model_executor/models/blip2.py>
843-
- Florence2 (insert at start of prompt): <gh-file:vllm/model_executor/models/florence2.py>
844843
- Molmo (insert after `<|endoftext|>` token): <gh-file:vllm/model_executor/models/molmo.py>
845844

846845
### Handling prompt updates unrelated to multi-modal data

docs/models/supported_models.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -331,8 +331,6 @@ th {
331331
| `BailingMoeV2ForCausalLM` | Ling | `inclusionAI/Ling-mini-2.0`, etc. | ✅︎ | ✅︎ | ✅︎ |
332332
| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | ✅︎ |
333333
| `BloomForCausalLM` | BLOOM, BLOOMZ, BLOOMChat | `bigscience/bloom`, `bigscience/bloomz`, etc. | | ✅︎ | ✅︎ |
334-
| `BartForConditionalGeneration` | BART | `facebook/bart-base`, `facebook/bart-large-cnn`, etc. | | | |
335-
| `MBartForConditionalGeneration` | mBART | `facebook/mbart-large-en-ro`, `facebook/mbart-large-50`, etc. | | | |
336334
| `ChatGLMModel`, `ChatGLMForConditionalGeneration` | ChatGLM | `zai-org/chatglm2-6b`, `zai-org/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc. | ✅︎ | ✅︎ | ✅︎ |
337335
| `CohereForCausalLM`, `Cohere2ForCausalLM` | Command-R, Command-A | `CohereLabs/c4ai-command-r-v01`, `CohereLabs/c4ai-command-r7b-12-2024`, `CohereLabs/c4ai-command-a-03-2025`, `CohereLabs/command-a-reasoning-08-2025`, etc. | ✅︎ | ✅︎ | ✅︎ |
338336
| `DbrxForCausalLM` | DBRX | `databricks/dbrx-base`, `databricks/dbrx-instruct`, etc. | | ✅︎ | ✅︎ |
@@ -426,9 +424,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th
426424
!!! note
427425
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
428426

429-
!!! note
430-
Some mBART models' config files do not have an `architecture` defined. Therefore, you need to use `--hf-overrides '{"architectures": ["MBartForConditionalGeneration"]}'` to explicitly specify the use of the `MBartForConditionalGeneration` architecture.
431-
432427
### Pooling Models
433428

434429
See [this page](./pooling_models.md) for more information on how to use pooling models.
@@ -625,9 +620,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
625620
| `ChameleonForConditionalGeneration` | Chameleon | T + I | `facebook/chameleon-7b`, etc. | | ✅︎ | ✅︎ |
626621
| `Cohere2VisionForConditionalGeneration` | Command A Vision | T + I<sup>+</sup> | `CohereLabs/command-a-vision-07-2025`, etc. | | ✅︎ | ✅︎ |
627622
| `DeepseekVLV2ForCausalLM`<sup>^</sup> | DeepSeek-VL2 | T + I<sup>+</sup> | `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2`, etc. | | ✅︎ | ✅︎ |
628-
| `DonutForConditionalGeneration`<sup>^</sup> | Donut | T + I | `ByteDance/Dolphin`, `naver-clova-ix/donut-base-finetuned-docvqa`, etc. | | | |
629623
| `Ernie4_5_VLMoeForConditionalGeneration` | Ernie4.5-VL | T + I<sup>+</sup>/ V<sup>+</sup> | `baidu/ERNIE-4.5-VL-28B-A3B-PT`, `baidu/ERNIE-4.5-VL-424B-A47B-PT` | | ✅︎ | ✅︎ |
630-
| `Florence2ForConditionalGeneration` | Florence-2 | T + I | `microsoft/Florence-2-base`, `microsoft/Florence-2-large`, etc. | | | |
631624
| `FuyuForCausalLM` | Fuyu | T + I | `adept/fuyu-8b`, etc. | | ✅︎ | ✅︎ |
632625
| `Gemma3ForConditionalGeneration` | Gemma 3 | T + I<sup>+</sup> | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ | ⚠️ |
633626
| `Gemma3nForConditionalGeneration` | Gemma 3n | T + I + A | `google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, etc. | | | ✅︎ |
@@ -654,7 +647,6 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
654647
| `MiniCPMV` | MiniCPM-V | T + I<sup>E+</sup> + V<sup>E+</sup> | `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, `openbmb/MiniCPM-V-4`, `openbmb/MiniCPM-V-4_5`, etc. | ✅︎ | | ✅︎ |
655648
| `MiniMaxVL01ForConditionalGeneration` | MiniMax-VL | T + I<sup>E+</sup> | `MiniMaxAI/MiniMax-VL-01`, etc. | | ✅︎ | ✅︎ |
656649
| `Mistral3ForConditionalGeneration` | Mistral3 (HF Transformers) | T + I<sup>+</sup> | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc. | ✅︎ | ✅︎ | ✅︎ |
657-
| `MllamaForConditionalGeneration` | Llama 3.2 | T + I<sup>+</sup> | `meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, etc. | | | |
658650
| `MolmoForCausalLM` | Molmo | T + I<sup>+</sup> | `allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc. | ✅︎ | ✅︎ | ✅︎ |
659651
| `NVLM_D_Model` | NVLM-D 1.0 | T + I<sup>+</sup> | `nvidia/NVLM-D-72B`, etc. | | ✅︎ | ✅︎ |
660652
| `Ovis` | Ovis2, Ovis1.6 | T + I<sup>+</sup> | `AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc. | | ✅︎ | ✅︎ |

docs/usage/v1_guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ Please note that prefix caching is not yet supported for any of the above models
120120

121121
Whisper is supported. Other models requiring cross-attention between separate
122122
encoder and decoder (e.g., `BartForConditionalGeneration`,
123-
`MllamaForConditionalGeneration`) are not yet supported.
123+
`MllamaForConditionalGeneration`) are not supported.
124124

125125
### Features
126126

0 commit comments

Comments
 (0)