Skip to content

Conversation

@liuzijing2014
Copy link
Collaborator

@liuzijing2014 liuzijing2014 commented Jun 12, 2025

Purpose

Allow vLLM to run text-only Llama4 model, aka Llama4ForCausalLM.

Test Plan

Run vLLM w/ a text-only Llama4 Maverick checkpoint (vender internal one).

Test Result

Model successfully recognized and loaded

Loading safetensors checkpoint shards:   0% Completed | 0/84 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  10% Completed | 8/84 [00:00<00:00, 77.14it/s]
Loading safetensors checkpoint shards:  19% Completed | 16/84 [00:00<00:01, 48.57it/s]
Loading safetensors checkpoint shards:  29% Completed | 24/84 [00:00<00:01, 53.53it/s]
Loading safetensors checkpoint shards:  36% Completed | 30/84 [00:00<00:00, 55.29it/s]
Loading safetensors checkpoint shards:  43% Completed | 36/84 [00:00<00:00, 49.49it/s]
Loading safetensors checkpoint shards:  50% Completed | 42/84 [00:00<00:01, 39.83it/s]
Loading safetensors checkpoint shards:  67% Completed | 56/84 [00:01<00:00, 56.72it/s]
Loading safetensors checkpoint shards:  77% Completed | 65/84 [00:01<00:00, 57.88it/s]
Loading safetensors checkpoint shards:  88% Completed | 74/84 [00:01<00:00, 64.70it/s]
Loading safetensors checkpoint shards:  96% Completed | 81/84 [00:01<00:00, 56.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 84/84 [00:01<00:00, 56.55it/s]
(VllmWorker rank=0 pid=604590) 
(VllmWorker rank=7 pid=604598) INFO 06-12 14:37:50 [default_loader.py:272] Loading weights took 45.72 seconds
(VllmWorker rank=0 pid=604590) INFO 06-12 14:37:50 [default_loader.py:272] Loading weights took 45.64 seconds
(VllmWorker rank=2 pid=604593) INFO 06-12 14:37:50 [default_loader.py:272] Loading weights took 45.65 seconds
(VllmWorker rank=6 pid=604597) INFO 06-12 14:37:50 [default_loader.py:272] Loading weights took 45.62 seconds
(VllmWorker rank=4 pid=604595) INFO 06-12 14:37:50 [default_loader.py:272] Loading weights took 45.60 seconds
(VllmWorker rank=5 pid=604596) INFO 06-12 14:37:50 [default_loader.py:272] Loading weights took 45.70 seconds
(VllmWorker rank=1 pid=604591) INFO 06-12 14:37:50 [default_loader.py:272] Loading weights took 45.71 seconds
(VllmWorker rank=3 pid=604594) INFO 06-12 14:37:50 [default_loader.py:272] Loading weights took 45.71 seconds
(VllmWorker rank=3 pid=604594) INFO 06-12 14:37:51 [gpu_model_runner.py:1615] Model loading took 48.8683 GiB and 46.056898 seconds
(VllmWorker rank=0 pid=604590) INFO 06-12 14:37:51 [gpu_model_runner.py:1615] Model loading took 48.8683 GiB and 45.984370 seconds
(VllmWorker rank=4 pid=604595) INFO 06-12 14:37:51 [gpu_model_runner.py:1615] Model loading took 48.8683 GiB and 45.931201 seconds
(VllmWorker rank=5 pid=604596) INFO 06-12 14:37:51 [gpu_model_runner.py:1615] Model loading took 48.8683 GiB and 46.051216 seconds
(VllmWorker rank=1 pid=604591) INFO 06-12 14:37:51 [gpu_model_runner.py:1615] Model loading took 48.8683 GiB and 46.056561 seconds
(VllmWorker rank=6 pid=604597) INFO 06-12 14:37:51 [gpu_model_runner.py:1615] Model loading took 48.8683 GiB and 45.960544 seconds
(VllmWorker rank=2 pid=604593) INFO 06-12 14:37:51 [gpu_model_runner.py:1615] Model loading took 48.8683 GiB and 46.010001 seconds
(VllmWorker rank=7 pid=604598) INFO 06-12 14:37:51 [gpu_model_runner.py:1615] Model loading took 48.8683 GiB and 46.070654 seconds
Evaluation results on task gsm8k.8_shot.1_gen: em: 0.957500 | f1: 0.957500 | em_maj1@1: 0.957500 | f1_maj1@1: 0.957500

Signed-off-by: Zijing Liu <[email protected]>
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@yeqcharlotte
Copy link
Collaborator

we deleted this from the registry because it used to broke a bunch of CI that @ywang96 has to hack to let it through. in particular could you check the workarounds in #16113 and run precommits locally to make sure they are good?

@liuzijing2014
Copy link
Collaborator Author

we deleted this from the registry because it used to broke a bunch of CI that @ywang96 has to hack to let it through. in particular could you check the workarounds in #16113 and run precommits locally to make sure they are good?

pre-commit runs fine locally:

yapf.....................................................................Passed
ruff.....................................................................Passed
ruff-format..........................................(no files to check)Skipped
typos....................................................................Passed
isort....................................................................Passed
clang-format.........................................(no files to check)Skipped
PyMarkdown...........................................(no files to check)Skipped
Lint GitHub Actions workflow files...................(no files to check)Skipped
pip-compile..........................................(no files to check)Skipped
Run mypy for local Python installation...................................Passed
Lint shell scripts...................................(no files to check)Skipped
Lint PNG exports from excalidraw.....................(no files to check)Skipped
Check SPDX headers.......................................................Passed
Check for spaces in all filenames........................................Passed
Update Dockerfile dependency graph.......................................Passed
Enforce import regex as re...............................................Passed
Forbid direct 'import triton'............................................Passed
Prevent new pickle/cloudpickle imports...................................Passed
Suggestion...............................................................Passed
- hook id: suggestion
- duration: 0s

To bypass pre-commit hooks, add --no-verify to git commit.

Sign-off Commit..........................................................Passed

I will wait and see if there is any CI failure signals.

@houseroad houseroad requested a review from ywang96 June 12, 2025 23:34
@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 12, 2025
Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it on hold, and just check the CI.

"LlamaForCausalLM": ("llama", "LlamaForCausalLM"),
# For decapoda-research/llama-*
"LLaMAForCausalLM": ("llama", "LlamaForCausalLM"),
"Llama4ForCausalLM": ("llama4", "Llama4ForCausalLM"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently our basic-models-test always assumes that the tested architectures have a corresponding huggingface model repository to test with.

_EXAMPLE_MODELS = {
**_TEXT_GENERATION_EXAMPLE_MODELS,
**_EMBEDDING_EXAMPLE_MODELS,
**_CROSS_ENCODER_EXAMPLE_MODELS,
**_MULTIMODAL_EXAMPLE_MODELS,
**_SPECULATIVE_DECODING_EXAMPLE_MODELS,
**_TRANSFORMERS_MODELS,
}

Do you think it's possible to add a dummy model repo on HF with the architecture Llama4ForCausalLM? Alternatively you will need to modify test_registry.py for CI to pass.

"LlamaForCausalLM": ("llama", "LlamaForCausalLM"),
# For decapoda-research/llama-*
"LLaMAForCausalLM": ("llama", "LlamaForCausalLM"),
"Llama4ForCausalLM": ("llama4", "Llama4ForCausalLM"),
Copy link
Member

@ywang96 ywang96 Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, I think the proper way to support the text-only usage of models that are released as "natively multimodal" like llama4 or mistral-small 3.1 is to add a --language-model-only mode

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just go with "--language-model-only" solution? @liuzijing2014 thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I will try out this idea for Llama4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuzijing2014 Happy to collaborate on this! This was one of the items that I'm planning to work on too :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants