[Bug]: LLama4 Not working on PP

### Your current environment

<details>

</summary>

```text
Your output of `python collect_env.py` here
```

</details>


### 🐛 Describe the bug

I am running into an issue where i am unable to launch Llama-4-Maverick-17B-128E-Instruct-FP8 in a distributed fashion using Ray.

As you can see below, VLLM is able to successfully connect to the Ray cluster, however it looks like the value for `architectures` appears to be None on the way workers node. 

Looking through the stack trace i can see that `architectures` is being set to `None` despite both the `config.json` and the `--hf-overides` flag both specifying `{"architectures": ["Llama4ForConditionalGeneration"]}`

I can confirm this is only happening for llama4 and was able to successfully distribute 3.3 over 16 X A 100.

```
VLLM_DISABLE_COMPILE_CACHE=1 python -m vllm.entrypoints.openai.api_server --model /home/jovyan/llama4-llm-vol/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --served-model-name Llama-4-Maverick-17B-128E-Instruct-FP8 --enforce-eager --max-model-len 2000 --tensor-parallel 8 --pipeline-parallel-size 2 --gpu-memory-utilization 0.95 --host 0.0.0.0 --distributed-executor-backend ray --port 8000 --quantization compressed-tensors --hf-overrides '{"architectures": ["Llama4ForConditionalGeneration"]}'
INFO 04-10 03:30:20 [__init__.py:239] Automatically detected platform cuda.
INFO 04-10 03:30:22 [api_server.py:1034] vLLM API server version 0.8.3rc2.dev80+gcb84e45a
INFO 04-10 03:30:22 [api_server.py:1035] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/jovyan/llama4-llm-vol/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=2000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend='ray', pipeline_parallel_size=2, tensor_parallel_size=8, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization='compressed-tensors', rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides={'architectures': ['Llama4ForConditionalGeneration']}, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-4-Maverick-17B-128E-Instruct-FP8'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-10 03:30:22 [config.py:352] Overriding HF config with {'architectures': ['Llama4ForConditionalGeneration']}
INFO 04-10 03:30:30 [config.py:604] This model supports multiple tasks: {'score', 'reward', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 04-10 03:30:31 [config.py:1797] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-10 03:30:37 [__init__.py:239] Automatically detected platform cuda.
INFO 04-10 03:30:40 [core.py:61] Initializing a V1 LLM engine (v0.8.3rc2.dev80+gcb84e45a) with config: model='/home/jovyan/llama4-llm-vol/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', speculative_config=None, tokenizer='/home/jovyan/llama4-llm-vol/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2000, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Llama-4-Maverick-17B-128E-Instruct-FP8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
2025-04-10 03:30:40,246 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: ray-head.rapid-experimentation-team.svc.cluster.local:6379...
2025-04-10 03:30:40,464 INFO worker.py:1841 -- Connected to Ray cluster.
INFO 04-10 03:30:42 [ray_utils.py:335] No current placement group found. Creating a new placement group.
INFO 04-10 03:30:42 [ray_distributed_executor.py:176] use_ray_spmd_worker: True
(pid=21440) INFO 04-10 03:30:46 [__init__.py:239] Automatically detected platform cuda.
INFO 04-10 03:30:50 [ray_distributed_executor.py:352] non_carry_over_env_vars from config: set()
INFO 04-10 03:30:50 [ray_distributed_executor.py:354] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USE_RAY_SPMD_WORKER', 'VLLM_USE_RAY_COMPILED_DAG', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_V1', 'VLLM_DISABLE_COMPILE_CACHE']
INFO 04-10 03:30:50 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /app/.config/vllm/ray_non_carry_over_env_vars.json file
(RayWorkerWrapper pid=21428) WARNING 04-10 03:30:50 [utils.py:2429] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f5b88740580>
(RayWorkerWrapper pid=21428) INFO 04-10 03:30:56 [utils.py:990] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=21428) INFO 04-10 03:30:56 [pynccl.py:69] vLLM is using nccl==2.21.5
(pid=16887, ip=198.18.77.250) INFO 04-10 03:30:46 [__init__.py:239] Automatically detected platform cuda. [repeated 15x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=16881, ip=198.18.77.250) WARNING 04-10 03:30:50 [utils.py:2429] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f44fc50a170> [repeated 15x across cluster]
(RayWorkerWrapper pid=21428) INFO 04-10 03:30:59 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /app/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(RayWorkerWrapper pid=16907, ip=198.18.77.250) INFO 04-10 03:30:59 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_b575475e'), local_subscribe_addr='ipc:///tmp/b8d239f0-862f-48b9-9ee8-325b401f4b0e', remote_subscribe_addr=None, remote_addr_ipv6=False)
(RayWorkerWrapper pid=21430) INFO 04-10 03:30:59 [parallel_state.py:957] rank 1 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 1
(RayWorkerWrapper pid=21430) INFO 04-10 03:30:59 [cuda.py:221] Using Flash Attention backend on V1 engine.
(RayWorkerWrapper pid=16907, ip=198.18.77.250) INFO 04-10 03:31:04 [gpu_model_runner.py:1277] Starting to load model /home/jovyan/llama4-llm-vol/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8...
(RayWorkerWrapper pid=16881, ip=198.18.77.250) INFO 04-10 03:30:59 [utils.py:990] Found nccl from library libnccl.so.2 [repeated 31x across cluster]
(RayWorkerWrapper pid=16881, ip=198.18.77.250) INFO 04-10 03:30:59 [pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 31x across cluster]
(RayWorkerWrapper pid=16881, ip=198.18.77.250) INFO 04-10 03:30:59 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /app/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 15x across cluster]
(RayWorkerWrapper pid=21428) INFO 04-10 03:30:59 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_215d0926'), local_subscribe_addr='ipc:///tmp/f4217ffd-9b42-480f-acf0-7b78e2d42805', remote_subscribe_addr=None, remote_addr_ipv6=False)
(RayWorkerWrapper pid=21442) INFO 04-10 03:30:59 [parallel_state.py:957] rank 5 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 5 [repeated 15x across cluster]
(RayWorkerWrapper pid=21442) INFO 04-10 03:30:59 [cuda.py:221] Using Flash Attention backend on V1 engine. [repeated 15x across cluster]
(RayWorkerWrapper pid=16907, ip=198.18.77.250) WARNING 04-10 03:31:05 [registry.py:432] No model architectures are specified
ERROR 04-10 03:31:05 [core.py:386] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 377, in run_engine_core
ERROR 04-10 03:31:05 [core.py:386]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-10 03:31:05 [core.py:386]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 67, in __init__
ERROR 04-10 03:31:05 [core.py:386]     self.model_executor = executor_class(vllm_config)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 286, in __init__
ERROR 04-10 03:31:05 [core.py:386]     super().__init__(*args, **kwargs)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-10 03:31:05 [core.py:386]     self._init_executor()
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 114, in _init_executor
ERROR 04-10 03:31:05 [core.py:386]     self._init_workers_ray(placement_group)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 396, in _init_workers_ray
ERROR 04-10 03:31:05 [core.py:386]     self._run_workers("load_model",
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 521, in _run_workers
ERROR 04-10 03:31:05 [core.py:386]     ray_worker_outputs = ray.get(ray_worker_outputs)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
ERROR 04-10 03:31:05 [core.py:386]     return fn(*args, **kwargs)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
ERROR 04-10 03:31:05 [core.py:386]     return func(*args, **kwargs)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2771, in get
ERROR 04-10 03:31:05 [core.py:386]     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 919, in get_objects
ERROR 04-10 03:31:05 [core.py:386]     raise value.as_instanceof_cause()
ERROR 04-10 03:31:05 [core.py:386] ray.exceptions.RayTaskError(TypeError): ray::RayWorkerWrapper.execute_method() (pid=16911, ip=198.18.77.250, actor_id=c3542595396344fb7bc7e80106000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f98e1307a30>)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 621, in execute_method
ERROR 04-10 03:31:05 [core.py:386]     raise e
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 612, in execute_method
ERROR 04-10 03:31:05 [core.py:386]     return run_method(self, method, args, kwargs)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/utils.py", line 2363, in run_method
ERROR 04-10 03:31:05 [core.py:386]     return func(*args, **kwargs)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 136, in load_model
ERROR 04-10 03:31:05 [core.py:386]     self.model_runner.load_model()
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1280, in load_model
ERROR 04-10 03:31:05 [core.py:386]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 04-10 03:31:05 [core.py:386]     return loader.load_model(vllm_config=vllm_config)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 452, in load_model
ERROR 04-10 03:31:05 [core.py:386]     model = _initialize_model(vllm_config=vllm_config)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
ERROR 04-10 03:31:05 [core.py:386]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/mllama4.py", line 692, in __init__
ERROR 04-10 03:31:05 [core.py:386]     vllm_config=vllm_config.with_hf_config(config.text_config),
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 3568, in with_hf_config
ERROR 04-10 03:31:05 [core.py:386]     return replace(self, model_config=model_config)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/dataclasses.py", line 1453, in replace
ERROR 04-10 03:31:05 [core.py:386]     return obj.__class__(**changes)
ERROR 04-10 03:31:05 [core.py:386]   File "<string>", line 19, in __init__
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 3577, in __post_init__
ERROR 04-10 03:31:05 [core.py:386]     self.model_config.verify_with_parallel_config(self.parallel_config)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 797, in verify_with_parallel_config
ERROR 04-10 03:31:05 [core.py:386]     if not self.registry.is_pp_supported_model(self.architectures):
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 501, in is_pp_supported_model
ERROR 04-10 03:31:05 [core.py:386]     model_cls, _ = self.inspect_model_cls(architectures)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 447, in inspect_model_cls
ERROR 04-10 03:31:05 [core.py:386]     architectures = self._normalize_archs(architectures)
ERROR 04-10 03:31:05 [core.py:386]   File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/registry.py", line 436, in _normalize_archs
ERROR 04-10 03:31:05 [core.py:386]     filter(lambda model: model in self.models, architectures))
ERROR 04-10 03:31:05 [core.py:386] TypeError: 'NoneType' object is not iterable
ERROR 04-10 03:31:05 [core.py:386] 
INFO 04-10 03:31:05 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
CRITICAL 04-10 03:31:05 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
```

### ENVIRONMENT

PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.16 | packaged by conda-forge | (main, Dec  5 2024, 14:16:10) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.10.234-225.895.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB

Nvidia driver version: 550.144.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               96
On-line CPU(s) list:                  0-95
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            2
Stepping:                             7
BogoMIPS:                             5999.98
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            1.5 MiB (48 instances)
L1i cache:                            1.5 MiB (48 instances)
L2 cache:                             48 MiB (48 instances)
L3 cache:                             71.5 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-23,48-71
NUMA node1 CPU(s):                    24-47,72-95
Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pytorch-lightning==2.5.0.post0
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchmetrics==1.6.3
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.1
[pip3] triton==3.2.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pytorch-lightning         2.5.0.post0              pypi_0    pypi
[conda] pyzmq                     26.4.0                   pypi_0    pypi
[conda] torch                     2.6.0                    pypi_0    pypi
[conda] torchaudio                2.6.0                    pypi_0    pypi
[conda] torchmetrics              1.6.3                    pypi_0    pypi
[conda] torchvision               0.21.0                   pypi_0    pypi
[conda] transformers              4.51.1                   pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: LLama4 Not working on PP #16385

Your current environment

🐛 Describe the bug

ENVIRONMENT

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: LLama4 Not working on PP #16385

Description

Your current environment

🐛 Describe the bug

ENVIRONMENT

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions