-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
nvidia-smi output
Tue Apr 29 12:50:57 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 70C P8 23W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+The k8s `yml`
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemma-deployment
namespace: default
labels:
app: gemma
spec:
replicas: 1
selector:
matchLabels:
app: gemma
template:
metadata:
labels:
app: gemma
spec:
containers:
- command: ["vllm", "serve"]
args:
- google/gemma-3-1b-it
- --host
- "0.0.0.0"
- --port
- "8000"
- --gpu_memory_utilization
- "0.9"
- --task
- generate
image: vllm/vllm-openai:v0.8.5
env:
- name: NCCL_DEBUG
value: "TRACE"
- name: HF_TOKEN
value: "redacted"
- name: HUGGING_FACE_HUB_TOKEN
value: "redacted"
- name: VLLM_LOGGING_LEVEL
value: "DEBUG"
- name: CUDA_LAUNCH_BLOCKING
value: "1"
- name: VLLM_TRACE_FUNCTION
value: "1"
imagePullPolicy: IfNotPresent
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
livenessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTP
failureThreshold: 3
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTP
failureThreshold: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
startupProbe:
httpGet:
path: /health
port: 8000
scheme: HTTP
failureThreshold: 30
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
---
apiVersion: v1
kind: Service
metadata:
name: gemma-service
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
app: gemma
type: ClusterIPThe output of `vLLM` pod logs
~/Git/repo | on master *15 ?18 kubectl logs -f gemma-deployment-68b9f48455-zqqgr ok | base py | at cluster kube | at 14:17:18
DEBUG 04-29 05:17:25 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-29 05:17:25 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-29 05:17:25 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-29 05:17:25 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-29 05:17:25 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-29 05:17:25 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-29 05:17:25 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-29 05:17:25 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-29 05:17:25 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-29 05:17:25 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-29 05:17:25 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-29 05:17:25 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-29 05:17:25 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-29 05:17:25 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-29 05:17:25 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-29 05:17:25 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-29 05:17:25 [__init__.py:239] Automatically detected platform cuda.
DEBUG 04-29 05:17:31 [utils.py:135] Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'
DEBUG 04-29 05:17:31 [__init__.py:28] No plugins for group vllm.general_plugins found.
INFO 04-29 05:17:32 [api_server.py:1043] vLLM API server version 0.8.5
INFO 04-29 05:17:32 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='google/gemma-3-1b-it', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='google/gemma-3-1b-it', task='generate', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7ca7d7092de0>)
DEBUG 04-29 05:17:41 [arg_utils.py:1616] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
DEBUG 04-29 05:17:41 [arg_utils.py:1623] Setting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
INFO 04-29 05:17:41 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
DEBUG 04-29 05:17:49 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-29 05:17:49 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-29 05:17:49 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-29 05:17:49 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-29 05:17:50 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-29 05:17:50 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-29 05:17:50 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-29 05:17:50 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-29 05:17:50 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-29 05:17:50 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-29 05:17:50 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-29 05:17:50 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-29 05:17:50 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-29 05:17:50 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-29 05:17:50 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-29 05:17:50 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-29 05:17:50 [__init__.py:239] Automatically detected platform cuda.
INFO 04-29 05:17:53 [core.py:58] Initializing a V1 LLM engine (v0.8.5) with config: model='google/gemma-3-1b-it', speculative_config=None, tokenizer='google/gemma-3-1b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=google/gemma-3-1b-it, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-29 05:17:53 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 04-29 05:17:53 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-d3481/VLLM_TRACE_FUNCTION_for_process_118_thread_132867992028288_at_2025-04-29_05:17:53.542171.log
DEBUG 04-29 05:17:53 [__init__.py:28] No plugins for group vllm.general_plugins found.
DEBUG 04-29 05:17:54 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 04-29 05:17:55 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
WARNING 04-29 05:18:00 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x78d654f8e300>
DEBUG 04-29 05:18:00 [config.py:4110] enabled custom ops: Counter()
DEBUG 04-29 05:18:00 [config.py:4112] disabled custom ops: Counter()
DEBUG 04-29 05:18:00 [parallel_state.py:867] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.200.0.48:38683 backend=nccl
INFO 04-29 05:18:00 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-29 05:18:00 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 04-29 05:18:00 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
DEBUG 04-29 05:18:01 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.gemma3.Gemma3Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 04-29 05:18:01 [config.py:4110] enabled custom ops: Counter()
DEBUG 04-29 05:18:01 [config.py:4112] disabled custom ops: Counter()
INFO 04-29 05:18:01 [gpu_model_runner.py:1329] Starting to load model google/gemma-3-1b-it...
DEBUG 04-29 05:18:04 [config.py:4110] enabled custom ops: Counter()
DEBUG 04-29 05:18:04 [config.py:4112] disabled custom ops: Counter({'gemma_rms_norm': 157, 'gelu_and_mul': 26, 'rotary_embedding': 2})
INFO 04-29 05:18:05 [weight_utils.py:265] Using model weights format ['*.safetensors']
DEBUG 04-29 05:18:05 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
INFO 04-29 05:18:11 [weight_utils.py:281] Time spent downloading weights for google/gemma-3-1b-it: 5.597264 seconds
INFO 04-29 05:18:11 [weight_utils.py:315] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.30s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.31s/it]
INFO 04-29 05:18:12 [loader.py:458] Loading weights took 1.43 seconds
INFO 04-29 05:18:13 [gpu_model_runner.py:1347] Model loading took 1.9147 GiB and 12.207135 seconds
DEBUG 04-29 05:18:13 [decorators.py:203] Start compiling function <code object forward at 0x1a7d7e40, file "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 382>
DEBUG 04-29 05:18:15 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:18:25 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:18:35 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:18:45 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:18:55 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:19:05 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:19:15 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:19:25 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:19:35 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:19:44 [core.py:392] EngineCore exiting.
[rank0]:[W429 05:19:45.151525875 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "/usr/local/bin/vllm", line 10, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 53, in main
args.dispatch_function(args)
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
self.engine_core = core_client_class(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 642, in __init__
super().__init__(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 398, in __init__
self._wait_for_engine_startup()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 423, in _wait_for_engine_startup
events = poller.poll(STARTUP_POLL_PERIOD_MS)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/zmq/sugar/poll.py", line 106, in poll
return zmq_poll(self.sockets, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "_zmq.py", line 1609, in zmq.backend.cython._zmq.zmq_poll
File "_zmq.py", line 169, in zmq.backend.cython._zmq._check_rc
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1074, in signal_handler
raise KeyboardInterrupt("terminated")
KeyboardInterrupt: terminated🐛 Describe the bug
Hi,
When deploying the gemma-3-1b-it the vLLM pod always crashes with the same error while waiting for core engine process to start. It never starts successfully. GPU used has 24GB of memory so it should not be an issue to run this model on only one GPU.
Above are detailed outputs of the pod and the simple setup for K8s to reproduce it.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working