Skip to content

[Bug]: vLLM serve google/gemma-3-1b-it with version 0.8.5 interrupted SIGTERM #17386

@igor-susic

Description

@igor-susic

Your current environment

nvidia-smi output
Tue Apr 29 12:50:57 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   70C    P8              23W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
The k8s `yml`
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemma-deployment
  namespace: default
  labels:
    app: gemma
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma
  template:
    metadata:
      labels:
        app: gemma
    spec:
      containers:
        - command: ["vllm", "serve"]
          args:
            - google/gemma-3-1b-it
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --gpu_memory_utilization
            - "0.9"
            - --task
            - generate
          image: vllm/vllm-openai:v0.8.5
          env:
            - name: NCCL_DEBUG
              value: "TRACE"
            - name: HF_TOKEN
              value: "redacted"
            - name: HUGGING_FACE_HUB_TOKEN
              value: "redacted"
            - name: VLLM_LOGGING_LEVEL
              value: "DEBUG"
            - name: CUDA_LAUNCH_BLOCKING
              value: "1"
            - name: VLLM_TRACE_FUNCTION
              value: "1"
          imagePullPolicy: IfNotPresent
          name: vllm-openai
          ports:
            - containerPort: 8000
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            failureThreshold: 3
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            failureThreshold: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          startupProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            failureThreshold: 30
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1

---

apiVersion: v1
kind: Service
metadata:
  name: gemma-service
  namespace: default
spec:
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    app: gemma
  type: ClusterIP
The output of `vLLM` pod logs
~/Git/repo | on master *15 ?18  kubectl logs -f gemma-deployment-68b9f48455-zqqgr                                                               ok | base py | at cluster kube | at 14:17:18
DEBUG 04-29 05:17:25 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-29 05:17:25 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-29 05:17:25 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-29 05:17:25 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-29 05:17:25 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-29 05:17:25 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-29 05:17:25 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-29 05:17:25 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-29 05:17:25 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-29 05:17:25 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-29 05:17:25 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-29 05:17:25 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-29 05:17:25 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-29 05:17:25 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-29 05:17:25 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-29 05:17:25 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-29 05:17:25 [__init__.py:239] Automatically detected platform cuda.
DEBUG 04-29 05:17:31 [utils.py:135] Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'
DEBUG 04-29 05:17:31 [__init__.py:28] No plugins for group vllm.general_plugins found.
INFO 04-29 05:17:32 [api_server.py:1043] vLLM API server version 0.8.5
INFO 04-29 05:17:32 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='google/gemma-3-1b-it', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='google/gemma-3-1b-it', task='generate', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7ca7d7092de0>)
DEBUG 04-29 05:17:41 [arg_utils.py:1616] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
DEBUG 04-29 05:17:41 [arg_utils.py:1623] Setting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
INFO 04-29 05:17:41 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
DEBUG 04-29 05:17:49 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-29 05:17:49 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-29 05:17:49 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-29 05:17:49 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-29 05:17:50 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-29 05:17:50 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-29 05:17:50 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-29 05:17:50 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-29 05:17:50 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-29 05:17:50 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-29 05:17:50 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-29 05:17:50 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-29 05:17:50 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-29 05:17:50 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-29 05:17:50 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-29 05:17:50 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-29 05:17:50 [__init__.py:239] Automatically detected platform cuda.
INFO 04-29 05:17:53 [core.py:58] Initializing a V1 LLM engine (v0.8.5) with config: model='google/gemma-3-1b-it', speculative_config=None, tokenizer='google/gemma-3-1b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=google/gemma-3-1b-it, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-29 05:17:53 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 04-29 05:17:53 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-d3481/VLLM_TRACE_FUNCTION_for_process_118_thread_132867992028288_at_2025-04-29_05:17:53.542171.log
DEBUG 04-29 05:17:53 [__init__.py:28] No plugins for group vllm.general_plugins found.
DEBUG 04-29 05:17:54 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 04-29 05:17:55 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
WARNING 04-29 05:18:00 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x78d654f8e300>
DEBUG 04-29 05:18:00 [config.py:4110] enabled custom ops: Counter()
DEBUG 04-29 05:18:00 [config.py:4112] disabled custom ops: Counter()
DEBUG 04-29 05:18:00 [parallel_state.py:867] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.200.0.48:38683 backend=nccl
INFO 04-29 05:18:00 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-29 05:18:00 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 04-29 05:18:00 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
DEBUG 04-29 05:18:01 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.gemma3.Gemma3Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 04-29 05:18:01 [config.py:4110] enabled custom ops: Counter()
DEBUG 04-29 05:18:01 [config.py:4112] disabled custom ops: Counter()
INFO 04-29 05:18:01 [gpu_model_runner.py:1329] Starting to load model google/gemma-3-1b-it...
DEBUG 04-29 05:18:04 [config.py:4110] enabled custom ops: Counter()
DEBUG 04-29 05:18:04 [config.py:4112] disabled custom ops: Counter({'gemma_rms_norm': 157, 'gelu_and_mul': 26, 'rotary_embedding': 2})
INFO 04-29 05:18:05 [weight_utils.py:265] Using model weights format ['*.safetensors']
DEBUG 04-29 05:18:05 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
INFO 04-29 05:18:11 [weight_utils.py:281] Time spent downloading weights for google/gemma-3-1b-it: 5.597264 seconds
INFO 04-29 05:18:11 [weight_utils.py:315] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.30s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.31s/it]

INFO 04-29 05:18:12 [loader.py:458] Loading weights took 1.43 seconds
INFO 04-29 05:18:13 [gpu_model_runner.py:1347] Model loading took 1.9147 GiB and 12.207135 seconds
DEBUG 04-29 05:18:13 [decorators.py:203] Start compiling function <code object forward at 0x1a7d7e40, file "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3.py", line 382>
DEBUG 04-29 05:18:15 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:18:25 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:18:35 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:18:45 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:18:55 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:19:05 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:19:15 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:19:25 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:19:35 [core_client.py:425] Waiting for 1 core engine proc(s) to start: {0}
DEBUG 04-29 05:19:44 [core.py:392] EngineCore exiting.
[rank0]:[W429 05:19:45.151525875 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 53, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 423, in _wait_for_engine_startup
    events = poller.poll(STARTUP_POLL_PERIOD_MS)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/zmq/sugar/poll.py", line 106, in poll
    return zmq_poll(self.sockets, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "_zmq.py", line 1609, in zmq.backend.cython._zmq.zmq_poll
  File "_zmq.py", line 169, in zmq.backend.cython._zmq._check_rc
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1074, in signal_handler
    raise KeyboardInterrupt("terminated")
KeyboardInterrupt: terminated

🐛 Describe the bug

Hi,

When deploying the gemma-3-1b-it the vLLM pod always crashes with the same error while waiting for core engine process to start. It never starts successfully. GPU used has 24GB of memory so it should not be an issue to run this model on only one GPU.

Above are detailed outputs of the pod and the simple setup for K8s to reproduce it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions