[Bug]: Compile inductor / CUDA Graph build before the memory profiling

### Your current environment

Running Llama4 Maverick on H100x8

### 🐛 Describe the bug

Otherwise, it's easy to get OOM. Inductor and CUDA graph themselves may consume a lot of memory, especially, inductor may leverage some profiling to search the best config for the kernels.

```
export LLAMA_DIR=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8; export PORT=8081 VLLM_LOGGING_LEVEL=DEBUG VLLM_DISABLE_COMPILE_CACHE=1 SAFETENSORS_FAST_GPU=1 vllm serve $LLAMA_DIR --disable-log-requests -tp 8 --host :: --port $PORT --served-model-name default --no-enable-prefix-caching --max-model-len 4096 --gpu-memory-utilization 0.8 2>&1 | tee marverik_fp8_no_compile.log
```

If we use 0.9 or 0.95, it's easy to reproduce the issue on H100x8 machines.
0.8 may be okay.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Compile inductor / CUDA Graph build before the memory profiling #19480

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Compile inductor / CUDA Graph build before the memory profiling #19480

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions