-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Description
Your current environment
The output of `python collect_env.py`
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.31
Python version: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB
Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7742 64-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 2250.000
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4491.45
Virtualization: AMD-V
L1d cache: 4 MiB
L1i cache: 4 MiB
L2 cache: 64 MiB
L3 cache: 512 MiB
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-255
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Versions of relevant libraries:
[pip3] flashinfer==0.1.2+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] triton==3.0.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.4
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31,144-159 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5
NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS
NIC2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS
NIC3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS
NIC7 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS
NIC8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
NIC9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
🐛 Describe the bug
I am running the nightly pip, pulled 9:00 PM EST 8/20/24. Running under docker image 0.5.4. I also tested the same commands under normal 0.5.4.
python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Meta-Llama-3.1-70B-Instruct-quantized.w8a16 -tp 8 --gpu-memory-utilization 0.995 --dtype auto --distributed-executor-backend mp --max-model-len 40000 --kv-cache-dtype fp8
This is the model I am using
https://huggingface.co/neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w8a16
It also fails with
https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: at 110:17:
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] cur_kv_head * stride_k_cache_h +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] (offs_d[:, None] // x) * stride_k_cache_d +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] ((start_n + offs_n[None, :]) % block_size) *
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] stride_k_cache_bl +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] (offs_d[:, None] % x) * stride_k_cache_x)
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] # [N,D]
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] off_v = (
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] bn[:, None] * stride_v_cache_bs +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] cur_kv_head * stride_v_cache_h +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] offs_d[None, :] * stride_v_cache_d +
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] k_load = tl.load(K_cache + off_k,
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] ^, Traceback (most recent call last):
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 35, in wrapper
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] return fn(*args, **kwargs)
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1597, in load
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] return semantic.load(pointer, mask, other, boundary_check, padding_option, cache_modifier, eviction_policy,
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1037, in load
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] return _load_legacy(ptr, mask, other, boundary_check, padding, cache, eviction, is_volatile, builder)
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 1005, in _load_legacy
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] other = cast(other, elt_ty, builder)
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 759, in cast
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] assert builder.options.allow_fp8e4nv, "fp8e4nv data type is not supported on CUDA arch < 89"
(VllmWorkerProcess pid=4055) ERROR 08-21 01:06:31 multiproc_worker_utils.py:226] AssertionError: fp8e4nv data type is not supported on CUDA arch < 89
However, all of these commands work.
python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Meta-Llama-3.1-70B-Instruct-quantized.w8a16 -tp 8 --gpu-memory-utilization 0.995 --dtype auto --distributed-executor-backend mp --max-model-len 40000
python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Llama-3-8B-Instruct-GPTQ-4-Bit -tp 8 --gpu-memory-utilization 0.995 --dtype auto --distributed-executor-backend mp --max-model-len 4000 --kv-cache-dtype fp8
python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Llama-3-8B-Instruct -tp 8 --gpu-memory-utilization 0.995 --dtype auto --distributed-executor-backend mp --max-model-len 4000 --kv-cache-dtype fp8 --quantization fp8
python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Llama-3-8B-Instruct -tp 8 --gpu-memory-utilization 0.995 --dtype auto --distributed-executor-backend mp --max-model-len 4000 --kv-cache-dtype fp8
It appears to be something to do with neuralmagic's quants and the fp8 cache. It appears to be setting the fp8 cache to a dtype that is not possible under ampere. However, forcibly setting the dtype to fp8_e5m2 or fp8_e4m3 do not work either likely due to the precomputed scales in the quants. Let me know if I should post this to neuralmagic but I figure they're heavily involved here anyway and there might be a vllm solution as well (I noticed you guys managed to get the neuralmagic quants running ampere in 0.5.3, thank you!).