-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Red Hat Enterprise Linux 9.4 (Plow) (x86_64)
GCC version: (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.34
Python version: 3.12.1 (main, Aug 23 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (64-bit runtime)
Python platform: Linux-4.18.0-372.46.1.el8_6.x86_64-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Vendor ID: GenuineIntel
Model name: Intel Xeon Processor (Icelake)
CPU family: 6
Model: 134
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
Stepping: 0
BogoMIPS: 5600.04
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 2.5 MiB (80 instances)
L1i cache: 2.5 MiB (80 instances)
L2 cache: 160 MiB (40 instances)
L3 cache: 32 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-39
NUMA node1 CPU(s): 40-79
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu124torch2.4
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.1.2
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] sentence-transformers==3.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.2
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.dev150+gd5fbb8706.d20241010
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 40-79 1 N/A
NIC0 SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
Model Input Dumps
No response
🐛 Describe the bug
When the nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test
model runs requests with temperature=0, the output changes depending on how the scheduler batches the requests. This seems to be the reason the lm-eval tests get different scores as the size of the KV cache is changed.
Slack thread for more context: https://vllm-dev.slack.com/archives/C07R5PAL2L9/p1729409919734939
Here's a small repro script that uses --max-num-seqs
to force different batch sizes:
test_batch_weirdness.py
from vllm import LLM
import gc
import torch
import os
import json
from vllm.sampling_params import SamplingParams
from difflib import unified_diff
# Load up request data
CWD = os.path.dirname(os.path.abspath(__file__))
with open(f"{CWD}/request_data_small.json", "r") as f:
data = json.load(f)
prompt_token_ids = [d['prompt_token_ids'] for d in data]
# Run once with no limit on batch size
llm = LLM("nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test")
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=SamplingParams(temperature=0, max_tokens=100))
batched_output_list = [i.outputs[0].text for i in outputs]
# Check that we get the same answer if we run these twice
sanity_check_outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=SamplingParams(temperature=0, max_tokens=100))
assert batched_output_list == [i.outputs[0].text for i in sanity_check_outputs]
# Run again with a batch size of 1
del llm
gc.collect()
torch.cuda.empty_cache()
llm = LLM("nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test", max_num_seqs=1)
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=SamplingParams(temperature=0, max_tokens=100))
serial_output_list = [i.outputs[0].text for i in outputs]
# Show the diff between the lists
for i in range(len(batched_output_list)):
if batched_output_list[i] != serial_output_list[i]:
print(f"\n\nDiff in output {i}: \n")
diff = unified_diff(batched_output_list[i].splitlines(), serial_output_list[i].splitlines(), lineterm='')
print('\n'.join(list(diff)))
And the input data for that: request_data_small.json
On my A100 machine this produces a diff like so:
Diff in output 1:
---
+++
@@ -1,6 +1,6 @@
- There are 240 - 80 = <<240-80=160>>160 Chinese people.
-There are 60 boys on the Chinese team, so there are 160 - 60 = <<160-60=100>>100 girls on the Chinese team.
+ There are 240 - 80 = <<240-80=160>>160 Chinese.
+There are 60 boys, so there are 160 - 60 = <<160-60=100>>100 girls.
#### 100
-Question: A bakery sells 250 loaves of bread per day. They sell 1/5 of their loaves to a local restaurant. How many loaves of bread does the bakery sell to the restaurant?
-Answer
+Question: A bakery sells 240 loaves of bread per day. They sell 1/3 of their loaves to a local restaurant. How many loaves of bread does the bakery sell to the restaurant?
+Answer: 1/3 of 240 is
Diff in output 2:
---
+++
@@ -1,6 +1,6 @@
Charlie has 3 times as many Facebook friends as Dorothy, so Dorothy has 12/3 = 4 Facebook friends.
-James has 4 times as many Facebook friends as Dorothy, so James has 4 x 4 = 16 Facebook friends.
+James has 4 times as many friends on Facebook as Dorothy, so James has 4 * 4 = 16 Facebook friends.
#### 16
-Question: David's car gets 25 miles per gallon. He drives 300 miles. How many gallons of gas will he need?
-Answer: David's car gets 25 miles per gallon, so it will need 300
+Question: David's car is 5 years old. He has been driving it for 3 years. How many years old is his car in terms of its mileage?
+Answer: The car is 5 years old in
Diff in output 3:
---
+++
@@ -1,6 +1,4 @@
- On Thursday, the mechanic earned 6 * $60 + 4 * $40 = <<6*60+4*40=360+160=520>>520 dollars.
-On Friday, the mechanic earned 12 * $40 = <<12*40=480>>480 dollars.
-The mechanic earned $480 - $520 = <<480-520=-40>>-$40 more on the day with higher revenue.
+ On Thursday, the mechanic earned 6 * $60 = $360 for truck tires and 4 * $40 = $160 for car tires. So, the total revenue on Thursday was $360 + $160 = $520.
+On Friday, the mechanic earned 12 * $40 = $480 for car tires. So, the total revenue on Friday was $480.
+The mechanic earned $480 - $520 = -$40 more on the day with higher revenue.
#### -$40
-
-Question: A bakery sells a total of 250 loaves
Diff in output 5:
---
+++
@@ -1,3 +1,3 @@
- Steve will take 3 miles / 440 feet per minute = <<3/440=0.06875>>0.06875 hours to get home.
-Tim will take 2 miles / 264 feet per minute = <<2/264=0.00758>>0.00758 hours to get home.
-Steve will be waiting 0.06875 - 0.00758 = <<0.06875-0.00758=0.06117>>0.06117
+ Steve will take 3 miles / 440 feet per minute = <<3/440=0.0682>>0.0682 hours to get home.
+Tim will take 2 miles / 264 feet per minute = <<2/264=0.0076>>0.0076 hours to get home.
+The difference in time is 0.0682 - 0.0076 = <<0.0682-0.0076=0.0606>>0.060
Diff in output 6:
---
+++
@@ -1,3 +1,3 @@
The tree will cost $90 to plant, so he will not earn any money for the first year.
-In the second year, he will earn $1.5 * 7 = $<<1.5*7=10.5>>10.5 from the lemons, but he will also spend $3 to water and feed the tree, so he will earn $10.5 - $3 = $<<10.5-3=7.5>>7.5.
-In
+In the second year, he will earn $1.5 * 7 = $<<1.5*7=10.5>>10.5 from the lemons, but it will cost $3 to water and feed the tree, so he will earn $10.5 - $3 = $<<10.5-3=7.5>>7.5.
+In the
Diff in output 7:
---
+++
@@ -3,4 +3,4 @@
In total, Tommy makes $129 + $92 = $<<129+92=221>>221
#### 221
-Question: A bookshelf has
+Question: A bakery sells a
Diff in output 8:
---
+++
@@ -1,8 +1,7 @@
- 30% of 1000 is 0.3 * 1000 = 300 students who went out through exit A.
+ 30% of 1000 is 0.3 * 1000 = 300 students.
The remaining students are 1000 - 300 = 700.
-3/5 of the remaining students went out through exit B, which is 0.6 * 700 = 420 students.
-The remaining students are 700 - 420 = 280.
-The number of students who went out through exit C is 280.
+3/5 of the remaining students went out through exit B, which is 0.6 * 700 = 420.
+The number of students who went out through exit C is 700 - 420 = 280.
#### 280
-Question
+Question: A snail is at the bottom of a 20-foot well
Diff in output 9:
---
+++
@@ -1,6 +1,5 @@
10 acres produce 10 x 5 = <<10*5=50>>50 tons of grapes.
-50 tons of grapes produce 50 x 2 = <<50*2=100>>100 barrels of wine.
+50 tons of grapes make 50 x 2 = <<50*2=100>>100 barrels of wine.
#### 100
-Question: A bakery sells 250 loaves of bread per day. They sell 1/4 of their loaves to a local restaurant. How many loaves of bread does the bakery sell to the restaurant?
-Answer: The bakery sells
+Question: A bakery sells a total of 250 loaves of bread per day. They sell a combination of whole wheat and white bread. If they sell 30 more loaves of whole wheat than white bread, and the total number of loaves of
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working