[Bug]: Models produce different output with different batch sizes

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux 9.4 (Plow) (x86_64)
GCC version: (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.34

Python version: 3.12.1 (main, Aug 23 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (64-bit runtime)
Python platform: Linux-4.18.0-372.46.1.el8_6.x86_64-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          80
On-line CPU(s) list:             0-79
Vendor ID:                       GenuineIntel
Model name:                      Intel Xeon Processor (Icelake)
CPU family:                      6
Model:                           134
Thread(s) per core:              2
Core(s) per socket:              20
Socket(s):                       2
Stepping:                        0
BogoMIPS:                        5600.04
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       2.5 MiB (80 instances)
L1i cache:                       2.5 MiB (80 instances)
L2 cache:                        160 MiB (40 instances)
L3 cache:                        32 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-39
NUMA node1 CPU(s):               40-79
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu124torch2.4
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.1.2
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] sentence-transformers==3.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.2
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.dev150+gd5fbb8706.d20241010
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	40-79	1		N/A
NIC0	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

When the `nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test` model runs requests with temperature=0, the output changes depending on how the scheduler batches the requests. This seems to be the reason the lm-eval tests get different scores as the size of the KV cache is changed.

Slack thread for more context: https://vllm-dev.slack.com/archives/C07R5PAL2L9/p1729409919734939

Here's a small repro script that uses `--max-num-seqs` to force different batch sizes:

<details>
<summary>test_batch_weirdness.py</summary>

```python
from vllm import LLM
import gc
import torch
import os
import json

from vllm.sampling_params import SamplingParams
from difflib import unified_diff

# Load up request data
CWD = os.path.dirname(os.path.abspath(__file__))
with open(f"{CWD}/request_data_small.json", "r") as f:
    data = json.load(f)
prompt_token_ids = [d['prompt_token_ids'] for d in data]

# Run once with no limit on batch size
llm = LLM("nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test")
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=SamplingParams(temperature=0, max_tokens=100))
batched_output_list = [i.outputs[0].text for i in outputs]

# Check that we get the same answer if we run these twice
sanity_check_outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=SamplingParams(temperature=0, max_tokens=100))
assert batched_output_list == [i.outputs[0].text for i in sanity_check_outputs]

# Run again with a batch size of 1
del llm
gc.collect()
torch.cuda.empty_cache()

llm = LLM("nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test", max_num_seqs=1)
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=SamplingParams(temperature=0, max_tokens=100))
serial_output_list = [i.outputs[0].text for i in outputs]

# Show the diff between the lists
for i in range(len(batched_output_list)):
    if batched_output_list[i] != serial_output_list[i]:
        print(f"\n\nDiff in output {i}: \n")
        diff = unified_diff(batched_output_list[i].splitlines(), serial_output_list[i].splitlines(), lineterm='')
        print('\n'.join(list(diff)))
```
</details>


And the input data for that: [request_data_small.json](https://github.com/user-attachments/files/17468017/request_data_small.json)

On my A100 machine this produces a diff like so:

```
Diff in output 1: 

--- 
+++ 
@@ -1,6 +1,6 @@
- There are 240 - 80 = <<240-80=160>>160 Chinese people.
-There are 60 boys on the Chinese team, so there are 160 - 60 = <<160-60=100>>100 girls on the Chinese team.
+ There are 240 - 80 = <<240-80=160>>160 Chinese.
+There are 60 boys, so there are 160 - 60 = <<160-60=100>>100 girls.
 #### 100
 
-Question: A bakery sells 250 loaves of bread per day. They sell 1/5 of their loaves to a local restaurant. How many loaves of bread does the bakery sell to the restaurant?
-Answer
+Question: A bakery sells 240 loaves of bread per day. They sell 1/3 of their loaves to a local restaurant. How many loaves of bread does the bakery sell to the restaurant?
+Answer: 1/3 of 240 is


Diff in output 2: 

--- 
+++ 
@@ -1,6 +1,6 @@
  Charlie has 3 times as many Facebook friends as Dorothy, so Dorothy has 12/3 = 4 Facebook friends.
-James has 4 times as many Facebook friends as Dorothy, so James has 4 x 4 = 16 Facebook friends.
+James has 4 times as many friends on Facebook as Dorothy, so James has 4 * 4 = 16 Facebook friends.
 #### 16
 
-Question: David's car gets 25 miles per gallon. He drives 300 miles. How many gallons of gas will he need?
-Answer: David's car gets 25 miles per gallon, so it will need 300
+Question: David's car is 5 years old. He has been driving it for 3 years. How many years old is his car in terms of its mileage?
+Answer: The car is 5 years old in


Diff in output 3: 

--- 
+++ 
@@ -1,6 +1,4 @@
- On Thursday, the mechanic earned 6 * $60 + 4 * $40 = <<6*60+4*40=360+160=520>>520 dollars.
-On Friday, the mechanic earned 12 * $40 = <<12*40=480>>480 dollars.
-The mechanic earned $480 - $520 = <<480-520=-40>>-$40 more on the day with higher revenue.
+ On Thursday, the mechanic earned 6 * $60 = $360 for truck tires and 4 * $40 = $160 for car tires.  So, the total revenue on Thursday was $360 + $160 = $520.
+On Friday, the mechanic earned 12 * $40 = $480 for car tires.  So, the total revenue on Friday was $480.
+The mechanic earned $480 - $520 = -$40 more on the day with higher revenue.
 #### -$40
-
-Question: A bakery sells a total of 250 loaves


Diff in output 5: 

--- 
+++ 
@@ -1,3 +1,3 @@
- Steve will take 3 miles / 440 feet per minute = <<3/440=0.06875>>0.06875 hours to get home.
-Tim will take 2 miles / 264 feet per minute = <<2/264=0.00758>>0.00758 hours to get home.
-Steve will be waiting 0.06875 - 0.00758 = <<0.06875-0.00758=0.06117>>0.06117
+ Steve will take 3 miles / 440 feet per minute = <<3/440=0.0682>>0.0682 hours to get home.
+Tim will take 2 miles / 264 feet per minute = <<2/264=0.0076>>0.0076 hours to get home.
+The difference in time is 0.0682 - 0.0076 = <<0.0682-0.0076=0.0606>>0.060


Diff in output 6: 

--- 
+++ 
@@ -1,3 +1,3 @@
  The tree will cost $90 to plant, so he will not earn any money for the first year.
-In the second year, he will earn $1.5 * 7 = $<<1.5*7=10.5>>10.5 from the lemons, but he will also spend $3 to water and feed the tree, so he will earn $10.5 - $3 = $<<10.5-3=7.5>>7.5.
-In
+In the second year, he will earn $1.5 * 7 = $<<1.5*7=10.5>>10.5 from the lemons, but it will cost $3 to water and feed the tree, so he will earn $10.5 - $3 = $<<10.5-3=7.5>>7.5.
+In the


Diff in output 7: 

--- 
+++ 
@@ -3,4 +3,4 @@
 In total, Tommy makes $129 + $92 = $<<129+92=221>>221
 #### 221
 
-Question: A bookshelf has
+Question: A bakery sells a


Diff in output 8: 

--- 
+++ 
@@ -1,8 +1,7 @@
- 30% of 1000 is 0.3 * 1000 = 300 students who went out through exit A.
+ 30% of 1000 is 0.3 * 1000 = 300 students.
 The remaining students are 1000 - 300 = 700.
-3/5 of the remaining students went out through exit B, which is 0.6 * 700 = 420 students.
-The remaining students are 700 - 420 = 280.
-The number of students who went out through exit C is 280.
+3/5 of the remaining students went out through exit B, which is 0.6 * 700 = 420.
+The number of students who went out through exit C is 700 - 420 = 280.
 #### 280
 
-Question
+Question: A snail is at the bottom of a 20-foot well


Diff in output 9: 

--- 
+++ 
@@ -1,6 +1,5 @@
  10 acres produce 10 x 5 = <<10*5=50>>50 tons of grapes.
-50 tons of grapes produce 50 x 2 = <<50*2=100>>100 barrels of wine.
+50 tons of grapes make 50 x 2 = <<50*2=100>>100 barrels of wine.
 #### 100
 
-Question: A bakery sells 250 loaves of bread per day.  They sell 1/4 of their loaves to a local restaurant.  How many loaves of bread does the bakery sell to the restaurant?
-Answer: The bakery sells 
+Question: A bakery sells a total of 250 loaves of bread per day. They sell a combination of whole wheat and white bread. If they sell 30 more loaves of whole wheat than white bread, and the total number of loaves of

```


### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Models produce different output with different batch sizes #9567

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Models produce different output with different batch sizes #9567

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions