From ce6c1e9cfed335f906863e872abb06312e73d238 Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Fri, 11 Jul 2025 15:23:07 -0700
Subject: [PATCH 1/5] Minor changes to command line examples

---
 docs/dev-docker/README.md | 19 ++++---------------
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 89d5dc31478a..71e934f4c65b 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -33,7 +33,7 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
 ## Performance Results
 
 The data in the following tables is a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by AMD Instinct™ MI300X accelerator with vLLM. See the MLPerf section in this document for information about MLPerf 4.1 inference results. The performance numbers above were collected using the steps below.
-*Note Benchmarks were run with benchmark scripts from [v0.6.5](https://github.com/vllm-project/vllm/tree/v0.6.5/benchmarks)*
+*Note Benchmarks were run with benchmark scripts from [v0.8.5](https://github.com/vllm-project/vllm/tree/v0.8.5/benchmarks)*
 
 ### Throughput Measurements
 
@@ -242,15 +242,10 @@ python3 /app/vllm/benchmarks/benchmark_latency.py \
     --output-len $OUT \
     --tensor-parallel-size $TP \
     --num-iters-warmup 3 \
-    --num-iters 5 \
-    --output-json output.json
-    --disable-custom-all-reduce
-    --compilation-config '{"full_cuda_graph": true,"custom_ops":["+rms_norm","+silu_and_mul"],"pass_config":{"enable_noop":true,"enable_fusion":true}}’
+    --num-iters 5
 
 ```
 
-For FP16 models, remove `--quantization fp8 --kv-cache-dtype fp8`. For all other models, remove `--disable-custom-all-reduce`.
-
 When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value.  It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.
 
 To estimate Time To First Token (TTFT) with the benchmark_latency.py script, set the OUT to 1 token.  It is also recommended to use `--enforce-eager` to get a more accurate measurement of the time that it actually takes to generate the first token.  (For a more comprehensive measurement of TTFT, use the Online Serving Benchmark.)
@@ -283,7 +278,7 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
     --gpu-memory-utilization 0.9 \
     --trust-remote-code \
     --num-scheduler-steps 10 \
-    --enable-chunked-prefill False \
+    --no-enable-chunked-prefill \
     --model $MODEL \
     --max-model-len 8192 \
     --max-num-batched-tokens 131072 \
@@ -292,15 +287,9 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
     --output-len $OUT \
     --tensor-parallel-size $TP \
     --num-prompts $PROMPTS \
-    --max-num-seqs $MAX_NUM_SEQS \
-    --output-json output.json
-    --disable-custom-all-reduce
-    --disable-detokenize
-    --compilation-config '{"full_cuda_graph": true,"custom_ops":["+silu_and_mul"],"pass_config":{"enable_noop":true,"enable_fusion":true}}’
+    --max-num-seqs $MAX_NUM_SEQS
 ```
 
-For FP16 models, remove `--quantization fp8 --kv-cache-dtype fp8`. For all other models, remove `--disable-custom-all-reduce`.
-
 When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value (8192 in this example).  It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.
 
 It is important to tune vLLM’s --max-num-seqs value to an appropriate value depending on the model and input/output lengths.  Larger values will allow vLLM to leverage more of the GPU memory for KV Cache and process more prompts concurrently.  But if the value is too large, the KV cache will reach its capacity and vLLM will have to cancel and re-process some prompts.  Suggested values for various models and configurations are listed below.

From d9cddc019c2610d0a7e821d62ca00217ee64f8d6 Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Mon, 14 Jul 2025 14:39:37 -0700
Subject: [PATCH 2/5] README changes and added throughput results

Still waiting on latency
---
 docs/dev-docker/README.md | 45 ++++++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 71e934f4c65b..ec5d99d68415 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -21,14 +21,15 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
 
 ## What is New
 
-- V1 on by default (use VLLM_USE_V1=0 to override)
-- Fixed detokenizers issue
-- Fixed AITER MoE issues
-- vLLM v0.9.1
+- No need to specify the --compilation-config parameter, these options were turned on by default
+- Fixed llama3.1 405b CAR issue (no longer need --disable-custom-all-reduce)
+- Fixed +rms_norm custom kernel issue
+- Added quick reduce (set VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=FP to enable. Supported modes are FP, INT8, INT6, INT4)
+- Mitigated the commandr model causing GPU crash through a workaround until the driver issue is fixed
 
 ## Known Issues and Workarounds
 
-- --disable-custom-all-reduce required for Llama-3.1 405B
+- AITER does not support fp8 kv cache
 
 ## Performance Results
 
@@ -41,14 +42,14 @@ The table below shows performance data where a local inference client is fed req
 
 | Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
 |-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
-| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 10948.9  |
-|       |           |         | 128   | 4096   | 1500        | 1500         | 9489.2               |
-|       |           |         | 500   | 2000   | 2000        | 2000         | 9576.7               |
-|       |           |         | 2048  | 2048   | 1500        | 1500         | 6450.8                |
-| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 3329.1 |
-|       |           |         | 128   | 4096   | 1500        | 1500         | 2733.0                |
-|       |           |         | 500   | 2000   | 2000        | 2000         | 2765.0                |
-|       |           |         | 2048  | 2048   | 500         | 500          | 2170.1                |
+| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 12638.9  |
+|       |           |         | 128   | 4096   | 1500        | 1500         | 10756.8               |
+|       |           |         | 500   | 2000   | 2000        | 2000         | 10691.7               |
+|       |           |         | 2048  | 2048   | 1500        | 1500         | 7354.9                |
+| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 3912.8 |
+|       |           |         | 128   | 4096   | 1500        | 1500         | 3084.7                |
+|       |           |         | 500   | 2000   | 2000        | 2000         | 2935.9                |
+|       |           |         | 2048  | 2048   | 500         | 500          | 2191.5                |
 
 *TP stands for Tensor Parallelism.*
 
@@ -95,7 +96,7 @@ The table below shows latency measurement, which typically involves assessing th
 
 *TP stands for Tensor Parallelism.*
 
-Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9575F Processors, 2.25 TiB RAM, 8x AMD Instinct MI300X (192GiB, 750W) GPUs, Ubuntu 22.04, and amdgpu driver 6.8.5
+Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9554 Processors, 2.25 TiB RAM, 8x AMD Instinct MI300X (192GiB, 750W) GPUs, Ubuntu 22.04, and amdgpu driver 6.8.5
 
 ## Reproducing Benchmarked Results
 
@@ -225,6 +226,7 @@ You can run latency tests for FP8 models with:
 
 ```bash
 export VLLM_USE_TRITON_FLASH_ATTN=0
+export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
 MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
 BS=1
 IN=128
@@ -264,6 +266,7 @@ You can run throughput tests for FP8 models with:
 
 ```bash
 export VLLM_USE_TRITON_FLASH_ATTN=0
+export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
 MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
 IN=128
 OUT=2048
@@ -275,10 +278,9 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
     --distributed-executor-backend mp \
     --kv-cache-dtype fp8 \
     --dtype float16 \
+    --disable-detokenize \
     --gpu-memory-utilization 0.9 \
     --trust-remote-code \
-    --num-scheduler-steps 10 \
-    --no-enable-chunked-prefill \
     --model $MODEL \
     --max-model-len 8192 \
     --max-num-batched-tokens 131072 \
@@ -451,6 +453,8 @@ Some use cases include:
 
 ```bash
 export VLLM_ROCM_USE_AITER=1
+export VLLM_ROCM_USE_AITER_MHA=0
+export VLLM_ROCM_USE_AITER_RMSNORM=0
 python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 256 --input-len 128 --output-len 2048
 ```
 
@@ -487,7 +491,7 @@ To reproduce the release docker:
 ```bash
     git clone https://github.com/ROCm/vllm.git
     cd vllm
-    git checkout b335519f20495128a47d86f2c01dd467e2fe602b
+    git checkout b432b7a285aa0dcb9677380936ffa74931bb6d6f
     docker build -f docker/Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
 ```
 
@@ -504,6 +508,13 @@ Use AITER release candidate branch instead:
 
 ## Changelog
 
+20250715_aiter:
+- No need to specify the --compilation-config parameter, these options were turned on by default
+- Fixed llama3.1 405b CAR issue (no longer need --disable-custom-all-reduce)
+- Fixed +rms_norm custom kernel issue
+- Added quick reduce (set VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=FP to enable. Supported modes are FP, INT8, INT6, INT4)
+- Mitigated the commandr model causing GPU crash through a workaround until the driver issue is fixed
+
 20250620_aiter:
 - V1 on by default (use VLLM_USE_V1=0 to override)
 - Fixed detokenizers issue

From 2ecfd9595d67d2d38fc367affebaca393be4cb3b Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Tue, 15 Jul 2025 08:54:44 -0700
Subject: [PATCH 3/5] Added latency results

---
 docs/dev-docker/README.md | 64 +++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index ec5d99d68415..e21f0ca05d79 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -61,38 +61,38 @@ The table below shows latency measurement, which typically involves assessing th
 
 | Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
 |-------|-----------|----------|------------|--------|---------|-------------------|
-| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.175 |
-| | | | 2 | 128 | 2048 | 17.603 |
-| | | | 4 | 128 | 2048 | 18.128 |
-| | | | 8 | 128 | 2048 | 19.549 |
-| | | | 16 | 128 | 2048 | 21.518 |
-| | | | 32 | 128 | 2048 | 24.103 |
-| | | | 64 | 128 | 2048 | 31.443 |
-| | | | 128 | 128 | 2048 | 42.932 |
-| | | | 1 | 2048 | 2048 | 17.112 |
-| | | | 2 | 2048 | 2048 | 17.857 |
-| | | | 4 | 2048 | 2048 | 18.711 |
-| | | | 8 | 2048 | 2048 | 19.770 |
-| | | | 16 | 2048 | 2048 | 21.865 |
-| | | | 32 | 2048 | 2048 | 25.302 |
-| | | | 64 | 2048 | 2048 | 33.435 |
-| | | | 128 | 2048 | 2048 | 48.935 |
-| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 52.201 |
-| | | | 2 | 128 | 2048 | 52.689 |
-| | | | 4 | 128 | 2048 | 53.543 |
-| | | | 8 | 128 | 2048 | 56.713 |
-| | | | 16 | 128 | 2048 | 62.190 |
-| | | | 32 | 128 | 2048 | 68.914 |
-| | | | 64 | 128 | 2048 | 85.783 |
-| | | | 128 | 128 | 2048 | 116.485 |
-| | | | 1 | 2048 | 2048 | 52.309 |
-| | | | 2 | 2048 | 2048 | 52.551 |
-| | | | 4 | 2048 | 2048 | 53.685 |
-| | | | 8 | 2048 | 2048 | 56.905 |
-| | | | 16 | 2048 | 2048 | 63.313 |
-| | | | 32 | 2048 | 2048 | 71.007 |
-| | | | 64 | 2048 | 2048 | 88.708 |
-| | | | 128 | 2048 | 2048 | 124.583 |
+| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.236 |
+| | | | 2 | 128 | 2048 | 18.057 |
+| | | | 4 | 128 | 2048 | 18.450 |
+| | | | 8 | 128 | 2048 | 19.677 |
+| | | | 16 | 128 | 2048 | 22.072 |
+| | | | 32 | 128 | 2048 | 24.932 |
+| | | | 64 | 128 | 2048 | 33.287 |
+| | | | 128 | 128 | 2048 | 46.484 |
+| | | | 1 | 2048 | 2048 | 17.500 |
+| | | | 2 | 2048 | 2048 | 18.055 |
+| | | | 4 | 2048 | 2048 | 18.858 |
+| | | | 8 | 2048 | 2048 | 20.161 |
+| | | | 16 | 2048 | 2048 | 22.347 |
+| | | | 32 | 2048 | 2048 | 25.966 |
+| | | | 64 | 2048 | 2048 | 35.324 |
+| | | | 128 | 2048 | 2048 | 52.394 |
+| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 48.453 |
+| | | | 2 | 128 | 2048 | 49.268 |
+| | | | 4 | 128 | 2048 | 51.136 |
+| | | | 8 | 128 | 2048 | 54.226 |
+| | | | 16 | 128 | 2048 | 57.274 |
+| | | | 32 | 128 | 2048 | 68.901 |
+| | | | 64 | 128 | 2048 | 88.631 |
+| | | | 128 | 128 | 2048 | 117.027 |
+| | | | 1 | 2048 | 2048 | 48.362 |
+| | | | 2 | 2048 | 2048 | 49.121 |
+| | | | 4 | 2048 | 2048 | 52.347 |
+| | | | 8 | 2048 | 2048 | 54.471 |
+| | | | 16 | 2048 | 2048 | 57.841 |
+| | | | 32 | 2048 | 2048 | 70.538 |
+| | | | 64 | 2048 | 2048 | 91.452 |
+| | | | 128 | 2048 | 2048 | 125.471 |
 
 *TP stands for Tensor Parallelism.*
 

From 885244214a5815a3f3c3070c9b139b08fc68bfe6 Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Tue, 15 Jul 2025 09:00:04 -0700
Subject: [PATCH 4/5] Update README.md

---
 docs/dev-docker/README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index e21f0ca05d79..290da726c9f3 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -291,6 +291,7 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
     --num-prompts $PROMPTS \
     --max-num-seqs $MAX_NUM_SEQS
 ```
+For FP16 models, remove `--kv-cache-dtype fp8`. 
 
 When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value (8192 in this example).  It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.
 
@@ -324,6 +325,7 @@ vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
     --gpu-memory-utilization 0.99 \
     --num_scheduler-steps 10
 ```
+For FP16 models, remove `--kv-cache-dtype fp8`. 
 
 Change port (for example --port 8005) if port=8000 is currently being used by other processes.
 

From 270179a7cc4c7f1a9f07f20dd53162108132b154 Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Tue, 15 Jul 2025 09:42:15 -0700
Subject: [PATCH 5/5] Update README.md

---
 docs/dev-docker/README.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 290da726c9f3..410da5000b11 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -325,9 +325,7 @@ vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
     --gpu-memory-utilization 0.99 \
     --num_scheduler-steps 10
 ```
-For FP16 models, remove `--kv-cache-dtype fp8`. 
-
-Change port (for example --port 8005) if port=8000 is currently being used by other processes.
+For FP16 models, remove `--kv-cache-dtype fp8`. Change port (for example --port 8005) if port=8000 is currently being used by other processes.
 
 Run client in a separate terminal. Use port_id from previous step else port-id=8000.