Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 65 additions & 65 deletions docs/dev-docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,34 +21,35 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main

## What is New

- V1 on by default (use VLLM_USE_V1=0 to override)
- Fixed detokenizers issue
- Fixed AITER MoE issues
- vLLM v0.9.1
- No need to specify the --compilation-config parameter, these options were turned on by default
- Fixed llama3.1 405b CAR issue (no longer need --disable-custom-all-reduce)
- Fixed +rms_norm custom kernel issue
- Added quick reduce (set VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=FP to enable. Supported modes are FP, INT8, INT6, INT4)
- Mitigated the commandr model causing GPU crash through a workaround until the driver issue is fixed

## Known Issues and Workarounds

- --disable-custom-all-reduce required for Llama-3.1 405B
- AITER does not support fp8 kv cache

## Performance Results

The data in the following tables is a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by AMD Instinct™ MI300X accelerator with vLLM. See the MLPerf section in this document for information about MLPerf 4.1 inference results. The performance numbers above were collected using the steps below.
*Note Benchmarks were run with benchmark scripts from [v0.6.5](https://github.com/vllm-project/vllm/tree/v0.6.5/benchmarks)*
*Note Benchmarks were run with benchmark scripts from [v0.8.5](https://github.com/vllm-project/vllm/tree/v0.8.5/benchmarks)*

### Throughput Measurements

The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.

| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
|-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 10948.9 |
| | | | 128 | 4096 | 1500 | 1500 | 9489.2 |
| | | | 500 | 2000 | 2000 | 2000 | 9576.7 |
| | | | 2048 | 2048 | 1500 | 1500 | 6450.8 |
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 3329.1 |
| | | | 128 | 4096 | 1500 | 1500 | 2733.0 |
| | | | 500 | 2000 | 2000 | 2000 | 2765.0 |
| | | | 2048 | 2048 | 500 | 500 | 2170.1 |
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 12638.9 |
| | | | 128 | 4096 | 1500 | 1500 | 10756.8 |
| | | | 500 | 2000 | 2000 | 2000 | 10691.7 |
| | | | 2048 | 2048 | 1500 | 1500 | 7354.9 |
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 3912.8 |
| | | | 128 | 4096 | 1500 | 1500 | 3084.7 |
| | | | 500 | 2000 | 2000 | 2000 | 2935.9 |
| | | | 2048 | 2048 | 500 | 500 | 2191.5 |

*TP stands for Tensor Parallelism.*

Expand All @@ -60,42 +61,42 @@ The table below shows latency measurement, which typically involves assessing th

| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
|-------|-----------|----------|------------|--------|---------|-------------------|
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.175 |
| | | | 2 | 128 | 2048 | 17.603 |
| | | | 4 | 128 | 2048 | 18.128 |
| | | | 8 | 128 | 2048 | 19.549 |
| | | | 16 | 128 | 2048 | 21.518 |
| | | | 32 | 128 | 2048 | 24.103 |
| | | | 64 | 128 | 2048 | 31.443 |
| | | | 128 | 128 | 2048 | 42.932 |
| | | | 1 | 2048 | 2048 | 17.112 |
| | | | 2 | 2048 | 2048 | 17.857 |
| | | | 4 | 2048 | 2048 | 18.711 |
| | | | 8 | 2048 | 2048 | 19.770 |
| | | | 16 | 2048 | 2048 | 21.865 |
| | | | 32 | 2048 | 2048 | 25.302 |
| | | | 64 | 2048 | 2048 | 33.435 |
| | | | 128 | 2048 | 2048 | 48.935 |
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 52.201 |
| | | | 2 | 128 | 2048 | 52.689 |
| | | | 4 | 128 | 2048 | 53.543 |
| | | | 8 | 128 | 2048 | 56.713 |
| | | | 16 | 128 | 2048 | 62.190 |
| | | | 32 | 128 | 2048 | 68.914 |
| | | | 64 | 128 | 2048 | 85.783 |
| | | | 128 | 128 | 2048 | 116.485 |
| | | | 1 | 2048 | 2048 | 52.309 |
| | | | 2 | 2048 | 2048 | 52.551 |
| | | | 4 | 2048 | 2048 | 53.685 |
| | | | 8 | 2048 | 2048 | 56.905 |
| | | | 16 | 2048 | 2048 | 63.313 |
| | | | 32 | 2048 | 2048 | 71.007 |
| | | | 64 | 2048 | 2048 | 88.708 |
| | | | 128 | 2048 | 2048 | 124.583 |
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.236 |
| | | | 2 | 128 | 2048 | 18.057 |
| | | | 4 | 128 | 2048 | 18.450 |
| | | | 8 | 128 | 2048 | 19.677 |
| | | | 16 | 128 | 2048 | 22.072 |
| | | | 32 | 128 | 2048 | 24.932 |
| | | | 64 | 128 | 2048 | 33.287 |
| | | | 128 | 128 | 2048 | 46.484 |
| | | | 1 | 2048 | 2048 | 17.500 |
| | | | 2 | 2048 | 2048 | 18.055 |
| | | | 4 | 2048 | 2048 | 18.858 |
| | | | 8 | 2048 | 2048 | 20.161 |
| | | | 16 | 2048 | 2048 | 22.347 |
| | | | 32 | 2048 | 2048 | 25.966 |
| | | | 64 | 2048 | 2048 | 35.324 |
| | | | 128 | 2048 | 2048 | 52.394 |
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 48.453 |
| | | | 2 | 128 | 2048 | 49.268 |
| | | | 4 | 128 | 2048 | 51.136 |
| | | | 8 | 128 | 2048 | 54.226 |
| | | | 16 | 128 | 2048 | 57.274 |
| | | | 32 | 128 | 2048 | 68.901 |
| | | | 64 | 128 | 2048 | 88.631 |
| | | | 128 | 128 | 2048 | 117.027 |
| | | | 1 | 2048 | 2048 | 48.362 |
| | | | 2 | 2048 | 2048 | 49.121 |
| | | | 4 | 2048 | 2048 | 52.347 |
| | | | 8 | 2048 | 2048 | 54.471 |
| | | | 16 | 2048 | 2048 | 57.841 |
| | | | 32 | 2048 | 2048 | 70.538 |
| | | | 64 | 2048 | 2048 | 91.452 |
| | | | 128 | 2048 | 2048 | 125.471 |

*TP stands for Tensor Parallelism.*

Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9575F Processors, 2.25 TiB RAM, 8x AMD Instinct MI300X (192GiB, 750W) GPUs, Ubuntu 22.04, and amdgpu driver 6.8.5
Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9554 Processors, 2.25 TiB RAM, 8x AMD Instinct MI300X (192GiB, 750W) GPUs, Ubuntu 22.04, and amdgpu driver 6.8.5

## Reproducing Benchmarked Results

Expand Down Expand Up @@ -225,6 +226,7 @@ You can run latency tests for FP8 models with:

```bash
export VLLM_USE_TRITON_FLASH_ATTN=0
export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
BS=1
IN=128
Expand All @@ -242,15 +244,10 @@ python3 /app/vllm/benchmarks/benchmark_latency.py \
--output-len $OUT \
--tensor-parallel-size $TP \
--num-iters-warmup 3 \
--num-iters 5 \
--output-json output.json
--disable-custom-all-reduce
--compilation-config '{"full_cuda_graph": true,"custom_ops":["+rms_norm","+silu_and_mul"],"pass_config":{"enable_noop":true,"enable_fusion":true}}’
--num-iters 5

```

For FP16 models, remove `--quantization fp8 --kv-cache-dtype fp8`. For all other models, remove `--disable-custom-all-reduce`.

When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value. It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.

To estimate Time To First Token (TTFT) with the benchmark_latency.py script, set the OUT to 1 token. It is also recommended to use `--enforce-eager` to get a more accurate measurement of the time that it actually takes to generate the first token. (For a more comprehensive measurement of TTFT, use the Online Serving Benchmark.)
Expand All @@ -269,6 +266,7 @@ You can run throughput tests for FP8 models with:

```bash
export VLLM_USE_TRITON_FLASH_ATTN=0
export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
IN=128
OUT=2048
Expand All @@ -280,10 +278,9 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
--distributed-executor-backend mp \
--kv-cache-dtype fp8 \
--dtype float16 \
--disable-detokenize \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--num-scheduler-steps 10 \
--enable-chunked-prefill False \
--model $MODEL \
--max-model-len 8192 \
--max-num-batched-tokens 131072 \
Expand All @@ -292,14 +289,9 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
--output-len $OUT \
--tensor-parallel-size $TP \
--num-prompts $PROMPTS \
--max-num-seqs $MAX_NUM_SEQS \
--output-json output.json
--disable-custom-all-reduce
--disable-detokenize
--compilation-config '{"full_cuda_graph": true,"custom_ops":["+silu_and_mul"],"pass_config":{"enable_noop":true,"enable_fusion":true}}’
--max-num-seqs $MAX_NUM_SEQS
```

For FP16 models, remove `--quantization fp8 --kv-cache-dtype fp8`. For all other models, remove `--disable-custom-all-reduce`.
For FP16 models, remove `--kv-cache-dtype fp8`.

When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value (8192 in this example). It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.

Expand Down Expand Up @@ -333,8 +325,7 @@ vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
--gpu-memory-utilization 0.99 \
--num_scheduler-steps 10
```

Change port (for example --port 8005) if port=8000 is currently being used by other processes.
For FP16 models, remove `--kv-cache-dtype fp8`. Change port (for example --port 8005) if port=8000 is currently being used by other processes.

Run client in a separate terminal. Use port_id from previous step else port-id=8000.

Expand Down Expand Up @@ -462,6 +453,8 @@ Some use cases include:

```bash
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_AITER_RMSNORM=0
python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 256 --input-len 128 --output-len 2048
```

Expand Down Expand Up @@ -498,7 +491,7 @@ To reproduce the release docker:
```bash
git clone https://github.com/ROCm/vllm.git
cd vllm
git checkout b335519f20495128a47d86f2c01dd467e2fe602b
git checkout b432b7a285aa0dcb9677380936ffa74931bb6d6f
docker build -f docker/Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
```

Expand All @@ -515,6 +508,13 @@ Use AITER release candidate branch instead:

## Changelog

20250715_aiter:
- No need to specify the --compilation-config parameter, these options were turned on by default
- Fixed llama3.1 405b CAR issue (no longer need --disable-custom-all-reduce)
- Fixed +rms_norm custom kernel issue
- Added quick reduce (set VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=FP to enable. Supported modes are FP, INT8, INT6, INT4)
- Mitigated the commandr model causing GPU crash through a workaround until the driver issue is fixed

20250620_aiter:
- V1 on by default (use VLLM_USE_V1=0 to override)
- Fixed detokenizers issue
Expand Down
Loading