Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,7 @@ steps:
- pytest -v -s kernels/mamba

- label: Tensorizer Test # 11min
mirror_hardwares: [amdexperimental, amdproduction]
mirror_hardwares: [amdexperimental]
soft_fail: true
source_file_dependencies:
- vllm/model_executor/model_loader
Expand Down
4 changes: 3 additions & 1 deletion docs/dev-docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,8 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
--num-prompts $PROMPTS \
--max-num-seqs $MAX_NUM_SEQS
```
For FP16 models, remove `--kv-cache-dtype fp8`.

For FP16 models, remove `--kv-cache-dtype fp8`.

When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value (8192 in this example). It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.

Expand Down Expand Up @@ -325,6 +326,7 @@ vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
--gpu-memory-utilization 0.99 \
--num_scheduler-steps 10
```

For FP16 models, remove `--kv-cache-dtype fp8`. Change port (for example --port 8005) if port=8000 is currently being used by other processes.

Run client in a separate terminal. Use port_id from previous step else port-id=8000.
Expand Down