Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 47 additions & 2 deletions examples/models/core/llama4/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ This document shows how to run Llama4-Maverick on B200 with PyTorch workflow and
- [Performance Benchmarks](#performance-benchmarks)
- [B200 Max-throughput](#b200-max-throughput)
- [B200 Min-latency](#b200-min-latency)
- [B200 Balanced](#b200-balanced)
- [Advanced Configuration](#advanced-configuration)
- [Configuration tuning](#configuration-tuning)
- [Troubleshooting](#troubleshooting)
Expand Down Expand Up @@ -94,9 +95,7 @@ Explanation:

#### 2. Launch trtllm-serve OpenAI-compatible API server
TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint.
Currently parallel weight loading conflicts with min_latency, disable the parallel weight loading to enable min_latency for now.
``` bash
TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--max_batch_size 8 \
--tp_size 8 \
Expand All @@ -121,6 +120,52 @@ python -m tensorrt_llm.serve.scripts.benchmark_serving \
--max-concurrency 1 \
```

### B200 Balanced


#### 1. Prepare TensorRT-LLM extra configs
```bash
cat >./extra-llm-api-config.yml <<EOF
stream_interval: 2
cuda_graph_config:
max_batch_size: 1024
enable_padding: true
EOF
```
Explanation:
- `stream_interval`: The iteration interval to create responses under the streaming mode.
- `cuda_graph_config`: CUDA Graph config.
- `max_batch_size`: Max CUDA graph batch size to capture.
- `enable_padding`: Whether to enable CUDA graph padding.


#### 2. Launch trtllm-serve OpenAI-compatible API server
TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint.
``` bash
trtllm-serve nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--tp_size 8 \
--ep_size 2 \
--num_postprocess_workers 2 \
--trust_remote_code \
--extra_llm_api_options ./extra-llm-api-config.yml
```


#### 3. Run performance benchmark
TensorRT-LLM provides a benchmark tool to benchmark trtllm-serve
Prepare a new terminal and run `benchmark_serving`
```bash
python -m tensorrt_llm.serve.scripts.benchmark_serving \
--model nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--dataset-name random \
--ignore-eos \
--num-prompts 1000 \
--random-input-len 1024 \
--random-output-len 2048 \
--random-ids \
--max-concurrency 64 \
```

## Advanced Configuration

### Configuration tuning
Expand Down