-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
System Info
Device:H20
Driver:535.161.07
cuda-toolkit:12.2.0
python env:
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
tensorrt 10.0.1
tensorrt-cu12-bindings 10.0.1
tensorrt-cu12-libs 10.0.1
tensorrt-llm 0.11.0.dev2024052800
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
1.convert_config/llama_13b/float16/2-gpu/config.json
{ "architecture": "LlamaForCausalLM", "dtype": "float16", "logits_dtype": "float32", "vocab_size": 32000, "max_position_embeddings": 4096, "hidden_size": 5120, "num_hidden_layers": 40, "num_attention_heads": 40, "num_key_value_heads": 40, "head_size": 128, "hidden_act": "silu", "intermediate_size": 13824, "norm_epsilon": 1e-05, "position_embedding_type": "rope_gpt_neox", "use_parallel_embedding": false, "embedding_sharding_dim": 0, "share_embedding_table": false, "mapping": { "world_size": 2, "tp_size": 2, "pp_size": 1 }, "quantization": { "quant_algo": null, "kv_cache_quant_algo": null, "group_size": 128, "smoothquant_val": null, "has_zero_point": false, "pre_quant_scale": false, "exclude_modules": [ "lm_head" ] }, "kv_dtype": "float16", "rotary_scaling": null, "moe_normalization_mode": null, "rotary_base": 10000.0, "moe_num_experts": 0, "moe_top_k": 0, "moe_tp_mode": 2, "attn_bias": false, "disable_weight_only_quant_plugin": false, "mlp_bias": false }
- run_build_llama13b.sh
#!/bin/bash
### 1. generate config json file
model=llama_13b
tp=2
dtype=fp16
model_config=./convert_config/$model/float16/${tp}-gpu/config.json
output_dir=./engines/$model/trt_engines/fp16/${tp}-gpu
### 2. generate engine with batches and input lens
max_output_len=200
declare -a input_lengths=(1024)
for ((i=0; i<${#input_lengths[@]}; i++)); do
max_input_len=${input_lengths[$i]}
case $max_input_len in
1024) test_batch_sizes=(1 2 4 8 16 32 64 128 256) ;;
*) echo "Invalid input length"; exit 1 ;;
esac
for b in "${test_batch_sizes[@]}"; do
max_batch_size=$b
echo "Running trtllm-build with max_input_len=$max_input_len, max_output_len=$max_output_len, max_batch_size=$max_batch_size"
trtllm-build \
--model_config $model_config \
--gemm_plugin auto \
--output_dir $output_dir
if [ $? -ne 0 ]; then
echo "trtllm-build failed for max_input_len=$max_input_len, max_batch_size=$max_batch_size"
fi
### 3. begin test cases
in=$max_input_len
out=200
work_dir=`pwd`
engine_dir=./engines/$model/trt_engines/fp16/${tp}-gpu/
echo "Running gptSessionBenchmark with input_len=$in, output_len=$out, batch_size=$b"
mpirun --allow-run-as-root -n ${tp} python3 benchmarks/python/benchmark.py \
--model ${model} \
--mode plugin \
--batch_size "${b}" \
--input_output_len "${in},${out}" \
--warm_up 1 \
--num_runs 4 \
--engine_dir $engine_dir \
--csv
if [ $? -ne 0 ]; then
echo "gptSessionBenchmark failed for input_len=$in, output_len=$out, batch_size=$b"
fi
### 4. delete engine dir
rm -rf $engine_dir
done
done
echo "All runs completed successfully."
Expected behavior
Expect to print performance data normally.
actual behavior
[96/03/2024-16:26:43] [TRT-LLM] [I] Engine serialized. Total time: 00:00:05
[06/03/2024-16:26:43] [TRT-LLM] [I] Total time of building all engines: 00:01:18
Running gptSessionBenchmark with input_len=1024,.output_len=200, batch_size=1
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Allocated 117.50 MiB for execution context memory:
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/root/anaconda3/envs/trt_llm/lib/python3.1o/site-packages/torch/nested/__init__py:166: UserWarning: The PyTorch API of nested tensors is
in prototype stage and will change in the near future. (Triggered internally at . ./aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
[dlp2-29-140:159817] *** Process received signal ***
[dlp2-29-140:159817] Signal: Floating point exception (8)
[dlp2-29-140:159817] Signal code: Integer divide-by-zero (1)
[dlp2-29-140:159817] Failing at address: 0x2b228fc4ec59
[dlp2-29-140:159817] [ 0] /lib64/libpthread.s0.0(+0xf630)[0x2b220e562630]
[dlp2-29-140:159817] [ 1] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0xa0bc59)[0x2b228fc4ec59]
[dlp2-29-140:159817]_[_21 /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x814383)[0x2b228fa57383]
[dlp2-29-140:159817]_[_3] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x6ace72][0x2b228f8efe72]
[dlp2-29-140:159817] .[.4] /root/anaconda3/envs/trt_llm//lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7aa087)[0x2b228f9ed087]
[dlp2-29-140:159817] [.5] /root/anaconda3/envs/trt_llm//lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7ab055][0x2b228f9ee055]
[dlp2-29-140:159817] [ 6] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7ab774)[0x2b228f9ee774]
[dlp2-29-140:159817]. [_7]./root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(c
ublasLtMatmul+0x1525)[0x2b228f9f2375]
[dlp2-29-140:159817] [ 8] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_zN12tensorrt_ll
m6common15CublasMMWrapper4GemmE17cublas0peration_tS2_iiiPKviS4_iPviffRK20cublasLtMatmuLAlgo_tbb+0xfd)[0x2b23b55e7a8d]
[dlp2-29-140:159817] [ 9] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_zN12tensorrt_ll
m6common15CublasMMWrapper4GemmE17cublas0peration_tS2_iiiPKviS4_iPviRKSt8optionalI31cublasLtMatmulHeuristicResult_tE+0x60)[0x2b23b55e7f70]
additional notes
The same command and script can run normally on the A100, but it will report an error of divide-by-zero on the H20. Is it because the NVIDIA CUDA version is too low?