Skip to content

H20 Using random weights to infer llama2-13B results in a divide-by-zero error. #1717

@zxs789

Description

@zxs789

System Info

Device:H20
Driver:535.161.07
cuda-toolkit:12.2.0

python env:
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
tensorrt 10.0.1
tensorrt-cu12-bindings 10.0.1
tensorrt-cu12-libs 10.0.1
tensorrt-llm 0.11.0.dev2024052800

Who can help?

@ncomly-nvidia @kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

1.convert_config/llama_13b/float16/2-gpu/config.json
{ "architecture": "LlamaForCausalLM", "dtype": "float16", "logits_dtype": "float32", "vocab_size": 32000, "max_position_embeddings": 4096, "hidden_size": 5120, "num_hidden_layers": 40, "num_attention_heads": 40, "num_key_value_heads": 40, "head_size": 128, "hidden_act": "silu", "intermediate_size": 13824, "norm_epsilon": 1e-05, "position_embedding_type": "rope_gpt_neox", "use_parallel_embedding": false, "embedding_sharding_dim": 0, "share_embedding_table": false, "mapping": { "world_size": 2, "tp_size": 2, "pp_size": 1 }, "quantization": { "quant_algo": null, "kv_cache_quant_algo": null, "group_size": 128, "smoothquant_val": null, "has_zero_point": false, "pre_quant_scale": false, "exclude_modules": [ "lm_head" ] }, "kv_dtype": "float16", "rotary_scaling": null, "moe_normalization_mode": null, "rotary_base": 10000.0, "moe_num_experts": 0, "moe_top_k": 0, "moe_tp_mode": 2, "attn_bias": false, "disable_weight_only_quant_plugin": false, "mlp_bias": false }

  1. run_build_llama13b.sh
#!/bin/bash

### 1. generate config json file

model=llama_13b
tp=2
dtype=fp16

model_config=./convert_config/$model/float16/${tp}-gpu/config.json
output_dir=./engines/$model/trt_engines/fp16/${tp}-gpu

### 2. generate engine with batches and input lens
max_output_len=200
declare -a input_lengths=(1024)

for ((i=0; i<${#input_lengths[@]}; i++)); do
  max_input_len=${input_lengths[$i]}

  case $max_input_len in
    1024) test_batch_sizes=(1 2 4 8 16 32 64 128 256) ;;
    *) echo "Invalid input length"; exit 1 ;;
  esac
  
  for b in "${test_batch_sizes[@]}"; do
    max_batch_size=$b
    echo "Running trtllm-build with max_input_len=$max_input_len, max_output_len=$max_output_len, max_batch_size=$max_batch_size"

    trtllm-build \
      --model_config $model_config \
      --gemm_plugin auto \
      --output_dir $output_dir

    if [ $? -ne 0 ]; then
      echo "trtllm-build failed for max_input_len=$max_input_len, max_batch_size=$max_batch_size"
    fi

    ### 3. begin test cases
    in=$max_input_len
    out=200

    work_dir=`pwd`
    engine_dir=./engines/$model/trt_engines/fp16/${tp}-gpu/

    echo "Running gptSessionBenchmark with input_len=$in, output_len=$out, batch_size=$b"
    mpirun --allow-run-as-root -n ${tp} python3 benchmarks/python/benchmark.py \
                                           --model ${model} \
                                           --mode plugin \
                                           --batch_size  "${b}" \
                                           --input_output_len "${in},${out}" \
                                           --warm_up 1 \
                                           --num_runs 4 \
                                           --engine_dir $engine_dir \
                                           --csv

    if [ $? -ne 0 ]; then
      echo "gptSessionBenchmark failed for input_len=$in, output_len=$out, batch_size=$b"
    fi
    ### 4. delete engine dir
    rm -rf $engine_dir
  done  
done
echo "All runs completed successfully."

Expected behavior

Expect to print performance data normally.

actual behavior

[96/03/2024-16:26:43] [TRT-LLM] [I] Engine serialized. Total time: 00:00:05
[06/03/2024-16:26:43] [TRT-LLM] [I] Total time of building all engines: 00:01:18
Running gptSessionBenchmark with input_len=1024,.output_len=200, batch_size=1
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Allocated 117.50 MiB for execution context memory:
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/root/anaconda3/envs/trt_llm/lib/python3.1o/site-packages/torch/nested/__init__py:166: UserWarning: The PyTorch API of nested tensors is
 in prototype stage and will change in the near future. (Triggered internally at . ./aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
[dlp2-29-140:159817] *** Process received signal ***
[dlp2-29-140:159817] Signal: Floating point exception (8)
[dlp2-29-140:159817] Signal code: Integer divide-by-zero (1)
[dlp2-29-140:159817] Failing at address: 0x2b228fc4ec59
[dlp2-29-140:159817] [ 0] /lib64/libpthread.s0.0(+0xf630)[0x2b220e562630]
[dlp2-29-140:159817] [ 1] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0xa0bc59)[0x2b228fc4ec59]
[dlp2-29-140:159817]_[_21 /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x814383)[0x2b228fa57383]
[dlp2-29-140:159817]_[_3] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x6ace72][0x2b228f8efe72]
[dlp2-29-140:159817] .[.4] /root/anaconda3/envs/trt_llm//lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7aa087)[0x2b228f9ed087]
[dlp2-29-140:159817] [.5] /root/anaconda3/envs/trt_llm//lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7ab055][0x2b228f9ee055]
[dlp2-29-140:159817] [ 6] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7ab774)[0x2b228f9ee774]
[dlp2-29-140:159817]. [_7]./root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(c
ublasLtMatmul+0x1525)[0x2b228f9f2375]
[dlp2-29-140:159817] [ 8] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_zN12tensorrt_ll
m6common15CublasMMWrapper4GemmE17cublas0peration_tS2_iiiPKviS4_iPviffRK20cublasLtMatmuLAlgo_tbb+0xfd)[0x2b23b55e7a8d]
[dlp2-29-140:159817] [ 9] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_zN12tensorrt_ll
m6common15CublasMMWrapper4GemmE17cublas0peration_tS2_iiiPKviS4_iPviRKSt8optionalI31cublasLtMatmulHeuristicResult_tE+0x60)[0x2b23b55e7f70]

additional notes

The same command and script can run normally on the A100, but it will report an error of divide-by-zero on the H20. Is it because the NVIDIA CUDA version is too low?

Metadata

Metadata

Assignees

Labels

InvestigatingbugSomething isn't workingtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions