H20 Using random weights to infer llama2-13B  results in a divide-by-zero error.

### System Info

Device:H20
Driver:535.161.07
cuda-toolkit:12.2.0

python env:
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
tensorrt 10.0.1
tensorrt-cu12-bindings 10.0.1
tensorrt-cu12-libs 10.0.1
tensorrt-llm 0.11.0.dev2024052800

### Who can help?

@ncomly-nvidia @kaiyux 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

1.convert_config/llama_13b/float16/2-gpu/config.json
`{
    "architecture": "LlamaForCausalLM",
    "dtype": "float16",
    "logits_dtype": "float32",
    "vocab_size": 32000,
    "max_position_embeddings": 4096,
    "hidden_size": 5120,
    "num_hidden_layers": 40,
    "num_attention_heads": 40,
    "num_key_value_heads": 40,
    "head_size": 128,
    "hidden_act": "silu",
    "intermediate_size": 13824,
    "norm_epsilon": 1e-05,
    "position_embedding_type": "rope_gpt_neox",
    "use_parallel_embedding": false,
    "embedding_sharding_dim": 0,
    "share_embedding_table": false,
    "mapping": {
        "world_size": 2,
        "tp_size": 2,
        "pp_size": 1
    },
    "quantization": {
        "quant_algo": null,
        "kv_cache_quant_algo": null,
        "group_size": 128,
        "smoothquant_val": null,
        "has_zero_point": false,
        "pre_quant_scale": false,
        "exclude_modules": [
            "lm_head"
        ]
    },
    "kv_dtype": "float16",
    "rotary_scaling": null,
    "moe_normalization_mode": null,
    "rotary_base": 10000.0,
    "moe_num_experts": 0,
    "moe_top_k": 0,
    "moe_tp_mode": 2,
    "attn_bias": false,
    "disable_weight_only_quant_plugin": false,
    "mlp_bias": false
}`

2. run_build_llama13b.sh

~~~python
#!/bin/bash

### 1. generate config json file

model=llama_13b
tp=2
dtype=fp16

model_config=./convert_config/$model/float16/${tp}-gpu/config.json
output_dir=./engines/$model/trt_engines/fp16/${tp}-gpu

### 2. generate engine with batches and input lens
max_output_len=200
declare -a input_lengths=(1024)

for ((i=0; i<${#input_lengths[@]}; i++)); do
  max_input_len=${input_lengths[$i]}

  case $max_input_len in
    1024) test_batch_sizes=(1 2 4 8 16 32 64 128 256) ;;
    *) echo "Invalid input length"; exit 1 ;;
  esac
  
  for b in "${test_batch_sizes[@]}"; do
    max_batch_size=$b
    echo "Running trtllm-build with max_input_len=$max_input_len, max_output_len=$max_output_len, max_batch_size=$max_batch_size"

    trtllm-build \
      --model_config $model_config \
      --gemm_plugin auto \
      --output_dir $output_dir

    if [ $? -ne 0 ]; then
      echo "trtllm-build failed for max_input_len=$max_input_len, max_batch_size=$max_batch_size"
    fi

    ### 3. begin test cases
    in=$max_input_len
    out=200

    work_dir=`pwd`
    engine_dir=./engines/$model/trt_engines/fp16/${tp}-gpu/

    echo "Running gptSessionBenchmark with input_len=$in, output_len=$out, batch_size=$b"
    mpirun --allow-run-as-root -n ${tp} python3 benchmarks/python/benchmark.py \
                                           --model ${model} \
                                           --mode plugin \
                                           --batch_size  "${b}" \
                                           --input_output_len "${in},${out}" \
                                           --warm_up 1 \
                                           --num_runs 4 \
                                           --engine_dir $engine_dir \
                                           --csv

    if [ $? -ne 0 ]; then
      echo "gptSessionBenchmark failed for input_len=$in, output_len=$out, batch_size=$b"
    fi
    ### 4. delete engine dir
    rm -rf $engine_dir
  done  
done
echo "All runs completed successfully."
~~~

### Expected behavior

Expect to print performance data normally.

### actual behavior

~~~
[96/03/2024-16:26:43] [TRT-LLM] [I] Engine serialized. Total time: 00:00:05
[06/03/2024-16:26:43] [TRT-LLM] [I] Total time of building all engines: 00:01:18
Running gptSessionBenchmark with input_len=1024,.output_len=200, batch_size=1
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
Allocated 117.50 MiB for execution context memory:
model_name,world_size,num_heads,num_kv_heads,num_layers,hidden_size,vocab_size,precision,batch_size,gpu_weights_percent,input_length,output_length,gpu_peak_mem(gb),build_time(s),tokens_per_sec,percentile95(ms),percentile99(ms),latency(ms),compute_cap,quantization,generation_time(ms),total_generated_tokens,generation_tokens_per_second
/root/anaconda3/envs/trt_llm/lib/python3.1o/site-packages/torch/nested/__init__py:166: UserWarning: The PyTorch API of nested tensors is
 in prototype stage and will change in the near future. (Triggered internally at . ./aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
[dlp2-29-140:159817] *** Process received signal ***
[dlp2-29-140:159817] Signal: Floating point exception (8)
[dlp2-29-140:159817] Signal code: Integer divide-by-zero (1)
[dlp2-29-140:159817] Failing at address: 0x2b228fc4ec59
[dlp2-29-140:159817] [ 0] /lib64/libpthread.s0.0(+0xf630)[0x2b220e562630]
[dlp2-29-140:159817] [ 1] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0xa0bc59)[0x2b228fc4ec59]
[dlp2-29-140:159817]_[_21 /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x814383)[0x2b228fa57383]
[dlp2-29-140:159817]_[_3] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x6ace72][0x2b228f8efe72]
[dlp2-29-140:159817] .[.4] /root/anaconda3/envs/trt_llm//lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7aa087)[0x2b228f9ed087]
[dlp2-29-140:159817] [.5] /root/anaconda3/envs/trt_llm//lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7ab055][0x2b228f9ee055]
[dlp2-29-140:159817] [ 6] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(+
0x7ab774)[0x2b228f9ee774]
[dlp2-29-140:159817]. [_7]./root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/torch/lib/../../nvidia/cublas/lib/libcublasLt.so.12(c
ublasLtMatmul+0x1525)[0x2b228f9f2375]
[dlp2-29-140:159817] [ 8] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_zN12tensorrt_ll
m6common15CublasMMWrapper4GemmE17cublas0peration_tS2_iiiPKviS4_iPviffRK20cublasLtMatmuLAlgo_tbb+0xfd)[0x2b23b55e7a8d]
[dlp2-29-140:159817] [ 9] /root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_zN12tensorrt_ll
m6common15CublasMMWrapper4GemmE17cublas0peration_tS2_iiiPKviS4_iPviRKSt8optionalI31cublasLtMatmulHeuristicResult_tE+0x60)[0x2b23b55e7f70]
~~~

### additional notes

The same command and script can run normally on the A100, but it will report an error of divide-by-zero on the H20. Is it because the NVIDIA CUDA version is too low?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

H20 Using random weights to infer llama2-13B results in a divide-by-zero error. #1717

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

H20 Using random weights to infer llama2-13B results in a divide-by-zero error. #1717

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions