Flan-T5 models with Tensor Parallelism

### System Info

I am experimenting with TRT LLM and `flan-t5` models. My simple goal is to build engines with different configurations and tensor parallelism, then review performance. Have a DGX system and an AWS P4de that I can work on (a100s). Did a full stack upgrade to each to see if it fixes the problem with no luck.

- TensorRT-LLM version: 0.9.0.dev2024030500 (also the stable version,  no `--pre` , gives you a `0.8.x`)
[from `nvidia-smi`]
-  NVIDIA-SMI 550.54.14 
-  Driver Version: 550.54.14 (also a` 545.x`)
- CUDA Version: 12.4 
- Host OS: Ubuntu 20.04.6
- Base Image: nvidia/cuda:12.1.0-devel-ubuntu22.04 (also nvidia/cuda:12.3.0-devel-ubuntu22.04)

### Who can help?

@byshiue @ncom

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

follow the README for encoder-decoder models here (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#download-weights-from-huggingface-transformers) focusing on flan-t5-small (or use large). 
go for example #3 (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#build-tensorrt-engines)


### Expected behavior

`build` command exits successfully with engine artifacts exported in target directory.

### actual behavior

I have tried on a DGX system, an AWS P4de instance, with different TP arrangements, small/large flan-t5 models, adding/removing flags for plug-ins; regardless of the configuration the engine build process errors out when building the `decoder` layer (can see the `encoder` under `trt_engine` directory.
one way or another, all failure modes appear to be at layer: `DecoderModel/decoder_layers/0/cross_attention` 
with error log:
```
[03/12/2024-16:38:05] [TRT] [E] 4: (Unnamed Layer* 95) [Output]: IIfConditionalOutputLayer inputs must have the same shape. Shapes are [-1,576] and [-1,1152].
[03/12/2024-16:38:05] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2221] Error Code 4: Internal Error (DecoderModel/decoder_layers/0/cross_attention/PLUGIN_V2_GPTAttention_0: output shape can not be computed)
[03/12/2024-16:38:05] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2221] Error Code 4: Internal Error (DecoderModel/decoder_layers/0/cross_attention/dense/PLUGIN_V2_AllReduce_0: output shape can not be computed)
Traceback (most recent call last):
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 605, in <module>
    run_build(component='decoder')
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 596, in run_build
    build(0, args)
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 540, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 469, in build_rank_engine
    tllm_model(*inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1097, in forward
    hidden_states = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 477, in forward
    hidden_states = residual + attention_output
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 322, in __add__
    return add(self, b)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 2362, in elementwise_binary
    left, right = broadcast_helper(left, right)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 2307, in broadcast_helper
    if left.rank() == right.rank():
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 465, in rank
    return len(self.trt_tensor.shape)
ValueError: __len__() should return >= 0
```

### additional notes

without tensor parallelism (tp=1), following the readme work outs fine for small/large t5's.
I wonder if anyone had success with flan-t5 models with tensor parallelism ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flan-T5 models with Tensor Parallelism #1286

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flan-T5 models with Tensor Parallelism #1286

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions