Skip to content

Flan-T5 models with Tensor Parallelism #1286

@hademircii

Description

@hademircii

System Info

I am experimenting with TRT LLM and flan-t5 models. My simple goal is to build engines with different configurations and tensor parallelism, then review performance. Have a DGX system and an AWS P4de that I can work on (a100s). Did a full stack upgrade to each to see if it fixes the problem with no luck.

  • TensorRT-LLM version: 0.9.0.dev2024030500 (also the stable version, no --pre , gives you a 0.8.x)
    [from nvidia-smi]
  • NVIDIA-SMI 550.54.14
  • Driver Version: 550.54.14 (also a 545.x)
  • CUDA Version: 12.4
  • Host OS: Ubuntu 20.04.6
  • Base Image: nvidia/cuda:12.1.0-devel-ubuntu22.04 (also nvidia/cuda:12.3.0-devel-ubuntu22.04)

Who can help?

@byshiue @Ncom

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

follow the README for encoder-decoder models here (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#download-weights-from-huggingface-transformers) focusing on flan-t5-small (or use large).
go for example #3 (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#build-tensorrt-engines)

Expected behavior

build command exits successfully with engine artifacts exported in target directory.

actual behavior

I have tried on a DGX system, an AWS P4de instance, with different TP arrangements, small/large flan-t5 models, adding/removing flags for plug-ins; regardless of the configuration the engine build process errors out when building the decoder layer (can see the encoder under trt_engine directory.
one way or another, all failure modes appear to be at layer: DecoderModel/decoder_layers/0/cross_attention
with error log:

[03/12/2024-16:38:05] [TRT] [E] 4: (Unnamed Layer* 95) [Output]: IIfConditionalOutputLayer inputs must have the same shape. Shapes are [-1,576] and [-1,1152].
[03/12/2024-16:38:05] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2221] Error Code 4: Internal Error (DecoderModel/decoder_layers/0/cross_attention/PLUGIN_V2_GPTAttention_0: output shape can not be computed)
[03/12/2024-16:38:05] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2221] Error Code 4: Internal Error (DecoderModel/decoder_layers/0/cross_attention/dense/PLUGIN_V2_AllReduce_0: output shape can not be computed)
Traceback (most recent call last):
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 605, in <module>
    run_build(component='decoder')
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 596, in run_build
    build(0, args)
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 540, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/TensorRT-LLM/examples/enc_dec/build.py", line 469, in build_rank_engine
    tllm_model(*inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1097, in forward
    hidden_states = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 477, in forward
    hidden_states = residual + attention_output
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 322, in __add__
    return add(self, b)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 2362, in elementwise_binary
    left, right = broadcast_helper(left, right)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 2307, in broadcast_helper
    if left.rank() == right.rank():
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 465, in rank
    return len(self.trt_tensor.shape)
ValueError: __len__() should return >= 0

additional notes

without tensor parallelism (tp=1), following the readme work outs fine for small/large t5's.
I wonder if anyone had success with flan-t5 models with tensor parallelism ?

Metadata

Metadata

Assignees

Labels

Model customization<NV>Adding support for new model architectures or variantsScale-out<NV>Multi-GPU and distributed inference scaling issues, tensor/pipeline/data parallelismbugSomething isn't workingtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions