-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
System Info
I am experimenting with TRT LLM and flan-t5
models. My simple goal is to build engines with different configurations and tensor parallelism, then review performance. Have a DGX system and an AWS P4de that I can work on (a100s). Did a full stack upgrade to each to see if it fixes the problem with no luck.
- TensorRT-LLM version: 0.9.0.dev2024030500 (also the stable version, no
--pre
, gives you a0.8.x
)
[fromnvidia-smi
] - NVIDIA-SMI 550.54.14
- Driver Version: 550.54.14 (also a
545.x
) - CUDA Version: 12.4
- Host OS: Ubuntu 20.04.6
- Base Image: nvidia/cuda:12.1.0-devel-ubuntu22.04 (also nvidia/cuda:12.3.0-devel-ubuntu22.04)
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
follow the README for encoder-decoder models here (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#download-weights-from-huggingface-transformers) focusing on flan-t5-small (or use large).
go for example #3 (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec#build-tensorrt-engines)
Expected behavior
build
command exits successfully with engine artifacts exported in target directory.
actual behavior
I have tried on a DGX system, an AWS P4de instance, with different TP arrangements, small/large flan-t5 models, adding/removing flags for plug-ins; regardless of the configuration the engine build process errors out when building the decoder
layer (can see the encoder
under trt_engine
directory.
one way or another, all failure modes appear to be at layer: DecoderModel/decoder_layers/0/cross_attention
with error log:
[03/12/2024-16:38:05] [TRT] [E] 4: (Unnamed Layer* 95) [Output]: IIfConditionalOutputLayer inputs must have the same shape. Shapes are [-1,576] and [-1,1152].
[03/12/2024-16:38:05] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2221] Error Code 4: Internal Error (DecoderModel/decoder_layers/0/cross_attention/PLUGIN_V2_GPTAttention_0: output shape can not be computed)
[03/12/2024-16:38:05] [TRT] [E] 4: [graphShapeAnalyzer.cpp::needTypeAndDimensions::2221] Error Code 4: Internal Error (DecoderModel/decoder_layers/0/cross_attention/dense/PLUGIN_V2_AllReduce_0: output shape can not be computed)
Traceback (most recent call last):
File "/TensorRT-LLM/examples/enc_dec/build.py", line 605, in <module>
run_build(component='decoder')
File "/TensorRT-LLM/examples/enc_dec/build.py", line 596, in run_build
build(0, args)
File "/TensorRT-LLM/examples/enc_dec/build.py", line 540, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/TensorRT-LLM/examples/enc_dec/build.py", line 469, in build_rank_engine
tllm_model(*inputs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
output = self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1097, in forward
hidden_states = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
output = self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 477, in forward
hidden_states = residual + attention_output
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 322, in __add__
return add(self, b)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 2362, in elementwise_binary
left, right = broadcast_helper(left, right)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 2307, in broadcast_helper
if left.rank() == right.rank():
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 465, in rank
return len(self.trt_tensor.shape)
ValueError: __len__() should return >= 0
additional notes
without tensor parallelism (tp=1), following the readme work outs fine for small/large t5's.
I wonder if anyone had success with flan-t5 models with tensor parallelism ?