-
Notifications
You must be signed in to change notification settings - Fork 369
Description
Context
For models requiring fallback to Torch due to converter capabilities, custom operators, or other needs, each of the TRTEngine objects is compiled, initialized, inserted into the Torch nn.Module, and runtime-ready during compile time. This takes up an unnecessary amount of memory on the GPU at compile time.
Proposal
Use the GPU as a build space for TRTEngine objects, but do not deserialize or initialize the engines until the first forward pass, similar to what is done here:
TensorRT/py/torch_tensorrt/fx/trt_module.py
Lines 25 to 27 in ad74a73
| def _initialize(self): | |
| self.initialized = True | |
| self.context = self.engine.create_execution_context() |
API Details
The TRTModule objects will take a parameter, construct_live=True, which can be specified to False if it is desired to initialize the engines at the first forward pass, thereby avoiding unnecessary usage of GPU space during compilation. After building the engine at compile time, the serialized object is moved to host memory until runtime, at which point it is initialized. check_initialized() is called at every forward pass, only having a measurable effect on the first pass of inference at which point the engines are moved from host to device memory for usage.