Skip to content

Conversation

@lsy643
Copy link
Contributor

@lsy643 lsy643 commented Dec 24, 2020

When we are deploying a SSD model in a cpu ctx and offload the backbone to Tensorrt with BYOC, if the TVM_TENSORRT_CACHE_DIR is set, there will be a segmentation fault when loading the Tensorrt engine from cache, which is caused by the device_buffers of TensorRTEngineAndContext is not initialized correctly.

This issue is fixed by adding device_buffers initialization in the BuildEngine of TensorRTRuntime

Please review @masahi @comaniac

@comaniac
Copy link
Contributor

Also cc @trevor-m who implemented TRT integration.

Copy link
Contributor

@trevor-m trevor-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lsy643 thank you for the PR! I hadn't considered that situation of caching + device buffers before.

I think in general we need some slight refactoring here. We should be able to allocate the buffers in GetCachedEnginesFromDisk() and we shouldn't need to modify the BuildEngine() code.

void Convert(TensorRTOpConverterParams* params) const {
auto input = params->inputs.at(0).tensor;
ICHECK_EQ(std::stoi(params->node.GetAttr<std::vector<std::string>>("reverse")[0]), false);
//ICHECK_EQ(std::stoi(params->node.GetAttr<std::vector<std::string>>("reverse")[0]), false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase so you don't need to comment out this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I have rebased the repo.

LoadGlobalAttributes();
if (GetCachedEnginesFromDisk()) return;
SetupConstants(consts);
if (GetCachedEnginesFromDisk()) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since GetCachedEnginesFromDisk is now at the end of the function, we dont need the if and return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If and return removed

return {engine, context, network_input_names_, network_output_names_, device_buffers};
}

void TensorRTBuilder::CreateDeviceBuffers(TensorRTEngineAndContext* engine_and_context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in this function is a duplicate of the code in BuildEngine() - can you call this new function from BuildEngine to avoid the duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateDeviceBuffers is used in BuildEngine

TensorRTEngineAndContext& engine_and_context =
trt_engine_cache_.at(std::make_pair(symbol_name_, batch_size_));
size_t binding_num = engine_and_context.engine->getNbBindings();
if (engine_and_context.device_buffers.size() == binding_num) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be !engine_and_context.device_buffers.empty() instead, it maybe communicates the purpose of this check better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use empty() instead

trt_engine_cache_.at(std::make_pair(symbol_name_, batch_size_));
if (engine_and_context.device_buffers.size() == 0) {
builder.CreateDeviceBuffers(&engine_and_context);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are building the TRT network in the TensorRTBuilder, but exiting before BuildEngine is called. This means the resources used by builder won't ever be freed (TensorRTBuilder::CleanUp()) needs to be called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also shouldnt have to rebuild the whole nextwork just to allocate the buffers.

Copy link
Contributor Author

@lsy643 lsy643 Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the CleanUp is added, there will be a segmentation fault and so I don't call CleanUp.

In the BuildEngine function from tensorrt_runtime.cc, the whole network has not been actually rebuilt since the function returns beforebuilder.BuildEngine gets called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best solution is to move CreateDeviceBuffers out of TensorRTBuilder and into the runtime module. That way we can call it without the unnecessary allocations and creations done by TensorRTBuilder

@comaniac
Copy link
Contributor

@lsy643 please address the comments and rebase to resolve the conflicts.

@comaniac
Copy link
Contributor

comaniac commented Feb 5, 2021

trt_engine_cache_.at(std::make_pair(symbol_name_, batch_size_));
if (engine_and_context.device_buffers.size() == 0) {
builder.CreateDeviceBuffers(&engine_and_context);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best solution is to move CreateDeviceBuffers out of TensorRTBuilder and into the runtime module. That way we can call it without the unnecessary allocations and creations done by TensorRTBuilder

@tqchen
Copy link
Member

tqchen commented Mar 10, 2021

@trevor-m @comaniac @zhiics please followup @lsy643 :)

@comaniac
Copy link
Contributor

@lsy643 please address the last comment from @trevor-m and let's merge this PR.

@comaniac
Copy link
Contributor

Gentle ping @lsy643

@jroesch jroesch added the needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it label Aug 27, 2021
@jroesch
Copy link
Member

jroesch commented Jan 19, 2022

This PR appears to be out of date, please feel free to reopen it if this is not the case.

As part of the new year we are attempting to triage the project's open pull requests to ensure that code which
is ready for review and/or merging receives adequate attention.

Thanks again for your contribution, and feel free to reach out to discuss these changes.

@jroesch jroesch closed this Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it status: need update need update based on feedbacks status: review in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants