Fix Segmentation Fault For Tensorrt BYOC when TVM_TENSORRT_CACHE_DIR is Set #7162

lsy643 · 2020-12-24T09:52:43Z

When we are deploying a SSD model in a cpu ctx and offload the backbone to Tensorrt with BYOC, if the TVM_TENSORRT_CACHE_DIR is set, there will be a segmentation fault when loading the Tensorrt engine from cache, which is caused by the device_buffers of TensorRTEngineAndContext is not initialized correctly.

This issue is fixed by adding device_buffers initialization in the BuildEngine of TensorRTRuntime

Please review @masahi @comaniac

comaniac · 2020-12-24T17:13:26Z

Also cc @trevor-m who implemented TRT integration.

trevor-m

Hi @lsy643 thank you for the PR! I hadn't considered that situation of caching + device buffers before.

I think in general we need some slight refactoring here. We should be able to allocate the buffers in GetCachedEnginesFromDisk() and we shouldn't need to modify the BuildEngine() code.

trevor-m · 2021-01-27T20:59:03Z

src/runtime/contrib/tensorrt/tensorrt_ops.cc

  void Convert(TensorRTOpConverterParams* params) const {
    auto input = params->inputs.at(0).tensor;
-    ICHECK_EQ(std::stoi(params->node.GetAttr<std::vector<std::string>>("reverse")[0]), false);
+    //ICHECK_EQ(std::stoi(params->node.GetAttr<std::vector<std::string>>("reverse")[0]), false);


Please rebase so you don't need to comment out this line.

Ok. I have rebased the repo.

trevor-m · 2021-01-27T20:59:49Z

src/runtime/contrib/tensorrt/tensorrt_runtime.cc

    LoadGlobalAttributes();
-    if (GetCachedEnginesFromDisk()) return;
    SetupConstants(consts);
+    if (GetCachedEnginesFromDisk()) return;


Since GetCachedEnginesFromDisk is now at the end of the function, we dont need the if and return.

If and return removed

trevor-m · 2021-01-27T21:04:46Z

src/runtime/contrib/tensorrt/tensorrt_builder.cc

  return {engine, context, network_input_names_, network_output_names_, device_buffers};
 }

+void TensorRTBuilder::CreateDeviceBuffers(TensorRTEngineAndContext* engine_and_context) {


The code in this function is a duplicate of the code in BuildEngine() - can you call this new function from BuildEngine to avoid the duplication?

CreateDeviceBuffers is used in BuildEngine

trevor-m · 2021-01-27T21:10:36Z

src/runtime/contrib/tensorrt/tensorrt_runtime.cc

+      TensorRTEngineAndContext& engine_and_context =
+          trt_engine_cache_.at(std::make_pair(symbol_name_, batch_size_));
+      size_t binding_num = engine_and_context.engine->getNbBindings();
+      if (engine_and_context.device_buffers.size() == binding_num) {


This could be !engine_and_context.device_buffers.empty() instead, it maybe communicates the purpose of this check better.

Use empty() instead

trevor-m · 2021-01-27T21:12:54Z

src/runtime/contrib/tensorrt/tensorrt_runtime.cc

+          trt_engine_cache_.at(std::make_pair(symbol_name_, batch_size_));
+      if (engine_and_context.device_buffers.size() == 0) {
+        builder.CreateDeviceBuffers(&engine_and_context);
+        return;


We are building the TRT network in the TensorRTBuilder, but exiting before BuildEngine is called. This means the resources used by builder won't ever be freed (TensorRTBuilder::CleanUp()) needs to be called.

We also shouldnt have to rebuild the whole nextwork just to allocate the buffers.

If the CleanUp is added, there will be a segmentation fault and so I don't call CleanUp.

In the BuildEngine function from tensorrt_runtime.cc, the whole network has not been actually rebuilt since the function returns beforebuilder.BuildEngine gets called.

I think the best solution is to move CreateDeviceBuffers out of TensorRTBuilder and into the runtime module. That way we can call it without the unnecessary allocations and creations done by TensorRTBuilder

comaniac · 2021-01-30T02:04:29Z

@lsy643 please address the comments and rebase to resolve the conflicts.

comaniac · 2021-02-05T22:11:19Z

@trevor-m please review and approve explicitly (https://tvm.apache.org/docs/contribute/code_review.html#approve-and-request-changes-explicitly)

trevor-m · 2021-02-09T00:33:39Z

src/runtime/contrib/tensorrt/tensorrt_runtime.cc

+          trt_engine_cache_.at(std::make_pair(symbol_name_, batch_size_));
+      if (engine_and_context.device_buffers.size() == 0) {
+        builder.CreateDeviceBuffers(&engine_and_context);
+        return;


I think the best solution is to move CreateDeviceBuffers out of TensorRTBuilder and into the runtime module. That way we can call it without the unnecessary allocations and creations done by TensorRTBuilder

tqchen · 2021-03-10T15:25:35Z

@trevor-m @comaniac @zhiics please followup @lsy643 :)

comaniac · 2021-03-10T17:25:58Z

@lsy643 please address the last comment from @trevor-m and let's merge this PR.

comaniac · 2021-03-23T07:23:47Z

Gentle ping @lsy643

jroesch · 2022-01-19T19:01:11Z

This PR appears to be out of date, please feel free to reopen it if this is not the case.

As part of the new year we are attempting to triage the project's open pull requests to ensure that code which
is ready for review and/or merging receives adequate attention.

Thanks again for your contribution, and feel free to reach out to discuss these changes.

lisiyuan added 19 commits October 21, 2020 16:42

sync vta-hw

a436c43

Merge remote-tracking branch 'upstream/main' into main

a82be60

Merge remote-tracking branch 'upstream/main' into main

42d1895

Merge remote-tracking branch 'upstream/main' into main

66c7839

sync with remote

c42e1cc

sync for vta-hw

a4abee1

sync for tvm

2fbfba4

upgrade dmlc-core

e72cac1

Merge remote-tracking branch 'upstream/main' into main

63c50ba

syn with upstream

de91b6a

Merge remote-tracking branch 'upstream/main' into main

86a17ae

sync

4dee782

Merge remote-tracking branch 'upstream/main' into main

b51ca36

fix tensorrt runtime error when load from cache

98f8eb8

sync 3rd party

38d59e5

fix lint

4114555

fix clang format

2216908

clang format

08a369d

clang format

aac7620

lisiyuan added 4 commits December 30, 2020 10:32

Merge remote-tracking branch 'upstream/main' into debug_trt

68a74e1

Merge remote-tracking branch 'upstream/main' into debug_trt

e8fab0a

fix reshape attr loss error

bee0450

sync upstream

1256f1c

ZihengJiang added the status: need review label Jan 6, 2021

sync with upstream

6084250

trevor-m requested changes Jan 27, 2021

View reviewed changes

comaniac added status: need update need update based on feedbacks status: review in progress and removed status: need review labels Jan 27, 2021

lisiyuan added 6 commits February 1, 2021 09:05

Merge remote-tracking branch 'upstream/main' into debug_trt

7e58d4d

refactor

5598d6e

make cleanup public

fb273d3

fix lint

75ca51b

sync vta-hw

236fb41

remove cleanup

2c9af6e

trevor-m requested changes Feb 9, 2021

View reviewed changes

tqchen assigned comaniac Mar 10, 2021

trevor-m mentioned this pull request Jun 1, 2021

[BYOC][TensorRT] Reuse TRT engines based on max_batch_size for dynamic batching, improve device buffer allocation #8172

Merged

jroesch added the needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it label Aug 27, 2021

jroesch closed this Jan 19, 2022

Fix Segmentation Fault For Tensorrt BYOC when TVM_TENSORRT_CACHE_DIR is Set #7162

Fix Segmentation Fault For Tensorrt BYOC when TVM_TENSORRT_CACHE_DIR is Set #7162

Uh oh!

Conversation

lsy643 commented Dec 24, 2020

Uh oh!

comaniac commented Dec 24, 2020

Uh oh!

trevor-m left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lsy643 Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comaniac commented Jan 30, 2021

Uh oh!

comaniac commented Feb 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tqchen commented Mar 10, 2021

Uh oh!

comaniac commented Mar 10, 2021

Uh oh!

comaniac commented Mar 23, 2021

Uh oh!

jroesch commented Jan 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lsy643 Feb 1, 2021 •

edited

Loading