Skip to content

Conversation

@mbs-octoml
Copy link
Contributor

(See https://discuss.tvm.apache.org/t/byoc-supporting-cutlass-byoc-with-collage/12796/6 for context, which in turn is part of Collage (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md).

Currently CUTLASS has four entry points:

  • The usual 'partition_for_cutlass' partitioning function, using the standard pattern table and pass machinery (see cutlass/build.py).
  • A 'tune_cutlass_kernels' function which augments CUTLASS partition functions with the results of building and running test kernels (see cutlass/build.py).
  • A 'relay.ext.cutlass' external codegen function which inspects the turning results and generates a CSourceModule for each partitions (see cutlass/codegen.cc).
  • A 'build_cutlass_kernels_vm' function which runs 'export_library' with all the nvcc compiler options needed to build all the CSourceModules (see cutlass/bild.py).

For Collage we'd like CUTLASS to have only two entry points: 'partition_for_cutlass', and 'relay.ext.cutlass' or equivalent. This makes the CUTLASS external codegen integration composable with other integrations, which in turn helps Collage avoid having to understand any external codegen APIs other than the global pattern table and the custom compilation function/pass.

Collage also tends to end up requiring multiple partitions for the same backend since it is more aggressive at mixing-and-matching smaller sub-graphs between backends. Thus we'd also like to make sure all tuning, generated code and compilation overhead is shared between all such CUTLASS partitions.

So, in this PR:

  • We add all the CUTLASS-specific tuning and compilation options as new Target attributes for the 'external codegen' "cutlass" TargetKind (cutlass/target.cc). The user now has one place to provide those settings, and we've already done the legwork to plumb the target instance.
  • We replace 'relay.ext.cutlass' with a 'RelayToTIR' custom pass hook 'CompileForCutlass' (see cutlass/codegen.cc). This pass obviously can see all the CUTLASS partitions in the IRModule, so we can now share tuning results between them all and can be sure to generate a single CSourceModule. The pass can also invoke the compiler to yield a StaticModule, which we've also already done the legwork to support. In this way all CUTLASS-specific steps are handled at once.
  • For convenience we supply 'finalize_modules' and 'finalize_modules_vm' which invoke nvcc for final linking (using export_library as usual). However, there's now nothing CUTLASS specific in those helpers other than their overriding of the 'compiler' to be nvcc.
  • test_cutlass.py is updated to use the new API.

Though this is a breaking change for existing users of the CUTLASS integration the change is pretty minor, as shown in test_cutlass.py.

@mbs-octoml
Copy link
Contributor Author

CAUTION: Includes #11619, I'll rebase when possible.

@mbs-octoml mbs-octoml force-pushed the mbs-collage-cutlass branch 3 times, most recently from ed6d86b to 5707fca Compare June 10, 2022 14:09
@mbs-octoml
Copy link
Contributor Author

Rebased and ready for review.

(See https://discuss.tvm.apache.org/t/byoc-supporting-cutlass-byoc-with-collage/12796/6 for
context, which in turn is part of Collage (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md).

Currently CUTLASS has four entry points:
 - The usual 'partition_for_cutlass' partitioning function, using the
   standard pattern table and pass machinery (see cutlass/build.py).
 - A 'tune_cutlass_kernels' function which augments CUTLASS partition
   functions with the results of building and running test kernels (see cutlass/build.py).
 - A 'relay.ext.cutlass' external codegen function which inspects the
   turning results and generates a CSourceModule for each partitions
   (see cutlass/codegen.cc).
 - A 'build_cutlass_kernels_vm' function which runs 'export_library' with
   all the nvcc compiler options needed to build all the CSourceModules
   (see cutlass/bild.py).

For Collage we'd like CUTLASS to have only two entry points: 'partition_for_cutlass',
and 'relay.ext.cutlass' or equivalent. This makes the CUTLASS external codegen integration
composable with other integrations, which in turn helps Collage avoid having to understand any
external codegen APIs other than the global pattern table and the custom compilation function/pass.

Collage also tends to end up requiring multiple partitions for the same backend since it is
more aggressive at mixing-and-matching smaller sub-graphs between backends. Thus we'd also like
to make sure all tuning, generated code and compilation overhead is shared between all such CUTLASS
partitions.

So, in this PR:
 - We add all the CUTLASS-specific tuning and compilation options as new Target
   attributes for the 'external codegen' "cutlass" TargetKind (cutlass/target.cc).
   The user now has one place to provide those settings, and we've already done the
   legwork to plumb the target instance.
 - We replace 'relay.ext.cutlass' with a 'RelayToTIR' custom pass hook
   'CompileForCutlass' (see cutlass/codegen.cc). This pass obviously can see all
   the CUTLASS partitions in the IRModule, so we can now share tuning results
   between them all and can be sure to generate a single CSourceModule. The pass can
   also invoke the compiler to yield a StaticModule, which we've also already done the
   legwork to support. In this way all CUTLASS-specific steps are handled at once.
 - For convenience we supply 'finalize_modules' and 'finalize_modules_vm' which
   invoke nvcc for final linking (using export_library as usual). However, there's now
   nothing CUTLASS specific in those helpers other than their overriding of the 'compiler' to
   be nvcc.
 - test_cutlass.py is updated to use the new API.

 Though this is a breaking change for existing users of the CUTLASS integration the
 change is pretty minor, as shown in test_cutlass.py.
@tmoreau89
Copy link
Contributor

@apeskov that PR should be of interest to you!

@masahi
Copy link
Member

masahi commented Jun 10, 2022

so we can now share tuning results between them all and can be sure to generate a single CSourceModule

If I understand this correctly, this is a huge improvement! Previously, when compiling a model like bert-large which has the same workload repeating many times, tuning results are cached but the generated code is not shared. This ends up in a very slow compile and a huge binary blob (> 1G).

I also like the more polished API.

cc @Laurawly @comaniac

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change that makes CUTLASS more practical in TVM :)

@mbs-octoml
Copy link
Contributor Author

mbs-octoml commented Jun 10, 2022

Thanks Masa.

which has the same workload repeating many times

Right, at first I thought I'd done something horribly wrong (for me it was gpt2) but it turns out external codegen has never had structural sharing. So I'm going to play this same game with some of the other codegens, including trt.

@mbs-octoml mbs-octoml force-pushed the mbs-collage-cutlass branch from 5707fca to 9c16e05 Compare June 10, 2022 21:54
@masahi
Copy link
Member

masahi commented Jun 10, 2022

(Can be done in a follow-up) @mbs-octoml Can you remove this line that you just added back? I don't know why I added it before. Maybe it was there to be consistent with lib = tvm.runtime.load_module(lib_path), I remember that one is needed. But if you had things working without reloading from vmcode_path, we should remove it.

code = bytearray(open(vmcode_path, "rb").read())

@mbs-octoml
Copy link
Contributor Author

removed the unnecessary load

@masahi masahi merged commit dfc8e95 into apache:main Jun 11, 2022
@mbs-octoml mbs-octoml deleted the mbs-collage-cutlass branch June 11, 2022 14:34
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Jun 17, 2022
This does for the TensorRT integration what apache#11631 did for the CUTLASS integration.

- All compilation options are captured within the attributes of a Target of
  kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in
  PassContext). This means all BYOC configurations options needed by Collage can
  be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used
  internally at OctoML) only need to worry about maintaining the fidelity of the
  Target instance(s) rather than reaching into the PassContext.

- Compilation is switched to being IRModule-at-a-time (using the RelayToTIR
  target-specific hook mechanism) instead of function-at-a-time (relying on the
  TECompiler). Though not strictly necessary I wanted to check the path is now
  clear to deprecate the latter mechanism in favor of the former, with an eye
  to removing a big source of complexity in the LowerTE pass.

- And, as it happens the switch did reveal the fragility of how we extract constant
  bindings during construction of the JSON for each offloaded function. Currently
  the JSON visitor assigns each Constant a unique name to represent the constant,
  but the underlying NDArray is ignorded. Then, via a callback from within the
  TECompiler the function is visited again (hopefully in the same order) to actually
  captured the NDArrays, replicating the name generation. Replaced that with simply
  binding the constant names to their NDArrays directly in the serializer, and
  attaching that to the IRModule as a new "const_name_to_ndarray" attribute. In
  this way any pass is free to hoist constants out of functions, and the final
  metadata module construction will be sure to capture them for loading at runtime.
  (Obviously IRModule deserves a first class 'global constant' bindings map.)
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Jun 17, 2022
I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz:
 - Make sure all compilation options are passed in Target instances. This helps Collage.
 - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism.
   This helps use decouple external codegen from lowering.

This PR collects the prep for that change:
 - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the
   visitor encountered a Constant it simply generated and recorded a name for the constant. Then,
   completely separately, and via a callback in TECompiler, the function is visited again in the
   same order and with the same name generation convention by a ConstantUpdater to actually collect the
   bindings, which are then encoded into a ConstLoaderModule to be made available at runtime.

   However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback
   hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type
   Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by
   any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those
   constants into the final runtime artifact

   (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.)

 - The TensorRT tests use the create_executor interface but it wasn't quite ready for the
   new more general form of passing list-of-targets.

 - I want TensorRT compilation to work out of the box without the need for any special targets if
   all the default options should apply. Go back and make the CUTLASS integration I did follow the
   same convention.

 - TensorRT actually needs to 'undo' partitionings in some situations. Add an InlineCompilerFunctions
   pass to make that robust. In particular, it must undo both the 'partitioning' (ie separating out
   the "Compiler" function) and any 'compositing' (ie separating out small sub-graphs as
   "Composite" functions).
jinhongyii added a commit to jinhongyii/tvm that referenced this pull request Jun 20, 2022
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Jun 27, 2022
I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz:
 - Make sure all compilation options are passed in Target instances. This helps Collage.
 - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism.
   This helps use decouple external codegen from lowering.

This PR collects the prep for that change:
 - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the
   visitor encountered a Constant it simply generated and recorded a name for the constant. Then,
   completely separately, and via a callback in TECompiler, the function is visited again in the
   same order and with the same name generation convention by a ConstantUpdater to actually collect the
   bindings, which are then encoded into a ConstLoaderModule to be made available at runtime.

   However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback
   hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type
   Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by
   any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those
   constants into the final runtime artifact

   (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.)

 - The TensorRT tests use the create_executor interface but it wasn't quite ready for the
   new more general form of passing list-of-targets.

 - I want TensorRT compilation to work out of the box without the need for any special targets if
   all the default options should apply. Go back and make the CUTLASS integration I did follow the
   same convention.

 - TensorRT actually needs to 'undo' partitionings in some situations. Add an InlineCompilerFunctions
   pass to make that robust. In particular, it must undo both the 'partitioning' (ie separating out
   the "Compiler" function) and any 'compositing' (ie separating out small sub-graphs as
   "Composite" functions).
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Jun 28, 2022
This does for the TensorRT integration what apache#11631 did for the CUTLASS integration.

- All compilation options are captured within the attributes of a Target of
  kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in
  PassContext). This means all BYOC configurations options needed by Collage can
  be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used
  internally at OctoML) only need to worry about maintaining the fidelity of the
  Target instance(s) rather than reaching into the PassContext.

- Compilation is switched to being IRModule-at-a-time (using the RelayToTIR
  target-specific hook mechanism) instead of function-at-a-time (relying on the
  TECompiler). Though not strictly necessary I wanted to check the path is now
  clear to deprecate the latter mechanism in favor of the former, with an eye
  to removing a big source of complexity in the LowerTE pass.

- And, as it happens the switch did reveal the fragility of how we extract constant
  bindings during construction of the JSON for each offloaded function. Currently
  the JSON visitor assigns each Constant a unique name to represent the constant,
  but the underlying NDArray is ignorded. Then, via a callback from within the
  TECompiler the function is visited again (hopefully in the same order) to actually
  captured the NDArrays, replicating the name generation. Replaced that with simply
  binding the constant names to their NDArrays directly in the serializer, and
  attaching that to the IRModule as a new "const_name_to_ndarray" attribute. In
  this way any pass is free to hoist constants out of functions, and the final
  metadata module construction will be sure to capture them for loading at runtime.
  (Obviously IRModule deserves a first class 'global constant' bindings map.)
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Jun 29, 2022
I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz:
 - Make sure all compilation options are passed in Target instances. This helps Collage.
 - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism.
   This helps use decouple external codegen from lowering.

This PR collects the prep for that change:
 - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the
   visitor encountered a Constant it simply generated and recorded a name for the constant. Then,
   completely separately, and via a callback in TECompiler, the function is visited again in the
   same order and with the same name generation convention by a ConstantUpdater to actually collect the
   bindings, which are then encoded into a ConstLoaderModule to be made available at runtime.

   However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback
   hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type
   Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by
   any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those
   constants into the final runtime artifact

   (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.)

 - The TensorRT tests use the create_executor interface but it wasn't quite ready for the
   new more general form of passing list-of-targets.

 - I want TensorRT compilation to work out of the box without the need for any special targets if
   all the default options should apply. Go back and make the CUTLASS integration I did follow the
   same convention.

 - To test this I also switched the 'demo' "ccompiler" external codegen target to IRModule-at-a-time
   style. This means we can test most of external codegen machinery in one place without depending on
   any target which may not be enabled in CI (eg TensorRT):
     - Target instances are plumbed correctly so compile-time options are available.
     - External modules are conveyed to the final export library.
     - Constant bindings are conveyed to the metadata module.
areusch pushed a commit that referenced this pull request Jun 30, 2022
I tried to do to the TensorRT integration what #11631 did to the CUTLASS integration, viz:
 - Make sure all compilation options are passed in Target instances. This helps Collage.
 - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism.
   This helps use decouple external codegen from lowering.

This PR collects the prep for that change:
 - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the
   visitor encountered a Constant it simply generated and recorded a name for the constant. Then,
   completely separately, and via a callback in TECompiler, the function is visited again in the
   same order and with the same name generation convention by a ConstantUpdater to actually collect the
   bindings, which are then encoded into a ConstLoaderModule to be made available at runtime.

   However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback
   hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type
   Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by
   any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those
   constants into the final runtime artifact

   (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.)

 - The TensorRT tests use the create_executor interface but it wasn't quite ready for the
   new more general form of passing list-of-targets.

 - I want TensorRT compilation to work out of the box without the need for any special targets if
   all the default options should apply. Go back and make the CUTLASS integration I did follow the
   same convention.

 - To test this I also switched the 'demo' "ccompiler" external codegen target to IRModule-at-a-time
   style. This means we can test most of external codegen machinery in one place without depending on
   any target which may not be enabled in CI (eg TensorRT):
     - Target instances are plumbed correctly so compile-time options are available.
     - External modules are conveyed to the final export library.
     - Constant bindings are conveyed to the metadata module.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Jun 30, 2022
…elayToTIR hook

This does for the TensorRT integration what apache#11631 did for the CUTLASS integration.

- All compilation options are captured within the attributes of a Target of
  kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in
  PassContext). This means all BYOC configurations options needed by Collage can
  be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used
  internally at OctoML) only need to worry about maintaining the fidelity of the
  Target instance(s) rather than reaching into the PassContext.

- Compilation is switched from function-at-a-time (relying on the TECompiler) to
  IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though
  not strictly necessary for Collage I want to check the path is now clear to
  deprecate the support for BYOC in TEComplier.

- Get all the TensorRT tests going again, except for a few I've disabled with
  x-link to a new issue apache#11765. CAUTION: The TensorRT runtime is not supported in
  CI so many of these tests are cosmetic.

- While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py
  made the TensorRT allocs/frees more robust, but turns out its also broken in main.
  No harm leaving these changes in though.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Jul 1, 2022
…elayToTIR hook

This does for the TensorRT integration what apache#11631 did for the CUTLASS integration.

- All compilation options are captured within the attributes of a Target of
  kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in
  PassContext). This means all BYOC configurations options needed by Collage can
  be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used
  internally at OctoML) only need to worry about maintaining the fidelity of the
  Target instance(s) rather than reaching into the PassContext.

- Compilation is switched from function-at-a-time (relying on the TECompiler) to
  IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though
  not strictly necessary for Collage I want to check the path is now clear to
  deprecate the support for BYOC in TEComplier.

- Get all the TensorRT tests going again, except for a few I've disabled with
  x-link to a new issue apache#11765. CAUTION: The TensorRT runtime is not supported in
  CI so many of these tests are cosmetic.

- While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py
  made the TensorRT allocs/frees more robust, but turns out its also broken in main.
  No harm leaving these changes in though.
masahi pushed a commit that referenced this pull request Jul 1, 2022
…elayToTIR hook (#11979)

* [BYOC] Switch TensorRT BYOC integration to IRModule-at-a-time using RelayToTIR hook

This does for the TensorRT integration what #11631 did for the CUTLASS integration.

- All compilation options are captured within the attributes of a Target of
  kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in
  PassContext). This means all BYOC configurations options needed by Collage can
  be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used
  internally at OctoML) only need to worry about maintaining the fidelity of the
  Target instance(s) rather than reaching into the PassContext.

- Compilation is switched from function-at-a-time (relying on the TECompiler) to
  IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though
  not strictly necessary for Collage I want to check the path is now clear to
  deprecate the support for BYOC in TEComplier.

- Get all the TensorRT tests going again, except for a few I've disabled with
  x-link to a new issue #11765. CAUTION: The TensorRT runtime is not supported in
  CI so many of these tests are cosmetic.

- While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py
  made the TensorRT allocs/frees more robust, but turns out its also broken in main.
  No harm leaving these changes in though.

* - Lints

* - Woops, fix test

* - lints

* - Use default tensorrt target if none given in targets list

* - fix free error

* - accidentally introduced 'transforms' namespace
- can't use default Target("tensorrt") arg

* - D'oh! Include ended up #if protected

* - restore mark for test_dynamic_offload
- handle missing runtime in versioning
- turn test_maskrcnn_resnet50 back on now that we have the
  import-torch-first workaround.

* - wibble
blackkker pushed a commit to blackkker/tvm that referenced this pull request Jul 7, 2022
…e#11770)

I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz:
 - Make sure all compilation options are passed in Target instances. This helps Collage.
 - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism.
   This helps use decouple external codegen from lowering.

This PR collects the prep for that change:
 - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the
   visitor encountered a Constant it simply generated and recorded a name for the constant. Then,
   completely separately, and via a callback in TECompiler, the function is visited again in the
   same order and with the same name generation convention by a ConstantUpdater to actually collect the
   bindings, which are then encoded into a ConstLoaderModule to be made available at runtime.

   However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback
   hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type
   Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by
   any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those
   constants into the final runtime artifact

   (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.)

 - The TensorRT tests use the create_executor interface but it wasn't quite ready for the
   new more general form of passing list-of-targets.

 - I want TensorRT compilation to work out of the box without the need for any special targets if
   all the default options should apply. Go back and make the CUTLASS integration I did follow the
   same convention.

 - To test this I also switched the 'demo' "ccompiler" external codegen target to IRModule-at-a-time
   style. This means we can test most of external codegen machinery in one place without depending on
   any target which may not be enabled in CI (eg TensorRT):
     - Target instances are plumbed correctly so compile-time options are available.
     - External modules are conveyed to the final export library.
     - Constant bindings are conveyed to the metadata module.
blackkker pushed a commit to blackkker/tvm that referenced this pull request Jul 7, 2022
…elayToTIR hook (apache#11979)

* [BYOC] Switch TensorRT BYOC integration to IRModule-at-a-time using RelayToTIR hook

This does for the TensorRT integration what apache#11631 did for the CUTLASS integration.

- All compilation options are captured within the attributes of a Target of
  kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in
  PassContext). This means all BYOC configurations options needed by Collage can
  be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used
  internally at OctoML) only need to worry about maintaining the fidelity of the
  Target instance(s) rather than reaching into the PassContext.

- Compilation is switched from function-at-a-time (relying on the TECompiler) to
  IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though
  not strictly necessary for Collage I want to check the path is now clear to
  deprecate the support for BYOC in TEComplier.

- Get all the TensorRT tests going again, except for a few I've disabled with
  x-link to a new issue apache#11765. CAUTION: The TensorRT runtime is not supported in
  CI so many of these tests are cosmetic.

- While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py
  made the TensorRT allocs/frees more robust, but turns out its also broken in main.
  No harm leaving these changes in though.

* - Lints

* - Woops, fix test

* - lints

* - Use default tensorrt target if none given in targets list

* - fix free error

* - accidentally introduced 'transforms' namespace
- can't use default Target("tensorrt") arg

* - D'oh! Include ended up #if protected

* - restore mark for test_dynamic_offload
- handle missing runtime in versioning
- turn test_maskrcnn_resnet50 back on now that we have the
  import-torch-first workaround.

* - wibble
masahi pushed a commit to masahi/tvm that referenced this pull request Jul 15, 2022
…e#11770)

I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz:
 - Make sure all compilation options are passed in Target instances. This helps Collage.
 - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism.
   This helps use decouple external codegen from lowering.

This PR collects the prep for that change:
 - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the
   visitor encountered a Constant it simply generated and recorded a name for the constant. Then,
   completely separately, and via a callback in TECompiler, the function is visited again in the
   same order and with the same name generation convention by a ConstantUpdater to actually collect the
   bindings, which are then encoded into a ConstLoaderModule to be made available at runtime.

   However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback
   hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type
   Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by
   any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those
   constants into the final runtime artifact

   (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.)

 - The TensorRT tests use the create_executor interface but it wasn't quite ready for the
   new more general form of passing list-of-targets.

 - I want TensorRT compilation to work out of the box without the need for any special targets if
   all the default options should apply. Go back and make the CUTLASS integration I did follow the
   same convention.

 - To test this I also switched the 'demo' "ccompiler" external codegen target to IRModule-at-a-time
   style. This means we can test most of external codegen machinery in one place without depending on
   any target which may not be enabled in CI (eg TensorRT):
     - Target instances are plumbed correctly so compile-time options are available.
     - External modules are conveyed to the final export library.
     - Constant bindings are conveyed to the metadata module.
masahi pushed a commit to masahi/tvm that referenced this pull request Jul 15, 2022
…elayToTIR hook (apache#11979)

* [BYOC] Switch TensorRT BYOC integration to IRModule-at-a-time using RelayToTIR hook

This does for the TensorRT integration what apache#11631 did for the CUTLASS integration.

- All compilation options are captured within the attributes of a Target of
  kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in
  PassContext). This means all BYOC configurations options needed by Collage can
  be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used
  internally at OctoML) only need to worry about maintaining the fidelity of the
  Target instance(s) rather than reaching into the PassContext.

- Compilation is switched from function-at-a-time (relying on the TECompiler) to
  IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though
  not strictly necessary for Collage I want to check the path is now clear to
  deprecate the support for BYOC in TEComplier.

- Get all the TensorRT tests going again, except for a few I've disabled with
  x-link to a new issue apache#11765. CAUTION: The TensorRT runtime is not supported in
  CI so many of these tests are cosmetic.

- While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py
  made the TensorRT allocs/frees more robust, but turns out its also broken in main.
  No harm leaving these changes in though.

* - Lints

* - Woops, fix test

* - lints

* - Use default tensorrt target if none given in targets list

* - fix free error

* - accidentally introduced 'transforms' namespace
- can't use default Target("tensorrt") arg

* - D'oh! Include ended up #if protected

* - restore mark for test_dynamic_offload
- handle missing runtime in versioning
- turn test_maskrcnn_resnet50 back on now that we have the
  import-torch-first workaround.

* - wibble
mikeseven pushed a commit to mikeseven/tvm that referenced this pull request Sep 27, 2023
…e#11770)

I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz:
 - Make sure all compilation options are passed in Target instances. This helps Collage.
 - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism.
   This helps use decouple external codegen from lowering.

This PR collects the prep for that change:
 - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the
   visitor encountered a Constant it simply generated and recorded a name for the constant. Then,
   completely separately, and via a callback in TECompiler, the function is visited again in the
   same order and with the same name generation convention by a ConstantUpdater to actually collect the
   bindings, which are then encoded into a ConstLoaderModule to be made available at runtime.

   However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback
   hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type
   Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by
   any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those
   constants into the final runtime artifact

   (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.)

 - The TensorRT tests use the create_executor interface but it wasn't quite ready for the
   new more general form of passing list-of-targets.

 - I want TensorRT compilation to work out of the box without the need for any special targets if
   all the default options should apply. Go back and make the CUTLASS integration I did follow the
   same convention.

 - To test this I also switched the 'demo' "ccompiler" external codegen target to IRModule-at-a-time
   style. This means we can test most of external codegen machinery in one place without depending on
   any target which may not be enabled in CI (eg TensorRT):
     - Target instances are plumbed correctly so compile-time options are available.
     - External modules are conveyed to the final export library.
     - Constant bindings are conveyed to the metadata module.
mikeseven pushed a commit to mikeseven/tvm that referenced this pull request Sep 27, 2023
…elayToTIR hook (apache#11979)

* [BYOC] Switch TensorRT BYOC integration to IRModule-at-a-time using RelayToTIR hook

This does for the TensorRT integration what apache#11631 did for the CUTLASS integration.

- All compilation options are captured within the attributes of a Target of
  kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in
  PassContext). This means all BYOC configurations options needed by Collage can
  be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used
  internally at OctoML) only need to worry about maintaining the fidelity of the
  Target instance(s) rather than reaching into the PassContext.

- Compilation is switched from function-at-a-time (relying on the TECompiler) to
  IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though
  not strictly necessary for Collage I want to check the path is now clear to
  deprecate the support for BYOC in TEComplier.

- Get all the TensorRT tests going again, except for a few I've disabled with
  x-link to a new issue apache#11765. CAUTION: The TensorRT runtime is not supported in
  CI so many of these tests are cosmetic.

- While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py
  made the TensorRT allocs/frees more robust, but turns out its also broken in main.
  No harm leaving these changes in though.

* - Lints

* - Woops, fix test

* - lints

* - Use default tensorrt target if none given in targets list

* - fix free error

* - accidentally introduced 'transforms' namespace
- can't use default Target("tensorrt") arg

* - D'oh! Include ended up #if protected

* - restore mark for test_dynamic_offload
- handle missing runtime in versioning
- turn test_maskrcnn_resnet50 back on now that we have the
  import-torch-first workaround.

* - wibble
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants