[BYOC] Make CUTLASS BYOC integration 'Collage friendly' #11631

mbs-octoml · 2022-06-08T20:58:39Z

(See https://discuss.tvm.apache.org/t/byoc-supporting-cutlass-byoc-with-collage/12796/6 for context, which in turn is part of Collage (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md).

Currently CUTLASS has four entry points:

The usual 'partition_for_cutlass' partitioning function, using the standard pattern table and pass machinery (see cutlass/build.py).
A 'tune_cutlass_kernels' function which augments CUTLASS partition functions with the results of building and running test kernels (see cutlass/build.py).
A 'relay.ext.cutlass' external codegen function which inspects the turning results and generates a CSourceModule for each partitions (see cutlass/codegen.cc).
A 'build_cutlass_kernels_vm' function which runs 'export_library' with all the nvcc compiler options needed to build all the CSourceModules (see cutlass/bild.py).

For Collage we'd like CUTLASS to have only two entry points: 'partition_for_cutlass', and 'relay.ext.cutlass' or equivalent. This makes the CUTLASS external codegen integration composable with other integrations, which in turn helps Collage avoid having to understand any external codegen APIs other than the global pattern table and the custom compilation function/pass.

Collage also tends to end up requiring multiple partitions for the same backend since it is more aggressive at mixing-and-matching smaller sub-graphs between backends. Thus we'd also like to make sure all tuning, generated code and compilation overhead is shared between all such CUTLASS partitions.

So, in this PR:

We add all the CUTLASS-specific tuning and compilation options as new Target attributes for the 'external codegen' "cutlass" TargetKind (cutlass/target.cc). The user now has one place to provide those settings, and we've already done the legwork to plumb the target instance.
We replace 'relay.ext.cutlass' with a 'RelayToTIR' custom pass hook 'CompileForCutlass' (see cutlass/codegen.cc). This pass obviously can see all the CUTLASS partitions in the IRModule, so we can now share tuning results between them all and can be sure to generate a single CSourceModule. The pass can also invoke the compiler to yield a StaticModule, which we've also already done the legwork to support. In this way all CUTLASS-specific steps are handled at once.
For convenience we supply 'finalize_modules' and 'finalize_modules_vm' which invoke nvcc for final linking (using export_library as usual). However, there's now nothing CUTLASS specific in those helpers other than their overriding of the 'compiler' to be nvcc.
test_cutlass.py is updated to use the new API.

Though this is a breaking change for existing users of the CUTLASS integration the change is pretty minor, as shown in test_cutlass.py.

mbs-octoml · 2022-06-08T20:59:19Z

CAUTION: Includes #11619, I'll rebase when possible.

mbs-octoml · 2022-06-10T14:10:05Z

Rebased and ready for review.

(See https://discuss.tvm.apache.org/t/byoc-supporting-cutlass-byoc-with-collage/12796/6 for context, which in turn is part of Collage (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md). Currently CUTLASS has four entry points: - The usual 'partition_for_cutlass' partitioning function, using the standard pattern table and pass machinery (see cutlass/build.py). - A 'tune_cutlass_kernels' function which augments CUTLASS partition functions with the results of building and running test kernels (see cutlass/build.py). - A 'relay.ext.cutlass' external codegen function which inspects the turning results and generates a CSourceModule for each partitions (see cutlass/codegen.cc). - A 'build_cutlass_kernels_vm' function which runs 'export_library' with all the nvcc compiler options needed to build all the CSourceModules (see cutlass/bild.py). For Collage we'd like CUTLASS to have only two entry points: 'partition_for_cutlass', and 'relay.ext.cutlass' or equivalent. This makes the CUTLASS external codegen integration composable with other integrations, which in turn helps Collage avoid having to understand any external codegen APIs other than the global pattern table and the custom compilation function/pass. Collage also tends to end up requiring multiple partitions for the same backend since it is more aggressive at mixing-and-matching smaller sub-graphs between backends. Thus we'd also like to make sure all tuning, generated code and compilation overhead is shared between all such CUTLASS partitions. So, in this PR: - We add all the CUTLASS-specific tuning and compilation options as new Target attributes for the 'external codegen' "cutlass" TargetKind (cutlass/target.cc). The user now has one place to provide those settings, and we've already done the legwork to plumb the target instance. - We replace 'relay.ext.cutlass' with a 'RelayToTIR' custom pass hook 'CompileForCutlass' (see cutlass/codegen.cc). This pass obviously can see all the CUTLASS partitions in the IRModule, so we can now share tuning results between them all and can be sure to generate a single CSourceModule. The pass can also invoke the compiler to yield a StaticModule, which we've also already done the legwork to support. In this way all CUTLASS-specific steps are handled at once. - For convenience we supply 'finalize_modules' and 'finalize_modules_vm' which invoke nvcc for final linking (using export_library as usual). However, there's now nothing CUTLASS specific in those helpers other than their overriding of the 'compiler' to be nvcc. - test_cutlass.py is updated to use the new API. Though this is a breaking change for existing users of the CUTLASS integration the change is pretty minor, as shown in test_cutlass.py.

tmoreau89 · 2022-06-10T18:47:17Z

@apeskov that PR should be of interest to you!

masahi · 2022-06-10T20:05:49Z

so we can now share tuning results between them all and can be sure to generate a single CSourceModule

If I understand this correctly, this is a huge improvement! Previously, when compiling a model like bert-large which has the same workload repeating many times, tuning results are cached but the generated code is not shared. This ends up in a very slow compile and a huge binary blob (> 1G).

I also like the more polished API.

cc @Laurawly @comaniac

python/tvm/contrib/cutlass/build.py

comaniac

Thanks for the change that makes CUTLASS more practical in TVM :)

mbs-octoml · 2022-06-10T21:31:08Z

Thanks Masa.

which has the same workload repeating many times

Right, at first I thought I'd done something horribly wrong (for me it was gpt2) but it turns out external codegen has never had structural sharing. So I'm going to play this same game with some of the other codegens, including trt.

masahi · 2022-06-10T22:15:39Z

(Can be done in a follow-up) @mbs-octoml Can you remove this line that you just added back? I don't know why I added it before. Maybe it was there to be consistent with lib = tvm.runtime.load_module(lib_path), I remember that one is needed. But if you had things working without reloading from vmcode_path, we should remove it.

tvm/python/tvm/contrib/cutlass/build.py

Line 635 in 9c16e05

code = bytearray(open(vmcode_path, "rb").read())

mbs-octoml · 2022-06-11T00:10:23Z

removed the unnecessary load

This does for the TensorRT integration what apache#11631 did for the CUTLASS integration. - All compilation options are captured within the attributes of a Target of kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in PassContext). This means all BYOC configurations options needed by Collage can be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used internally at OctoML) only need to worry about maintaining the fidelity of the Target instance(s) rather than reaching into the PassContext. - Compilation is switched to being IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism) instead of function-at-a-time (relying on the TECompiler). Though not strictly necessary I wanted to check the path is now clear to deprecate the latter mechanism in favor of the former, with an eye to removing a big source of complexity in the LowerTE pass. - And, as it happens the switch did reveal the fragility of how we extract constant bindings during construction of the JSON for each offloaded function. Currently the JSON visitor assigns each Constant a unique name to represent the constant, but the underlying NDArray is ignorded. Then, via a callback from within the TECompiler the function is visited again (hopefully in the same order) to actually captured the NDArrays, replicating the name generation. Replaced that with simply binding the constant names to their NDArrays directly in the serializer, and attaching that to the IRModule as a new "const_name_to_ndarray" attribute. In this way any pass is free to hoist constants out of functions, and the final metadata module construction will be sure to capture them for loading at runtime. (Obviously IRModule deserves a first class 'global constant' bindings map.)

I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz: - Make sure all compilation options are passed in Target instances. This helps Collage. - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism. This helps use decouple external codegen from lowering. This PR collects the prep for that change: - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the visitor encountered a Constant it simply generated and recorded a name for the constant. Then, completely separately, and via a callback in TECompiler, the function is visited again in the same order and with the same name generation convention by a ConstantUpdater to actually collect the bindings, which are then encoded into a ConstLoaderModule to be made available at runtime. However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those constants into the final runtime artifact (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.) - The TensorRT tests use the create_executor interface but it wasn't quite ready for the new more general form of passing list-of-targets. - I want TensorRT compilation to work out of the box without the need for any special targets if all the default options should apply. Go back and make the CUTLASS integration I did follow the same convention. - TensorRT actually needs to 'undo' partitionings in some situations. Add an InlineCompilerFunctions pass to make that robust. In particular, it must undo both the 'partitioning' (ie separating out the "Compiler" function) and any 'compositing' (ie separating out small sub-graphs as "Composite" functions).

…he#11631)" This reverts commit dfc8e95.

I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz: - Make sure all compilation options are passed in Target instances. This helps Collage. - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism. This helps use decouple external codegen from lowering. This PR collects the prep for that change: - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the visitor encountered a Constant it simply generated and recorded a name for the constant. Then, completely separately, and via a callback in TECompiler, the function is visited again in the same order and with the same name generation convention by a ConstantUpdater to actually collect the bindings, which are then encoded into a ConstLoaderModule to be made available at runtime. However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those constants into the final runtime artifact (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.) - The TensorRT tests use the create_executor interface but it wasn't quite ready for the new more general form of passing list-of-targets. - I want TensorRT compilation to work out of the box without the need for any special targets if all the default options should apply. Go back and make the CUTLASS integration I did follow the same convention. - TensorRT actually needs to 'undo' partitionings in some situations. Add an InlineCompilerFunctions pass to make that robust. In particular, it must undo both the 'partitioning' (ie separating out the "Compiler" function) and any 'compositing' (ie separating out small sub-graphs as "Composite" functions).

This does for the TensorRT integration what apache#11631 did for the CUTLASS integration. - All compilation options are captured within the attributes of a Target of kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in PassContext). This means all BYOC configurations options needed by Collage can be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used internally at OctoML) only need to worry about maintaining the fidelity of the Target instance(s) rather than reaching into the PassContext. - Compilation is switched to being IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism) instead of function-at-a-time (relying on the TECompiler). Though not strictly necessary I wanted to check the path is now clear to deprecate the latter mechanism in favor of the former, with an eye to removing a big source of complexity in the LowerTE pass. - And, as it happens the switch did reveal the fragility of how we extract constant bindings during construction of the JSON for each offloaded function. Currently the JSON visitor assigns each Constant a unique name to represent the constant, but the underlying NDArray is ignorded. Then, via a callback from within the TECompiler the function is visited again (hopefully in the same order) to actually captured the NDArrays, replicating the name generation. Replaced that with simply binding the constant names to their NDArrays directly in the serializer, and attaching that to the IRModule as a new "const_name_to_ndarray" attribute. In this way any pass is free to hoist constants out of functions, and the final metadata module construction will be sure to capture them for loading at runtime. (Obviously IRModule deserves a first class 'global constant' bindings map.)

I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz: - Make sure all compilation options are passed in Target instances. This helps Collage. - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism. This helps use decouple external codegen from lowering. This PR collects the prep for that change: - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the visitor encountered a Constant it simply generated and recorded a name for the constant. Then, completely separately, and via a callback in TECompiler, the function is visited again in the same order and with the same name generation convention by a ConstantUpdater to actually collect the bindings, which are then encoded into a ConstLoaderModule to be made available at runtime. However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those constants into the final runtime artifact (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.) - The TensorRT tests use the create_executor interface but it wasn't quite ready for the new more general form of passing list-of-targets. - I want TensorRT compilation to work out of the box without the need for any special targets if all the default options should apply. Go back and make the CUTLASS integration I did follow the same convention. - To test this I also switched the 'demo' "ccompiler" external codegen target to IRModule-at-a-time style. This means we can test most of external codegen machinery in one place without depending on any target which may not be enabled in CI (eg TensorRT): - Target instances are plumbed correctly so compile-time options are available. - External modules are conveyed to the final export library. - Constant bindings are conveyed to the metadata module.

I tried to do to the TensorRT integration what #11631 did to the CUTLASS integration, viz: - Make sure all compilation options are passed in Target instances. This helps Collage. - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism. This helps use decouple external codegen from lowering. This PR collects the prep for that change: - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the visitor encountered a Constant it simply generated and recorded a name for the constant. Then, completely separately, and via a callback in TECompiler, the function is visited again in the same order and with the same name generation convention by a ConstantUpdater to actually collect the bindings, which are then encoded into a ConstLoaderModule to be made available at runtime. However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those constants into the final runtime artifact (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.) - The TensorRT tests use the create_executor interface but it wasn't quite ready for the new more general form of passing list-of-targets. - I want TensorRT compilation to work out of the box without the need for any special targets if all the default options should apply. Go back and make the CUTLASS integration I did follow the same convention. - To test this I also switched the 'demo' "ccompiler" external codegen target to IRModule-at-a-time style. This means we can test most of external codegen machinery in one place without depending on any target which may not be enabled in CI (eg TensorRT): - Target instances are plumbed correctly so compile-time options are available. - External modules are conveyed to the final export library. - Constant bindings are conveyed to the metadata module.

…elayToTIR hook This does for the TensorRT integration what apache#11631 did for the CUTLASS integration. - All compilation options are captured within the attributes of a Target of kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in PassContext). This means all BYOC configurations options needed by Collage can be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used internally at OctoML) only need to worry about maintaining the fidelity of the Target instance(s) rather than reaching into the PassContext. - Compilation is switched from function-at-a-time (relying on the TECompiler) to IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though not strictly necessary for Collage I want to check the path is now clear to deprecate the support for BYOC in TEComplier. - Get all the TensorRT tests going again, except for a few I've disabled with x-link to a new issue apache#11765. CAUTION: The TensorRT runtime is not supported in CI so many of these tests are cosmetic. - While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py made the TensorRT allocs/frees more robust, but turns out its also broken in main. No harm leaving these changes in though.

…elayToTIR hook (#11979) * [BYOC] Switch TensorRT BYOC integration to IRModule-at-a-time using RelayToTIR hook This does for the TensorRT integration what #11631 did for the CUTLASS integration. - All compilation options are captured within the attributes of a Target of kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in PassContext). This means all BYOC configurations options needed by Collage can be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used internally at OctoML) only need to worry about maintaining the fidelity of the Target instance(s) rather than reaching into the PassContext. - Compilation is switched from function-at-a-time (relying on the TECompiler) to IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though not strictly necessary for Collage I want to check the path is now clear to deprecate the support for BYOC in TEComplier. - Get all the TensorRT tests going again, except for a few I've disabled with x-link to a new issue #11765. CAUTION: The TensorRT runtime is not supported in CI so many of these tests are cosmetic. - While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py made the TensorRT allocs/frees more robust, but turns out its also broken in main. No harm leaving these changes in though. * - Lints * - Woops, fix test * - lints * - Use default tensorrt target if none given in targets list * - fix free error * - accidentally introduced 'transforms' namespace - can't use default Target("tensorrt") arg * - D'oh! Include ended up #if protected * - restore mark for test_dynamic_offload - handle missing runtime in versioning - turn test_maskrcnn_resnet50 back on now that we have the import-torch-first workaround. * - wibble

…e#11770) I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz: - Make sure all compilation options are passed in Target instances. This helps Collage. - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism. This helps use decouple external codegen from lowering. This PR collects the prep for that change: - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the visitor encountered a Constant it simply generated and recorded a name for the constant. Then, completely separately, and via a callback in TECompiler, the function is visited again in the same order and with the same name generation convention by a ConstantUpdater to actually collect the bindings, which are then encoded into a ConstLoaderModule to be made available at runtime. However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those constants into the final runtime artifact (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.) - The TensorRT tests use the create_executor interface but it wasn't quite ready for the new more general form of passing list-of-targets. - I want TensorRT compilation to work out of the box without the need for any special targets if all the default options should apply. Go back and make the CUTLASS integration I did follow the same convention. - To test this I also switched the 'demo' "ccompiler" external codegen target to IRModule-at-a-time style. This means we can test most of external codegen machinery in one place without depending on any target which may not be enabled in CI (eg TensorRT): - Target instances are plumbed correctly so compile-time options are available. - External modules are conveyed to the final export library. - Constant bindings are conveyed to the metadata module.

…elayToTIR hook (apache#11979) * [BYOC] Switch TensorRT BYOC integration to IRModule-at-a-time using RelayToTIR hook This does for the TensorRT integration what apache#11631 did for the CUTLASS integration. - All compilation options are captured within the attributes of a Target of kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in PassContext). This means all BYOC configurations options needed by Collage can be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used internally at OctoML) only need to worry about maintaining the fidelity of the Target instance(s) rather than reaching into the PassContext. - Compilation is switched from function-at-a-time (relying on the TECompiler) to IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though not strictly necessary for Collage I want to check the path is now clear to deprecate the support for BYOC in TEComplier. - Get all the TensorRT tests going again, except for a few I've disabled with x-link to a new issue apache#11765. CAUTION: The TensorRT runtime is not supported in CI so many of these tests are cosmetic. - While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py made the TensorRT allocs/frees more robust, but turns out its also broken in main. No harm leaving these changes in though. * - Lints * - Woops, fix test * - lints * - Use default tensorrt target if none given in targets list * - fix free error * - accidentally introduced 'transforms' namespace - can't use default Target("tensorrt") arg * - D'oh! Include ended up #if protected * - restore mark for test_dynamic_offload - handle missing runtime in versioning - turn test_maskrcnn_resnet50 back on now that we have the import-torch-first workaround. * - wibble

…e#11770) I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz: - Make sure all compilation options are passed in Target instances. This helps Collage. - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism. This helps use decouple external codegen from lowering. This PR collects the prep for that change: - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the visitor encountered a Constant it simply generated and recorded a name for the constant. Then, completely separately, and via a callback in TECompiler, the function is visited again in the same order and with the same name generation convention by a ConstantUpdater to actually collect the bindings, which are then encoded into a ConstLoaderModule to be made available at runtime. However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those constants into the final runtime artifact (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.) - The TensorRT tests use the create_executor interface but it wasn't quite ready for the new more general form of passing list-of-targets. - I want TensorRT compilation to work out of the box without the need for any special targets if all the default options should apply. Go back and make the CUTLASS integration I did follow the same convention. - To test this I also switched the 'demo' "ccompiler" external codegen target to IRModule-at-a-time style. This means we can test most of external codegen machinery in one place without depending on any target which may not be enabled in CI (eg TensorRT): - Target instances are plumbed correctly so compile-time options are available. - External modules are conveyed to the final export library. - Constant bindings are conveyed to the metadata module.

…elayToTIR hook (apache#11979) * [BYOC] Switch TensorRT BYOC integration to IRModule-at-a-time using RelayToTIR hook This does for the TensorRT integration what apache#11631 did for the CUTLASS integration. - All compilation options are captured within the attributes of a Target of kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in PassContext). This means all BYOC configurations options needed by Collage can be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used internally at OctoML) only need to worry about maintaining the fidelity of the Target instance(s) rather than reaching into the PassContext. - Compilation is switched from function-at-a-time (relying on the TECompiler) to IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though not strictly necessary for Collage I want to check the path is now clear to deprecate the support for BYOC in TEComplier. - Get all the TensorRT tests going again, except for a few I've disabled with x-link to a new issue apache#11765. CAUTION: The TensorRT runtime is not supported in CI so many of these tests are cosmetic. - While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py made the TensorRT allocs/frees more robust, but turns out its also broken in main. No harm leaving these changes in though. * - Lints * - Woops, fix test * - lints * - Use default tensorrt target if none given in targets list * - fix free error * - accidentally introduced 'transforms' namespace - can't use default Target("tensorrt") arg * - D'oh! Include ended up #if protected * - restore mark for test_dynamic_offload - handle missing runtime in versioning - turn test_maskrcnn_resnet50 back on now that we have the import-torch-first workaround. * - wibble

…e#11770) I tried to do to the TensorRT integration what apache#11631 did to the CUTLASS integration, viz: - Make sure all compilation options are passed in Target instances. This helps Collage. - Use a custom pass invoked via RelayToTIRTargetHooks instead of the relay.ext.$toolchain mechanism. This helps use decouple external codegen from lowering. This PR collects the prep for that change: - TensorRT uses the JSONSerializer visitor to encode each partition function. Previously, when the visitor encountered a Constant it simply generated and recorded a name for the constant. Then, completely separately, and via a callback in TECompiler, the function is visited again in the same order and with the same name generation convention by a ConstantUpdater to actually collect the bindings, which are then encoded into a ConstLoaderModule to be made available at runtime. However if all TensorRT compilation is to be done by a stand-alone pass there's no TECompiler callback hackery available. So I've added a "const_name_to_ndarray" attribute to the IRModule of type Map<String, runtime::NDArray> so that named constants can be accumulated throughout compilation by any pass which needs to do so. Then the Graph, AOT and VM executors are all updated to merge those constants into the final runtime artifact (Compare with "Constants", the equivalent attribute for extracting TIR AllocateConsts.) - The TensorRT tests use the create_executor interface but it wasn't quite ready for the new more general form of passing list-of-targets. - I want TensorRT compilation to work out of the box without the need for any special targets if all the default options should apply. Go back and make the CUTLASS integration I did follow the same convention. - To test this I also switched the 'demo' "ccompiler" external codegen target to IRModule-at-a-time style. This means we can test most of external codegen machinery in one place without depending on any target which may not be enabled in CI (eg TensorRT): - Target instances are plumbed correctly so compile-time options are available. - External modules are conveyed to the final export library. - Constant bindings are conveyed to the metadata module.

…elayToTIR hook (apache#11979) * [BYOC] Switch TensorRT BYOC integration to IRModule-at-a-time using RelayToTIR hook This does for the TensorRT integration what apache#11631 did for the CUTLASS integration. - All compilation options are captured within the attributes of a Target of kind "tensorrt" (instead of the "relay.ext.tensorrt.options" attribute in PassContext). This means all BYOC configurations options needed by Collage can be captured uniformly by a list-of-Targets. It also means RPC boundaries (as used internally at OctoML) only need to worry about maintaining the fidelity of the Target instance(s) rather than reaching into the PassContext. - Compilation is switched from function-at-a-time (relying on the TECompiler) to IRModule-at-a-time (using the RelayToTIR target-specific hook mechanism). Though not strictly necessary for Collage I want to check the path is now clear to deprecate the support for BYOC in TEComplier. - Get all the TensorRT tests going again, except for a few I've disabled with x-link to a new issue apache#11765. CAUTION: The TensorRT runtime is not supported in CI so many of these tests are cosmetic. - While trying to track down a 'free(): invalid pointer' error in test_tensorrt_int8_exp.py made the TensorRT allocs/frees more robust, but turns out its also broken in main. No harm leaving these changes in though. * - Lints * - Woops, fix test * - lints * - Use default tensorrt target if none given in targets list * - fix free error * - accidentally introduced 'transforms' namespace - can't use default Target("tensorrt") arg * - D'oh! Include ended up #if protected * - restore mark for test_dynamic_offload - handle missing runtime in versioning - turn test_maskrcnn_resnet50 back on now that we have the import-torch-first workaround. * - wibble

mbs-octoml force-pushed the mbs-collage-cutlass branch 3 times, most recently from ed6d86b to 5707fca Compare June 10, 2022 14:09

masahi approved these changes Jun 10, 2022

View reviewed changes

python/tvm/contrib/cutlass/build.py Outdated Show resolved Hide resolved

python/tvm/contrib/cutlass/build.py Outdated Show resolved Hide resolved

comaniac approved these changes Jun 10, 2022

View reviewed changes

- Masa's comments

9c16e05

mbs-octoml force-pushed the mbs-collage-cutlass branch from 5707fca to 9c16e05 Compare June 10, 2022 21:54

- Remove unnecessary save.

d6efea0

masahi merged commit dfc8e95 into apache:main Jun 11, 2022

mbs-octoml deleted the mbs-collage-cutlass branch June 11, 2022 14:34

mbs-octoml mentioned this pull request Jun 11, 2022

[Relay] Finish implementations of WithFields #11674

Merged

mbs-octoml mentioned this pull request Jun 17, 2022

[BYOC] Handle constants in IRModule-at-a-time external codegen #11770

Merged

jinhongyii added a commit to jinhongyii/tvm that referenced this pull request Jun 20, 2022

Revert "[BYOC] Make CUTLASS BYOC integration 'Collage friendly' (apac…

59016fc

…he#11631)" This reverts commit dfc8e95.

mbs-octoml mentioned this pull request Jun 30, 2022

[BYOC] Switch TensorRT BYOC integration to IRModule-at-a-time using RelayToTIR hook #11979

Merged

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BYOC] Make CUTLASS BYOC integration 'Collage friendly' #11631

[BYOC] Make CUTLASS BYOC integration 'Collage friendly' #11631

Uh oh!

mbs-octoml commented Jun 8, 2022

Uh oh!

mbs-octoml commented Jun 8, 2022

Uh oh!

mbs-octoml commented Jun 10, 2022

Uh oh!

tmoreau89 commented Jun 10, 2022

Uh oh!

masahi commented Jun 10, 2022

Uh oh!

Uh oh!

Uh oh!

comaniac left a comment

Uh oh!

mbs-octoml commented Jun 10, 2022 •

edited

Loading

Uh oh!

masahi commented Jun 10, 2022 •

edited

Loading

Uh oh!

mbs-octoml commented Jun 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[BYOC] Make CUTLASS BYOC integration 'Collage friendly' #11631

[BYOC] Make CUTLASS BYOC integration 'Collage friendly' #11631

Uh oh!

Conversation

mbs-octoml commented Jun 8, 2022

Uh oh!

mbs-octoml commented Jun 8, 2022

Uh oh!

mbs-octoml commented Jun 10, 2022

Uh oh!

tmoreau89 commented Jun 10, 2022

Uh oh!

masahi commented Jun 10, 2022

Uh oh!

Uh oh!

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

mbs-octoml commented Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

masahi commented Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbs-octoml commented Jun 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mbs-octoml commented Jun 10, 2022 •

edited

Loading

masahi commented Jun 10, 2022 •

edited

Loading