Skip to content

Conversation

@mbs-octoml
Copy link
Contributor

@mbs-octoml mbs-octoml commented Nov 30, 2021

This PR:

  1. Makes PlanDevices consider lowered calls when solving device domain constraints.
  2. Connects the storage scopes on PrimFunc parameters (encoded in their Buffer data
    Var type annotation PointerTypes storage_scope fields) to the memory_scope
    fields of the SEScopes which PlanDevices unifies over.
  3. Allows new device_copies to be inserted on the arguments and results of lowered
    calls so as to acount for any memory scope mismatches which are now apparent.

[device_planner.cc has main changes, rest is secondary.]

In the short term we'd like to use this machinery to flow memory scope choices made
during lowering back out into the overall Relay program. In the longer term we'd
also like to be able to use memory scopes to influence the lowering of
yet-to-be-lowered functions (or lowered functions which have yet to been scheduled,
a distinction now possible with TensorIR).

  • Memory scope constraints can flow both out of and in to PrimFuncs
    introduced by LowerTE. In TIR memory scopes are represented by
    'storage scopes' on the PointerType type annotations on TIR Buffer data
    variables.

    • It is straightforward to extract memory scopes from PrimFuncs by
      looking at the PrimFunc's buffer_map. We do this is 'phase 1' of
      PlanDevices, which collects all the device constraints implied by
    • However, pushing memory constraints in to PrimFuncs is more challenging
      due to buffer aliasing. This aspect is still experimental.
  • Allow device_copies to be inserted for both arguments and
    results of PrimFunc calls, on the assumption PlanDevices has
    already established a consistent device assignment prior to
    lowering and any new mismatch is required to match up memory scopes.
    We use the new 'free' on_device annotations to implement this.

Coming along for the ride:

  • To make unit tests of mixed Relay/TIR functions possible needed
    to be able to supply a checked_type to GlobalVar since that's currently
    the only way to give a Relay type to PrimFuncs.

  • Use GenSym to get unique var names in ANF & partial eval so easier
    to diff debug output between passes and connect program fragments
    back into the overall program. Relying on pretty-printing to
    automagically unique-ify var names is certainly cute but until we
    have better span support is very hard to work with.

  • Realized both dead_code.cc and fold_constant.cc would
    happily move values into a different lexical virtual
    device context since device_planner.cc was being
    'clever' and eliding on_devices for let-bound values
    when there's no change. Fixed so that every let-bound
    value has an on_device. Will be much better after
    [RELAY][AST] Add virtual device as a first class field to Relay expressions tvm-rfcs#45 is implemented.

  • Make build -Werror clean for clang-12 (mostly move fixups).

  • Address post-submit comments from [Relay] PlanDevices supports 'free' on_device annotations #9693.

@mbs-octoml mbs-octoml force-pushed the mbs-post-lower-dev-plan branch 7 times, most recently from 715692f to c1842cc Compare December 7, 2021 23:24
@mbs-octoml mbs-octoml force-pushed the mbs-post-lower-dev-plan branch from c1842cc to d65804c Compare December 7, 2021 23:52
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Dec 9, 2021
…imFuncs

This is in support of apache#9613 which allows memory scopes to flow
out of already-lowered PrimFuncs into the rest of the Relay
program. This means scope choices made during lowering can
be accounted for in the rest of the program, with device_copies
inserted as required.

Somewhat more speculatively we also allow memory scopes to flow
in to PrimFuncs. This is in preparation for when we can split
lowering into two phases: i) lower "primitive" fused Relay
functions to TensorIR in a schedulable form roughly isomorphic
to TE, and ii) actual scheduling down to traditional TIR. Once
that split is made it will be possible to flow memory scopes
out of one PrimFunc and into another so as to avoid unnecessary
device_copies being necessary due to independently chosen
memory scopes.

I also suspect we'll want to put our focus on layouts rather
than memory scopes, but this at least sets up some of the
machinery.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Dec 9, 2021
This is in support of apache#9613, which allows PlanDevices to be run
after lowering so as to flow memory constraints in and
out of PrimFuncs. That requires a way to insert device_copies
when the memory scopes chosen during separate lowering of fused
primitive functions clashes, but otherwise avoid device_copies when
scopes can be chosen so as to avoid them.

We support that by generalizing the "on_device" annotation to
allow the device constraint to be independently controlled for
its 'body' and 'result'.

# Standard user annotation: body is constrained to S
on_device(body, S)

# Used by PlanDevices to 'fix' expression to S
# (was is_fixed=True)
on_device(body, S, constrain_result=True)

# Used by PlanDevices to indicate a device_copy can be
# inserted if necessary.
on_device(body, S, constrain_body=False)

# Supported, but currently has no use.
on_device(body, S, constrain_result=True, constrain_body=False)

A few extra odd's 'n ends collected along the way:
 - Some CallLowered cleanup which I found useful.
 - The usual extra debugging output needed as I debugged.
   In return I removed some particularly verbose logging I'd
   added while tracking down unexpected object copies.
 - Cleanup warnings from clang-12 as I touch files.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Dec 9, 2021
…imFuncs

This is in support of apache#9613 which allows memory scopes to flow
out of already-lowered PrimFuncs into the rest of the Relay
program. This means scope choices made during lowering can
be accounted for in the rest of the program, with device_copies
inserted as required.

Somewhat more speculatively we also allow memory scopes to flow
in to PrimFuncs. This is in preparation for when we can split
lowering into two phases: i) lower "primitive" fused Relay
functions to TensorIR in a schedulable form roughly isomorphic
to TE, and ii) actual scheduling down to traditional TIR. Once
that split is made it will be possible to flow memory scopes
out of one PrimFunc and into another so as to avoid unnecessary
device_copies being necessary due to independently chosen
memory scopes.

I also suspect we'll want to put our focus on layouts rather
than memory scopes, but this at least sets up some of the
machinery.
mbrookhart pushed a commit that referenced this pull request Dec 10, 2021
…imFuncs (#9689)

* [TIR] Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs

This is in support of #9613 which allows memory scopes to flow
out of already-lowered PrimFuncs into the rest of the Relay
program. This means scope choices made during lowering can
be accounted for in the rest of the program, with device_copies
inserted as required.

Somewhat more speculatively we also allow memory scopes to flow
in to PrimFuncs. This is in preparation for when we can split
lowering into two phases: i) lower "primitive" fused Relay
functions to TensorIR in a schedulable form roughly isomorphic
to TE, and ii) actual scheduling down to traditional TIR. Once
that split is made it will be possible to flow memory scopes
out of one PrimFunc and into another so as to avoid unnecessary
device_copies being necessary due to independently chosen
memory scopes.

I also suspect we'll want to put our focus on layouts rather
than memory scopes, but this at least sets up some of the
machinery.

* [checkpoint] Junru's comments.
jroesch pushed a commit that referenced this pull request Dec 10, 2021
* [Relay] PlanDevices supports 'free' on_device annotations

This is in support of #9613, which allows PlanDevices to be run
after lowering so as to flow memory constraints in and
out of PrimFuncs. That requires a way to insert device_copies
when the memory scopes chosen during separate lowering of fused
primitive functions clashes, but otherwise avoid device_copies when
scopes can be chosen so as to avoid them.

We support that by generalizing the "on_device" annotation to
allow the device constraint to be independently controlled for
its 'body' and 'result'.

# Standard user annotation: body is constrained to S
on_device(body, S)

# Used by PlanDevices to 'fix' expression to S
# (was is_fixed=True)
on_device(body, S, constrain_result=True)

# Used by PlanDevices to indicate a device_copy can be
# inserted if necessary.
on_device(body, S, constrain_body=False)

# Supported, but currently has no use.
on_device(body, S, constrain_result=True, constrain_body=False)

A few extra odd's 'n ends collected along the way:
 - Some CallLowered cleanup which I found useful.
 - The usual extra debugging output needed as I debugged.
   In return I removed some particularly verbose logging I'd
   added while tracking down unexpected object copies.
 - Cleanup warnings from clang-12 as I touch files.

* [checkpoint] unused var
…straints.

This PR:
 1) Makes PlanDevices consider lowered calls when solving device domain constraints.
 2) Connects the storage scopes on PrimFunc parameters (encoded in their Buffer data
    Var type annotation PointerTypes storage_scope fields) to the memory_scope
    fields of the SEScopes which PlanDevices unifies over.
 3) Allows new device_copies to be inserted on the arguments and results of lowered
    calls so as to acount for any memory scope mismatches which are now apparent.

[device_planner.cc has main changes, rest is secondary.]

In the short term we'd like to use this machinery to flow memory scope choices made
during lowering back out into the overall Relay program. In the longer term we'd
also like to be able to use memory scopes to influence the lowering of
yet-to-be-lowered functions (or lowered functions which have yet to been scheduled,
a distinction now possible with TensorIR).

 - Memory scope constraints can flow both out of and in to PrimFuncs
   introduced by LowerTE. In TIR memory scopes are represented by
   'storage scopes' on the PointerType type annotations on TIR Buffer data
   variables.
    - It is straightforward to extract memory scopes from PrimFuncs by
      looking at the PrimFunc's buffer_map. We do this is 'phase 1' of
      PlanDevices, which collects all the device constraints implied by
    - However, pushing memory constraints in to PrimFuncs is more challenging
      due to buffer aliasing. This aspect is still experimental.

 - Allow device_copies to be inserted for both arguments and
   results of PrimFunc calls, on the assumption PlanDevices has
   already established a consistent device assignment prior to
   lowering and any new mismatch is required to match up memory scopes.
   We use the new 'free' on_device annotations to implement this.

Coming along for the ride:

 - To make unit tests of mixed Relay/TIR functions possible needed
   to be able to supply a checked_type to GlobalVar since that's currently
   the only way to give a Relay type to PrimFuncs.

 - Use GenSym to get unique var names in ANF & partial eval so easier
   to diff debug output between passes and connect program fragments
   back into the overall program. Relying on pretty-printing to
   automagically unique-ify var names is certainly cute but until we
   have better span support is very hard to work with.

 - Realized both dead_code.cc and fold_constant.cc would
   happily move values into a different lexical virtual
   device context since device_planner.cc was being
   'clever' and eliding on_devices for let-bound values
   when there's no change. Fixed so that every let-bound
   value has an on_device. Will be much better after
   apache/tvm-rfcs#45 is implemented.

 - Make build -Werror clean for clang-12 (mostly move fixups).

 - Address post-submit comments from apache#9693.
@mbs-octoml mbs-octoml force-pushed the mbs-post-lower-dev-plan branch from d65804c to 219762a Compare December 13, 2021 19:02
@mbs-octoml mbs-octoml changed the title [DRAFT][Relay] PlanDevices pass can run after LowerTE pass [Relay] Re-run PlanDevices after LowerTE to flow new memory scope constraints. Dec 13, 2021
@mbs-octoml mbs-octoml marked this pull request as ready for review December 13, 2021 19:03
@mbrookhart mbrookhart merged commit fb99383 into apache:main Dec 14, 2021
@mbrookhart
Copy link
Contributor

Thanks @mbs-octoml @jroesch

@mbs-octoml mbs-octoml deleted the mbs-post-lower-dev-plan branch December 14, 2021 18:21
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
…imFuncs (apache#9689)

* [TIR] Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs

This is in support of apache#9613 which allows memory scopes to flow
out of already-lowered PrimFuncs into the rest of the Relay
program. This means scope choices made during lowering can
be accounted for in the rest of the program, with device_copies
inserted as required.

Somewhat more speculatively we also allow memory scopes to flow
in to PrimFuncs. This is in preparation for when we can split
lowering into two phases: i) lower "primitive" fused Relay
functions to TensorIR in a schedulable form roughly isomorphic
to TE, and ii) actual scheduling down to traditional TIR. Once
that split is made it will be possible to flow memory scopes
out of one PrimFunc and into another so as to avoid unnecessary
device_copies being necessary due to independently chosen
memory scopes.

I also suspect we'll want to put our focus on layouts rather
than memory scopes, but this at least sets up some of the
machinery.

* [checkpoint] Junru's comments.
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
* [Relay] PlanDevices supports 'free' on_device annotations

This is in support of apache#9613, which allows PlanDevices to be run
after lowering so as to flow memory constraints in and
out of PrimFuncs. That requires a way to insert device_copies
when the memory scopes chosen during separate lowering of fused
primitive functions clashes, but otherwise avoid device_copies when
scopes can be chosen so as to avoid them.

We support that by generalizing the "on_device" annotation to
allow the device constraint to be independently controlled for
its 'body' and 'result'.

# Standard user annotation: body is constrained to S
on_device(body, S)

# Used by PlanDevices to 'fix' expression to S
# (was is_fixed=True)
on_device(body, S, constrain_result=True)

# Used by PlanDevices to indicate a device_copy can be
# inserted if necessary.
on_device(body, S, constrain_body=False)

# Supported, but currently has no use.
on_device(body, S, constrain_result=True, constrain_body=False)

A few extra odd's 'n ends collected along the way:
 - Some CallLowered cleanup which I found useful.
 - The usual extra debugging output needed as I debugged.
   In return I removed some particularly verbose logging I'd
   added while tracking down unexpected object copies.
 - Cleanup warnings from clang-12 as I touch files.

* [checkpoint] unused var
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
…straints. (apache#9613)

* [Relay] Re-run PlanDevices after LowerTE to flow new memory scope constraints.

This PR:
 1) Makes PlanDevices consider lowered calls when solving device domain constraints.
 2) Connects the storage scopes on PrimFunc parameters (encoded in their Buffer data
    Var type annotation PointerTypes storage_scope fields) to the memory_scope
    fields of the SEScopes which PlanDevices unifies over.
 3) Allows new device_copies to be inserted on the arguments and results of lowered
    calls so as to acount for any memory scope mismatches which are now apparent.

[device_planner.cc has main changes, rest is secondary.]

In the short term we'd like to use this machinery to flow memory scope choices made
during lowering back out into the overall Relay program. In the longer term we'd
also like to be able to use memory scopes to influence the lowering of
yet-to-be-lowered functions (or lowered functions which have yet to been scheduled,
a distinction now possible with TensorIR).

 - Memory scope constraints can flow both out of and in to PrimFuncs
   introduced by LowerTE. In TIR memory scopes are represented by
   'storage scopes' on the PointerType type annotations on TIR Buffer data
   variables.
    - It is straightforward to extract memory scopes from PrimFuncs by
      looking at the PrimFunc's buffer_map. We do this is 'phase 1' of
      PlanDevices, which collects all the device constraints implied by
    - However, pushing memory constraints in to PrimFuncs is more challenging
      due to buffer aliasing. This aspect is still experimental.

 - Allow device_copies to be inserted for both arguments and
   results of PrimFunc calls, on the assumption PlanDevices has
   already established a consistent device assignment prior to
   lowering and any new mismatch is required to match up memory scopes.
   We use the new 'free' on_device annotations to implement this.

Coming along for the ride:

 - To make unit tests of mixed Relay/TIR functions possible needed
   to be able to supply a checked_type to GlobalVar since that's currently
   the only way to give a Relay type to PrimFuncs.

 - Use GenSym to get unique var names in ANF & partial eval so easier
   to diff debug output between passes and connect program fragments
   back into the overall program. Relying on pretty-printing to
   automagically unique-ify var names is certainly cute but until we
   have better span support is very hard to work with.

 - Realized both dead_code.cc and fold_constant.cc would
   happily move values into a different lexical virtual
   device context since device_planner.cc was being
   'clever' and eliding on_devices for let-bound values
   when there's no change. Fixed so that every let-bound
   value has an on_device. Will be much better after
   apache/tvm-rfcs#45 is implemented.

 - Make build -Werror clean for clang-12 (mostly move fixups).

 - Address post-submit comments from apache#9693.

* [checkpoint] thread safe GenSym
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 11, 2022
…imFuncs (apache#9689)

* [TIR] Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs

This is in support of apache#9613 which allows memory scopes to flow
out of already-lowered PrimFuncs into the rest of the Relay
program. This means scope choices made during lowering can
be accounted for in the rest of the program, with device_copies
inserted as required.

Somewhat more speculatively we also allow memory scopes to flow
in to PrimFuncs. This is in preparation for when we can split
lowering into two phases: i) lower "primitive" fused Relay
functions to TensorIR in a schedulable form roughly isomorphic
to TE, and ii) actual scheduling down to traditional TIR. Once
that split is made it will be possible to flow memory scopes
out of one PrimFunc and into another so as to avoid unnecessary
device_copies being necessary due to independently chosen
memory scopes.

I also suspect we'll want to put our focus on layouts rather
than memory scopes, but this at least sets up some of the
machinery.

* [checkpoint] Junru's comments.
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 11, 2022
* [Relay] PlanDevices supports 'free' on_device annotations

This is in support of apache#9613, which allows PlanDevices to be run
after lowering so as to flow memory constraints in and
out of PrimFuncs. That requires a way to insert device_copies
when the memory scopes chosen during separate lowering of fused
primitive functions clashes, but otherwise avoid device_copies when
scopes can be chosen so as to avoid them.

We support that by generalizing the "on_device" annotation to
allow the device constraint to be independently controlled for
its 'body' and 'result'.

# Standard user annotation: body is constrained to S
on_device(body, S)

# Used by PlanDevices to 'fix' expression to S
# (was is_fixed=True)
on_device(body, S, constrain_result=True)

# Used by PlanDevices to indicate a device_copy can be
# inserted if necessary.
on_device(body, S, constrain_body=False)

# Supported, but currently has no use.
on_device(body, S, constrain_result=True, constrain_body=False)

A few extra odd's 'n ends collected along the way:
 - Some CallLowered cleanup which I found useful.
 - The usual extra debugging output needed as I debugged.
   In return I removed some particularly verbose logging I'd
   added while tracking down unexpected object copies.
 - Cleanup warnings from clang-12 as I touch files.

* [checkpoint] unused var
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 11, 2022
…straints. (apache#9613)

* [Relay] Re-run PlanDevices after LowerTE to flow new memory scope constraints.

This PR:
 1) Makes PlanDevices consider lowered calls when solving device domain constraints.
 2) Connects the storage scopes on PrimFunc parameters (encoded in their Buffer data
    Var type annotation PointerTypes storage_scope fields) to the memory_scope
    fields of the SEScopes which PlanDevices unifies over.
 3) Allows new device_copies to be inserted on the arguments and results of lowered
    calls so as to acount for any memory scope mismatches which are now apparent.

[device_planner.cc has main changes, rest is secondary.]

In the short term we'd like to use this machinery to flow memory scope choices made
during lowering back out into the overall Relay program. In the longer term we'd
also like to be able to use memory scopes to influence the lowering of
yet-to-be-lowered functions (or lowered functions which have yet to been scheduled,
a distinction now possible with TensorIR).

 - Memory scope constraints can flow both out of and in to PrimFuncs
   introduced by LowerTE. In TIR memory scopes are represented by
   'storage scopes' on the PointerType type annotations on TIR Buffer data
   variables.
    - It is straightforward to extract memory scopes from PrimFuncs by
      looking at the PrimFunc's buffer_map. We do this is 'phase 1' of
      PlanDevices, which collects all the device constraints implied by
    - However, pushing memory constraints in to PrimFuncs is more challenging
      due to buffer aliasing. This aspect is still experimental.

 - Allow device_copies to be inserted for both arguments and
   results of PrimFunc calls, on the assumption PlanDevices has
   already established a consistent device assignment prior to
   lowering and any new mismatch is required to match up memory scopes.
   We use the new 'free' on_device annotations to implement this.

Coming along for the ride:

 - To make unit tests of mixed Relay/TIR functions possible needed
   to be able to supply a checked_type to GlobalVar since that's currently
   the only way to give a Relay type to PrimFuncs.

 - Use GenSym to get unique var names in ANF & partial eval so easier
   to diff debug output between passes and connect program fragments
   back into the overall program. Relying on pretty-printing to
   automagically unique-ify var names is certainly cute but until we
   have better span support is very hard to work with.

 - Realized both dead_code.cc and fold_constant.cc would
   happily move values into a different lexical virtual
   device context since device_planner.cc was being
   'clever' and eliding on_devices for let-bound values
   when there's no change. Fixed so that every let-bound
   value has an on_device. Will be much better after
   apache/tvm-rfcs#45 is implemented.

 - Make build -Werror clean for clang-12 (mostly move fixups).

 - Address post-submit comments from apache#9693.

* [checkpoint] thread safe GenSym
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 12, 2022
…imFuncs (apache#9689)

* [TIR] Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs

This is in support of apache#9613 which allows memory scopes to flow
out of already-lowered PrimFuncs into the rest of the Relay
program. This means scope choices made during lowering can
be accounted for in the rest of the program, with device_copies
inserted as required.

Somewhat more speculatively we also allow memory scopes to flow
in to PrimFuncs. This is in preparation for when we can split
lowering into two phases: i) lower "primitive" fused Relay
functions to TensorIR in a schedulable form roughly isomorphic
to TE, and ii) actual scheduling down to traditional TIR. Once
that split is made it will be possible to flow memory scopes
out of one PrimFunc and into another so as to avoid unnecessary
device_copies being necessary due to independently chosen
memory scopes.

I also suspect we'll want to put our focus on layouts rather
than memory scopes, but this at least sets up some of the
machinery.

* [checkpoint] Junru's comments.
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 12, 2022
* [Relay] PlanDevices supports 'free' on_device annotations

This is in support of apache#9613, which allows PlanDevices to be run
after lowering so as to flow memory constraints in and
out of PrimFuncs. That requires a way to insert device_copies
when the memory scopes chosen during separate lowering of fused
primitive functions clashes, but otherwise avoid device_copies when
scopes can be chosen so as to avoid them.

We support that by generalizing the "on_device" annotation to
allow the device constraint to be independently controlled for
its 'body' and 'result'.

# Standard user annotation: body is constrained to S
on_device(body, S)

# Used by PlanDevices to 'fix' expression to S
# (was is_fixed=True)
on_device(body, S, constrain_result=True)

# Used by PlanDevices to indicate a device_copy can be
# inserted if necessary.
on_device(body, S, constrain_body=False)

# Supported, but currently has no use.
on_device(body, S, constrain_result=True, constrain_body=False)

A few extra odd's 'n ends collected along the way:
 - Some CallLowered cleanup which I found useful.
 - The usual extra debugging output needed as I debugged.
   In return I removed some particularly verbose logging I'd
   added while tracking down unexpected object copies.
 - Cleanup warnings from clang-12 as I touch files.

* [checkpoint] unused var
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 12, 2022
…straints. (apache#9613)

* [Relay] Re-run PlanDevices after LowerTE to flow new memory scope constraints.

This PR:
 1) Makes PlanDevices consider lowered calls when solving device domain constraints.
 2) Connects the storage scopes on PrimFunc parameters (encoded in their Buffer data
    Var type annotation PointerTypes storage_scope fields) to the memory_scope
    fields of the SEScopes which PlanDevices unifies over.
 3) Allows new device_copies to be inserted on the arguments and results of lowered
    calls so as to acount for any memory scope mismatches which are now apparent.

[device_planner.cc has main changes, rest is secondary.]

In the short term we'd like to use this machinery to flow memory scope choices made
during lowering back out into the overall Relay program. In the longer term we'd
also like to be able to use memory scopes to influence the lowering of
yet-to-be-lowered functions (or lowered functions which have yet to been scheduled,
a distinction now possible with TensorIR).

 - Memory scope constraints can flow both out of and in to PrimFuncs
   introduced by LowerTE. In TIR memory scopes are represented by
   'storage scopes' on the PointerType type annotations on TIR Buffer data
   variables.
    - It is straightforward to extract memory scopes from PrimFuncs by
      looking at the PrimFunc's buffer_map. We do this is 'phase 1' of
      PlanDevices, which collects all the device constraints implied by
    - However, pushing memory constraints in to PrimFuncs is more challenging
      due to buffer aliasing. This aspect is still experimental.

 - Allow device_copies to be inserted for both arguments and
   results of PrimFunc calls, on the assumption PlanDevices has
   already established a consistent device assignment prior to
   lowering and any new mismatch is required to match up memory scopes.
   We use the new 'free' on_device annotations to implement this.

Coming along for the ride:

 - To make unit tests of mixed Relay/TIR functions possible needed
   to be able to supply a checked_type to GlobalVar since that's currently
   the only way to give a Relay type to PrimFuncs.

 - Use GenSym to get unique var names in ANF & partial eval so easier
   to diff debug output between passes and connect program fragments
   back into the overall program. Relying on pretty-printing to
   automagically unique-ify var names is certainly cute but until we
   have better span support is very hard to work with.

 - Realized both dead_code.cc and fold_constant.cc would
   happily move values into a different lexical virtual
   device context since device_planner.cc was being
   'clever' and eliding on_devices for let-bound values
   when there's no change. Fixed so that every let-bound
   value has an on_device. Will be much better after
   apache/tvm-rfcs#45 is implemented.

 - Make build -Werror clean for clang-12 (mostly move fixups).

 - Address post-submit comments from apache#9693.

* [checkpoint] thread safe GenSym
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
…imFuncs (apache#9689)

* [TIR] Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs

This is in support of apache#9613 which allows memory scopes to flow
out of already-lowered PrimFuncs into the rest of the Relay
program. This means scope choices made during lowering can
be accounted for in the rest of the program, with device_copies
inserted as required.

Somewhat more speculatively we also allow memory scopes to flow
in to PrimFuncs. This is in preparation for when we can split
lowering into two phases: i) lower "primitive" fused Relay
functions to TensorIR in a schedulable form roughly isomorphic
to TE, and ii) actual scheduling down to traditional TIR. Once
that split is made it will be possible to flow memory scopes
out of one PrimFunc and into another so as to avoid unnecessary
device_copies being necessary due to independently chosen
memory scopes.

I also suspect we'll want to put our focus on layouts rather
than memory scopes, but this at least sets up some of the
machinery.

* [checkpoint] Junru's comments.
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* [Relay] PlanDevices supports 'free' on_device annotations

This is in support of apache#9613, which allows PlanDevices to be run
after lowering so as to flow memory constraints in and
out of PrimFuncs. That requires a way to insert device_copies
when the memory scopes chosen during separate lowering of fused
primitive functions clashes, but otherwise avoid device_copies when
scopes can be chosen so as to avoid them.

We support that by generalizing the "on_device" annotation to
allow the device constraint to be independently controlled for
its 'body' and 'result'.

# Standard user annotation: body is constrained to S
on_device(body, S)

# Used by PlanDevices to 'fix' expression to S
# (was is_fixed=True)
on_device(body, S, constrain_result=True)

# Used by PlanDevices to indicate a device_copy can be
# inserted if necessary.
on_device(body, S, constrain_body=False)

# Supported, but currently has no use.
on_device(body, S, constrain_result=True, constrain_body=False)

A few extra odd's 'n ends collected along the way:
 - Some CallLowered cleanup which I found useful.
 - The usual extra debugging output needed as I debugged.
   In return I removed some particularly verbose logging I'd
   added while tracking down unexpected object copies.
 - Cleanup warnings from clang-12 as I touch files.

* [checkpoint] unused var
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
…straints. (apache#9613)

* [Relay] Re-run PlanDevices after LowerTE to flow new memory scope constraints.

This PR:
 1) Makes PlanDevices consider lowered calls when solving device domain constraints.
 2) Connects the storage scopes on PrimFunc parameters (encoded in their Buffer data
    Var type annotation PointerTypes storage_scope fields) to the memory_scope
    fields of the SEScopes which PlanDevices unifies over.
 3) Allows new device_copies to be inserted on the arguments and results of lowered
    calls so as to acount for any memory scope mismatches which are now apparent.

[device_planner.cc has main changes, rest is secondary.]

In the short term we'd like to use this machinery to flow memory scope choices made
during lowering back out into the overall Relay program. In the longer term we'd
also like to be able to use memory scopes to influence the lowering of
yet-to-be-lowered functions (or lowered functions which have yet to been scheduled,
a distinction now possible with TensorIR).

 - Memory scope constraints can flow both out of and in to PrimFuncs
   introduced by LowerTE. In TIR memory scopes are represented by
   'storage scopes' on the PointerType type annotations on TIR Buffer data
   variables.
    - It is straightforward to extract memory scopes from PrimFuncs by
      looking at the PrimFunc's buffer_map. We do this is 'phase 1' of
      PlanDevices, which collects all the device constraints implied by
    - However, pushing memory constraints in to PrimFuncs is more challenging
      due to buffer aliasing. This aspect is still experimental.

 - Allow device_copies to be inserted for both arguments and
   results of PrimFunc calls, on the assumption PlanDevices has
   already established a consistent device assignment prior to
   lowering and any new mismatch is required to match up memory scopes.
   We use the new 'free' on_device annotations to implement this.

Coming along for the ride:

 - To make unit tests of mixed Relay/TIR functions possible needed
   to be able to supply a checked_type to GlobalVar since that's currently
   the only way to give a Relay type to PrimFuncs.

 - Use GenSym to get unique var names in ANF & partial eval so easier
   to diff debug output between passes and connect program fragments
   back into the overall program. Relying on pretty-printing to
   automagically unique-ify var names is certainly cute but until we
   have better span support is very hard to work with.

 - Realized both dead_code.cc and fold_constant.cc would
   happily move values into a different lexical virtual
   device context since device_planner.cc was being
   'clever' and eliding on_devices for let-bound values
   when there's no change. Fixed so that every let-bound
   value has an on_device. Will be much better after
   apache/tvm-rfcs#45 is implemented.

 - Make build -Werror clean for clang-12 (mostly move fixups).

 - Address post-submit comments from apache#9693.

* [checkpoint] thread safe GenSym
qsqqsqqsq-intellif pushed a commit to qsqqsqqsq-intellif/tvm that referenced this pull request Apr 29, 2022
…imFuncs (apache#9689)

* [TIR] Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs

This is in support of apache#9613 which allows memory scopes to flow
out of already-lowered PrimFuncs into the rest of the Relay
program. This means scope choices made during lowering can
be accounted for in the rest of the program, with device_copies
inserted as required.

Somewhat more speculatively we also allow memory scopes to flow
in to PrimFuncs. This is in preparation for when we can split
lowering into two phases: i) lower "primitive" fused Relay
functions to TensorIR in a schedulable form roughly isomorphic
to TE, and ii) actual scheduling down to traditional TIR. Once
that split is made it will be possible to flow memory scopes
out of one PrimFunc and into another so as to avoid unnecessary
device_copies being necessary due to independently chosen
memory scopes.

I also suspect we'll want to put our focus on layouts rather
than memory scopes, but this at least sets up some of the
machinery.

* [checkpoint] Junru's comments.
qsqqsqqsq-intellif pushed a commit to qsqqsqqsq-intellif/tvm that referenced this pull request Apr 29, 2022
* [Relay] PlanDevices supports 'free' on_device annotations

This is in support of apache#9613, which allows PlanDevices to be run
after lowering so as to flow memory constraints in and
out of PrimFuncs. That requires a way to insert device_copies
when the memory scopes chosen during separate lowering of fused
primitive functions clashes, but otherwise avoid device_copies when
scopes can be chosen so as to avoid them.

We support that by generalizing the "on_device" annotation to
allow the device constraint to be independently controlled for
its 'body' and 'result'.

# Standard user annotation: body is constrained to S
on_device(body, S)

# Used by PlanDevices to 'fix' expression to S
# (was is_fixed=True)
on_device(body, S, constrain_result=True)

# Used by PlanDevices to indicate a device_copy can be
# inserted if necessary.
on_device(body, S, constrain_body=False)

# Supported, but currently has no use.
on_device(body, S, constrain_result=True, constrain_body=False)

A few extra odd's 'n ends collected along the way:
 - Some CallLowered cleanup which I found useful.
 - The usual extra debugging output needed as I debugged.
   In return I removed some particularly verbose logging I'd
   added while tracking down unexpected object copies.
 - Cleanup warnings from clang-12 as I touch files.

* [checkpoint] unused var
qsqqsqqsq-intellif pushed a commit to qsqqsqqsq-intellif/tvm that referenced this pull request Apr 29, 2022
…straints. (apache#9613)

* [Relay] Re-run PlanDevices after LowerTE to flow new memory scope constraints.

This PR:
 1) Makes PlanDevices consider lowered calls when solving device domain constraints.
 2) Connects the storage scopes on PrimFunc parameters (encoded in their Buffer data
    Var type annotation PointerTypes storage_scope fields) to the memory_scope
    fields of the SEScopes which PlanDevices unifies over.
 3) Allows new device_copies to be inserted on the arguments and results of lowered
    calls so as to acount for any memory scope mismatches which are now apparent.

[device_planner.cc has main changes, rest is secondary.]

In the short term we'd like to use this machinery to flow memory scope choices made
during lowering back out into the overall Relay program. In the longer term we'd
also like to be able to use memory scopes to influence the lowering of
yet-to-be-lowered functions (or lowered functions which have yet to been scheduled,
a distinction now possible with TensorIR).

 - Memory scope constraints can flow both out of and in to PrimFuncs
   introduced by LowerTE. In TIR memory scopes are represented by
   'storage scopes' on the PointerType type annotations on TIR Buffer data
   variables.
    - It is straightforward to extract memory scopes from PrimFuncs by
      looking at the PrimFunc's buffer_map. We do this is 'phase 1' of
      PlanDevices, which collects all the device constraints implied by
    - However, pushing memory constraints in to PrimFuncs is more challenging
      due to buffer aliasing. This aspect is still experimental.

 - Allow device_copies to be inserted for both arguments and
   results of PrimFunc calls, on the assumption PlanDevices has
   already established a consistent device assignment prior to
   lowering and any new mismatch is required to match up memory scopes.
   We use the new 'free' on_device annotations to implement this.

Coming along for the ride:

 - To make unit tests of mixed Relay/TIR functions possible needed
   to be able to supply a checked_type to GlobalVar since that's currently
   the only way to give a Relay type to PrimFuncs.

 - Use GenSym to get unique var names in ANF & partial eval so easier
   to diff debug output between passes and connect program fragments
   back into the overall program. Relying on pretty-printing to
   automagically unique-ify var names is certainly cute but until we
   have better span support is very hard to work with.

 - Realized both dead_code.cc and fold_constant.cc would
   happily move values into a different lexical virtual
   device context since device_planner.cc was being
   'clever' and eliding on_devices for let-bound values
   when there's no change. Fixed so that every let-bound
   value has an on_device. Will be much better after
   apache/tvm-rfcs#45 is implemented.

 - Make build -Werror clean for clang-12 (mostly move fixups).

 - Address post-submit comments from apache#9693.

* [checkpoint] thread safe GenSym
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants