-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+ #161638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/pierre-vh/remove-extra-wait-after-rmw
Are you sure you want to change the base?
[AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+ #161638
Conversation
They were previously optimized to not emit any waitcnt, which is technically correct because there is no reordering of operations at workgroup scope in CU mode for GFX10+. This breaks transitivity however, for example if we have the following sequence of events in one thread: - some stores - store atomic release syncscope("workgroup") - barrier then another thread follows with - barrier - load atomic acquire - store atomic release syncscope("agent") It does not work because, while the other thread sees the stores, it cannot release them at the wider scope. Our release fences aren't strong enough to "wait" on stores from other waves. We also cannot strengthen our release fences any further to allow for releasing other wave's stores because only GFX12 can do that with `global_wb`. GFX10-11 do not have the writeback instruction. It'd also add yet another level of complexity to code sequences, with both acquire/release having CU-mode only alternatives. Lastly, acq/rel are always used together. The price for synchronization has to be paid either at the acq, or the rel. Strengthening the releases would just make the memory model more complex but wouldn't help performance. So the choice here is to streamline the code sequences by making CU and WGP mode emit identical code for release (or stronger) atomic ordering. This also removes the `vm_vsrc(0)` wait before barriers. Now that the release fence in CU mode is strong enough, it is no longer needed.
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
You can test this locally with the following command:git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 'HEAD~1' HEAD llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-cluster.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll The following files introduce new uses of undef:
Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields In tests, avoid using For example, this is considered a bad practice: define void @fn() {
...
br i1 undef, ...
} Please use the following instead: define void @fn(i1 %cond) {
...
br i1 %cond, ...
} Please refer to the Undefined Behavior Manual for more information. |
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) ChangesThey were previously optimized to not emit any waitcnt, which is technically correct because there is no reordering of operations at workgroup scope in CU mode for GFX10+. This breaks transitivity however, for example if we have the following sequence of events in one thread:
then another thread follows with
It does not work because, while the other thread sees the stores, it cannot release them at the wider scope. Our release fences aren't strong enough to "wait" on stores from other waves. We also cannot strengthen our release fences any further to allow for releasing other wave's stores because only GFX12 can do that with So the choice here is to streamline the code sequences by making CU and WGP mode emit identical code for release (or stronger) atomic ordering. This also removes the Supersedes #160501 Patch is 428.16 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/161638.diff 16 Files Affected:
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 74b7604fda56d..cba86b3d5447e 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -13229,9 +13229,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
- generic vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit vmcnt(0) and
- vscnt(0).
- If OpenCL, omit
lgkmcnt(0).
- Could be split into
@@ -13277,8 +13274,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
2. buffer/global/flat_store
store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit.
- If OpenCL, omit.
- Could be split into
separate s_waitcnt
@@ -13366,9 +13361,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
- generic vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit vmcnt(0) and
- vscnt(0).
- If OpenCL, omit lgkmcnt(0).
- Could be split into
separate s_waitcnt
@@ -13413,8 +13405,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
2. buffer/global/flat_atomic
atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit.
- If OpenCL, omit.
- Could be split into
separate s_waitcnt
@@ -13498,9 +13488,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit vmcnt(0) and
- vscnt(0).
- If OpenCL and
address space is
not generic, omit
@@ -13627,9 +13614,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit vmcnt(0) and
- vscnt(0).
- If OpenCL, omit
lgkmcnt(0).
- Must happen after
@@ -13681,8 +13665,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
2. buffer/global_atomic
3. s_waitcnt vm/vscnt(0)
- - If CU wavefront execution
- mode, omit.
- Use vmcnt(0) if atomic with
return and vscnt(0) if
atomic with no-return.
@@ -13707,8 +13689,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit.
- If OpenCL, omit.
- Could be split into
separate s_waitcnt
@@ -13768,9 +13748,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit vmcnt(0) and
- vscnt(0).
- If OpenCL, omit lgkmcnt(0).
- Could be split into
separate s_waitcnt
@@ -13816,9 +13793,9 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
3. s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit vmcnt(0) and
- vscnt(0).
+ - If atomic with return, omit
+ vscnt(0), if atomic with
+ no-return, omit vmcnt(0).
- If OpenCL, omit lgkmcnt(0).
- Must happen before
the following
@@ -13991,9 +13968,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit vmcnt(0) and
- vscnt(0).
- If OpenCL and
address space is
not generic, omit
@@ -14223,9 +14197,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) &
- generic vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit vmcnt(0) and
- vscnt(0).
- Could be split into
separate s_waitcnt
vmcnt(0), s_waitcnt
@@ -14334,8 +14305,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
1. s_waitcnt vmcnt(0) & vscnt(0)
- - If CU wavefront execution
- mode, omit.
- Could be split into
separate s_waitcnt
vmcnt(0) and s_waitcnt
@@ -15337,8 +15306,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL, omit ``s_wait_dscnt 0x0``.
- The waits can be
@@ -15384,8 +15351,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL, omit.
- The waits can be
@@ -15479,8 +15444,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL, omit ``s_wait_dscnt 0x0``.
- If OpenCL and CU wavefront
@@ -15530,8 +15493,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL, omit all.
- The waits can be
@@ -15623,8 +15584,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL, omit ``s_wait_dscnt 0x0``.
- If OpenCL and
@@ -15754,8 +15713,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL, omit ``s_wait_dscnt 0x0``.
- Must happen after
@@ -15812,8 +15769,6 @@ the instruction in the code sequence that references the table.
| **Atomic without return:**
| ``s_wait_storecnt 0x0``
- - If CU wavefront execution
- mode, omit.
- Must happen before
the following
``global_inv``.
@@ -15838,8 +15793,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL, omit.
- The waits can be
@@ -15901,8 +15854,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL, omit ``s_wait_loadcnt 0x0``.
- The waits can be
@@ -16154,8 +16105,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL and
address space is
@@ -16384,8 +16333,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL, omit
``s_wait_dscnt 0x0``
@@ -16492,8 +16439,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
- If OpenCL, omit all.
- The waits can be
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index 362ef140a28f8..29c326893539b 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -390,12 +390,6 @@ class SICacheControl {
bool IsCrossAddrSpaceOrdering,
Position Pos) const = 0;
- /// Inserts any necessary instructions before the barrier start instruction
- /// \p MI in order to support pairing of barriers and fences.
- virtual bool insertBarrierStart(MachineBasicBlock::iterator &MI) const {
- return false;
- };
-
/// Virtual destructor to allow derivations to be deleted.
virtual ~SICacheControl() = default;
};
@@ -576,12 +570,8 @@ class SIGfx10CacheControl : public SIGfx7CacheControl {
bool IsCrossAddrSpaceOrdering, Position Pos,
AtomicOrdering Order, bool AtomicsOnly) const override;
- bool insertAcquire(MachineBasicBlock::iterator &MI,
- SIAtomicScope Scope,
- SIAtomicAddrSpace AddrSpace,
- Position Pos) const override;
-
- bool insertBarrierStart(MachineBasicBlock::iterator &MI) const override;
+ bool insertAcquire(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
+ SIAtomicAddrSpace AddrSpace, Position Pos) const override;
};
class SIGfx11CacheControl : public SIGfx10CacheControl {
@@ -2046,8 +2036,11 @@ bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
// the WGP. Therefore need to wait for operations to complete to ensure
// they are visible to waves in the other CU as the L0 is per CU.
// Otherwise in CU mode and all waves of a work-group are on the same CU
- // which shares the same L0.
- if (!ST.isCuModeEnabled()) {
+ // which shares the same L0. Note that we still need to wait when
+ // performing a release in this mode to respect the transitivity of
+ // happens-before, e.g. other waves of the workgroup must be able to
+ // release the memory from another wave at a wider scope.
+ if (!ST.isCuModeEnabled() || isReleaseOrStronger(Order)) {
if ((Op & SIMemOp::LOAD) != SIMemOp::NONE)
VMCnt |= true;
if ((Op & SIMemOp::STORE) != SIMemOp::NONE)
@@ -2202,22 +2195,6 @@ bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::i...
[truncated]
|
They were previously optimized to not emit any waitcnt, which is technically correct because there is no reordering of operations at workgroup scope in CU mode for GFX10+.
This breaks transitivity however, for example if we have the following sequence of events in one thread:
then another thread follows with
It does not work because, while the other thread sees the stores, it cannot release them at the wider scope. Our release fences aren't strong enough to "wait" on stores from other waves.
We also cannot strengthen our release fences any further to allow for releasing other wave's stores because only GFX12 can do that with
global_wb
. GFX10-11 do not have the writeback instruction.It'd also add yet another level of complexity to code sequences, with both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization has to be paid either at the acq, or the rel. Strengthening the releases would just make the memory model more complex but wouldn't help performance.
So the choice here is to streamline the code sequences by making CU and WGP mode emit identical code for release (or stronger) atomic ordering.
This also removes the
vm_vsrc(0)
wait before barriers. Now that the release fence in CU mode is strong enough, it is no longer needed.Supersedes #160501
Solves SC1-6454