-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Fix code sequence for barrier start in GFX10+ CU Mode #160501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The previous implementation only waited on `vm_vsrc(0)`, which works to make sure all threads in the workgroup see the stores done before the barrier. However, despite seeing the stores, the threads were unable to release them at a wider scope. This caused failures in Vulkan CTS tests. To correctly fulfill the memory model semantics, which require happens-before to be transitive, we must wait for the stores to actually complete before the barrier, so that another thread can release them. Note that we still don't need to do anything for WGP mode because release fences are strong enough in that mode. This only applies to CU mode because CU release fences do not emit any code. Solves SC1-6454
@llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) ChangesThe previous implementation only waited on To correctly fulfill the memory model semantics, which require happens-before Note that we still don't need to do anything for WGP mode because release fences Solves SC1-6454 Full diff: https://github.com/llvm/llvm-project/pull/160501.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index c85d2bb9fe9ae..ba6c29a855ebf 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -637,6 +637,8 @@ class SIGfx12CacheControl : public SIGfx11CacheControl {
SIAtomicAddrSpace AddrSpace, bool IsCrossAddrSpaceOrdering,
Position Pos) const override;
+ bool insertBarrierStart(MachineBasicBlock::iterator &MI) const override;
+
bool enableLoadCacheBypass(const MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,
SIAtomicAddrSpace AddrSpace) const override {
@@ -2174,17 +2176,19 @@ bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
bool SIGfx10CacheControl::insertBarrierStart(
MachineBasicBlock::iterator &MI) const {
- // We need to wait on vm_vsrc so barriers can pair with fences in GFX10+ CU
- // mode. This is because a CU mode release fence does not emit any wait, which
- // is fine when only dealing with vmem, but isn't sufficient in the presence
- // of barriers which do not go through vmem.
- // GFX12.5 does not require this additional wait.
- if (!ST.isCuModeEnabled() || ST.hasGFX1250Insts())
+ if (!ST.isCuModeEnabled())
return false;
+ // GFX10/11 CU MODE Workgroup fences do not emit anything.
+ // In the presence of barriers, we want to make sure previous memory
+ // operations are actually visible and can be released at a wider scope by
+ // another thread upon exiting the barrier. To make this possible, we must
+ // wait on previous stores.
+
BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
- TII->get(AMDGPU::S_WAITCNT_DEPCTR))
- .addImm(AMDGPU::DepCtr::encodeFieldVmVsrc(0));
+ TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
+ .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
+ .addImm(0);
return true;
}
@@ -2570,6 +2574,23 @@ bool SIGfx12CacheControl::insertRelease(MachineBasicBlock::iterator &MI,
return Changed;
}
+bool SIGfx12CacheControl::insertBarrierStart(
+ MachineBasicBlock::iterator &MI) const {
+ if (!ST.isCuModeEnabled() || ST.hasGFX1250Insts())
+ return false;
+
+ // GFX12 CU MODE Workgroup fences do not emit anything (except in GFX12.5).
+ // In the presence of barriers, we want to make sure previous memory
+ // operations are actually visible and can be released at a wider scope by
+ // another thread upon exiting the barrier. To make this possible, we must
+ // wait on previous stores.
+
+ BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
+ TII->get(AMDGPU::S_WAIT_STORECNT_soft))
+ .addImm(0);
+ return true;
+}
+
bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal(
MachineBasicBlock::iterator &MI, SIAtomicAddrSpace AddrSpace, SIMemOp Op,
bool IsVolatile, bool IsNonTemporal, bool IsLastUse = false) const {
diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
index b91963f08681c..d23509b5aa812 100644
--- a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
+++ b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
@@ -150,7 +150,6 @@ define amdgpu_kernel void @barrier_release(<4 x i32> inreg %rsrc,
; GFX10CU-NEXT: buffer_load_dword v0, s[8:11], 0 offen lds
; GFX10CU-NEXT: v_mov_b32_e32 v0, s13
; GFX10CU-NEXT: s_waitcnt vmcnt(0)
-; GFX10CU-NEXT: s_waitcnt_depctr 0xffe3
; GFX10CU-NEXT: s_barrier
; GFX10CU-NEXT: ds_read_b32 v0, v0
; GFX10CU-NEXT: s_waitcnt lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
index 516c3946f63dc..b5c3cce160a22 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
@@ -15,7 +15,7 @@ define amdgpu_kernel void @test_s_barrier() {
;
; GFX10-CU-LABEL: test_s_barrier:
; GFX10-CU: ; %bb.0: ; %entry
-; GFX10-CU-NEXT: s_waitcnt_depctr 0xffe3
+; GFX10-CU-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-CU-NEXT: s_barrier
; GFX10-CU-NEXT: s_endpgm
;
@@ -26,7 +26,7 @@ define amdgpu_kernel void @test_s_barrier() {
;
; GFX11-CU-LABEL: test_s_barrier:
; GFX11-CU: ; %bb.0: ; %entry
-; GFX11-CU-NEXT: s_waitcnt_depctr 0xffe3
+; GFX11-CU-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-CU-NEXT: s_barrier
; GFX11-CU-NEXT: s_endpgm
;
@@ -38,7 +38,7 @@ define amdgpu_kernel void @test_s_barrier() {
;
; GFX12-CU-LABEL: test_s_barrier:
; GFX12-CU: ; %bb.0: ; %entry
-; GFX12-CU-NEXT: s_wait_alu 0xffe3
+; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: s_barrier_signal -1
; GFX12-CU-NEXT: s_barrier_wait -1
; GFX12-CU-NEXT: s_endpgm
@@ -64,7 +64,7 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
; GFX10-CU-LABEL: test_s_barrier_workgroup_fence:
; GFX10-CU: ; %bb.0: ; %entry
; GFX10-CU-NEXT: s_waitcnt lgkmcnt(0)
-; GFX10-CU-NEXT: s_waitcnt_depctr 0xffe3
+; GFX10-CU-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-CU-NEXT: s_barrier
; GFX10-CU-NEXT: s_endpgm
;
@@ -78,7 +78,7 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
; GFX11-CU-LABEL: test_s_barrier_workgroup_fence:
; GFX11-CU: ; %bb.0: ; %entry
; GFX11-CU-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-CU-NEXT: s_waitcnt_depctr 0xffe3
+; GFX11-CU-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-CU-NEXT: s_barrier
; GFX11-CU-NEXT: s_endpgm
;
@@ -94,8 +94,7 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
;
; GFX12-CU-LABEL: test_s_barrier_workgroup_fence:
; GFX12-CU: ; %bb.0: ; %entry
-; GFX12-CU-NEXT: s_wait_dscnt 0x0
-; GFX12-CU-NEXT: s_wait_alu 0xffe3
+; GFX12-CU-NEXT: s_wait_storecnt_dscnt 0x0
; GFX12-CU-NEXT: s_barrier_signal -1
; GFX12-CU-NEXT: s_barrier_wait -1
; GFX12-CU-NEXT: s_endpgm
@@ -125,7 +124,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
; GFX10-CU: ; %bb.0: ; %entry
; GFX10-CU-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX10-CU-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX10-CU-NEXT: s_waitcnt_depctr 0xffe3
; GFX10-CU-NEXT: s_barrier
; GFX10-CU-NEXT: s_endpgm
;
@@ -140,7 +138,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
; GFX11-CU: ; %bb.0: ; %entry
; GFX11-CU-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX11-CU-NEXT: s_waitcnt_vscnt null, 0x0
-; GFX11-CU-NEXT: s_waitcnt_depctr 0xffe3
; GFX11-CU-NEXT: s_barrier
; GFX11-CU-NEXT: s_endpgm
;
@@ -160,7 +157,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX12-CU-NEXT: s_wait_alu 0xffe3
; GFX12-CU-NEXT: s_barrier_signal -1
; GFX12-CU-NEXT: s_barrier_wait -1
; GFX12-CU-NEXT: s_endpgm
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - as far as an immediate fix for the issue goes.
Beyond the scope of this change... It would be nice if we could define where vm_vsrc(0)
would be sufficient, and be able to apply that as an optimization. My suspicion is that it is sufficient in the majority of graphics scenarios.
Does this need a change in the memory model docs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should ever unconditionally insert waits purely based on a barrier. This pessimizes LDS-only barriers too much.
Quite frankly, I'd rather admit that trying to reduce CU-mode release fences to only VM_VSRC waits was a mistake, and go back to using the same waits there as for WGP-mode.
I'm not so sure about that, unfortunately. The example you showed offline showed a problem when wave A has a workgroup-scope release fence and then wave B has an agent-scope release fence that should push out the data from A as well. Since we can't do long-distance static analysis to understand what happens in other waves, we can only reduce a workgroup-scope release fence to |
We never emitted a
Which LDS-only barriers ? The check only includes Do you think we should instead pessimize all workgroup release fences in CU mode so they have a wait on storecnt? |
We don't have the documentation for that upstream yet, I'll fix the downstream one |
Right, and an
Is it a pessimization? I don't think so. Isn't the example @perlfu gave offline evidence that if a release fence intends to fence global memory, then a storecnt wait is pretty much unavoidable? Edit for context: The example was:
... where Y/M of thread B is X/N of thread A. The image load in thread B should see the store in A (assuming the atomic load in B is after the atomic store in A in the memory location order of X == Y). For that to be guaranteed, the workgroup scope release fence has to imply a storecnt wait. But it shouldn't really matter that the synchronization mechanism in question is a barrier: the same would be true if the barrier was replaced e.g. by LDS atomics. |
I agree. It's a bug fix, not a pessimization. On the other hand, the programmer may know that a certain part of the program only cares about synchronization within the workgroup. For such a program, opting out of transitivity is an optimization, which needs a way to be expressed in LLVM IR. |
As I understand it, it's fine to not wait because the release only occurs when the other thread observes the atomic store done as part of the release sequence. This is why we need to do spin (loop) on an acquire if we don't have a barrier for example, because we know the release didn't occur until we load the right value. The problem here is very barrier specific because we're introducing a model where we synchronize without the classic release/acquire sequences that rely on an atomic store. Instead we're adding a barrier + fence pairing, and we synchronize when leaving the barrier. We remove the requirement to spin on the acquire when a barrier is present. |
After thinking about this a bit more, I think you're both right. There's a bug. I think in the absence of waits, we could have the following situation:
Then another thread in the workgroup could see the store to B without having the certainty that the store to A is done as well. It'd require some unfortunate timing to see that happen without barriers (hence why we never observed it until now), but as proved here it's possible to see it when using barriers. We can fix it by using the same waits for WGP/CU mode. |
Being safe sounds good to me. I honestly haven't thought about various combinations of a store that precedes an atomic store-release operation. In particular what is the hardware memory model implied by the programming guide for different counters. |
The previous implementation only waited on
vm_vsrc(0)
, which works tomake sure all threads in the workgroup see the stores done before the barrier.
However, despite seeing the stores, the threads were unable to release them
at a wider scope. This caused failures in Vulkan CTS tests.
To correctly fulfill the memory model semantics, which require happens-before
to be transitive, we must wait for the stores to actually complete before the barrier,
so that another thread can release them.
Note that we still don't need to do anything for WGP mode because release fences
are strong enough in that mode. This only applies to CU mode because CU release
fences do not emit any code.
Solves SC1-6454