Skip to content

Conversation

Pierre-vh
Copy link
Contributor

The previous implementation only waited on vm_vsrc(0), which works to
make sure all threads in the workgroup see the stores done before the barrier.
However, despite seeing the stores, the threads were unable to release them
at a wider scope. This caused failures in Vulkan CTS tests.

To correctly fulfill the memory model semantics, which require happens-before
to be transitive, we must wait for the stores to actually complete before the barrier,
so that another thread can release them.

Note that we still don't need to do anything for WGP mode because release fences
are strong enough in that mode. This only applies to CU mode because CU release
fences do not emit any code.

Solves SC1-6454

The previous implementation only waited on `vm_vsrc(0)`, which works to
make sure all threads in the workgroup see the stores done before the barrier.
However, despite seeing the stores, the threads were unable to release them
at a wider scope. This caused failures in Vulkan CTS tests.

To correctly fulfill the memory model semantics, which require happens-before
to be transitive, we must wait for the stores to actually complete before the barrier,
so that another thread can release them.

Note that we still don't need to do anything for WGP mode because release fences
are strong enough in that mode. This only applies to CU mode because CU release
fences do not emit any code.

Solves SC1-6454
Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@Pierre-vh Pierre-vh marked this pull request as ready for review September 24, 2025 11:10
@llvmbot
Copy link
Member

llvmbot commented Sep 24, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

The previous implementation only waited on vm_vsrc(0), which works to
make sure all threads in the workgroup see the stores done before the barrier.
However, despite seeing the stores, the threads were unable to release them
at a wider scope. This caused failures in Vulkan CTS tests.

To correctly fulfill the memory model semantics, which require happens-before
to be transitive, we must wait for the stores to actually complete before the barrier,
so that another thread can release them.

Note that we still don't need to do anything for WGP mode because release fences
are strong enough in that mode. This only applies to CU mode because CU release
fences do not emit any code.

Solves SC1-6454


Full diff: https://github.com/llvm/llvm-project/pull/160501.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+29-8)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll (-1)
  • (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll (+6-10)
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index c85d2bb9fe9ae..ba6c29a855ebf 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -637,6 +637,8 @@ class SIGfx12CacheControl : public SIGfx11CacheControl {
                      SIAtomicAddrSpace AddrSpace, bool IsCrossAddrSpaceOrdering,
                      Position Pos) const override;
 
+  bool insertBarrierStart(MachineBasicBlock::iterator &MI) const override;
+
   bool enableLoadCacheBypass(const MachineBasicBlock::iterator &MI,
                              SIAtomicScope Scope,
                              SIAtomicAddrSpace AddrSpace) const override {
@@ -2174,17 +2176,19 @@ bool SIGfx10CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
 
 bool SIGfx10CacheControl::insertBarrierStart(
     MachineBasicBlock::iterator &MI) const {
-  // We need to wait on vm_vsrc so barriers can pair with fences in GFX10+ CU
-  // mode. This is because a CU mode release fence does not emit any wait, which
-  // is fine when only dealing with vmem, but isn't sufficient in the presence
-  // of barriers which do not go through vmem.
-  // GFX12.5 does not require this additional wait.
-  if (!ST.isCuModeEnabled() || ST.hasGFX1250Insts())
+  if (!ST.isCuModeEnabled())
     return false;
 
+  // GFX10/11 CU MODE Workgroup fences do not emit anything.
+  // In the presence of barriers, we want to make sure previous memory
+  // operations are actually visible and can be released at a wider scope by
+  // another thread upon exiting the barrier. To make this possible, we must
+  // wait on previous stores.
+
   BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
-          TII->get(AMDGPU::S_WAITCNT_DEPCTR))
-      .addImm(AMDGPU::DepCtr::encodeFieldVmVsrc(0));
+          TII->get(AMDGPU::S_WAITCNT_VSCNT_soft))
+      .addReg(AMDGPU::SGPR_NULL, RegState::Undef)
+      .addImm(0);
   return true;
 }
 
@@ -2570,6 +2574,23 @@ bool SIGfx12CacheControl::insertRelease(MachineBasicBlock::iterator &MI,
   return Changed;
 }
 
+bool SIGfx12CacheControl::insertBarrierStart(
+    MachineBasicBlock::iterator &MI) const {
+  if (!ST.isCuModeEnabled() || ST.hasGFX1250Insts())
+    return false;
+
+  // GFX12 CU MODE Workgroup fences do not emit anything (except in GFX12.5).
+  // In the presence of barriers, we want to make sure previous memory
+  // operations are actually visible and can be released at a wider scope by
+  // another thread upon exiting the barrier. To make this possible, we must
+  // wait on previous stores.
+
+  BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),
+          TII->get(AMDGPU::S_WAIT_STORECNT_soft))
+      .addImm(0);
+  return true;
+}
+
 bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal(
     MachineBasicBlock::iterator &MI, SIAtomicAddrSpace AddrSpace, SIMemOp Op,
     bool IsVolatile, bool IsNonTemporal, bool IsLastUse = false) const {
diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
index b91963f08681c..d23509b5aa812 100644
--- a/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
+++ b/llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
@@ -150,7 +150,6 @@ define amdgpu_kernel void @barrier_release(<4 x i32> inreg %rsrc,
 ; GFX10CU-NEXT:    buffer_load_dword v0, s[8:11], 0 offen lds
 ; GFX10CU-NEXT:    v_mov_b32_e32 v0, s13
 ; GFX10CU-NEXT:    s_waitcnt vmcnt(0)
-; GFX10CU-NEXT:    s_waitcnt_depctr 0xffe3
 ; GFX10CU-NEXT:    s_barrier
 ; GFX10CU-NEXT:    ds_read_b32 v0, v0
 ; GFX10CU-NEXT:    s_waitcnt lgkmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
index 516c3946f63dc..b5c3cce160a22 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
@@ -15,7 +15,7 @@ define amdgpu_kernel void @test_s_barrier() {
 ;
 ; GFX10-CU-LABEL: test_s_barrier:
 ; GFX10-CU:       ; %bb.0: ; %entry
-; GFX10-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10-CU-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-CU-NEXT:    s_barrier
 ; GFX10-CU-NEXT:    s_endpgm
 ;
@@ -26,7 +26,7 @@ define amdgpu_kernel void @test_s_barrier() {
 ;
 ; GFX11-CU-LABEL: test_s_barrier:
 ; GFX11-CU:       ; %bb.0: ; %entry
-; GFX11-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX11-CU-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-CU-NEXT:    s_barrier
 ; GFX11-CU-NEXT:    s_endpgm
 ;
@@ -38,7 +38,7 @@ define amdgpu_kernel void @test_s_barrier() {
 ;
 ; GFX12-CU-LABEL: test_s_barrier:
 ; GFX12-CU:       ; %bb.0: ; %entry
-; GFX12-CU-NEXT:    s_wait_alu 0xffe3
+; GFX12-CU-NEXT:    s_wait_storecnt 0x0
 ; GFX12-CU-NEXT:    s_barrier_signal -1
 ; GFX12-CU-NEXT:    s_barrier_wait -1
 ; GFX12-CU-NEXT:    s_endpgm
@@ -64,7 +64,7 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
 ; GFX10-CU-LABEL: test_s_barrier_workgroup_fence:
 ; GFX10-CU:       ; %bb.0: ; %entry
 ; GFX10-CU-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX10-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10-CU-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX10-CU-NEXT:    s_barrier
 ; GFX10-CU-NEXT:    s_endpgm
 ;
@@ -78,7 +78,7 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
 ; GFX11-CU-LABEL: test_s_barrier_workgroup_fence:
 ; GFX11-CU:       ; %bb.0: ; %entry
 ; GFX11-CU-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-CU-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX11-CU-NEXT:    s_waitcnt_vscnt null, 0x0
 ; GFX11-CU-NEXT:    s_barrier
 ; GFX11-CU-NEXT:    s_endpgm
 ;
@@ -94,8 +94,7 @@ define amdgpu_kernel void @test_s_barrier_workgroup_fence() {
 ;
 ; GFX12-CU-LABEL: test_s_barrier_workgroup_fence:
 ; GFX12-CU:       ; %bb.0: ; %entry
-; GFX12-CU-NEXT:    s_wait_dscnt 0x0
-; GFX12-CU-NEXT:    s_wait_alu 0xffe3
+; GFX12-CU-NEXT:    s_wait_storecnt_dscnt 0x0
 ; GFX12-CU-NEXT:    s_barrier_signal -1
 ; GFX12-CU-NEXT:    s_barrier_wait -1
 ; GFX12-CU-NEXT:    s_endpgm
@@ -125,7 +124,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
 ; GFX10-CU:       ; %bb.0: ; %entry
 ; GFX10-CU-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX10-CU-NEXT:    s_waitcnt_vscnt null, 0x0
-; GFX10-CU-NEXT:    s_waitcnt_depctr 0xffe3
 ; GFX10-CU-NEXT:    s_barrier
 ; GFX10-CU-NEXT:    s_endpgm
 ;
@@ -140,7 +138,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
 ; GFX11-CU:       ; %bb.0: ; %entry
 ; GFX11-CU-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
 ; GFX11-CU-NEXT:    s_waitcnt_vscnt null, 0x0
-; GFX11-CU-NEXT:    s_waitcnt_depctr 0xffe3
 ; GFX11-CU-NEXT:    s_barrier
 ; GFX11-CU-NEXT:    s_endpgm
 ;
@@ -160,7 +157,6 @@ define amdgpu_kernel void @test_s_barrier_agent_fence() {
 ; GFX12-CU-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-CU-NEXT:    s_wait_storecnt 0x0
 ; GFX12-CU-NEXT:    s_wait_loadcnt_dscnt 0x0
-; GFX12-CU-NEXT:    s_wait_alu 0xffe3
 ; GFX12-CU-NEXT:    s_barrier_signal -1
 ; GFX12-CU-NEXT:    s_barrier_wait -1
 ; GFX12-CU-NEXT:    s_endpgm

Copy link
Contributor

@perlfu perlfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - as far as an immediate fix for the issue goes.

Beyond the scope of this change... It would be nice if we could define where vm_vsrc(0) would be sufficient, and be able to apply that as an optimization. My suspicion is that it is sufficient in the majority of graphics scenarios.

@jayfoad
Copy link
Contributor

jayfoad commented Sep 24, 2025

Does this need a change in the memory model docs?

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ever unconditionally insert waits purely based on a barrier. This pessimizes LDS-only barriers too much.

Quite frankly, I'd rather admit that trying to reduce CU-mode release fences to only VM_VSRC waits was a mistake, and go back to using the same waits there as for WGP-mode.

@nhaehnle
Copy link
Collaborator

Beyond the scope of this change... It would be nice if we could define where vm_vsrc(0) would be sufficient, and be able to apply that as an optimization. My suspicion is that it is sufficient in the majority of graphics scenarios.

I'm not so sure about that, unfortunately. The example you showed offline showed a problem when wave A has a workgroup-scope release fence and then wave B has an agent-scope release fence that should push out the data from A as well.

Since we can't do long-distance static analysis to understand what happens in other waves, we can only reduce a workgroup-scope release fence to vm_vsrc(0) if we change the code sequence for agent-scope release fences in a way that establishes this guarantee, and I'm not convinced that the hardware we have guarantees that. We can follow up offline.

@Pierre-vh
Copy link
Contributor Author

Quite frankly, I'd rather admit that trying to reduce CU-mode release fences to only VM_VSRC waits was a mistake, and go back to using the same waits there as for WGP-mode.

We never emitted a vm_vsrc(0) wait for workgroup release fences in CU mode.

This pessimizes LDS-only barriers too much.

Which LDS-only barriers ? The check only includes S_BARRIER.

Do you think we should instead pessimize all workgroup release fences in CU mode so they have a wait on storecnt?

@Pierre-vh
Copy link
Contributor Author

Does this need a change in the memory model docs?

We don't have the documentation for that upstream yet, I'll fix the downstream one

@Pierre-vh Pierre-vh requested a review from nhaehnle September 25, 2025 09:21
@nhaehnle
Copy link
Collaborator

nhaehnle commented Oct 1, 2025

This pessimizes LDS-only barriers too much.

Which LDS-only barriers ? The check only includes S_BARRIER.

Right, and an S_BARRIER by itself is not (or shouldn't be) evidence of any desired memory ordering. We need a fence for that, and if the fence is limited to LDS, then there should not be any loadcnt or storecnt waits.

Do you think we should instead pessimize all workgroup release fences in CU mode so they have a wait on storecnt?

Is it a pessimization? I don't think so. Isn't the example @perlfu gave offline evidence that if a release fence intends to fence global memory, then a storecnt wait is pretty much unavoidable?

Edit for context: The example was:

image store N
fence release workgroup scope
workgroup barrier
fence acquire workgroup scope
(conditional) fence release device (agent) scope
(conditional) atomic store to X
atomic load from Y
fence acquire device (agent) scope
image load M

... where Y/M of thread B is X/N of thread A. The image load in thread B should see the store in A (assuming the atomic load in B is after the atomic store in A in the memory location order of X == Y).

For that to be guaranteed, the workgroup scope release fence has to imply a storecnt wait.

But it shouldn't really matter that the synchronization mechanism in question is a barrier: the same would be true if the barrier was replaced e.g. by LDS atomics.

@ssahasra
Copy link
Collaborator

ssahasra commented Oct 1, 2025

Do you think we should instead pessimize all workgroup release fences in CU mode so they have a wait on storecnt?

Is it a pessimization? I don't think so. Isn't the example @perlfu gave offline evidence that if a release fence intends to fence global memory, then a storecnt wait is pretty much unavoidable?

I agree. It's a bug fix, not a pessimization. On the other hand, the programmer may know that a certain part of the program only cares about synchronization within the workgroup. For such a program, opting out of transitivity is an optimization, which needs a way to be expressed in LLVM IR.

@Pierre-vh
Copy link
Contributor Author

Do you think we should instead pessimize all workgroup release fences in CU mode so they have a wait on storecnt?

Is it a pessimization? I don't think so. Isn't the example @perlfu gave offline evidence that if a release fence intends to fence global memory, then a storecnt wait is pretty much unavoidable?

I agree. It's a bug fix, not a pessimization. On the other hand, the programmer may know that a certain part of the program only cares about synchronization within the workgroup. For such a program, opting out of transitivity is an optimization, which needs a way to be expressed in LLVM IR.

As I understand it, it's fine to not wait because the release only occurs when the other thread observes the atomic store done as part of the release sequence. This is why we need to do spin (loop) on an acquire if we don't have a barrier for example, because we know the release didn't occur until we load the right value.
So if we take barriers out of the picture, it is fine to not wait because when the store is seen, all previous stores are seen as well (for CU mode workgroup scope).

The problem here is very barrier specific because we're introducing a model where we synchronize without the classic release/acquire sequences that rely on an atomic store. Instead we're adding a barrier + fence pairing, and we synchronize when leaving the barrier. We remove the requirement to spin on the acquire when a barrier is present.

@Pierre-vh
Copy link
Contributor Author

After thinking about this a bit more, I think you're both right. There's a bug. I think in the absence of waits, we could have the following situation:

Thread 0:
  store atomic A relaxed
  store atomic B release syncscope("workgroup")

Then another thread in the workgroup could see the store to B without having the certainty that the store to A is done as well.
The store to A could be held into another memory channel for example.

It'd require some unfortunate timing to see that happen without barriers (hence why we never observed it until now), but as proved here it's possible to see it when using barriers.

We can fix it by using the same waits for WGP/CU mode.
Does everyone agree with that?

@ssahasra
Copy link
Collaborator

ssahasra commented Oct 1, 2025

Being safe sounds good to me. I honestly haven't thought about various combinations of a store that precedes an atomic store-release operation. In particular what is the hardware memory model implied by the programming guide for different counters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants