Skip to content

Conversation

nikic
Copy link
Contributor

@nikic nikic commented Aug 26, 2025

GEPs are often in the form gep [N x %T], ptr %p, i64 0, i64 %idx. Canonicalize these to gep %T, ptr %p, i64 %idx.

This enables transforms that only support one GEP index to work and improves CSE.

Various transforms were recently hardened to make sure they still work without the leading index. #155404 is the last problematic transform I'm aware of.

GEPs are often in the form `gep [N x %T], ptr %p, i64 0, i64 %idx`.
Canonicalize these to `gep %T, ptr %p, i64 %idx`.

This enables transforms that only support one GEP index to work
and improves CSE.
@nikic nikic requested review from davemgreen and dtcxzyw August 26, 2025 13:37
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms labels Aug 26, 2025
@llvmbot
Copy link
Member

llvmbot commented Aug 26, 2025

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-clang

Author: Nikita Popov (nikic)

Changes

GEPs are often in the form gep [N x %T], ptr %p, i64 0, i64 %idx. Canonicalize these to gep %T, ptr %p, i64 %idx.

This enables transforms that only support one GEP index to work and improves CSE.

Various transforms were recently hardened to make sure they still work without the leading index. #155404 is the last problematic transform I'm aware of.


Patch is 213.30 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/155415.diff

56 Files Affected:

  • (modified) clang/test/CodeGen/attr-counted-by.c (+94-106)
  • (modified) clang/test/CodeGen/union-tbaa1.c (+3-3)
  • (modified) llvm/lib/Transforms/InstCombine/InstructionCombining.cpp (+13)
  • (modified) llvm/test/CodeGen/AMDGPU/vector-alloca-addrspacecast.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/2006-12-15-Range-Test.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/AMDGPU/memcpy-from-constant.ll (+10-10)
  • (modified) llvm/test/Transforms/InstCombine/canonicalize-gep-constglob.ll (+4-4)
  • (modified) llvm/test/Transforms/InstCombine/cast.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/cast_phi.ll (+4-4)
  • (modified) llvm/test/Transforms/InstCombine/gep-addrspace.ll (+3-3)
  • (modified) llvm/test/Transforms/InstCombine/gep-alias.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/gep-vector.ll (+16-16)
  • (modified) llvm/test/Transforms/InstCombine/getelementptr.ll (+5-5)
  • (modified) llvm/test/Transforms/InstCombine/load-bitcast-select.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/load-cmp.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/loadstore-alignment.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/mem-gep-zidx.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/memchr-11.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/memcmp-8.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/memcpy-addrspace.ll (+6-6)
  • (modified) llvm/test/Transforms/InstCombine/memcpy-from-global.ll (+11-11)
  • (modified) llvm/test/Transforms/InstCombine/opaque-ptr.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/pr58901.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/ptrtoint-nullgep.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/remove-loop-phi-multiply-by-zero.ll (+6-6)
  • (modified) llvm/test/Transforms/InstCombine/strcmp-3.ll (+3-3)
  • (modified) llvm/test/Transforms/InstCombine/strlen-1.ll (+4-4)
  • (modified) llvm/test/Transforms/InstCombine/strlen-4.ll (+8-8)
  • (modified) llvm/test/Transforms/InstCombine/strlen-7.ll (+4-4)
  • (modified) llvm/test/Transforms/InstCombine/strlen-8.ll (+5-5)
  • (modified) llvm/test/Transforms/InstCombine/strlen-9.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/strnlen-3.ll (+9-9)
  • (modified) llvm/test/Transforms/InstCombine/strnlen-4.ll (+3-3)
  • (modified) llvm/test/Transforms/InstCombine/strnlen-5.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/sub-gep.ll (+4-4)
  • (modified) llvm/test/Transforms/InstCombine/vec_demanded_elts-inseltpoison.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/vec_demanded_elts.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/wcslen-1.ll (+4-4)
  • (modified) llvm/test/Transforms/InstCombine/wcslen-3.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/wcslen-5.ll (+8-8)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll (+10-10)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve2-histcnt.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/small-size.ll (+22-22)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/x86_fp80-vector-store.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll (+5-5)
  • (modified) llvm/test/Transforms/LoopVectorize/multiple-address-spaces.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/non-const-n.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/reduction-inloop.ll (+24-24)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/matrix-extract-insert.ll (+26-26)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/excessive-unrolling.ll (+6-6)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/merge-functions2.ll (+1-1)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/merge-functions3.ll (+1-1)
  • (modified) llvm/test/Transforms/PhaseOrdering/X86/simplifycfg-late.ll (+1-1)
  • (modified) llvm/test/Transforms/PhaseOrdering/scev-custom-dl.ll (+1-1)
  • (modified) llvm/test/Transforms/PhaseOrdering/single-iteration-loop-sroa.ll (+2-2)
  • (modified) llvm/test/Transforms/SimplifyCFG/Hexagon/switch-to-lookup-table.ll (+1-1)
diff --git a/clang/test/CodeGen/attr-counted-by.c b/clang/test/CodeGen/attr-counted-by.c
index 59e1b134850a9..cb23efdb8f263 100644
--- a/clang/test/CodeGen/attr-counted-by.c
+++ b/clang/test/CodeGen/attr-counted-by.c
@@ -72,7 +72,7 @@ struct anon_struct {
 // SANITIZE-WITH-ATTR-NEXT:    unreachable, !nosanitize [[META2]]
 // SANITIZE-WITH-ATTR:       cont3:
 // SANITIZE-WITH-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
-// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM]]
+// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[IDXPROM]]
 // SANITIZE-WITH-ATTR-NEXT:    store i32 [[VAL]], ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA4:![0-9]+]]
 // SANITIZE-WITH-ATTR-NEXT:    ret void
 //
@@ -81,7 +81,7 @@ struct anon_struct {
 // NO-SANITIZE-WITH-ATTR-NEXT:  entry:
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INDEX]] to i64
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM]]
+// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 [[IDXPROM]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    store i32 [[VAL]], ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2:![0-9]+]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    ret void
 //
@@ -90,7 +90,7 @@ struct anon_struct {
 // SANITIZE-WITHOUT-ATTR-NEXT:  entry:
 // SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
 // SANITIZE-WITHOUT-ATTR-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INDEX]] to i64
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM]]
+// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 [[IDXPROM]]
 // SANITIZE-WITHOUT-ATTR-NEXT:    store i32 [[VAL]], ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2:![0-9]+]]
 // SANITIZE-WITHOUT-ATTR-NEXT:    ret void
 //
@@ -99,7 +99,7 @@ struct anon_struct {
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:  entry:
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INDEX]] to i64
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM]]
+// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 [[IDXPROM]]
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    store i32 [[VAL]], ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2:![0-9]+]]
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    ret void
 //
@@ -120,7 +120,7 @@ void test1(struct annotated *p, int index, int val) {
 // SANITIZE-WITH-ATTR-NEXT:    unreachable, !nosanitize [[META2]]
 // SANITIZE-WITH-ATTR:       cont6:
 // SANITIZE-WITH-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
-// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[INDEX]]
+// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[INDEX]]
 // SANITIZE-WITH-ATTR-NEXT:    [[TMP2:%.*]] = tail call i32 @llvm.smax.i32(i32 [[COUNTED_BY_LOAD]], i32 0)
 // SANITIZE-WITH-ATTR-NEXT:    [[CONV:%.*]] = shl i32 [[TMP2]], 2
 // SANITIZE-WITH-ATTR-NEXT:    store i32 [[CONV]], ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA4]]
@@ -134,7 +134,7 @@ void test1(struct annotated *p, int index, int val) {
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[COUNTED_BY_LOAD:%.*]] = load i32, ptr [[COUNTED_BY_GEP]], align 4
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[TMP0:%.*]] = tail call i32 @llvm.smax.i32(i32 [[COUNTED_BY_LOAD]], i32 0)
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[CONV:%.*]] = shl i32 [[TMP0]], 2
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[INDEX]]
+// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[INDEX]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    store i32 [[CONV]], ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    ret void
 //
@@ -142,7 +142,7 @@ void test1(struct annotated *p, int index, int val) {
 // SANITIZE-WITHOUT-ATTR-SAME: ptr noundef [[P:%.*]], i64 noundef [[INDEX:%.*]]) local_unnamed_addr #[[ATTR0]] {
 // SANITIZE-WITHOUT-ATTR-NEXT:  entry:
 // SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[INDEX]]
+// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[INDEX]]
 // SANITIZE-WITHOUT-ATTR-NEXT:    store i32 -1, ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2]]
 // SANITIZE-WITHOUT-ATTR-NEXT:    ret void
 //
@@ -150,7 +150,7 @@ void test1(struct annotated *p, int index, int val) {
 // NO-SANITIZE-WITHOUT-ATTR-SAME: ptr noundef [[P:%.*]], i64 noundef [[INDEX:%.*]]) local_unnamed_addr #[[ATTR0]] {
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:  entry:
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[INDEX]]
+// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[INDEX]]
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    store i32 -1, ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2]]
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    ret void
 //
@@ -243,7 +243,7 @@ size_t test2_bdos_cast(struct annotated *p) {
 // SANITIZE-WITH-ATTR-NEXT:    unreachable, !nosanitize [[META2]]
 // SANITIZE-WITH-ATTR:       cont3:
 // SANITIZE-WITH-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
-// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[INDEX]]
+// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[INDEX]]
 // SANITIZE-WITH-ATTR-NEXT:    store i32 -1, ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA4]]
 // SANITIZE-WITH-ATTR-NEXT:    ret void
 //
@@ -251,7 +251,7 @@ size_t test2_bdos_cast(struct annotated *p) {
 // NO-SANITIZE-WITH-ATTR-SAME: ptr noundef [[P:%.*]], i64 noundef [[INDEX:%.*]]) local_unnamed_addr #[[ATTR0]] {
 // NO-SANITIZE-WITH-ATTR-NEXT:  entry:
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[INDEX]]
+// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[INDEX]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    store i32 -1, ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    ret void
 //
@@ -259,7 +259,7 @@ size_t test2_bdos_cast(struct annotated *p) {
 // SANITIZE-WITHOUT-ATTR-SAME: ptr noundef [[P:%.*]], i64 noundef [[INDEX:%.*]]) local_unnamed_addr #[[ATTR0]] {
 // SANITIZE-WITHOUT-ATTR-NEXT:  entry:
 // SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[INDEX]]
+// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[INDEX]]
 // SANITIZE-WITHOUT-ATTR-NEXT:    store i32 -1, ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2]]
 // SANITIZE-WITHOUT-ATTR-NEXT:    ret void
 //
@@ -267,7 +267,7 @@ size_t test2_bdos_cast(struct annotated *p) {
 // NO-SANITIZE-WITHOUT-ATTR-SAME: ptr noundef [[P:%.*]], i64 noundef [[INDEX:%.*]]) local_unnamed_addr #[[ATTR0]] {
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:  entry:
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[INDEX]]
+// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[INDEX]]
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    store i32 -1, ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2]]
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    ret void
 //
@@ -350,7 +350,7 @@ size_t test3_bdos_cast(struct annotated *p) {
 // SANITIZE-WITH-ATTR-NEXT:    [[RESULT:%.*]] = add i32 [[FLEXIBLE_ARRAY_MEMBER_SIZE]], 244
 // SANITIZE-WITH-ATTR-NEXT:    [[TMP4:%.*]] = and i32 [[RESULT]], 252
 // SANITIZE-WITH-ATTR-NEXT:    [[CONV2:%.*]] = select i1 [[TMP3]], i32 [[TMP4]], i32 0
-// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX10:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM]]
+// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX10:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[IDXPROM]]
 // SANITIZE-WITH-ATTR-NEXT:    store i32 [[CONV2]], ptr [[ARRAYIDX10]], align 4, !tbaa [[TBAA4]]
 // SANITIZE-WITH-ATTR-NEXT:    [[DOTNOT81:%.*]] = icmp eq i32 [[DOTCOUNTED_BY_LOAD]], 3
 // SANITIZE-WITH-ATTR-NEXT:    br i1 [[DOTNOT81]], label [[HANDLER_OUT_OF_BOUNDS18:%.*]], label [[CONT19:%.*]], !prof [[PROF8:![0-9]+]], !nosanitize [[META2]]
@@ -370,7 +370,7 @@ size_t test3_bdos_cast(struct annotated *p) {
 // SANITIZE-WITH-ATTR-NEXT:    [[RESULT25:%.*]] = add i32 [[FLEXIBLE_ARRAY_MEMBER_SIZE]], 240
 // SANITIZE-WITH-ATTR-NEXT:    [[TMP7:%.*]] = and i32 [[RESULT25]], 252
 // SANITIZE-WITH-ATTR-NEXT:    [[CONV27:%.*]] = select i1 [[TMP6]], i32 [[TMP7]], i32 0
-// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX36:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM31]]
+// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX36:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[IDXPROM31]]
 // SANITIZE-WITH-ATTR-NEXT:    store i32 [[CONV27]], ptr [[ARRAYIDX36]], align 4, !tbaa [[TBAA4]]
 // SANITIZE-WITH-ATTR-NEXT:    [[IDXPROM42:%.*]] = sext i32 [[FAM_IDX]] to i64
 // SANITIZE-WITH-ATTR-NEXT:    [[DOTCOUNTED_BY_LOAD44:%.*]] = load i32, ptr [[DOTCOUNTED_BY_GEP]], align 4
@@ -389,7 +389,7 @@ size_t test3_bdos_cast(struct annotated *p) {
 // SANITIZE-WITH-ATTR-NEXT:    tail call void @__ubsan_handle_out_of_bounds_abort(ptr nonnull @[[GLOB10:[0-9]+]], i64 [[IDXPROM60]]) #[[ATTR8]], !nosanitize [[META2]]
 // SANITIZE-WITH-ATTR-NEXT:    unreachable, !nosanitize [[META2]]
 // SANITIZE-WITH-ATTR:       cont67:
-// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX65:%.*]] = getelementptr inbounds nuw [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM60]]
+// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX65:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAY]], i64 [[IDXPROM60]]
 // SANITIZE-WITH-ATTR-NEXT:    [[COUNT50:%.*]] = sext i32 [[DOTCOUNTED_BY_LOAD44]] to i64
 // SANITIZE-WITH-ATTR-NEXT:    [[TMP10:%.*]] = sub nsw i64 [[COUNT50]], [[IDXPROM42]]
 // SANITIZE-WITH-ATTR-NEXT:    [[TMP11:%.*]] = tail call i64 @llvm.smax.i64(i64 [[TMP10]], i64 0)
@@ -411,7 +411,7 @@ size_t test3_bdos_cast(struct annotated *p) {
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[TMP1:%.*]] = and i32 [[RESULT]], 252
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[CONV1:%.*]] = select i1 [[TMP0]], i32 [[TMP1]], i32 0
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INDEX]] to i64
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX3:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM]]
+// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX3:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 [[IDXPROM]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    store i32 [[CONV1]], ptr [[ARRAYIDX3]], align 4, !tbaa [[TBAA2]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[COUNTED_BY_LOAD7:%.*]] = load i32, ptr [[COUNTED_BY_GEP]], align 4
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[FLEXIBLE_ARRAY_MEMBER_SIZE9:%.*]] = shl i32 [[COUNTED_BY_LOAD7]], 2
@@ -419,9 +419,7 @@ size_t test3_bdos_cast(struct annotated *p) {
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[TMP2:%.*]] = icmp sgt i32 [[COUNTED_BY_LOAD7]], 3
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[TMP3:%.*]] = and i32 [[RESULT10]], 252
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[CONV12:%.*]] = select i1 [[TMP2]], i32 [[TMP3]], i32 0
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[ADD:%.*]] = add nsw i32 [[INDEX]], 1
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[IDXPROM14:%.*]] = sext i32 [[ADD]] to i64
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX15:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM14]]
+// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX15:%.*]] = getelementptr i8, ptr [[ARRAYIDX3]], i64 4
 // NO-SANITIZE-WITH-ATTR-NEXT:    store i32 [[CONV12]], ptr [[ARRAYIDX15]], align 4, !tbaa [[TBAA2]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[IDXPROM17:%.*]] = sext i32 [[FAM_IDX]] to i64
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[COUNTED_BY_LOAD20:%.*]] = load i32, ptr [[COUNTED_BY_GEP]], align 4
@@ -434,9 +432,7 @@ size_t test3_bdos_cast(struct annotated *p) {
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[TMP8:%.*]] = shl i32 [[DOTTR]], 2
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[TMP9:%.*]] = and i32 [[TMP8]], 252
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[CONV26:%.*]] = select i1 [[TMP7]], i32 [[TMP9]], i32 0
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[ADD28:%.*]] = add nsw i32 [[INDEX]], 2
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[IDXPROM29:%.*]] = sext i32 [[ADD28]] to i64
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX30:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM29]]
+// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX30:%.*]] = getelementptr i8, ptr [[ARRAYIDX3]], i64 8
 // NO-SANITIZE-WITH-ATTR-NEXT:    store i32 [[CONV26]], ptr [[ARRAYIDX30]], align 4, !tbaa [[TBAA2]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    ret void
 //
@@ -445,15 +441,11 @@ size_t test3_bdos_cast(struct annotated *p) {
 // SANITIZE-WITHOUT-ATTR-NEXT:  entry:
 // SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
 // SANITIZE-WITHOUT-ATTR-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INDEX]] to i64
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX5:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM]]
+// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX5:%.*]] = getelementptr i32, ptr [[ARRAY]], i64 [[IDXPROM]]
 // SANITIZE-WITHOUT-ATTR-NEXT:    store i32 255, ptr [[ARRAYIDX5]], align 4, !tbaa [[TBAA2]]
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[ADD:%.*]] = add nsw i32 [[INDEX]], 1
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[IDXPROM17:%.*]] = sext i32 [[ADD]] to i64
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX18:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM17]]
+// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX18:%.*]] = getelementptr i8, ptr [[ARRAYIDX5]], i64 4
 // SANITIZE-WITHOUT-ATTR-NEXT:    store i32 255, ptr [[ARRAYIDX18]], align 4, !tbaa [[TBAA2]]
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[ADD31:%.*]] = add nsw i32 [[INDEX]], 2
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[IDXPROM32:%.*]] = sext i32 [[ADD31]] to i64
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX33:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM32]]
+// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX33:%.*]] = getelementptr i8, ptr [[ARRAYIDX5]], i64 8
 // SANITIZE-WITHOUT-ATTR-NEXT:    store i32 255, ptr [[ARRAYIDX33]], align 4, !tbaa [[TBAA2]]
 // SANITIZE-WITHOUT-ATTR-NEXT:    ret void
 //
@@ -462,15 +454,11 @@ size_t test3_bdos_cast(struct annotated *p) {
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:  entry:
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAY:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 12
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INDEX]] to i64
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX3:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM]]
+// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX3:%.*]] = getelementptr inbounds i32, ptr [[ARRAY]], i64 [[IDXPROM]]
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    store i32 255, ptr [[ARRAYIDX3]], align 4, !tbaa [[TBAA2]]
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ADD:%.*]] = add nsw i32 [[INDEX]], 1
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[IDXPROM9:%.*]] = sext i32 [[ADD]] to i64
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX10:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM9]]
+// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX10:%.*]] = getelementptr i8, ptr [[ARRAYIDX3]], i64 4
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    store i32 255, ptr [[ARRAYIDX10]], align 4, !tbaa [[TBAA2]]
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ADD17:%.*]] = add nsw i32 [[INDEX]], 2
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[IDXPROM18:%.*]] = sext i32 [[ADD17]] to i64
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX19:%.*]] = getelementptr inbounds [0 x i32], ptr [[ARRAY]], i64 0, i64 [[IDXPROM18]]
+// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX19:%.*]] = getelementptr i8, ptr [[ARRAYIDX3]], i64 8
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    store i32 255, ptr [[ARRAYIDX19]], align 4, !tbaa [[TBAA2]]
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    ret void
 //
@@ -632,7 +620,7 @@ size_t test4_bdos_cast2(struct annotated *p, int index) {
 // SANITIZE-WITH-ATTR-NEXT:    unreachable, !nosanitize [[META2]]
 // SANITIZE-WITH-ATTR:       cont3:
 // SANITIZE-WITH-ATTR-NEXT:    [[TMP1:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 16
-// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [0 x i32], ptr [[TMP1]], i64 0, i64 [[IDXPROM]]
+// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i64 [[IDXPROM]]
 // SANITIZE-WITH-ATTR-NEXT:    store i32 -1, ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA4]]
 // SANITIZE-WITH-ATTR-NEXT:    ret void
 //
@@ -641,7 +629,7 @@ size_t test4_bdos_cast2(struct annotated *p, int index) {
 // NO-SANITIZE-WITH-ATTR-NEXT:  entry:
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[TMP0:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 16
 // NO-SANITIZE-WITH-ATTR-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INDEX]] to i64
-// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [0 x i32], ptr [[TMP0]], i64 0, i64 [[IDXPROM]]
+// NO-SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 [[IDXPROM]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    store i32 -1, ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2]]
 // NO-SANITIZE-WITH-ATTR-NEXT:    ret void
 //
@@ -650,7 +638,7 @@ size_t test4_bdos_cast2(struct annotated *p, int index) {
 // SANITIZE-WITHOUT-ATTR-NEXT:  entry:
 // SANITIZE-WITHOUT-ATTR-NEXT:    [[TMP0:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 16
 // SANITIZE-WITHOUT-ATTR-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INDEX]] to i64
-// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [0 x i32], ptr [[TMP0]], i64 0, i64 [[IDXPROM]]
+// SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 [[IDXPROM]]
 // SANITIZE-WITHOUT-ATTR-NEXT:    store i32 -1, ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2]]
 // SANITIZE-WITHOUT-ATTR-NEXT:    ret void
 //
@@ -659,7 +647,7 @@ size_t test4_bdos_cast2(struct annotated *p, int index) {
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:  entry:
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[TMP0:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 16
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[IDXPROM:%.*]] = sext i32 [[INDEX]] to i64
-// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [0 x i32], ptr [[TMP0]], i64 0, i64 [[IDXPROM]]
+// NO-SANITIZE-WITHOUT-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[TMP0]], i64 [[IDXPROM]]
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    store i32 -1, ptr [[ARRAYIDX]], align 4, !tbaa [[TBAA2]]
 // NO-SANITIZE-WITHOUT-ATTR-NEXT:    ret void
 //
@@ -704,7 +692,7 @@ size_t test5_bdos(struct anon_struct *p) {
 // SANITIZE-WITH-ATTR-NEXT:    unreachable, !nosanitize [[META2]]
 // SANITIZE-WITH-ATTR:       cont6:
 // SANITIZE-WITH-ATTR-NEXT:    [[TMP1:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 16
-// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds [0 x i32], ptr [[TMP1]], i64 0, i64 [[IDXPROM]]
+// SANITIZE-WITH-ATTR-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i64 [[IDXPROM]]
 // SANITIZE-WITH-ATTR-NEXT:    [[FLEXIBLE_ARRAY_MEMBER_SIZE:%.*]] = shl nuw i64 [[COUNTED_BY_LOAD]], 2
 // SANITIZE-WITH-ATTR-...
[truncated]

@davemgreen
Copy link
Collaborator

Oh - I was about to put up fd286bd. That was only going to try with array types, not with vectors too, and I believe the only other semi-useful part was some DA tests I had tried adding without array indices (which looked fine for all the ones I tested).

The differences I have seen have mostly been extra hoisting causing extra unrolling. I would be interested in if @dtcxzyw finds anything more.

@nikic
Copy link
Contributor Author

nikic commented Aug 26, 2025

@zyw-bot csmith-fuzz

@nikic
Copy link
Contributor Author

nikic commented Aug 26, 2025

Based on llvm-opt-benchmark, the main impact this has is enabling the gep + add -> gep + gep in many more cases (there are a few other transforms that only support single index gep, but they're less significant). This mostly has positive effects (as many gep + adds become a constant offset gep where the base was CSEd). The main issue I saw is that we now sometimes fail to eliminate null checks of the form gep inbounds (p, x-1) == null, which are converted to gep (gep p, x), -1 now, losing flags.

@nikic
Copy link
Contributor Author

nikic commented Aug 27, 2025

The main issue I saw is that we now sometimes fail to eliminate null checks of the form gep inbounds (p, x-1) == null, which are converted to gep (gep p, x), -1 now, losing flags.

I checked whether we can mitigate this by folding null checks before doing this transform, but at least in the one case I looked at, the null check only gets inlined after this fold has already happened.

@nikic
Copy link
Contributor Author

nikic commented Aug 28, 2025

The main issue I saw is that we now sometimes fail to eliminate null checks of the form gep inbounds (p, x-1) == null, which are converted to gep (gep p, x), -1 now, losing flags.

I checked whether we can mitigate this by folding null checks before doing this transform, but at least in the one case I looked at, the null check only gets inlined after this fold has already happened.

I also checked whether not going this transform for constant negative offsets might make sense, but it turns out that it's pretty widely beneficial even in that case.

So I don't think there's anything we can really do about those regressions.

Copy link
Member

@dtcxzyw dtcxzyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@nikic nikic merged commit 055bfc0 into llvm:main Sep 1, 2025
14 checks passed
@nikic nikic deleted the instcombine-gep-leading-zeros branch September 1, 2025 07:58
@jplehr
Copy link
Contributor

jplehr commented Sep 2, 2025

It appears that this broke our libc on GPU buildbot: https://lab.llvm.org/buildbot/#/builders/10/builds/12560
I bisected the breakage locally down to this change and reverting fixes it.
@jhuber6 for awareness.

A failing configuration can be reproduced with cmake -GNinja -C offload/cmake/caches/AMDGPULibcBot.cmake llvm. Though the fail is a runtime fail afaict.

zmodem added a commit to zmodem/rust that referenced this pull request Sep 2, 2025
This updates tests/codegen-llvm/issues/issue-118306.rs to pass also
after llvm/llvm-project#155415
@nikic
Copy link
Contributor Author

nikic commented Sep 2, 2025

@jplehr I don't have an AMDGPU environment to debug this. Is it possible to reduce this further?

@jplehr
Copy link
Contributor

jplehr commented Sep 2, 2025

Yeah, I'm sure it is. Is IR dump what's needed here, either original or before some specific pass?

@nikic
Copy link
Contributor Author

nikic commented Sep 2, 2025

Not sure whether it's going to be sufficient, but the original IR dump would be a good starting point at least.

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Sep 2, 2025
Adjust issue-118306.rs test after LLVM change

This updates tests/codegen-llvm/issues/issue-118306.rs to pass also after llvm/llvm-project#155415
rust-timer added a commit to rust-lang/rust that referenced this pull request Sep 2, 2025
Rollup merge of #146116 - zmodem:issue_118306_fix, r=nikic

Adjust issue-118306.rs test after LLVM change

This updates tests/codegen-llvm/issues/issue-118306.rs to pass also after llvm/llvm-project#155415
@jplehr
Copy link
Contributor

jplehr commented Sep 2, 2025

I finally got some IR and maybe some other info.
With -opt-bisect-limit it appears that the issue is coming from GVN on the function _ZN47LlvmLibcCharacterConverterUTF32To8Test_FourByte3RunEv. After that the particular tests fails at runtime.

I attached the IR after the pass of LCSSA which runs before GVN. Let me know how helpful that is or what I can do to help better with this issue.
after-lcssa.txt

@jhuber6
Copy link
Contributor

jhuber6 commented Sep 2, 2025

I finally got some IR and maybe some other info. With -opt-bisect-limit it appears that the issue is coming from GVN on the function _ZN47LlvmLibcCharacterConverterUTF32To8Test_FourByte3RunEv. After that the particular tests fails at runtime.

I attached the IR after the pass of LCSSA which runs before GVN. Let me know how helpful that is or what I can do to help better with this issue. after-lcssa.txt

I don't see anything immediately problematic looking at that function, https://godbolt.org/z/34fq6Erz8. But I'm not sure overall, cc @arsenm.

@nikic
Copy link
Contributor Author

nikic commented Sep 3, 2025

GVN doesn't seem to do anything particularly interesting there. I tried running this through llc to see how the codegen changes, but that gives:

LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.if

Does this need extra llc flags beyond the triple?

@arsenm
Copy link
Contributor

arsenm commented Sep 3, 2025

GVN doesn't seem to do anything particularly interesting there. I tried running this through llc to see how the codegen changes, but that gives:

LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.if

Does this need extra llc flags beyond the triple?

This IR is not suitable for running through codegen again. We cannot deal with IR that has already had control flow intrinsics inserted. Reproducers should be IR at the start of codegen, not the end of the codegen IR pipeline

@nikic
Copy link
Contributor Author

nikic commented Sep 3, 2025

Oh, I see. I thought this was a GVN run in the middle-end, but it's one in the backend.

As a side note, I noticed this comment:

// EarlyCSE is not always strong enough to clean up what LSR produces. For
// example, GVN can combine
//
// %0 = add %a, %b
// %1 = add %b, %a
//
// and
//
// %0 = shl nsw %a, 2
// %1 = shl %a, 2
//
// but EarlyCSE can do neither of them.
if (isPassEnabled(EnableScalarIRPasses))
addEarlyCSEOrGVNPass();

This looks very outdated. EarlyCSE supports both commutative instructions and poison flag intersection. You may be able to save some compile-time by using EarlyCSE instead of GVN there...

@jplehr
Copy link
Contributor

jplehr commented Sep 3, 2025

If there is value in it, I can see to extract the IR earlier in the pipeline and attach that here.

@arsenm
Copy link
Contributor

arsenm commented Sep 3, 2025

If there is value in it, I can see to extract the IR earlier in the pipeline and attach that here.

Yes, need the IR at the start of codegen

@jplehr
Copy link
Contributor

jplehr commented Sep 3, 2025

Attached is the IR from pre codegen. Apologies that this may again not be what is needed. I always find the clang-linker-wrapper thing a bit confusing.
So, I hope this helps and thanks for the time.
libc-gpu-precodegen.txt

@jplehr
Copy link
Contributor

jplehr commented Sep 4, 2025

I reverted this patch locally as this restores the tests back to passing and then retrieved the IR from the same point in the pipeline -- attached below.
There is two differences, that I am not sure how relevant they are. I'll quote one here, another similar one is found much further towards the end of the file.

Good case, line 189

 %arrayidx.i.i.i.i = getelementptr inbounds [64 x %"struct.__llvm_libc_22_0_0_git::AtExitUnit"], ptr addrspace(1) @_ZN22__llvm_libc_22_0_0_git16atexit_callbacksE, i64 0, i64 %1

Bad case, line 189

  %arrayidx.i.i.i.i = getelementptr inbounds %"struct.__llvm_libc_22_0_0_git::AtExitUnit", ptr addrspace(1) @_ZN22__llvm_libc_22_0_0_git16atexit_callbacksE, i64 %1

I do not know whether this is any significant though. Mostly caught my attention as a difference and that the integer literal that went away (64) is the number of lanes on our GPU.
Maybe that info is a piece to this puzzle.

IR:
libc-gpu-precodegen-good.txt

@nikic
Copy link
Contributor Author

nikic commented Sep 4, 2025

I looked at the diff between those two, and all the changes look correct to me. The only real difference is related to __cxa_thread_finalize, which is probably just a difference in baselines (it was added yesterday).

@nikic
Copy link
Contributor Author

nikic commented Sep 4, 2025

There are more diffs in the llc output, but I find it hard to interpret those without more knowledge about AMDGPU.

@jayfoad
Copy link
Contributor

jayfoad commented Sep 4, 2025

I think the relevant part of the diff is probably:

--- libc-gpu-precodegen-good.txt        2025-09-04 14:13:56.751818563 +0100
+++ libc-gpu-precodegen.txt     2025-09-04 14:13:49.402694527 +0100
@@ -719 +720 @@
-  %call.i572 = tail call fastcc noundef zeroext i1 @_ZN22__llvm_libc_22_0_0_git7testing8internal4testIiEEbPNS1_10RunContextENS0_8TestCondET_S6_PKcS8_NS1_8LocationE(ptr noundef %2, i32 noundef 240, i32 noundef 240, ptr noundef addrspacecast (ptr addrspace(4) @.str.10 to ptr), ptr noundef addrspacecast (ptr addrspace(4) @.str.24 to ptr), i32 133) #28
+  %call.i572 = tail call fastcc noundef zeroext i1 @_ZN22__llvm_libc_22_0_0_git7testing8internal4testIiEEbPNS1_10RunContextENS0_8TestCondET_S6_PKcS8_NS1_8LocationE(ptr noundef %2, i32 noundef poison, i32 noundef 240, ptr noundef addrspacecast (ptr addrspace(4) @.str.10 to ptr), ptr noundef addrspacecast (ptr addrspace(4) @.str.24 to ptr), i32 133) #27

Note that in the good case the first three arguments to the call are "%2, 240, 240" while in the bad case they are "%2, poison, 240".

@nikic
Copy link
Contributor Author

nikic commented Sep 4, 2025

Ah yes, good point. That means the issue was already introduced at an earlier point in the middle-end pipeline, not during codegen.

@jplehr
Copy link
Contributor

jplehr commented Sep 5, 2025

I checked out main at 69d0c3e yesterday.
Then reverted this commit and validated that this still has the tests pass. I produced "good" intermediate files with that version.
I then reverted the revert and validated that this has the tests fail. I produced "bad" intermediate files with that version.

I have attached all IR files (except precodegen), but for more insight at what exact points in the pipeline they are produced @jhuber6 needs to help me. These are produced with -Wl,-save-temps on the final step of the ninja --verbose to build that particular test; if that helps in figuring things out.
There are a few differences in the preopt IR files between good and bad. I cannot judge their significance though. Similar for the internalize files.
If there is something else that I can/should provide to help with this, please let me know. I will need guidance then.

01-preopt-bad.txt
01-preopt-good.txt
02-internalize-bad.txt
02-internalize-good.txt
03-opt-bad.txt
03-opt-good.txt

@jayfoad
Copy link
Contributor

jayfoad commented Sep 5, 2025

This diff looks suspicious to me in function _ZN22__llvm_libc_22_0_0_git8internal18CharacterConverter8pop_utf8Ev:

--- 01-preopt-good.txt  2025-09-05 10:39:57.226224451 +0100
+++ 01-preopt-bad.txt   2025-09-05 10:39:56.149207707 +0100
@@ -11112,10 +11113,10 @@
 if.then4:                                         ; preds = %if.end
-  %sub7 = add nsw i64 %conv, -1
-  %4 = extractelement <4 x i8> <i8 0, i8 -64, i8 -32, i8 -16>, i64 %sub7
-  %5 = load i32, ptr %0, align 4, !tbaa !58
+  %4 = add nuw nsw i64 %conv, 4294967295
+  %5 = extractelement <4 x i8> <i8 0, i8 -64, i8 -32, i8 -16>, i64 %4
+  %6 = load i32, ptr %0, align 4, !tbaa !58
   %sh_prom = trunc nuw nsw i64 %mul to i32
-  %shr = lshr i32 %5, %sh_prom
+  %shr = lshr i32 %6, %sh_prom
   %shr.tr = trunc i32 %shr to i8
-  %or.narrow = or i8 %4, %shr.tr
-  %6 = zext i8 %or.narrow to i32
+  %or.narrow = or i8 %5, %shr.tr
+  %7 = zext i8 %or.narrow to i32
   br label %if.end15

I am guessing that the add nsw i64 %conv, -1 vs add nuw nsw i64 %conv, 4294967295 are the result of transforming some getelementptr instructions, and it seems suspicious that the 64-bit offset is small and negative in the good case but large and positive in the bad case.

@jplehr
Copy link
Contributor

jplehr commented Sep 5, 2025

Given this is in the preopt IR, the change would occur in an even earlier stage, right? I'll see if I can get some IR out of whatever stage that may be.

@nikic
Copy link
Contributor Author

nikic commented Sep 8, 2025

Given this is in the preopt IR, the change would occur in an even earlier stage, right? I'll see if I can get some IR out of whatever stage that may be.

Yes, you're currently dumping the pre-LTO IR, but we'd want the full optimized one produced by the compiler. (Should be possible to get it with -S -emit-llvm -Xclang -disable-llvm-optzns.)


Does anyone know which transform is capable of converting a load from a constant global into a dynamic extract from a vector constant? I tried and failed to reproduce that behavior on simpler test cases.

@ritter-x2a
Copy link
Member

I think I might have found the culprit: AMDGPUPromoteAllocaToVectorPass
Attached is the IR before and after that pass, with interesting points marked with ; HERE: .
A %arrayidx = getelementptr i8, ptr addrspace(5) %14, i64 -1 (AS 5 is the private/scratch address space, so pointers have 32 bits) seems to be translated into %15 = add i64 4294967295, %conv6, which adds a zero-extended 32-bit -1 in 64 bits. That probably should be sign-extended instead.

around-AMDGPUPromoteAllocaToVectorPass.ll.txt

@nikic
Copy link
Contributor Author

nikic commented Sep 9, 2025

@ritter-x2a Nice find! I guess this code is the culprit:

static Value *GEPToVectorIndex(GetElementPtrInst *GEP, AllocaInst *Alloca,
It seems to treat ConstOffset as an unsigned number, while it is actually signed.

@ritter-x2a
Copy link
Member

@nikic yes, using SExt instead of ZExt and sdivrem instead of udivrem in GEPToVectorIndex fixes the problematic test for me, but it causes memory access faults in some other tests. I'll have to think a bit more about which bitwidths we need at the individual operations, I guess.

@ritter-x2a
Copy link
Member

PR to fix the AMDGPUPromoteAlloca issue: #157682
I'm still observing crashes in different tests, libc.test.src.math.smoke.ufromfpf_test and similar variants, but I see those with and without my fix, and with and without the patch from this PR reverted, so they are probably caused by a separate problem.

@nathanchance
Copy link
Member

For what it's worth, I bisected the introduction of an instance of -Wframe-larger-than when building the Linux kernel with CONFIG_KASAN to this change.

# bad: [0fb2f524f8bc7a945fb1795e70ebe80adc08e237] [LLD] Recognize 'mipspe' as a PE target name (#157305)
# good: [1ffc38ca4920e6fdf0e640d09f50f397d6831a09] [TableGen][DecoderEmitter] Remove unused variables (NFC) (#153262)
git bisect start '0fb2f524f8bc7a945fb1795e70ebe80adc08e237' '1ffc38ca4920e6fdf0e640d09f50f397d6831a09'
# good: [3d722f5ed9497cc6118c119dece7db6b694e093a] [gn build] Port 2ab4c2880db6
git bisect good 3d722f5ed9497cc6118c119dece7db6b694e093a
# bad: [c945022f2fd8321559d84e2272005487c5ced924] [MLIR][NVVM] Support packed registers in `inline_ptx` (#154904)
git bisect bad c945022f2fd8321559d84e2272005487c5ced924
# good: [5dfc4f79655d6e36088aaac5c702b5034731a754] [docs][OpenMP] Add docs section for OpenMP 6.1 implementation status (#155651)
git bisect good 5dfc4f79655d6e36088aaac5c702b5034731a754
# good: [257975fada8bb40f729976a9caa06d2bbb4e9f12] [mlir] Fix node numbering order in SliceMatchers example (#155684)
git bisect good 257975fada8bb40f729976a9caa06d2bbb4e9f12
# bad: [d7d870326202a3dce9cec7a12aa097eb48c559d2] [Dexter] Implement DexStepFunction and DexContinue (#152721)
git bisect bad d7d870326202a3dce9cec7a12aa097eb48c559d2
# good: [34fbf1f0b79750f69d391cf9f778d8bca2ca5f05] [memprof] Make the AllocSites and CallSites sections optional in YAML (#156226)
git bisect good 34fbf1f0b79750f69d391cf9f778d8bca2ca5f05
# good: [09350bd1c5ebba6cae96afcb70ef8ac097f7e1de] [NFC][PowerPC] adding the arguments for register names and VSR to VR (#155991)
git bisect good 09350bd1c5ebba6cae96afcb70ef8ac097f7e1de
# good: [a987022f33a27610732544b0c5f4475ce818c982] [MemoryBuiltins] Add getBaseObjectSize() (NFCI) (#155911)
git bisect good a987022f33a27610732544b0c5f4475ce818c982
# bad: [9ad8e12c573f248b3f0ca4cca01fee1e5ed22c7b] [AMDGPU] Expand scratch atomics to flat atomics if GAS is enabled (#154710)
git bisect bad 9ad8e12c573f248b3f0ca4cca01fee1e5ed22c7b
# good: [e82abde49454d0bd2c2d37b6209e1d21fbcbaa56] [SLP][NFC] Refactor `if`s into `&&` (#156278)
git bisect good e82abde49454d0bd2c2d37b6209e1d21fbcbaa56
# bad: [c128b8c46f2a3b750c9abcba1e303f92d6531e5f] [clang-reorder-fields] Fix unused private field warning (NFC)
git bisect bad c128b8c46f2a3b750c9abcba1e303f92d6531e5f
# bad: [055bfc027141bbfafd51fb43f5ab81ba3b480649] [InstCombine] Strip leading zero indices from GEP (#155415)
git bisect bad 055bfc027141bbfafd51fb43f5ab81ba3b480649
# first bad commit: [055bfc027141bbfafd51fb43f5ab81ba3b480649] [InstCombine] Strip leading zero indices from GEP (#155415)
$ make -skj"$(nproc)" ARCH=arm64 LLVM=1 clean defconfig

$ scripts/config -e COMPILE_TEST -e DRM_OMAP -e KASAN

$ make -skj"$(nproc)" ARCH=arm64 LLVM=1 olddefconfig drivers/gpu/drm/omapdrm/dss/dispc.o
...
drivers/gpu/drm/omapdrm/dss/dispc.c:4720:27: warning: stack frame size (2256) exceeds limit (2048) in 'dispc_runtime_suspend' [-Wframe-larger-than]
 4720 | static __maybe_unused int dispc_runtime_suspend(struct device *dev)
      |                           ^
...

cvise spits out

cvise reproducer
enum omap_plane_id {
  OMAP_DSS_GFX,
  OMAP_DSS_VIDEO1,
  OMAP_DSS_VIDEO2,
  OMAP_DSS_VIDEO3,
  OMAP_DSS_WB
};
struct dispc_device *dispc;
struct dispc_device {
  int ctx[];
};
enum omap_plane_id DISPC_PIC_SIZE_OFFSET_plane, DISPC_FIR_COEF_HV2_OFFSET_plane,
    DISPC_FIR_COEF_V_OFFSET_plane;
void *base;
int dispc_save_context___trans_tmp_21, dispc_save_context___trans_tmp_17,
    dispc_save_context___trans_tmp_40, dispc_save_context___trans_tmp_38,
    dispc_save_context___trans_tmp_37, dispc_save_context___trans_tmp_37,
    dispc_save_context___trans_tmp_35;
int __raw_readl(void *addr) {
  int val;
  asm volatile(".if "
               "1"
               " == 1\n"
               ".endif\n"
               : "=r"(val)
               : "r"(addr));
  return val;
}
short DISPC_OVL_BASE(enum omap_plane_id plane) {
  switch (plane) {
  case OMAP_DSS_GFX:
    return 0;
  case OMAP_DSS_VIDEO1:
    return 188;
  case OMAP_DSS_VIDEO2:
    return 332;
  case OMAP_DSS_VIDEO3:
    return 768;
  case OMAP_DSS_WB:
    return 1280;
  default:
    __builtin_unreachable();
  }
}
short DISPC_BA0_UV_OFFSET(enum omap_plane_id plane) {
  switch (plane) {
  case OMAP_DSS_GFX:
  case OMAP_DSS_VIDEO1:
    return 44;
  case OMAP_DSS_VIDEO2:
    return 2;
  case OMAP_DSS_VIDEO3:
    return 0;
  case OMAP_DSS_WB:
    return 18;
  }
}
short DISPC_PIX_INC_OFFSET(enum omap_plane_id plane) {
  switch (plane) {
  case OMAP_DSS_GFX:
  case OMAP_DSS_VIDEO1:
  case OMAP_DSS_VIDEO2:
    return 20;
  case OMAP_DSS_VIDEO3:
  case OMAP_DSS_WB:
    return 8;
  }
}
short DISPC_FIR2_OFFSET(enum omap_plane_id plane) {
  switch (plane) {
  case OMAP_DSS_GFX:
  case OMAP_DSS_VIDEO1:
    return 408;
  case OMAP_DSS_VIDEO2:
    return 2;
  case OMAP_DSS_VIDEO3:
    return 4;
  case OMAP_DSS_WB:
    return 90;
  }
}
short DISPC_ACCU2_0_OFFSET(enum omap_plane_id plane) {
  switch (plane) {
  case OMAP_DSS_GFX:
  case OMAP_DSS_VIDEO1:
    return 1412;
  case OMAP_DSS_VIDEO2:
    return 60;
  case OMAP_DSS_VIDEO3:
    return 8;
  case OMAP_DSS_WB:
    return 4;
  }
}
short DISPC_FIR_COEF_H_OFFSET(enum omap_plane_id plane, short i) {
  switch (plane) {
  case OMAP_DSS_VIDEO1:
  case OMAP_DSS_VIDEO2:
    return 52 + i * 8;
  case OMAP_DSS_WB:
    return 16 + i * 8;
  default:
    __builtin_unreachable();
  }
}
short DISPC_FIR_COEF_H2_OFFSET(short i) {
  enum omap_plane_id plane = 0;
  switch (plane) {
  case OMAP_DSS_GFX:
  case OMAP_DSS_VIDEO1:
    return 1420 + i * 8;
  case OMAP_DSS_VIDEO2:
  case OMAP_DSS_VIDEO3:
  case OMAP_DSS_WB:
    return 8;
  }
}
short DISPC_FIR_COEF_HV_OFFSET(enum omap_plane_id plane, short i) {
  switch (plane) {
  case OMAP_DSS_VIDEO1:
  case OMAP_DSS_VIDEO2:
    return 56 + i * 8;
  case OMAP_DSS_WB:
    return 14 + i * 8;
  default:
    __builtin_unreachable();
  }
}
short DISPC_FIR_COEF_HV2_OFFSET(short i) {
  switch (DISPC_FIR_COEF_HV2_OFFSET_plane)
  case OMAP_DSS_GFX:
  case OMAP_DSS_VIDEO1:
  case OMAP_DSS_VIDEO2:
  case OMAP_DSS_VIDEO3:
  case OMAP_DSS_WB:
    return 676 + i;
}
short DISPC_FIR_COEF_V_OFFSET(short i) {
  switch (DISPC_FIR_COEF_V_OFFSET_plane) {
  case OMAP_DSS_VIDEO1:
    return 292 + i * 4;
  default:
    __builtin_unreachable();
  }
}
short DISPC_FIR_COEF_V2_OFFSET(short i) {
  enum omap_plane_id plane = 0;
  switch (plane) {
  case OMAP_DSS_GFX:
  case OMAP_DSS_VIDEO1:
    return 1484 + i;
  case OMAP_DSS_VIDEO2:
  case OMAP_DSS_VIDEO3:
  case OMAP_DSS_WB:
    return 4;
  }
}
int dispc_read_reg(short idx) { return __raw_readl(base + idx); }
int dispc_get_num_ovls();
static void dispc_save_context(struct dispc_device *dispc) {
  short __trans_tmp_36, __trans_tmp_7, __trans_tmp_6, __trans_tmp_4,
      __trans_tmp_3, __trans_tmp_2, __trans_tmp_1;
  int __trans_tmp_19, __trans_tmp_13, i, j;
  i = 0;
  for (; dispc_get_num_ovls(); i++) {
    dispc->ctx[(DISPC_OVL_BASE(i) + DISPC_PIX_INC_OFFSET(i)) / sizeof(int)] = 0;
    if (i == OMAP_DSS_GFX)
      continue;
    {
      short idx = 0;
      void *addr = base;
      int val;
      asm volatile(".if "
                   "1"
                   " == 1\n"
                   ".endif\n"
                   : "=r"(val)
                   : "r"(addr));
      dispc_save_context___trans_tmp_35 = val;
      __trans_tmp_13 = dispc_save_context___trans_tmp_35 + idx;
      switch (DISPC_PIC_SIZE_OFFSET_plane)
      case OMAP_DSS_GFX:
      case OMAP_DSS_VIDEO1:
      case OMAP_DSS_VIDEO2:
      case OMAP_DSS_VIDEO3:
      case OMAP_DSS_WB:
        __trans_tmp_36 = 94;
    }
    dispc->ctx[DISPC_OVL_BASE(i) + __trans_tmp_36] = __trans_tmp_13;
    dispc->ctx[DISPC_OVL_BASE(i) / sizeof(int)] =
        dispc->ctx[(DISPC_OVL_BASE(i) + 4) / sizeof(int)];
    j = 0;
    for (; j < 8; j++) {
      __trans_tmp_1 = DISPC_OVL_BASE(i);
      short idx = __trans_tmp_1;
      __raw_readl(base + idx);
      dispc->ctx[(DISPC_OVL_BASE(i) + DISPC_FIR_COEF_H_OFFSET(i, j)) /
                 sizeof(int)] = dispc_save_context___trans_tmp_37;
    }
    j = 0;
    for (; j < 8; j++) {
      __trans_tmp_2 = DISPC_OVL_BASE(i);
      short idx = __trans_tmp_2;
      __raw_readl(base + idx);
      dispc->ctx[(DISPC_OVL_BASE(i) + DISPC_FIR_COEF_HV_OFFSET(i, j)) /
                 sizeof(int)] = dispc_save_context___trans_tmp_38;
    }
    j = 0;
    for (; j < 5; j++) {
      void *addr = base;
      int val;
      asm volatile(".if "
                   "1"
                   " == 1\n"
                   ".endif\n"
                   : "=r"(val)
                   : "r"(addr));
      dispc->ctx[DISPC_OVL_BASE(i) + 116 + j * 4 / sizeof(int)] =
          dispc_save_context___trans_tmp_17;
    }
    j = 0;
    for (; j < 8; j++) {
      __trans_tmp_3 = DISPC_OVL_BASE(i);
      short idx = __trans_tmp_3;
      __raw_readl(base + idx);
      dispc->ctx[(DISPC_OVL_BASE(i) + DISPC_FIR_COEF_V_OFFSET(j)) /
                 sizeof(int)] = dispc_save_context___trans_tmp_40;
    }
    dispc->ctx[500 + DISPC_BA0_UV_OFFSET(i)] =
        dispc->ctx[DISPC_OVL_BASE(i) + 788 / sizeof(int)] =
            dispc->ctx[DISPC_FIR2_OFFSET(i)];
    short idx = 0;
    __trans_tmp_19 = idx;
    dispc->ctx[500 + DISPC_ACCU2_0_OFFSET(i)] = __trans_tmp_19;
    dispc->ctx[8] = j = 0;
    for (; j < 8; j++) {
      __trans_tmp_4 = DISPC_OVL_BASE(i);
      {
        idx = __trans_tmp_4;
        dispc_save_context___trans_tmp_37 = __raw_readl(base + idx);
      }
      dispc->ctx[DISPC_OVL_BASE(i) + DISPC_FIR_COEF_H2_OFFSET(j)] =
          dispc_save_context___trans_tmp_37;
    }
    j = 0;
    for (; j < 8; j++)
      dispc->ctx[DISPC_OVL_BASE(i) + DISPC_FIR_COEF_HV2_OFFSET(j)] =
          dispc_save_context___trans_tmp_21;
    j = 0;
    for (; j < 8; j++) {
      __trans_tmp_6 = DISPC_OVL_BASE(i);
      __trans_tmp_7 = 0;
      dispc->ctx[DISPC_OVL_BASE(i) + DISPC_FIR_COEF_V2_OFFSET(j)] =
          dispc_read_reg(__trans_tmp_6 + __trans_tmp_7);
    }
  }
}
void dispc_runtime_suspend() { dispc_save_context(dispc); }

which shows the stack usage tripling with -fsanitize=kernel-address:

GCC 15.2.0:

$ aarch64-linux-gcc -O2 -Wframe-larger-than=1 -fsanitize=kernel-address -c -o /dev/null dispc.i
dispc.i: In function 'DISPC_BA0_UV_OFFSET':
dispc.i:57:1: warning: the frame size of 16 bytes is larger than 1 bytes [-Wframe-larger-than=]
   57 | }
      | ^
dispc.i: In function 'DISPC_FIR2_OFFSET':
dispc.i:81:1: warning: the frame size of 16 bytes is larger than 1 bytes [-Wframe-larger-than=]
   81 | }
      | ^
dispc.i: In function 'DISPC_ACCU2_0_OFFSET':
dispc.i:94:1: warning: the frame size of 16 bytes is larger than 1 bytes [-Wframe-larger-than=]
   94 | }
      | ^
dispc.i: In function 'dispc_runtime_suspend':
dispc.i:258:59: warning: the frame size of 64 bytes is larger than 1 bytes [-Wframe-larger-than=]
  258 | void dispc_runtime_suspend() { dispc_save_context(dispc); }
      |                                                           ^

$ x86_64-linux-gcc -O2 -Wframe-larger-than=1 -fsanitize=kernel-address -c -o /dev/null dispc.i
dispc.i: In function 'DISPC_BA0_UV_OFFSET':
dispc.i:57:1: warning: the frame size of 16 bytes is larger than 1 bytes [-Wframe-larger-than=]
   57 | }
      | ^
dispc.i: In function 'DISPC_FIR2_OFFSET':
dispc.i:81:1: warning: the frame size of 16 bytes is larger than 1 bytes [-Wframe-larger-than=]
   81 | }
      | ^
dispc.i: In function 'DISPC_ACCU2_0_OFFSET':
dispc.i:94:1: warning: the frame size of 16 bytes is larger than 1 bytes [-Wframe-larger-than=]
   94 | }
      | ^
dispc.i: In function 'dispc_runtime_suspend':
dispc.i:258:59: warning: the frame size of 48 bytes is larger than 1 bytes [-Wframe-larger-than=]
  258 | void dispc_runtime_suspend() { dispc_save_context(dispc); }
      |                                                           ^

clang @ e82abde:

$ clang --target=aarch64-linux-gnu -O2 -Wframe-larger-than=1 -c -o /dev/null dispc.i
dispc.i:258:6: warning: stack frame size (128) exceeds limit (1) in 'dispc_runtime_suspend' [-Wframe-larger-than]
  258 | void dispc_runtime_suspend() { dispc_save_context(dispc); }
      |      ^

$ clang --target=aarch64-linux-gnu -O2 -Wframe-larger-than=1 -fsanitize=kernel-address -c -o /dev/null dispc.i
dispc.i:29:7: warning: stack frame size (32) exceeds limit (1) in 'DISPC_OVL_BASE' [-Wframe-larger-than]
   29 | short DISPC_OVL_BASE(enum omap_plane_id plane) {
      |       ^
dispc.i:258:6: warning: stack frame size (704) exceeds limit (1) in 'dispc_runtime_suspend' [-Wframe-larger-than]
  258 | void dispc_runtime_suspend() { dispc_save_context(dispc); }
      |      ^

$ clang --target=x86_64-linux-gnu -O2 -Wframe-larger-than=1 -c -o /dev/null dispc.i
dispc.i:258:6: warning: stack frame size (88) exceeds limit (1) in 'dispc_runtime_suspend' [-Wframe-larger-than]
  258 | void dispc_runtime_suspend() { dispc_save_context(dispc); }
      |      ^

$ clang --target=x86_64-linux-gnu -O2 -Wframe-larger-than=1 -fsanitize=kernel-address -c -o /dev/null dispc.i
dispc.i:29:7: warning: stack frame size (8) exceeds limit (1) in 'DISPC_OVL_BASE' [-Wframe-larger-than]
   29 | short DISPC_OVL_BASE(enum omap_plane_id plane) {
      |       ^
dispc.i:258:6: warning: stack frame size (488) exceeds limit (1) in 'dispc_runtime_suspend' [-Wframe-larger-than]
  258 | void dispc_runtime_suspend() { dispc_save_context(dispc); }
      |      ^

clang @ 055bfc0:

$ clang --target=aarch64-linux-gnu -O2 -Wframe-larger-than=1 -c -o /dev/null dispc.i
dispc.i:258:6: warning: stack frame size (96) exceeds limit (1) in 'dispc_runtime_suspend' [-Wframe-larger-than]
  258 | void dispc_runtime_suspend() { dispc_save_context(dispc); }
      |      ^

$ clang --target=aarch64-linux-gnu -O2 -Wframe-larger-than=1 -fsanitize=kernel-address -c -o /dev/null dispc.i
dispc.i:29:7: warning: stack frame size (32) exceeds limit (1) in 'DISPC_OVL_BASE' [-Wframe-larger-than]
   29 | short DISPC_OVL_BASE(enum omap_plane_id plane) {
      |       ^
dispc.i:258:6: warning: stack frame size (2064) exceeds limit (1) in 'dispc_runtime_suspend' [-Wframe-larger-than]
  258 | void dispc_runtime_suspend() { dispc_save_context(dispc); }
      |      ^

$ clang --target=x86_64-linux-gnu -O2 -Wframe-larger-than=1 -c -o /dev/null dispc.i
dispc.i:258:6: warning: stack frame size (104) exceeds limit (1) in 'dispc_runtime_suspend' [-Wframe-larger-than]
  258 | void dispc_runtime_suspend() { dispc_save_context(dispc); }
      |      ^

$ clang --target=x86_64-linux-gnu -O2 -Wframe-larger-than=1 -fsanitize=kernel-address -c -o /dev/null dispc.i
dispc.i:29:7: warning: stack frame size (8) exceeds limit (1) in 'DISPC_OVL_BASE' [-Wframe-larger-than]
   29 | short DISPC_OVL_BASE(enum omap_plane_id plane) {
      |       ^
dispc.i:258:6: warning: stack frame size (1464) exceeds limit (1) in 'dispc_runtime_suspend' [-Wframe-larger-than]
  258 | void dispc_runtime_suspend() { dispc_save_context(dispc); }
      |      ^

This is the same code that triggered #143908 but that reproducer does not change with or without this patch.

@nikic
Copy link
Contributor Author

nikic commented Sep 11, 2025

This is the same code that triggered #143908 but that reproducer does not change with or without this patch.

The root cause here is the same. This change just ends up unrolling additional cases and triggers the same underlying problem with address calculations being unprofitably hoisted out of a loop and spilled. I left a comment on that issue.

intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 20, 2025
With latest llvm22, I got the following verification failure:

  ...
  ; int big_alloc2(void *ctx) @ verifier_arena_large.c:207
  0: (b4) w6 = 1                        ; R6_w=1
  ...
  ; if (err) @ verifier_arena_large.c:233
  53: (56) if w6 != 0x0 goto pc+62      ; R6=0
  54: (b7) r7 = -4                      ; R7_w=-4
  55: (18) r8 = 0x7f4000000000          ; R8_w=scalar()
  57: (bf) r9 = addr_space_cast(r8, 0, 1)       ; R8_w=scalar() R9_w=arena
  58: (b4) w6 = 5                       ; R6_w=5
  ; pg = page[i]; @ verifier_arena_large.c:238
  59: (bf) r1 = r7                      ; R1_w=-4 R7_w=-4
  60: (07) r1 += 4                      ; R1_w=0
  61: (79) r2 = *(u64 *)(r9 +0)         ; R2_w=scalar() R9_w=arena
  ; if (*pg != i) @ verifier_arena_large.c:239
  62: (bf) r3 = addr_space_cast(r2, 0, 1)       ; R2_w=scalar() R3_w=arena
  63: (71) r3 = *(u8 *)(r3 +0)          ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
  64: (5d) if r1 != r3 goto pc+51       ; R1_w=0 R3_w=0
  ; bpf_arena_free_pages(&arena, (void __arena *)pg, 2); @ verifier_arena_large.c:241
  65: (18) r1 = 0xff11000114548000      ; R1_w=map_ptr(map=arena,ks=0,vs=0)
  67: (b4) w3 = 2                       ; R3_w=2
  68: (85) call bpf_arena_free_pages#72675      ;
  69: (b7) r1 = 0                       ; R1_w=0
  ; page[i + 1] = NULL; @ verifier_arena_large.c:243
  70: (7b) *(u64 *)(r8 +8) = r1
  R8 invalid mem access 'scalar'
  processed 61 insns (limit 1000000) max_states_per_insn 0 total_states 6 peak_states 6 mark_read 2
  =============
  torvalds#489/5   verifier_arena_large/big_alloc2:FAIL

The main reason is that 'r8' in insn '70' is not an arena pointer.
Further debugging at llvm side shows that llvm commit ([1]) caused
the failure. For the original code:
  page[i] = NULL;
  page[i + 1] = NULL;
the llvm transformed it to something like below at source level:
  __builtin_memset(&page[i], 0, 16)
Such transformation prevents llvm BPFCheckAndAdjustIR pass from
generating proper addr_space_cast insns ([2]).

Adding support in llvm BPFCheckAndAdjustIR pass should work, but
not sure that such a pattern exists or not in real applications.
At the same time, simply adding a memory barrier between two 'page'
assignment can fix the issue.

  [1] llvm/llvm-project#155415
  [2] llvm/llvm-project#84410

Cc: Eduard Zingerman <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf bot pushed a commit to kernel-patches/bpf that referenced this pull request Sep 20, 2025
With latest llvm22, I got the following verification failure:

  ...
  ; int big_alloc2(void *ctx) @ verifier_arena_large.c:207
  0: (b4) w6 = 1                        ; R6_w=1
  ...
  ; if (err) @ verifier_arena_large.c:233
  53: (56) if w6 != 0x0 goto pc+62      ; R6=0
  54: (b7) r7 = -4                      ; R7_w=-4
  55: (18) r8 = 0x7f4000000000          ; R8_w=scalar()
  57: (bf) r9 = addr_space_cast(r8, 0, 1)       ; R8_w=scalar() R9_w=arena
  58: (b4) w6 = 5                       ; R6_w=5
  ; pg = page[i]; @ verifier_arena_large.c:238
  59: (bf) r1 = r7                      ; R1_w=-4 R7_w=-4
  60: (07) r1 += 4                      ; R1_w=0
  61: (79) r2 = *(u64 *)(r9 +0)         ; R2_w=scalar() R9_w=arena
  ; if (*pg != i) @ verifier_arena_large.c:239
  62: (bf) r3 = addr_space_cast(r2, 0, 1)       ; R2_w=scalar() R3_w=arena
  63: (71) r3 = *(u8 *)(r3 +0)          ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
  64: (5d) if r1 != r3 goto pc+51       ; R1_w=0 R3_w=0
  ; bpf_arena_free_pages(&arena, (void __arena *)pg, 2); @ verifier_arena_large.c:241
  65: (18) r1 = 0xff11000114548000      ; R1_w=map_ptr(map=arena,ks=0,vs=0)
  67: (b4) w3 = 2                       ; R3_w=2
  68: (85) call bpf_arena_free_pages#72675      ;
  69: (b7) r1 = 0                       ; R1_w=0
  ; page[i + 1] = NULL; @ verifier_arena_large.c:243
  70: (7b) *(u64 *)(r8 +8) = r1
  R8 invalid mem access 'scalar'
  processed 61 insns (limit 1000000) max_states_per_insn 0 total_states 6 peak_states 6 mark_read 2
  =============
  #489/5   verifier_arena_large/big_alloc2:FAIL

The main reason is that 'r8' in insn '70' is not an arena pointer.
Further debugging at llvm side shows that llvm commit ([1]) caused
the failure. For the original code:
  page[i] = NULL;
  page[i + 1] = NULL;
the llvm transformed it to something like below at source level:
  __builtin_memset(&page[i], 0, 16)
Such transformation prevents llvm BPFCheckAndAdjustIR pass from
generating proper addr_space_cast insns ([2]).

Adding support in llvm BPFCheckAndAdjustIR pass should work, but
not sure that such a pattern exists or not in real applications.
At the same time, simply adding a memory barrier between two 'page'
assignment can fix the issue.

  [1] llvm/llvm-project#155415
  [2] llvm/llvm-project#84410

Cc: Eduard Zingerman <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Sep 20, 2025
With latest llvm22, I got the following verification failure:

  ...
  ; int big_alloc2(void *ctx) @ verifier_arena_large.c:207
  0: (b4) w6 = 1                        ; R6_w=1
  ...
  ; if (err) @ verifier_arena_large.c:233
  53: (56) if w6 != 0x0 goto pc+62      ; R6=0
  54: (b7) r7 = -4                      ; R7_w=-4
  55: (18) r8 = 0x7f4000000000          ; R8_w=scalar()
  57: (bf) r9 = addr_space_cast(r8, 0, 1)       ; R8_w=scalar() R9_w=arena
  58: (b4) w6 = 5                       ; R6_w=5
  ; pg = page[i]; @ verifier_arena_large.c:238
  59: (bf) r1 = r7                      ; R1_w=-4 R7_w=-4
  60: (07) r1 += 4                      ; R1_w=0
  61: (79) r2 = *(u64 *)(r9 +0)         ; R2_w=scalar() R9_w=arena
  ; if (*pg != i) @ verifier_arena_large.c:239
  62: (bf) r3 = addr_space_cast(r2, 0, 1)       ; R2_w=scalar() R3_w=arena
  63: (71) r3 = *(u8 *)(r3 +0)          ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
  64: (5d) if r1 != r3 goto pc+51       ; R1_w=0 R3_w=0
  ; bpf_arena_free_pages(&arena, (void __arena *)pg, 2); @ verifier_arena_large.c:241
  65: (18) r1 = 0xff11000114548000      ; R1_w=map_ptr(map=arena,ks=0,vs=0)
  67: (b4) w3 = 2                       ; R3_w=2
  68: (85) call bpf_arena_free_pages#72675      ;
  69: (b7) r1 = 0                       ; R1_w=0
  ; page[i + 1] = NULL; @ verifier_arena_large.c:243
  70: (7b) *(u64 *)(r8 +8) = r1
  R8 invalid mem access 'scalar'
  processed 61 insns (limit 1000000) max_states_per_insn 0 total_states 6 peak_states 6 mark_read 2
  =============
  #489/5   verifier_arena_large/big_alloc2:FAIL

The main reason is that 'r8' in insn '70' is not an arena pointer.
Further debugging at llvm side shows that llvm commit ([1]) caused
the failure. For the original code:
  page[i] = NULL;
  page[i + 1] = NULL;
the llvm transformed it to something like below at source level:
  __builtin_memset(&page[i], 0, 16)
Such transformation prevents llvm BPFCheckAndAdjustIR pass from
generating proper addr_space_cast insns ([2]).

Adding support in llvm BPFCheckAndAdjustIR pass should work, but
not sure that such a pattern exists or not in real applications.
At the same time, simply adding a memory barrier between two 'page'
assignment can fix the issue.

  [1] llvm/llvm-project#155415
  [2] llvm/llvm-project#84410

Cc: Eduard Zingerman <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Sep 20, 2025
With latest llvm22, I got the following verification failure:

  ...
  ; int big_alloc2(void *ctx) @ verifier_arena_large.c:207
  0: (b4) w6 = 1                        ; R6_w=1
  ...
  ; if (err) @ verifier_arena_large.c:233
  53: (56) if w6 != 0x0 goto pc+62      ; R6=0
  54: (b7) r7 = -4                      ; R7_w=-4
  55: (18) r8 = 0x7f4000000000          ; R8_w=scalar()
  57: (bf) r9 = addr_space_cast(r8, 0, 1)       ; R8_w=scalar() R9_w=arena
  58: (b4) w6 = 5                       ; R6_w=5
  ; pg = page[i]; @ verifier_arena_large.c:238
  59: (bf) r1 = r7                      ; R1_w=-4 R7_w=-4
  60: (07) r1 += 4                      ; R1_w=0
  61: (79) r2 = *(u64 *)(r9 +0)         ; R2_w=scalar() R9_w=arena
  ; if (*pg != i) @ verifier_arena_large.c:239
  62: (bf) r3 = addr_space_cast(r2, 0, 1)       ; R2_w=scalar() R3_w=arena
  63: (71) r3 = *(u8 *)(r3 +0)          ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
  64: (5d) if r1 != r3 goto pc+51       ; R1_w=0 R3_w=0
  ; bpf_arena_free_pages(&arena, (void __arena *)pg, 2); @ verifier_arena_large.c:241
  65: (18) r1 = 0xff11000114548000      ; R1_w=map_ptr(map=arena,ks=0,vs=0)
  67: (b4) w3 = 2                       ; R3_w=2
  68: (85) call bpf_arena_free_pages#72675      ;
  69: (b7) r1 = 0                       ; R1_w=0
  ; page[i + 1] = NULL; @ verifier_arena_large.c:243
  70: (7b) *(u64 *)(r8 +8) = r1
  R8 invalid mem access 'scalar'
  processed 61 insns (limit 1000000) max_states_per_insn 0 total_states 6 peak_states 6 mark_read 2
  =============
  #489/5   verifier_arena_large/big_alloc2:FAIL

The main reason is that 'r8' in insn '70' is not an arena pointer.
Further debugging at llvm side shows that llvm commit ([1]) caused
the failure. For the original code:
  page[i] = NULL;
  page[i + 1] = NULL;
the llvm transformed it to something like below at source level:
  __builtin_memset(&page[i], 0, 16)
Such transformation prevents llvm BPFCheckAndAdjustIR pass from
generating proper addr_space_cast insns ([2]).

Adding support in llvm BPFCheckAndAdjustIR pass should work, but
not sure that such a pattern exists or not in real applications.
At the same time, simply adding a memory barrier between two 'page'
assignment can fix the issue.

  [1] llvm/llvm-project#155415
  [2] llvm/llvm-project#84410

Cc: Eduard Zingerman <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
kernel-patches-daemon-bpf-rc bot pushed a commit to kernel-patches/bpf-rc that referenced this pull request Sep 20, 2025
With latest llvm22, I got the following verification failure:

  ...
  ; int big_alloc2(void *ctx) @ verifier_arena_large.c:207
  0: (b4) w6 = 1                        ; R6_w=1
  ...
  ; if (err) @ verifier_arena_large.c:233
  53: (56) if w6 != 0x0 goto pc+62      ; R6=0
  54: (b7) r7 = -4                      ; R7_w=-4
  55: (18) r8 = 0x7f4000000000          ; R8_w=scalar()
  57: (bf) r9 = addr_space_cast(r8, 0, 1)       ; R8_w=scalar() R9_w=arena
  58: (b4) w6 = 5                       ; R6_w=5
  ; pg = page[i]; @ verifier_arena_large.c:238
  59: (bf) r1 = r7                      ; R1_w=-4 R7_w=-4
  60: (07) r1 += 4                      ; R1_w=0
  61: (79) r2 = *(u64 *)(r9 +0)         ; R2_w=scalar() R9_w=arena
  ; if (*pg != i) @ verifier_arena_large.c:239
  62: (bf) r3 = addr_space_cast(r2, 0, 1)       ; R2_w=scalar() R3_w=arena
  63: (71) r3 = *(u8 *)(r3 +0)          ; R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff))
  64: (5d) if r1 != r3 goto pc+51       ; R1_w=0 R3_w=0
  ; bpf_arena_free_pages(&arena, (void __arena *)pg, 2); @ verifier_arena_large.c:241
  65: (18) r1 = 0xff11000114548000      ; R1_w=map_ptr(map=arena,ks=0,vs=0)
  67: (b4) w3 = 2                       ; R3_w=2
  68: (85) call bpf_arena_free_pages#72675      ;
  69: (b7) r1 = 0                       ; R1_w=0
  ; page[i + 1] = NULL; @ verifier_arena_large.c:243
  70: (7b) *(u64 *)(r8 +8) = r1
  R8 invalid mem access 'scalar'
  processed 61 insns (limit 1000000) max_states_per_insn 0 total_states 6 peak_states 6 mark_read 2
  =============
  #489/5   verifier_arena_large/big_alloc2:FAIL

The main reason is that 'r8' in insn '70' is not an arena pointer.
Further debugging at llvm side shows that llvm commit ([1]) caused
the failure. For the original code:
  page[i] = NULL;
  page[i + 1] = NULL;
the llvm transformed it to something like below at source level:
  __builtin_memset(&page[i], 0, 16)
Such transformation prevents llvm BPFCheckAndAdjustIR pass from
generating proper addr_space_cast insns ([2]).

Adding support in llvm BPFCheckAndAdjustIR pass should work, but
not sure that such a pattern exists or not in real applications.
At the same time, simply adding a memory barrier between two 'page'
assignment can fix the issue.

  [1] llvm/llvm-project#155415
  [2] llvm/llvm-project#84410

Cc: Eduard Zingerman <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
rust-cloud-vms bot pushed a commit to makai410/rustc_public that referenced this pull request Oct 12, 2025
Adjust issue-118306.rs test after LLVM change

This updates tests/codegen-llvm/issues/issue-118306.rs to pass also after llvm/llvm-project#155415
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:AMDGPU clang Clang issues not falling into any other category llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants