Skip to content

Conversation

@asudarsa
Copy link
Owner

@asudarsa asudarsa commented Oct 7, 2024

This PR is one of the many PRs in the SYCL upstreaming effort focusing on device code linking during the SYCL offload compilation process. RFC: https://discourse.llvm.org/t/rfc-offloading-design-for-sycl-offload-kind-and-spir-targets/74088

In this PR, we introduce a new tool that will be used to perform device code linking for SYCL offload kind. It accepts SYCL device objects in LLVM IR bitcode format and will generate a fully linked device object that can then be wrapped and linked into the host object.

A primary use case for this tool is to perform device code linking for objects with SYCL offload kind inside the clang-linker-wrapper. It can also be invoked via clang driver as follows:

clang --target=spirv64 --sycl-link input.bc

Device code linking for SYCL offloading kind has a number of known quirks that makes it difficult to use in a unified offloading setting. Two of the primary issues are:

  1. Several finalization steps are required to be run on the fully-linked LLVM IR bitcode to gaurantee conformance to SYCL standards. This step is unique to SYCL offloading compilation flow.
  2. SPIR-V LLVM Translator tool is an extenal tool and hence SPIR-V IR code generation cannot be done as part of LTO. This limitation will be lifted once SPIR-V backend is available as a viable LLVM backend.
    Hence, we introduce this new tool to provide a clean wrapper to perform SYCL device linking.

List of items supported in this PR:

  1. Addition of a new linking tool to link device for SYCL offloading. Providing a way to call this from clang driver. This supports a simple JIT compilation flow.
  2. Currently, this new tool supports linking of multiple sources in LLVM IR device code format, linking the SYCL device library files if available and finally running llvm-spirv tool to generate SPIR-V IR device file.

List of upcoming PRs:

  1. Post-link routine that can support device code splitting and also support running of several SYCL-specific finalization passes.
  2. Integration of post-link routine into clang-sycl-link-wrapper
  3. Invoking clang driver from inside clang-linker-wrapper for linking SYCL device codes.
  4. AOT compilation support for SPIR-V targets
  5. AOT compilation support for non SPIR-V targets

Thanks

@asudarsa asudarsa changed the title [Clang][SYCL] Introduce clang-sycl-link-wrapper to link SYCL offloadi… [Clang][SYCL] Introduce clang-sycl-link-wrapper to link SYCL offloading device code Oct 7, 2024
Copy link

@sarnex sarnex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks! just some comments below.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Purpose of this wrapper is to provide an interface to link SYCL device bitcode
The purpose of this wrapper is to provide an interface to link SYCL device bitcode

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Device code linking for SYCL offloading kind has a number of known quirks that
Device code linking for SYCL offloading has a number of known quirks that

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IR bitcode to gaurantee conformance to SYCL standards. This step is unique to
IR bitcode to guarantee conformance to SYCL standards. This step is unique to

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SYCL offloading compilation flow.
the SYCL offloading compilation flow.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. SPIR-V LLVM Translator tool is an extenal tool and hence SPIR-V IR code
2. SPIR-V LLVM Translator tool is an external tool and hence SPIR-V IR code

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i forget, are there plans to have the driver own this list?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to hard-code the list here as this is a SYCL specific tool.

Thanks

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well with the old offload model the driver also owns a copy, so once we have the new model enabled and remove support for the old model, if this is the single list then that's fine, i just want to avoid two copies of the same thing

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is nullopt return valid? should it be an error?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user does not specify a valid device library location, then we simply do not link any device lib files. It need not be an error.

Copy link

@sarnex sarnex Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user here is the driver/clang-linker-wrapper right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the llvm::sys::path namespace has a filename function to extract the filename from the path, can we use that here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to note in a comment or something that we will need to add processing of the linked module (sycl-post-link) later?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add a PR comment about upcoming PRs

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's an error, i think we won't remove the temp files even if --save-temps is not passed, right? reportError is noreturn, so we probably need something that will always run

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reportError is being used in other tools as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reportError is fine but checking for save_temps after a reportError won't work I think. If the same thing happens in other tools then it's fine for this PR

@bader
Copy link

bader commented Oct 7, 2024

A primary use case for this tool is to perform device code linking for objects with SYCL offload kind inside the clang-linker-wrapper. It can also be invoked via clang driver as follows:

clang --target=spirv64 --sycl-link input.bc

Device code linking for SYCL offloading kind has a number of known quirks that makes it difficult to use in a unified offloading setting. Two of the primary issues are:

  1. Several finalization steps are required to be run on the fully-linked LLVM IR bitcode to gaurantee conformance to SYCL standards. This step is unique to SYCL offloading compilation flow.

I think it might be easier to communicate SYCL runtime requirements by adding SYCL operating system type or environment type to the target triple.

clang -target=spirv64-intel-sycl input.bc # OS type
clang -target=spirv64-intel--sycl input.bc # Environment type

E.g. CUDA uses CUDA OS type to select the right interface with the driver. See https://llvm.org/docs/NVPTXUsage.html#triples

One concern is that Codeplay is already using cuda OS type when we compiler SYCL for nvptx target. @Naghasan, what do you think about the idea of having SYCL specific OS type in the target triple?

Comment on lines 1 to 3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
====================
Clang SYCL link Wrapper
====================
=======================
Clang SYCL link Wrapper
=======================

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my understanding, the purpose of the tool is to provide standard linker interface and be able to link any object file formats 'compile' step is able to produce for SYCL offload mode. So, I would expect this tool to handle linking of fat objects with device code in other IR like SPIR-V as well as native binary format.

Copy link
Owner Author

@asudarsa asudarsa Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bader

Thanks for the feedback. I agree with you. This tool is expected to handle linking of fat objects and SPIR-V objects. However, in its current state, it can support only bitcode files. We can extend this tool easily to support linking of other objects. I was planning to update the document as the tool evolves. I will add more details to the doc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let's clearly state the final goal for the tool, current status of the implementation, and plans for the future work.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
be passed down to downstrea tools like 'llvm-link', 'llvm-spirv', etc.
be passed down to downstrea tools like 'clang', 'llvm-spirv', etc.

We should use clang instead of llvm-link to link llvm bitcode files.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a TODO. I do have a tracker to replace llvm-link with clang. '-only-needed' linking support seems to be missing in clang.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think llvm-link is not a part of clang distribution, so SYCL linking won't work. We need to resolve that ASAP.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about AOT targets?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to add it as the implementation evolves in near future. But I will add it now and provide a PR comment about upcoming changes.

Thanks

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, specify that --sycl-link is the clang driver option, not clang-sycl-link-wrapper option.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy-paste from another source? I expect this comment reference spir-v instead of nvptx and amdgpu.

Copy link
Owner Author

@asudarsa asudarsa Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Sorry. I thought I had cleaned it up. We do not really need this comment. Will remove it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either change sycl to spirv in the option name of vice versa in the description. Otherwise, it looks a bit messy.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Special option to pass in llvm-spirv options
// Options to pass to llvm-spirv tool

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Ssave intermediary results.
/// Save intermediary results.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SmallString<128> SPIRVDumpDir;
static SmallString<128> SPIRVDumpDir;

@Naghasan
Copy link

Naghasan commented Oct 9, 2024

A primary use case for this tool is to perform device code linking for objects with SYCL offload kind inside the clang-linker-wrapper. It can also be invoked via clang driver as follows:
clang --target=spirv64 --sycl-link input.bc
Device code linking for SYCL offloading kind has a number of known quirks that makes it difficult to use in a unified offloading setting. Two of the primary issues are:

  1. Several finalization steps are required to be run on the fully-linked LLVM IR bitcode to gaurantee conformance to SYCL standards. This step is unique to SYCL offloading compilation flow.

I think it might be easier to communicate SYCL runtime requirements by adding SYCL operating system type or environment type to the target triple.

clang -target=spirv64-intel-sycl input.bc # OS type
clang -target=spirv64-intel--sycl input.bc # Environment type

E.g. CUDA uses CUDA OS type to select the right interface with the driver. See https://llvm.org/docs/NVPTXUsage.html#triples

One concern is that Codeplay is already using cuda OS type when we compiler SYCL for nvptx target. @Naghasan, what do you think about the idea of having SYCL specific OS type in the target triple?

I think it would be best to avoid that TBH. I briefly looked at the patch, and I don't understand (yet) why we would need extra information. I'll make a review.

If we have to, environment would be best IMO so as to not compete with existing TC logic that we can reuse. But the point of the device compiler is to lower to a specific environment (OpenCL, Level 0, HIP, CUDA), so I don't see why the middle or backend should need to know that to function properly and so I would personally prefer an extra flag.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I have a better understanding now than when I wrote #1 (comment)

You shouldn't need that flag... if the driver was there. If it was, the condition should be if (JA.isDeviceOffloading(Action::OFK_SYCL)) rather than this global modifier. IMHO, the driver change shouldn't be part of this patch:

  • there is a better way to do it but can't implement this because it lacks the required sycl offloading driver logic.
  • looking at the test, you don't need this to test clang-sycl-linker-wrapper

Copy link
Owner Author

@asudarsa asudarsa Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @Naghasan
Here, the clang driver is expected to be called from inside clang-linker-wrapper to link device images and we pass --sycl-link (and no -fsycl) to invoke the 'link' stage.
Adding @mdtoguchi for more details here.

Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of JA.isDeviceOffloading() is useful when we know we are offloading and determining what kind we are offloading to. From what I understand, the clang call from within the wrapper is not to be considered as an offloading enabling invocation, but rather a target specific call to produce a specific type of binary.

@asudarsa asudarsa requested review from Naghasan, bader and sarnex October 10, 2024 14:20
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The wrapper will also support the Ahead-Of-Time compilation flow. AOT
The wrapper will also support the Ahead-Of-Time (AOT) compilation flow. AOT

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
generation is deferred until application runtime time.
generation is deferred until application runtime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
makes it difficult to use in a unified offloading setting. Two of the primary
make it difficult to use in a unified offloading setting. Two of the primary

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'makes' goes well with 'Device code linking'. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reads interesting, as the association can also be with 'quirks'. It is the quirks that make it difficult - it's not the device code linking that makes it difficult.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need 'sycl' here? The tool is a SYCL specific tool - seems redundant.

Copy link

@sarnex sarnex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no major flags from me assuming my comments are addressed in a seperate pr

@asudarsa asudarsa marked this pull request as ready for review October 14, 2024 15:09
…ng device code

This PR is one of the many PRs in the SYCL upstreaming effort focusing on
device code linking during the SYCL offload compilation process.
RFC: https://discourse.llvm.org/t/rfc-offloading-design-for-sycl-offload-kind-and-spir-targets/74088

In this PR, we introduce a new tool that will be used to perform device code
linking for SYCL offload kind. It accepts SYCL device objects in LLVM IR bitcode
format and will generate a fully linked device object that can then be wrapped
and linked into the host object.

A primary use case for this tool is to perform device code linking for objects
with SYCL offload kind inside the clang-linker-wrapper.
It can also be invoked via clang driver as follows:

`clang --target=spirv64 --sycl-link input.bc`

Device code linking for SYCL offloading kind has a number of known quirks that
makes it difficult to use in a unified offloading setting.
Two of the primary issues are:
1. Several finalization steps are required to be run on the fully-linked LLVM
IR bitcode to gaurantee conformance to SYCL standards. This step is unique to
SYCL offloading compilation flow.
2. SPIR-V LLVM Translator tool is an extenal tool and hence SPIR-V IR code
generation cannot be done as part of LTO. This limitation will be lifted once
SPIR-V backend is available as a viable LLVM backend.

Hence, we introduce this new tool to provide a clean wrapper to perform SYCL
device linking.

Thanks

Signed-off-by: Arvind Sudarsanam <[email protected]>
@asudarsa asudarsa force-pushed the add_clang-sycl-link-wrapper-part-1 branch from 856c205 to eff4a03 Compare October 14, 2024 19:00
Signed-off-by: Arvind Sudarsanam <[email protected]>
Signed-off-by: Arvind Sudarsanam <[email protected]>
Signed-off-by: Arvind Sudarsanam <[email protected]>
…of changes missed out from earlier commit.

Signed-off-by: Arvind Sudarsanam <[email protected]>
Signed-off-by: Arvind Sudarsanam <[email protected]>
Signed-off-by: Arvind Sudarsanam <[email protected]>
Signed-off-by: Arvind Sudarsanam <[email protected]>
… when only one input is present

Signed-off-by: Arvind Sudarsanam <[email protected]>
asudarsa pushed a commit that referenced this pull request Mar 11, 2025
When compiling VLS SVE, the compiler often replaces VL-based offsets
with immediate-based ones. This leads to a mismatch in the allowed
addressing modes due to SVE loads/stores generally expecting immediate
offsets relative to VL. For example, given:
```c

svfloat64_t foo(const double *x) {
  svbool_t pg = svptrue_b64();
  return svld1_f64(pg, x+svcntd());
}
```

When compiled with `-msve-vector-bits=128`, we currently generate:
```gas
foo:
        ptrue   p0.d
        mov     x8, #2
        ld1d    { z0.d }, p0/z, [x0, x8, lsl #3]
        ret
```

Instead, we could be generating:
```gas
foo:
        ldr     z0, [x0, #1, mul vl]
        ret
```

Likewise for other types, stores, and other VLS lengths.

This patch achieves the above by extending `SelectAddrModeIndexedSVE`
to let constants through when `vscale` is known.
asudarsa pushed a commit that referenced this pull request Mar 11, 2025
`TestReportData.py` is failing on the macOS CI with:
```
Traceback (most recent call last):
  File "/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 1784, in test_method
    return attrvalue(self)
  File "/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 148, in wrapper
    return func(*args, **kwargs)
  File "/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/test/API/functionalities/asan/TestReportData.py", line 28, in test_libsanitizers_asan
    self.asan_tests(libsanitizers=True)
  File "/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/test/API/functionalities/asan/TestReportData.py", line 60, in asan_tests
    self.expect(
  File "/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/packages/Python/lldbsuite/test/lldbtest.py", line 2490, in expect
    self.fail(log_msg)
AssertionError: Ran command:
"thread list"

Got output:
Process 3474 stopped
* thread #1: tid = 0x38b5e9, 0x00007ff80f563b52 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT

Expecting sub string: "stopped" (was found)
Expecting sub string: "stop reason = Use of deallocated memory" (was not found)
Process should be stopped due to ASan report
```

There isn't much to go off of in the log, so adding more to help us debug this.
asudarsa pushed a commit that referenced this pull request Mar 11, 2025
These are macOS tests only and are currently failing on the x86_64 CI
and on arm64 on recent versions of macOS/Xcode.

The tests are failing because we're stopping in:
```
Process 17458 stopped
* thread #1: tid = 0xbda69a, 0x00000002735bd000
  libsystem_malloc.dylib`purgeable_print_self.cold.1, stop reason = EXC_BREAKPOINT (code=1, subcode=0x2735bd000)
```
instead of the libsanitizers library. This seems to be related to
`-fsanitize-trivial-abi` support

Skip these for now until we figure out the root cause.
asudarsa pushed a commit that referenced this pull request Apr 6, 2025
…d A520 (llvm#132246)

Inefficient SVE codegen occurs on at least two in-order cores,
those being Cortex-A510 and Cortex-A520. For example a simple vector
add

```
void foo(float a, float b, float dst, unsigned n) {
    for (unsigned i = 0; i < n; ++i)
        dst[i] = a[i] + b[i];
}
```

Vectorizes the inner loop into the following interleaved sequence
of instructions.

```
        add     x12, x1, x10
        ld1b    { z0.b }, p0/z, [x1, x10]
        add     x13, x2, x10
        ld1b    { z1.b }, p0/z, [x2, x10]
        ldr     z2, [x12, #1, mul vl]
        ldr     z3, [x13, #1, mul vl]
        dech    x11
        add     x12, x0, x10
        fadd    z0.s, z1.s, z0.s
        fadd    z1.s, z3.s, z2.s
        st1b    { z0.b }, p0, [x0, x10]
        addvl   x10, x10, #2
        str     z1, [x12, #1, mul vl]
```

By adjusting the target features to prefer fixed over scalable if the
cost is equal we get the following vectorized loop.

```
         ldp q0, q3, [x11, #-16]
         subs    x13, x13, llvm#8
         ldp q1, q2, [x10, #-16]
         add x10, x10, llvm#32
         add x11, x11, llvm#32
         fadd    v0.4s, v1.4s, v0.4s
         fadd    v1.4s, v2.4s, v3.4s
         stp q0, q1, [x12, #-16]
         add x12, x12, llvm#32
```

Which is more efficient.
asudarsa pushed a commit that referenced this pull request Apr 10, 2025
… A510/A520 (llvm#134606)

Recommit. This work was done by llvm#132246 but failed buildbots due to the
test introduced needing updates

Inefficient SVE codegen occurs on at least two in-order cores, those
being Cortex-A510 and Cortex-A520. For example a simple vector add

```
void foo(float a, float b, float dst, unsigned n) {
    for (unsigned i = 0; i < n; ++i)
        dst[i] = a[i] + b[i];
}
```

Vectorizes the inner loop into the following interleaved sequence of
instructions.

```
        add     x12, x1, x10
        ld1b    { z0.b }, p0/z, [x1, x10]
        add     x13, x2, x10
        ld1b    { z1.b }, p0/z, [x2, x10]
        ldr     z2, [x12, #1, mul vl]
        ldr     z3, [x13, #1, mul vl]
        dech    x11
        add     x12, x0, x10
        fadd    z0.s, z1.s, z0.s
        fadd    z1.s, z3.s, z2.s
        st1b    { z0.b }, p0, [x0, x10]
        addvl   x10, x10, #2
        str     z1, [x12, #1, mul vl]
```

By adjusting the target features to prefer fixed over scalable if the
cost is equal we get the following vectorized loop.

```
         ldp q0, q3, [x11, #-16]
         subs    x13, x13, llvm#8
         ldp q1, q2, [x10, #-16]
         add x10, x10, llvm#32
         add x11, x11, llvm#32
         fadd    v0.4s, v1.4s, v0.4s
         fadd    v1.4s, v2.4s, v3.4s
         stp q0, q1, [x12, #-16]
         add x12, x12, llvm#32
```

Which is more efficient.
asudarsa pushed a commit that referenced this pull request Apr 10, 2025
…s=128. (llvm#134068)

When compiling with -msve-vector-bits=128 or vscale_range(1, 1) and when
the offsets allow it, we can pair SVE LDR/STR instructions into Neon
LDP/STP.

For example, given:
```cpp
#include <arm_sve.h>

void foo(double const *ldp, double *stp) {
  svbool_t pg = svptrue_b64();
  svfloat64_t ld1 = svld1_f64(pg, ldp);
  svfloat64_t ld2 = svld1_f64(pg, ldp+svcntd());
  svst1_f64(pg, stp, ld1);
  svst1_f64(pg, stp+svcntd(), ld2);
}
```

When compiled with `-msve-vector-bits=128`, we currently generate:
```gas
foo:
        ldr     z0, [x0]
        ldr     z1, [x0, #1, mul vl]
        str     z0, [x1]
        str     z1, [x1, #1, mul vl]
        ret
```

With this patch, we instead generate:
```gas
foo:
        ldp     q0, q1, [x0]
        stp     q0, q1, [x1]
        ret
```

This is an alternative, more targetted approach to llvm#127500.
asudarsa pushed a commit that referenced this pull request Apr 10, 2025
…ctor-bits=128." (llvm#134997)

Reverts llvm#134068

Caused a stage 2 build failure:
https://lab.llvm.org/buildbot/#/builders/41/builds/6016

```
FAILED: lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o 
/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage2/lib/Support -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/lib/Support -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage2/include -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/include -mcpu=neoverse-512tvb -mllvm -scalable-vectorization=preferred -mllvm -treat-scalable-fixed-error-as-warning=false -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Werror=global-constructors -O3 -DNDEBUG -std=c++17 -UNDEBUG  -fno-exceptions -funwind-tables -fno-rtti -MD -MT lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o -MF lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o.d -o lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o -c /home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/lib/Support/Caching.cpp
Opcode has unknown scale!
UNREACHABLE executed at ../llvm/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp:4530!
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: /home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage2/lib/Support -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/lib/Support -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage2/include -I/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/include -mcpu=neoverse-512tvb -mllvm -scalable-vectorization=preferred -mllvm -treat-scalable-fixed-error-as-warning=false -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Werror=global-constructors -O3 -DNDEBUG -std=c++17 -UNDEBUG -fno-exceptions -funwind-tables -fno-rtti -MD -MT lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o -MF lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o.d -o lib/Support/CMakeFiles/LLVMSupport.dir/Caching.cpp.o -c /home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/lib/Support/Caching.cpp
1.	<eof> parser at end of file
2.	Code generation
3.	Running pass 'Function Pass Manager' on module '/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/llvm/llvm/lib/Support/Caching.cpp'.
4.	Running pass 'AArch64 load / store optimization pass' on function '@"_ZNSt17_Function_handlerIFN4llvm8ExpectedISt8functionIFNS1_ISt10unique_ptrINS0_16CachedFileStreamESt14default_deleteIS4_EEEEjRKNS0_5TwineEEEEEjNS0_9StringRefESB_EZNS0_10localCacheESB_SB_SB_S2_IFvjSB_S3_INS0_12MemoryBufferES5_ISH_EEEEE3$_0E9_M_invokeERKSt9_Any_dataOjOSF_SB_"'
 #0 0x0000b6eae9b67bf0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang+++0x81c7bf0)
 #1 0x0000b6eae9b65aec llvm::sys::RunSignalHandlers() (/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang+++0x81c5aec)
 #2 0x0000b6eae9acd5f4 CrashRecoverySignalHandler(int) CrashRecoveryContext.cpp:0:0
 #3 0x0000f16c1aff28f8 (linux-vdso.so.1+0x8f8)
 #4 0x0000f16c1aacf1f0 __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #5 0x0000f16c1aa8a67c gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #6 0x0000f16c1aa77130 abort ./stdlib/abort.c:81:7
 llvm#7 0x0000b6eae9ad6628 (/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang+++0x8136628)
 llvm#8 0x0000b6eae72e95a8 (/home/tcwg-buildbot/worker/clang-aarch64-sve-vla-2stage/stage1.install/bin/clang+++0x59495a8)
 llvm#9 0x0000b6eae74ca9a8 (anonymous namespace)::AArch64LoadStoreOpt::findMatchingInsn(llvm::MachineInstrBundleIterator<llvm::MachineInstr, false>, (anonymous namespace)::LdStPairFlags&, unsigned int, bool) AArch64LoadStoreOptimizer.cpp:0:0
llvm#10 0x0000b6eae74c85a8 (anonymous namespace)::AArch64LoadStoreOpt::tryToPairLdStInst(llvm::MachineInstrBundleIterator<llvm::MachineInstr, false>&) AArch64LoadStoreOptimizer.cpp:0:0
llvm#11 0x0000b6eae74c624c (anonymous namespace)::AArch64LoadStoreOpt::optimizeBlock(llvm::MachineBasicBlock&, bool) AArch64LoadStoreOptimizer.cpp:0:0
llvm#12 0x0000b6eae74c429c (anonymous namespace)::AArch64LoadStoreOpt::runOnMachineFunction(llvm::MachineFunction&) AArch64LoadStoreOptimizer.cpp:0:0
```
asudarsa pushed a commit that referenced this pull request Apr 14, 2025
…vailable (llvm#135343)

When a frame is inlined, LLDB will display its name in backtraces as
follows:
```
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.3
  * frame #0: 0x0000000100000398 a.out`func() [inlined] baz(x=10) at inline.cpp:1:42
    frame #1: 0x0000000100000398 a.out`func() [inlined] bar() at inline.cpp:2:37
    frame #2: 0x0000000100000398 a.out`func() at inline.cpp:4:15
    frame #3: 0x00000001000003c0 a.out`main at inline.cpp:7:5
    frame #4: 0x000000026eb29ab8 dyld`start + 6812
```
The longer the names get the more confusing this gets because the first
function name that appears is the parent frame. My assumption (which may
need some more surveying) is that for the majority of cases we only care
about the actual frame name (not the parent). So this patch removes all
the special logic that prints the parent frame.

Another quirk of the current format is that the inlined frame name does
not abide by the `${function.name-XXX}` format variables. We always just
print the raw demangled name. With this patch, we would format the
inlined frame name according to the `frame-format` setting (see the
test-cases).

If we really want to have the `parentFrame [inlined] inlinedFrame`
format, we could expose it through a new `frame-format` variable (e..g.,
`${function.inlined-at-name}` and let the user decide where to place
things.
asudarsa pushed a commit that referenced this pull request Apr 17, 2025
Currently, given:
```cpp
uint64_t incb(uint64_t x) {
  return x+svcntb();
}
```
LLVM generates:
```gas
incb:
        addvl   x0, x0, #1
        ret
```
Which is equivalent to:
```gas
incb:
        incb    x0
        ret
```

However, on microarchitectures like the Neoverse V2 and Neoverse V3,
the second form (with INCB) can have significantly better latency and
throughput (according to their SWOG). On the Neoverse V2, for example,
ADDVL has a latency and throughput of 2, whereas some forms of INCB
have a latency of 1 and a throughput of 4. The same applies to DECB.
This patch adds patterns to prefer the cheaper INCB/DECB forms over
ADDVL where applicable.
asudarsa pushed a commit that referenced this pull request Jun 6, 2025
Fixes llvm#123300

What is seen 
```
clang-repl> int x = 42;
clang-repl> auto capture = [&]() { return x * 2; };
In file included from <<< inputs >>>:1:
input_line_4:1:17: error: non-local lambda expression cannot have a capture-default
    1 | auto capture = [&]() { return x * 2; };
      |                 ^
zsh: segmentation fault  clang-repl --Xcc="-v"

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x0000000107b4f8b8 libclang-cpp.19.1.dylib`clang::IncrementalParser::CleanUpPTU(clang::PartialTranslationUnit&) + 988
    frame #1: 0x0000000107b4f1b4 libclang-cpp.19.1.dylib`clang::IncrementalParser::ParseOrWrapTopLevelDecl() + 416
    frame #2: 0x0000000107b4fb94 libclang-cpp.19.1.dylib`clang::IncrementalParser::Parse(llvm::StringRef) + 612
    frame #3: 0x0000000107b52fec libclang-cpp.19.1.dylib`clang::Interpreter::ParseAndExecute(llvm::StringRef, clang::Value*) + 180
    frame #4: 0x0000000100003498 clang-repl`main + 3560
    frame #5: 0x000000018d39a0e0 dyld`start + 2360
```

Though the error is justified, we shouldn't be interested in exiting
through a segfault in such cases.

The issue is that empty named decls weren't being taken care of
resulting into this assert


https://github.com/llvm/llvm-project/blob/c1a229252617ed58f943bf3f4698bd8204ee0f04/clang/include/clang/AST/DeclarationName.h#L503

Can also be seen when the example is attempted through xeus-cpp-lite.


![image](https://github.com/user-attachments/assets/9b0e6ead-138e-4b06-9ad9-fcb9f8d5bf6e)
asudarsa pushed a commit that referenced this pull request Jun 6, 2025
# Symptom

We have seen SIGSEGV like this:
```
* thread #1, name = 'lldb-server', stop reason = SIGSEGV
    frame #0: 0x00007f39e529c993 libc.so.6`__pthread_kill_internal(signo=11, threadid=<unavailable>) at pthread_kill.c:46:37
    ...
  * frame #5: 0x000056027c94fe48 lldb-server`lldb_private::process_linux::GetPtraceScope() + 72
    frame #6: 0x000056027c92f94f lldb-server`lldb_private::process_linux::NativeProcessLinux::Attach(int) + 1087
    ...
```
See [full stack trace](https://pastebin.com/X0d6QhYj).

This happens on Linux where LLDB doesn't have access to
`/proc/sys/kernel/yama/ptrace_scope`.

A similar error (an unchecked `Error`) can be reproduced by running the
newly added unit test without the fix. See the "Test" section below.


# Root cause

`GetPtraceScope()`
([code](https://github.com/llvm/llvm-project/blob/328f40f408c218f25695ea42c844e43bef38660b/lldb/source/Plugins/Process/Linux/Procfs.cpp#L77))
has the following `if` statement:
```
llvm::Expected<int> lldb_private::process_linux::GetPtraceScope() {
  ErrorOr<std::unique_ptr<MemoryBuffer>> ptrace_scope_file =
      getProcFile("sys/kernel/yama/ptrace_scope");
  if (!*ptrace_scope_file)
    return errorCodeToError(ptrace_scope_file.getError());
  ...
}
```

The intention of the `if` statement is to check whether the
`ptrace_scope_file` is an `Error` or not, and return the error if it is.
However, the `operator*` of `ErrorOr` returns the value that is stored
(which is a `std::unique_ptr<MemoryBuffer>`), so what the `if` condition
actually do is to check if the unique pointer is non-null.

Note that the method `ErrorOr::getStorage()` ([called
by](https://github.com/llvm/llvm-project/blob/328f40f408c218f25695ea42c844e43bef38660b/llvm/include/llvm/Support/ErrorOr.h#L162-L164)
`ErrorOr::operator *`) **does** assert on whether or not `HasError` has
been set (see
[ErrorOr.h](https://github.com/llvm/llvm-project/blob/328f40f408c218f25695ea42c844e43bef38660b/llvm/include/llvm/Support/ErrorOr.h#L235-L243)).
However, it seems this wasn't executed, probably because the LLDB was a
release build.

# Fix

The fix is simply remove the `*` in the said `if` statement.
asudarsa pushed a commit that referenced this pull request Jun 6, 2025
…142952)

This was removed in llvm#135343 in
favour of making it a format variable, which we do here. This follows
the precedent of the `[opt]` and `[artificial]` markers.

Before:
```
 thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2
 * frame #0: 0x000000010000037c a.out`inlined1() at inline.cpp:4:3
   frame #1: 0x000000010000037c a.out`regular() at inline.cpp:6:17
   frame #2: 0x00000001000003b8 a.out`inlined2() at inline.cpp:7:43
   frame #3: 0x00000001000003b4 a.out`main at inline.cpp:10:3
   frame #4: 0x0000000186345be4 dyld`start + 7040
```

After (note the `[inlined]` markers):
```
thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2
* frame #0: 0x000000010000037c a.out`inlined1() at inline.cpp:4:3 [inlined]
  frame #1: 0x000000010000037c a.out`regular() at inline.cpp:6:17
  frame #2: 0x00000001000003b8 a.out`inlined2() at inline.cpp:7:43 [inlined]
  frame #3: 0x00000001000003b4 a.out`main at inline.cpp:10:3
  frame #4: 0x0000000186345be4 dyld`start + 7040
```

rdar://152642178
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants