-
Notifications
You must be signed in to change notification settings - Fork 5.2k
vxsort: Add Arm64 Neon implementation and tests #110692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Work in progress. Currently the testing for vxsort exists in I still need to add bitonic search and packing support for Neon. The searching for small lists currently uses a copy of insertsort instead of bitonic search. So that I can check performance, for AVX2 I've disabled packing and switch to insertsort too. Performance testing is very basic, but running On an AVX2 X64 (Gold 5120T), the vxsort is slightly faster. On the same AVX2 X64 (Gold 5120T), switching the vxsort code to use bitonic search and packing: Given the above, I'm fairly confident that implementing the rest for Neon will give some improvements. However, It will never show the same boost as AVX2 given the vector length sizes. On Neon, 128bit vectors means we are only sorting two 64bit values at once. I noticed that for more than 255 the program segfaults on both X64 and Arm64. This looks like a limitation of vxsort. Might be worth adding some asserts in the GC to check the size of the list? |
|
thanks for your interest in this! @damageboy has many tests in his repo - https://github.com/damageboy/vxsort-cpp
is 255 number of elements in the array? that'd be quite surprising because we don't even start invoking vxsort till we have at least 8k for avx2 and 128k for avx512. |
Looks like that was a bug in my side. With that fixed, for 8000 elements: AVX2 X64 (Gold 5120T): Cobalt 100: |
|
Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100? |
ignore...so seems today we use |
Yes. I'm hoping to get more by porting both bitonic search and packing for Arm64. In the above figures, I've disabled both on those on X86. When I re-enable them again, X86 goes from ~576ms to ~162ms. So there's definitely some more performance to find. |
Change-Id: I19e0fc293b67e28d1dd5491efd9b4e9b86c5c4d7
|
I've added an implementation of Bitonic. The plan was to do the bitonic sort using NEON. Unfortunately instructions like For some of the bitonic functions, they can be done in NEON with a few extra instructions (eg An alternative is to simply to avoid NEON and use GPR registers throughout. This can be done by simply writing the code in C++ instead of intrinsics, allowing the compiler to optimise. As a result, I've implemented the bitonic using scalar code. It's highly doubtful that a mix of NEON+scalar would give better performance. As a bonus it is architecture independent code. Note that for 8/16/32 values, NEON would be the preferred option. Also, SVE would give better performance on 256bit machines (currently only neoverse V1), but it's doubtful on 128bit machines, although it would shorten the code size. I don't plan on implementing with SVE in this PR. Trying this code on cobalt 100 shows quite an additional speedup (previously vxsort was running at ~150ms) In the new year, I'll look at cleaning this up, sorting out the tests etc. I'll also looking at the missing "packing" code in vxsort, see if there's anything else to gain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds Arm64 Neon implementation of vxsort to complement the existing x86/AMD64 AVX implementations. The PR extends the vectorized sorting algorithm to support Arm64 processors with NEON instruction set capabilities and includes a standalone testing framework for development and verification.
- Adds Neon implementation of machine traits and sorting algorithm for Arm64 architecture
- Creates standalone testing framework for vxsort development and benchmarking
- Generates scalar bitonic sort implementations as fallback for small arrays
Reviewed Changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/tools/aot/ILCompiler/reproNative/reproNative.vcxproj | Adds VxsortDisabled library reference for Arm64 AOT builds |
| src/coreclr/nativeaot/Runtime/Full/CMakeLists.txt | Extends VxSort library build to include Arm64 architecture |
| src/coreclr/nativeaot/Runtime/CMakeLists.txt | Adds Neon-specific source files for Arm64 VxSort implementation |
| src/coreclr/nativeaot/BuildIntegration/Microsoft.NETCore.Native.*.targets | Updates build conditions to include arm64 for VxSort library linking |
| src/coreclr/gc/vxsort/vxsort.h | Adds conditional compilation for AMD64 vs other architectures, includes introsort fallback |
| src/coreclr/gc/vxsort/standalone/* | New testing framework with demo applications and performance benchmarks |
| src/coreclr/gc/vxsort/smallsort/codegen/* | Adds scalar bitonic sort code generation for fallback sorting |
| src/coreclr/gc/vxsort/smallsort/bitonic_sort.scalar..generated. | Generated scalar sorting implementations for uint32_t and uint64_t |
| src/coreclr/gc/vxsort/machine_traits.neon.* | Neon-specific machine traits implementation for Arm64 |
| src/coreclr/gc/vxsort/do_vxsort_neon.cpp | Entry point function for Neon-based vxsort |
| src/coreclr/gc/gc.cpp | Updates GC to support Neon vxsort on Arm64 and moves introsort to separate header |
| Various CMakeLists.txt files | Build system updates to include Arm64 vxsort support |
Comments suppressed due to low confidence (1)
src/coreclr/tools/aot/ILCompiler/reproNative/reproNative.vcxproj:1
- The XML formatting contains inconsistent spacing and line breaks. There are spaces before "bin\coreclr" in some paths but not others, and the line break placement is inconsistent. This could lead to build issues or make the configuration harder to maintain.
<Project DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
|
The new version of the PR now supports packing - uint64s are compressed to uint32 before sorting. This gives an additional boost. I could go further - the smallsort for uint64 has to use scalar code, but the uint32 could be rewritten to use NEON. Although it's quite a lot of additional auto-generated functions to support it. There are some minor build errors I still need to fix. And I still need to measure the size of AOT binaries. |
|
Running simple_bench with 500000 on an Nvidia Grace: vxsort: Time= 5956 us Older version of the PR (without packing): Also, we should decide whether to keep the new test files or remove them. |
|
Added support for NEON based bitonic sort for uint32_t. Still uses scalar for uint64_t as that is the best way to do it (given there is no vminq_u64 / vmaxq_u64). Performance is a little bit better again. For 500000 items on a Grace: HEAD: 18977 us I'm not planning on making any big changes after this now. |
How much extra binary size are we paying for this improvement? |
For jitted, the release version of For AOT, in A simple helloworld AOT binary is currently 1.4M, so I'd expect it to go up to 2.1M with vxsort - so 33% (!) I'm having problems using vxsort with AOT in practice though. As of #118633 vxsort is off by default in AOT. So, I should just be able to run: What I'm not sure how to do is to build with my CoreCLR libraries. Curiously, when I look at the downloaded .dotnet/, I don't see a Maybe I can test this via the src/tests in some way? |
There is a way to use locally built packages with the SDK. Last time I tried I was able to build either JIT or AOT apps with that approach. |
Thanks! In the end, I build the AOT tests. Looking at I don't see any vxsort functions in the binary (checked by opening in gdb and breaking all all functions called I added Now I can see the vxsort functions in the binary (using gdb). So that's only a 77K increase. Going from 1.26MB to 1.35MB, Or 5.76% of the new binary size. |
|
Re-ran Orchard on an Nvidia Grace using Egor's script. HEAD: PR: The older version of the PR ranged from 12715.27 to 14681.96, so there is a definite improvement from that. |
Ok, that looks reasonable to me.
vxsort is not enabled for native AOT by default. It is only enabled for |
|
I looked through the implementation and the reported perf/size diffs. This looks at least as good as x64 implementation. Since we are at the very beginning of net11, I think the best course of action from here is to merge this and see what happens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add an implementation of vxsort for Neon.
Add a testing framework to be able to build vxsort by itself, run some basic sanity tests, run some basic performance tests. This will be useful should further improvements be made, or other targets (SVE?) added.