Skip to content

Conversation

@a74nh
Copy link
Contributor

@a74nh a74nh commented Dec 13, 2024

Add an implementation of vxsort for Neon.

Add a testing framework to be able to build vxsort by itself, run some basic sanity tests, run some basic performance tests. This will be useful should further improvements be made, or other targets (SVE?) added.

@ghost ghost added the area-GC-coreclr label Dec 13, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 13, 2024
@a74nh
Copy link
Contributor Author

a74nh commented Dec 13, 2024

Work in progress.

Currently the testing for vxsort exists in src/coreclr/gc/vxsort/standalone. This needs refactoring and moving into src/tests somewhere.

I still need to add bitonic search and packing support for Neon. The searching for small lists currently uses a copy of insertsort instead of bitonic search. So that I can check performance, for AVX2 I've disabled packing and switch to insertsort too.

Performance testing is very basic, but running ./simple_bench/Project_demo 250 on Cobalt 100 I see roughly the same for both vxsort and insertsort:

vxsort: Time= 3 us
vxsort: Time= 5 us
vxsort: Time= 4 us
insertsort: Time= 5 us
insertsort: Time= 3 us
insertsort: Time= 7 us

On an AVX2 X64 (Gold 5120T), the vxsort is slightly faster.

vxsort: Time= 6 us
vxsort: Time= 6 us
vxsort: Time= 6 us
insertsort: Time= 8 us
insertsort: Time= 6 us
insertsort: Time= 5 us

On the same AVX2 X64 (Gold 5120T), switching the vxsort code to use bitonic search and packing:

vxsort: Time= 3 us
vxsort: Time= 5 us
vxsort: Time= 4 us

Given the above, I'm fairly confident that implementing the rest for Neon will give some improvements. However, It will never show the same boost as AVX2 given the vector length sizes. On Neon, 128bit vectors means we are only sorting two 64bit values at once.

I noticed that for more than 255 the program segfaults on both X64 and Arm64. This looks like a limitation of vxsort. Might be worth adding some asserts in the GC to check the size of the list?

@a74nh
Copy link
Contributor Author

a74nh commented Dec 13, 2024

@Maoni0
Copy link
Member

Maoni0 commented Dec 14, 2024

thanks for your interest in this!

@damageboy has many tests in his repo - https://github.com/damageboy/vxsort-cpp

I noticed that for more than 255 the program segfaults on both X64 and Arm64.

is 255 number of elements in the array? that'd be quite surprising because we don't even start invoking vxsort till we have at least 8k for avx2 and 128k for avx512.

@a74nh
Copy link
Contributor Author

a74nh commented Dec 16, 2024

is 255 number of elements in the array? that'd be quite surprising because we don't even start invoking vxsort till we have at least 8k for avx2 and 128k for avx512.

Looks like that was a bug in my side. With that fixed, for 8000 elements:

AVX2 X64 (Gold 5120T):

vxsort: Time= 593 us
vxsort: Time= 576 us
vxsort: Time= 566 us
insertsort: Time= 1169 us
insertsort: Time= 1177 us
insertsort: Time= 1168 us

Cobalt 100:

vxsort: Time= 157 us
vxsort: Time= 153 us
vxsort: Time= 156 us
insertsort: Time= 233 us
insertsort: Time= 215 us
insertsort: Time= 220 us

@kunalspathak
Copy link
Contributor

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

@kunalspathak
Copy link
Contributor

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

ignore...so seems today we use insertsort on arm64, so with your numbers, seems like 30% improvement.

@a74nh
Copy link
Contributor Author

a74nh commented Dec 17, 2024

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

ignore...so seems today we use insertsort on arm64, so with your numbers, seems like 30% improvement.

Yes. I'm hoping to get more by porting both bitonic search and packing for Arm64. In the above figures, I've disabled both on those on X86. When I re-enable them again, X86 goes from ~576ms to ~162ms. So there's definitely some more performance to find.

@a74nh
Copy link
Contributor Author

a74nh commented Dec 20, 2024

I've added an implementation of Bitonic.

The plan was to do the bitonic sort using NEON. Unfortunately instructions like rev, min, max etc do not have variants that work on 64bit elements - they only have 8/16/32 variants. (A broken version showing what it would look like if those instructions existed is here).

For some of the bitonic functions, they can be done in NEON with a few extra instructions (eg cmgt+bit instead of max). For other functions the most optimal way is to move the values into GPR registers and use scalar code. That's very messy and looses perf in all the moves.

An alternative is to simply to avoid NEON and use GPR registers throughout. This can be done by simply writing the code in C++ instead of intrinsics, allowing the compiler to optimise.

As a result, I've implemented the bitonic using scalar code. It's highly doubtful that a mix of NEON+scalar would give better performance. As a bonus it is architecture independent code.

Note that for 8/16/32 values, NEON would be the preferred option. Also, SVE would give better performance on 256bit machines (currently only neoverse V1), but it's doubtful on 128bit machines, although it would shorten the code size. I don't plan on implementing with SVE in this PR.

Trying this code on cobalt 100 shows quite an additional speedup (previously vxsort was running at ~150ms)

❯ ./simple_bench/Project_demo 8000
vxsort: Time= 113 us
vxsort: Time= 123 us
vxsort: Time= 117 us
insertsort: Time= 220 us
insertsort: Time= 221 us
insertsort: Time= 238 us

In the new year, I'll look at cleaning this up, sorting out the tests etc. I'll also looking at the missing "packing" code in vxsort, see if there's anything else to gain.

@JulieLeeMSFT JulieLeeMSFT added this to the 11.0.0 milestone Sep 15, 2025
@Copilot Copilot AI review requested due to automatic review settings September 16, 2025 09:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds Arm64 Neon implementation of vxsort to complement the existing x86/AMD64 AVX implementations. The PR extends the vectorized sorting algorithm to support Arm64 processors with NEON instruction set capabilities and includes a standalone testing framework for development and verification.

  • Adds Neon implementation of machine traits and sorting algorithm for Arm64 architecture
  • Creates standalone testing framework for vxsort development and benchmarking
  • Generates scalar bitonic sort implementations as fallback for small arrays

Reviewed Changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/coreclr/tools/aot/ILCompiler/reproNative/reproNative.vcxproj Adds VxsortDisabled library reference for Arm64 AOT builds
src/coreclr/nativeaot/Runtime/Full/CMakeLists.txt Extends VxSort library build to include Arm64 architecture
src/coreclr/nativeaot/Runtime/CMakeLists.txt Adds Neon-specific source files for Arm64 VxSort implementation
src/coreclr/nativeaot/BuildIntegration/Microsoft.NETCore.Native.*.targets Updates build conditions to include arm64 for VxSort library linking
src/coreclr/gc/vxsort/vxsort.h Adds conditional compilation for AMD64 vs other architectures, includes introsort fallback
src/coreclr/gc/vxsort/standalone/* New testing framework with demo applications and performance benchmarks
src/coreclr/gc/vxsort/smallsort/codegen/* Adds scalar bitonic sort code generation for fallback sorting
src/coreclr/gc/vxsort/smallsort/bitonic_sort.scalar..generated. Generated scalar sorting implementations for uint32_t and uint64_t
src/coreclr/gc/vxsort/machine_traits.neon.* Neon-specific machine traits implementation for Arm64
src/coreclr/gc/vxsort/do_vxsort_neon.cpp Entry point function for Neon-based vxsort
src/coreclr/gc/gc.cpp Updates GC to support Neon vxsort on Arm64 and moves introsort to separate header
Various CMakeLists.txt files Build system updates to include Arm64 vxsort support
Comments suppressed due to low confidence (1)

src/coreclr/tools/aot/ILCompiler/reproNative/reproNative.vcxproj:1

  • The XML formatting contains inconsistent spacing and line breaks. There are spaces before "bin\coreclr" in some paths but not others, and the line break placement is inconsistent. This could lead to build issues or make the configuration harder to maintain.
<Project DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">

@a74nh
Copy link
Contributor Author

a74nh commented Sep 16, 2025

The new version of the PR now supports packing - uint64s are compressed to uint32 before sorting. This gives an additional boost.

I could go further - the smallsort for uint64 has to use scalar code, but the uint32 could be rewritten to use NEON. Although it's quite a lot of additional auto-generated functions to support it.

There are some minor build errors I still need to fix.

And I still need to measure the size of AOT binaries.

@a74nh
Copy link
Contributor Author

a74nh commented Sep 16, 2025

Running simple_bench with 500000 on an Nvidia Grace:

vxsort: Time= 5956 us
insertsort: Time= 18972 us

Older version of the PR (without packing):
vxsort: Time= 8502 us

Also, we should decide whether to keep the new test files or remove them.

@a74nh
Copy link
Contributor Author

a74nh commented Sep 18, 2025

Added support for NEON based bitonic sort for uint32_t. Still uses scalar for uint64_t as that is the best way to do it (given there is no vminq_u64 / vmaxq_u64).

Performance is a little bit better again. For 500000 items on a Grace:

HEAD: 18977 us
Pre-Packing version: 8510 us
Pre-NEON bitonic version: 6198 us
Current version: 5834 us

I'm not planning on making any big changes after this now.

@jkotas
Copy link
Member

jkotas commented Sep 18, 2025

Pre-NEON bitonic version: 6198 us
Current version: 5834 us

How much extra binary size are we paying for this improvement?

@a74nh
Copy link
Contributor Author

a74nh commented Sep 19, 2025

Pre-NEON bitonic version: 6198 us
Current version: 5834 us

How much extra binary size are we paying for this improvement?

For jitted, the release version of libclrgc.so has gone from 775K to 854K. So, 79K increase.

For AOT, in artifacts/bin/coreclr/linux.arm64.Release/aotsdk I see:

-rw-r--r-- 1 alahay01 alahay01  23K Sep 18 16:43 libRuntime.VxsortDisabled.a
-rw-r--r-- 1 alahay01 alahay01 775K Sep 18 16:43 libRuntime.VxsortEnabled.a

A simple helloworld AOT binary is currently 1.4M, so I'd expect it to go up to 2.1M with vxsort - so 33% (!)

I'm having problems using vxsort with AOT in practice though.

As of #118633 vxsort is off by default in AOT. So, I should just be able to run:
dotnet publish -c Release -p:PublishAot=true helloworld.csproj
dotnet publish -c Release -p:PublishAot=true -p:IlcEnableVxsort=true helloworld.csproj
and compare the size difference in the built binary.

What I'm not sure how to do is to build with my CoreCLR libraries.
I've tried dotnet.sh, but that ends with everything coming from .dotnet/ during the build.

Curiously, when I look at the downloaded .dotnet/, I don't see a aotsdk directory in it, so I'm not sure what happens to these files in the release.

Maybe I can test this via the src/tests in some way?

@VSadov
Copy link
Member

VSadov commented Sep 19, 2025

What I'm not sure how to do is to build with my CoreCLR libraries.
I've tried dotnet.sh, but that ends with everything coming from .dotnet/ during the build.

There is a way to use locally built packages with the SDK.
https://github.com/dotnet/runtime/blob/main/docs/workflow/testing/using-dev-shipping-packages.md

Last time I tried I was able to build either JIT or AOT apps with that approach.
(anything that says 10 in the instructions should probably mean 11 now)

@a74nh
Copy link
Contributor Author

a74nh commented Sep 19, 2025

There is a way to use locally built packages with the SDK.

Thanks! In the end, I build the AOT tests.

Looking at src/tests/nativeaot/CustomMain. When built normally with this PR:
-rwxrwxr-x 1 alahay01 alahay01 1269496 Sep 19 15:17 CustomMain

I don't see any vxsort functions in the binary (checked by opening in gdb and breaking all all functions called sort).

I added <IlcEnableVxsort>true</IlcEnableVxsort> to CustomMain.csproj and rebuilt:
-rwxrwxr-x 1 alahay01 alahay01 1347200 Sep 19 15:18 CustomMain

Now I can see the vxsort functions in the binary (using gdb).

So that's only a 77K increase. Going from 1.26MB to 1.35MB, Or 5.76% of the new binary size.

@a74nh
Copy link
Contributor Author

a74nh commented Sep 19, 2025

Re-ran Orchard on an Nvidia Grace using Egor's script.

HEAD:
11904.68
11222.52
12123.36
13348.58
13018.35
13016.68
12909.51

PR:
15112.48
16343.40
13833.41
16139.92
13190.81
14864.46
13158.77
14823.63
14757.19

The older version of the PR ranged from 12715.27 to 14681.96, so there is a definite improvement from that.

@jkotas
Copy link
Member

jkotas commented Sep 19, 2025

So that's only a 77K increase

Ok, that looks reasonable to me.

Thanks! In the end, I build the AOT tests.
I don't see any vxsort functions in the binary (checked by opening in gdb and breaking all all functions called sort).

vxsort is not enabled for native AOT by default. It is only enabled for <OptimizationPreference>speed</OptimizationPreference> (or via the undocumented IlcEnableVxsort switch).

@VSadov
Copy link
Member

VSadov commented Sep 20, 2025

I looked through the implementation and the reported perf/size diffs. This looks at least as good as x64 implementation.
The self-consistency checks in chk/dbg should catch invalid sort results.

Since we are at the very beginning of net11, I think the best course of action from here is to merge this and see what happens.

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!!!

@a74nh a74nh removed the request for review from Maoni0 September 20, 2025 14:04
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VSadov VSadov merged commit 2ccc38b into dotnet:main Sep 23, 2025
99 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arch-arm64 area-GC-coreclr community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants