vxsort: Add Arm64 Neon implementation and tests #110692

a74nh · 2024-12-13T13:48:42Z

Add an implementation of vxsort for Neon.

Add a testing framework to be able to build vxsort by itself, run some basic sanity tests, run some basic performance tests. This will be useful should further improvements be made, or other targets (SVE?) added.

a74nh · 2024-12-13T14:07:16Z

Work in progress.

Currently the testing for vxsort exists in src/coreclr/gc/vxsort/standalone. This needs refactoring and moving into src/tests somewhere.

I still need to add bitonic search and packing support for Neon. The searching for small lists currently uses a copy of insertsort instead of bitonic search. So that I can check performance, for AVX2 I've disabled packing and switch to insertsort too.

Performance testing is very basic, but running ./simple_bench/Project_demo 250 on Cobalt 100 I see roughly the same for both vxsort and insertsort:

vxsort: Time= 3 us
vxsort: Time= 5 us
vxsort: Time= 4 us
insertsort: Time= 5 us
insertsort: Time= 3 us
insertsort: Time= 7 us

On an AVX2 X64 (Gold 5120T), the vxsort is slightly faster.

vxsort: Time= 6 us
vxsort: Time= 6 us
vxsort: Time= 6 us
insertsort: Time= 8 us
insertsort: Time= 6 us
insertsort: Time= 5 us

On the same AVX2 X64 (Gold 5120T), switching the vxsort code to use bitonic search and packing:

vxsort: Time= 3 us
vxsort: Time= 5 us
vxsort: Time= 4 us

Given the above, I'm fairly confident that implementing the rest for Neon will give some improvements. However, It will never show the same boost as AVX2 given the vector length sizes. On Neon, 128bit vectors means we are only sorting two 64bit values at once.

I noticed that for more than 255 the program segfaults on both X64 and Arm64. This looks like a limitation of vxsort. Might be worth adding some asserts in the GC to check the size of the list?

a74nh · 2024-12-13T15:33:24Z

@kunalspathak @JulieLeeMSFT

Maoni0 · 2024-12-14T04:39:56Z

thanks for your interest in this!

@damageboy has many tests in his repo - https://github.com/damageboy/vxsort-cpp

I noticed that for more than 255 the program segfaults on both X64 and Arm64.

is 255 number of elements in the array? that'd be quite surprising because we don't even start invoking vxsort till we have at least 8k for avx2 and 128k for avx512.

src/coreclr/gc/vxsort/do_vxsort_neon.cpp

src/coreclr/gc/vxsort/introsort.cpp

a74nh · 2024-12-16T10:42:22Z

is 255 number of elements in the array? that'd be quite surprising because we don't even start invoking vxsort till we have at least 8k for avx2 and 128k for avx512.

Looks like that was a bug in my side. With that fixed, for 8000 elements:

AVX2 X64 (Gold 5120T):

vxsort: Time= 593 us
vxsort: Time= 576 us
vxsort: Time= 566 us
insertsort: Time= 1169 us
insertsort: Time= 1177 us
insertsort: Time= 1168 us

Cobalt 100:

vxsort: Time= 157 us
vxsort: Time= 153 us
vxsort: Time= 156 us
insertsort: Time= 233 us
insertsort: Time= 215 us
insertsort: Time= 220 us

kunalspathak · 2024-12-16T18:38:36Z

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

kunalspathak · 2024-12-16T18:51:23Z

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

ignore...so seems today we use insertsort on arm64, so with your numbers, seems like 30% improvement.

a74nh · 2024-12-17T09:47:03Z

Thanks @a74nh . Can you confirm which of the above numbers are with vs. without your change on Cobalt 100?

ignore...so seems today we use insertsort on arm64, so with your numbers, seems like 30% improvement.

Yes. I'm hoping to get more by porting both bitonic search and packing for Arm64. In the above figures, I've disabled both on those on X86. When I re-enable them again, X86 goes from ~576ms to ~162ms. So there's definitely some more performance to find.

Change-Id: I19e0fc293b67e28d1dd5491efd9b4e9b86c5c4d7

a74nh · 2024-12-20T16:42:18Z

I've added an implementation of Bitonic.

The plan was to do the bitonic sort using NEON. Unfortunately instructions like rev, min, max etc do not have variants that work on 64bit elements - they only have 8/16/32 variants. (A broken version showing what it would look like if those instructions existed is here).

For some of the bitonic functions, they can be done in NEON with a few extra instructions (eg cmgt+bit instead of max). For other functions the most optimal way is to move the values into GPR registers and use scalar code. That's very messy and looses perf in all the moves.

An alternative is to simply to avoid NEON and use GPR registers throughout. This can be done by simply writing the code in C++ instead of intrinsics, allowing the compiler to optimise.

As a result, I've implemented the bitonic using scalar code. It's highly doubtful that a mix of NEON+scalar would give better performance. As a bonus it is architecture independent code.

Note that for 8/16/32 values, NEON would be the preferred option. Also, SVE would give better performance on 256bit machines (currently only neoverse V1), but it's doubtful on 128bit machines, although it would shorten the code size. I don't plan on implementing with SVE in this PR.

Trying this code on cobalt 100 shows quite an additional speedup (previously vxsort was running at ~150ms)

❯ ./simple_bench/Project_demo 8000
vxsort: Time= 113 us
vxsort: Time= 123 us
vxsort: Time= 117 us
insertsort: Time= 220 us
insertsort: Time= 221 us
insertsort: Time= 238 us

In the new year, I'll look at cleaning this up, sorting out the tests etc. I'll also looking at the missing "packing" code in vxsort, see if there's anything else to gain.

Copilot

Pull Request Overview

Adds Arm64 Neon implementation of vxsort to complement the existing x86/AMD64 AVX implementations. The PR extends the vectorized sorting algorithm to support Arm64 processors with NEON instruction set capabilities and includes a standalone testing framework for development and verification.

Adds Neon implementation of machine traits and sorting algorithm for Arm64 architecture
Creates standalone testing framework for vxsort development and benchmarking
Generates scalar bitonic sort implementations as fallback for small arrays

Reviewed Changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/coreclr/tools/aot/ILCompiler/reproNative/reproNative.vcxproj	Adds VxsortDisabled library reference for Arm64 AOT builds
src/coreclr/nativeaot/Runtime/Full/CMakeLists.txt	Extends VxSort library build to include Arm64 architecture
src/coreclr/nativeaot/Runtime/CMakeLists.txt	Adds Neon-specific source files for Arm64 VxSort implementation
src/coreclr/nativeaot/BuildIntegration/Microsoft.NETCore.Native.*.targets	Updates build conditions to include arm64 for VxSort library linking
src/coreclr/gc/vxsort/vxsort.h	Adds conditional compilation for AMD64 vs other architectures, includes introsort fallback
src/coreclr/gc/vxsort/standalone/*	New testing framework with demo applications and performance benchmarks
src/coreclr/gc/vxsort/smallsort/codegen/*	Adds scalar bitonic sort code generation for fallback sorting
src/coreclr/gc/vxsort/smallsort/bitonic_sort.scalar..generated.	Generated scalar sorting implementations for uint32_t and uint64_t
src/coreclr/gc/vxsort/machine_traits.neon.*	Neon-specific machine traits implementation for Arm64
src/coreclr/gc/vxsort/do_vxsort_neon.cpp	Entry point function for Neon-based vxsort
src/coreclr/gc/gc.cpp	Updates GC to support Neon vxsort on Arm64 and moves introsort to separate header
Various CMakeLists.txt files	Build system updates to include Arm64 vxsort support

Comments suppressed due to low confidence (1)

src/coreclr/tools/aot/ILCompiler/reproNative/reproNative.vcxproj:1

The XML formatting contains inconsistent spacing and line breaks. There are spaces before "bin\coreclr" in some paths but not others, and the line break placement is inconsistent. This could lead to build issues or make the configuration harder to maintain.

<Project DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">

a74nh · 2025-09-16T13:50:10Z

The new version of the PR now supports packing - uint64s are compressed to uint32 before sorting. This gives an additional boost.

I could go further - the smallsort for uint64 has to use scalar code, but the uint32 could be rewritten to use NEON. Although it's quite a lot of additional auto-generated functions to support it.

There are some minor build errors I still need to fix.

And I still need to measure the size of AOT binaries.

a74nh · 2025-09-16T13:54:43Z

Running simple_bench with 500000 on an Nvidia Grace:

vxsort: Time= 5956 us
insertsort: Time= 18972 us

Older version of the PR (without packing):
vxsort: Time= 8502 us

Also, we should decide whether to keep the new test files or remove them.

a74nh · 2025-09-18T14:05:10Z

Added support for NEON based bitonic sort for uint32_t. Still uses scalar for uint64_t as that is the best way to do it (given there is no vminq_u64 / vmaxq_u64).

Performance is a little bit better again. For 500000 items on a Grace:

HEAD: 18977 us
Pre-Packing version: 8510 us
Pre-NEON bitonic version: 6198 us
Current version: 5834 us

I'm not planning on making any big changes after this now.

jkotas · 2025-09-18T17:08:41Z

Pre-NEON bitonic version: 6198 us
Current version: 5834 us

How much extra binary size are we paying for this improvement?

a74nh · 2025-09-19T09:46:43Z

Pre-NEON bitonic version: 6198 us
Current version: 5834 us

How much extra binary size are we paying for this improvement?

For jitted, the release version of libclrgc.so has gone from 775K to 854K. So, 79K increase.

For AOT, in artifacts/bin/coreclr/linux.arm64.Release/aotsdk I see:

-rw-r--r-- 1 alahay01 alahay01  23K Sep 18 16:43 libRuntime.VxsortDisabled.a
-rw-r--r-- 1 alahay01 alahay01 775K Sep 18 16:43 libRuntime.VxsortEnabled.a

A simple helloworld AOT binary is currently 1.4M, so I'd expect it to go up to 2.1M with vxsort - so 33% (!)

I'm having problems using vxsort with AOT in practice though.

As of #118633 vxsort is off by default in AOT. So, I should just be able to run:
dotnet publish -c Release -p:PublishAot=true helloworld.csproj
dotnet publish -c Release -p:PublishAot=true -p:IlcEnableVxsort=true helloworld.csproj
and compare the size difference in the built binary.

What I'm not sure how to do is to build with my CoreCLR libraries.
I've tried dotnet.sh, but that ends with everything coming from .dotnet/ during the build.

Curiously, when I look at the downloaded .dotnet/, I don't see a aotsdk directory in it, so I'm not sure what happens to these files in the release.

Maybe I can test this via the src/tests in some way?

VSadov · 2025-09-19T13:39:50Z

What I'm not sure how to do is to build with my CoreCLR libraries.
I've tried dotnet.sh, but that ends with everything coming from .dotnet/ during the build.

There is a way to use locally built packages with the SDK.
https://github.com/dotnet/runtime/blob/main/docs/workflow/testing/using-dev-shipping-packages.md

Last time I tried I was able to build either JIT or AOT apps with that approach.
(anything that says 10 in the instructions should probably mean 11 now)

a74nh · 2025-09-19T15:35:01Z

There is a way to use locally built packages with the SDK.

Thanks! In the end, I build the AOT tests.

Looking at src/tests/nativeaot/CustomMain. When built normally with this PR:
-rwxrwxr-x 1 alahay01 alahay01 1269496 Sep 19 15:17 CustomMain

I don't see any vxsort functions in the binary (checked by opening in gdb and breaking all all functions called sort).

I added <IlcEnableVxsort>true</IlcEnableVxsort> to CustomMain.csproj and rebuilt:
-rwxrwxr-x 1 alahay01 alahay01 1347200 Sep 19 15:18 CustomMain

Now I can see the vxsort functions in the binary (using gdb).

So that's only a 77K increase. Going from 1.26MB to 1.35MB, Or 5.76% of the new binary size.

a74nh · 2025-09-19T17:42:39Z

Re-ran Orchard on an Nvidia Grace using Egor's script.

HEAD:
11904.68
11222.52
12123.36
13348.58
13018.35
13016.68
12909.51

PR:
15112.48
16343.40
13833.41
16139.92
13190.81
14864.46
13158.77
14823.63
14757.19

The older version of the PR ranged from 12715.27 to 14681.96, so there is a definite improvement from that.

jkotas · 2025-09-19T17:47:07Z

So that's only a 77K increase

Ok, that looks reasonable to me.

Thanks! In the end, I build the AOT tests.
I don't see any vxsort functions in the binary (checked by opening in gdb and breaking all all functions called sort).

vxsort is not enabled for native AOT by default. It is only enabled for <OptimizationPreference>speed</OptimizationPreference> (or via the undocumented IlcEnableVxsort switch).

VSadov · 2025-09-20T01:48:42Z

I looked through the implementation and the reported perf/size diffs. This looks at least as good as x64 implementation.
The self-consistency checks in chk/dbg should catch invalid sort results.

Since we are at the very beginning of net11, I think the best course of action from here is to merge this and see what happens.

VSadov

LGTM. Thanks!!!

jkotas

LGTM

vxsort: Add Arm64 Neon implementation and tests

44be149

ghost added the area-GC-coreclr label Dec 13, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 13, 2024

am11 reviewed Dec 14, 2024

View reviewed changes

src/coreclr/gc/vxsort/do_vxsort_neon.cpp Outdated Show resolved Hide resolved

am11 reviewed Dec 14, 2024

View reviewed changes

src/coreclr/gc/vxsort/introsort.cpp Outdated Show resolved Hide resolved

allow more than 255 elements

8e9af9a

cleanup insertsort

5ce2c0b

a74nh added 3 commits December 19, 2024 10:17

Add Neon to bitonic

9c8151d

Add scalar version of bitonic

f1b591a

Change-Id: I19e0fc293b67e28d1dd5491efd9b4e9b86c5c4d7

remove Neon from bitonic

e4771c1

Remove TARGET_ checks

a60bae5

This was referenced Jan 6, 2025

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

a74nh added 9 commits January 6, 2025 16:18

Remove all SVE references

a39c6b5

Enable vxsort in GC for Arm64

bfddad9

Fix nativeaot build

a275ead

Fix ISA detection for windows

05dc17f

Add vxsort to windows Arm64 build

6614ed8

Merge main

57ed7a0

Add vxsort to nativeaot

a168d7c

Explict popcount implementation for msvc

db404c1

Remove reinterpret_cast for msvc

e5ea309

JulieLeeMSFT added this to the 11.0.0 milestone Sep 15, 2025

Add 32bit compressed sort

8fbb659

Copilot AI review requested due to automatic review settings September 16, 2025 09:24

Copilot AI reviewed Sep 16, 2025

View reviewed changes

a74nh added 3 commits September 16, 2025 13:16

Add N to AVX machine traits

390fb15

Fixup makefile

b7a03a0

fix constants for msvc

7c4e36c

a74nh added 8 commits September 16, 2025 15:04

better vector constants

47f0c16

update makefile

106ca7e

Add neon to dummy file

2d9e290

Add bitonic_sort.NEON.uint32_t

e7a9fe9

Fix mask for msvc

c3f53e3

fix idx and maxv for msvc

be10c0a

Implement merge sorter

3bade96

Add small sorting type to machine traits

0aeb9f0

VSadov approved these changes Sep 20, 2025

View reviewed changes

a74nh removed the request for review from Maoni0 September 20, 2025 14:04

jkotas approved these changes Sep 23, 2025

View reviewed changes

VSadov merged commit 2ccc38b into dotnet:main Sep 23, 2025
99 checks passed

agocke approved these changes Sep 23, 2025

View reviewed changes

vxsort: Add Arm64 Neon implementation and tests #110692

vxsort: Add Arm64 Neon implementation and tests #110692

Uh oh!

Conversation

a74nh commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a74nh commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a74nh commented Dec 13, 2024

Uh oh!

Maoni0 commented Dec 14, 2024

Uh oh!

Uh oh!

Uh oh!

a74nh commented Dec 16, 2024

Uh oh!

kunalspathak commented Dec 16, 2024

Uh oh!

kunalspathak commented Dec 16, 2024

Uh oh!

a74nh commented Dec 17, 2024

Uh oh!

a74nh commented Dec 20, 2024

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

a74nh commented Sep 16, 2025

Uh oh!

a74nh commented Sep 16, 2025

Uh oh!

a74nh commented Sep 18, 2025

Uh oh!

jkotas commented Sep 18, 2025

Uh oh!

a74nh commented Sep 19, 2025

Uh oh!

VSadov commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a74nh commented Sep 19, 2025

Uh oh!

a74nh commented Sep 19, 2025

Uh oh!

jkotas commented Sep 19, 2025

Uh oh!

VSadov commented Sep 20, 2025

Uh oh!

VSadov left a comment

Choose a reason for hiding this comment

Uh oh!

jkotas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

a74nh commented Dec 13, 2024 •

edited

Loading

a74nh commented Dec 13, 2024 •

edited

Loading

VSadov commented Sep 19, 2025 •

edited

Loading