NativeAOT compared with GraalVM/LLVM missing features. Any plans for them? #117454

ciplogic · 2025-07-09T06:54:00Z

ciplogic
Jul 9, 2025

Hello .Net team!

First of all I appreciate all the great work on .Net in general and NativeAOT in particular.

My understanding (at least according to how it is built now) is that NativeAOT works on a closed world and I've seen 3 areas which seem to me at least from outside (I never looked inside of the codebase itself, neighter of .Net Core, nor GraalVM, nor LLVM, but trough announcements) are these:

Class Hierarchy Analysis: you can always devirtualize interfaces implemented once. It should be definitely working better with a closed world assumption. This sounds to me like a simple free optimization as it will allow more opportunities for inliner
Greedy Register Allocator or alike. What at least it is described is that you can get around up-to 10% speedup on same machine by having a better algorithm to register allocator.
https://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html
Or this:
https://llvm.org/devmtg/2011-11/Olesen_RegisterAllocation.pdf

To be clear, I am not suggesting that .Net must implement LLVM register allocator, but if there's one register allocator that it can be slower and generate better code, but opt-in and that is more efficient when choosing optimization level for performance, sounds to me like a good win for everyone. And in future (say 3 years from now when you are sure that there are no bugs with it) swap it as a default in NativeAOT.

Auto-Vectorization at least for simple loops. This is the most complex and it might not offer a lot of speedup given that at least for some critical code it is already SIMD-ed, but my thinking is that maybe if there's a described subset of code and even with an annotated method, to know that this is rewritten (even by the Roslyn compiler) into SIMD constructs, it would be amazing if you ask me.

But as for me, the order is likely from top to bottom in this order, because Class Hierarchy Analysis and automatic devirtualization really means that users will likely get free performance in large apps (I am thinking something like Avalonia apps which are 20-30 MB a piece, not including assets).

Thank you for the work you did it again!

Answered by huoyaoyuan

Jul 9, 2025

Class Hierarchy Analysis: you can always devirtualize interfaces implemented once. It should be definitely working better with a closed world assumption. This sounds to me like a simple free optimization as it will allow more opportunities for inliner

This was implemented in the very early development of NativeAOT. There are improvement PRs like #97867.

Greedy Register Allocator or alike.

This is not a NativeAOT topic. It uses the same codegen and register allocator with JIT. Improvements in each area are always welcomed.

Auto-Vectorization at least for simple loops.

This is very costful optimization with unknown benefit. In .NET we prefer to expose SIMD build blocks to simpli…

View full answer

huoyaoyuan · 2025-07-09T07:20:31Z

huoyaoyuan
Jul 9, 2025
Collaborator

Class Hierarchy Analysis: you can always devirtualize interfaces implemented once. It should be definitely working better with a closed world assumption. This sounds to me like a simple free optimization as it will allow more opportunities for inliner

This was implemented in the very early development of NativeAOT. There are improvement PRs like #97867.

Greedy Register Allocator or alike.

This is not a NativeAOT topic. It uses the same codegen and register allocator with JIT. Improvements in each area are always welcomed.

Auto-Vectorization at least for simple loops.

This is very costful optimization with unknown benefit. In .NET we prefer to expose SIMD build blocks to simplify manual adoption (Vector256<T> etc), or already SIMD-optimized primitive operations (Span operations and TensorPrimitive) that remove the necessity of writing loops.
This is also not NativeAOT-specific. Moreover, by default it can only target the baseline instruction set, instead of all instructions available in current system like JIT, making the improvement room very small.

2 replies

ciplogic Jul 9, 2025
Author

To my understanding the last 2 items (RA and SIMD) are IMHO NativeAOT as my understanding they impact real-life startup, but they can be traded when doing AOT.

But I agree on the 3rd item that the impact is very likely very small. But my understanding was that for example if you write: "for... summing data in array", to suggest you the Linq version that is SIMD-ed, and/or to be a rewriting step to do this code (at least on higher performance configurations).

As for register allocator, I've seen at least some papers that they increase a lot the compilation time, but it might worth for some (Bing for "Register allocator using ant colony optimization") and it will never be worth in my understanding to be integrated with the JIT, but as a dormant code.

The point is simply that as far as I know LRSA could be "traded" for wanting to have the fastest app, and it will never sufice for a JIT.

And obviously there are few implementations (of LLVM RA) done around and maybe it can be experimented with. Like this one:
https://github.com/bytecodealliance/regalloc2

huoyaoyuan Jul 9, 2025
Collaborator

But I agree on the 3rd item that the impact is very likely very small. But my understanding was that for example if you write: "for... summing data in array", to suggest you the Linq version that is SIMD-ed, and/or to be a rewriting step to do this code (at least on higher performance configurations).

This has always been a hard problem. LINQ is FP-style program that compiled into OOP form. It's not easy to do optimizations in the compiled-down form. Currently SIMD is adopted in LINQ, but the base overhead of method invocation is still a problem.

The point is simply that as far as I know LRSA could be "traded" for wanting to have the fastest app, and it will never sufice for a JIT.

With tiered compilation, we can afford longer compilation time in JIT. However the algorithm needs to show its benefit.

MichalStrehovsky · 2025-07-09T13:07:09Z

MichalStrehovsky
Jul 9, 2025
Collaborator

I wrote an article on some of the optimizations that are specific to native AOT: https://migeel.sk/blog/2023/11/22/top-3-whole-program-optimizations-for-aot-in-net-8/.

Of course since native AOT uses the same code generator (RyuJIT) as the JIT-based CoreCLR version, native AOT also gets most of optimizations that the JIT can do, minus dynamic optimizations that are based on program execution (things like dynamic PGO).

I don't have much else to add besides what Huo already covered above.

1 reply

ciplogic Jul 9, 2025
Author

Thank you Michal! I never knew of this article and great to be here to explain it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NativeAOT compared with GraalVM/LLVM missing features. Any plans for them? #117454

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

NativeAOT compared with GraalVM/LLVM missing features. Any plans for them? #117454

Uh oh!

Uh oh!

ciplogic Jul 9, 2025

Replies: 2 comments · 3 replies

Uh oh!

huoyaoyuan Jul 9, 2025 Collaborator

Uh oh!

Uh oh!

ciplogic Jul 9, 2025 Author

Uh oh!

huoyaoyuan Jul 9, 2025 Collaborator

Uh oh!

MichalStrehovsky Jul 9, 2025 Collaborator

Uh oh!

ciplogic Jul 9, 2025 Author

ciplogic
Jul 9, 2025

Replies: 2 comments 3 replies

huoyaoyuan
Jul 9, 2025
Collaborator

ciplogic Jul 9, 2025
Author

huoyaoyuan Jul 9, 2025
Collaborator

MichalStrehovsky
Jul 9, 2025
Collaborator

ciplogic Jul 9, 2025
Author