JIT/AOT to interpreter calls support #116353

janvorli · 2025-06-05T22:33:22Z

We only had support for calling interpreted void returning method with no arguments by the JIT/AOT. This change enables support for passing all possible kinds of arguments and returning all types.

It reuses the CallStubGenerator that was implemented for the interpreter to native code usage and adds support for the other direction to it in mostly trivial manner.
In addition to that, assembler routines for storing argument values to the interpreter stack were added.
The CallStubGenerator generates a list of routines to copy the arguments from CPU registers and stack to the interpreter stack. The last one makes call to the ExecuteInterpretedMethod and then puts the result into appropriate registers.
For functions that return result via a return buffer, the buffer is passed to the ExecuteInterpretedMethod so that the IR opcode to return valuetype stores it directly to the return buffer.

The ARM64 for Apple OSes is the most optimized version, as it is the primary target where the performance matters the most. It eliminates argument registers saving to stack on the fast path when we already have the call stub.

Copilot

Pull Request Overview

This PR introduces comprehensive support for interpreter-to-JIT and JIT-to-interpreter calls, extending argument passing and return value handling to all kinds of method signatures. Key changes include new test methods in the interpreter tests, adjustments in call stub generation and assembly routines, and updates to thread-local and ABI-related constants.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/tests/JIT/interpreter/Interpreter.cs	Added several new test methods for various calling conventions to exercise interpreter/JIT interop.
src/coreclr/vm/threads.*	Modified thread-local definitions and moved the interpreter thread context member from private to public.
src/coreclr/vm/prestub.cpp	Updated the signature and return value handling of ExecuteInterpretedMethod.
src/coreclr/vm/interpexec.cpp	Added CreateNativeToInterpreterCallStub with adjustments to generate call stubs and ensure thread context.
src/coreclr/vm/callstubgenerator.h	Extended the call stub generator with new routines and a parameter to distinguish call directions.
asmconstants.h, AsmHelpers.asm, and PAL assembly macros	Updated ABI constants and assembly routines to support the new calling conventions.
src/coreclr/interpreter/interpretershared.h	Updated the InterpMethod structure to include the call stub pointer.

Comments suppressed due to low confidence (1)

src/coreclr/vm/interpexec.cpp:66

[nitpick] The AllocMemTracker variable is redeclared here, shadowing the one declared on line 56. To improve clarity and avoid potential confusion, consider reusing the outer 'amTracker' instead of redeclaring it.

        AllocMemTracker amTracker;

dotnet-policy-service · 2025-06-05T22:34:08Z

Tagging subscribers to this area: @BrzVlad, @janvorli, @kg
See info in area-owners.md if you want to be subscribed.

src/coreclr/pal/inc/unixasmmacrosamd64.inc

jkotas · 2025-06-06T14:03:28Z

src/coreclr/vm/amd64/AsmHelpers.asm

+        ; Interpreter-TODO: consider not storing the argument registers in the PROLOG_WITH_TRANSITION_BLOCK and
+        ; store / restore them only around this call. We would still need to save RCX and RDX as they can contain
+        ; the return buffer address.
+        call            CreateNativeToInterpreterCallStub


I do not see how the arguments get reported to the GC while CreateNativeToInterpreterCallStub is executing. Is it a GC hole?

Would it be better to compile the transition helper in PreStubWorker? PreStubWorker gets called whenever we are about the execute a method from JIT/AOT code that is not compiled/prepared yet, and it has all the right infrastructure to protect arguments. It would be best to avoid duplicating it.

That's a good point and compiling it in the PreStubWorker would definitely be better.

kg · 2025-06-06T18:00:46Z

src/coreclr/vm/amd64/AsmHelpers.asm

+
+LEAF_ENTRY Store_XMM2_XMM3, _TEXT
+        movsd real8 ptr [r10], xmm2
+        movsd real8 ptr [r10 + 8], xmm3


aren't xmm registers 16 bytes?

I'm guessing what's going on here is there's an array of individual pointers at r10 for the location of the data for each xmm reg but wanted to make sure

In this case we are handling the XMM registers for the case where the 16 byte XMM register is just being used for an 8 byte floating point value. I'm a bit curious if this works well for floats as well as doubles, but the concept is sound. (I could see that perhaps our interpreter stack models all floating point values as doubles, which will introduce floating point differences from what the JIT does in certain cases, but it is pretty close.

More interestingly though I don't see handling for Vector64/Vector128 on Arm64 which should be treated as HFA values. I don't think this all needs to be handled before the PR is merged, but we should have it on our worklist.

The xmm registers are 16 bytes long, but for argument passing purposes, they can only carry either float or double arguments, so we use max 8 bytes out of it. See https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170.
For Unix, it is the same case. The calling convention would allow larger data types passed in all 16 bytes, but that's not used in .NET (In C, it can pass arguments of types __float128, _Decimal128 and __m128)

I would love to see one comment in here somewhere to the effect of 'we only use up to the first 8 bytes of the xmm registers due to the x64 calling convention' just so anyone looking at this can go 'oh, that makes sense' and look up the calling convention for more info (i.e. that vectors are passed by-reference)

Otherwise, looks good, thanks for the explanation.

davidwrighton

Overall, this looks good. I suspect we haven't fleshed out all the calling convention gremlins, but this should get most of it.

davidwrighton · 2025-06-06T18:53:04Z

src/coreclr/vm/amd64/AsmHelpers.asm

+
+LEAF_ENTRY Store_XMM2_XMM3, _TEXT
+        movsd real8 ptr [r10], xmm2
+        movsd real8 ptr [r10 + 8], xmm3


In this case we are handling the XMM registers for the case where the 16 byte XMM register is just being used for an 8 byte floating point value. I'm a bit curious if this works well for floats as well as doubles, but the concept is sound. (I could see that perhaps our interpreter stack models all floating point values as doubles, which will introduce floating point differences from what the JIT does in certain cases, but it is pretty close.

More interestingly though I don't see handling for Vector64/Vector128 on Arm64 which should be treated as HFA values. I don't think this all needs to be handled before the PR is merged, but we should have it on our worklist.

src/coreclr/interpreter/interpretershared.h

janvorli · 2025-06-06T19:05:34Z

I'm a bit curious if this works well for floats as well as doubles
@davidwrighton the interpreter distinguishes between floats and doubles, it knows what each slot contains. So in case the argument is float, we just write in 8 bytes, but the interpreter only ever uses the low 4 bytes.

janvorli · 2025-06-06T19:17:07Z

More interestingly though I don't see handling for Vector64/Vector128 on Arm64 which should be treated as HFA values.

Good point. I think that the only one that would not work is the Vector128. For that one, we will need to add routines for both storing and loading q0..q7. The Vector64 should be handled by the regular d0..d7 storing / loading routines.

src/coreclr/vm/prestub.cpp

src/coreclr/pal/inc/unixasmmacrosamd64.inc

src/coreclr/pal/inc/unixasmmacrosarm64.inc

Until now, we only had support for calling interpreted void returning method with no arguments by the JIT/AOT. This change enables support for passing all possible kinds of arguments and returning all types. It reuses the `CallStubGenerator` that was implemented for the interpreter to native code usage and adds support for the other direction to it in mostly trivial manner. In addition to that, assembler routines for storing argument values to the interpreter stack were added. The `CallStubGenerator` generates a list of routines to copy the arguments from CPU registers and stack to the interpreter stack. The last one makes call to the `ExecuteInterpretedMethod` and then puts the result into appropriate registers. For functions that return result via a return buffer, the buffer is passed to the `ExecuteInterpretedMethod` so that the IR opcode to return valuetype stores it directly to the return buffer. The ARM64 for Apple OSes is the most optimized version, as it is the primary target where the performance matters the most. It eliminates argument registers saving to stack on the fast path when we already have the call stub.

src/coreclr/pal/inc/unixasmmacrosamd64.inc

janvorli · 2025-06-09T20:29:17Z

@jkotas I guess I know why the OSX INLINE_GET_ALLOC_CONTEXT_BASE was using the same path as the emulated TLS. There is a known problem with OSX access to thread_local variables from assembler modules. I've learned that when adding access to the t_CurrentThreadInfo before. __thread marked variables work fine. So I had to switch the t_runtime_thread_locals to __thread too. Since it was just POD, it doesn't have any unexpected effects.
The reason why it doesn't work is that somehow the symbol name of the thread_local variables is not preserved and if you look at the compiled module, it is renamed to something like ttmp0 (I don't remember the exact name).

BrzVlad · 2025-06-10T06:48:21Z

src/coreclr/vm/interpexec.cpp

+        // the interpreter stub right after this call.
+        GetThread()->GetInterpThreadContext();
+
+        GCX_PREEMP();


aren't we already in preemptive mode here ?

Good catch, looks like a copy and paste issue. I'll remove the first GCX_PREEMP in a follow-up change.

am11 · 2025-06-10T07:50:09Z

@jkotas I guess I know why the OSX INLINE_GET_ALLOC_CONTEXT_BASE was using the same path as the emulated TLS. There is a known problem with OSX access to thread_local variables from assembler modules. I've learned that when adding access to the t_CurrentThreadInfo before. __thread marked variables work fine. So I had to switch the t_runtime_thread_locals to __thread too. Since it was just POD, it doesn't have any unexpected effects. The reason why it doesn't work is that somehow the symbol name of the thread_local variables is not preserved and if you look at the compiled module, it is renamed to something like ttmp0 (I don't remember the exact name).

cc @filipnavara (in case it's a bug on Apple's side)

filipnavara · 2025-06-10T08:00:10Z

Apple's implementation of thread_local is unlike any other platform. I think it even generates the actual accessors by the linker. It's specific to the C++ thread_local keyword which has some additional initialization requirements which __thread (and C11 _Thread_local) don't have. It's not a bug.

The CoreCLR code base was updated to thread_local quite recently (#114660) and it was followed by a bunch of reports of performance regressions specifically because it has different semantics and different performance characteristics.

I'm not in a position to propose not to use thread_local but I would definitely prefer if it was not used when not needed. The access from assembly is just one of those unfortunate consequences.

jkotas · 2025-06-10T23:39:46Z

The CoreCLR code base was updated to thread_local quite recently (#114660) and it was followed by a bunch of reports of performance regressions specifically because it has different semantics and different performance characteristics.

cc @AaronRobinsonMSFT It sounds like thread_local should be only used in code that's not perf critical.

AaronRobinsonMSFT · 2025-06-11T03:15:20Z

The CoreCLR code base was updated to thread_local quite recently (#114660) and it was followed by a bunch of reports of performance regressions specifically because it has different semantics and different performance characteristics.

cc @AaronRobinsonMSFT It sounds like thread_local should be only used in code that's not perf critical.

Given all the issues I've been seeing across performance and macOS implementation of thread_local, I think we should avoid it entirely. It doesn't seem to provide the assumed benefit and creates future work.

Related performance issue #115004

Reverting some changes in #116512

am11 · 2025-06-27T15:18:13Z

src/coreclr/vm/arm64/asmhelpers.S

+    stp x6, x7, [x9], #16
+    ldr x11, [x10], #8
+    EPILOG_BRANCH_REG x11
+LEAF_END Store_X0_X1_X2_X3_X4_X5_X6_X7


@janvorli, can we dedup these with macros? e.g.

.macro STORE_SEQ base_reg=0, count=1 .set reg, \base_reg .set remaining, \count // Construct entry name .macro regname start cnt name:req .set r, \start .set total, \cnt .set \name, Store_X\r .rept total - 1 .set r, r + 1 .set \name, \name\()_X\r .endr .endm regname \base_reg \count funcname LEAF_ENTRY \funcname // Emit store sequence .set reg, \base_reg .while remaining >= 2 stp x\reg, x\reg+1, [x9], #16 .set reg, reg + 2 .set remaining, remaining - 2 .endw .if remaining == 1 str x\reg, [x9], #8 .endif ldr x11, [x10], #8 EPILOG_BRANCH_REG x11 LEAF_END \funcname .endm .macro STORE_ALT_ENTRY base_reg=0, count=1 .set reg, \base_reg .set remaining, \count .macro altname start cnt name:req .set r, \start .set total, \cnt .set \name, Store_X\r .rept total - 1 .set r, r + 1 .set \name, \name\()_X\r .endr .endm altname \base_reg \count altname ALTERNATE_ENTRY \altname .while remaining >= 2 stp x\reg, x\reg+1, [x9], #16 .set reg, reg + 2 .set remaining, remaining - 2 .endw .if remaining == 1 str x\reg, [x9], #8 .endif .endm

Usage:

STORE_SEQ 0, 3 // Store_X0_X1_X2

It will make porting this to other archs a bit compact.

janvorli added this to the 10.0.0 milestone Jun 5, 2025

janvorli self-assigned this Jun 5, 2025

Copilot AI review requested due to automatic review settings June 5, 2025 22:33

janvorli added the area-CodeGen-Interpreter-coreclr label Jun 5, 2025

janvorli requested review from BrzVlad and kg as code owners June 5, 2025 22:33

Copilot AI reviewed Jun 5, 2025

View reviewed changes

janvorli force-pushed the jit-to-interpreter-calls branch from 8ed6861 to df8d547 Compare June 6, 2025 00:24

kg reviewed Jun 6, 2025

View reviewed changes

src/coreclr/pal/inc/unixasmmacrosamd64.inc Show resolved Hide resolved

build-analysis bot mentioned this pull request Jun 6, 2025

[android] Android.Device_Emulator.JIT.Test failing on emulators with CoreCLR #112633

Open

jkotas reviewed Jun 6, 2025

View reviewed changes

kg reviewed Jun 6, 2025

View reviewed changes

davidwrighton approved these changes Jun 6, 2025

View reviewed changes

build-analysis bot mentioned this pull request Jun 7, 2025

System.Diagnostics.Process.Tests failure on net10.0-linux-Debug-arm64-Mono_Minijit_Debug #110643

Open

jkotas reviewed Jun 7, 2025

View reviewed changes

src/coreclr/vm/prestub.cpp Outdated Show resolved Hide resolved

jkotas reviewed Jun 7, 2025

View reviewed changes

src/coreclr/pal/inc/unixasmmacrosamd64.inc Outdated Show resolved Hide resolved

jkotas reviewed Jun 7, 2025

View reviewed changes

src/coreclr/pal/inc/unixasmmacrosarm64.inc Outdated Show resolved Hide resolved

janvorli added 6 commits June 9, 2025 18:24

Fix m_pInterpThreadContext offset after rebase to main

12a27c0

Fix m_pInterpThreadContext for more targets

20be8e3

Move the CreateNativeToInterpreterCallStub to prestub

29be1cc

Fix build break and add a comment

99b9764

Create stub only when interp. method called from JIT/AOT code

dfadb1f

janvorli force-pushed the jit-to-interpreter-calls branch from 401bf92 to dfadb1f Compare June 9, 2025 17:03

Unify some asm macros

cdf1b0e

jkotas reviewed Jun 9, 2025

View reviewed changes

src/coreclr/pal/inc/unixasmmacrosamd64.inc Outdated Show resolved Hide resolved

Update src/coreclr/pal/inc/unixasmmacrosamd64.inc

3816fe2

Fix OSX x64 build

a367c51

davidwrighton mentioned this pull request Jun 9, 2025

Implementation of ldftn and calli #116449

Merged

janvorli merged commit dc8dd28 into dotnet:main Jun 9, 2025
102 checks passed

janvorli deleted the jit-to-interpreter-calls branch June 9, 2025 22:45

BrzVlad reviewed Jun 10, 2025

View reviewed changes

AaronRobinsonMSFT mentioned this pull request Jun 11, 2025

Revert partial changes for thread_local convergence. #116512

Merged

am11 reviewed Jun 27, 2025

View reviewed changes

github-actions bot locked and limited conversation to collaborators Jul 28, 2025

JIT/AOT to interpreter calls support #116353

JIT/AOT to interpreter calls support #116353

Uh oh!

Conversation

janvorli commented Jun 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

dotnet-policy-service bot commented Jun 5, 2025

Uh oh!

Uh oh!

jkotas Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidwrighton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janvorli commented Jun 6, 2025

Uh oh!

janvorli commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janvorli commented Jun 9, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am11 commented Jun 10, 2025

Uh oh!

filipnavara commented Jun 10, 2025

Uh oh!

jkotas commented Jun 10, 2025

Uh oh!

AaronRobinsonMSFT commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jkotas Jun 6, 2025 •

edited

Loading

AaronRobinsonMSFT commented Jun 11, 2025 •

edited

Loading