Skip to content

Conversation

@janvorli
Copy link
Member

@janvorli janvorli commented Jun 5, 2025

We only had support for calling interpreted void returning method with no arguments by the JIT/AOT. This change enables support for passing all possible kinds of arguments and returning all types.

It reuses the CallStubGenerator that was implemented for the interpreter to native code usage and adds support for the other direction to it in mostly trivial manner.
In addition to that, assembler routines for storing argument values to the interpreter stack were added.
The CallStubGenerator generates a list of routines to copy the arguments from CPU registers and stack to the interpreter stack. The last one makes call to the ExecuteInterpretedMethod and then puts the result into appropriate registers.
For functions that return result via a return buffer, the buffer is passed to the ExecuteInterpretedMethod so that the IR opcode to return valuetype stores it directly to the return buffer.

The ARM64 for Apple OSes is the most optimized version, as it is the primary target where the performance matters the most. It eliminates argument registers saving to stack on the fast path when we already have the call stub.

@janvorli janvorli added this to the 10.0.0 milestone Jun 5, 2025
@janvorli janvorli self-assigned this Jun 5, 2025
Copilot AI review requested due to automatic review settings June 5, 2025 22:33
@janvorli janvorli requested review from BrzVlad and kg as code owners June 5, 2025 22:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive support for interpreter-to-JIT and JIT-to-interpreter calls, extending argument passing and return value handling to all kinds of method signatures. Key changes include new test methods in the interpreter tests, adjustments in call stub generation and assembly routines, and updates to thread-local and ABI-related constants.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/tests/JIT/interpreter/Interpreter.cs Added several new test methods for various calling conventions to exercise interpreter/JIT interop.
src/coreclr/vm/threads.* Modified thread-local definitions and moved the interpreter thread context member from private to public.
src/coreclr/vm/prestub.cpp Updated the signature and return value handling of ExecuteInterpretedMethod.
src/coreclr/vm/interpexec.cpp Added CreateNativeToInterpreterCallStub with adjustments to generate call stubs and ensure thread context.
src/coreclr/vm/callstubgenerator.h Extended the call stub generator with new routines and a parameter to distinguish call directions.
asmconstants.h, AsmHelpers.asm, and PAL assembly macros Updated ABI constants and assembly routines to support the new calling conventions.
src/coreclr/interpreter/interpretershared.h Updated the InterpMethod structure to include the call stub pointer.
Comments suppressed due to low confidence (1)

src/coreclr/vm/interpexec.cpp:66

  • [nitpick] The AllocMemTracker variable is redeclared here, shadowing the one declared on line 56. To improve clarity and avoid potential confusion, consider reusing the outer 'amTracker' instead of redeclaring it.
        AllocMemTracker amTracker;

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @BrzVlad, @janvorli, @kg
See info in area-owners.md if you want to be subscribed.

@janvorli janvorli force-pushed the jit-to-interpreter-calls branch from 8ed6861 to df8d547 Compare June 6, 2025 00:24
; Interpreter-TODO: consider not storing the argument registers in the PROLOG_WITH_TRANSITION_BLOCK and
; store / restore them only around this call. We would still need to save RCX and RDX as they can contain
; the return buffer address.
call CreateNativeToInterpreterCallStub
Copy link
Member

@jkotas jkotas Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see how the arguments get reported to the GC while CreateNativeToInterpreterCallStub is executing. Is it a GC hole?

Would it be better to compile the transition helper in PreStubWorker? PreStubWorker gets called whenever we are about the execute a method from JIT/AOT code that is not compiled/prepared yet, and it has all the right infrastructure to protect arguments. It would be best to avoid duplicating it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point and compiling it in the PreStubWorker would definitely be better.


LEAF_ENTRY Store_XMM2_XMM3, _TEXT
movsd real8 ptr [r10], xmm2
movsd real8 ptr [r10 + 8], xmm3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't xmm registers 16 bytes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing what's going on here is there's an array of individual pointers at r10 for the location of the data for each xmm reg but wanted to make sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we are handling the XMM registers for the case where the 16 byte XMM register is just being used for an 8 byte floating point value. I'm a bit curious if this works well for floats as well as doubles, but the concept is sound. (I could see that perhaps our interpreter stack models all floating point values as doubles, which will introduce floating point differences from what the JIT does in certain cases, but it is pretty close.

More interestingly though I don't see handling for Vector64/Vector128 on Arm64 which should be treated as HFA values. I don't think this all needs to be handled before the PR is merged, but we should have it on our worklist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xmm registers are 16 bytes long, but for argument passing purposes, they can only carry either float or double arguments, so we use max 8 bytes out of it. See https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170.
For Unix, it is the same case. The calling convention would allow larger data types passed in all 16 bytes, but that's not used in .NET (In C, it can pass arguments of types __float128, _Decimal128 and __m128)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to see one comment in here somewhere to the effect of 'we only use up to the first 8 bytes of the xmm registers due to the x64 calling convention' just so anyone looking at this can go 'oh, that makes sense' and look up the calling convention for more info (i.e. that vectors are passed by-reference)

Otherwise, looks good, thanks for the explanation.

Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good. I suspect we haven't fleshed out all the calling convention gremlins, but this should get most of it.


LEAF_ENTRY Store_XMM2_XMM3, _TEXT
movsd real8 ptr [r10], xmm2
movsd real8 ptr [r10 + 8], xmm3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we are handling the XMM registers for the case where the 16 byte XMM register is just being used for an 8 byte floating point value. I'm a bit curious if this works well for floats as well as doubles, but the concept is sound. (I could see that perhaps our interpreter stack models all floating point values as doubles, which will introduce floating point differences from what the JIT does in certain cases, but it is pretty close.

More interestingly though I don't see handling for Vector64/Vector128 on Arm64 which should be treated as HFA values. I don't think this all needs to be handled before the PR is merged, but we should have it on our worklist.

@janvorli
Copy link
Member Author

janvorli commented Jun 6, 2025

I'm a bit curious if this works well for floats as well as doubles
@davidwrighton the interpreter distinguishes between floats and doubles, it knows what each slot contains. So in case the argument is float, we just write in 8 bytes, but the interpreter only ever uses the low 4 bytes.

@janvorli
Copy link
Member Author

janvorli commented Jun 6, 2025

More interestingly though I don't see handling for Vector64/Vector128 on Arm64 which should be treated as HFA values.

Good point. I think that the only one that would not work is the Vector128. For that one, we will need to add routines for both storing and loading q0..q7. The Vector64 should be handled by the regular d0..d7 storing / loading routines.

janvorli added 6 commits June 9, 2025 18:24
Until now, we only had support for calling interpreted void returning
method with no arguments by the JIT/AOT. This change enables support for
passing all possible kinds of arguments and returning all types.

It reuses the `CallStubGenerator` that was implemented for the interpreter
to native code usage and adds support for the other direction to it in
mostly trivial manner.
In addition to that, assembler routines for storing argument values to
the interpreter stack were added.
The `CallStubGenerator` generates a list of routines to copy the
arguments from CPU registers and stack to the interpreter stack.
The last one makes call to the `ExecuteInterpretedMethod` and then puts
the result into appropriate registers.
For functions that return result via a return buffer, the buffer is
passed to the `ExecuteInterpretedMethod` so that the IR opcode to return
valuetype stores it directly to the return buffer.

The ARM64 for Apple OSes is the most optimized version, as it is the
primary target where the performance matters the most. It eliminates
argument registers saving to stack on the fast path when we already have
the call stub.
@janvorli janvorli force-pushed the jit-to-interpreter-calls branch from 401bf92 to dfadb1f Compare June 9, 2025 17:03
@janvorli
Copy link
Member Author

janvorli commented Jun 9, 2025

@jkotas I guess I know why the OSX INLINE_GET_ALLOC_CONTEXT_BASE was using the same path as the emulated TLS. There is a known problem with OSX access to thread_local variables from assembler modules. I've learned that when adding access to the t_CurrentThreadInfo before. __thread marked variables work fine. So I had to switch the t_runtime_thread_locals to __thread too. Since it was just POD, it doesn't have any unexpected effects.
The reason why it doesn't work is that somehow the symbol name of the thread_local variables is not preserved and if you look at the compiled module, it is renamed to something like ttmp0 (I don't remember the exact name).

@janvorli janvorli merged commit dc8dd28 into dotnet:main Jun 9, 2025
102 checks passed
@janvorli janvorli deleted the jit-to-interpreter-calls branch June 9, 2025 22:45
// the interpreter stub right after this call.
GetThread()->GetInterpThreadContext();

GCX_PREEMP();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we already in preemptive mode here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, looks like a copy and paste issue. I'll remove the first GCX_PREEMP in a follow-up change.

@am11
Copy link
Member

am11 commented Jun 10, 2025

@jkotas I guess I know why the OSX INLINE_GET_ALLOC_CONTEXT_BASE was using the same path as the emulated TLS. There is a known problem with OSX access to thread_local variables from assembler modules. I've learned that when adding access to the t_CurrentThreadInfo before. __thread marked variables work fine. So I had to switch the t_runtime_thread_locals to __thread too. Since it was just POD, it doesn't have any unexpected effects. The reason why it doesn't work is that somehow the symbol name of the thread_local variables is not preserved and if you look at the compiled module, it is renamed to something like ttmp0 (I don't remember the exact name).

cc @filipnavara (in case it's a bug on Apple's side)

@filipnavara
Copy link
Member

Apple's implementation of thread_local is unlike any other platform. I think it even generates the actual accessors by the linker. It's specific to the C++ thread_local keyword which has some additional initialization requirements which __thread (and C11 _Thread_local) don't have. It's not a bug.

The CoreCLR code base was updated to thread_local quite recently (#114660) and it was followed by a bunch of reports of performance regressions specifically because it has different semantics and different performance characteristics.

I'm not in a position to propose not to use thread_local but I would definitely prefer if it was not used when not needed. The access from assembly is just one of those unfortunate consequences.

@jkotas
Copy link
Member

jkotas commented Jun 10, 2025

The CoreCLR code base was updated to thread_local quite recently (#114660) and it was followed by a bunch of reports of performance regressions specifically because it has different semantics and different performance characteristics.

cc @AaronRobinsonMSFT It sounds like thread_local should be only used in code that's not perf critical.

@AaronRobinsonMSFT
Copy link
Member

AaronRobinsonMSFT commented Jun 11, 2025

The CoreCLR code base was updated to thread_local quite recently (#114660) and it was followed by a bunch of reports of performance regressions specifically because it has different semantics and different performance characteristics.

cc @AaronRobinsonMSFT It sounds like thread_local should be only used in code that's not perf critical.

Given all the issues I've been seeing across performance and macOS implementation of thread_local, I think we should avoid it entirely. It doesn't seem to provide the assumed benefit and creates future work.

Related performance issue #115004

Reverting some changes in #116512

stp x6, x7, [x9], #16
ldr x11, [x10], #8
EPILOG_BRANCH_REG x11
LEAF_END Store_X0_X1_X2_X3_X4_X5_X6_X7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janvorli, can we dedup these with macros? e.g.

.macro STORE_SEQ base_reg=0, count=1

    .set reg, \base_reg
    .set remaining, \count

    // Construct entry name
    .macro regname start cnt name:req
        .set r, \start
        .set total, \cnt
        .set \name, Store_X\r
        .rept total - 1
            .set r, r + 1
            .set \name, \name\()_X\r
        .endr
    .endm

    regname \base_reg \count funcname
    LEAF_ENTRY \funcname

    // Emit store sequence
    .set reg, \base_reg
    .while remaining >= 2
        stp x\reg, x\reg+1, [x9], #16
        .set reg, reg + 2
        .set remaining, remaining - 2
    .endw

    .if remaining == 1
        str x\reg, [x9], #8
    .endif

    ldr x11, [x10], #8
    EPILOG_BRANCH_REG x11
    LEAF_END \funcname

.endm

.macro STORE_ALT_ENTRY base_reg=0, count=1
    .set reg, \base_reg
    .set remaining, \count
    .macro altname start cnt name:req
        .set r, \start
        .set total, \cnt
        .set \name, Store_X\r
        .rept total - 1
            .set r, r + 1
            .set \name, \name\()_X\r
        .endr
    .endm

    altname \base_reg \count altname
    ALTERNATE_ENTRY \altname

    .while remaining >= 2
        stp x\reg, x\reg+1, [x9], #16
        .set reg, reg + 2
        .set remaining, remaining - 2
    .endw
    .if remaining == 1
        str x\reg, [x9], #8
    .endif

.endm

Usage:

    STORE_SEQ 0, 3       // Store_X0_X1_X2

It will make porting this to other archs a bit compact.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants