JIT: Switch arm32 call-finally to be similar to other targets #95117

jakobbotsch · 2023-11-22T13:20:40Z

This switches arm32 to generate call-finally in the same way as other targets, with a call to the finally funclet, instead of loading a different return address. Loading a different return address confuses the hardware's return address predictor, which has large negative perf impact.

Two micro benchmarks from my rpi, both run with DOTNET_JitEnableFinallyCloning=0, that show the cost of messing up return address prediction. The first runs a loop that calls a finally funclet on every iteration. The second sets up a deeper stack and calls a funclet, which means that the old scheme then mispredicts the return for all subsequent returns. The second benchmark is over 4 times faster with this change.

    public static void Main()
    {
        for (int i = 0; i < 4; i++)
        {
            for (int j = 0; j < 100; j++)
            {
                Run(100);
                Call1();
            }

            Thread.Sleep(100);
        }

        Stopwatch timer = Stopwatch.StartNew();
        Run(100_000_000);
        timer.Stop();
        Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
        timer.Restart();
        for (int i = 0; i < 100_000_000; i++)
            Call1();
        timer.Stop();
        Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
    }

    public static long Run(int iters)
    {
        long sum = 0;
        for (int i = 0; i < iters; i++)
        {
            try
            {
                sum += i;
            }
            finally
            {
                sum += 2 * i;
            }
        }

        return sum;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call1() => Call2() + 1;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call2() => Call3() + 2;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call3() => Call4() + 3;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call4() => Call5() + 4;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call5() => Call6() + 5;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call6() => Call7() + 6;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call7() => Call8() + 7;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call8() => Call9() + 8;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call9() => Call10() + 9;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call10() => Call11() + 10;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call11() => Call12() + 11;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call12() => Call13() + 12;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call13() => Finally();

    public static int Finally()
    {
        int result = 10;
        try
        {
            result += 5;
        }
        finally
        {
            result += 10;
        }

        return result;
    }

Codegen diff: https://www.diffchecker.com/zKoHLm4v/

Output before:

Elapsed: 916ms
Elapsed: 18322ms

Output after:

Elapsed: 856ms
Elapsed: 4010ms

Fix #59453
Fix #66578

Maybe this doesn't handle unwinding properly?

ghost · 2023-11-22T13:20:51Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This switches arm32 to generate call-finally in the same way as other targets, with a call to the finally funclet, instead of loading a different return address. Loading a different return address confuses the hardware's return address predictor, which has large negative perf impact.

Not sure yet if this handles unwinding correctly, and if this can be made to work easily... let's see what CI (and @BruceForstall) says.

Two micro benchmarks from my rpi, both run with DOTNET_JitEnableFinallyCloning=0, that show the cost of messing up return address prediction. The first runs a loop that calls a finally funclet on every iteration. The second sets up a deeper stack and calls a funclet, which means that the old scheme then mispredicts the return for all subsequent returns. The second benchmark is over 4 times faster with this change.

    public static void Main()
    {
        for (int i = 0; i < 4; i++)
        {
            for (int j = 0; j < 100; j++)
            {
                Run(100);
                Call1();
            }

            Thread.Sleep(100);
        }

        Stopwatch timer = Stopwatch.StartNew();
        Run(100_000_000);
        timer.Stop();
        Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
        timer.Restart();
        for (int i = 0; i < 100_000_000; i++)
            Call1();
        timer.Stop();
        Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
    }

    public static long Run(int iters)
    {
        long sum = 0;
        for (int i = 0; i < iters; i++)
        {
            try
            {
                sum += i;
            }
            finally
            {
                sum += 2 * i;
            }
        }

        return sum;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call1() => Call2() + 1;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call2() => Call3() + 2;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call3() => Call4() + 3;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call4() => Call5() + 4;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call5() => Call6() + 5;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call6() => Call7() + 6;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call7() => Call8() + 7;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call8() => Call9() + 8;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call9() => Call10() + 9;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call10() => Call11() + 10;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call11() => Call12() + 11;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call12() => Call13() + 12;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call13() => Finally();

    public static int Finally()
    {
        int result = 10;
        try
        {
            result += 5;
        }
        finally
        {
            result += 10;
        }

        return result;
    }

Codegen diff: https://www.diffchecker.com/w69PBgox/

Output before:

Elapsed: 916ms
Elapsed: 18322ms

Output after:

Elapsed: 856ms
Elapsed: 4010ms

Author:	jakobbotsch
Assignees:	jakobbotsch
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jakobbotsch · 2023-11-22T14:43:19Z

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-coreclr gcstress0x3-gcstress0xc

azure-pipelines · 2023-11-22T14:43:43Z

Azure Pipelines successfully started running 3 pipeline(s).

jakobbotsch · 2023-11-22T14:43:56Z

/azp run runtime-coreclr outerloop

azure-pipelines · 2023-11-22T14:44:10Z

Azure Pipelines successfully started running 1 pipeline(s).

jakobbotsch · 2023-11-22T18:15:30Z

cc @dotnet/jit-contrib PTAL @BruceForstall

Diffs

No arm32 failures in any of the test pipelines (except for #94652), so seems like not much else is needed here.

kunalspathak · 2023-11-22T18:39:15Z

Nice cleanup. Changes LGTM

This switches arm32 to generate call-finally in the same way as other targets, with a call to the finally funclet, instead of loading a different return address. Loading a different return address confuses the hardware's return address predictor, which has large negative perf impact. Two micro benchmarks from my rpi, both run with `DOTNET_JitEnableFinallyCloning=0`, that show the cost of messing up return address prediction. The first runs a loop that calls a finally funclet on every iteration. The second sets up a deeper stack and calls a funclet, which means that the old scheme then mispredicts the return for all subsequent returns. The second benchmark is over 4 times faster with this change. ```csharp public static void Main() { for (int i = 0; i < 4; i++) { for (int j = 0; j < 100; j++) { Run(100); Call1(); } Thread.Sleep(100); } Stopwatch timer = Stopwatch.StartNew(); Run(100_000_000); timer.Stop(); Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds); timer.Restart(); for (int i = 0; i < 100_000_000; i++) Call1(); timer.Stop(); Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds); } public static long Run(int iters) { long sum = 0; for (int i = 0; i < iters; i++) { try { sum += i; } finally { sum += 2 * i; } } return sum; } [MethodImpl(MethodImplOptions.NoInlining)] public static int Call1() => Call2() + 1; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call2() => Call3() + 2; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call3() => Call4() + 3; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call4() => Call5() + 4; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call5() => Call6() + 5; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call6() => Call7() + 6; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call7() => Call8() + 7; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call8() => Call9() + 8; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call9() => Call10() + 9; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call10() => Call11() + 10; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call11() => Call12() + 11; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call12() => Call13() + 12; [MethodImpl(MethodImplOptions.NoInlining)] public static int Call13() => Finally(); public static int Finally() { int result = 10; try { result += 5; } finally { result += 10; } return result; } ``` Output before: ```scala Elapsed: 916ms Elapsed: 18322ms ``` Output after: ```scala Elapsed: 856ms Elapsed: 4010ms ``` Fix dotnet#59453 Fix dotnet#66578

jakobbotsch · 2023-11-23T09:41:57Z

I found #59453 that covers this and noticed the FEATURE_EH_CALLFINALLY_THUNKS it talks about, so I pushed a change to enable that. Will rerun testing once innerloop passes.

jakobbotsch · 2023-11-23T11:29:32Z

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-coreclr gcstress0x3-gcstress0xc, runtime-coreclr outerloop

azure-pipelines · 2023-11-23T11:30:07Z

Azure Pipelines successfully started running 4 pipeline(s).

BruceForstall · 2023-11-26T01:24:08Z

I found #59453 that covers this and noticed the FEATURE_EH_CALLFINALLY_THUNKS it talks about, so I pushed a change to enable that. Will rerun testing once innerloop passes.

Can you update the diffchecker diffs linked at the top after this change? That was one thing I didn't see in the diff: I expected the "bl" to the finally to be in its own IG so the EH reporting could reference it.

In the diff that is there, there is a nop injected before the continuation (finally return target). I would think that would no longer be necessary.

Well, the nop might need to remain because the return from finally address needs to be in the same EH region as the call to finally.

BruceForstall · 2023-11-26T01:45:06Z

Diffs

Oddly large TP regression in some cases. Probably due to FEATURE_EH_CALLFINALLY_THUNKS handling?

Huge size improvements.

BruceForstall

The changes look great to me. Love to see all this ifdef code disappear, especially if it's a net perf and code size win.

BruceForstall · 2023-11-26T04:49:47Z

cc @janvorli -- this PR changes how the JIT generates calls to finally clauses. fyi, in case you know of a reason the VM has a dependency on the current model (seems unlikely to me).

jakobbotsch · 2023-11-26T11:33:24Z

Oddly large TP regression in some cases. Probably due to FEATURE_EH_CALLFINALLY_THUNKS handling?

libraries.pmi (the collection with the large TP regression) has only 6 MinOpts contexts, so the results aren't very telling. The particular context that regresses is for this one:
https://github.com/dotnet/fsharp/blob/b4dc66c3d3cd4b45614d2b1b903890c9c846895a/src/FSharp.Core/Linq.fs#L429-L641

We switch to minopts because the function has over 60 KB of IL. The function has only 7 EH clauses but it has 2500 basic blocks. The detailed TP diff for this single context is:

Base: 185069812, Diff: 185844906, +0.4188%

552076 : NA        : 51.77% : +0.2983% : public: struct BasicBlock * __thiscall Compiler::fgFindInsertPoint(unsigned int, bool, struct BasicBlock *, struct BasicBlock *, struct BasicBlock *, struct BasicBlock *, bool)
131965 : +4.94%    : 12.37% : +0.0713% : public: void __thiscall GcInfoEncoder::Build(void)                                                                                                                              
98691  : NA        : 9.25%  : +0.0533% : public: bool __thiscall BasicBlock::isBBCallAlwaysPair(void) const                                                                                                              
44208  : NA        : 4.15%  : +0.0239% : private: void * __thiscall emitter::emitAddLabel(unsigned int *const &, unsigned __int64, unsigned __int64)                                                                     
29708  : +3636.23% : 2.79%  : +0.0161% : private: void __thiscall CodeGen::genReportEH(void)                                                                                                                             
13135  : +1.65%    : 1.23%  : +0.0071% : private: void __thiscall LinearScan::setBlockSequence(void)                                                                                                                     
12784  : +43.06%   : 1.20%  : +0.0069% : public: bool __thiscall BasicBlock::isBBCallAlwaysPairTail(void) const                                                                                                          
11847  : +0.83%    : 1.11%  : +0.0064% : protected: void __thiscall CodeGen::genCodeForBBlist(void)                                                                                                                      
5164   : +0.13%    : 0.48%  : +0.0028% : protected: unsigned int __thiscall emitter::emitOutputInstr(struct insGroup *, struct emitter::instrDesc *, unsigned char **)                                                   
2645   : +8.26%    : 0.25%  : +0.0014% : private: void __thiscall CodeGen::genMarkLabelsForCodegen(void)                                                                                                                 
2472   : +0.02%    : 0.23%  : +0.0013% : public: void __thiscall LinearScan::allocateRegisters(void)                                                                                                                     
2423   : +1.42%    : 0.23%  : +0.0013% : public: void __thiscall emitter::emitIns_J(enum instruction, struct BasicBlock *, int)                                                                                          
1140   : +0.24%    : 0.11%  : +0.0006% : public: static void __stdcall BitSetOps<unsigned int *, 1, class Compiler *, class BasicBlockBitSetTraits>::ClearD(class Compiler *, unsigned int *&)                           
-1923  : -0.40%    : 0.18%  : -0.0010% : public: unsigned int __thiscall emitter::emitEndCodeGen(class Compiler *, bool, bool, bool, unsigned int, unsigned int *, unsigned int *, void **, void **, void **)            
-2316  : -0.72%    : 0.22%  : -0.0013% : protected: bool __thiscall Compiler::fgRemoveDeadBlocks(void)                                                                                                                   
-5768  : -0.70%    : 0.54%  : -0.0031% : private: void __thiscall emitter::emitJumpDistBind(void)                                                                                                                        
-6405  : -1.46%    : 0.60%  : -0.0035% : private: void __thiscall GcInfoEncoder::WriteSlotStateVector(class BitStreamWriter &, class BitArray const &)                                                                   
-8759  : -0.71%    : 0.82%  : -0.0047% : public: void __thiscall BitStreamWriter::Write(unsigned int, unsigned int)                                                                                                      
-10567 : -1.43%    : 0.99%  : -0.0057% : private: void __thiscall GcInfoEncoder::SizeofSlotStateVarLengthVector(class BitArray const &, unsigned int, unsigned int, unsigned int *, unsigned int *, unsigned int *)      
-29120 : -100.00%  : 2.73%  : -0.0157% : public: void __thiscall emitter::emitStackPop(unsigned char *, bool, unsigned char, unsigned int)                                                                               
-31879 : -26.86%   : 2.99%  : -0.0172% : Compiler::fgRemoveUnreachableBlocks<`Compiler::fgRemoveDeadBlocks'::`2'::<lambda_1> >                                                                                           
-44208 : -100.00%  : 4.15%  : -0.0239% : private: void * __thiscall emitter::emitAddLabel(unsigned int *const &, unsigned __int64, unsigned __int64, bool)

jakobbotsch · 2023-11-26T16:03:57Z

Can you update the diffchecker diffs linked at the top after this change? That was one thing I didn't see in the diff: I expected the "bl" to the finally to be in its own IG so the EH reporting could reference it.

In the diff that is there, there is a nop injected before the continuation (finally return target). I would think that would no longer be necessary.

I updated the diff -- indeed, the bl is in its own IG now, but the nop is still there.

janvorli · 2023-11-27T11:23:59Z

cc @janvorli -- this PR changes how the JIT generates calls to finally clauses. fyi, in case you know of a reason the VM has a dependency on the current model (seems unlikely to me).

@BruceForstall if I understand it correctly, this only influences non-exceptionally invoked finallys, right? So the difference is that instead of lr, <continuation>; jmp <finally> JIT uses call <finally>; jmp <continuation>?
I am not aware of the runtime depending on that in any way.

BruceForstall · 2023-11-27T19:51:18Z

@BruceForstall if I understand it correctly, this only influences non-exceptionally invoked finallys, right? So the difference is that instead of lr, <continuation>; jmp <finally> JIT uses call <finally>; jmp <continuation>?
I am not aware of the runtime depending on that in any way.

Correct. Thanks for considering.

I updated the diff -- indeed, the bl is in its own IG now, but the nop is still there.

We should probably track down why that nop is there. At first I thought maybe it's due to the return address of the call-to-finally bl needing to be in the same EH region as the bl, for some stack walk purpose. But I think the VM during stack walk will "back up" by one byte so the address in the bl frame will be considered to be "inside" the bl instruction itself, which would be the appropriate EH region.

Due to change in dotnet#95117

Due to change in #95117

JIT: Switch arm32 call-finally to be similar to other targets

a4a29e4

Maybe this doesn't handle unwinding properly?

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 22, 2023

ghost assigned jakobbotsch Nov 22, 2023

build-analysis bot mentioned this pull request Nov 22, 2023

Test_EventSource_EtwManifestGeneration* tests failing in CI #48798

Closed

jakobbotsch marked this pull request as ready for review November 22, 2023 18:13

jakobbotsch requested a review from BruceForstall November 22, 2023 18:15

jakobbotsch mentioned this pull request Nov 23, 2023

Assertion failed 'interval->isSpilled' during 'LSRA allocate' #66578

Closed

BruceForstall approved these changes Nov 26, 2023

View reviewed changes

jakobbotsch merged commit 76acd04 into dotnet:main Nov 26, 2023

jakobbotsch deleted the arm32-callfinally-bl branch November 26, 2023 20:05

This was referenced Nov 28, 2023

[ARM32] Consider setting BBF_FINALLY_TARGET immediately before codegen #8647

Closed

Adjust clr-abi for new arm32 call-finally behavior #95596

Merged

BruceForstall added a commit to BruceForstall/runtime that referenced this pull request Dec 4, 2023

Adjust clr-abi for new arm32 call-finally behavior

d9abdb4

Due to change in dotnet#95117

BruceForstall added a commit that referenced this pull request Dec 4, 2023

Adjust clr-abi for new arm32 call-finally behavior (#95596)

be4a76b

Due to change in #95117

github-actions bot locked and limited conversation to collaborators Dec 28, 2023

JIT: Switch arm32 call-finally to be similar to other targets #95117

JIT: Switch arm32 call-finally to be similar to other targets #95117

Uh oh!

Conversation

jakobbotsch commented Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Nov 22, 2023

Uh oh!

jakobbotsch commented Nov 22, 2023

Uh oh!

azure-pipelines bot commented Nov 22, 2023

Uh oh!

jakobbotsch commented Nov 22, 2023

Uh oh!

azure-pipelines bot commented Nov 22, 2023

Uh oh!

jakobbotsch commented Nov 22, 2023

Uh oh!

kunalspathak commented Nov 22, 2023

Uh oh!

jakobbotsch commented Nov 23, 2023

Uh oh!

jakobbotsch commented Nov 23, 2023

Uh oh!

azure-pipelines bot commented Nov 23, 2023

Uh oh!

BruceForstall commented Nov 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BruceForstall commented Nov 26, 2023

Uh oh!

BruceForstall left a comment

Choose a reason for hiding this comment

Uh oh!

BruceForstall commented Nov 26, 2023

Uh oh!

jakobbotsch commented Nov 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakobbotsch commented Nov 26, 2023

Uh oh!

janvorli commented Nov 27, 2023

Uh oh!

BruceForstall commented Nov 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jakobbotsch commented Nov 22, 2023 •

edited

Loading

BruceForstall commented Nov 26, 2023 •

edited

Loading

jakobbotsch commented Nov 26, 2023 •

edited

Loading