Skip to content

Conversation

@jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Nov 22, 2023

This switches arm32 to generate call-finally in the same way as other targets, with a call to the finally funclet, instead of loading a different return address. Loading a different return address confuses the hardware's return address predictor, which has large negative perf impact.

Two micro benchmarks from my rpi, both run with DOTNET_JitEnableFinallyCloning=0, that show the cost of messing up return address prediction. The first runs a loop that calls a finally funclet on every iteration. The second sets up a deeper stack and calls a funclet, which means that the old scheme then mispredicts the return for all subsequent returns. The second benchmark is over 4 times faster with this change.

    public static void Main()
    {
        for (int i = 0; i < 4; i++)
        {
            for (int j = 0; j < 100; j++)
            {
                Run(100);
                Call1();
            }

            Thread.Sleep(100);
        }

        Stopwatch timer = Stopwatch.StartNew();
        Run(100_000_000);
        timer.Stop();
        Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
        timer.Restart();
        for (int i = 0; i < 100_000_000; i++)
            Call1();
        timer.Stop();
        Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
    }

    public static long Run(int iters)
    {
        long sum = 0;
        for (int i = 0; i < iters; i++)
        {
            try
            {
                sum += i;
            }
            finally
            {
                sum += 2 * i;
            }
        }

        return sum;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call1() => Call2() + 1;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call2() => Call3() + 2;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call3() => Call4() + 3;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call4() => Call5() + 4;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call5() => Call6() + 5;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call6() => Call7() + 6;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call7() => Call8() + 7;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call8() => Call9() + 8;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call9() => Call10() + 9;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call10() => Call11() + 10;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call11() => Call12() + 11;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call12() => Call13() + 12;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call13() => Finally();

    public static int Finally()
    {
        int result = 10;
        try
        {
            result += 5;
        }
        finally
        {
            result += 10;
        }

        return result;
    }

Codegen diff: https://www.diffchecker.com/zKoHLm4v/

Output before:

Elapsed: 916ms
Elapsed: 18322ms

Output after:

Elapsed: 856ms
Elapsed: 4010ms

Fix #59453
Fix #66578

Maybe this doesn't handle unwinding properly?
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 22, 2023
@ghost ghost assigned jakobbotsch Nov 22, 2023
@ghost
Copy link

ghost commented Nov 22, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This switches arm32 to generate call-finally in the same way as other targets, with a call to the finally funclet, instead of loading a different return address. Loading a different return address confuses the hardware's return address predictor, which has large negative perf impact.

Not sure yet if this handles unwinding correctly, and if this can be made to work easily... let's see what CI (and @BruceForstall) says.

Two micro benchmarks from my rpi, both run with DOTNET_JitEnableFinallyCloning=0, that show the cost of messing up return address prediction. The first runs a loop that calls a finally funclet on every iteration. The second sets up a deeper stack and calls a funclet, which means that the old scheme then mispredicts the return for all subsequent returns. The second benchmark is over 4 times faster with this change.

    public static void Main()
    {
        for (int i = 0; i < 4; i++)
        {
            for (int j = 0; j < 100; j++)
            {
                Run(100);
                Call1();
            }

            Thread.Sleep(100);
        }

        Stopwatch timer = Stopwatch.StartNew();
        Run(100_000_000);
        timer.Stop();
        Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
        timer.Restart();
        for (int i = 0; i < 100_000_000; i++)
            Call1();
        timer.Stop();
        Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
    }

    public static long Run(int iters)
    {
        long sum = 0;
        for (int i = 0; i < iters; i++)
        {
            try
            {
                sum += i;
            }
            finally
            {
                sum += 2 * i;
            }
        }

        return sum;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call1() => Call2() + 1;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call2() => Call3() + 2;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call3() => Call4() + 3;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call4() => Call5() + 4;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call5() => Call6() + 5;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call6() => Call7() + 6;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call7() => Call8() + 7;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call8() => Call9() + 8;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call9() => Call10() + 9;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call10() => Call11() + 10;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call11() => Call12() + 11;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call12() => Call13() + 12;
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Call13() => Finally();

    public static int Finally()
    {
        int result = 10;
        try
        {
            result += 5;
        }
        finally
        {
            result += 10;
        }

        return result;
    }

Codegen diff: https://www.diffchecker.com/w69PBgox/

Output before:

Elapsed: 916ms
Elapsed: 18322ms

Output after:

Elapsed: 856ms
Elapsed: 4010ms
Author: jakobbotsch
Assignees: jakobbotsch
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-coreclr gcstress0x3-gcstress0xc

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @BruceForstall

Diffs

No arm32 failures in any of the test pipelines (except for #94652), so seems like not much else is needed here.

@kunalspathak
Copy link
Contributor

Nice cleanup. Changes LGTM

This switches arm32 to generate call-finally in the same way as other
targets, with a call to the finally funclet, instead of loading a
different return address. Loading a different return address confuses
the hardware's return address predictor, which has large negative perf
impact.

Two micro benchmarks from my rpi, both run with
`DOTNET_JitEnableFinallyCloning=0`, that show the cost of messing up
return address prediction. The first runs a loop that calls a finally
funclet on every iteration. The second sets up a deeper stack and calls
a funclet, which means that the old scheme then mispredicts the return
for all subsequent returns. The second benchmark is over 4 times faster
with this change.

```csharp
public static void Main()
{
    for (int i = 0; i < 4; i++)
    {
        for (int j = 0; j < 100; j++)
        {
            Run(100);
            Call1();
        }

        Thread.Sleep(100);
    }

    Stopwatch timer = Stopwatch.StartNew();
    Run(100_000_000);
    timer.Stop();
    Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
    timer.Restart();
    for (int i = 0; i < 100_000_000; i++)
        Call1();
    timer.Stop();
    Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
}

public static long Run(int iters)
{
    long sum = 0;
    for (int i = 0; i < iters; i++)
    {
        try
        {
            sum += i;
        }
        finally
        {
            sum += 2 * i;
        }
    }

    return sum;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call1() => Call2() + 1;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call2() => Call3() + 2;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call3() => Call4() + 3;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call4() => Call5() + 4;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call5() => Call6() + 5;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call6() => Call7() + 6;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call7() => Call8() + 7;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call8() => Call9() + 8;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call9() => Call10() + 9;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call10() => Call11() + 10;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call11() => Call12() + 11;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call12() => Call13() + 12;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call13() => Finally();

public static int Finally()
{
    int result = 10;
    try
    {
        result += 5;
    }
    finally
    {
        result += 10;
    }

    return result;
}
```

Output before:
```scala
Elapsed: 916ms
Elapsed: 18322ms
```

Output after:
```scala
Elapsed: 856ms
Elapsed: 4010ms
```

Fix dotnet#59453
Fix dotnet#66578
@jakobbotsch
Copy link
Member Author

I found #59453 that covers this and noticed the FEATURE_EH_CALLFINALLY_THUNKS it talks about, so I pushed a change to enable that. Will rerun testing once innerloop passes.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-coreclr gcstress0x3-gcstress0xc, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@BruceForstall
Copy link
Contributor

BruceForstall commented Nov 26, 2023

I found #59453 that covers this and noticed the FEATURE_EH_CALLFINALLY_THUNKS it talks about, so I pushed a change to enable that. Will rerun testing once innerloop passes.

Can you update the diffchecker diffs linked at the top after this change? That was one thing I didn't see in the diff: I expected the "bl" to the finally to be in its own IG so the EH reporting could reference it.

In the diff that is there, there is a nop injected before the continuation (finally return target). I would think that would no longer be necessary.

Well, the nop might need to remain because the return from finally address needs to be in the same EH region as the call to finally.

@BruceForstall
Copy link
Contributor

Diffs

Oddly large TP regression in some cases. Probably due to FEATURE_EH_CALLFINALLY_THUNKS handling?

Huge size improvements.

Copy link
Contributor

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look great to me. Love to see all this ifdef code disappear, especially if it's a net perf and code size win.

@BruceForstall
Copy link
Contributor

cc @janvorli -- this PR changes how the JIT generates calls to finally clauses. fyi, in case you know of a reason the VM has a dependency on the current model (seems unlikely to me).

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Nov 26, 2023

Oddly large TP regression in some cases. Probably due to FEATURE_EH_CALLFINALLY_THUNKS handling?

libraries.pmi (the collection with the large TP regression) has only 6 MinOpts contexts, so the results aren't very telling. The particular context that regresses is for this one:
https://github.com/dotnet/fsharp/blob/b4dc66c3d3cd4b45614d2b1b903890c9c846895a/src/FSharp.Core/Linq.fs#L429-L641

We switch to minopts because the function has over 60 KB of IL. The function has only 7 EH clauses but it has 2500 basic blocks. The detailed TP diff for this single context is:

Base: 185069812, Diff: 185844906, +0.4188%

552076 : NA        : 51.77% : +0.2983% : public: struct BasicBlock * __thiscall Compiler::fgFindInsertPoint(unsigned int, bool, struct BasicBlock *, struct BasicBlock *, struct BasicBlock *, struct BasicBlock *, bool)
131965 : +4.94%    : 12.37% : +0.0713% : public: void __thiscall GcInfoEncoder::Build(void)                                                                                                                              
98691  : NA        : 9.25%  : +0.0533% : public: bool __thiscall BasicBlock::isBBCallAlwaysPair(void) const                                                                                                              
44208  : NA        : 4.15%  : +0.0239% : private: void * __thiscall emitter::emitAddLabel(unsigned int *const &, unsigned __int64, unsigned __int64)                                                                     
29708  : +3636.23% : 2.79%  : +0.0161% : private: void __thiscall CodeGen::genReportEH(void)                                                                                                                             
13135  : +1.65%    : 1.23%  : +0.0071% : private: void __thiscall LinearScan::setBlockSequence(void)                                                                                                                     
12784  : +43.06%   : 1.20%  : +0.0069% : public: bool __thiscall BasicBlock::isBBCallAlwaysPairTail(void) const                                                                                                          
11847  : +0.83%    : 1.11%  : +0.0064% : protected: void __thiscall CodeGen::genCodeForBBlist(void)                                                                                                                      
5164   : +0.13%    : 0.48%  : +0.0028% : protected: unsigned int __thiscall emitter::emitOutputInstr(struct insGroup *, struct emitter::instrDesc *, unsigned char **)                                                   
2645   : +8.26%    : 0.25%  : +0.0014% : private: void __thiscall CodeGen::genMarkLabelsForCodegen(void)                                                                                                                 
2472   : +0.02%    : 0.23%  : +0.0013% : public: void __thiscall LinearScan::allocateRegisters(void)                                                                                                                     
2423   : +1.42%    : 0.23%  : +0.0013% : public: void __thiscall emitter::emitIns_J(enum instruction, struct BasicBlock *, int)                                                                                          
1140   : +0.24%    : 0.11%  : +0.0006% : public: static void __stdcall BitSetOps<unsigned int *, 1, class Compiler *, class BasicBlockBitSetTraits>::ClearD(class Compiler *, unsigned int *&)                           
-1923  : -0.40%    : 0.18%  : -0.0010% : public: unsigned int __thiscall emitter::emitEndCodeGen(class Compiler *, bool, bool, bool, unsigned int, unsigned int *, unsigned int *, void **, void **, void **)            
-2316  : -0.72%    : 0.22%  : -0.0013% : protected: bool __thiscall Compiler::fgRemoveDeadBlocks(void)                                                                                                                   
-5768  : -0.70%    : 0.54%  : -0.0031% : private: void __thiscall emitter::emitJumpDistBind(void)                                                                                                                        
-6405  : -1.46%    : 0.60%  : -0.0035% : private: void __thiscall GcInfoEncoder::WriteSlotStateVector(class BitStreamWriter &, class BitArray const &)                                                                   
-8759  : -0.71%    : 0.82%  : -0.0047% : public: void __thiscall BitStreamWriter::Write(unsigned int, unsigned int)                                                                                                      
-10567 : -1.43%    : 0.99%  : -0.0057% : private: void __thiscall GcInfoEncoder::SizeofSlotStateVarLengthVector(class BitArray const &, unsigned int, unsigned int, unsigned int *, unsigned int *, unsigned int *)      
-29120 : -100.00%  : 2.73%  : -0.0157% : public: void __thiscall emitter::emitStackPop(unsigned char *, bool, unsigned char, unsigned int)                                                                               
-31879 : -26.86%   : 2.99%  : -0.0172% : Compiler::fgRemoveUnreachableBlocks<`Compiler::fgRemoveDeadBlocks'::`2'::<lambda_1> >                                                                                           
-44208 : -100.00%  : 4.15%  : -0.0239% : private: void * __thiscall emitter::emitAddLabel(unsigned int *const &, unsigned __int64, unsigned __int64, bool)                                                                                                     

@jakobbotsch
Copy link
Member Author

Can you update the diffchecker diffs linked at the top after this change? That was one thing I didn't see in the diff: I expected the "bl" to the finally to be in its own IG so the EH reporting could reference it.

In the diff that is there, there is a nop injected before the continuation (finally return target). I would think that would no longer be necessary.

I updated the diff -- indeed, the bl is in its own IG now, but the nop is still there.

@jakobbotsch jakobbotsch merged commit 76acd04 into dotnet:main Nov 26, 2023
@jakobbotsch jakobbotsch deleted the arm32-callfinally-bl branch November 26, 2023 20:05
@janvorli
Copy link
Member

cc @janvorli -- this PR changes how the JIT generates calls to finally clauses. fyi, in case you know of a reason the VM has a dependency on the current model (seems unlikely to me).

@BruceForstall if I understand it correctly, this only influences non-exceptionally invoked finallys, right? So the difference is that instead of lr, <continuation>; jmp <finally> JIT uses call <finally>; jmp <continuation>?
I am not aware of the runtime depending on that in any way.

@BruceForstall
Copy link
Contributor

@BruceForstall if I understand it correctly, this only influences non-exceptionally invoked finallys, right? So the difference is that instead of lr, <continuation>; jmp <finally> JIT uses call <finally>; jmp <continuation>?
I am not aware of the runtime depending on that in any way.

Correct. Thanks for considering.

I updated the diff -- indeed, the bl is in its own IG now, but the nop is still there.

We should probably track down why that nop is there. At first I thought maybe it's due to the return address of the call-to-finally bl needing to be in the same EH region as the bl, for some stack walk purpose. But I think the VM during stack walk will "back up" by one byte so the address in the bl frame will be considered to be "inside" the bl instruction itself, which would be the appropriate EH region.

BruceForstall added a commit to BruceForstall/runtime that referenced this pull request Dec 4, 2023
BruceForstall added a commit that referenced this pull request Dec 4, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Assertion failed 'interval->isSpilled' during 'LSRA allocate' RyuJIT: reconsider arm32 callfinally code pattern

4 participants