-
Notifications
You must be signed in to change notification settings - Fork 5.2k
JIT: Switch arm32 call-finally to be similar to other targets #95117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Maybe this doesn't handle unwinding properly?
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis switches arm32 to generate call-finally in the same way as other targets, with a call to the finally funclet, instead of loading a different return address. Loading a different return address confuses the hardware's return address predictor, which has large negative perf impact. Not sure yet if this handles unwinding correctly, and if this can be made to work easily... let's see what CI (and @BruceForstall) says. Two micro benchmarks from my rpi, both run with public static void Main()
{
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 100; j++)
{
Run(100);
Call1();
}
Thread.Sleep(100);
}
Stopwatch timer = Stopwatch.StartNew();
Run(100_000_000);
timer.Stop();
Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
timer.Restart();
for (int i = 0; i < 100_000_000; i++)
Call1();
timer.Stop();
Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
}
public static long Run(int iters)
{
long sum = 0;
for (int i = 0; i < iters; i++)
{
try
{
sum += i;
}
finally
{
sum += 2 * i;
}
}
return sum;
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call1() => Call2() + 1;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call2() => Call3() + 2;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call3() => Call4() + 3;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call4() => Call5() + 4;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call5() => Call6() + 5;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call6() => Call7() + 6;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call7() => Call8() + 7;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call8() => Call9() + 8;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call9() => Call10() + 9;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call10() => Call11() + 10;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call11() => Call12() + 11;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call12() => Call13() + 12;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call13() => Finally();
public static int Finally()
{
int result = 10;
try
{
result += 5;
}
finally
{
result += 10;
}
return result;
}Codegen diff: https://www.diffchecker.com/w69PBgox/ Output before: Elapsed: 916ms
Elapsed: 18322msOutput after: Elapsed: 856ms
Elapsed: 4010ms
|
|
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-coreclr gcstress0x3-gcstress0xc |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
cc @dotnet/jit-contrib PTAL @BruceForstall No arm32 failures in any of the test pipelines (except for #94652), so seems like not much else is needed here. |
|
Nice cleanup. Changes LGTM |
This switches arm32 to generate call-finally in the same way as other
targets, with a call to the finally funclet, instead of loading a
different return address. Loading a different return address confuses
the hardware's return address predictor, which has large negative perf
impact.
Two micro benchmarks from my rpi, both run with
`DOTNET_JitEnableFinallyCloning=0`, that show the cost of messing up
return address prediction. The first runs a loop that calls a finally
funclet on every iteration. The second sets up a deeper stack and calls
a funclet, which means that the old scheme then mispredicts the return
for all subsequent returns. The second benchmark is over 4 times faster
with this change.
```csharp
public static void Main()
{
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 100; j++)
{
Run(100);
Call1();
}
Thread.Sleep(100);
}
Stopwatch timer = Stopwatch.StartNew();
Run(100_000_000);
timer.Stop();
Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
timer.Restart();
for (int i = 0; i < 100_000_000; i++)
Call1();
timer.Stop();
Console.WriteLine("Elapsed: {0}ms", timer.ElapsedMilliseconds);
}
public static long Run(int iters)
{
long sum = 0;
for (int i = 0; i < iters; i++)
{
try
{
sum += i;
}
finally
{
sum += 2 * i;
}
}
return sum;
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call1() => Call2() + 1;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call2() => Call3() + 2;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call3() => Call4() + 3;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call4() => Call5() + 4;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call5() => Call6() + 5;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call6() => Call7() + 6;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call7() => Call8() + 7;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call8() => Call9() + 8;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call9() => Call10() + 9;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call10() => Call11() + 10;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call11() => Call12() + 11;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call12() => Call13() + 12;
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Call13() => Finally();
public static int Finally()
{
int result = 10;
try
{
result += 5;
}
finally
{
result += 10;
}
return result;
}
```
Output before:
```scala
Elapsed: 916ms
Elapsed: 18322ms
```
Output after:
```scala
Elapsed: 856ms
Elapsed: 4010ms
```
Fix dotnet#59453
Fix dotnet#66578
|
I found #59453 that covers this and noticed the |
|
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, runtime-coreclr gcstress0x3-gcstress0xc, runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Can you update the diffchecker diffs linked at the top after this change? That was one thing I didn't see in the diff: I expected the "bl" to the finally to be in its own IG so the EH reporting could reference it. In the diff that is there, there is a Well, the |
|
Oddly large TP regression in some cases. Probably due to FEATURE_EH_CALLFINALLY_THUNKS handling? Huge size improvements. |
BruceForstall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look great to me. Love to see all this ifdef code disappear, especially if it's a net perf and code size win.
|
cc @janvorli -- this PR changes how the JIT generates calls to finally clauses. fyi, in case you know of a reason the VM has a dependency on the current model (seems unlikely to me). |
libraries.pmi (the collection with the large TP regression) has only 6 MinOpts contexts, so the results aren't very telling. The particular context that regresses is for this one: We switch to minopts because the function has over 60 KB of IL. The function has only 7 EH clauses but it has 2500 basic blocks. The detailed TP diff for this single context is: Base: 185069812, Diff: 185844906, +0.4188%
552076 : NA : 51.77% : +0.2983% : public: struct BasicBlock * __thiscall Compiler::fgFindInsertPoint(unsigned int, bool, struct BasicBlock *, struct BasicBlock *, struct BasicBlock *, struct BasicBlock *, bool)
131965 : +4.94% : 12.37% : +0.0713% : public: void __thiscall GcInfoEncoder::Build(void)
98691 : NA : 9.25% : +0.0533% : public: bool __thiscall BasicBlock::isBBCallAlwaysPair(void) const
44208 : NA : 4.15% : +0.0239% : private: void * __thiscall emitter::emitAddLabel(unsigned int *const &, unsigned __int64, unsigned __int64)
29708 : +3636.23% : 2.79% : +0.0161% : private: void __thiscall CodeGen::genReportEH(void)
13135 : +1.65% : 1.23% : +0.0071% : private: void __thiscall LinearScan::setBlockSequence(void)
12784 : +43.06% : 1.20% : +0.0069% : public: bool __thiscall BasicBlock::isBBCallAlwaysPairTail(void) const
11847 : +0.83% : 1.11% : +0.0064% : protected: void __thiscall CodeGen::genCodeForBBlist(void)
5164 : +0.13% : 0.48% : +0.0028% : protected: unsigned int __thiscall emitter::emitOutputInstr(struct insGroup *, struct emitter::instrDesc *, unsigned char **)
2645 : +8.26% : 0.25% : +0.0014% : private: void __thiscall CodeGen::genMarkLabelsForCodegen(void)
2472 : +0.02% : 0.23% : +0.0013% : public: void __thiscall LinearScan::allocateRegisters(void)
2423 : +1.42% : 0.23% : +0.0013% : public: void __thiscall emitter::emitIns_J(enum instruction, struct BasicBlock *, int)
1140 : +0.24% : 0.11% : +0.0006% : public: static void __stdcall BitSetOps<unsigned int *, 1, class Compiler *, class BasicBlockBitSetTraits>::ClearD(class Compiler *, unsigned int *&)
-1923 : -0.40% : 0.18% : -0.0010% : public: unsigned int __thiscall emitter::emitEndCodeGen(class Compiler *, bool, bool, bool, unsigned int, unsigned int *, unsigned int *, void **, void **, void **)
-2316 : -0.72% : 0.22% : -0.0013% : protected: bool __thiscall Compiler::fgRemoveDeadBlocks(void)
-5768 : -0.70% : 0.54% : -0.0031% : private: void __thiscall emitter::emitJumpDistBind(void)
-6405 : -1.46% : 0.60% : -0.0035% : private: void __thiscall GcInfoEncoder::WriteSlotStateVector(class BitStreamWriter &, class BitArray const &)
-8759 : -0.71% : 0.82% : -0.0047% : public: void __thiscall BitStreamWriter::Write(unsigned int, unsigned int)
-10567 : -1.43% : 0.99% : -0.0057% : private: void __thiscall GcInfoEncoder::SizeofSlotStateVarLengthVector(class BitArray const &, unsigned int, unsigned int, unsigned int *, unsigned int *, unsigned int *)
-29120 : -100.00% : 2.73% : -0.0157% : public: void __thiscall emitter::emitStackPop(unsigned char *, bool, unsigned char, unsigned int)
-31879 : -26.86% : 2.99% : -0.0172% : Compiler::fgRemoveUnreachableBlocks<`Compiler::fgRemoveDeadBlocks'::`2'::<lambda_1> >
-44208 : -100.00% : 4.15% : -0.0239% : private: void * __thiscall emitter::emitAddLabel(unsigned int *const &, unsigned __int64, unsigned __int64, bool) |
I updated the diff -- indeed, the |
@BruceForstall if I understand it correctly, this only influences non-exceptionally invoked finallys, right? So the difference is that instead of |
Correct. Thanks for considering.
We should probably track down why that |
Due to change in dotnet#95117
This switches arm32 to generate call-finally in the same way as other targets, with a call to the finally funclet, instead of loading a different return address. Loading a different return address confuses the hardware's return address predictor, which has large negative perf impact.
Two micro benchmarks from my rpi, both run with
DOTNET_JitEnableFinallyCloning=0, that show the cost of messing up return address prediction. The first runs a loop that calls a finally funclet on every iteration. The second sets up a deeper stack and calls a funclet, which means that the old scheme then mispredicts the return for all subsequent returns. The second benchmark is over 4 times faster with this change.Codegen diff: https://www.diffchecker.com/zKoHLm4v/
Output before:
Output after:
Fix #59453
Fix #66578