-
Notifications
You must be signed in to change notification settings - Fork 5.2k
JIT: Add a disabled-by-default loop peeling phase #97517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Factor the loop duplication code out of loop cloning and loop unrolling in anticipation of also using it in loop peeling.
Add a phase that peels loops by duplicating their loop body once. No heuristics are yet included, so this is only going to be enabled under stress for now.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsAdd a phase that peels loops by duplicating their loop body once. No Based on #97506
|
|
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
|
Azure Pipelines successfully started running 3 pipeline(s). |
Diff results for #97517Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,496,508 contexts (1,011,240 MinOpts, 1,485,268 FullOpts). MISSED contexts: base: 6,580 (0.26%), diff: 8,842 (0.35%) Overall (+83,870,420 bytes)
FullOpts (+83,870,420 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,504,130 contexts (977,766 MinOpts, 1,526,364 FullOpts). MISSED contexts: base: 6,922 (0.28%), diff: 8,132 (0.32%) Overall (+85,413,308 bytes)
FullOpts (+85,413,308 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,228,167 contexts (927,360 MinOpts, 1,300,807 FullOpts). MISSED contexts: base: 6,095 (0.27%), diff: 7,850 (0.35%) Overall (+61,409,324 bytes)
FullOpts (+61,409,324 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,306,322 contexts (929,692 MinOpts, 1,376,630 FullOpts). MISSED contexts: base: 6,353 (0.27%), diff: 8,476 (0.37%) Overall (+66,017,632 bytes)
FullOpts (+66,017,632 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,365,064 contexts (928,740 MinOpts, 1,436,324 FullOpts). MISSED contexts: base: 6,816 (0.29%), diff: 8,137 (0.34%) Overall (+64,751,990 bytes)
FullOpts (+64,751,990 bytes)
Details here Assembly diffs for linux/arm ran on windows/x86Diffs are based on 2,228,746 contexts (825,130 MinOpts, 1,403,616 FullOpts). MISSED contexts: base: 77,529 (3.36%), diff: 79,285 (3.44%) Overall (+53,448,948 bytes)
FullOpts (+53,448,948 bytes)
Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,277,191 contexts (840,452 MinOpts, 1,436,739 FullOpts). MISSED contexts: base: 7,010 (0.30%), diff: 21,934 (0.95%) Overall (+46,134,613 bytes)
FullOpts (+46,134,613 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+4.02% to +18.80%)
FullOpts (+7.07% to +25.44%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+4.46% to +20.19%)
FullOpts (+7.69% to +25.92%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+3.55% to +21.98%)
FullOpts (+6.27% to +27.66%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+3.62% to +21.25%)
FullOpts (+6.38% to +25.02%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+3.87% to +23.68%)
FullOpts (+6.64% to +27.38%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (+4.64% to +15.62%)
FullOpts (+7.85% to +19.46%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+4.73% to +17.44%)
FullOpts (+7.21% to +21.25%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+3.85% to +18.62%)
FullOpts (+6.99% to +25.14%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+4.24% to +19.86%)
FullOpts (+7.59% to +25.54%)
Details here |
Diff results for #97517Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,496,508 contexts (1,011,240 MinOpts, 1,485,268 FullOpts). MISSED contexts: base: 6,580 (0.26%), diff: 8,842 (0.35%) Overall (+83,870,420 bytes)
FullOpts (+83,870,420 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,504,130 contexts (977,766 MinOpts, 1,526,364 FullOpts). MISSED contexts: base: 6,922 (0.28%), diff: 8,132 (0.32%) Overall (+85,413,308 bytes)
FullOpts (+85,413,308 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,228,167 contexts (927,360 MinOpts, 1,300,807 FullOpts). MISSED contexts: base: 6,095 (0.27%), diff: 7,850 (0.35%) Overall (+61,409,324 bytes)
FullOpts (+61,409,324 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,306,322 contexts (929,692 MinOpts, 1,376,630 FullOpts). MISSED contexts: base: 6,353 (0.27%), diff: 8,476 (0.37%) Overall (+66,017,632 bytes)
FullOpts (+66,017,632 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,478,673 contexts (976,915 MinOpts, 1,501,758 FullOpts). MISSED contexts: base: 6,816 (0.27%), diff: 8,236 (0.33%) Overall (+69,970,980 bytes)
FullOpts (+69,970,980 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+4.02% to +18.80%)
FullOpts (+7.07% to +25.44%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+4.46% to +20.19%)
FullOpts (+7.69% to +25.92%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+3.54% to +21.98%)
FullOpts (+6.27% to +27.66%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+3.62% to +21.25%)
FullOpts (+6.37% to +25.02%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+3.87% to +23.69%)
FullOpts (+6.63% to +27.39%)
Details here Throughput diffs for windows/x86 ran on linux/x86Overall (+4.73% to +17.44%)
FullOpts (+7.21% to +21.25%)
Details here |
Diff results for #97517Assembly diffsAssembly diffs for linux/arm ran on windows/x86Diffs are based on 2,228,746 contexts (825,130 MinOpts, 1,403,616 FullOpts). MISSED contexts: base: 77,529 (3.36%), diff: 79,285 (3.44%) Overall (+53,448,948 bytes)
FullOpts (+53,448,948 bytes)
Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,277,191 contexts (840,452 MinOpts, 1,436,739 FullOpts). MISSED contexts: base: 7,010 (0.30%), diff: 21,934 (0.95%) Overall (+46,134,613 bytes)
FullOpts (+46,134,613 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (+4.64% to +15.60%)
FullOpts (+7.85% to +19.46%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+3.85% to +18.62%)
FullOpts (+6.99% to +25.14%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+4.24% to +19.87%)
FullOpts (+7.59% to +25.55%)
Details here |
Diff results for #97517Assembly diffsAssembly diffs for linux/arm ran on windows/x86Diffs are based on 2,228,746 contexts (825,130 MinOpts, 1,403,616 FullOpts). MISSED contexts: base: 77,529 (3.36%), diff: 79,285 (3.44%) Overall (+53,448,948 bytes)
FullOpts (+53,448,948 bytes)
Details here |
|
Hitting an assert on x86 with ------------ BB96 [0297] [00C..03A) -> BB92,BB03,BB14,BB17,BB11,BB20,BB91,BB91,BB93,BB40 (switch), preds={BB02} succs={BB03,BB11,BB14,BB17,BB20,BB40,BB91,BB92,BB93}
N071 (???,???) [002516] ----------Z t2516 = LCL_VAR int V04 loc0 edx REG edx
N073 (???,???) [002517] ----------- t2517 = JMPTABLE int REG eax
N001 ( 1, 1) [002789] ----------z u2789 = LCL_VAR int V04 loc0 edi REG edi
┌──▌ t2516 int
├──▌ t2517 int
N075 (???,???) [002518] ----------- ▌ SWITCH_TABLE void REG NA
runtime/src/coreclr/jit/codegenlinear.cpp Lines 923 to 930 in 0f70e9a
@kunalspathak any idea? Attached the jitdump. |
|
TLDR: This seems to be an existing bug in placement of resolution move of a local that is an operand of switch. Details: At the end of BB96, This gets added before the runtime/src/coreclr/jit/lsra.cpp Lines 8430 to 8441 in fe51bd7
During codegen, throughout Generating: N001 ( 1, 1) [002789] ----------z u2789 = LCL_VAR int V04 loc0 edi REG edi
IN000b: mov edi, edx <-- can be eliminated??
IN000c: mov edi, dword ptr [V04 ebp-0x10](I think the first Next, when we generate code for IMO, the resolution should happen after the last use of mov dword ptr [ebp-0x10], edx ; spill of [002516]
mov edi, dword ptr [ebp-0x10] ; resolution unspill of [002789]
// jump table codeI will come up with a fix. |
I guess the problem is we can't really make that happen, since the relative ordering of One thing that might be insightful is to figure out how the same situation gets handled for Initially I thought the handling was happening here: runtime/src/coreclr/jit/lsra.cpp Lines 8874 to 8887 in b76ef7f
The |
Exactly.
Yes, I was wondering about that too last night. I tried a prototype and that seems to solve the problem. I have sent out #97713 to see the results of superpmi-replay . |
|
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
Add a phase that peels loops by duplicating their loop body once. No
heuristics are yet included, so this is only going to be enabled under
stress for now.
Based on #97506