-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Use SIMD operations in InitBlkUnroll/CopyBlkUnroll and increase unroll limit up to 128 bytes #61030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT Issue Detailsbenchmarks.run.windows.arm64.checked.mch:Detail diffscoreclr_tests.pmi.windows.arm64.checked.mch:Detail diffslibraries.crossgen2.windows.arm64.checked.mch:Detail diffslibraries.pmi.windows.arm64.checked.mch:Detail diffslibraries_tests.pmi.windows.arm64.checked.mch:Detail diffs
|
|
Tagging subscribers to this area: @JulieLeeMSFT Issue Detailsbenchmarks.run.windows.arm64.checked.mch:Detail diffscoreclr_tests.pmi.windows.arm64.checked.mch:Detail diffslibraries.crossgen2.windows.arm64.checked.mch:Detail diffslibraries.pmi.windows.arm64.checked.mch:Detail diffslibraries_tests.pmi.windows.arm64.checked.mch:Detail diffs
|
|
I am also adding micro benchmarks to test performance impact of changes surrounding FWIW, the current change is mostly CQ improvements with little effect on the performance. As you can notice, I am not changing The following are the results of running the new micro benchmarks with the proposed increased
|
94eaaaa to
3c84b5a
Compare
929dc5d to
5bc4a6e
Compare
5bc4a6e to
653c8c0
Compare
653c8c0 to
7063255
Compare
7063255 to
c3566b1
Compare
|
/azp run runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@BruceForstall @EgorBo PTAL |
…lkUnroll() in src/coreclr/jit/codegenarmarch.cpp
…r/jit/codegenarmarch.cpp
…pyBlockUnrollHelper in src/coreclr/jit/codegenarmarch.cpp
…eg for InitBlock in src/coreclr/jit/lsraarmarch.cpp
…eg for CopyBlock in src/coreclr/jit/lsraarmarch.cpp
…adOrStorePairOffset() helper in src/coreclr/jit/emitarm64.h src/coreclr/jit/emitarm64.cpp
…lr/jit/codegenarmarch.cpp
|
This pull request has been automatically marked |
…r are contained and can potentially occupy up to 2 int registers
|
@BruceForstall Please take another look. I think I addressed your feedback. |
BruceForstall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/azp run runtime-coreclr outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sandreenko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice solution, it looks like it was a lot of fun.
| static unsigned GetRegSizeAtLeastBytes(unsigned byteCount) | ||
| { | ||
| assert(byteCount != 0); | ||
| assert(byteCount < 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not be assert(byteCount <= 16); less confusing?
|
|
||
| void StorePairRegs(int offset, unsigned regSizeBytes) | ||
| { | ||
| canEncodeAllStores = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit canEncodeAllStores &| ? It is safe for bool-s in C++.
| bool canEncodeAllStores; | ||
| }; | ||
|
|
||
| class ProducingStreamBaseInstrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does base mean no-simd?
|
|
||
| private: | ||
| template <class InstructionStream> | ||
| void UnrollInitBlock(InstructionStream& instrStream, int initialRegSizeBytes) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code convention forbids non-const ref args https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/clr-jit-coding-conventions.md#1535-reference-arguments
| intReg1 = node->ExtractTempReg(RBM_ALLINT); | ||
| intReg2 = node->GetSingleTempReg(RBM_ALLINT); | ||
| break; | ||
| default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are this default and the next one for simdRegCount unreached()?
| { | ||
| public: | ||
| ProducingStreamBaseInstrs(regNumber intReg1, regNumber intReg2, regNumber addrReg, emitter* emitter) | ||
| : intReg1(intReg1), intReg2(intReg2), addrReg(addrReg), emitter(emitter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should all reg be not REG_NA or does it allow some to be REG_NA?
Currently,
CodeGen::genCodeForInitBlkUnroll()andCodeGen::genCodeForCpBlkUnroll()do not use SIMD instructions when generating code forInitBlock/CopyBlockoperations on Arm and Arm64.This PR adds support for that on Arm64.
In order to accurately estimate whether the use of SIMD instructions is needed and/or whether the base address of a block needs to computed and stored to a integer register the algorithm does multiple passes - (one that verifies whether all offsets can be encoded and another that counts how many instructions will be emitted).
The following are justifications why such approach was chosen:
In fact, that issue was already pointed out by a TODO comment (that is removed by the PR)
In other words it transforms a problem of code-generating of
InitBlock(largeOffset, byteCount)toInitBlock(0, byteCount). In order to do figure out that such transformation is required, the algorithm needs a pass to verify whether all the offsets are encodable.For example, the following snippet shows such case
InitBlockAllZeros(startOffset=8, byteCount=16)) since the JIT would need to initialize a SIMD register while it could just use astr xzr, xzr, [dstReg, #dstOffset]instead (assumingdstOffsetis encodable). In order to figure out such cases I've chose to use the same mechanism of multiple passes instead of trying to "encode" all corner cases.As an example, the following demonstrates a case where it is beneficial to "spill" that the base of address of destination in
CopyBlockand use SIMD instructions than use integer instructions without spilling.In addition to that, the instruction selection strategy that the current implementation uses can be characterized as greedy "best fit" and it doesn't always produce the optimal result (for example,
InitBlocksequence for 7 bytes isstr rInitValue, [rAddr]; strh rInitValue, [rAddr+4]; strb rInitValue, [rAddr+6]). I changed the strategy slightly and added support to allow overlapping stores (so that sequence would becomestr rInitValue, [rAddr]; stur rInitValue, [rAddr+3].The following is an example where using overlapping loads/stores result in shorter instruction sequence
ldp x1, x3, [x0] stp x1, x3, [x2,#16] - ldr w1, [x0,#16] - str w1, [x2,#32] - ldrh w1, [x0,#20] - strh w1, [x2,#36] + ldr x1, [x0,#14] + str x1, [x2,#30]