-
Couldn't load subscription status.
- Fork 5.2k
Arm64/Sve: Implement SVE Math *Multiply* APIs #102007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@dotnet/arm64-contrib |
| } | ||
| } | ||
|
|
||
| if ((intrin.id == NI_Sve_FusedMultiplyAddBySelectedScalar) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these this require special code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because as per FMLA (indexed), Zm has to be in lower vector registers.
We have similar code for AdvSimd too and most likely, if I see more patterns in future, I will combine this code with it.
runtime/src/coreclr/jit/lsraarm64.cpp
Lines 1586 to 1613 in 3fce4e7
| if ((intrin.category == HW_Category_SIMDByIndexedElement) && (genTypeSize(intrin.baseType) == 2)) | |
| { | |
| // Some "Advanced SIMD scalar x indexed element" and "Advanced SIMD vector x indexed element" instructions (e.g. | |
| // "MLA (by element)") have encoding that restricts what registers that can be used for the indexed element when | |
| // the element size is H (i.e. 2 bytes). | |
| assert(intrin.op2 != nullptr); | |
| if ((intrin.op4 != nullptr) || ((intrin.op3 != nullptr) && !hasImmediateOperand)) | |
| { | |
| if (isRMW) | |
| { | |
| srcCount += BuildDelayFreeUses(intrin.op2, nullptr); | |
| srcCount += BuildDelayFreeUses(intrin.op3, nullptr, RBM_ASIMD_INDEXED_H_ELEMENT_ALLOWED_REGS); | |
| } | |
| else | |
| { | |
| srcCount += BuildOperandUses(intrin.op2); | |
| srcCount += BuildOperandUses(intrin.op3, RBM_ASIMD_INDEXED_H_ELEMENT_ALLOWED_REGS); | |
| } | |
| if (intrin.op4 != nullptr) | |
| { | |
| assert(hasImmediateOperand); | |
| assert(varTypeIsIntegral(intrin.op4)); | |
| srcCount += BuildOperandUses(intrin.op4); | |
| } | |
| } |
| HARDWARE_INTRINSIC(Sve, CreateWhileLessThanOrEqualMask64Bit, -1, 2, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_whilele, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_SpecialCodeGen|HW_Flag_ReturnsPerElementMask) | ||
| HARDWARE_INTRINSIC(Sve, CreateWhileLessThanOrEqualMask8Bit, -1, 2, false, {INS_invalid, INS_sve_whilele, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_SpecialCodeGen|HW_Flag_ReturnsPerElementMask) | ||
| HARDWARE_INTRINSIC(Sve, Divide, -1, 2, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_sdiv, INS_sve_udiv, INS_sve_sdiv, INS_sve_udiv, INS_sve_fdiv, INS_sve_fdiv}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation) | ||
| HARDWARE_INTRINSIC(Sve, FusedMultiplyAdd, -1, -1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_fmla, INS_sve_fmla}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are always using FMLA for these. Will there be cases where FMAD might be more optimal based on register usage? If so, raise an issue to track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I am just preferencing op1 as a targetPrefUse, in other words telling LSRA to use op1 as the targetReg and mark the registers for other operands as delayFree. With that, using FMLA will always be optimal. @tannergooding - please correct if I missed anything here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that sounds reasonable.
There might be scenarios where FMAD is still optimal - those where op2 is never reused in the C#, but op1 is reused. Using FMLA would avoid having to mov op1 into a temp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would definitely expect us to have some logic around picking FMLA vs FMAD.
The x64 logic is even more complex because it has to handle the RMW consideration (should the tgtPrefUse be the addend or multiplier), but it also needs to consider which memory operand should be contained (since it supports embedded loads). That logic is here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L9823 and you'll note that it uses the node->GetResultOpNumForRmwIntrinsic to determine which of op1, op2, or op3 is both an input and output or otherwise which is last use. It uses this to ensure the right containment choices are being made.
x64 then repeats this logic again in LSRA to actually set the tgtPrefUse: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lsraxarch.cpp#L2432 and then again in codegen to pick which instruction form it should use: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiccodegenxarch.cpp#L2947
I expect that Arm64 just needs to mirror the LSRA and codegen logic (ignoring any bits relevant to containment) and picking FMLA vs FMAD (rather than 231 vs 213, respectively)
| // If the instruction just has "predicated" version, then move the "embMaskOp1Reg" | ||
| // into targetReg. Next, do the predicated operation on the targetReg and last, | ||
| // use "sel" to select the active lanes based on mask, and set inactive lanes | ||
| // to falseReg. | ||
|
|
||
| assert(HWIntrinsicInfo::IsEmbeddedMaskedOperation(intrinEmbMask.id)); | ||
|
|
||
| if (targetReg != embMaskOp1Reg) | ||
| { | ||
| GetEmitter()->emitIns_R_R(INS_sve_movprfx, EA_SCALABLE, targetReg, embMaskOp1Reg); | ||
| } | ||
|
|
||
| GetEmitter()->emitIns_R_R_R_R(insEmbMask, emitSize, targetReg, maskReg, embMaskOp2Reg, | ||
| embMaskOp3Reg, opt); | ||
|
|
||
| GetEmitter()->emitIns_R_R_R_R(INS_sve_sel, emitSize, targetReg, maskReg, targetReg, falseReg, | ||
| opt, INS_SCALABLE_OPTS_UNPREDICATED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an assumption being made about the instruction being RMW here?
FMLA encodes 4 registers (Zda, Pg, Zn, and Zm) where Zda is both the source and destination and the operation is functionally similar to Zda += (Zn * Zm) (with only a single rounding operation).
Given some Zda = ConditionalSelect(Pg, FusedMultiplyAdd(Zda, Zn, Zm), Zda) it can then be encoded as simply:
fmla Zda, Pg/M, Zn, ZmGiven some Zda = ConditionalSelect(Pg, FusedMultiplyAdd(Zda, Zn, Zm), merge) it can then be encoded as simply:
movprfx Zda, Pg/M, merge
fmla Zda, Pg/M, Zn, ZmGiven some Zda = ConditionalSelect(Pg, FusedMultiplyAdd(Zda, Zn, Zm), Zero) it can then be encoded as simply:
movprfx Zda, Pg/Z, Zda
fmla Zda, Pg/M, Zn, ZmThere are then similar versions possible using fmad when the multiplier is the source and destination (op2Reg == tgtReg or op3Reg == tgtReg).
We should actually never need sel for this case, but only need complex generation if tgtReg is unique from all input registers (including the merge) and we're merging with a non-zero value, such as dest = ConditionalSelect(Pg, FusedMultiplyAdd(Zda, Zn, Zm), merge):
mov dest, Zda
movprfx dest, Pg/M, merge
fmla dest, Pg/M, Zn, ZmThis ends up being different from the other fallbacks that do use sel specifically because it's RMW and requires predication (that is there is no fmla (unpredicated)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main goal of using ins (unpredicated); sel in the other case is because it allows a 2 instruction sequence as the worst case.
In this case, we at worst need a 3 instruction sequence due to the required predication on the instruction. Thus, it becomes better to use mov; movprfx (predicated); ins (predicated) instead as it can allow mov to be elided by the register renamer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
such as dest = ConditionalSelect(Pg, FusedMultiplyAdd(Zda, Zn, Zm), merge):
For the similar reasoning mentioned in #100743 (comment) (where we should only movprfx the inactive lanes from merge -> dest, the code should be:
mov dest, Zda
fmla dest, Pg/M, Zn, Zm
sel dest, Pg/M, dest, merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I misinterpreted the value of Pg/M as AllTrue. Spoke to @tannergooding offline and we would like to generate:
sel dest, Pg/M, Zda, merge
fmla dest, Pg/M, Zn, Zm` Sve.ConditionalSelect(op1, Sve.MultiplyBySelectedScalar(op1, op2, 0), op3);` was failing because we were trying to check if `MultiplyBySelectedScalar` is contained and we hit the assert because it is not containable. Added the check.
Also updated *SelectedScalar* tests for ConditionalSelect
| unreached(); | ||
| } | ||
|
|
||
| if (intrin.op3->IsVectorZero()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be asserting that intrin.op3 is contained?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added `
* Add *Fused* APIs * fix an assert in morph * Map APIs to instructions * Add test cases * handle fused* instructions * jit format * Added MultiplyAdd/MultiplySubtract * Add mapping of API to instruction * Add test cases * Handle mov Z, Z instruction * Reuse GetResultOpNumForRmwIntrinsic() for arm64 * Reuse HW_Flag_FmaIntrinsic for arm64 * Mark FMA APIs as HW_Flag_FmaIntrinsic * Handle FMA in LSRA and codegen * Remove the SpecialCodeGen flag from selectedScalar * address some more scenarios * jit format * Add MultiplyBySelectedScalar * Map the API to the instruction * fix a bug where *Indexed API used with ConditionalSelect were failing ` Sve.ConditionalSelect(op1, Sve.MultiplyBySelectedScalar(op1, op2, 0), op3);` was failing because we were trying to check if `MultiplyBySelectedScalar` is contained and we hit the assert because it is not containable. Added the check. * unpredicated movprfx should not send opt * Add the missing flags for Subtract/Multiply * Added tests for MultiplyBySelectedScalar Also updated *SelectedScalar* tests for ConditionalSelect * fixes to test cases * fix the parameter for selectedScalar test * jit format * Contain(op3) of CndSel if op1 is AllTrueMask * Handle FMA properly * added assert

All tests are passing: https://gist.github.com/kunalspathak/511565b3fe4d830dec509d867b8e36b0
Edit: Updated to include
MultiplyAddandMultiplySubtractContributes to #99957