Arm64/Sve: Implement SVE Math Multiply APIs #102007

kunalspathak · 2024-05-08T06:08:17Z

FusedMultiplyAdd
FusedMultiplyAddBySelectedScalar
FusedMultiplyAddNegated
FusedMultiplySubtract
FusedMultiplySubtractBySelectedScalar
FusedMultiplySubtractNegated
MultiplyAdd
MultiplySubtract
MultiplyBySelectedScalar

All tests are passing: https://gist.github.com/kunalspathak/511565b3fe4d830dec509d867b8e36b0
Edit: Updated to include MultiplyAdd and MultiplySubtract

Contributes to #99957

kunalspathak · 2024-05-08T06:31:16Z

@dotnet/arm64-contrib

a74nh · 2024-05-08T08:30:02Z

src/coreclr/jit/lsraarm64.cpp

            }
        }

+        if ((intrin.id == NI_Sve_FusedMultiplyAddBySelectedScalar) ||


Why do these this require special code here?

Because as per FMLA (indexed), Zm has to be in lower vector registers.

We have similar code for AdvSimd too and most likely, if I see more patterns in future, I will combine this code with it.

runtime/src/coreclr/jit/lsraarm64.cpp

Lines 1586 to 1613 in 3fce4e7

if ((intrin.category == HW_Category_SIMDByIndexedElement) && (genTypeSize(intrin.baseType) == 2))

{

// Some "Advanced SIMD scalar x indexed element" and "Advanced SIMD vector x indexed element" instructions (e.g.

// "MLA (by element)") have encoding that restricts what registers that can be used for the indexed element when

// the element size is H (i.e. 2 bytes).

assert(intrin.op2 != nullptr);

if ((intrin.op4 != nullptr) || ((intrin.op3 != nullptr) && !hasImmediateOperand))

{

if (isRMW)

{

srcCount += BuildDelayFreeUses(intrin.op2, nullptr);

srcCount += BuildDelayFreeUses(intrin.op3, nullptr, RBM_ASIMD_INDEXED_H_ELEMENT_ALLOWED_REGS);

}

else

{

srcCount += BuildOperandUses(intrin.op2);

srcCount += BuildOperandUses(intrin.op3, RBM_ASIMD_INDEXED_H_ELEMENT_ALLOWED_REGS);

}

if (intrin.op4 != nullptr)

{

assert(hasImmediateOperand);

assert(varTypeIsIntegral(intrin.op4));

srcCount += BuildOperandUses(intrin.op4);

}

}

a74nh · 2024-05-08T08:36:01Z

src/coreclr/jit/hwintrinsiclistarm64sve.h

 HARDWARE_INTRINSIC(Sve,           CreateWhileLessThanOrEqualMask64Bit,                              -1,      2,      false, {INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_sve_whilele,    INS_invalid,        INS_invalid},     HW_Category_SIMD,                  HW_Flag_Scalable|HW_Flag_SpecialCodeGen|HW_Flag_ReturnsPerElementMask)
 HARDWARE_INTRINSIC(Sve,           CreateWhileLessThanOrEqualMask8Bit,                               -1,      2,      false, {INS_invalid,        INS_sve_whilele,    INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid},     HW_Category_SIMD,                  HW_Flag_Scalable|HW_Flag_SpecialCodeGen|HW_Flag_ReturnsPerElementMask)
 HARDWARE_INTRINSIC(Sve,           Divide,                                                           -1,      2,      true,  {INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_sve_sdiv,       INS_sve_udiv,       INS_sve_sdiv,       INS_sve_udiv,       INS_sve_fdiv,       INS_sve_fdiv},    HW_Category_SIMD,                  HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation)
+HARDWARE_INTRINSIC(Sve,           FusedMultiplyAdd,                                                 -1,     -1,      false, {INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_sve_fmla,       INS_sve_fmla},    HW_Category_SIMD,                  HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation)


You are always using FMLA for these. Will there be cases where FMAD might be more optimal based on register usage? If so, raise an issue to track it.

Currently, I am just preferencing op1 as a targetPrefUse, in other words telling LSRA to use op1 as the targetReg and mark the registers for other operands as delayFree. With that, using FMLA will always be optimal. @tannergooding - please correct if I missed anything here.

Ok, that sounds reasonable.
There might be scenarios where FMAD is still optimal - those where op2 is never reused in the C#, but op1 is reused. Using FMLA would avoid having to mov op1 into a temp.

I would definitely expect us to have some logic around picking FMLA vs FMAD.

The x64 logic is even more complex because it has to handle the RMW consideration (should the tgtPrefUse be the addend or multiplier), but it also needs to consider which memory operand should be contained (since it supports embedded loads). That logic is here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L9823 and you'll note that it uses the node->GetResultOpNumForRmwIntrinsic to determine which of op1, op2, or op3 is both an input and output or otherwise which is last use. It uses this to ensure the right containment choices are being made.

x64 then repeats this logic again in LSRA to actually set the tgtPrefUse: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lsraxarch.cpp#L2432 and then again in codegen to pick which instruction form it should use: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiccodegenxarch.cpp#L2947

I expect that Arm64 just needs to mirror the LSRA and codegen logic (ignoring any bits relevant to containment) and picking FMLA vs FMAD (rather than 231 vs 213, respectively)

tannergooding · 2024-05-08T17:50:08Z

src/coreclr/jit/hwintrinsiccodegenarm64.cpp

+                        // If the instruction just has "predicated" version, then move the "embMaskOp1Reg"
+                        // into targetReg. Next, do the predicated operation on the targetReg and last,
+                        // use "sel" to select the active lanes based on mask, and set inactive lanes
+                        // to falseReg.
+
+                        assert(HWIntrinsicInfo::IsEmbeddedMaskedOperation(intrinEmbMask.id));
+
+                        if (targetReg != embMaskOp1Reg)
+                        {
+                            GetEmitter()->emitIns_R_R(INS_sve_movprfx, EA_SCALABLE, targetReg, embMaskOp1Reg);
+                        }
+
+                        GetEmitter()->emitIns_R_R_R_R(insEmbMask, emitSize, targetReg, maskReg, embMaskOp2Reg,
+                                                      embMaskOp3Reg, opt);
+
+                        GetEmitter()->emitIns_R_R_R_R(INS_sve_sel, emitSize, targetReg, maskReg, targetReg, falseReg,
+                                                      opt, INS_SCALABLE_OPTS_UNPREDICATED);


Is there an assumption being made about the instruction being RMW here?

FMLA encodes 4 registers (Zda, Pg, Zn, and Zm) where Zda is both the source and destination and the operation is functionally similar to Zda += (Zn * Zm) (with only a single rounding operation).

Given some Zda = ConditionalSelect(Pg, FusedMultiplyAdd(Zda, Zn, Zm), Zda) it can then be encoded as simply:

fmla Zda, Pg/M, Zn, Zm

Given some Zda = ConditionalSelect(Pg, FusedMultiplyAdd(Zda, Zn, Zm), merge) it can then be encoded as simply:

movprfx Zda, Pg/M, merge fmla Zda, Pg/M, Zn, Zm

Given some Zda = ConditionalSelect(Pg, FusedMultiplyAdd(Zda, Zn, Zm), Zero) it can then be encoded as simply:

movprfx Zda, Pg/Z, Zda fmla Zda, Pg/M, Zn, Zm

There are then similar versions possible using fmad when the multiplier is the source and destination (op2Reg == tgtReg or op3Reg == tgtReg).

We should actually never need sel for this case, but only need complex generation if tgtReg is unique from all input registers (including the merge) and we're merging with a non-zero value, such as dest = ConditionalSelect(Pg, FusedMultiplyAdd(Zda, Zn, Zm), merge):

mov dest, Zda movprfx dest, Pg/M, merge fmla dest, Pg/M, Zn, Zm

This ends up being different from the other fallbacks that do use sel specifically because it's RMW and requires predication (that is there is no fmla (unpredicated)).

The main goal of using ins (unpredicated); sel in the other case is because it allows a 2 instruction sequence as the worst case.

In this case, we at worst need a 3 instruction sequence due to the required predication on the instruction. Thus, it becomes better to use mov; movprfx (predicated); ins (predicated) instead as it can allow mov to be elided by the register renamer.

such as dest = ConditionalSelect(Pg, FusedMultiplyAdd(Zda, Zn, Zm), merge):

For the similar reasoning mentioned in #100743 (comment) (where we should only movprfx the inactive lanes from merge -> dest, the code should be:

mov dest, Zda fmla dest, Pg/M, Zn, Zm sel dest, Pg/M, dest, merge

Actually, I misinterpreted the value of Pg/M as AllTrue. Spoke to @tannergooding offline and we would like to generate:

sel dest, Pg/M, Zda, merge fmla dest, Pg/M, Zn, Zm

` Sve.ConditionalSelect(op1, Sve.MultiplyBySelectedScalar(op1, op2, 0), op3);` was failing because we were trying to check if `MultiplyBySelectedScalar` is contained and we hit the assert because it is not containable. Added the check.

Also updated *SelectedScalar* tests for ConditionalSelect

src/coreclr/jit/emitarm64.cpp

src/coreclr/jit/hwintrinsiccodegenarm64.cpp

tannergooding · 2024-05-10T22:24:18Z

src/coreclr/jit/hwintrinsiccodegenarm64.cpp

+                            unreached();
+                    }

+                    if (intrin.op3->IsVectorZero())


Should this be asserting that intrin.op3 is contained?

* Add *Fused* APIs * fix an assert in morph * Map APIs to instructions * Add test cases * handle fused* instructions * jit format * Added MultiplyAdd/MultiplySubtract * Add mapping of API to instruction * Add test cases * Handle mov Z, Z instruction * Reuse GetResultOpNumForRmwIntrinsic() for arm64 * Reuse HW_Flag_FmaIntrinsic for arm64 * Mark FMA APIs as HW_Flag_FmaIntrinsic * Handle FMA in LSRA and codegen * Remove the SpecialCodeGen flag from selectedScalar * address some more scenarios * jit format * Add MultiplyBySelectedScalar * Map the API to the instruction * fix a bug where *Indexed API used with ConditionalSelect were failing ` Sve.ConditionalSelect(op1, Sve.MultiplyBySelectedScalar(op1, op2, 0), op3);` was failing because we were trying to check if `MultiplyBySelectedScalar` is contained and we hit the assert because it is not containable. Added the check. * unpredicated movprfx should not send opt * Add the missing flags for Subtract/Multiply * Added tests for MultiplyBySelectedScalar Also updated *SelectedScalar* tests for ConditionalSelect * fixes to test cases * fix the parameter for selectedScalar test * jit format * Contain(op3) of CndSel if op1 is AllTrueMask * Handle FMA properly * added assert

kunalspathak added 6 commits May 7, 2024 06:59

Add *Fused* APIs

97373ca

fix an assert in morph

4e14098

Map APIs to instructions

3fb9dea

Add test cases

600391a

handle fused* instructions

67e4d4d

jit format

54899b2

kunalspathak requested a review from TIHan May 8, 2024 06:08

kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label May 8, 2024

kunalspathak requested a review from tannergooding May 8, 2024 06:08

dotnet-policy-service bot assigned kunalspathak May 8, 2024

kunalspathak mentioned this pull request May 8, 2024

Arm64: Implement SVE APIs #99957

Closed

kunalspathak added 3 commits May 7, 2024 23:19

Added MultiplyAdd/MultiplySubtract

e4a53ae

Add mapping of API to instruction

bfad7b7

Add test cases

100f289

kunalspathak changed the title ~~Arm64/Sve: Implement SVE Math *Fused* APIs~~ Arm64/Sve: Implement SVE Math Fused* APIs May 8, 2024

a74nh reviewed May 8, 2024

View reviewed changes

build-analysis bot mentioned this pull request May 8, 2024

Test failure in System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanDestinationFunctions_SpecialValues #101731

Closed

jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 8, 2024

tannergooding reviewed May 8, 2024

View reviewed changes

build-analysis bot mentioned this pull request May 8, 2024

arm32 fails in CI with "/lib/arm-linux-gnueabihf/libc.so.6: version `GLIBC_2.34' not found" #102030

Closed

kunalspathak added 8 commits May 9, 2024 14:53

Handle mov Z, Z instruction

8ac1840

Reuse GetResultOpNumForRmwIntrinsic() for arm64

9eb195e

Reuse HW_Flag_FmaIntrinsic for arm64

c182d0d

Mark FMA APIs as HW_Flag_FmaIntrinsic

62ea159

Handle FMA in LSRA and codegen

28a49cb

Remove the SpecialCodeGen flag from selectedScalar

722dd55

address some more scenarios

229f78f

jit format

a21439f

kunalspathak added 2 commits May 9, 2024 17:49

Add MultiplyBySelectedScalar

6a01ca4

Map the API to the instruction

318cbf3

kunalspathak changed the title ~~Arm64/Sve: Implement SVE Math Fused* APIs~~ Arm64/Sve: Implement SVE Math *Multiply* APIs May 10, 2024

kunalspathak added 8 commits May 9, 2024 18:43

unpredicated movprfx should not send opt

1ca5539

Add the missing flags for Subtract/Multiply

eb41e1d

Added tests for MultiplyBySelectedScalar

7874f25

Also updated *SelectedScalar* tests for ConditionalSelect

fixes to test cases

f756afb

fix the parameter for selectedScalar test

2904934

Merge remote-tracking branch 'origin/main' into sve_math6

53d29a0

jit format

98ac0ce

build-analysis bot mentioned this pull request May 10, 2024

[wasm][AOT][net8.0] emcc : error - received SIGKILL (-9) #89402

Open