Use LSB of vector when converting from vector to mask #116991

kunalspathak · 2025-06-24T22:56:20Z

Today, when converting vector to mask, we just check if the lane value is non-zero and if yes, we consider the lane to be active and set it such in the predicate register:

ptrue p0
cmpne p1, p0, zsrc, 0

This actually is not true as per Arm manual and the predicate registers conceptually only represents a bit (LSB) of the corresponding vector lanes. As such a vector with value 0x2 will be considered to be "active" when we convert to predicate, but ideally should be "inactive", because the LSB for that value is "0". This was not problem until now because we were not doing any major optimization around the constant masks. However, with #115566, we have started seeing problems where the codegen we produce do not match with how the constant masks are converted to pattern internally during optimization. During optimization, we should only check the LSB and accordingly set the corresponding lane as active or inactive.

This was also called out last year in Engineering SVE blogpost

This reveals a subtle difference in how the vector and predicate values are interpreted in .NET compared to the behavior outlined in the Arm manual. Despite this distinction, we believe this approach remains functionally correct, as vector and predicate values are handled with consistent rules throughout the .NET ecosystem.

I have updated the codegen to now do something like:

ptrue p0
lsl ztemp, zsrc, #size - 1
cmpne p1, p0, ztemp, #0

This will increase the number of instructions to convert vector to mask from 2 to 3. Hence I have also updated the heuristics to take that into account.

dotnet-policy-service · 2025-06-24T22:57:07Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

kunalspathak · 2025-06-25T07:19:11Z

While the tests are failing because with this PR we will consider only those lanes has active whose corresponding vector (from which we converted) contains LSB as 1. Our test templates were based on the fact that if the value is != 0 then treat it as active lane.

I see that lot of tests generating "random" value in mask input rather than just 0 or 1. I remember we change it to have just 0 or 1. For eg. below is failure of Sve_Abs_double

Beginning scenario: ConditionalSelect_Op1_mask - operation in trueValue
Sve.Abs<Single>(Vector<Single>, Vector<Single>): ConditionalSelectScenario_TrueValue failed:
    mask: (-0.2808353, -0.81216836, -0.9250676, -0.7905316)
    left: (-0.07548953, -0.54367715, -0.16840638, -0.6544843)
 falseOp: (-0.07548953, -0.54367715, -0.16840638, -0.6544843)
  result: (0.07548953, -0.54367715, 0.16840638, 0.6544843)

However, when you like at the C# code of the test, we do populate the mask properly:

runtime/src/tests/JIT/HardwareIntrinsics/Arm/Shared/_SveUnaryOpTestTemplate.template

Line 164 in 5392448

    
           for (var i = 0; i < Op1ElementCount; i++) { _maskData[i] = ({Op1BaseType})({NextValueOp1} % 2); }

This makes me wonder if there is bug that need to be fixed (as part of this PR) that converts the pattern to mask or vice-versa. I will run a round of fuzzlyn to double check that.

After that, the validation logic in repo\src\tests\Common\GenerateHWIntrinsicTests\GenerateHWIntrinsicTests_Arm.cs need to change from mask[i] != 0 to mask[i] & 1.

@a74nh - I will let you take this forward.

kunalspathak · 2025-06-25T07:19:18Z

/azp run Fuzzlyn

azure-pipelines · 2025-06-25T07:19:32Z

Azure Pipelines successfully started running 1 pipeline(s).

kunalspathak · 2025-06-25T13:57:29Z

From fuzzlyn run, looks like most of the functional errors are gone. Only failures I see now are couple of asserts.

a74nh · 2025-06-27T14:42:54Z

Looks like most (all?) the failures here are float or double. That's because for float/double we can't just read the mask lane, we have to do something like:
public static bool GetSveMaskValue(float[] mask, int index) => (BitConverter.SingleToInt32Bits(mask[index]) & 1) != 0;

I'm working through all of these....

a74nh · 2025-06-27T15:01:03Z

I don't have permission to push to this PR. So, I'll most likely end up raising my own version.

a74nh · 2025-06-30T14:17:16Z

ptrue p0
lsl ztemp, zsrc, #size - 1
cmpne p1, p0, ztemp, #0

I've checked with other people whether there is a better sequence. AND instead of LSL is preferred as it has better throughput. Otherwise, this is the best solution.

tannergooding · 2025-06-30T16:24:05Z

src/coreclr/jit/hwintrinsiccodegenarm64.cpp

                // PMOV would be ideal here, but it is in SVE2.1.
-                // Instead, use a compare: CMPNE <Pd>.<T>, <Pg>/Z, <Zn>.<T>, #0
-                GetEmitter()->emitIns_R_R_R_I(ins, emitSize, targetReg, op1Reg, op2Reg, 0, opt);


Not quite sure this is correct and will likely break real world code.

The general consideration is that what we expose to the end user is a "full vector", so if they do something like Vector.ConditionalSelect(mask, x, y) then they are expecting this to behave exactly as (x & mask) | (y & ~mask), which is that it selects x when the corresponding bit in mask is set and y when it is clear.

The user isn't thinking about or considering the fact that on xarch the blendv instruction only checks the msb or any other similar nuance. They are writing a general purpose cross-platform API where the contract is "bitwise" from the original vector.

Correspondingly, we've documented that APIs which take a mask input are expecting allbitsset for that element to indicate true and zero for that element to indicate false. There isn't a consideration for them that the underlying hardware happens to have a conversion to/from the internal mask/predicate register representation that can optimize it a particular way.

The reason we document this is because all the hardware that doesn't have a dedicated masking register concept has their comparisons return AllBitsSet or Zero per element, because its the most effective thing to use when dealing purely with vector data and it makes this experience consistent regardless of where you are or what you're doing.

Now, there are some dedicated instructions like Sse41.BlendVariable which explicitly take a vector in hardware and which explicitly document that they only consider the MSB of each element, but these are special and require direct usage. They aren't something we always transform something like Vector128.ConditionalSelect into, because that would result in a different semantic for unknown vectors (anything that wasn't AllBitsSet vs Zero per element).

I believe the same consideration exists here. We cannot implicitly have the internal ConvertVectorToMask helper behaving as if it is (x & 1) != 0, because it will break the rest of the logic flow and normalization considerations we've taken throughout the JIT and laid down for users.

More broadly, I'm not understanding what is meant by "the predicate registers conceptually only represents a bit (LSB) of the corresponding vector lanes"

The manual simply documents

Each SVE predicate register holds one bit for each byte of an SVE scalar vector register.

It then covers:

A predicate-as-mask encoding is divided into 1-bit, 2-bit, 4-bit, or 8-bit predicate elements. Each predicate element corresponds to a vector element in an SVE scalable vector register
...
If the lowest-numbered bit of a predicate element is 0, the value of the predicate element is FALSE.
If the lowest-numbered bit of a predicate element is 1, the value of the predicate element is TRUE.

Which is to say that for Vector128<byte> you get 16x 1-bit predicate values. For Vector128<ushort> you get 8x 2-bit predicate values, V128<uint> is 4x 4-bit, and V128<ulong> is 2x 8-bit.

For the multi-bit predicates, you get the consideration that only the lsb of the predicate element is actually checked. That is, for V128<ushort> you get the following, showing that only the lsb is relevant:

00 - false

01 - true

10 - false

11 - true

Which would indicate that the conversion being done here isn't really "correct" in the first place. The existing logic was already producing a correct predicate, but more so a more useful predicate because it can freely be reused for any smaller T. That is, because we only convert to 00 or 11 predicates, the V128<short> is equally reusable as a predicate for V128<byte> since it remains a valid bytewise predicate for the same comparison. If you normalized this to only 00 vs 01, we'd need to then normalize that for use with V128<byte> (and so on for 4-bit and 8-bit predicate elements).

Given that, I'd think any "incorrect" behavior would potentially be in ConvertMaskToVector where something could've produced a Mask for ushort elements that was 00/01/10/11 and where we'd want to produce a Vector that was still Zero/AllBitsSet/Zero/AllBitsSet for those respective values; but I believe the conversion codegen is already handling that nuance due to how it uses the mask as part of the selection.

So I don't think we have any consideration where a managed API is taking a Vector<T> where we aren't expecting it to be precisely Zero vs Non-Zero for the mask elements for conversion purposes and where we aren't requiring that a mask producing node normalize to Zero vs AllBitsSet per element.

We only have considerations for what internal nodes might produce when directly consumed, but those are never producing vectors where only the LSB is considered as far as I can tell. There is only the case that a mask might only set the lsb of the predicate element and where we should already be handling that nuance.

Consider an XOR where the result is then used as a mask:

ConvertVectorToMask(Xor(veca, vecb)): veca = {100, 111, 101, 000} vecb = {100, 100, 101, 000} xor = {000, 011, 000, 000} to mask = {0, 1, 0, 0}

By the rules we've set up, this should be exactly the same as first converting the inputs to a mask, and then XORing using the predicate XOR instruction. But today this gives a different result:

XorPredicate(ConvertVectorToMask(veca), ConvertVectorToMask(vecb)): veca = {100, 111, 101, 000} vecb = {100, 100, 101, 000} maska = {1, 1, 1, 0} maskb = {1, 1, 1, 0} xor = {0, 0, 0, 0}

Using the LSB for masks and now the results match:

ConvertVectorToMask(Xor(veca, vecb)): veca = {100, 111, 101, 000} vecb = {100, 100, 101, 000} xor = {000, 011, 000, 000} tomask = {0, 1, 0, 0}

XorPredicate(ConvertVectorToMask(veca), ConvertVectorToMask(vecb)): veca = {100, 111, 101, 000} vecb = {100, 100, 101, 000} maska = {0, 1, 1, 0} maskb = {0, 0, 1, 0} xor {0, 1, 0, 0}

In simd.h the code was using MSB when converting from Vector to Mask. As of my #115566, it's now using LSB. Either way, neither of these options match the idea of using all true for a mask.

Previously we got away with this because we kept masks as TrueVector nodes and so none of the constant folding optimisations ever got triggered. As of #115566 those optimisations are getting triggered by Fuzzlyn when passing all constant vectors into ConditionalSelect and other such oddities.

By the rules we've set up, this should be exactly the same

This shouldn't be part of the existing rules/logic. Is there a place that is doing this in the Arm64 code already? If so, I think that's a bug and what needs to be fixed.

This is the same general issue of bool ^ bool where C# defines it as 0 (literal false) vs 1 (literal true) where-as IL defines it as 0 (literal false) vs not-0 (conceptual true) and so simply doing (x ^ y) == true can be distinct from (x ^ y) != false, because doing a literal xor of booleans can produce something that isn't the same as the literal true.

For anything like Sve.CompareEqual we know that it produces a predicate result. This will actually only set the lsb of the respective "predicate element". However, we guarantee that to the end user if they end up viewing it as a vector it will only be AllBitsSet (predicate was true for that element) or Zero (predicate was false for that element) and so the CvtMaskToVector will do the right thing for that platform (on both xarch and arm64 this is that each element has 1 bit to check per element, the difference is how those bits are packed throughout the predicate register).

The user then doesn't have an API that lets them directly do XorPredicate, we only light that up via internal pattern recognition and we know the base type of a given mask, so we know whether two marks are compatible. This is correspondingly why the morph logic that produces XorMask requires it to explicitly be in the shape of Xor(CvtMaskToVector(mask), CvtMaskToVector(mask)) and requires the two masks to be for compatible SIMD sizes/base types, because otherwise it could produce a difference in behavior. -- In general it isn't safe to do predicate1Bit ^ predicate2Bit because they aren't logically compatible. There are some special scenarios involving constants where you can treat them as compatible due to being able to observe the bit patterns, but otherwise you have 1-bit per byte vs 1-bit per 2 bytes, so you'd get a potentially incorrect result whether it was consumed as 1-bit predicate element or 2-bit predicate element value.

The user then doesn't have an API that lets them directly do XorPredicate, we only light that up via internal pattern recognition and we know the base type of a given mask

As I understood it, the intention was that we would be able to optimise Xor to XorPredicate, otherwise we'd be missing SVE instructions. #101294 (comment) and other places. Recently #114438 added support for this, but this was then turned off with #115566. #116854 attempts to enable it again, but it's stuck behind this PR.

This is correspondingly why the morph logic that produces XorMask requires it to explicitly be in the shape of Xor(CvtMaskToVector(mask), CvtMaskToVector(mask)) and requires the two masks to be for compatible SIMD sizes/base types,

Right. It should be fairly simple to update #116854 to work like that.

There are also the approved SVE APIs for AndNot and OrNot. The instruction for these are predicates only. Using allbit masks I'm not sure if we can safely implement them or not. #116628 adds those APIs, but is stuck behind all the above work.

How should a constant vector be turned into a mask if the user passes it into an API that uses a mask? EvaluateSimdCvtVectorToMask() in simd.h used to check MSB, and now checks LSB. What bit or bits should be turned into true? Only allbitsset?

Given we're getting into July now, I'm quite keen to get this all fixed up ASAP.

As I understood it, the intention was that we would be able to optimise Xor to XorPredicate

Yes, but only where the optimization is valid. We cannot do it when it would change semantics to managed user.

In practice this hasn't been an issue for any existing code paths. Developers are not taking a mask for Vector<byte> and combining it with a mask for Vector<int> in the wild. -- They notably wouldn't have been able to combine them properly using raw predicates either, just due to the two masks being fundamentally incompatible due to representing different state.

Right. It should be fairly simple to update #116854 to work like that.

This seems like a fairly expensive/verbose approach. The xarch logic is far simpler, https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L10009-L10147, and it catches all the important scenarios for very little expense.

The existing logic just looks for BitwiseOp(CvtMaskToVec(op1), CvtMaskToVec(op2)) and transforms it to BitwiseOpMask(op1, op2) if the two inputs are compatible. This takes a handful of lines of code and reasonable Arm64 should be able to light up with the same logic, just by adding the relevant maskIntrinsicId selections for Arm64 SVE. -- It and other paths that are handling the x64 KMASK setup were intentionally setup so that Arm64 Predicates could be plugged in trivially when ready.

There are also the approved SVE APIs for AndNot and OrNot. The instruction for these are predicates only. Using allbit masks I'm not sure if we can safely implement them or not

If they are only for predicates, they likely shouldn't be exposed. There's not really a need to expose them either and I think the names are potentially confusing due to the them being nand/nor -- We already have AndNot APIs and it is expected they behave like BIC does, which is x & ~y, where-as the predicates are doing ~(x & y) and ~(x | y) instead.

x64 has a similar predicate only concept with xnor which does ~(x ^ y), but it not having a managed API isn't problematic because the user is more likely to write the natural pattern (which can also be optimized for other platforms or ISAs, like AdvSimd). We simply recognize NOT(XOR(CvtMaskToVec(x), CvtMaskToVec(y))) and translate it to XnorMask(x, y) -- The normalization to NOT(XOR(...)) also makes it easier to light up constant folding, comparison inversions, and other basic optimizations; so we intentionally decompose things like V128.AndNot(x, y) into AND(x, NOT(y)) instead of importing it directly as AND_NOT(x, y). We then construct the actual AND_NOT much later, such as in lowering, when we know that no other optimization work is needed and we just want to optimize the instruction selection

How should a constant vector be turned into a mask if the user passes it into an API that uses a mask?

We basically have 4 options here:

non-zero is true, zero is false

This most closely corresponds to how bool works

all-bits-set is true, anything else is false

msb set is true, msb clear is false

This most closely corresponds to how vpmov*2m works on xarch

lsb set is true, lsb clear is false

This most closely corresponds to how pmov (to predicate) works on arm64

The intended behavior is the first: non-zero is true, zero is false. This is the intended behavior because it corresponds to how bool works already, is the cheapest to emulate for unknown inputs, lets us trivially optimize for both xarch and arm64 for known inputs, and allows cross platform determinism (including simplistic mapping to hardware without specialized predicate/masking support).

In practice the need to use this conversion should be rare, because users should not be taking arbitrary (especially non-constant) vectors and trying to use them as inputs into scenarios that expect predicates/masks. Most typically code is using the output of some Compare or similar instruction which directly produced a mask, potentially doing some minor bitwise operations on it to combine it with other masks.

In practice the need to use this conversion should be rare, because users should not be taking arbitrary (especially non-constant) vectors and trying to use them as inputs into scenarios that expect predicates/masks.

Agreed, but sadly Fuzzlyn doesn't agree - currently this is where all the failures are coming from, with code I doubt we'd ever see in the wild.

I'll get this fixed up this week. I'll draft a new PR to fix the mask handling. And then when combined with #116852 that should fix all the current issues.

currently this is where all the failures are coming from, with code I doubt we'd ever see in the wild.

Right. It's important we still handle it correctly and it's great Fuzzlyn is catching them. Just it's not critical for it to be "peak efficiency" for codegen since it's not likely going to happen in the real world.

Rather we want to ensure that morph/folding/lowering catch the typical real world cases and get the good codegen there. From what I've seen of the existing x64 support (which has been in production a couple years now), we're getting most of the right stuff already and most of that is setup for Arm64 to hook in for the same benefits.

I've put the required changes to EvaluateSimdCvtVectorToMask() in #116852.

This PR should probably be closed now.

Use LSB of vector to convert to mask

1249454

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 24, 2025

dotnet-policy-service bot assigned kunalspathak Jun 24, 2025

This was referenced Jun 25, 2025

Interop/COM/NativeClients/Dispatch test failure: 0x80131516 = dispatchCoerceTesting->Invoke #116393

Open

browser-wasm Windows build error #116746

Closed

a74nh mentioned this pull request Jun 27, 2025

Arm64 SVE: Fix conditionalselect with constant arguments #116852

Merged

tannergooding reviewed Jun 30, 2025

View reviewed changes

tannergooding closed this Jul 2, 2025

a74nh mentioned this pull request Jul 4, 2025

Arm64 SVE: Add AndNot/OrNot APIs #116628

Closed

github-actions bot locked and limited conversation to collaborators Aug 2, 2025

Use LSB of vector when converting from vector to mask #116991

Use LSB of vector when converting from vector to mask #116991

Uh oh!

Conversation

kunalspathak commented Jun 24, 2025

Uh oh!

dotnet-policy-service bot commented Jun 24, 2025

Uh oh!

kunalspathak commented Jun 25, 2025

Uh oh!

kunalspathak commented Jun 25, 2025

Uh oh!

azure-pipelines bot commented Jun 25, 2025

Uh oh!

kunalspathak commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a74nh commented Jun 27, 2025

Uh oh!

a74nh commented Jun 27, 2025

Uh oh!

a74nh commented Jun 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kunalspathak commented Jun 25, 2025 •

edited

Loading

tannergooding Jun 30, 2025 •

edited

Loading