-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Use LSB of vector when converting from vector to mask #116991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
While the tests are failing because with this PR we will consider only those lanes has active whose corresponding vector (from which we converted) contains LSB as I see that lot of tests generating "random" value in mask input rather than just 0 or 1. I remember we change it to have just 0 or 1. For eg. below is failure of Sve_Abs_double However, when you like at the C# code of the test, we do populate the mask properly: runtime/src/tests/JIT/HardwareIntrinsics/Arm/Shared/_SveUnaryOpTestTemplate.template Line 164 in 5392448
This makes me wonder if there is bug that need to be fixed (as part of this PR) that converts the pattern to mask or vice-versa. I will run a round of fuzzlyn to double check that. After that, the validation logic in @a74nh - I will let you take this forward. |
|
/azp run Fuzzlyn |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
From fuzzlyn run, looks like most of the functional errors are gone. Only failures I see now are couple of asserts. |
|
Looks like most (all?) the failures here are float or double. That's because for float/double we can't just read the mask lane, we have to do something like: I'm working through all of these.... |
|
I don't have permission to push to this PR. So, I'll most likely end up raising my own version. |
I've checked with other people whether there is a better sequence. |
| // PMOV would be ideal here, but it is in SVE2.1. | ||
| // Instead, use a compare: CMPNE <Pd>.<T>, <Pg>/Z, <Zn>.<T>, #0 | ||
| GetEmitter()->emitIns_R_R_R_I(ins, emitSize, targetReg, op1Reg, op2Reg, 0, opt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure this is correct and will likely break real world code.
The general consideration is that what we expose to the end user is a "full vector", so if they do something like Vector.ConditionalSelect(mask, x, y) then they are expecting this to behave exactly as (x & mask) | (y & ~mask), which is that it selects x when the corresponding bit in mask is set and y when it is clear.
The user isn't thinking about or considering the fact that on xarch the blendv instruction only checks the msb or any other similar nuance. They are writing a general purpose cross-platform API where the contract is "bitwise" from the original vector.
Correspondingly, we've documented that APIs which take a mask input are expecting allbitsset for that element to indicate true and zero for that element to indicate false. There isn't a consideration for them that the underlying hardware happens to have a conversion to/from the internal mask/predicate register representation that can optimize it a particular way.
The reason we document this is because all the hardware that doesn't have a dedicated masking register concept has their comparisons return AllBitsSet or Zero per element, because its the most effective thing to use when dealing purely with vector data and it makes this experience consistent regardless of where you are or what you're doing.
Now, there are some dedicated instructions like Sse41.BlendVariable which explicitly take a vector in hardware and which explicitly document that they only consider the MSB of each element, but these are special and require direct usage. They aren't something we always transform something like Vector128.ConditionalSelect into, because that would result in a different semantic for unknown vectors (anything that wasn't AllBitsSet vs Zero per element).
I believe the same consideration exists here. We cannot implicitly have the internal ConvertVectorToMask helper behaving as if it is (x & 1) != 0, because it will break the rest of the logic flow and normalization considerations we've taken throughout the JIT and laid down for users.
More broadly, I'm not understanding what is meant by "the predicate registers conceptually only represents a bit (LSB) of the corresponding vector lanes"
The manual simply documents
Each SVE predicate register holds one bit for each byte of an SVE scalar vector register.
It then covers:
A predicate-as-mask encoding is divided into 1-bit, 2-bit, 4-bit, or 8-bit predicate elements. Each predicate element corresponds to a vector element in an SVE scalable vector register
...
If the lowest-numbered bit of a predicate element is 0, the value of the predicate element is FALSE.
If the lowest-numbered bit of a predicate element is 1, the value of the predicate element is TRUE.
Which is to say that for Vector128<byte> you get 16x 1-bit predicate values. For Vector128<ushort> you get 8x 2-bit predicate values, V128<uint> is 4x 4-bit, and V128<ulong> is 2x 8-bit.
For the multi-bit predicates, you get the consideration that only the lsb of the predicate element is actually checked. That is, for V128<ushort> you get the following, showing that only the lsb is relevant:
- 00 - false
- 01 - true
- 10 - false
- 11 - true
Which would indicate that the conversion being done here isn't really "correct" in the first place. The existing logic was already producing a correct predicate, but more so a more useful predicate because it can freely be reused for any smaller T. That is, because we only convert to 00 or 11 predicates, the V128<short> is equally reusable as a predicate for V128<byte> since it remains a valid bytewise predicate for the same comparison. If you normalized this to only 00 vs 01, we'd need to then normalize that for use with V128<byte> (and so on for 4-bit and 8-bit predicate elements).
Given that, I'd think any "incorrect" behavior would potentially be in ConvertMaskToVector where something could've produced a Mask for ushort elements that was 00/01/10/11 and where we'd want to produce a Vector that was still Zero/AllBitsSet/Zero/AllBitsSet for those respective values; but I believe the conversion codegen is already handling that nuance due to how it uses the mask as part of the selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I don't think we have any consideration where a managed API is taking a Vector<T> where we aren't expecting it to be precisely Zero vs Non-Zero for the mask elements for conversion purposes and where we aren't requiring that a mask producing node normalize to Zero vs AllBitsSet per element.
We only have considerations for what internal nodes might produce when directly consumed, but those are never producing vectors where only the LSB is considered as far as I can tell. There is only the case that a mask might only set the lsb of the predicate element and where we should already be handling that nuance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider an XOR where the result is then used as a mask:
ConvertVectorToMask(Xor(veca, vecb)):
veca = {100, 111, 101, 000}
vecb = {100, 100, 101, 000}
xor = {000, 011, 000, 000}
to mask = {0, 1, 0, 0}
By the rules we've set up, this should be exactly the same as first converting the inputs to a mask, and then XORing using the predicate XOR instruction. But today this gives a different result:
XorPredicate(ConvertVectorToMask(veca), ConvertVectorToMask(vecb)):
veca = {100, 111, 101, 000}
vecb = {100, 100, 101, 000}
maska = {1, 1, 1, 0}
maskb = {1, 1, 1, 0}
xor = {0, 0, 0, 0}
Using the LSB for masks and now the results match:
ConvertVectorToMask(Xor(veca, vecb)):
veca = {100, 111, 101, 000}
vecb = {100, 100, 101, 000}
xor = {000, 011, 000, 000}
tomask = {0, 1, 0, 0}
XorPredicate(ConvertVectorToMask(veca), ConvertVectorToMask(vecb)):
veca = {100, 111, 101, 000}
vecb = {100, 100, 101, 000}
maska = {0, 1, 1, 0}
maskb = {0, 0, 1, 0}
xor {0, 1, 0, 0}
In simd.h the code was using MSB when converting from Vector to Mask. As of my #115566, it's now using LSB. Either way, neither of these options match the idea of using all true for a mask.
Previously we got away with this because we kept masks as TrueVector nodes and so none of the constant folding optimisations ever got triggered. As of #115566 those optimisations are getting triggered by Fuzzlyn when passing all constant vectors into ConditionalSelect and other such oddities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the rules we've set up, this should be exactly the same
This shouldn't be part of the existing rules/logic. Is there a place that is doing this in the Arm64 code already? If so, I think that's a bug and what needs to be fixed.
This is the same general issue of bool ^ bool where C# defines it as 0 (literal false) vs 1 (literal true) where-as IL defines it as 0 (literal false) vs not-0 (conceptual true) and so simply doing (x ^ y) == true can be distinct from (x ^ y) != false, because doing a literal xor of booleans can produce something that isn't the same as the literal true.
For anything like Sve.CompareEqual we know that it produces a predicate result. This will actually only set the lsb of the respective "predicate element". However, we guarantee that to the end user if they end up viewing it as a vector it will only be AllBitsSet (predicate was true for that element) or Zero (predicate was false for that element) and so the CvtMaskToVector will do the right thing for that platform (on both xarch and arm64 this is that each element has 1 bit to check per element, the difference is how those bits are packed throughout the predicate register).
The user then doesn't have an API that lets them directly do XorPredicate, we only light that up via internal pattern recognition and we know the base type of a given mask, so we know whether two marks are compatible. This is correspondingly why the morph logic that produces XorMask requires it to explicitly be in the shape of Xor(CvtMaskToVector(mask), CvtMaskToVector(mask)) and requires the two masks to be for compatible SIMD sizes/base types, because otherwise it could produce a difference in behavior. -- In general it isn't safe to do predicate1Bit ^ predicate2Bit because they aren't logically compatible. There are some special scenarios involving constants where you can treat them as compatible due to being able to observe the bit patterns, but otherwise you have 1-bit per byte vs 1-bit per 2 bytes, so you'd get a potentially incorrect result whether it was consumed as 1-bit predicate element or 2-bit predicate element value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user then doesn't have an API that lets them directly do
XorPredicate, we only light that up via internal pattern recognition and we know the base type of a given mask
As I understood it, the intention was that we would be able to optimise Xor to XorPredicate, otherwise we'd be missing SVE instructions. #101294 (comment) and other places. Recently #114438 added support for this, but this was then turned off with #115566. #116854 attempts to enable it again, but it's stuck behind this PR.
This is correspondingly why the morph logic that produces
XorMaskrequires it to explicitly be in the shape ofXor(CvtMaskToVector(mask), CvtMaskToVector(mask))and requires the two masks to be for compatible SIMD sizes/base types,
Right. It should be fairly simple to update #116854 to work like that.
There are also the approved SVE APIs for AndNot and OrNot. The instruction for these are predicates only. Using allbit masks I'm not sure if we can safely implement them or not. #116628 adds those APIs, but is stuck behind all the above work.
How should a constant vector be turned into a mask if the user passes it into an API that uses a mask? EvaluateSimdCvtVectorToMask() in simd.h used to check MSB, and now checks LSB. What bit or bits should be turned into true? Only allbitsset?
Given we're getting into July now, I'm quite keen to get this all fixed up ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understood it, the intention was that we would be able to optimise Xor to XorPredicate
Yes, but only where the optimization is valid. We cannot do it when it would change semantics to managed user.
In practice this hasn't been an issue for any existing code paths. Developers are not taking a mask for Vector<byte> and combining it with a mask for Vector<int> in the wild. -- They notably wouldn't have been able to combine them properly using raw predicates either, just due to the two masks being fundamentally incompatible due to representing different state.
Right. It should be fairly simple to update #116854 to work like that.
This seems like a fairly expensive/verbose approach. The xarch logic is far simpler, https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L10009-L10147, and it catches all the important scenarios for very little expense.
The existing logic just looks for BitwiseOp(CvtMaskToVec(op1), CvtMaskToVec(op2)) and transforms it to BitwiseOpMask(op1, op2) if the two inputs are compatible. This takes a handful of lines of code and reasonable Arm64 should be able to light up with the same logic, just by adding the relevant maskIntrinsicId selections for Arm64 SVE. -- It and other paths that are handling the x64 KMASK setup were intentionally setup so that Arm64 Predicates could be plugged in trivially when ready.
There are also the approved SVE APIs for AndNot and OrNot. The instruction for these are predicates only. Using allbit masks I'm not sure if we can safely implement them or not
If they are only for predicates, they likely shouldn't be exposed. There's not really a need to expose them either and I think the names are potentially confusing due to the them being nand/nor -- We already have AndNot APIs and it is expected they behave like BIC does, which is x & ~y, where-as the predicates are doing ~(x & y) and ~(x | y) instead.
x64 has a similar predicate only concept with xnor which does ~(x ^ y), but it not having a managed API isn't problematic because the user is more likely to write the natural pattern (which can also be optimized for other platforms or ISAs, like AdvSimd). We simply recognize NOT(XOR(CvtMaskToVec(x), CvtMaskToVec(y))) and translate it to XnorMask(x, y) -- The normalization to NOT(XOR(...)) also makes it easier to light up constant folding, comparison inversions, and other basic optimizations; so we intentionally decompose things like V128.AndNot(x, y) into AND(x, NOT(y)) instead of importing it directly as AND_NOT(x, y). We then construct the actual AND_NOT much later, such as in lowering, when we know that no other optimization work is needed and we just want to optimize the instruction selection
How should a constant vector be turned into a mask if the user passes it into an API that uses a mask?
We basically have 4 options here:
non-zeroistrue,zeroisfalse- This most closely corresponds to how
boolworks
- This most closely corresponds to how
all-bits-setistrue, anything else isfalsemsb setistrue,msb clearisfalse- This most closely corresponds to how
vpmov*2mworks on xarch
- This most closely corresponds to how
lsb setistrue,lsb clearisfalse- This most closely corresponds to how
pmov (to predicate)works on arm64
- This most closely corresponds to how
The intended behavior is the first: non-zero is true, zero is false. This is the intended behavior because it corresponds to how bool works already, is the cheapest to emulate for unknown inputs, lets us trivially optimize for both xarch and arm64 for known inputs, and allows cross platform determinism (including simplistic mapping to hardware without specialized predicate/masking support).
In practice the need to use this conversion should be rare, because users should not be taking arbitrary (especially non-constant) vectors and trying to use them as inputs into scenarios that expect predicates/masks. Most typically code is using the output of some Compare or similar instruction which directly produced a mask, potentially doing some minor bitwise operations on it to combine it with other masks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice the need to use this conversion should be rare, because users should not be taking arbitrary (especially non-constant) vectors and trying to use them as inputs into scenarios that expect predicates/masks.
Agreed, but sadly Fuzzlyn doesn't agree - currently this is where all the failures are coming from, with code I doubt we'd ever see in the wild.
I'll get this fixed up this week. I'll draft a new PR to fix the mask handling. And then when combined with #116852 that should fix all the current issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently this is where all the failures are coming from, with code I doubt we'd ever see in the wild.
Right. It's important we still handle it correctly and it's great Fuzzlyn is catching them. Just it's not critical for it to be "peak efficiency" for codegen since it's not likely going to happen in the real world.
Rather we want to ensure that morph/folding/lowering catch the typical real world cases and get the good codegen there. From what I've seen of the existing x64 support (which has been in production a couple years now), we're getting most of the right stuff already and most of that is setup for Arm64 to hook in for the same benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put the required changes to EvaluateSimdCvtVectorToMask() in #116852.
This PR should probably be closed now.
Today, when converting vector to mask, we just check if the lane value is non-zero and if yes, we consider the lane to be active and set it such in the predicate register:
This actually is not true as per Arm manual and the predicate registers conceptually only represents a bit (LSB) of the corresponding vector lanes. As such a vector with value
0x2will be considered to be "active" when we convert to predicate, but ideally should be "inactive", because the LSB for that value is "0". This was not problem until now because we were not doing any major optimization around the constant masks. However, with #115566, we have started seeing problems where the codegen we produce do not match with how the constant masks are converted to pattern internally during optimization. During optimization, we should only check the LSB and accordingly set the corresponding lane as active or inactive.This was also called out last year in Engineering SVE blogpost
I have updated the codegen to now do something like:
This will increase the number of instructions to convert vector to mask from 2 to 3. Hence I have also updated the heuristics to take that into account.