-
Couldn't load subscription status.
- Fork 5.2k
Description
Per discussion with @tannergooding on Discord
Vector128.ExtractMostSignificantBits is quite an important API that is typically used together with comparisons and TrailingZeroCount/LeadingZeroCount to detect positions of an element in a vector - typically used in various IndexOf-like algorithms, etc. Example:
static void PrintPostion()
{
Vector128<byte> src = Vector128.Create((byte)0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
Vector128<byte> val = Vector128.Create((byte)42);
// prints 3 as the index of 42 is "3" in src vector
Console.WriteLine(FirstMatch(src, val));
}
static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
Vector128<byte> eq = Vector128.Equals(src, val);
return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
}Codegen for FirstMatch on x64:
vpcmpeqb xmm0, xmm0, xmm1
vpmovmskb eax, xmm0
tzcnt eax, eaxCodegen for FirstMatch on arm64:
cmeq v16.16b, v0.16b, v1.16b
ldr q17, [@RWD00]
and v16.16b, v16.16b, v17.16b
ldr q17, [@RWD16]
ushl v16.16b, v16.16b, v17.16b
movi v17.4s, #0
ext v17.16b, v16.16b, v17.16b, #8
addv b17, v17.8b
umov w0, v17.b[0]
lsl w0, w0, #8
addv b16, v16.8b
umov w1, v16.b[0]
orr w0, w0, w1
rbit w0, w0
clz w0, w0Because arm64 doesn't have a direct equivalent of movmsk. However, this particular case can be optimized because we know that input of ExtractMostSignificantBits is a comparison's result with all elements being either zero or all-bits-set, in the best case we can perform this smart trick from arm blog by @danlark1
cmeq v16.16b, v0.16b, v1.16b
shrn v16.8b, v16.8h, #4
umov x0, v16.d[0]
rbit x0, x0
clz x0, x0
asr w0, w0, #2its C# equivalent:
static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
Vector128<byte> eq = Vector128.Equals(vector, val);
ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
return BitOperations.TrailingZeroCount(matches) >> 2;
}Performance impact
We expect a nice improvement for small inputs like what http parsing typically sees where we have to find positions of symbols like :, \n etc in relatively small inputs. For large inputs in most of our algorithms we use fast compare == Vector128<>.Zero checks to ignore chunks without matches.
category:cq
theme:vector-codegen
skill-level:intermediate
cost:small
impact:small