Skip to content

Improve Vector128.ExtractMostSignificantBits for arm64 #76047

@EgorBo

Description

@EgorBo

Per discussion with @tannergooding on Discord

Vector128.ExtractMostSignificantBits is quite an important API that is typically used together with comparisons and TrailingZeroCount/LeadingZeroCount to detect positions of an element in a vector - typically used in various IndexOf-like algorithms, etc. Example:

static void PrintPostion()
{
    Vector128<byte> src = Vector128.Create((byte)0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
    Vector128<byte> val = Vector128.Create((byte)42);

    // prints 3 as the index of 42 is "3" in src vector
    Console.WriteLine(FirstMatch(src, val)); 
}

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(src, val);
    return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
}

Codegen for FirstMatch on x64:

       vpcmpeqb  xmm0, xmm0, xmm1
       vpmovmskb eax, xmm0
       tzcnt     eax, eax

Codegen for FirstMatch on arm64:

            cmeq    v16.16b, v0.16b, v1.16b
            ldr     q17, [@RWD00]
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD16]
            ushl    v16.16b, v16.16b, v17.16b
            movi    v17.4s, #0
            ext     v17.16b, v16.16b, v17.16b, #8
            addv    b17, v17.8b
            umov    w0, v17.b[0]
            lsl     w0, w0, #8
            addv    b16, v16.8b
            umov    w1, v16.b[0]
            orr     w0, w0, w1
            rbit    w0, w0
            clz     w0, w0

Because arm64 doesn't have a direct equivalent of movmsk. However, this particular case can be optimized because we know that input of ExtractMostSignificantBits is a comparison's result with all elements being either zero or all-bits-set, in the best case we can perform this smart trick from arm blog by @danlark1

            cmeq    v16.16b, v0.16b, v1.16b
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x0, x0
            clz     x0, x0
            asr     w0, w0, #2

its C# equivalent:

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(vector, val);
    ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(matches) >> 2;
}

Performance impact

We expect a nice improvement for small inputs like what http parsing typically sees where we have to find positions of symbols like :, \n etc in relatively small inputs. For large inputs in most of our algorithms we use fast compare == Vector128<>.Zero checks to ignore chunks without matches.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:small
impact:small

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions