Improve Vector128.ExtractMostSignificantBits for arm64

_Per discussion with @tannergooding on Discord_

`Vector128.ExtractMostSignificantBits` is quite an important API that is typically used together with comparisons and `TrailingZeroCount/LeadingZeroCount` to detect positions of an element in a vector - typically used in various `IndexOf`-like algorithms, etc. Example:

```csharp
static void PrintPostion()
{
    Vector128<byte> src = Vector128.Create((byte)0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
    Vector128<byte> val = Vector128.Create((byte)42);

    // prints 3 as the index of 42 is "3" in src vector
    Console.WriteLine(FirstMatch(src, val)); 
}

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(src, val);
    return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
}
```
Codegen for `FirstMatch` on x64:
```asm
       vpcmpeqb  xmm0, xmm0, xmm1
       vpmovmskb eax, xmm0
       tzcnt     eax, eax
```
Codegen for `FirstMatch` on arm64:
```asm
            cmeq    v16.16b, v0.16b, v1.16b
            ldr     q17, [@RWD00]
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD16]
            ushl    v16.16b, v16.16b, v17.16b
            movi    v17.4s, #0
            ext     v17.16b, v16.16b, v17.16b, #8
            addv    b17, v17.8b
            umov    w0, v17.b[0]
            lsl     w0, w0, #8
            addv    b16, v16.8b
            umov    w1, v16.b[0]
            orr     w0, w0, w1
            rbit    w0, w0
            clz     w0, w0
```
Because arm64 doesn't have a direct equivalent of `movmsk`. However, this particular case can be optimized because we know that input of `ExtractMostSignificantBits` is a comparison's result with all elements being either zero or all-bits-set, in the best case we can perform this smart trick from [arm blog](https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon) by @danlark1
```asm
            cmeq    v16.16b, v0.16b, v1.16b
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x0, x0
            clz     x0, x0
            asr     w0, w0, #2
```
its C# equivalent:
```csharp
static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(vector, val);
    ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(matches) >> 2;
}
```

### Performance impact

We expect a nice improvement for small inputs like what http parsing typically sees where we have to find positions of symbols like `:`, `\n` etc in relatively small inputs. For large inputs in most of our algorithms we use fast `compare == Vector128<>.Zero` checks to ignore chunks without matches.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:small
impact:small

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve Vector128.ExtractMostSignificantBits for arm64 #76047

Performance impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improve Vector128.ExtractMostSignificantBits for arm64 #76047

Description

Performance impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions