Optional SIMD str(c)spn #597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ncruces wants to merge 1 commit into WebAssembly:main from ncruces:bitset

Contributor

ncruces commented Jul 17, 2025

Continuing #580, implements strspn and strcspn.

This one follows the same general structure as #586, #592 and #594, but uses a somewhat more complicated algorithm, described here.

I used the Geoff Langdale alternative implementation (the tweet as since disappeared) which is correctly described there but has a subtle bug in the implementation: WojciechMula/simd-byte-lookup#2

Since the complexity needed for __wasm_v128_bitmap256_t is shared for both strspn and strcspn, I moved the implementation to a common file, when SIMD is used.

The tests follow a similar structure as the previous ones, and cover the bug, which I was found through fuzzing.


          Optional SIMD str(c)spn

a730338

abrown reviewed

View reviewed changes

libc-top-half/musl/src/string/strspn_simd.c

Comment on lines +33 to +34

		v128_t bitmask_lookup = wasm_u8x16_const(1, 2, 4, 8, 16, 32, 64, 128, //
		1, 2, 4, 8, 16, 32, 64, 128);

Collaborator

abrown Jul 22, 2025

These are both -128 in the original algorithm, no?

Contributor Author

ncruces Jul 22, 2025 •

edited

Loading

Yes. In two's complement, int8(-128) is equal to uint8(128). In interpreting the algorithm, I found unsigned clearer, so used that.

I guess Intel lacks the unsigned version of _mm_setr_epi8 intrinsic (which would map to the same instruction anyways…)?

What do you think is best, a comment explaining this, or wasm_i8x16_const?

abrown reviewed

View reviewed changes

libc-top-half/musl/src/string/strspn_simd.c

+                for (; *c; c++) {
+                  // Terminator IS NOT on the bitmap.
+                  __wasm_v128_setbit(&bitmap, *c);

Collaborator

abrown Jul 22, 2025

Just a note for future reference: I was initially a bit concerned here that we will incur startup costs too heavy for the "check a small string" use case (?). But of course it's better to loop over c once up front rather than at each character in s like the scalar version does.

Contributor Author

ncruces Jul 22, 2025

The scalar version does the same: it iterates over c once, building a bitmap:
https://github.com/WebAssembly/wasi-libc/blob/main/libc-top-half/musl/src/string/strspn.c

They use some "inscrutable" (but well known) macros to build a more straightforward bitmap in stack memory.
I used this function to build our weird 256-bit bitmap "directly" into a pair of v128 vectors.

libc-top-half/musl/src/string/strspn_simd.c

Comment on lines +107 to +120

+                  if (!wasm_i8x16_all_true(cmp)) {
+                    // Clear the bits corresponding to align (little-endian)
+                    // so we can count trailing zeros.
+                    int mask = (uint16_t)~wasm_i8x16_bitmask(cmp) >> align << align;
+                    // At least one bit will be set, unless align cleared them.
+                    // Knowing this helps the compiler if it unrolls the loop.
+                    __builtin_assume(mask || align);
+                    // If the mask became zero because of align,
+                    // it's as if we didn't find anything.
+                    if (mask) {
+                      // Find the offset of the first one bit (little-endian).
+                      return addr - (uintptr_t)s + __builtin_ctz(mask);
+                    }
+                  }

Collaborator

abrown Jul 22, 2025

We keep on using this pattern; is the logic repetitive enough that we should define a helper function somewhere?

Contributor Author

ncruces Jul 23, 2025

Yeah, but it's always subtly different.

Some times it's (1) cmp, any_true and bitmask; others (2) cmp, all_true and (uint16_t)~bitmask.

Then, strlen gets away with doing !all_true before the cmp and bitmask; memrchr does clz instead of ctz.

I also don't see how to get those two main patterns to inline without going into macros, and inlining is important. That __builtin_assume makes the compiler figure out it's good to unroll the first iteration to make the subsequent ones significantly simpler.

libc-top-half/musl/src/string/strspn_simd.c

Comment on lines +26 to +27

		bitmap->lo[lo_nibble] \|= (uint8_t)((uint32_t)1 << (hi_nibble - 0));
		bitmap->hi[lo_nibble] \|= (uint8_t)((uint32_t)1 << (hi_nibble - 8));

Collaborator

abrown Jul 22, 2025

I'm interested in understanding the codegen of this: so the ...[lo_nibble] |= is generating some i8x16.replace_lane but somehow also OR-ing the high nibble bits? What is emitted by LLVM here?

Contributor Author

ncruces Jul 23, 2025

LLVM cheats and uses the stack, wasm-opt (and wasm-ctor-eval) then removes any traces of $__stack_pointer in this build:
https://github.com/ncruces/go-sqlite3/blob/b72fd5db/sqlite3/libc/libc.wat#L1646-L1726

libc-top-half/musl/src/string/strspn_simd.c

+              } __wasm_v128_bitmap256_t;
+              __attribute__((always_inline))
+              static void __wasm_v128_setbit(__wasm_v128_bitmap256_t *bitmap, int i) {

Collaborator

abrown Jul 22, 2025

Why not the following?

Suggested change

      
            static void __wasm_v128_setbit(__wasm_v128_bitmap256_t *bitmap, int i) {
          
            static void __wasm_v128_setbit(__wasm_v128_bitmap256_t *bitmap, uint8_t i) {

Contributor Author

ncruces Jul 23, 2025

Sure. It moves the cast out to the loop, best not rely on implicit char to uint8_t.

Arguably, this bit would be more correct with unsigned char and unsigned.

Contributor Author

ncruces commented Jul 22, 2025

Just to document the decision of using Geoff Langdale's variant of the algorithm:

It trades (in the original by Wojciech Muła):

1 wasm_v128_bitselect or wasm_i8x16_relaxed_laneselect
2 wasm_i8x16_swizzle or wasm_i8x16_relaxed_swizzle
1 wasm_i8x16_lt
1 wasm_v128_and

By (in the variant with Geoff Langdale input):

2 wasm_i8x16_swizzle
1 wasm_v128_xor
1 wasm_v128_or

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet