Skip to content

Optional SIMD str(c)spn #597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Optional SIMD str(c)spn #597

wants to merge 1 commit into from

Conversation

ncruces
Copy link
Contributor

@ncruces ncruces commented Jul 17, 2025

Continuing #580, implements strspn and strcspn.

This one follows the same general structure as #586, #592 and #594, but uses a somewhat more complicated algorithm, described here.

I used the Geoff Langdale alternative implementation (the tweet as since disappeared) which is correctly described there but has a subtle bug in the implementation: WojciechMula/simd-byte-lookup#2

Since the complexity needed for __wasm_v128_bitmap256_t is shared for both strspn and strcspn, I moved the implementation to a common file, when SIMD is used.

The tests follow a similar structure as the previous ones, and cover the bug, which I was found through fuzzing.

Comment on lines +33 to +34
v128_t bitmask_lookup = wasm_u8x16_const(1, 2, 4, 8, 16, 32, 64, 128, //
1, 2, 4, 8, 16, 32, 64, 128);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are both -128 in the original algorithm, no?

Copy link
Contributor Author

@ncruces ncruces Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In two's complement, int8(-128) is equal to uint8(128). In interpreting the algorithm, I found unsigned clearer, so used that.

I guess Intel lacks the unsigned version of _mm_setr_epi8 intrinsic (which would map to the same instruction anyways…)?

What do you think is best, a comment explaining this, or wasm_i8x16_const?


for (; *c; c++) {
// Terminator IS NOT on the bitmap.
__wasm_v128_setbit(&bitmap, *c);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for future reference: I was initially a bit concerned here that we will incur startup costs too heavy for the "check a small string" use case (?). But of course it's better to loop over c once up front rather than at each character in s like the scalar version does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scalar version does the same: it iterates over c once, building a bitmap:
https://github.com/WebAssembly/wasi-libc/blob/main/libc-top-half/musl/src/string/strspn.c

They use some "inscrutable" (but well known) macros to build a more straightforward bitmap in stack memory.
I used this function to build our weird 256-bit bitmap "directly" into a pair of v128 vectors.

Comment on lines +107 to +120
if (!wasm_i8x16_all_true(cmp)) {
// Clear the bits corresponding to align (little-endian)
// so we can count trailing zeros.
int mask = (uint16_t)~wasm_i8x16_bitmask(cmp) >> align << align;
// At least one bit will be set, unless align cleared them.
// Knowing this helps the compiler if it unrolls the loop.
__builtin_assume(mask || align);
// If the mask became zero because of align,
// it's as if we didn't find anything.
if (mask) {
// Find the offset of the first one bit (little-endian).
return addr - (uintptr_t)s + __builtin_ctz(mask);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep on using this pattern; is the logic repetitive enough that we should define a helper function somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but it's always subtly different.

Some times it's (1) cmp, any_true and bitmask; others (2) cmp, all_true and (uint16_t)~bitmask.

Then, strlen gets away with doing !all_true before the cmp and bitmask; memrchr does clz instead of ctz.

I also don't see how to get those two main patterns to inline without going into macros, and inlining is important. That __builtin_assume makes the compiler figure out it's good to unroll the first iteration to make the subsequent ones significantly simpler.

Comment on lines +26 to +27
bitmap->lo[lo_nibble] |= (uint8_t)((uint32_t)1 << (hi_nibble - 0));
bitmap->hi[lo_nibble] |= (uint8_t)((uint32_t)1 << (hi_nibble - 8));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in understanding the codegen of this: so the ...[lo_nibble] |= is generating some i8x16.replace_lane but somehow also OR-ing the high nibble bits? What is emitted by LLVM here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM cheats and uses the stack, wasm-opt (and wasm-ctor-eval) then removes any traces of $__stack_pointer in this build:
https://github.com/ncruces/go-sqlite3/blob/b72fd5db/sqlite3/libc/libc.wat#L1646-L1726

} __wasm_v128_bitmap256_t;

__attribute__((always_inline))
static void __wasm_v128_setbit(__wasm_v128_bitmap256_t *bitmap, int i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not the following?

Suggested change
static void __wasm_v128_setbit(__wasm_v128_bitmap256_t *bitmap, int i) {
static void __wasm_v128_setbit(__wasm_v128_bitmap256_t *bitmap, uint8_t i) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It moves the cast out to the loop, best not rely on implicit char to uint8_t.

Arguably, this bit would be more correct with unsigned char and unsigned.

@ncruces
Copy link
Contributor Author

ncruces commented Jul 22, 2025

Just to document the decision of using Geoff Langdale's variant of the algorithm:

It trades (in the original by Wojciech Muła):

  • 1 wasm_v128_bitselect or wasm_i8x16_relaxed_laneselect
  • 2 wasm_i8x16_swizzle or wasm_i8x16_relaxed_swizzle
  • 1 wasm_i8x16_lt
  • 1 wasm_v128_and

By (in the variant with Geoff Langdale input):

  • 2 wasm_i8x16_swizzle
  • 1 wasm_v128_xor
  • 1 wasm_v128_or

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants