-
Notifications
You must be signed in to change notification settings - Fork 214
Optional SIMD str(c)spn #597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
v128_t bitmask_lookup = wasm_u8x16_const(1, 2, 4, 8, 16, 32, 64, 128, // | ||
1, 2, 4, 8, 16, 32, 64, 128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are both -128
in the original algorithm, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In two's complement, int8(-128)
is equal to uint8(128)
. In interpreting the algorithm, I found unsigned clearer, so used that.
I guess Intel lacks the unsigned version of _mm_setr_epi8
intrinsic (which would map to the same instruction anyways…)?
What do you think is best, a comment explaining this, or wasm_i8x16_const
?
|
||
for (; *c; c++) { | ||
// Terminator IS NOT on the bitmap. | ||
__wasm_v128_setbit(&bitmap, *c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note for future reference: I was initially a bit concerned here that we will incur startup costs too heavy for the "check a small string" use case (?). But of course it's better to loop over c
once up front rather than at each character in s
like the scalar version does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scalar version does the same: it iterates over c
once, building a bitmap:
https://github.com/WebAssembly/wasi-libc/blob/main/libc-top-half/musl/src/string/strspn.c
They use some "inscrutable" (but well known) macros to build a more straightforward bitmap in stack memory.
I used this function to build our weird 256-bit bitmap "directly" into a pair of v128
vectors.
if (!wasm_i8x16_all_true(cmp)) { | ||
// Clear the bits corresponding to align (little-endian) | ||
// so we can count trailing zeros. | ||
int mask = (uint16_t)~wasm_i8x16_bitmask(cmp) >> align << align; | ||
// At least one bit will be set, unless align cleared them. | ||
// Knowing this helps the compiler if it unrolls the loop. | ||
__builtin_assume(mask || align); | ||
// If the mask became zero because of align, | ||
// it's as if we didn't find anything. | ||
if (mask) { | ||
// Find the offset of the first one bit (little-endian). | ||
return addr - (uintptr_t)s + __builtin_ctz(mask); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We keep on using this pattern; is the logic repetitive enough that we should define a helper function somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but it's always subtly different.
Some times it's (1) cmp
, any_true
and bitmask
; others (2) cmp
, all_true
and (uint16_t)~bitmask
.
Then, strlen
gets away with doing !all_true
before the cmp
and bitmask
; memrchr
does clz
instead of ctz
.
I also don't see how to get those two main patterns to inline without going into macros, and inlining is important. That __builtin_assume
makes the compiler figure out it's good to unroll the first iteration to make the subsequent ones significantly simpler.
bitmap->lo[lo_nibble] |= (uint8_t)((uint32_t)1 << (hi_nibble - 0)); | ||
bitmap->hi[lo_nibble] |= (uint8_t)((uint32_t)1 << (hi_nibble - 8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested in understanding the codegen of this: so the ...[lo_nibble] |=
is generating some i8x16.replace_lane
but somehow also OR
-ing the high nibble bits? What is emitted by LLVM here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM cheats and uses the stack, wasm-opt
(and wasm-ctor-eval
) then removes any traces of $__stack_pointer
in this build:
https://github.com/ncruces/go-sqlite3/blob/b72fd5db/sqlite3/libc/libc.wat#L1646-L1726
} __wasm_v128_bitmap256_t; | ||
|
||
__attribute__((always_inline)) | ||
static void __wasm_v128_setbit(__wasm_v128_bitmap256_t *bitmap, int i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not the following?
static void __wasm_v128_setbit(__wasm_v128_bitmap256_t *bitmap, int i) { | |
static void __wasm_v128_setbit(__wasm_v128_bitmap256_t *bitmap, uint8_t i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It moves the cast out to the loop, best not rely on implicit char
to uint8_t
.
Arguably, this bit would be more correct with unsigned char
and unsigned
.
Just to document the decision of using Geoff Langdale's variant of the algorithm: It trades (in the original by Wojciech Muła):
By (in the variant with Geoff Langdale input):
|
Continuing #580, implements
strspn
andstrcspn
.This one follows the same general structure as #586, #592 and #594, but uses a somewhat more complicated algorithm, described here.
I used the Geoff Langdale alternative implementation (the tweet as since disappeared) which is correctly described there but has a subtle bug in the implementation: WojciechMula/simd-byte-lookup#2
Since the complexity needed for
__wasm_v128_bitmap256_t
is shared for bothstrspn
andstrcspn
, I moved the implementation to a common file, when SIMD is used.The tests follow a similar structure as the previous ones, and cover the bug, which I was found through fuzzing.