-
Notifications
You must be signed in to change notification settings - Fork 214
Optional SIMD memchr #592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional SIMD memchr #592
Conversation
// Bitmask is slow on AArch64, any_true is much faster. | ||
if (wasm_v128_any_true(cmp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this: using the wasm_i8x16_bitmask
directly here is a better lowering for x64 than wasm_v128_any_true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is a mitigation for AArch64. We don't have the luxury of knowing the final architecture, and much less the CPU. Or how good the runtime is at (e.g. peephole) optimizing the final generated assembly.
But I'm pretty sure I measured, and at least for large buffers (and wazero) this was a significant improvement on AArch64, for a pretty insignificant cost on 3 different x86-64 CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this again with bench.c
with wasmtime on my Xeon W-2135, and... it's hard to measure. I'm not saying it's not slower, it may, but it's close enough that between processors, VMs, lengths, I'm not sure which is better.
So, where do I put something like bench.c
, and how do we settle the matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this kind of thing could fit in sightglass, even though this is a bit of a micro-benchmark. You could take a look at this example of how the blake3 benchmark was added: benchmark.c
. In the Dockerfile that builds a benchmark you could add the special "build wasi-libc with SIMD enabled" logic.
But you don't have to put it there. I think we could probably settle this using the bench.c
you provided. I'd probably be comfortable merging this without the special aarch64 optimization now and then submitting that as a second PR once I have a chance to measure a few things. Let me make sure I understand what you're saying precisely: (a) you can't detect a difference using bitmask
or any_true
with the x64 CPUs you tested but (b) it still makes a very big difference for aarch64?
On hold due to #593. I don't understand the implications of the UB claim for the pointer arithmetic needed to implement Do feel free to update the PR as required. |
Updated with the inline assembly, and suggestions in #593 (comment), @alexcrichton please validate. Also tweaked naming and formatting in |
Continuing #580, followup to #586.
Chose
memchr
because it's somewhat similar tostrlen
, but also because it is the basis forstrnlen
(and in that capacity, forstrndup
andstrlcat
) and is also used bystrstr
,fnmatch
.