Skip to content

Optional SIMD memchr #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 11, 2025
Merged

Optional SIMD memchr #592

merged 2 commits into from
Jul 11, 2025

Conversation

ncruces
Copy link
Contributor

@ncruces ncruces commented Jul 1, 2025

Continuing #580, followup to #586.

Chose memchr because it's somewhat similar to strlen, but also because it is the basis for strnlen (and in that capacity, for strndup and strlcat) and is also used by strstr, fnmatch.

Comment on lines +36 to +47
// Bitmask is slow on AArch64, any_true is much faster.
if (wasm_v128_any_true(cmp)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this: using the wasm_i8x16_bitmask directly here is a better lowering for x64 than wasm_v128_any_true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is a mitigation for AArch64. We don't have the luxury of knowing the final architecture, and much less the CPU. Or how good the runtime is at (e.g. peephole) optimizing the final generated assembly.

But I'm pretty sure I measured, and at least for large buffers (and wazero) this was a significant improvement on AArch64, for a pretty insignificant cost on 3 different x86-64 CPUs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this again with bench.c with wasmtime on my Xeon W-2135, and... it's hard to measure. I'm not saying it's not slower, it may, but it's close enough that between processors, VMs, lengths, I'm not sure which is better.

So, where do I put something like bench.c, and how do we settle the matter?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this kind of thing could fit in sightglass, even though this is a bit of a micro-benchmark. You could take a look at this example of how the blake3 benchmark was added: benchmark.c. In the Dockerfile that builds a benchmark you could add the special "build wasi-libc with SIMD enabled" logic.

But you don't have to put it there. I think we could probably settle this using the bench.c you provided. I'd probably be comfortable merging this without the special aarch64 optimization now and then submitting that as a second PR once I have a chance to measure a few things. Let me make sure I understand what you're saying precisely: (a) you can't detect a difference using bitmask or any_true with the x64 CPUs you tested but (b) it still makes a very big difference for aarch64?

@ncruces ncruces marked this pull request as draft July 2, 2025 23:23
@ncruces
Copy link
Contributor Author

ncruces commented Jul 2, 2025

On hold due to #593.

I don't understand the implications of the UB claim for the pointer arithmetic needed to implement memchr.

Do feel free to update the PR as required.

@ncruces
Copy link
Contributor Author

ncruces commented Jul 3, 2025

Updated with the inline assembly, and suggestions in #593 (comment), @alexcrichton please validate.

Also tweaked naming and formatting in strlen, for consistency.

@ncruces ncruces marked this pull request as ready for review July 3, 2025 14:14
@abrown abrown merged commit 4ea6fdf into WebAssembly:main Jul 11, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants