-
Notifications
You must be signed in to change notification settings - Fork 66
chacha20: add an AVX-512 backend #477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Benchmark results on Zen 4: AVX2 Zen4 test chacha20_bench1_16b ... bench: 23.71 ns/iter (+/- 0.89) = 695 MB/s test chacha20_bench2_256b ... bench: 82.98 ns/iter (+/- 7.64) = 3121 MB/s test chacha20_bench3_1kib ... bench: 302.03 ns/iter (+/- 3.59) = 3390 MB/s test chacha20_bench4_16kib ... bench: 4,677.58 ns/iter (+/- 161.42) = 3503 MB/s AVX512 Zen4 test chacha20_bench1_16b ... bench: 25.07 ns/iter (+/- 0.90) = 640 MB/s test chacha20_bench2_256b ... bench: 79.66 ns/iter (+/- 1.18) = 3240 MB/s test chacha20_bench3_1kib ... bench: 275.32 ns/iter (+/- 4.13) = 3723 MB/s test chacha20_bench4_16kib ... bench: 4,201.84 ns/iter (+/- 24.18) = 3900 MB/s Much greater speedups are achievable for long input sizes if we increase PAR_BLOCKS to 8, but this also causes a 2x slowdown for short inputs (< 512 bytes). The StreamCipherBackend API doesn't seem to have any way to support multiple degrees of parallelism depending on the input size.
New throughput results on Zen 4: test chacha20_bench1_16b ... bench: 25.53 ns/iter (+/- 0.75) = 640 MB/s test chacha20_bench2_256b ... bench: 255.88 ns/iter (+/- 4.16) = 1003 MB/s test chacha20_bench3_1kib ... bench: 192.76 ns/iter (+/- 4.15) = 5333 MB/s test chacha20_bench4_16kib ... bench: 2,873.78 ns/iter (+/- 62.99) = 5702 MB/s 3x regression for 256b case, since minimum 512b is required to use parallel.
|
Also, requires updating MSRV to 1.89 for stable AVX-512 intrinsics support. |
|
Added tests for this now. |
… short inputs This makes up about half the performance loss for 16-byte output. I suspect the remaining loss is due to different inlining decisions and probably insignificant.
…sm to make it worth the complexity
I would prefer we make a stable release for |
|
Yeah, I don't think this is something we should enable right away and it would be good to have an initial release with a 1.85 MSRV. Maybe it could be gated by a |
55328bd to
efff21d
Compare
|
Sounds good. I've updated to gate the implementation under a |
This adds an AVX-512 backend to
chacha20. There are major speedups for long input sizes at the cost of a ~5-20% performance loss for very short inputs. See benchmarks below.It is largely based on the AVX-2 backend, but with a bit of tuning to get better performance on medium-length inputs.
I spent some time tuning the
PAR_BLOCKSparameter and found that a value of 16 (compared to 4 for AVX-2) produced the highest throughput for large input sizes. This achieves the highest ILP without spilling (thanks to the larger register file in AVX-512).I added special tail handling to get better performance on sizes less than 1024 bytes.
The performance loss on short inputs seems to be due to LLVM making different inlining decisions into the benchmark loop. I'm not sure if this matters much outside a microbenchmark context.
Benchmarks
On a Ryzen 7950X (Zen 4):
On a Xeon Gold 6530 (Emerald Rapids):