Skip to content

Conversation

@caelunshun
Copy link

@caelunshun caelunshun commented Nov 1, 2025

This adds an AVX-512 backend to chacha20. There are major speedups for long input sizes at the cost of a ~5-20% performance loss for very short inputs. See benchmarks below.

It is largely based on the AVX-2 backend, but with a bit of tuning to get better performance on medium-length inputs.

I spent some time tuning the PAR_BLOCKS parameter and found that a value of 16 (compared to 4 for AVX-2) produced the highest throughput for large input sizes. This achieves the highest ILP without spilling (thanks to the larger register file in AVX-512).

I added special tail handling to get better performance on sizes less than 1024 bytes.

The performance loss on short inputs seems to be due to LLVM making different inlining decisions into the benchmark loop. I'm not sure if this matters much outside a microbenchmark context.

Benchmarks

On a Ryzen 7950X (Zen 4):

benchmark AVX-2 throughput AVX-512 throughput speedup
chacha20_bench1_16b 666 MB/s 640 MB/s 0.96x
chacha20_bench2_256b 3011 MB/s 3240 MB/s 1.08x
chacha20_bench3_1kib 3390 MB/s 6243 MB/s 1.84x
chacha20_bench4_16kib 3488 MB/s 6603 MB/s 1.89x
chacha12_bench1_16b 941 MB/s 800 MB/s 0.85x
chacha12_bench2_256b 4491 MB/s 4830 MB/s 1.08x
chacha12_bench3_1kib 5446 MB/s 9142 MB/s 1.68x
chacha12_bench4_16kib 5746 MB/s 10076 MB/s 1.75x
chacha8_bench1_16b 1066 MB/s 1000 MB/s 0.94x
chacha8_bench2_256b 6243 MB/s 6564 MB/s 1.05x
chacha8_bench3_1kib 7937 MB/s 12190 MB/s 1.54x
chacha8_bench4_16kib 8458 MB/s 13664 MB/s 1.62x

On a Xeon Gold 6530 (Emerald Rapids):

benchmark AVX-2 throughput AVX-512 throughput speedup
chacha20_bench1_16b 333 MB/s 280 MB/s 0.84x
chacha20_bench2_256b 1430 MB/s 1802 MB/s 1.26x
chacha20_bench3_1kib 1587 MB/s 2723 MB/s 1.72x
chacha20_bench4_16kib 1645 MB/s 2925 MB/s 1.78x
chacha12_bench1_16b 444 MB/s 355 MB/s 0.80x
chacha12_bench2_256b 2206 MB/s 2694 MB/s 1.22x
chacha12_bench3_1kib 2566 MB/s 3864 MB/s 1.51x
chacha12_bench4_16kib 2728 MB/s 4573 MB/s 1.68x
chacha8_bench1_16b 484 MB/s 421 MB/s 0.87x
chacha8_bench2_256b 2694 MB/s 3555 MB/s 1.32x
chacha8_bench3_1kib 3543 MB/s 5505 MB/s 1.55x
chacha8_bench4_16kib 3868 MB/s 6425 MB/s 1.66x

Benchmark results on Zen 4:

AVX2 Zen4
test chacha20_bench1_16b   ... bench:          23.71 ns/iter (+/- 0.89) = 695 MB/s
test chacha20_bench2_256b  ... bench:          82.98 ns/iter (+/- 7.64) = 3121 MB/s
test chacha20_bench3_1kib  ... bench:         302.03 ns/iter (+/- 3.59) = 3390 MB/s
test chacha20_bench4_16kib ... bench:       4,677.58 ns/iter (+/- 161.42) = 3503 MB/s

AVX512 Zen4
test chacha20_bench1_16b   ... bench:          25.07 ns/iter (+/- 0.90) = 640 MB/s
test chacha20_bench2_256b  ... bench:          79.66 ns/iter (+/- 1.18) = 3240 MB/s
test chacha20_bench3_1kib  ... bench:         275.32 ns/iter (+/- 4.13) = 3723 MB/s
test chacha20_bench4_16kib ... bench:       4,201.84 ns/iter (+/- 24.18) = 3900 MB/s

Much greater speedups are achievable for long input sizes if we increase PAR_BLOCKS to 8,
but this also causes a 2x slowdown for short inputs (< 512 bytes). The StreamCipherBackend
API doesn't seem to have any way to support multiple degrees of parallelism depending on the input
size.
New throughput results on Zen 4:

test chacha20_bench1_16b   ... bench:          25.53 ns/iter (+/- 0.75) = 640 MB/s
test chacha20_bench2_256b  ... bench:         255.88 ns/iter (+/- 4.16) = 1003 MB/s
test chacha20_bench3_1kib  ... bench:         192.76 ns/iter (+/- 4.15) = 5333 MB/s
test chacha20_bench4_16kib ... bench:       2,873.78 ns/iter (+/- 62.99) = 5702 MB/s

3x regression for 256b case, since minimum 512b is required to use parallel.
@caelunshun
Copy link
Author

caelunshun commented Nov 1, 2025

Also, requires updating MSRV to 1.89 for stable AVX-512 intrinsics support.

@caelunshun caelunshun marked this pull request as draft November 2, 2025 19:08
@caelunshun
Copy link
Author

caelunshun commented Nov 2, 2025

Just realized the tests don't actually check the AVX-512 version since the test vectors are too short, so marking as draft while I add more tests.

Added tests for this now.

… short inputs

This makes up about half the performance loss for 16-byte output. I suspect the
remaining loss is due to different inlining decisions and probably insignificant.
@caelunshun caelunshun marked this pull request as ready for review November 3, 2025 02:47
@dhardy
Copy link
Contributor

dhardy commented Nov 3, 2025

Also, requires updating MSRV to 1.89 for stable AVX-512 intrinsics support.

I would prefer we make a stable release for rand_core v0.10 before merging this.

@tarcieri
Copy link
Member

tarcieri commented Nov 3, 2025

Yeah, I don't think this is something we should enable right away and it would be good to have an initial release with a 1.85 MSRV.

Maybe it could be gated by a cfg, similar to how the AVX-512 functionality is gated in the aes crate?

@caelunshun
Copy link
Author

Sounds good. I've updated to gate the implementation under a chacha20_avx512 cfg, and added a CI test for AVX-512 (copied from aes's VAES-512 config).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants