chacha20: add an AVX-512 backend #477

caelunshun · 2025-11-01T23:13:48Z

This adds an AVX-512 backend to chacha20. There are major speedups for long input sizes at the cost of a ~5-20% performance loss for very short inputs. See benchmarks below.

It is largely based on the AVX-2 backend, but with a bit of tuning to get better performance on medium-length inputs.

I spent some time tuning the PAR_BLOCKS parameter and found that a value of 16 (compared to 4 for AVX-2) produced the highest throughput for large input sizes. This achieves the highest ILP without spilling (thanks to the larger register file in AVX-512).

I added special tail handling to get better performance on sizes less than 1024 bytes.

The performance loss on short inputs seems to be due to LLVM making different inlining decisions into the benchmark loop. I'm not sure if this matters much outside a microbenchmark context.

Benchmarks

On a Ryzen 7950X (Zen 4):

benchmark	AVX-2 throughput	AVX-512 throughput	speedup
chacha20_bench1_16b	666 MB/s	640 MB/s	0.96x
chacha20_bench2_256b	3011 MB/s	3240 MB/s	1.08x
chacha20_bench3_1kib	3390 MB/s	6243 MB/s	1.84x
chacha20_bench4_16kib	3488 MB/s	6603 MB/s	1.89x
chacha12_bench1_16b	941 MB/s	800 MB/s	0.85x
chacha12_bench2_256b	4491 MB/s	4830 MB/s	1.08x
chacha12_bench3_1kib	5446 MB/s	9142 MB/s	1.68x
chacha12_bench4_16kib	5746 MB/s	10076 MB/s	1.75x
chacha8_bench1_16b	1066 MB/s	1000 MB/s	0.94x
chacha8_bench2_256b	6243 MB/s	6564 MB/s	1.05x
chacha8_bench3_1kib	7937 MB/s	12190 MB/s	1.54x
chacha8_bench4_16kib	8458 MB/s	13664 MB/s	1.62x

On a Xeon Gold 6530 (Emerald Rapids):

benchmark	AVX-2 throughput	AVX-512 throughput	speedup
chacha20_bench1_16b	333 MB/s	280 MB/s	0.84x
chacha20_bench2_256b	1430 MB/s	1802 MB/s	1.26x
chacha20_bench3_1kib	1587 MB/s	2723 MB/s	1.72x
chacha20_bench4_16kib	1645 MB/s	2925 MB/s	1.78x
chacha12_bench1_16b	444 MB/s	355 MB/s	0.80x
chacha12_bench2_256b	2206 MB/s	2694 MB/s	1.22x
chacha12_bench3_1kib	2566 MB/s	3864 MB/s	1.51x
chacha12_bench4_16kib	2728 MB/s	4573 MB/s	1.68x
chacha8_bench1_16b	484 MB/s	421 MB/s	0.87x
chacha8_bench2_256b	2694 MB/s	3555 MB/s	1.32x
chacha8_bench3_1kib	3543 MB/s	5505 MB/s	1.55x
chacha8_bench4_16kib	3868 MB/s	6425 MB/s	1.66x

Benchmark results on Zen 4: AVX2 Zen4 test chacha20_bench1_16b ... bench: 23.71 ns/iter (+/- 0.89) = 695 MB/s test chacha20_bench2_256b ... bench: 82.98 ns/iter (+/- 7.64) = 3121 MB/s test chacha20_bench3_1kib ... bench: 302.03 ns/iter (+/- 3.59) = 3390 MB/s test chacha20_bench4_16kib ... bench: 4,677.58 ns/iter (+/- 161.42) = 3503 MB/s AVX512 Zen4 test chacha20_bench1_16b ... bench: 25.07 ns/iter (+/- 0.90) = 640 MB/s test chacha20_bench2_256b ... bench: 79.66 ns/iter (+/- 1.18) = 3240 MB/s test chacha20_bench3_1kib ... bench: 275.32 ns/iter (+/- 4.13) = 3723 MB/s test chacha20_bench4_16kib ... bench: 4,201.84 ns/iter (+/- 24.18) = 3900 MB/s Much greater speedups are achievable for long input sizes if we increase PAR_BLOCKS to 8, but this also causes a 2x slowdown for short inputs (< 512 bytes). The StreamCipherBackend API doesn't seem to have any way to support multiple degrees of parallelism depending on the input size.

New throughput results on Zen 4: test chacha20_bench1_16b ... bench: 25.53 ns/iter (+/- 0.75) = 640 MB/s test chacha20_bench2_256b ... bench: 255.88 ns/iter (+/- 4.16) = 1003 MB/s test chacha20_bench3_1kib ... bench: 192.76 ns/iter (+/- 4.15) = 5333 MB/s test chacha20_bench4_16kib ... bench: 2,873.78 ns/iter (+/- 62.99) = 5702 MB/s 3x regression for 256b case, since minimum 512b is required to use parallel.

caelunshun · 2025-11-01T23:14:49Z

Also, requires updating MSRV to 1.89 for stable AVX-512 intrinsics support.

caelunshun · 2025-11-02T19:09:29Z

~~Just realized the tests don't actually check the AVX-512 version since the test vectors are too short, so marking as draft while I add more tests.~~

Added tests for this now.

… short inputs This makes up about half the performance loss for 16-byte output. I suspect the remaining loss is due to different inlining decisions and probably insignificant.

…entation

…sm to make it worth the complexity

dhardy · 2025-11-03T08:11:33Z

Also, requires updating MSRV to 1.89 for stable AVX-512 intrinsics support.

I would prefer we make a stable release for rand_core v0.10 before merging this.

tarcieri · 2025-11-03T13:45:06Z

Yeah, I don't think this is something we should enable right away and it would be good to have an initial release with a 1.85 MSRV.

Maybe it could be gated by a cfg, similar to how the AVX-512 functionality is gated in the aes crate?

… RNG feature

caelunshun · 2025-11-03T15:44:16Z

Sounds good. I've updated to gate the implementation under a chacha20_avx512 cfg, and added a CI test for AVX-512 (copied from aes's VAES-512 config).

caelunshun added 4 commits November 1, 2025 14:52

Get back 256b performance by specializing gen_tail_blocks for AVX-512

5c1ad62

Add RNG support for avx512 (not benchmarked)

ff231fa

caelunshun marked this pull request as draft November 2, 2025 19:08

caelunshun added 3 commits November 2, 2025 17:00

chacha20 avx512: Refactor design to avoid using 512-bit path even for…

03b9b92

… short inputs This makes up about half the performance loss for 16-byte output. I suspect the remaining loss is due to different inlining decisions and probably insignificant.

Add long input test for chacha20 to test AVX-512 full parallel implem…

0442f2f

…entation

Remove AVX-512 RNG backend, since RNG doesn't expose enough paralleli…

9f4b9c9

…sm to make it worth the complexity

caelunshun marked this pull request as ready for review November 3, 2025 02:47

caelunshun added 2 commits November 3, 2025 08:32

Gate AVX-512 behind chacha20_avx512 cfg

045a79e

Add CI to run AVX-512 backend (from aes CI config) and fix build with…

efff21d

… RNG feature

caelunshun force-pushed the chacha20-avx512 branch from 55328bd to efff21d Compare November 3, 2025 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chacha20: add an AVX-512 backend #477

chacha20: add an AVX-512 backend #477

Uh oh!

caelunshun commented Nov 1, 2025 •

edited

Loading

Uh oh!

caelunshun commented Nov 1, 2025 •

edited

Loading

Uh oh!

caelunshun commented Nov 2, 2025 •

edited

Loading

Uh oh!

dhardy commented Nov 3, 2025

Uh oh!

tarcieri commented Nov 3, 2025

Uh oh!

caelunshun commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chacha20: add an AVX-512 backend #477

Are you sure you want to change the base?

chacha20: add an AVX-512 backend #477

Uh oh!

Conversation

caelunshun commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

caelunshun commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caelunshun commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhardy commented Nov 3, 2025

Uh oh!

tarcieri commented Nov 3, 2025

Uh oh!

caelunshun commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

caelunshun commented Nov 1, 2025 •

edited

Loading

caelunshun commented Nov 1, 2025 •

edited

Loading

caelunshun commented Nov 2, 2025 •

edited

Loading