Skip to content

Conversation

mkroening
Copy link
Member

@mkroening mkroening commented Oct 14, 2025

This PR makes our overalignment of every allocation to cache line size non-default.
This heavily reduces our memory usage in scenarios with many small allocations (such as deserializing JSON).
This also circumvents SFBdragon/talc#44, which is the cause of #1968.

That issue's JSON benchmark has this performance on my machine:

alloc time
main 35 s
galloc 200 ms
bump 195 ms
galloc + this PR 120 ms
bump + this PR #GP
this PR 90 ms
host 10 ms

bump + this PR is broken because it incorrectly aligns allocations.

Depends on #1983.
Closes #1935.
Closes #1940.
Closes #1968.

@mkroening mkroening self-assigned this Oct 14, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Results

Benchmark Current: 8123de4 Previous: 8e08755 Performance Ratio
startup_benchmark Build Time 110.72 s 118.64 s 0.93
startup_benchmark File Size 0.90 MB 0.90 MB 0.99
Startup Time - 1 core 0.89 s (±0.03 s) 0.92 s (±0.03 s) 0.98
Startup Time - 2 cores 0.91 s (±0.03 s) 0.93 s (±0.03 s) 0.98
Startup Time - 4 cores 0.91 s (±0.03 s) 0.90 s (±0.03 s) 1.01
multithreaded_benchmark Build Time 111.78 s 115.52 s 0.97
multithreaded_benchmark File Size 1.00 MB 1.01 MB 0.99
Multithreaded Pi Efficiency - 2 Threads 85.12 % (±9.57 %) 89.14 % (±8.69 %) 0.95
Multithreaded Pi Efficiency - 4 Threads 42.06 % (±3.67 %) 43.33 % (±4.80 %) 0.97
Multithreaded Pi Efficiency - 8 Threads 24.88 % (±1.56 %) 24.89 % (±3.25 %) 1.00
micro_benchmarks Build Time 111.64 s 121.35 s 0.92
micro_benchmarks File Size 1.00 MB 1.01 MB 0.99
Scheduling time - 1 thread 75.70 ticks (±5.38 ticks) 64.78 ticks (±3.48 ticks) 1.17
Scheduling time - 2 threads 43.43 ticks (±5.88 ticks) 39.58 ticks (±5.81 ticks) 1.10
Micro - Time for syscall (getpid) 3.54 ticks (±0.37 ticks) 3.11 ticks (±0.35 ticks) 1.14
Memcpy speed - (built_in) block size 4096 68554.82 MByte/s (±48794.10 MByte/s) 76892.26 MByte/s (±53208.68 MByte/s) 0.89
Memcpy speed - (built_in) block size 1048576 29459.00 MByte/s (±24153.57 MByte/s) 42437.70 MByte/s (±29401.77 MByte/s) 0.69
Memcpy speed - (built_in) block size 16777216 27922.81 MByte/s (±23184.88 MByte/s) 24216.95 MByte/s (±20131.45 MByte/s) 1.15
Memset speed - (built_in) block size 4096 69240.91 MByte/s (±49188.21 MByte/s) 76746.14 MByte/s (±53110.30 MByte/s) 0.90
Memset speed - (built_in) block size 1048576 30241.21 MByte/s (±24590.79 MByte/s) 42677.86 MByte/s (±29565.07 MByte/s) 0.71
Memset speed - (built_in) block size 16777216 28703.15 MByte/s (±23639.72 MByte/s) 24944.26 MByte/s (±20616.34 MByte/s) 1.15
Memcpy speed - (rust) block size 4096 61490.56 MByte/s (±45358.39 MByte/s) 72842.39 MByte/s (±51003.78 MByte/s) 0.84
Memcpy speed - (rust) block size 1048576 29444.81 MByte/s (±24209.84 MByte/s) 42614.04 MByte/s (±29546.77 MByte/s) 0.69
Memcpy speed - (rust) block size 16777216 28129.09 MByte/s (±23368.74 MByte/s) 25125.16 MByte/s (±20845.63 MByte/s) 1.12
Memset speed - (rust) block size 4096 62376.72 MByte/s (±45990.87 MByte/s) 73212.71 MByte/s (±51231.34 MByte/s) 0.85
Memset speed - (rust) block size 1048576 30207.40 MByte/s (±24646.85 MByte/s) 42865.24 MByte/s (±29720.96 MByte/s) 0.70
Memset speed - (rust) block size 16777216 28847.47 MByte/s (±23765.07 MByte/s) 25840.21 MByte/s (±21299.14 MByte/s) 1.12
alloc_benchmarks Build Time 105.91 s 106.38 s 1.00
alloc_benchmarks File Size 0.96 MB 0.97 MB 1.00
Allocations - Allocation success 100.00 % 100.00 % 1
Allocations - Deallocation success 70.05 % (±0.27 %) 70.03 % (±0.28 %) 1.00
Allocations - Pre-fail Allocations 100.00 % 100.00 % 1
Allocations - Average Allocation time 8576.21 Ticks (±445.05 Ticks) 11224.94 Ticks (±674.74 Ticks) 0.76
Allocations - Average Allocation time (no fail) 8576.21 Ticks (±445.05 Ticks) 11224.94 Ticks (±674.74 Ticks) 0.76
Allocations - Average Deallocation time 1106.42 Ticks (±328.69 Ticks) 733.24 Ticks (±64.51 Ticks) 1.51
mutex_benchmark Build Time 105.97 s 111.63 s 0.95
mutex_benchmark File Size 1.01 MB 1.01 MB 0.99
Mutex Stress Test Average Time per Iteration - 1 Threads 12.62 ns (±0.72 ns) 13.10 ns (±0.85 ns) 0.96
Mutex Stress Test Average Time per Iteration - 2 Threads 12.76 ns (±0.97 ns) 14.56 ns (±0.90 ns) 0.88

This comment was automatically generated by workflow using github-action-benchmark.

@jounathaen
Copy link
Member

Very nice.

I do think, we should not include the overalign feature, but instead be clever and thread aware on small allocations. This would then also help/solve #1984

@mkroening mkroening force-pushed the feat-overalign branch 2 times, most recently from 6e4dfab to 0a3a6b8 Compare October 16, 2025 14:15
@mkroening mkroening changed the title perf(alloc): don't overalign allocations by default perf(alloc): don't overalign allocations Oct 16, 2025
@mkroening
Copy link
Member Author

I do think, we should not include the overalign feature, but instead be clever and thread aware on small allocations. This would then also help/solve #1984

I agree. I have completely removed our intermediate layer. We can reintroduce it if it turns out to be useful in the future.

@mkroening mkroening added this pull request to the merge queue Oct 20, 2025
Merged via the queue into main with commit 89a5d77 Oct 20, 2025
17 checks passed
@mkroening mkroening deleted the feat-overalign branch October 20, 2025 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Poor single-threaded allocation performance

2 participants