perf(alloc): don't overalign allocations #1982

mkroening · 2025-10-14T17:45:57Z

This PR makes our overalignment of every allocation to cache line size non-default.
This heavily reduces our memory usage in scenarios with many small allocations (such as deserializing JSON).
This also circumvents SFBdragon/talc#44, which is the cause of #1968.

That issue's JSON benchmark has this performance on my machine:

alloc	time
main	35 s
galloc	200 ms
bump	195 ms
galloc + this PR	120 ms
bump + this PR	#GP
this PR	90 ms
host	10 ms

bump + this PR is broken because it incorrectly aligns allocations.

Depends on #1983.
Closes #1935.
Closes #1940.
Closes #1968.

github-actions

Benchmark Results

Benchmark	Current: `8123de4`	Previous: `8e08755`	Performance Ratio
startup_benchmark Build Time	`110.72` s	`118.64` s	`0.93`
startup_benchmark File Size	`0.90` MB	`0.90` MB	`0.99`
Startup Time - 1 core	`0.89` s (`±0.03` s)	`0.92` s (`±0.03` s)	`0.98`
Startup Time - 2 cores	`0.91` s (`±0.03` s)	`0.93` s (`±0.03` s)	`0.98`
Startup Time - 4 cores	`0.91` s (`±0.03` s)	`0.90` s (`±0.03` s)	`1.01`
multithreaded_benchmark Build Time	`111.78` s	`115.52` s	`0.97`
multithreaded_benchmark File Size	`1.00` MB	`1.01` MB	`0.99`
Multithreaded Pi Efficiency - 2 Threads	`85.12` % (`±9.57` %)	`89.14` % (`±8.69` %)	`0.95`
Multithreaded Pi Efficiency - 4 Threads	`42.06` % (`±3.67` %)	`43.33` % (`±4.80` %)	`0.97`
Multithreaded Pi Efficiency - 8 Threads	`24.88` % (`±1.56` %)	`24.89` % (`±3.25` %)	`1.00`
micro_benchmarks Build Time	`111.64` s	`121.35` s	`0.92`
micro_benchmarks File Size	`1.00` MB	`1.01` MB	`0.99`
Scheduling time - 1 thread	`75.70` ticks (`±5.38` ticks)	`64.78` ticks (`±3.48` ticks)	`1.17`
Scheduling time - 2 threads	`43.43` ticks (`±5.88` ticks)	`39.58` ticks (`±5.81` ticks)	`1.10`
Micro - Time for syscall (getpid)	`3.54` ticks (`±0.37` ticks)	`3.11` ticks (`±0.35` ticks)	`1.14`
Memcpy speed - (built_in) block size 4096	`68554.82` MByte/s (`±48794.10` MByte/s)	`76892.26` MByte/s (`±53208.68` MByte/s)	`0.89`
Memcpy speed - (built_in) block size 1048576	`29459.00` MByte/s (`±24153.57` MByte/s)	`42437.70` MByte/s (`±29401.77` MByte/s)	`0.69`
Memcpy speed - (built_in) block size 16777216	`27922.81` MByte/s (`±23184.88` MByte/s)	`24216.95` MByte/s (`±20131.45` MByte/s)	`1.15`
Memset speed - (built_in) block size 4096	`69240.91` MByte/s (`±49188.21` MByte/s)	`76746.14` MByte/s (`±53110.30` MByte/s)	`0.90`
Memset speed - (built_in) block size 1048576	`30241.21` MByte/s (`±24590.79` MByte/s)	`42677.86` MByte/s (`±29565.07` MByte/s)	`0.71`
Memset speed - (built_in) block size 16777216	`28703.15` MByte/s (`±23639.72` MByte/s)	`24944.26` MByte/s (`±20616.34` MByte/s)	`1.15`
Memcpy speed - (rust) block size 4096	`61490.56` MByte/s (`±45358.39` MByte/s)	`72842.39` MByte/s (`±51003.78` MByte/s)	`0.84`
Memcpy speed - (rust) block size 1048576	`29444.81` MByte/s (`±24209.84` MByte/s)	`42614.04` MByte/s (`±29546.77` MByte/s)	`0.69`
Memcpy speed - (rust) block size 16777216	`28129.09` MByte/s (`±23368.74` MByte/s)	`25125.16` MByte/s (`±20845.63` MByte/s)	`1.12`
Memset speed - (rust) block size 4096	`62376.72` MByte/s (`±45990.87` MByte/s)	`73212.71` MByte/s (`±51231.34` MByte/s)	`0.85`
Memset speed - (rust) block size 1048576	`30207.40` MByte/s (`±24646.85` MByte/s)	`42865.24` MByte/s (`±29720.96` MByte/s)	`0.70`
Memset speed - (rust) block size 16777216	`28847.47` MByte/s (`±23765.07` MByte/s)	`25840.21` MByte/s (`±21299.14` MByte/s)	`1.12`
alloc_benchmarks Build Time	`105.91` s	`106.38` s	`1.00`
alloc_benchmarks File Size	`0.96` MB	`0.97` MB	`1.00`
Allocations - Allocation success	`100.00` %	`100.00` %	`1`
Allocations - Deallocation success	`70.05` % (`±0.27` %)	`70.03` % (`±0.28` %)	`1.00`
Allocations - Pre-fail Allocations	`100.00` %	`100.00` %	`1`
Allocations - Average Allocation time	`8576.21` Ticks (`±445.05` Ticks)	`11224.94` Ticks (`±674.74` Ticks)	`0.76`
Allocations - Average Allocation time (no fail)	`8576.21` Ticks (`±445.05` Ticks)	`11224.94` Ticks (`±674.74` Ticks)	`0.76`
Allocations - Average Deallocation time	`1106.42` Ticks (`±328.69` Ticks)	`733.24` Ticks (`±64.51` Ticks)	`1.51`
mutex_benchmark Build Time	`105.97` s	`111.63` s	`0.95`
mutex_benchmark File Size	`1.01` MB	`1.01` MB	`0.99`
Mutex Stress Test Average Time per Iteration - 1 Threads	`12.62` ns (`±0.72` ns)	`13.10` ns (`±0.85` ns)	`0.96`
Mutex Stress Test Average Time per Iteration - 2 Threads	`12.76` ns (`±0.97` ns)	`14.56` ns (`±0.90` ns)	`0.88`

This comment was automatically generated by workflow using github-action-benchmark.

jounathaen · 2025-10-15T14:57:21Z

Very nice.

I do think, we should not include the overalign feature, but instead be clever and thread aware on small allocations. This would then also help/solve #1984

mkroening · 2025-10-16T14:20:02Z

I do think, we should not include the overalign feature, but instead be clever and thread aware on small allocations. This would then also help/solve #1984

I agree. I have completely removed our intermediate layer. We can reintroduce it if it turns out to be useful in the future.

mkroening requested review from jounathaen and stlankes October 14, 2025 17:45

mkroening self-assigned this Oct 14, 2025

github-actions bot reviewed Oct 14, 2025

View reviewed changes

mkroening mentioned this pull request Oct 15, 2025

Poor multi-threaded allocation performance #1984

Open

mkroening force-pushed the feat-overalign branch 2 times, most recently from 6e4dfab to 0a3a6b8 Compare October 16, 2025 14:15

mkroening changed the title ~~perf(alloc): don't overalign allocations by default~~ perf(alloc): don't overalign allocations Oct 16, 2025

mkroening force-pushed the feat-overalign branch from 0a3a6b8 to 4e780f5 Compare October 16, 2025 14:30

perf(alloc): don't overalign allocations

8123de4

mkroening force-pushed the feat-overalign branch from 4e780f5 to 8123de4 Compare October 20, 2025 08:41

mkroening added this pull request to the merge queue Oct 20, 2025

Merged via the queue into main with commit 89a5d77 Oct 20, 2025
17 checks passed

mkroening deleted the feat-overalign branch October 20, 2025 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(alloc): don't overalign allocations #1982

perf(alloc): don't overalign allocations #1982

Uh oh!

mkroening commented Oct 14, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

jounathaen commented Oct 15, 2025

Uh oh!

mkroening commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf(alloc): don't overalign allocations #1982

perf(alloc): don't overalign allocations #1982

Uh oh!

Conversation

mkroening commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Benchmark Results

Uh oh!

jounathaen commented Oct 15, 2025

Uh oh!

mkroening commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkroening commented Oct 14, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading