"Mark-delay" performance improvement to major GC #13580

NickBarnes · 2024-10-29T16:50:08Z

This is the upstreaming of oxcaml/oxcaml#2348 oxcaml/oxcaml#2358 (minor) and oxcaml/oxcaml#3029 by @stedolan. It introduces a new sweep-only phase at the start of each major GC cycle. This reduces the "latent garbage delay" - the time between a block becoming unreachable and it becoming available for allocation - by approximately half a major GC cycle.

Because marking, including root marking, doesn't take place until part-way through the GC cycle (when we move from sweep-only to mark-and-sweep), the allocation colour is not always MARKED but changes from UNMARKED to MARKED at that point. Effectively we switch from a grey mutator allocating white to a black mutator allocating black.

This PR is in draft because I've just done a fairly mechanical (although manual) patch application; I'm publishing it so that @stedolan and perhaps @kayceesrk can take a look. It passes the whole testsuite on my machine, including the new test (parallel/churn.ml) written by @stedolan for the flambda-backend mark-delay work.

kayceesrk · 2024-10-29T16:59:02Z

Thanks! I’ll have a look tomorrow.

runtime/caml/domain.h

runtime/major_gc.c

runtime/signals.c

runtime/major_gc.c

NickBarnes · 2024-10-31T09:58:51Z

I've addressed @kayceesrk's review comments and rebased.

NickBarnes · 2024-10-31T10:28:01Z

Thanks to @kayceesrk and @stedolan I think this can come out of draft now.

Changes

kayceesrk · 2024-10-31T15:05:13Z

The code looks ok now.

MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR.

runtime/caml/platform.h

kayceesrk · 2024-11-05T01:30:24Z

MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR.

any insights on this so far?

NickBarnes · 2024-11-08T14:43:45Z

The Win32 problem is due to an accounting error, observed across platforms, which causes the work_counter to trail further and further behind the alloc_counter. Eventually on 32-bit platforms it differs by enough to trigger an underflow, giving negative work budgets. At that point, progress on the current cycle stalls but allocation continues, causing the heap to balloon. On 32-bit Windows, this leads to out-of-memory at (just under) a 2 GiB heap. On 32-bit Linux machines with enough memory, the heap grows past 2GiB, leading the alloc_counter to wrap around once more, pushing work budgets back into positive territory, so the GC is able to progress again, successfully collecting, until the problem repeats.
On 64-bit platforms the same accounting problem is observed, but it doesn't cause the same symptoms.
Taking this PR back into draft mode until I've got a fix.

NickBarnes · 2024-11-08T14:44:17Z

(note: this accounting problem is specific to this PR, and not observed on trunk).

NickBarnes · 2024-11-09T08:25:55Z

I have further diagnosed the accounting problem, and put in a dirty hack to prove my hypothesis.
On the trunk, we maintain alloc_counter as a global estimate of the GC work necessary to keep up with allocation, and work_counter as a global record of the total amount of GC work done. Both of these counters are intnat sized, and on 32-bit platforms they wrap around. For each slice, in each domain, we work until the work_counter exceeds the slice_target - the value that the alloc_counter had at the start of the slice - or (alternatively) we run out of work to do.
In this way, these two counters remain approximately in sync.
However, on this branch, we delay marking until after sweeping has finished on at least one domain, and we always start marking in a fresh slice (after a minor GC). So on the last sweeping slice (on the domain which reaches the end of sweeping first), approximately half of the work expected in the slice is not used, and work_counter slips behind alloc_counter by this amount. At the end of the slice, instead of work_counter slightly exceeding slice_target, or not quite reaching it (if we got to the end of the collection), it falls behind by (on average) half of a slice's amount of work. Depending on the workload, there may be very little marking to do, not enough to catch up on the next slice (for instance, on testsuite/tests/regression/pr5757/pr5757.ml there is very little marking to do).
So, as the process continues, work_counter gradually falls further and further behind alloc_counter. This may lead to larger slice budgets and thus longer (but fewer) GC pauses, but should be otherwise harmless. However (as described in a previous comment), on 32-bit platforms eventually the shortfall is enough to trigger an underflow, giving negative work budgets. At that point, progress on the current cycle stalls but allocation continues, causing the heap to balloon. On 32-bit Windows we eventually hit the 2 GiB process memory limit, causing us to run out of memory with a (just under) 2 GiB heap. On 32-bit Linux machines with enough memory, we are able to continue past that point (as there is a 3 GiB process memory limit), advancing alloc_counter until it overtakes work_counter, pushing work budgets back into positive territory. Then the GC is able to progress again, successfully collecting, until the problem repeats.
On 64-bit platforms the same accounting problem is observed, but it doesn't cause the same symptoms. Instead the slice targets grow larger and larger, which may exacerbate the problem as all the sweeping is easily completed in a single slice.

My hack, which should not be merged, demonstrates that addressing this accounting issue fixes the problem by artificially consuming the rest of the slice budget at the point at which sweeping is first completed.
Paging @stedolan and @damiendoligez for thoughts.
I also have a sketch of a refactor of major_collection_slice to make this sort of thing easier to reason about. I hope to make a PR for that in the next week or two.

Co-authored-by: Stephen Dolan <[email protected]>

…unter at the start of any slice when it falls very far behind alloc_counter.

NickBarnes · 2025-07-17T14:17:29Z

Yes, as an interim fix, we shouldn't require available work to update the finaliser tables. This isn't ideal, and I've added a TODO about it.

kayceesrk · 2025-07-17T15:25:22Z

@jmid the multicore tests failure seems to be an internal error. Is that correct?

kayceesrk · 2025-07-17T17:15:14Z

Apart from the unrelated failure in multicore tests, the usual compiler testsuite passes cleanly. I'm satisfied with the PR. I've already left an approval.

jmid · 2025-07-17T19:57:24Z

@jmid the multicore tests failure seems to be an internal error. Is that correct?

Sort of, I earlier explained it here: #13580 (comment)

I kicked off a fresh test run here: https://github.com/ocaml-multicore/multicoretests/tree/target-mark-delay-13580
and still see it triggering a debug runtime error. I can't tell which one though, since the log is overflown by Updating memprof messages and cut off: https://github.com/ocaml-multicore/multicoretests/actions/runs/16350616902/job/46196141249

If you want to give this a go, I've previously shared instructions for building and running the test suite locally here: #13580 (comment)

jmid · 2025-07-17T22:21:48Z

I've been able to narrow this down a bit - both in the CI with v=61 to silence the excessive messenging:
https://github.com/ocaml-multicore/multicoretests/actions/runs/16356041532/job/46216244437#step:16:1707

random seed: 202969821
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Ephemeron stress test with Domain
[02] file runtime/major_gc.c; line 491 ### Assertion failed: Has_status_val(v, status)
File "src/ephemeron/dune", line 20, characters 7-16:
20 |  (name lin_tests)
            ^^^^^^^^^
(cd _build/default/src/ephemeron && ./lin_tests.exe --verbose)
Command got signal ILL.

and locally where I just commented out the debug message line:

diff --git a/runtime/minor_gc.c b/runtime/minor_gc.c
index 247382c227..6e2b42f339 100644
--- a/runtime/minor_gc.c
+++ b/runtime/minor_gc.c
@@ -881,7 +881,7 @@ caml_stw_empty_minor_heap_no_major_slice(caml_domain_state* domain,
   }
 
   CAML_EV_BEGIN(EV_MINOR_MEMPROF_CLEAN);
-  CAML_GC_MESSAGE(MINOR, "Updating memprof.\n");
+  /* CAML_GC_MESSAGE(MINOR, "Updating memprof.\n"); */
   caml_memprof_after_minor_gc(domain);
   CAML_EV_END(EV_MINOR_MEMPROF_CLEAN);

Here I was able to reproduce the assertion error a couple of times. It triggers once every 15 runs or so (at the beginning of a PBT run so count can be reduced):

$ OCAMLRUNPARAM='s=4096,V=1' dune exec --profile=debug-runtime src/ephemeron/lin_tests.exe -- -v
### OCaml runtime: debug mode ###   
### set OCAMLRUNPARAM=v=0 to silence this message
random seed: 361524320
generated error fail pass / total     time test name
[ ]    0    0    0    0 / 1000     0.0s Lin Ephemeron stress test with Domain[01] file runtime/major_gc.c; line 491 ### Assertion failed: Has_status_val(v, status)
Illegal instruction (core dumped)

Resulting in this stack trace:

Program terminated with signal SIGILL, Illegal instruction.
#0  caml_abort () at runtime/caml/misc.h:370
370	    __builtin_trap();
[Current thread is 1 (Thread 0x7f0f691bb640 (LWP 2935090))]
(gdb) thread apply all bt

Thread 6 (Thread 0x7f0f627fc640 (LWP 2935093)):
#0  futex_wait (private=0, expected=2, futex_word=0x559dec5c0000) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x559dec5c0000, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f0f69f41002 in lll_mutex_lock_optimized (mutex=0x559dec5c0000) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x559dec5c0000) at ./nptl/pthread_mutex_lock.c:93
#4  0x0000559db367462f in caml_plat_lock_blocking (m=0x559dec5c0000) at runtime/caml/platform.h:457
#5  backup_thread_func (v=0x559dec5bff60) at runtime/domain.c:1095
#6  0x00007f0f69f3dac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f0f69fcf850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5 (Thread 0x7f0f63fff640 (LWP 2935091)):
#0  futex_wait (private=0, expected=2, futex_word=0x559dec5bfef0) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x559dec5bfef0, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f0f69f41002 in lll_mutex_lock_optimized (mutex=0x559dec5bfef0) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x559dec5bfef0) at ./nptl/pthread_mutex_lock.c:93
#4  0x0000559db367462f in caml_plat_lock_blocking (m=0x559dec5bfef0) at runtime/caml/platform.h:457
#5  backup_thread_func (v=0x559dec5bfe50) at runtime/domain.c:1095
#6  0x00007f0f69f3dac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f0f69fcf850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x7f0f69ea6740 (LWP 2934676)):
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x559dec623918) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x559dec623918) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x559dec623918, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007f0f69f3ca41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x559dec623930, cond=0x559dec6238f0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x559dec6238f0, mutex=mutex@entry=0x559dec623930) at ./nptl/pthread_cond_wait.c:627
#5  0x0000559db369de23 in sync_condvar_wait (m=0x559dec623930, c=0x559dec6238f0) at runtime/sync_posix.h:116
#6  caml_ml_condition_wait (wcond=<optimized out>, wmut=<optimized out>) at runtime/sync.c:193
#7  <signal handler called>
#8  0x0000559db36266fe in camlStdlib__Domain$loop_769 () at domain.ml:294
#9  0x0000559db36257ee in camlStdlib__Mutex$protect_280 () at mutex.ml:28
#10 0x0000559db362669c in camlStdlib__Domain$join_766 () at domain.ml:299
#11 0x0000559db35a5d7c in camlLin_domain$run_parallel_788 () at lib/lin_domain.ml:24
#12 0x0000559db35a6147 in camlLin_domain$stress_prop_938 () at lib/lin_domain.ml:43
#13 0x0000559db35acfba in camlUtil$fun_2564 () at lib/util.ml:5
#14 0x0000559db35c43fd in camlQCheck2$loop_4296 () at src/core/QCheck2.ml:1805
#15 0x0000559db35c4344 in camlQCheck2$run_law_4291 () at src/core/QCheck2.ml:1810
#16 0x0000559db35c5104 in camlQCheck2$check_state_input_4359 () at src/core/QCheck2.ml:1933
#17 0x0000559db35c555e in camlQCheck2$check_cell_inner_11009 () at src/core/QCheck2.ml:2003
#18 0x0000559db35b53b7 in camlQCheck_base_runner$aux_map_1464 () at src/runner/QCheck_base_runner.ml:425
#19 0x0000559db360153d in camlStdlib__List$map_340 () at list.ml:85
#20 0x0000559db35b4c73 in camlQCheck_base_runner$run_tests_inner_2470 () at src/runner/QCheck_base_runner.ml:434
#21 0x0000559db35b5a4d in camlQCheck_base_runner$run_tests_main_inner_3102 () at src/runner/QCheck_base_runner.ml:478
#22 0x0000559db35a582f in camlDune__exe__Lin_tests$entry () at src/ephemeron/lin_tests.ml:35
#23 0x0000559db359f48c in caml_program ()
#24 <signal handler called>
#25 0x0000559db36a3b9a in caml_startup_common (pooling=<optimized out>, argv=0x7ffc2f6cc2a8) at runtime/startup_nat.c:127
#26 caml_startup_common (argv=0x7ffc2f6cc2a8, pooling=<optimized out>) at runtime/startup_nat.c:86
#27 0x0000559db36a3c2f in caml_startup_exn (argv=<optimized out>) at runtime/startup_nat.c:134
#28 caml_startup (argv=<optimized out>) at runtime/startup_nat.c:139
#29 caml_main (argv=<optimized out>) at runtime/startup_nat.c:146
#30 0x0000559db359f022 in main (argc=<optimized out>, argv=<optimized out>) at runtime/main.c:37

Thread 3 (Thread 0x7f0f689ba640 (LWP 2934691)):
#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x559dec5bfdb4) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x559dec5bfdb4) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x559dec5bfdb4, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007f0f69f3ca41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x559dec5bfd60, cond=0x559dec5bfd88) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x559dec5bfd88, mutex=mutex@entry=0x559dec5bfd60) at ./nptl/pthread_cond_wait.c:627
#5  0x0000559db3695beb in caml_plat_wait (cond=cond@entry=0x559dec5bfd88, mut=mut@entry=0x559dec5bfd60) at runtime/platform.c:146
#6  0x0000559db367479a in backup_thread_func (v=0x559dec5bfd40) at runtime/domain.c:1087
#7  0x00007f0f69f3dac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#8  0x00007f0f69fcf850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7f0f62ffd640 (LWP 2935092)):
#0  0x0000559db36948ad in caml_gc_log (msg=msg@entry=0x559db36b0490 "requesting stw empty_minor_heap") at runtime/misc.c:95
#1  0x0000559db36943af in caml_try_empty_minor_heap_on_all_domains () at runtime/minor_gc.c:952
#2  0x0000559db3694445 in caml_empty_minor_heaps_once () at runtime/minor_gc.c:978
#3  0x0000559db36753fa in caml_domain_terminate (last=last@entry=false) at runtime/domain.c:2088
#4  0x0000559db3675a79 in domain_thread_func (v=<optimized out>) at runtime/domain.c:1274
#5  0x00007f0f69f3dac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x00007f0f69fcf850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f0f691bb640 (LWP 2935090)):
#0  caml_abort () at runtime/caml/misc.h:370
#1  caml_failed_assert (expr=expr@entry=0x559db36aeaec "Has_status_val(v, status)", file_os=file_os@entry=0x559db36ae991 "runtime/major_gc.c", line=line@entry=491) at runtime/misc.c:51
#2  0x0000559db3687c3c in orph_ephe_list_verify_status (status=<optimized out>) at runtime/major_gc.c:491
#3  0x0000559db368afd4 in caml_mark_roots_stw (participant_count=participant_count@entry=3, barrier_participants=barrier_participants@entry=0x559dec5c8550) at runtime/major_gc.c:1573
#4  0x0000559db3693bd6 in caml_stw_empty_minor_heap_no_major_slice (domain=<optimized out>, mark_requested_p=<optimized out>, participating_count=<optimized out>, participating=<optimized out>) at runtime/minor_gc.c:890
#5  0x0000559db36744d4 in stw_handler (domain=0x7f0f64002b80) at runtime/domain.c:1535
#6  handle_incoming (s=0x559dec5bfe68) at runtime/domain.c:360
#7  0x0000559db36750cc in caml_handle_incoming_interrupts () at runtime/domain.c:373
#8  caml_handle_gc_interrupt () at runtime/domain.c:1967
#9  0x0000559db369b983 in caml_do_pending_actions_res () at runtime/signals.c:344
#10 0x0000559db36944c5 in caml_alloc_small_dispatch (dom_st=0x7f0f64002b80, wosize=3, flags=3, nallocs=1, encoded_alloc_lens=0x559db3785047 <camlStdlib__Ephemeron$frametable+4431> "\002\070\v") at runtime/minor_gc.c:1003
#11 <signal handler called>
#12 0x0000559db365afc2 in camlStdlib__Ephemeron$replace_793 () at ephemeron.ml:307
#13 0x0000559db35abef6 in camlLin$run_2234 () at lib/lin.ml:515
#14 0x0000559db3607e96 in camlStdlib__Array$map_355 () at array.ml:126
#15 0x0000559db35a5b76 in camlLin_domain$interp_644 () at lib/lin_domain.ml:10
#16 0x0000559db35a5e9e in camlLin_domain$main_818 () at lib/lin_domain.ml:20
#17 0x0000559db36265c6 in camlStdlib__Domain$body_759 () at domain.ml:268
#18 <signal handler called>
#19 0x0000559db367037a in caml_callback_exn (closure=<optimized out>, closure@entry=139704173766104, arg=<optimized out>, arg@entry=1) at runtime/callback.c:206
#20 0x0000559db3670ced in caml_callback_res (closure=closure@entry=139704173766104, arg=arg@entry=1) at runtime/callback.c:321
#21 0x0000559db3675952 in domain_thread_func (v=0x7ffc2f6cc050) at runtime/domain.c:1269
#22 0x00007f0f69f3dac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#23 0x00007f0f69fcf850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)

(I am supposed to be on holiday, so I'll let someone else debug from here 🙂 )

kayceesrk · 2025-07-18T08:12:17Z

~~I can't seem to recreate this failure on my machine :-/~~

The failure was surprisingly easy to recreate using rr but the normal runs would never fail for me. Investingating this.

kayceesrk · 2025-07-19T06:43:51Z

@NickBarnes @jmid fix for the multicoretests failure is here NickBarnes#5.

jmid · 2025-07-19T10:35:57Z

I've given your fix a spin on the CI (incl. repeating the above failing test) and it held up nicely 👍

In ocaml#13580 (comment) jmid reports that he needed to tweak the GC verbosity setting to avoid getting spammed by minor-gc messages when debugging an assertion failure. The other sub-phases of the GC minor all uses `caml_gc_log` rather than CAML_GC_MESSAGE, and do not seem to cause similar spamming issues. Fixing the code to be consistent will avoid inconsistent verbosity levels in end-user scripts.

Move the orphaned ephemerons GC colour check inside the barrier.

kayceesrk · 2025-07-24T10:20:46Z

Multicoretests run passes now: https://github.com/ocaml/ocaml/actions/runs/16492771907/job/46633963474?pr=13580. The other multicoretests failure is unrelated to this PR's changes. See earlier comment: #13580 (comment).

I believe the PR is good to merge.

kayceesrk · 2025-07-29T09:44:14Z

Merging the PR now. Thanks @NickBarnes. Looking forward to the GC pacing improvement PRs next.

In ocaml#13580 (comment) jmid reports that he needed to tweak the GC verbosity setting to avoid getting spammed by minor-gc messages when debugging an assertion failure. The other sub-phases of the GC minor all uses `caml_gc_log` rather than CAML_GC_MESSAGE, and do not seem to cause similar spamming issues. Fixing the code to be consistent will avoid inconsistent verbosity levels in end-user scripts.

NickBarnes force-pushed the nick-markdelay branch 2 times, most recently from 31fc961 to 72500e9 Compare October 29, 2024 16:56

NickBarnes added the run-multicoretests Makes the CI run the multicore testsuite label Oct 29, 2024

NickBarnes force-pushed the nick-markdelay branch from 72500e9 to 9af1835 Compare October 29, 2024 17:12