scx_chaos: use peek operation to optimise for empty delay dsq #2894

JakeHillion · 2025-10-15T10:30:42Z

Use the new scx_bpf_dsq_peek in scx_chaos to optimise for the fast path. This avoids locking the DSQs and should be beneficial in the common case where the DSQ is empty/nothing in the DSQ is ready.

Add a few stats for tracking how successful peek is. This works really well on my local machine for skipping the hot path.

This mostly avoids contention with the crawling timer thread, as the insertion in chaos_enqueue and removal in chaos_dispatch are all local to one CPU and the locking overhead would be minimal.

Test plan:

CI

jake@merlin:/data/users/jake/repos/scx/ > cargo build --release -p scx_chaos && sudo target/release/scx_chaos --random-delay-frequency 0.01 --random-delay-min-us 100000 --random-delay-max-us 200000 --stats 10
...
    Finished `release` profile [optimized] target(s) in 1m 01s
11:28:59 [INFO] Running scx_chaos (build ID: 1.0.20-ga6134e95-dirty x86_64-unknown-linux-gnu)
11:28:59 [INFO] Builder { traits: [RandomDelays { frequency: 0.01, min_us: 100000, max_us: 200000 }], verbose: 0, kprobe_random_delays: None, p2dq_opts: SchedulerOpts { disable_kthreads_local: false, autoslice: false, interactive_ratio: 10, deadline: false, eager_load_balance: false, freq_control: false, greedy_idle_disable: true, interactive_sticky: false, interactive_fifo: false, dispatch_pick2_disable: false, dispatch_lb_busy: 75, dispatch_lb_interactive: true, keep_running: false, atq_enabled: false, cpu_priority: false, interactive_dsq: true, wakeup_lb_busy: 0, wakeup_llc_migrations: false, select_idle_in_enqueue: false, queued_wakeup: false, idle_resume_us: None, max_dsq_pick2: false, task_slice: false, min_slice_us: 100, lb_mode: Load, sched_mode: Default, lb_slack_factor: 5, min_llc_runs_pick2: 1, saturated_percent: 5, dsq_time_slices: [], dsq_shift: 4, llc_shards: 5, min_nr_queued_pick2: 0, dumb_queues: 3, init_dsq_index: 0, virt_llc_enabled: false, topo: TopologyArgs { virt_llc: None } }, requires_ppid: None }
11:28:59 [INFO] DSQ[0] slice_ns 100000
11:28:59 [INFO] DSQ[1] slice_ns 3200000
11:28:59 [INFO] DSQ[2] slice_ns 6400000
11:28:59 [WARN] libbpf: map 'chaos': BPF map skeleton link is uninitialized

chaos traits: random_delays/cpu_freq/degradation 0/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 0
peek: empty/not_ready/needs_proc 1057/0/0
chaos traits: random_delays/cpu_freq/degradation 3/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 3
peek: empty/not_ready/needs_proc 107168/309/9716
chaos traits: random_delays/cpu_freq/degradation 0/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 0
peek: empty/not_ready/needs_proc 91787/0/15417
^C11:29:23 [INFO] Unregister scx_chaos scheduler

rrnewton

Some comments but no blockers! Looks good.

rrnewton · 2025-10-15T18:17:29Z

scheds/include/scx/compat.bpf.h

 	 scx_bpf_pick_any_cpu_node(cpus_allowed, node, flags) :			\
 	 scx_bpf_pick_any_cpu(cpus_allowed, flags))

+#define __COMPAT_scx_bpf_dsq_peek(dsq_id)                                      \


This is different than the compat macro (actually inline function) in the V5 that we ended up with and TJ moved to his branch. It probably doesn't matter, but it does make me curious what are rules are for updating this file.

Shouldn't it be automatically pulled from kernel somewhere? I don't see any other scripts in the repo that seem to mention compat.bpf.h -- I was expecting some update script that copies it from the kernel..

Ah, this has landed upstream now. I grabbed this from #2675 and plan to merge after that goes in, will take whatever's there (I don't actually need the compat macro).

rrnewton · 2025-10-15T18:31:54Z

scheds/rust/scx_chaos/src/bpf/main.bpf.c

+		return U64_MAX;
+	}
+
+	first_p = bpf_task_from_pid(first_p->pid);


Maybe you could help me understand if this is strictly required or just being extra cautious.

we can access the field first_p->pid before the bpf_task_from_pid because it is simple scalar data.

the field first_p->scx.dsq_vtime is in the task_struct, inside the nested struct sched_ext_entity struct which is stored contiguously within the parent struct (not via a pointer).

It seems to me that this effectively makes first_p->scx.dsq_vtime accessible without the bpf_task_from_pid/bpf_task_release protocol because it's effectively scalar data. But I take it that there's some rule about not references "complex data", without a proper reference-counted handle, to leave open more implementation leeway (on different architectures/compilers or something) with these nested structs?

Yes! This is good. I did try removing it before and had a verifier error, but I must have messed it up. Seems fine now on my kernel.

rrnewton · 2025-10-15T19:01:22Z

scheds/rust/scx_chaos/src/bpf/main.bpf.c

-	bpf_for_each(scx_dsq, p, get_cpu_delay_dsq(-1), 0) {
+	// Check if we need to process the delay DSQ
+	if (delay_dsq_next_time(dsq_id) > now)
+		goto p2dq;


This makes sense as the short-circuit. I want to quickly make sure that the races are benign. Normally, we would expect a peek/vtime reading to be a lower bound -- other cores could race and asynchronously pop the task and increase the vtime after the observation we made (but not decrease it, b.c. of monotonicity).

But here, because we only peek our own delay_dsq here... maybe we don't even need to worry about that because no one else will dispatch from it?

Do we need to rely on non-interference with delay-dsq here? I think not, because if the head of delay_dsq was popped and its head-delayed-time became even FURTHER into the future, then it is even more ahead of our fixed snapshot of now and we remain justified in taking the fastpath goto p2dq.

Nevertheless, reasoning about the concurrency safety here seems a bit scary because bpf_ktime_get_ns is realtime that is constantly changing (monotonically forward, and our "now" snapshots are immediately stale, we don't get to control the forward-march of time like in a simulation). If "now" and "earliest vtime in delay queue" are both variables that change asynchronously in the background... then atomic snapshots of just one of them would never be able to guarantee that a moment in time exists where they have a particular ordering relationship..

Thanks for this. The two parts make a lot of sense: information loss between dsq insertions/the check in chaos_dispatch, and the check in the timer being out of date.

chaos_enqueue and chaos_dispatch both only enqueue/dequeue from their local DSQ. There should be no possible interference there, and every enqueue should show up in the related dispatch. If we ever place the task onto a delay DSQ that isn't the CPU that triggered the enqueue this would need a re-think, but I don't see any reason we'd need to do that.

The timer is a much more interesting case. I initially did this work on an Intel machine with a unified cache and didn't see many stalls after I fixed the logic. However on my multi-CCX EPYC Rome machine I'm seeing fairly consistent stalls. I'll play around with this, but I believe it's the timer not triggering dispatch when it should, as I don't see how this would be a problem between enqueue/dispatch in isolation.

JakeHillion · 2025-10-16T13:27:48Z

Thanks @rrnewton for the comprehensive review! I made a few changes to stat collection which was wrong from an old refactor & corrected a weird return from that too. I've also found some reliable stalls so will need to debug those before landing.

arighi · 2025-10-17T07:14:49Z

Hey @JakeHillion @rrnewton maybe we can add the new scx_bpf_dsq_peek() and the __COMPAT helper in a separate PR, so that other schedulers can start using it and we don't necessary need to wait for this PR to be merged.

JakeHillion · 2025-10-17T11:47:27Z

Hey @JakeHillion @rrnewton maybe we can add the new scx_bpf_dsq_peek() and the __COMPAT helper in a separate PR, so that other schedulers can start using it and we don't necessary need to wait for this PR to be merged.

Yeah, they're only in this PR so I could share a working version. This is stalling so will stay a draft for a while, but I think the LAVD one might be ready to go. Else we can merge the header updates by themselves.

Use the new `scx_bpf_dsq_peek` in scx_chaos to optimise for the fast path. This avoids locking the DSQs and should be beneficial in the common case where the DSQ is empty/nothing in the DSQ is ready. Add a few stats for tracking how successful peek is. This works really well on my local machine for skipping the hot path. This mostly avoids contention with the crawling timer thread, as the insertion in chaos_enqueue and removal in chaos_dispatch are all local to one CPU and the locking overhead would be minimal. Test plan: - CI ``` jake@merlin:/data/users/jake/repos/scx/ > cargo build --release -p scx_chaos && sudo target/release/scx_chaos --random-delay-frequency 0.01 --random-delay-min-us 100000 --random-delay-max-us 200000 --stats 10 ... Finished `release` profile [optimized] target(s) in 1m 01s 11:28:59 [INFO] Running scx_chaos (build ID: 1.0.20-ga6134e95-dirty x86_64-unknown-linux-gnu) 11:28:59 [INFO] Builder { traits: [RandomDelays { frequency: 0.01, min_us: 100000, max_us: 200000 }], verbose: 0, kprobe_random_delays: None, p2dq_opts: SchedulerOpts { disable_kthreads_local: false, autoslice: false, interactive_ratio: 10, deadline: false, eager_load_balance: false, freq_control: false, greedy_idle_disable: true, interactive_sticky: false, interactive_fifo: false, dispatch_pick2_disable: false, dispatch_lb_busy: 75, dispatch_lb_interactive: true, keep_running: false, atq_enabled: false, cpu_priority: false, interactive_dsq: true, wakeup_lb_busy: 0, wakeup_llc_migrations: false, select_idle_in_enqueue: false, queued_wakeup: false, idle_resume_us: None, max_dsq_pick2: false, task_slice: false, min_slice_us: 100, lb_mode: Load, sched_mode: Default, lb_slack_factor: 5, min_llc_runs_pick2: 1, saturated_percent: 5, dsq_time_slices: [], dsq_shift: 4, llc_shards: 5, min_nr_queued_pick2: 0, dumb_queues: 3, init_dsq_index: 0, virt_llc_enabled: false, topo: TopologyArgs { virt_llc: None } }, requires_ppid: None } 11:28:59 [INFO] DSQ[0] slice_ns 100000 11:28:59 [INFO] DSQ[1] slice_ns 3200000 11:28:59 [INFO] DSQ[2] slice_ns 6400000 11:28:59 [WARN] libbpf: map 'chaos': BPF map skeleton link is uninitialized chaos traits: random_delays/cpu_freq/degradation 0/0/0 chaos excluded/skipped 0/0 kprobe_random_delays 0 timer kicks: 0 peek: empty/not_ready/needs_proc 1057/0/0 chaos traits: random_delays/cpu_freq/degradation 3/0/0 chaos excluded/skipped 0/0 kprobe_random_delays 0 timer kicks: 3 peek: empty/not_ready/needs_proc 107168/309/9716 chaos traits: random_delays/cpu_freq/degradation 0/0/0 chaos excluded/skipped 0/0 kprobe_random_delays 0 timer kicks: 0 peek: empty/not_ready/needs_proc 91787/0/15417 ^C11:29:23 [INFO] Unregister scx_chaos scheduler ```

JakeHillion requested a review from rrnewton October 15, 2025 10:30

rrnewton approved these changes Oct 15, 2025

View reviewed changes

JakeHillion force-pushed the jakehillion/chaos-peek branch from d34a00d to df8e559 Compare October 16, 2025 13:15

JakeHillion force-pushed the jakehillion/chaos-peek branch from df8e559 to 25627ea Compare October 28, 2025 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scx_chaos: use peek operation to optimise for empty delay dsq #2894

scx_chaos: use peek operation to optimise for empty delay dsq #2894

Uh oh!

JakeHillion commented Oct 15, 2025

Uh oh!

rrnewton left a comment

Uh oh!

rrnewton Oct 15, 2025

Uh oh!

JakeHillion Oct 16, 2025

Uh oh!

rrnewton Oct 15, 2025

Uh oh!

JakeHillion Oct 16, 2025

Uh oh!

rrnewton Oct 15, 2025

Uh oh!

JakeHillion Oct 16, 2025

Uh oh!

JakeHillion commented Oct 16, 2025

Uh oh!

arighi commented Oct 17, 2025

Uh oh!

JakeHillion commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

scx_chaos: use peek operation to optimise for empty delay dsq #2894

Are you sure you want to change the base?

scx_chaos: use peek operation to optimise for empty delay dsq #2894

Uh oh!

Conversation

JakeHillion commented Oct 15, 2025

Uh oh!

rrnewton left a comment

Choose a reason for hiding this comment

Uh oh!

rrnewton Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

JakeHillion Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

rrnewton Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

JakeHillion Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

rrnewton Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

JakeHillion Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

JakeHillion commented Oct 16, 2025

Uh oh!

arighi commented Oct 17, 2025

Uh oh!

JakeHillion commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants