- 
                Notifications
    You must be signed in to change notification settings 
- Fork 190
scx_chaos: use peek operation to optimise for empty delay dsq #2894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments but no blockers! Looks good.
| scx_bpf_pick_any_cpu_node(cpus_allowed, node, flags) : \ | ||
| scx_bpf_pick_any_cpu(cpus_allowed, flags)) | ||
|  | ||
| #define __COMPAT_scx_bpf_dsq_peek(dsq_id) \ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different than the compat macro (actually inline function) in the V5 that we ended up with and TJ moved to his branch. It probably doesn't matter, but it does make me curious what are rules are for updating this file.
Shouldn't it be automatically pulled from kernel somewhere?  I don't see any other scripts in the repo that seem to mention compat.bpf.h -- I was expecting some update script that copies it from the kernel..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this has landed upstream now. I grabbed this from #2675 and plan to merge after that goes in, will take whatever's there (I don't actually need the compat macro).
| return U64_MAX; | ||
| } | ||
|  | ||
| first_p = bpf_task_from_pid(first_p->pid); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could help me understand if this is strictly required or just being extra cautious.
- we can access the field first_p->pidbefore thebpf_task_from_pidbecause it is simple scalar data.
- the field first_p->scx.dsq_vtimeis in the task_struct, inside the nestedstruct sched_ext_entitystruct which is stored contiguously within the parent struct (not via a pointer).
It seems to me that this effectively makes first_p->scx.dsq_vtime accessible without the bpf_task_from_pid/bpf_task_release protocol because it's effectively scalar data. But I take it that there's some rule about not references "complex data", without a proper reference-counted handle, to leave open more implementation leeway (on different architectures/compilers or something) with these nested structs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! This is good. I did try removing it before and had a verifier error, but I must have messed it up. Seems fine now on my kernel.
| bpf_for_each(scx_dsq, p, get_cpu_delay_dsq(-1), 0) { | ||
| // Check if we need to process the delay DSQ | ||
| if (delay_dsq_next_time(dsq_id) > now) | ||
| goto p2dq; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense as the short-circuit. I want to quickly make sure that the races are benign. Normally, we would expect a peek/vtime reading to be a lower bound -- other cores could race and asynchronously pop the task and increase the vtime after the observation we made (but not decrease it, b.c. of monotonicity).
But here, because we only peek our own delay_dsq here... maybe we don't even need to worry about that because no one else will dispatch from it?
Do we need to rely on non-interference with delay-dsq here? I think not, because if the head of delay_dsq was popped and its head-delayed-time became even FURTHER into the future, then it is even more ahead of our fixed snapshot of now and we remain justified in taking the fastpath goto p2dq.
Nevertheless, reasoning about the concurrency safety here seems a bit scary because bpf_ktime_get_ns is realtime that is constantly changing (monotonically forward, and our "now" snapshots are immediately stale, we don't get to control the forward-march of time like in a simulation). If "now" and "earliest vtime in delay queue" are both variables that change asynchronously in the background... then atomic snapshots of just one of them would never be able to guarantee that a moment in time exists where they have a particular ordering relationship..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. The two parts make a lot of sense: information loss between dsq insertions/the check in chaos_dispatch, and the check in the timer being out of date.
chaos_enqueue and chaos_dispatch both only enqueue/dequeue from their local DSQ. There should be no possible interference there, and every enqueue should show up in the related dispatch. If we ever place the task onto a delay DSQ that isn't the CPU that triggered the enqueue this would need a re-think, but I don't see any reason we'd need to do that.
The timer is a much more interesting case. I initially did this work on an Intel machine with a unified cache and didn't see many stalls after I fixed the logic. However on my multi-CCX EPYC Rome machine I'm seeing fairly consistent stalls. I'll play around with this, but I believe it's the timer not triggering dispatch when it should, as I don't see how this would be a problem between enqueue/dispatch in isolation.
d34a00d    to
    df8e559      
    Compare
  
    | Thanks @rrnewton for the comprehensive review! I made a few changes to stat collection which was wrong from an old refactor & corrected a weird return from that too. I've also found some reliable stalls so will need to debug those before landing. | 
| Hey @JakeHillion @rrnewton maybe we can add the new  | 
| 
 Yeah, they're only in this PR so I could share a working version. This is stalling so will stay a draft for a while, but I think the LAVD one might be ready to go. Else we can merge the header updates by themselves. | 
Use the new `scx_bpf_dsq_peek` in scx_chaos to optimise for the fast
path. This avoids locking the DSQs and should be beneficial in the
common case where the DSQ is empty/nothing in the DSQ is ready.
Add a few stats for tracking how successful peek is. This works really
well on my local machine for skipping the hot path.
This mostly avoids contention with the crawling timer thread, as the
insertion in chaos_enqueue and removal in chaos_dispatch are all local
to one CPU and the locking overhead would be minimal.
Test plan:
- CI
```
jake@merlin:/data/users/jake/repos/scx/ > cargo build --release -p scx_chaos && sudo target/release/scx_chaos --random-delay-frequency 0.01 --random-delay-min-us 100000 --random-delay-max-us 200000 --stats 10
...
    Finished `release` profile [optimized] target(s) in 1m 01s
11:28:59 [INFO] Running scx_chaos (build ID: 1.0.20-ga6134e95-dirty x86_64-unknown-linux-gnu)
11:28:59 [INFO] Builder { traits: [RandomDelays { frequency: 0.01, min_us: 100000, max_us: 200000 }], verbose: 0, kprobe_random_delays: None, p2dq_opts: SchedulerOpts { disable_kthreads_local: false, autoslice: false, interactive_ratio: 10, deadline: false, eager_load_balance: false, freq_control: false, greedy_idle_disable: true, interactive_sticky: false, interactive_fifo: false, dispatch_pick2_disable: false, dispatch_lb_busy: 75, dispatch_lb_interactive: true, keep_running: false, atq_enabled: false, cpu_priority: false, interactive_dsq: true, wakeup_lb_busy: 0, wakeup_llc_migrations: false, select_idle_in_enqueue: false, queued_wakeup: false, idle_resume_us: None, max_dsq_pick2: false, task_slice: false, min_slice_us: 100, lb_mode: Load, sched_mode: Default, lb_slack_factor: 5, min_llc_runs_pick2: 1, saturated_percent: 5, dsq_time_slices: [], dsq_shift: 4, llc_shards: 5, min_nr_queued_pick2: 0, dumb_queues: 3, init_dsq_index: 0, virt_llc_enabled: false, topo: TopologyArgs { virt_llc: None } }, requires_ppid: None }
11:28:59 [INFO] DSQ[0] slice_ns 100000
11:28:59 [INFO] DSQ[1] slice_ns 3200000
11:28:59 [INFO] DSQ[2] slice_ns 6400000
11:28:59 [WARN] libbpf: map 'chaos': BPF map skeleton link is uninitialized
chaos traits: random_delays/cpu_freq/degradation 0/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 0
peek: empty/not_ready/needs_proc 1057/0/0
chaos traits: random_delays/cpu_freq/degradation 3/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 3
peek: empty/not_ready/needs_proc 107168/309/9716
chaos traits: random_delays/cpu_freq/degradation 0/0/0
        chaos excluded/skipped 0/0
        kprobe_random_delays 0
        timer kicks: 0
peek: empty/not_ready/needs_proc 91787/0/15417
^C11:29:23 [INFO] Unregister scx_chaos scheduler
```
    df8e559    to
    25627ea      
    Compare
  
    
Use the new
scx_bpf_dsq_peekin scx_chaos to optimise for the fast path. This avoids locking the DSQs and should be beneficial in the common case where the DSQ is empty/nothing in the DSQ is ready.Add a few stats for tracking how successful peek is. This works really well on my local machine for skipping the hot path.
This mostly avoids contention with the crawling timer thread, as the insertion in chaos_enqueue and removal in chaos_dispatch are all local to one CPU and the locking overhead would be minimal.
Test plan: