-
Notifications
You must be signed in to change notification settings - Fork 148
bpf: switch to memcg-based memory accounting #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Master branch: c365387 Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/netdevbpf/list/?series=383159 conflict: |
|
Master branch: c365387 |
b57c258 to
524d199
Compare
|
Master branch: 0a58a65 |
524d199 to
7f7233d
Compare
|
Master branch: 904709f |
7f7233d to
83f58da
Compare
Include memory used by bpf programs into the memcg-based accounting. This includes the memory used by programs itself, auxiliary data, statistics and bpf line info. A memory cgroup containing the process which loads the program is getting charged. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
In the absolute majority of cases if a process is making a kernel allocation, it's memory cgroup is getting charged. Bpf maps can be updated from an interrupt context and in such case there is no process which can be charged. It makes the memory accounting of bpf maps non-trivial. Fortunately, after commits 4127c65 ("mm: kmem: enable kernel memcg accounting from interrupt contexts") and b87d8ce ("mm, memcg: rework remote charging API to support nesting") it's finally possible. To do it, a pointer to the memory cgroup of the process which created the map is saved, and this cgroup is getting charged for all allocations made from an interrupt context. Allocations made from a process context will be accounted in a usual way. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
This patch enables memcg-based memory accounting for memory allocated by __bpf_map_area_alloc(), which is used by many types of bpf maps for large memory allocations. Following patches in the series will refine the accounting for some of the map types. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Include percpu arrays and auxiliary data into the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Include metadata and percpu data into the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Account memory used by cgroup storage maps including metadata structures. Account the percpu memory for the percpu flavor of cgroup storage. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Include map metadata and the node size (struct bpf_dtab_netdev) into the accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Include percpu objects and the size of map metadata into the accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]> Reviewed-by: Shakeel Butt <[email protected]>
Include lpm trie and lpm trie node objects into the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Enable the memcg-based memory accounting for the memory used by the bpf ringbuffer. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Account memory used by bpf local storage maps: per-socket and per-inode storages. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Include internal metadata into the memcg-based memory accounting. Also include the memory allocated on updating an element. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Extend xskmap memory accounting to include the memory taken by the xsk_map_node structure. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for arraymap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for bpf_struct_ops maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for cpumap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for cgroup storage maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for devmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for hashtab maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for lpm_trie maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for queue_stack maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for reuseport_array maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for bpf ringbuffer. It has been replaced with the memcg-based memory accounting. bpf_ringbuf_alloc() can't return anything except ERR_PTR(-ENOMEM) and a valid pointer, so to simplify the code make it return NULL in the first case. This allows to drop a couple of lines in ringbuf_map_alloc() and also makes it look similar to other memory allocating function like kmalloc(). Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Andrii Nakryiko <[email protected]>
…h maps Do not use rlimit-based memory accounting for sockmap and sockhash maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for stackmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for xskmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Roman Gushchin <[email protected]>
Do not use rlimit-based memory accounting for bpf local storage maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Remove rlimit-based accounting infrastructure code, which is not used anymore. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Do not use rlimit-based memory accounting for bpf progs. It has been replaced with memcg-based memory accounting. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
Since bpf is not using rlimit memlock for the memory accounting and control, do not change the limit in sample applications. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Song Liu <[email protected]>
|
Master branch: c14d61f |
83f58da to
19564bd
Compare
|
Master branch: 024cd2c Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/netdevbpf/list/?series=383159 conflict: |
|
At least one diff in series https://patchwork.kernel.org/project/linux-mm/list/?series=385629 irrelevant now for [{'archived': False, 'project': 399, 'delegate': 121173}] |
Add a test case which replaces an active ingress qdisc while keeping the miniq in-tact during the transition period to the new clsact qdisc. # ./vmtest.sh -- ./test_progs -t tc_link [...] ./test_progs -t tc_link [ 3.412871] bpf_testmod: loading out-of-tree module taints kernel. [ 3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel #332 tc_links_after:OK #333 tc_links_append:OK #334 tc_links_basic:OK #335 tc_links_before:OK #336 tc_links_chain_classic:OK #337 tc_links_chain_mixed:OK #338 tc_links_dev_chain0:OK #339 tc_links_dev_cleanup:OK #340 tc_links_dev_mixed:OK #341 tc_links_ingress:OK #342 tc_links_invalid:OK #343 tc_links_prepend:OK #344 tc_links_replace:OK #345 tc_links_revision:OK Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <[email protected]> Cc: Martin KaFai Lau <[email protected]>
Add a test case which replaces an active ingress qdisc while keeping the miniq in-tact during the transition period to the new clsact qdisc. # ./vmtest.sh -- ./test_progs -t tc_link [...] ./test_progs -t tc_link [ 3.412871] bpf_testmod: loading out-of-tree module taints kernel. [ 3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel #332 tc_links_after:OK #333 tc_links_append:OK #334 tc_links_basic:OK #335 tc_links_before:OK #336 tc_links_chain_classic:OK #337 tc_links_chain_mixed:OK #338 tc_links_dev_chain0:OK #339 tc_links_dev_cleanup:OK #340 tc_links_dev_mixed:OK #341 tc_links_ingress:OK #342 tc_links_invalid:OK #343 tc_links_prepend:OK #344 tc_links_replace:OK #345 tc_links_revision:OK Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <[email protected]> Cc: Martin KaFai Lau <[email protected]>
Add a test case which replaces an active ingress qdisc while keeping the miniq in-tact during the transition period to the new clsact qdisc. # ./vmtest.sh -- ./test_progs -t tc_link [...] ./test_progs -t tc_link [ 3.412871] bpf_testmod: loading out-of-tree module taints kernel. [ 3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel #332 tc_links_after:OK #333 tc_links_append:OK #334 tc_links_basic:OK #335 tc_links_before:OK #336 tc_links_chain_classic:OK #337 tc_links_chain_mixed:OK #338 tc_links_dev_chain0:OK #339 tc_links_dev_cleanup:OK #340 tc_links_dev_mixed:OK #341 tc_links_ingress:OK #342 tc_links_invalid:OK #343 tc_links_prepend:OK #344 tc_links_replace:OK #345 tc_links_revision:OK Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <[email protected]> Cc: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
Using mutex lock in IO hot path causes the kernel BUG sleeping while
atomic. Shinichiro[1], first encountered this issue while running blktest
nvme/052 shown below:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 996, name: (udev-worker)
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by (udev-worker)/996:
#0: ffff8881004570c8 (mapping.invalidate_lock){.+.+}-{3:3}, at: page_cache_ra_unbounded+0x155/0x5c0
#1: ffffffff8607eaa0 (rcu_read_lock){....}-{1:2}, at: blk_mq_flush_plug_list+0xa75/0x1950
CPU: 2 UID: 0 PID: 996 Comm: (udev-worker) Not tainted 6.12.0-rc3+ #339
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x90
__might_resched.cold+0x1f7/0x23d
? __pfx___might_resched+0x10/0x10
? vsnprintf+0xdeb/0x18f0
__mutex_lock+0xf4/0x1220
? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
? __pfx_vsnprintf+0x10/0x10
? __pfx___mutex_lock+0x10/0x10
? snprintf+0xa5/0xe0
? xas_load+0x1ce/0x3f0
? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
? __pfx_nvmet_subsys_nsid_exists+0x10/0x10 [nvmet]
nvmet_req_find_ns+0x24e/0x300 [nvmet]
nvmet_req_init+0x694/0xd40 [nvmet]
? blk_mq_start_request+0x11c/0x750
? nvme_setup_cmd+0x369/0x990 [nvme_core]
nvme_loop_queue_rq+0x2a7/0x7a0 [nvme_loop]
? __pfx___lock_acquire+0x10/0x10
? __pfx_nvme_loop_queue_rq+0x10/0x10 [nvme_loop]
__blk_mq_issue_directly+0xe2/0x1d0
? __pfx___blk_mq_issue_directly+0x10/0x10
? blk_mq_request_issue_directly+0xc2/0x140
blk_mq_plug_issue_direct+0x13f/0x630
? lock_acquire+0x2d/0xc0
? blk_mq_flush_plug_list+0xa75/0x1950
blk_mq_flush_plug_list+0xa9d/0x1950
? __pfx_blk_mq_flush_plug_list+0x10/0x10
? __pfx_mpage_readahead+0x10/0x10
__blk_flush_plug+0x278/0x4d0
? __pfx___blk_flush_plug+0x10/0x10
? lock_release+0x460/0x7a0
blk_finish_plug+0x4e/0x90
read_pages+0x51b/0xbc0
? __pfx_read_pages+0x10/0x10
? lock_release+0x460/0x7a0
page_cache_ra_unbounded+0x326/0x5c0
force_page_cache_ra+0x1ea/0x2f0
filemap_get_pages+0x59e/0x17b0
? __pfx_filemap_get_pages+0x10/0x10
? lock_is_held_type+0xd5/0x130
? __pfx___might_resched+0x10/0x10
? find_held_lock+0x2d/0x110
filemap_read+0x317/0xb70
? up_write+0x1ba/0x510
? __pfx_filemap_read+0x10/0x10
? inode_security+0x54/0xf0
? selinux_file_permission+0x36d/0x420
blkdev_read_iter+0x143/0x3b0
vfs_read+0x6ac/0xa20
? __pfx_vfs_read+0x10/0x10
? __pfx_vm_mmap_pgoff+0x10/0x10
? __pfx___seccomp_filter+0x10/0x10
ksys_read+0xf7/0x1d0
? __pfx_ksys_read+0x10/0x10
do_syscall_64+0x93/0x180
? lockdep_hardirqs_on_prepare+0x16d/0x400
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on+0x78/0x100
? do_syscall_64+0x9f/0x180
? lockdep_hardirqs_on_prepare+0x16d/0x400
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f565bd1ce11
Code: 00 48 8b 15 09 90 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 d0 ad 01 00 f3 0f 1e fa 80 3d 35 12 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
RSP: 002b:00007ffd6e7a20c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f565bd1ce11
RDX: 0000000000001000 RSI: 00007f565babb000 RDI: 0000000000000014
RBP: 00007ffd6e7a2130 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000556000bfa610 R11: 0000000000000246 R12: 000000003ffff000
R13: 0000556000bfa5b0 R14: 0000000000000e00 R15: 0000556000c07328
</TASK>
Apparently, the above issue is caused due to using mutex lock while
we're in IO hot path. It's a regression caused with commit 5053639
("nvmet: fix nvme status code when namespace is disabled"). The mutex
->su_mutex is used to find whether a disabled nsid exists in the config
group or not. This is to differentiate between a nsid that is disabled
vs non-existent.
To mitigate the above issue, we've worked upon a fix[2] where we now
insert nsid in subsys Xarray as soon as it's created under config group
and later when that nsid is enabled, we add an Xarray mark on it and set
ns->enabled to true. The Xarray mark is useful while we need to loop
through all enabled namepsaces under a subsystem using xa_for_each_marked()
API. If later a nsid is disabled then we clear Xarray mark from it and also
set ns->enabled to false. It's only when nsid is deleted from the config
group we delete it from the Xarray.
So with this change, now we could easily differentiate a nsid is disabled
(i.e. Xarray entry for ns exists but ns->enabled is set to false) vs non-
existent (i.e.Xarray entry for ns doesn't exist).
Link: https://lore.kernel.org/linux-nvme/[email protected]/ [2]
Reported-by: Shinichiro Kawasaki <[email protected]>
Closes: https://lore.kernel.org/linux-nvme/tqcy3sveity7p56v7ywp7ssyviwcb3w4623cnxj3knoobfcanq@yxgt2mjkbkam/ [1]
Fixes: 5053639 ("nvmet: fix nvme status code when namespace is disabled")
Fix-suggested-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Nilay Shroff <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
Pull request for series with
subject: bpf: switch to memcg-based memory accounting
version: 5
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=383159