Skip to content

Don't call dbuf_evict_one from dbuf_evict_notify for kswapd #17561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

khoang98
Copy link

@khoang98 khoang98 commented Jul 21, 2025

Motivation and Context

In a high memory pressure environment using Lustre 2.15 + ZFS 2.1.7, we see a state where arc_read is holding the hash lock and waiting for a page of memory while kswapd is trying to evict a dbuf waiting on the same hash lock.

This change reduces the chance of this deadlock happening by checking if the hash lock is taken before trying to evict.

This proposed change removes the call to dbuf_evict_one() specifically when it's running in the kswapd context. This change aims to reduce unnecessary work and hash lock contention during high memory pressure situations, while still maintaining the performance benefit of dbuf eviction in normal operational contexts.

Thread 1

[372109.472754] Call trace:
[372109.472759]  __switch_to+0xbc/0xfc
[372109.472763]  __schedule+0x28c/0x718
[372109.472765]  _cond_resched+0x48/0x60
[372109.472768]  shrink_page_list+0x6c/0xc70
[372109.472769]  shrink_inactive_list+0x160/0x510
[372109.472771]  shrink_lruvec+0x26c/0x300
[372109.472772]  shrink_node_memcgs+0x1c0/0x230
[372109.472773]  shrink_node+0x150/0x5e0
[372109.472775]  shrink_zones+0x98/0x220
[372109.472776]  do_try_to_free_pages+0xac/0x2e0
[372109.472778]  try_to_free_pages+0x120/0x25c
[372109.472780]  __alloc_pages_slowpath.constprop.0+0x400/0x82c
[372109.472781]  __alloc_pages_nodemask+0x2b4/0x310
[372109.472831]  abd_alloc_chunks+0x184/0x470 [zfs]
[372109.472881]  abd_alloc+0x90/0x120 [zfs]
[372109.472924]  arc_hdr_alloc_abd+0x134/0x22c [zfs]
[372109.472966]  arc_read+0x4a8/0x10a0 [zfs]
[372109.473008]  dbuf_read_impl.constprop.0+0x23c/0x3dc [zfs]
[372109.473050]  dbuf_read+0xd4/0x65c [zfs]
[372109.473092]  dmu_buf_hold_by_dnode+0xa0/0x124 [zfs]
[372109.473136]  zap_get_leaf_byblk+0x68/0x154 [zfs]
[372109.473178]  zap_deref_leaf+0xb4/0x148 [zfs]
[372109.473221]  fzap_lookup+0x80/0x1b8 [zfs]
[372109.473263]  zap_lookup_impl+0x6c/0x1c8 [zfs]
[372109.473305]  zap_lookup+0xc8/0x10c [zfs]
[372109.473314]  osd_fid_lookup+0x27c/0x4d4 [osd_zfs]
[372109.473322]  osd_object_init+0x314/0xaec [osd_zfs]
[372109.473355]  lu_object_start+0x84/0x154 [obdclass]
[372109.473385]  lu_object_find_at+0x37c/0x72c [obdclass]
[372109.473415]  lu_object_find+0x1c/0x24 [obdclass]
[372109.473423]  ofd_object_find+0x6c/0x18c [ofd]
[372109.473430]  ofd_lvbo_init+0x294/0x938 [ofd]
[372109.473481]  ldlm_lvbo_init+0x70/0x2e0 [ptlrpc]
[372109.473529]  ldlm_handle_enqueue0+0x540/0x1b24 [ptlrpc]
[372109.473577]  tgt_enqueue+0x84/0x2c0 [ptlrpc]
[372109.473624]  tgt_handle_request0+0x2b4/0x658 [ptlrpc]
[372109.473671]  tgt_request_handle+0x268/0xaac [ptlrpc]
[372109.473718]  ptlrpc_server_handle_request.isra.0+0x460/0xf20 [ptlrpc]
[372109.473765]  ptlrpc_main+0xd24/0x15bc [ptlrpc]
[372109.473768]  kthread+0x118/0x120

Thread 2

[372073.633195] INFO: task kswapd0:188 blocked for more than 122 seconds.
[372073.634293]       Tainted: P           OE     5.10.236-228.935.amzn2.aarch64 #1
[372073.635482] echo 0 > /proc/sys/kernel/hung_task_timeout_secs disables this message.
[372073.636758] task:kswapd0         state:D stack:    0 pid:  188 ppid:     2 flags:0x00000028
[372073.638118] Call trace:
[372073.638546]  __switch_to+0xbc/0xfc
[372073.639123]  __schedule+0x28c/0x718
[372073.639712]  schedule+0x4c/0xcc
[372073.640248]  schedule_preempt_disabled+0x14/0x1c
[372073.641015]  __mutex_lock.constprop.0+0x190/0x640
[372073.641796]  __mutex_lock_slowpath+0x18/0x20
[372073.642506]  mutex_lock+0x74/0x80
[372073.643112]  arc_buf_destroy+0x84/0x178 [zfs]
[372073.643884]  dbuf_destroy+0x38/0x3dc [zfs]
[372073.644607]  dbuf_evict_one+0x168/0x188 [zfs]
[372073.645376]  dbuf_evict_notify+0xe0/0xf0 [zfs]
[372073.646153]  dbuf_rele_and_unlock+0x61c/0x708 [zfs]
[372073.646998]  dmu_buf_rele+0x44/0x58 [zfs]
[372073.647710]  sa_handle_destroy+0x7c/0x120 [zfs]
[372073.648470]  osd_object_delete+0x64/0x17c [osd_zfs]
[372073.649312]  lu_object_free.isra.0+0x88/0x1fc [obdclass]
[372073.650215]  lu_site_purge_objects+0x338/0x47c [obdclass]
[372073.651127]  lu_cache_shrink_scan+0xa0/0x18c [obdclass]
[372073.651989]  do_shrink_slab+0x194/0x394
[372073.652635]  shrink_slab+0xbc/0x13c
[372073.653234]  shrink_node_memcgs+0x1d4/0x230
[372073.653943]  shrink_node+0x150/0x5e0
[372073.654554]  balance_pgdat+0x260/0x524
[372073.655192]  kswapd+0x124/0x208
[372073.655732]  kthread+0x118/0x120

Description

This change should help alleviate the blocking issue we're seeing with kswapd, as it prevents additional cache eviction work from being performed when the system is already under memory pressure.

How Has This Been Tested?

Tested running various IOR tests with Lustre file system and did not see any drop in performance.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Jul 21, 2025
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like a concept of "reduce the chance of this deadlock" -- it there is a deadlock, it must be fixed reliably. Also exposing internal kitchen of ARC to a whole public for that it not a clean way ether. What I think should be investigated, is dbuf_evict_one() calls from random dbuf_evict_notify() contexts, since it is not strictly required and is only a performance optimization.

@khoang98 khoang98 force-pushed the hashlock_lookup branch 2 times, most recently from 98bdc89 to 0b3c168 Compare July 21, 2025 19:59
@khoang98
Copy link
Author

I don't like a concept of "reduce the chance of this deadlock" -- it there is a deadlock, it must be fixed reliably. Also exposing internal kitchen of ARC to a whole public for that it not a clean way ether.

That makes sense to me. Ideally, I wanted the check higher up in in arc_buf_destroy to not block if the hashlock is locked, but at that point, we've already changed state in other method calls in the stack and clean up becomes a bit tricky...

What I think should be investigated, is dbuf_evict_one() calls from random dbuf_evict_notify() contexts, since it is not strictly required and is only a performance optimization.

Can you help me understand this a bit more? From what I can tell, dbuf_evict_notify() will wake up the eviction thread and call dbuf_evict_one() if the dbuf cache is at the high water mark. We also saw this eviction thread stuck trying to obtain the lock too - so just trying to find a way to unstuck these threads to not run into this in the future

[372073.656295] INFO: task dbuf_evict:677 blocked for more than 122 seconds.
[372073.657401]       Tainted: P           OE     5.10.236-228.935.amzn2.aarch64 #1
[372073.658602] echo 0 > /proc/sys/kernel/hung_task_timeout_secs disables this message.
[372073.659888] task:dbuf_evict      state:D stack:    0 pid:  677 ppid:     2 flags:0x00000028
[372073.661256] Call trace:
[372073.661687]  __switch_to+0xbc/0xfc
[372073.662268]  __schedule+0x28c/0x718
[372073.662864]  schedule+0x4c/0xcc
[372073.663407]  schedule_preempt_disabled+0x14/0x1c
[372073.664181]  __mutex_lock.constprop.0+0x190/0x640
[372073.664969]  __mutex_lock_slowpath+0x18/0x20
[372073.665691]  mutex_lock+0x74/0x80
[372073.666298]  arc_buf_destroy+0x84/0x178 [zfs]
[372073.667073]  dbuf_destroy+0x38/0x3dc [zfs]
[372073.667808]  dbuf_evict_one+0x168/0x188 [zfs]
[372073.668583]  dbuf_evict_thread+0x154/0x288 [zfs]
[372073.669365]  thread_generic_wrapper+0x64/0x80 [spl]
[372073.670177]  kthread+0x118/0x120

@amotin
Copy link
Member

amotin commented Jul 21, 2025

We also saw this eviction thread stuck trying to obtain the lock too - so just trying to find a way to unstuck these threads to not run into this in the future

Yes, the dbuf eviction thread may stuck until the hash lock is released, but it shouldn't block anything else, since as I see it is carefully written to not hold any locks at that point. And dbuf cache size should be small enough to not matter that much if it is not evicting for some time.

@vandanarungta
Copy link
Contributor

Other options that we considered:
1/ arc_read times out if it cannot get a page - but there is no error handling in arc_read in this path - so seemed risky
2/ arc_buf_destroy use a try_lock - but as Kaitlyn mentioned dbuf states have changed and there is no error handling in that path - so seemed risky
3/ (proposed) check if the hash_lock is owned (fair point on not exposing ARC, we can change that to be cleaner), before pulling the dbuf off the linked list.

Would appreciate thoughts on this: db_mtx lock is held when we check if the hash lock is owned. Is it possible between the time frame " of checking the hash lock and getting the hash_lock in arc_buf_destroy " for a new arc_read to get the hash_lock. If it can, then that is the open deadlock window.

If removing calls to dbuf_evict from dbuf_rele_and_unlock (and any other similar places) is the option you are proposing - it sounds like a reasonable solution - we can explore that - at least it will prevent kswapd from getting stuck and hard hanging the system.

@amotin
Copy link
Member

amotin commented Jul 21, 2025

Is it possible between the time frame " of checking the hash lock and getting the hash_lock in arc_buf_destroy " for a new arc_read to get the hash_lock. If it can, then that is the open deadlock window.

The same ARC buffer can be used by several dbufs, and the same hash lock can be used by several ARC buffers. So the idea of checking and later taking makes little sense to me, since does not ensure anything.

If removing calls to dbuf_evict from dbuf_rele_and_unlock (and any other similar places) is the option you are proposing - it sounds like a reasonable solution - we can explore that - at least it will prevent kswapd from getting stuck and hard hanging the system.

That is not what I have said. Read again. What I am saying is that there are two ways to evict dbuf cache: the dedicated thread and any other helpers. The first one is required, the second are only a performance optimization, which you can skip if called from kswapd.

@khoang98 khoang98 changed the title Add hash_lock lookup to dbuf_evict_one to prevent deadlock in arc_buf_destroy Don't call dbuf_evict_one from dbuf_evict_notify for kswapd Jul 24, 2025
@khoang98
Copy link
Author

@amotin pushed a new change based on your comments and updated the PR description - let me know what you think, thanks

@amotin
Copy link
Member

amotin commented Jul 24, 2025

@khoang98 This is kind of what I've meant, except current_is_kswapd() is Linux-specific. It may need some wrapper, if anything. On FreeBSD it makes me think about something like if (curproc == pageproc), but I haven't looked whether it is needed there.

@khoang98
Copy link
Author

@amotin updated to limit this change for Linux only

@khoang98
Copy link
Author

@amotin moved the check to dbuf.h to avoid the definition checking in c file

@khoang98
Copy link
Author

@amotin made the check more generic with OS-specific implementations

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is the right way to do it.

@khoang98 khoang98 force-pushed the hashlock_lookup branch 2 times, most recently from 4831fad to bec99b7 Compare July 25, 2025 19:06
@khoang98 khoang98 marked this pull request as ready for review July 28, 2025 18:40
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jul 28, 2025
@behlendorf behlendorf self-requested a review July 30, 2025 00:08
Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux
kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim
threads waiting for the dbuf hash lock in the call sequence:
dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy

Signed-off-by: Kaitlin Hoang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants