Don't call dbuf_evict_one from dbuf_evict_notify for kswapd #17561

khoang98 · 2025-07-21T17:50:20Z

Motivation and Context

In a high memory pressure environment using Lustre 2.15 + ZFS 2.1.7, we see a state where arc_read is holding the hash lock and waiting for a page of memory while kswapd is trying to evict a dbuf waiting on the same hash lock.

~~This change reduces the chance of this deadlock happening by checking if the hash lock is taken before trying to evict.~~

This proposed change removes the call to dbuf_evict_one() specifically when it's running in the kswapd context. This change aims to reduce unnecessary work and hash lock contention during high memory pressure situations, while still maintaining the performance benefit of dbuf eviction in normal operational contexts.

Thread 1

[372109.472754] Call trace:
[372109.472759]  __switch_to+0xbc/0xfc
[372109.472763]  __schedule+0x28c/0x718
[372109.472765]  _cond_resched+0x48/0x60
[372109.472768]  shrink_page_list+0x6c/0xc70
[372109.472769]  shrink_inactive_list+0x160/0x510
[372109.472771]  shrink_lruvec+0x26c/0x300
[372109.472772]  shrink_node_memcgs+0x1c0/0x230
[372109.472773]  shrink_node+0x150/0x5e0
[372109.472775]  shrink_zones+0x98/0x220
[372109.472776]  do_try_to_free_pages+0xac/0x2e0
[372109.472778]  try_to_free_pages+0x120/0x25c
[372109.472780]  __alloc_pages_slowpath.constprop.0+0x400/0x82c
[372109.472781]  __alloc_pages_nodemask+0x2b4/0x310
[372109.472831]  abd_alloc_chunks+0x184/0x470 [zfs]
[372109.472881]  abd_alloc+0x90/0x120 [zfs]
[372109.472924]  arc_hdr_alloc_abd+0x134/0x22c [zfs]
[372109.472966]  arc_read+0x4a8/0x10a0 [zfs]
[372109.473008]  dbuf_read_impl.constprop.0+0x23c/0x3dc [zfs]
[372109.473050]  dbuf_read+0xd4/0x65c [zfs]
[372109.473092]  dmu_buf_hold_by_dnode+0xa0/0x124 [zfs]
[372109.473136]  zap_get_leaf_byblk+0x68/0x154 [zfs]
[372109.473178]  zap_deref_leaf+0xb4/0x148 [zfs]
[372109.473221]  fzap_lookup+0x80/0x1b8 [zfs]
[372109.473263]  zap_lookup_impl+0x6c/0x1c8 [zfs]
[372109.473305]  zap_lookup+0xc8/0x10c [zfs]
[372109.473314]  osd_fid_lookup+0x27c/0x4d4 [osd_zfs]
[372109.473322]  osd_object_init+0x314/0xaec [osd_zfs]
[372109.473355]  lu_object_start+0x84/0x154 [obdclass]
[372109.473385]  lu_object_find_at+0x37c/0x72c [obdclass]
[372109.473415]  lu_object_find+0x1c/0x24 [obdclass]
[372109.473423]  ofd_object_find+0x6c/0x18c [ofd]
[372109.473430]  ofd_lvbo_init+0x294/0x938 [ofd]
[372109.473481]  ldlm_lvbo_init+0x70/0x2e0 [ptlrpc]
[372109.473529]  ldlm_handle_enqueue0+0x540/0x1b24 [ptlrpc]
[372109.473577]  tgt_enqueue+0x84/0x2c0 [ptlrpc]
[372109.473624]  tgt_handle_request0+0x2b4/0x658 [ptlrpc]
[372109.473671]  tgt_request_handle+0x268/0xaac [ptlrpc]
[372109.473718]  ptlrpc_server_handle_request.isra.0+0x460/0xf20 [ptlrpc]
[372109.473765]  ptlrpc_main+0xd24/0x15bc [ptlrpc]
[372109.473768]  kthread+0x118/0x120

Thread 2

[372073.633195] INFO: task kswapd0:188 blocked for more than 122 seconds.
[372073.634293]       Tainted: P           OE     5.10.236-228.935.amzn2.aarch64 #1
[372073.635482] echo 0 > /proc/sys/kernel/hung_task_timeout_secs disables this message.
[372073.636758] task:kswapd0         state:D stack:    0 pid:  188 ppid:     2 flags:0x00000028
[372073.638118] Call trace:
[372073.638546]  __switch_to+0xbc/0xfc
[372073.639123]  __schedule+0x28c/0x718
[372073.639712]  schedule+0x4c/0xcc
[372073.640248]  schedule_preempt_disabled+0x14/0x1c
[372073.641015]  __mutex_lock.constprop.0+0x190/0x640
[372073.641796]  __mutex_lock_slowpath+0x18/0x20
[372073.642506]  mutex_lock+0x74/0x80
[372073.643112]  arc_buf_destroy+0x84/0x178 [zfs]
[372073.643884]  dbuf_destroy+0x38/0x3dc [zfs]
[372073.644607]  dbuf_evict_one+0x168/0x188 [zfs]
[372073.645376]  dbuf_evict_notify+0xe0/0xf0 [zfs]
[372073.646153]  dbuf_rele_and_unlock+0x61c/0x708 [zfs]
[372073.646998]  dmu_buf_rele+0x44/0x58 [zfs]
[372073.647710]  sa_handle_destroy+0x7c/0x120 [zfs]
[372073.648470]  osd_object_delete+0x64/0x17c [osd_zfs]
[372073.649312]  lu_object_free.isra.0+0x88/0x1fc [obdclass]
[372073.650215]  lu_site_purge_objects+0x338/0x47c [obdclass]
[372073.651127]  lu_cache_shrink_scan+0xa0/0x18c [obdclass]
[372073.651989]  do_shrink_slab+0x194/0x394
[372073.652635]  shrink_slab+0xbc/0x13c
[372073.653234]  shrink_node_memcgs+0x1d4/0x230
[372073.653943]  shrink_node+0x150/0x5e0
[372073.654554]  balance_pgdat+0x260/0x524
[372073.655192]  kswapd+0x124/0x208
[372073.655732]  kthread+0x118/0x120

Description

This change should help alleviate the blocking issue we're seeing with kswapd, as it prevents additional cache eviction work from being performed when the system is already under memory pressure.

How Has This Been Tested?

Tested running various IOR tests with Lustre file system and did not see any drop in performance.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin

I don't like a concept of "reduce the chance of this deadlock" -- it there is a deadlock, it must be fixed reliably. Also exposing internal kitchen of ARC to a whole public for that it not a clean way ether. What I think should be investigated, is dbuf_evict_one() calls from random dbuf_evict_notify() contexts, since it is not strictly required and is only a performance optimization.

khoang98 · 2025-07-21T20:07:19Z

I don't like a concept of "reduce the chance of this deadlock" -- it there is a deadlock, it must be fixed reliably. Also exposing internal kitchen of ARC to a whole public for that it not a clean way ether.

That makes sense to me. Ideally, I wanted the check higher up in in arc_buf_destroy to not block if the hashlock is locked, but at that point, we've already changed state in other method calls in the stack and clean up becomes a bit tricky...

What I think should be investigated, is dbuf_evict_one() calls from random dbuf_evict_notify() contexts, since it is not strictly required and is only a performance optimization.

Can you help me understand this a bit more? From what I can tell, dbuf_evict_notify() will wake up the eviction thread and call dbuf_evict_one() if the dbuf cache is at the high water mark. We also saw this eviction thread stuck trying to obtain the lock too - so just trying to find a way to unstuck these threads to not run into this in the future

[372073.656295] INFO: task dbuf_evict:677 blocked for more than 122 seconds.
[372073.657401]       Tainted: P           OE     5.10.236-228.935.amzn2.aarch64 #1
[372073.658602] echo 0 > /proc/sys/kernel/hung_task_timeout_secs disables this message.
[372073.659888] task:dbuf_evict      state:D stack:    0 pid:  677 ppid:     2 flags:0x00000028
[372073.661256] Call trace:
[372073.661687]  __switch_to+0xbc/0xfc
[372073.662268]  __schedule+0x28c/0x718
[372073.662864]  schedule+0x4c/0xcc
[372073.663407]  schedule_preempt_disabled+0x14/0x1c
[372073.664181]  __mutex_lock.constprop.0+0x190/0x640
[372073.664969]  __mutex_lock_slowpath+0x18/0x20
[372073.665691]  mutex_lock+0x74/0x80
[372073.666298]  arc_buf_destroy+0x84/0x178 [zfs]
[372073.667073]  dbuf_destroy+0x38/0x3dc [zfs]
[372073.667808]  dbuf_evict_one+0x168/0x188 [zfs]
[372073.668583]  dbuf_evict_thread+0x154/0x288 [zfs]
[372073.669365]  thread_generic_wrapper+0x64/0x80 [spl]
[372073.670177]  kthread+0x118/0x120

amotin · 2025-07-21T20:59:32Z

We also saw this eviction thread stuck trying to obtain the lock too - so just trying to find a way to unstuck these threads to not run into this in the future

Yes, the dbuf eviction thread may stuck until the hash lock is released, but it shouldn't block anything else, since as I see it is carefully written to not hold any locks at that point. And dbuf cache size should be small enough to not matter that much if it is not evicting for some time.

vandanarungta · 2025-07-21T21:13:39Z

Other options that we considered:
1/ arc_read times out if it cannot get a page - but there is no error handling in arc_read in this path - so seemed risky
2/ arc_buf_destroy use a try_lock - but as Kaitlyn mentioned dbuf states have changed and there is no error handling in that path - so seemed risky
3/ (proposed) check if the hash_lock is owned (fair point on not exposing ARC, we can change that to be cleaner), before pulling the dbuf off the linked list.

Would appreciate thoughts on this: db_mtx lock is held when we check if the hash lock is owned. Is it possible between the time frame " of checking the hash lock and getting the hash_lock in arc_buf_destroy " for a new arc_read to get the hash_lock. If it can, then that is the open deadlock window.

If removing calls to dbuf_evict from dbuf_rele_and_unlock (and any other similar places) is the option you are proposing - it sounds like a reasonable solution - we can explore that - at least it will prevent kswapd from getting stuck and hard hanging the system.

amotin · 2025-07-21T22:25:06Z

Is it possible between the time frame " of checking the hash lock and getting the hash_lock in arc_buf_destroy " for a new arc_read to get the hash_lock. If it can, then that is the open deadlock window.

The same ARC buffer can be used by several dbufs, and the same hash lock can be used by several ARC buffers. So the idea of checking and later taking makes little sense to me, since does not ensure anything.

If removing calls to dbuf_evict from dbuf_rele_and_unlock (and any other similar places) is the option you are proposing - it sounds like a reasonable solution - we can explore that - at least it will prevent kswapd from getting stuck and hard hanging the system.

That is not what I have said. Read again. What I am saying is that there are two ways to evict dbuf cache: the dedicated thread and any other helpers. The first one is required, the second are only a performance optimization, which you can skip if called from kswapd.

khoang98 · 2025-07-24T14:58:02Z

@amotin pushed a new change based on your comments and updated the PR description - let me know what you think, thanks

amotin · 2025-07-24T15:26:22Z

@khoang98 This is kind of what I've meant, except current_is_kswapd() is Linux-specific. It may need some wrapper, if anything. On FreeBSD it makes me think about something like if (curproc == pageproc), but I haven't looked whether it is needed there.

khoang98 · 2025-07-24T16:03:42Z

@amotin updated to limit this change for Linux only

module/zfs/dbuf.c

khoang98 · 2025-07-24T19:34:28Z

@amotin moved the check to dbuf.h to avoid the definition checking in c file

include/sys/dbuf.h

khoang98 · 2025-07-25T14:50:15Z

@amotin made the check more generic with OS-specific implementations

amotin

Yes, I think this is the right way to do it.

module/zfs/dbuf.c

include/os/freebsd/spl/sys/misc.h

Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim threads waiting for the dbuf hash lock in the call sequence: dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy Signed-off-by: Kaitlin Hoang <[email protected]>

github-actions bot added the Status: Work in Progress Not yet ready for general review label Jul 21, 2025

amotin reviewed Jul 21, 2025

View reviewed changes

khoang98 force-pushed the hashlock_lookup branch 2 times, most recently from 98bdc89 to 0b3c168 Compare July 21, 2025 19:59

khoang98 force-pushed the hashlock_lookup branch from 0b3c168 to 8078c5c Compare July 24, 2025 14:51

khoang98 changed the title ~~Add hash_lock lookup to dbuf_evict_one to prevent deadlock in arc_buf_destroy~~ Don't call dbuf_evict_one from dbuf_evict_notify for kswapd Jul 24, 2025

khoang98 force-pushed the hashlock_lookup branch from 8078c5c to c248c07 Compare July 24, 2025 16:01

amotin reviewed Jul 24, 2025

View reviewed changes

module/zfs/dbuf.c Outdated Show resolved Hide resolved

khoang98 force-pushed the hashlock_lookup branch from c248c07 to 845722f Compare July 24, 2025 19:33

amotin reviewed Jul 24, 2025

View reviewed changes

include/sys/dbuf.h Outdated Show resolved Hide resolved

khoang98 force-pushed the hashlock_lookup branch from 845722f to 8ff19b8 Compare July 25, 2025 14:47

amotin reviewed Jul 25, 2025

View reviewed changes

module/zfs/dbuf.c Outdated Show resolved Hide resolved

include/os/freebsd/spl/sys/misc.h Outdated Show resolved Hide resolved

khoang98 force-pushed the hashlock_lookup branch 2 times, most recently from 4831fad to bec99b7 Compare July 25, 2025 19:06

khoang98 marked this pull request as ready for review July 28, 2025 18:40

github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Jul 28, 2025

behlendorf self-requested a review July 30, 2025 00:08

behlendorf force-pushed the hashlock_lookup branch from bec99b7 to dde05bd Compare July 30, 2025 03:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't call dbuf_evict_one from dbuf_evict_notify for kswapd #17561

Don't call dbuf_evict_one from dbuf_evict_notify for kswapd #17561

khoang98 commented Jul 21, 2025 •

edited

Loading

Uh oh!

amotin left a comment

Uh oh!

khoang98 commented Jul 21, 2025

Uh oh!

amotin commented Jul 21, 2025

Uh oh!

vandanarungta commented Jul 21, 2025

Uh oh!

amotin commented Jul 21, 2025

Uh oh!

khoang98 commented Jul 24, 2025

Uh oh!

amotin commented Jul 24, 2025 •

edited

Loading

Uh oh!

khoang98 commented Jul 24, 2025

Uh oh!

Uh oh!

khoang98 commented Jul 24, 2025

Uh oh!

Uh oh!

khoang98 commented Jul 25, 2025

Uh oh!

amotin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Don't call dbuf_evict_one from dbuf_evict_notify for kswapd #17561

Are you sure you want to change the base?

Don't call dbuf_evict_one from dbuf_evict_notify for kswapd #17561

Conversation

khoang98 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

khoang98 commented Jul 21, 2025

Uh oh!

amotin commented Jul 21, 2025

Uh oh!

vandanarungta commented Jul 21, 2025

Uh oh!

amotin commented Jul 21, 2025

Uh oh!

khoang98 commented Jul 24, 2025

Uh oh!

amotin commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khoang98 commented Jul 24, 2025

Uh oh!

Uh oh!

khoang98 commented Jul 24, 2025

Uh oh!

Uh oh!

khoang98 commented Jul 25, 2025

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khoang98 commented Jul 21, 2025 •

edited

Loading

amotin commented Jul 24, 2025 •

edited

Loading