Skip to content

Conversation

@mheon
Copy link
Member

@mheon mheon commented Sep 3, 2025

The cleanupExecBundle function was only meant to be called on a locked container, as it does some state mutation operations. It also has a timed wait (if the directory is busy and can't be removed yet, give it a few milliseconds) in which it deliberately yields the lock to not block the container for that time.

The healthCheckExec() function calls cleanupExecBundle out of a defer block. This is after the defer c.lock.Unlock() so it fires afterwards when the function returns, so we're normally fine - the container is still locked when our defer runs. The problem is that healthCheckExec() also unlocks the container during the expensive exec operation, and can actually fail and return while not holding the lock - meaning our defer can fire on an unlocked container, leading to a potential double unlock in cleanupExecBundle.

We could, potentially, re-lock the container after the exec occurs, but we're actually waiting for a select to trigger to end the function, so that's not a good solution. Instead, just re-lock (if necessary) in the defer, before invoking cleanupExecBundle(). The defer c.lock.Unlock() will fire right after and unlock after us.

Fixes #26968

Does this PR introduce a user-facing change?

Fixed a bug where Podman would sometimes panic when executing health checks due to a locking issue

@openshift-ci openshift-ci bot added release-note approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 3, 2025
@mheon mheon added the No New Tests Allow PR to proceed without adding regression tests label Sep 3, 2025
@mheon
Copy link
Member Author

mheon commented Sep 3, 2025

Tagging no new tests, we probably can't reproduce the conditions required for double-unlock reliably (busy error when removing exec session directory).

But don't merge just yet, I think there's some cleanup necessary in healthCheckExec() around exec session handling - I'll investigate tomorrow morning.

@packit-as-a-service
Copy link

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

1 similar comment
@packit-as-a-service
Copy link

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

@mheon
Copy link
Member Author

mheon commented Sep 3, 2025

Alright, the exec session patch breaks it, because we can't retrieve the exec session PID to stop it early if there's a timeout. Inconvenient. I'll think about this more in the morning.

defer func() {
// cleanupExecBundle MUST be called with the parent container locked.
if !unlock {
c.lock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot ever lock without calling syncContainer() because the function will call c.save() with an outdated state otherwise which will cause really bad problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also add the !c.batched condition? Otherwise the block here depends on unlock being false only when !c.batched and that could change without us noticing the breakage of the dependency.

@Luap99
Copy link
Member

Luap99 commented Sep 3, 2025

Why do we need to do this ignore ENOTEMPTY and EBUSY at all? I see this was added in e252b3b which mentions nfs. Our VMs do not run on nfs yet we still landed in that code path.

Doesn't ENOTEMPTY mean something is writing into the directory while we delete all files from it, how would happen? And EBUSY is for mount points which definitely should not be the case here?

Do we actually gain anything by unlocking instead of busy looping while staying locked? How often would such an ENOTEMPTY/EBUSY error actually happen and how long does it take to get resolved?
I much rather do the simple safe thing which is not to mess which locking in such deep call stack, it is way to brittle IMO.

cc @giuseppe

@giuseppe
Copy link
Member

giuseppe commented Sep 3, 2025

that is/was needed because if any program (like conmon) has an open file in that directory, you can't really delete it on NFS.

NFS will rename it to some weird .nfs$HASH file so the Rmdir will fail. The unlocking/locking was added to give time to conmon to terminate and since it can trigger the podman cleanup that in turn locks the container, we need to give it the possibility to terminate. Once conmon is terminated, NFS will take care of really deleting these files and then we can do the Rmdir.

Comment on lines 851 to 856
if c.state.ExecSessions == nil {
c.state.ExecSessions = make(map[string]*ExecSession)
}
c.state.ExecSessions[session.ID()] = session
defer delete(c.state.ExecSessions, session.ID())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we leak the session if podman crashes in healthCheckExec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm attempting to completely remove the exec session from the database for healthchecks, so that ought to completely remove the risk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will the GC work for these sessions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. This does ensure the database is clean, but we'd potentially leak the directory used for the exec session. I wonder if we can do away with that entirely, it's not like healthchecks have high requirements...

Regardless, I'm going to drop the "remove session from DB" patch from here, this is proving to be much more complicated than I hoped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I missing something, I thought the point was already the we do not commit the session to the db ad9839a?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we're still mutating c.state.State(). And then we sync that back to the database, potentially, in several places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not intentionally, I mean - any c.syncContainer() after we mutate state is a potential save point

defer func() {
// cleanupExecBundle MUST be called with the parent container locked.
if !unlock {
c.lock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also add the !c.batched condition? Otherwise the block here depends on unlock being false only when !c.batched and that could change without us noticing the breakage of the dependency.

@Luap99
Copy link
Member

Luap99 commented Sep 3, 2025

that is/was needed because if any program (like conmon) has an open file in that directory, you can't really delete it on NFS.

Sure that was the reason why it was added but that does not explain how the hell we hit that code path in our CI VMs which AFAIK use a normal ext4 fs?

The `cleanupExecBundle` function was only meant to be called on a
locked container, as it does some state mutation operations. It
also has a timed wait (if the directory is busy and can't be
removed yet, give it a few milliseconds) in which it deliberately
yields the lock to not block the container for that time.

The `healthCheckExec()` function calls `cleanupExecBundle` out of
a `defer` block. This is after the `defer c.lock.Unlock()` so it
fires afterwards when the function returns, so we're normally
fine - the container is still locked when our defer runs. The
problem is that `healthCheckExec()` also unlocks the container
during the expensive exec operation, and can actually fail and
return while not holding the lock - meaning our `defer` can fire
on an unlocked container, leading to a potential double unlock
in `cleanupExecBundle`.

We could, potentially, re-lock the container after the exec
occurs, but we're actually waiting for a `select` to trigger to
end the function, so that's not a good solution. Instead, just
re-lock (if necessary) in the defer, before invoking
`cleanupExecBundle()`. The `defer c.lock.Unlock()` will fire
right after and unlock after us.

Fixes containers#26968

Signed-off-by: Matthew Heon <[email protected]>
@mheon
Copy link
Member Author

mheon commented Sep 3, 2025

Comments addressed, one flake rerun.

I'm going to look at the session bits separately - that is definitely a larger set of changes.

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 4, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Luap99,giuseppe,mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit a589f51 into containers:main Sep 4, 2025
77 of 81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. No New Tests Allow PR to proceed without adding regression tests release-note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

locking issue in podman healtcheck

3 participants