Fix a locking bug in that could cause a double-unlock #26971

mheon · 2025-09-03T00:50:43Z

The cleanupExecBundle function was only meant to be called on a locked container, as it does some state mutation operations. It also has a timed wait (if the directory is busy and can't be removed yet, give it a few milliseconds) in which it deliberately yields the lock to not block the container for that time.

The healthCheckExec() function calls cleanupExecBundle out of a defer block. This is after the defer c.lock.Unlock() so it fires afterwards when the function returns, so we're normally fine - the container is still locked when our defer runs. The problem is that healthCheckExec() also unlocks the container during the expensive exec operation, and can actually fail and return while not holding the lock - meaning our defer can fire on an unlocked container, leading to a potential double unlock in cleanupExecBundle.

We could, potentially, re-lock the container after the exec occurs, but we're actually waiting for a select to trigger to end the function, so that's not a good solution. Instead, just re-lock (if necessary) in the defer, before invoking cleanupExecBundle(). The defer c.lock.Unlock() will fire right after and unlock after us.

Fixes #26968

Does this PR introduce a user-facing change?

Fixed a bug where Podman would sometimes panic when executing health checks due to a locking issue

mheon · 2025-09-03T00:52:00Z

Tagging no new tests, we probably can't reproduce the conditions required for double-unlock reliably (busy error when removing exec session directory).

But don't merge just yet, I think there's some cleanup necessary in healthCheckExec() around exec session handling - I'll investigate tomorrow morning.

packit-as-a-service · 2025-09-03T00:56:24Z

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

packit-as-a-service · 2025-09-03T00:56:24Z

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

mheon · 2025-09-03T02:10:27Z

Alright, the exec session patch breaks it, because we can't retrieve the exec session PID to stop it early if there's a timeout. Inconvenient. I'll think about this more in the morning.

Luap99 · 2025-09-03T08:49:43Z

libpod/container_exec.go

 	defer func() {
+		// cleanupExecBundle MUST be called with the parent container locked.
+		if !unlock {
+			c.lock.Lock()


you cannot ever lock without calling syncContainer() because the function will call c.save() with an outdated state otherwise which will cause really bad problems.

could we also add the !c.batched condition? Otherwise the block here depends on unlock being false only when !c.batched and that could change without us noticing the breakage of the dependency.

Luap99 · 2025-09-03T09:01:44Z

Why do we need to do this ignore ENOTEMPTY and EBUSY at all? I see this was added in e252b3b which mentions nfs. Our VMs do not run on nfs yet we still landed in that code path.

Doesn't ENOTEMPTY mean something is writing into the directory while we delete all files from it, how would happen? And EBUSY is for mount points which definitely should not be the case here?

Do we actually gain anything by unlocking instead of busy looping while staying locked? How often would such an ENOTEMPTY/EBUSY error actually happen and how long does it take to get resolved?
I much rather do the simple safe thing which is not to mess which locking in such deep call stack, it is way to brittle IMO.

cc @giuseppe

giuseppe · 2025-09-03T09:31:18Z

that is/was needed because if any program (like conmon) has an open file in that directory, you can't really delete it on NFS.

NFS will rename it to some weird .nfs$HASH file so the Rmdir will fail. The unlocking/locking was added to give time to conmon to terminate and since it can trigger the podman cleanup that in turn locks the container, we need to give it the possibility to terminate. Once conmon is terminated, NFS will take care of really deleting these files and then we can do the Rmdir.

giuseppe · 2025-09-03T09:30:43Z

libpod/container_exec.go

-	if c.state.ExecSessions == nil {
-		c.state.ExecSessions = make(map[string]*ExecSession)
-	}
-	c.state.ExecSessions[session.ID()] = session
-	defer delete(c.state.ExecSessions, session.ID())
-


do we leak the session if podman crashes in healthCheckExec?

I'm attempting to completely remove the exec session from the database for healthchecks, so that ought to completely remove the risk

how will the GC work for these sessions?

Hm. This does ensure the database is clean, but we'd potentially leak the directory used for the exec session. I wonder if we can do away with that entirely, it's not like healthchecks have high requirements...

Regardless, I'm going to drop the "remove session from DB" patch from here, this is proving to be much more complicated than I hoped.

am I missing something, I thought the point was already the we do not commit the session to the db ad9839a?

But we're still mutating c.state.State(). And then we sync that back to the database, potentially, in several places.

(Not intentionally, I mean - any c.syncContainer() after we mutate state is a potential save point

giuseppe · 2025-09-03T09:33:33Z

libpod/container_exec.go

 	defer func() {
+		// cleanupExecBundle MUST be called with the parent container locked.
+		if !unlock {
+			c.lock.Lock()


could we also add the !c.batched condition? Otherwise the block here depends on unlock being false only when !c.batched and that could change without us noticing the breakage of the dependency.

Luap99 · 2025-09-03T09:43:18Z

that is/was needed because if any program (like conmon) has an open file in that directory, you can't really delete it on NFS.

Sure that was the reason why it was added but that does not explain how the hell we hit that code path in our CI VMs which AFAIK use a normal ext4 fs?

The `cleanupExecBundle` function was only meant to be called on a locked container, as it does some state mutation operations. It also has a timed wait (if the directory is busy and can't be removed yet, give it a few milliseconds) in which it deliberately yields the lock to not block the container for that time. The `healthCheckExec()` function calls `cleanupExecBundle` out of a `defer` block. This is after the `defer c.lock.Unlock()` so it fires afterwards when the function returns, so we're normally fine - the container is still locked when our defer runs. The problem is that `healthCheckExec()` also unlocks the container during the expensive exec operation, and can actually fail and return while not holding the lock - meaning our `defer` can fire on an unlocked container, leading to a potential double unlock in `cleanupExecBundle`. We could, potentially, re-lock the container after the exec occurs, but we're actually waiting for a `select` to trigger to end the function, so that's not a good solution. Instead, just re-lock (if necessary) in the defer, before invoking `cleanupExecBundle()`. The `defer c.lock.Unlock()` will fire right after and unlock after us. Fixes containers#26968 Signed-off-by: Matthew Heon <[email protected]>

mheon · 2025-09-03T20:45:41Z

Comments addressed, one flake rerun.

I'm going to look at the session bits separately - that is definitely a larger set of changes.

giuseppe

LGTM

Luap99

/lgtm

openshift-ci · 2025-09-04T09:07:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99, mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,giuseppe,mheon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added release-note approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 3, 2025

mheon added the No New Tests Allow PR to proceed without adding regression tests label Sep 3, 2025

Luap99 reviewed Sep 3, 2025

View reviewed changes

giuseppe reviewed Sep 3, 2025

View reviewed changes

mheon force-pushed the fix_26968 branch from 6072c06 to 2c6dadd Compare September 3, 2025 14:19

giuseppe approved these changes Sep 4, 2025

View reviewed changes

Luap99 approved these changes Sep 4, 2025

View reviewed changes

openshift-ci bot assigned Luap99 Sep 4, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 4, 2025

openshift-merge-bot bot merged commit a589f51 into containers:main Sep 4, 2025
77 of 81 checks passed

Fix a locking bug in that could cause a double-unlock #26971

Fix a locking bug in that could cause a double-unlock #26971

Uh oh!

Conversation

mheon commented Sep 3, 2025

Does this PR introduce a user-facing change?

Uh oh!

mheon commented Sep 3, 2025

Uh oh!

packit-as-a-service bot commented Sep 3, 2025

Uh oh!

packit-as-a-service bot commented Sep 3, 2025

Uh oh!

mheon commented Sep 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Luap99 commented Sep 3, 2025

Uh oh!

giuseppe commented Sep 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Luap99 commented Sep 3, 2025

Uh oh!

mheon commented Sep 3, 2025

Uh oh!

giuseppe left a comment

Choose a reason for hiding this comment

Uh oh!

Luap99 left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants