Skip to content

Conversation

@troglobit
Copy link
Contributor

@troglobit troglobit commented Sep 1, 2025

Description

This PR addresses a set of container issues discovered while troubleshooting #1105, turns out that disabling or removing a container may under certain circumstances cause podman to deadlock and leave persistent file locks in /var/lib/containers:

The fix is to upgrade podman, to v4.9.5 (the last before they removed CNI support), and to also upgrade Finit, to v4.14, to allow all services to properly complete before starting the next "configuration generation". See the commit messages for more information.

A regression test, container_enabled, has been added to ensure this particular issue never creeps back in. For improved test coverage, another test for verifying environment variables, container_environment, was also added.

Note

Also included in this PR is an updated logo and slightly refreshed README that's worth checking out 😃

Checklist

Tick relevant boxes, this PR is-a or has-a:

  • Bugfix
    • Regression tests
    • ChangeLog updates (for next release)
  • Feature
    • YANG model change => revision updated?
    • Regression tests added?
    • ChangeLog updates (for next release)
    • Documentation added?
  • Test changes
    • Checked in changed Readme.adoc (make test-spec)
    • Added new test to group Readme.adoc and yaml file
  • Code style update (formatting, renaming)
  • Refactoring (please detail in commit messages)
  • Build related changes
  • Documentation content changes
    • ChangeLog updated (for major changes)
  • Other (please describe):

From the documentation:

> 'podman image prune' removes all dangling images from local storage.
> With the all option, all unused images are deleted (i.e., images not
> in use by any container).
>
> The image prune command does not prune cache images that only use
> layers that are necessary for other images.

So, when the container script is called in the cleanup phase of the
lifetime of a container, we can use the '--all' option to ensure we also
remove this container's loaded image.  In the case this happens before
a reboot of the system, there will be no old version of the image loaded
to /var/lib/containers after boot.

Issue #1098

Signed-off-by: Joachim Wiberg <[email protected]>
Signed-off-by: Joachim Wiberg <[email protected]>

This comment was marked as resolved.

Copy link
Contributor

@jovatn jovatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only checked README updates, and they look great!
Found a minor typo, that's all.

As Infix matures as an operating system it is quickly becoming more and
more useful also for end-device use-cases.  The README should reflect
this change in focus.

Signed-off-by: Joachim Wiberg <[email protected]>
Highlights:
 - fixes to systemd and s6 type services
 - bare-bones libsystemd replacement with #include <systemd/sd-daemon.h>
 - new reload:script mimicking systemd ExecReload, and
 - new stop:script mimicking systemd ExecStop
 - exit status/signal info when a process dies
 - service kill:SEC now support up to 300 sec.
 - the /tmp/norespawn trick now also covers service_retry()
 - the sysv 'stop' command process environment is now same as 'start'
 - State machine ordering issue: enter new config generation after
   services disabled in previous generation have been stopped

Full changelog at:
 - <https://github.com/troglobit/finit/releases/tag/4.13>
 - <https://github.com/troglobit/finit/releases/tag/4.14>

Fixes #1123

Signed-off-by: Joachim Wiberg <[email protected]>
This major upgrade, along with the upgrade to Finit v4.14, is what is
needed to fix #1123, which was caused by some odd futex locking bug in
Podman that left lingering issues in /var/lib/containers state files.
The root cause as fixed already in v4.7.x, but since CNI is supported
up to and including 4.9.5, going with a later release seemd prudent.

Full changelogs at:
 - <https://github.com/containers/podman/releases/tag/v4.5.1>
 - <https://github.com/containers/podman/releases/tag/v4.6.0>
 - <https://github.com/containers/podman/releases/tag/v4.6.1>
 - <https://github.com/containers/podman/releases/tag/v4.6.2>
 - <https://github.com/containers/podman/releases/tag/v4.7.0>
 - <https://github.com/containers/podman/releases/tag/v4.7.1>
 - <https://github.com/containers/podman/releases/tag/v4.7.2>
 - <https://github.com/containers/podman/releases/tag/v4.8.0>
 - <https://github.com/containers/podman/releases/tag/v4.8.1>
 - <https://github.com/containers/podman/releases/tag/v4.8.2>
 - <https://github.com/containers/podman/releases/tag/v4.8.3>
 - <https://github.com/containers/podman/releases/tag/v4.9.0>
 - <https://github.com/containers/podman/releases/tag/v4.9.1>
 - <https://github.com/containers/podman/releases/tag/v4.9.2>
 - <https://github.com/containers/podman/releases/tag/v4.9.3>
 - <https://github.com/containers/podman/releases/tag/v4.9.4>
 - <https://github.com/containers/podman/releases/tag/v4.9.5>

Fixes #1123

Signed-off-by: Joachim Wiberg <[email protected]>
@mattiaswal
Copy link
Contributor

Great work overall, you are the 🥇

Copy link
Contributor

@wkz wkz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! 🥇

@mattiaswal mattiaswal self-requested a review September 1, 2025 11:47
The extended kill delay (10 sec) is sometimes not enough for complex
system containers.  Also, podman sometimes take the opportunity to do
housekeeping tasks when stopping a container.  So, allow for up to 30
sec. grace period before we send SIGKILL.

With the latest image prune extension, set a 60 sec. timeout for the
cleanup task, in case podman gets stuck.  This to prevent any future
mishaps.

Signed-off-by: Joachim Wiberg <[email protected]>
When a container's image is on an inaccessible remote server, the
container wrapper script waits in the background for any netowrk
changes to retry download of the image.

This change avoids the dangerous previous construct, and is also
easier to read: timeuot after 60 seconds unless ip monitor reads
at least one event before that.

Fixes #1124

Signed-off-by: Joachim Wiberg <[email protected]>
 - the port-mapping plugin supports iptables or nftables
 - the firewall plugin support only iptables or firewalld

Enforce use of iptables wrapper for nftables, for now, in both plugins.
This all needs to be refactored to run podman with "unmanaged" networks
in the future.

Related to issue #1125

Signed-off-by: Joachim Wiberg <[email protected]>
 - Drop redundant comments
 - Drop redundant imports
 - PEP-8 fixes

Signed-off-by: Joachim Wiberg <[email protected]>
@troglobit
Copy link
Contributor Author

Another minor change was added late to this PR, issue #1127, discussed with and approved by @mattiaswal

@troglobit troglobit force-pushed the misc branch 2 times, most recently from d68230f to 46cd249 Compare September 2, 2025 06:45
Usually the CNI bridge plugin "takes care" of enabling IPv4 forwarding
on all interfaces, see issue #1125, but when the container tests are run
in a different order from the infix_containers.yaml, Infix may reset the
IPv4 forwarding on this critical interface.

This change is both future proof and also ensures the test works as it
was intended even if tests are run out-of-order.

Signed-off-by: Joachim Wiberg <[email protected]>
Regression test for issue #1123

Signed-off-by: Joachim Wiberg <[email protected]>
For a heavily loaded system, 10 seconds/retries is not enough time to
expect containers to have started up.  Particularly after the changes
done recently to do prune before and after a container is started.

Signed-off-by: Joachim Wiberg <[email protected]>
@mattiaswal mattiaswal merged commit 6cdcd57 into main Sep 2, 2025
6 checks passed
@mattiaswal mattiaswal deleted the misc branch September 2, 2025 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:main Build default defconfig, not minimal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Container setup with unreachable image spawns excessive ip monitor processes Disabling or removing a container may cause podman to hang

5 participants