Skip to content

Shutting down due to fatal error error=failed to read from netlink (resync): device or resource busy #10720

@linecolumn

Description

@linecolumn

We are intermittently encountering a fatal error in calico-node on random Kubernetes nodes on our clusters,approximately once every couple of days. When this occurs, the affected node becomes completely unusable for workload, and the only reliable way to recover is to reboot the host.

[ERROR][392] felix/daemon.go 411: Shutting down due to fatal error error=failed to read from netlink (resync): device or resource busy
[WARNING][392] felix/daemon.go 777: Felix is shutting down reason="fatal error"
[INFO][392] felix/daemon.go 832: Sleeping to avoid tight restart loop. reason="fatal error"
[FATAL][392] felix/daemon.go 845: Exiting. reason="fatal error"

We see this issues on following on-prem clusters:

RKE v1 cluster: Kubernetes v1.32.5 + Calico v3.29.4 (Manifest install) + eBPF/DSR + RHEL v8.10
RKE v1 cluster: Kubernetes v1.30.13 + Calico v3.29.4 (Manifest install) + eBPF/DSR + RHEL v8.10

iptables-1.8.5-11.el8.x86_64
nftables-1.0.4-7.el8_10.x86_64

Calico settings:

  • IPIP never
  • VXLAN never
  • IP_AUTODETECTION_METHOD interface=em2
  • FELIX_MTUIFACEPATTERN=em2

When observing host logs there's no events happening at same time which might affect this (no network issues), there are no interface changes, errors or drops happening.

When inspecting host-level logs (e.g., journalctl, dmesg) at the time of failure:

  • No network disruptions, interface changes, or link flaps are seen
  • No signs of packet drops, hardware errors, or system resource exhaustion are present
  • Interfaces remain up and healthy

Please advise if you encountered similar issues or have fixes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions