-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
We are intermittently encountering a fatal error in calico-node on random Kubernetes nodes on our clusters,approximately once every couple of days. When this occurs, the affected node becomes completely unusable for workload, and the only reliable way to recover is to reboot the host.
[ERROR][392] felix/daemon.go 411: Shutting down due to fatal error error=failed to read from netlink (resync): device or resource busy
[WARNING][392] felix/daemon.go 777: Felix is shutting down reason="fatal error"
[INFO][392] felix/daemon.go 832: Sleeping to avoid tight restart loop. reason="fatal error"
[FATAL][392] felix/daemon.go 845: Exiting. reason="fatal error"
We see this issues on following on-prem clusters:
RKE v1 cluster: Kubernetes v1.32.5 + Calico v3.29.4 (Manifest install) + eBPF/DSR + RHEL v8.10
RKE v1 cluster: Kubernetes v1.30.13 + Calico v3.29.4 (Manifest install) + eBPF/DSR + RHEL v8.10
iptables-1.8.5-11.el8.x86_64
nftables-1.0.4-7.el8_10.x86_64
Calico settings:
- IPIP never
- VXLAN never
- IP_AUTODETECTION_METHOD interface=em2
- FELIX_MTUIFACEPATTERN=em2
When observing host logs there's no events happening at same time which might affect this (no network issues), there are no interface changes, errors or drops happening.
When inspecting host-level logs (e.g., journalctl, dmesg) at the time of failure:
- No network disruptions, interface changes, or link flaps are seen
- No signs of packet drops, hardware errors, or system resource exhaustion are present
- Interfaces remain up and healthy
Please advise if you encountered similar issues or have fixes.