Shutting down due to fatal error error=failed to read from netlink (resync): device or resource busy

We are intermittently encountering a fatal error in **calico-node** on random Kubernetes nodes on our clusters,approximately once every couple of days. When this occurs, the affected node becomes completely unusable for workload, and the only reliable way to recover is to reboot the host.

```
[ERROR][392] felix/daemon.go 411: Shutting down due to fatal error error=failed to read from netlink (resync): device or resource busy
[WARNING][392] felix/daemon.go 777: Felix is shutting down reason="fatal error"
[INFO][392] felix/daemon.go 832: Sleeping to avoid tight restart loop. reason="fatal error"
[FATAL][392] felix/daemon.go 845: Exiting. reason="fatal error"
```

We see this issues on following on-prem clusters:

RKE v1 cluster: Kubernetes v1.32.5 + Calico v3.29.4 (Manifest install) + eBPF/DSR + RHEL v8.10
RKE v1 cluster: Kubernetes v1.30.13 + Calico v3.29.4 (Manifest install) + eBPF/DSR + RHEL v8.10

iptables-1.8.5-11.el8.x86_64
nftables-1.0.4-7.el8_10.x86_64

Calico settings:
- IPIP never
- VXLAN never
- IP_AUTODETECTION_METHOD interface=em2
- FELIX_MTUIFACEPATTERN=em2

When observing host logs there's no events happening at same time which might affect this (no network issues), there are no interface changes, errors or drops happening.

When inspecting host-level logs (e.g., journalctl, dmesg) at the time of failure:
- No network disruptions, interface changes, or link flaps are seen
- No signs of packet drops, hardware errors, or system resource exhaustion are present
- Interfaces remain up and healthy

Please advise if you encountered similar issues or have fixes. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shutting down due to fatal error error=failed to read from netlink (resync): device or resource busy #10720

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shutting down due to fatal error error=failed to read from netlink (resync): device or resource busy #10720

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions