-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Expected Behavior
As usually observed in the iptables data plane, I'd expect a typha restart (as in a regular typha rolling redeployment) not having significant negative impact on the ebpf data plane.
Current Behavior
Using the ebpf data plane in a bigger cluster (280 nodes, 1900 services, 18k pods) causes hanging traffic after a typha restart. Seemingly, while a calico-node is reconnecting (despite having similar output as with the iptables data plane), the ebpf data plane is reconfigured and connot properly forward traffic until done.
I'm specifically seeing timeouts in our ingress controllers / gateways (traefik, istio) are not able to forward traffic to the workload pods.
In my last production test, I observed effects for up to 20m after the typha redeployment. Old typha pods took up to 5m to terminate (which is our termination grace period, but it seems all connections were handed off before termination). Typha metrics show that all connections were up at ~6m after the start of the shutdown - still I see application impact up to 20m after the restart.
Noteworthy as well: I see significant drop of conntrack table size during the typha reconnects, which I would not expect. Maybe this even is the core of the problem, that connectiontable state is (partially?) dropped?
Steps to Reproduce (for bugs)
kubectl rollout restart -n kube-system deployment calico-typha
Your Environment
- Calico version 3.30.1
- Calico dataplane (iptables, windows etc.) ebpf
- Orchestrator version (e.g. kubernetes, mesos, rkt): k8s v1.31.8
- Operating System and version: FlatCar ContainerLinux 4152.2.3