Skip to content

BPF: Fix that stale-NAT cleanup didn't wait for kube-proxy sync #10721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 24, 2025

Conversation

fasaxc
Copy link
Member

@fasaxc fasaxc commented Jul 23, 2025

Description

Conntrack cleanup could fire before kube-proxy had loaded the services, resulting in every NAT entry looking stale.

Related issues/PRs

Should fix #10670
CORE-11639

Todos

  • Tests
  • Documentation
  • Release note

Release Note

eBPF: Fix race between loading kubernetes services and conntrack cleanup.  If conntrack cleanup ran before services were loaded, all service entries would look stale and get cleaned up.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

Conntrack cleanup could fire before kube-proxy had loaded the services,
resulting in _every_ NAT entry looking stale.
@Copilot Copilot AI review requested due to automatic review settings July 23, 2025 13:59
@fasaxc fasaxc requested a review from a team as a code owner July 23, 2025 13:59
@marvin-tigera marvin-tigera added this to the Calico v3.31.0 milestone Jul 23, 2025
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jul 23, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition where conntrack cleanup could execute before kube-proxy had completed its initial sync, causing all NAT entries to appear stale and be incorrectly removed.

  • Adds a HasSynced() method to the DPSyncer interface to track sync status
  • Updates conntrack checking logic to wait for syncer synchronization before proceeding
  • Includes comprehensive test coverage for the race condition scenarios

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
felix/bpf/proxy/proxy.go Adds HasSynced() method to DPSyncer interface
felix/bpf/proxy/syncer.go Implements HasSynced() method returning sync status
felix/bpf/proxy/kube-proxy.go Updates conntrack methods to check sync status before proceeding
felix/bpf/proxy/proxy_test.go Adds mock implementation of HasSynced() for existing tests
felix/bpf/proxy/kube-proxy_internal_test.go New test file covering the race condition scenarios
Comments suppressed due to low confidence (4)

felix/bpf/proxy/kube-proxy_internal_test.go:33

  • The test assertion logic appears inverted. When the syncer has not synced, ConntrackFrontendHasBackend should return true (conservative approach), but the test expects false and will fail.
	if !kp.ConntrackFrontendHasBackend(nil, 0, nil, 0, 0) {

felix/bpf/proxy/kube-proxy_internal_test.go:37

  • The test assertion logic appears inverted. When the syncer has synced, the method should delegate to the mock syncer which returns false, so the test should expect false but it expects true.
	if kp.ConntrackFrontendHasBackend(nil, 0, nil, 0, 0) {

felix/bpf/proxy/kube-proxy_internal_test.go:48

  • The test assertion logic appears inverted. When the syncer has not synced, ConntrackDestIsService should return false (conservative approach), but the test expects true and will fail.
	if kp.ConntrackDestIsService(nil, 0, 0) {

felix/bpf/proxy/kube-proxy_internal_test.go:52

  • The test assertion logic appears inverted. When the syncer has synced, the method should delegate to the mock syncer which returns true, so the test should expect true but it expects false.
	if !kp.ConntrackDestIsService(nil, 0, 0) {

@fasaxc fasaxc added docs-not-required Docs not required for this change cherry-pick-candidate and removed docs-pr-required Change is not yet documented labels Jul 23, 2025
Copy link
Member

@sridhartigera sridhartigera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fasaxc fasaxc merged commit 9bd69b9 into projectcalico:master Jul 24, 2025
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-candidate docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bpf] Connections dropped during felix restart
3 participants