2323 - [ Handling Node Updates] ( #handling-node-updates )
2424 - [ Future Expansion] ( #future-expansion )
2525 - [ Test Plan] ( #test-plan )
26- - [ Controller Unit Tests] ( #controller-unit-tests )
27- - [ Kube-Proxy Unit Tests] ( #kube-proxy-unit-tests )
28- - [ e2e Tests] ( #e2e-tests )
26+ - [ Unit tests] ( #unit-tests )
27+ - [ Controller Unit Tests] ( #controller-unit-tests )
28+ - [ Kube-Proxy Unit Tests] ( #kube-proxy-unit-tests )
29+ - [ Integration tests] ( #integration-tests )
30+ - [ e2e tests] ( #e2e-tests )
2931 - [ Observability] ( #observability )
32+ - [ Events] ( #events )
33+ - [ Logic] ( #logic )
34+ - [ Sample Events] ( #sample-events )
35+ - [ Documentation] ( #documentation )
36+ - [ Limitations] ( #limitations )
3037 - [ Graduation Criteria] ( #graduation-criteria )
3138 - [ Version Skew Strategy] ( #version-skew-strategy )
3239- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
@@ -368,12 +375,18 @@ In the future we may expand this functionality if needed. This could include:
368375
369376- A new `RequireZone` algorithm that would keep endpoints in EndpointSlices for
370377 the same zone they are in.
371- - A new option to specify a minimum threshold for the `PreferZone` approach.
378+ - A new option to specify a minimum threshold for the `Auto` (PreferZone)
379+ approach.
372380- Support for region based hints.
373381
374382# ## Test Plan
375383
376- # ### Controller Unit Tests
384+ # ### Unit tests
385+
386+ - `k8s.io/pkg/controller/endpointslice` : ` 2022-10-05` - `73.1`
387+ - `k8s.io/pkg/controller/endpointslice/topologycache` : ` 2022-10-05` - `75.4`
388+
389+ # #### Controller Unit Tests
377390| Test Description | Expected Result |
378391| :--- | :--- |
379392| Feature On, 2+ zones | Hints set |
@@ -393,31 +406,29 @@ In the future we may expand this functionality if needed. This could include:
393406| Endpoint additions that require redistribution | Hints updated |
394407| Endpoint removals that require redistribution | Hints updated |
395408
396- # ### Kube-Proxy Unit Tests
409+ # #### Kube-Proxy Unit Tests
397410| Test Description | Expected Result |
398411| :--- | :--- |
399412| Feature On, hints matching zone | Endpoints filtered |
400413| Feature On, ExternalTrafficPolicy == 'Local', hints matching zone | Endpoints not filtered |
401414| Feature Off, hints matching zone | Endpoints not filtered |
402415| Feature On, no hints matching zone | Endpoints not filtered |
403416
404- # ## e2e Tests
405- This represents the largest and most uncertain part of the testing effort. We
406- need to find a way to run e2e tests on multizone clusters. To limit flakiness,
407- those clusters likely need to have a consistent distribution of nodes across
408- zones. This will enable us to write predictable tests for topology aware
409- routing.
417+ # ### Integration tests
418+
419+ N/A
410420
411- At a minimum, we likely want the following test :
421+ # ### e2e tests
412422
413- - 3 zone cluster, with 1 equivalent node per zone
414- - Deploy a single pod to each node with a daemonset
415- - Create a Service that targets that daemonset
416- - Make requests from each zone and ensure that the request is routed to a pod in
417- the same zone
423+ This feature has e2e test coverage with the ["Topology Hints"
424+ test](https://github.com/kubernetes/kubernetes/blob/fbb6ccc0c62d2431dc3a18a4130d7fbae5c05acd/test/e2e/network/topology_hints.go#L43).
425+ This is currently limited to a periodic run due to the nature of requiring a
426+ multizone cluster to run. It has been [remarkably
427+ stable](https://testgrid.k8s.io/sig-network-kind#sig-network-kind,%20multizone)
428+ with 100% green runs.
418429
419- We'll likely need more tests to properly vet this feature, but this one should
420- be straightforward to write and unlikely to be flaky .
430+ As a prerequisite for GA, we will ensure that this test runs as a presubmit
431+ if any code changes in kube-proxy or the EndpointSlice controller .
421432
422433# ## Observability
423434We can reuse some of the metrics of EndpointSlice Controller that we already
@@ -454,13 +465,74 @@ EndpointSliceSyncs = metrics.NewCounterVec(
454465
455466` ` `
456467
468+ # ## Events
469+ A common point of frustration among initial users of this feature was how
470+ difficult it was to tell if this feature was enabled and working as intended.
471+ Due to the nature of this design, even when a user opts in to the `Auto` mode
472+ that is no guarantee that the controller logic will determine that there are
473+ a sufficient number of endpoints to allocate them proportionally to each zone
474+ in the cluster.
475+
476+ To make this feature easier to understand and use, the EndpointSlice controller
477+ will publish events for a Service to describe if the feature has been enabled,
478+ and if not, why not.
479+
480+ # ### Logic
481+
482+ The EndpointSlice controller will track the known state of this feature for
483+ each Service. When that state or the reason for it changes, the EndpointSlice
484+ controller will publish a new Event to reflect the updated status of this
485+ feature.
486+
487+ # ### Sample Events
488+
489+ | Type | Reason | Message |
490+ |-|-|-|
491+ | Normal | TopologyAwareRoutingEnabled | Topology Aware Routing has been enabled |
492+ | Normal | TopologyAwareRoutingDisabled | Topology Aware Routing configuration was removed |
493+ | Warning | TopologyAwareRoutingDisabled | Insufficient number of Endpoints (n), impossible to safely allocate proportionally |
494+ | Warning | TopologyAwareRoutingDisabled | 1 or more Endpoints do not have a Zone specified |
495+ | Warning | TopologyAwareRoutingDisabled | 1 or more Nodes do not have allocatable CPU specified |
496+ | Warning | TopologyAwareRoutingDisabled | Nodes only ready in 1 zone |
497+
498+ # ### Documentation
499+
500+ The Topology Aware Hints documentation will be updated to describe the reason
501+ each of these events may have been triggered, along with steps that can be taken
502+ to recover from that state.
503+
504+ # ### Limitations
505+
506+ Although the events described above should dramatically simplify the use of this
507+ feature, there is a tiny edge case that will not be covered. If any
508+ EndpointSlices for a Service do not include Hints, Kube-Proxy will not implement
509+ this feature. This would happen if a user created custom EndpointSlices _and_
510+ enabled Topology Aware Hints _and_ failed to set Hints on their custom
511+ EndpointSlices. This seems very unlikely, but is mentioned here for the sake of
512+ completeness.
513+
457514# ## Graduation Criteria
458515**Alpha:**
459516- Basic functionality covered with unit tests described above.
460517
461518**Beta:**
462519- Tests expanded to include e2e coverage described above.
463520
521+ **GA:**
522+ - Feedback from real world usage shows that feature is working as intended
523+ - Events are triggered on each Service to provide users with clear information
524+ on when the feature transitioned between enabled and disabled states.
525+ - Test coverage in EndpointSlice strategy to ensure that the Hints field is
526+ dropped when the feature gate is not enabled.
527+ - Test coverage in EndpointSlice controller for the transition from enabled to
528+ disabled.
529+ - Ensure that existing Topology Hints e2e test runs as a presubmit if any code
530+ changes in kube-proxy or the EndpointSlice controller.
531+ - Topology Hints e2e tests will graduate to conformance tests.
532+ - Autoscaling and Scheduling SIGs have a plan to provide zone aware autoscaling
533+ (and scheduling) that allows users to proportionally distribute endpoints
534+ across zones.
535+
464536# ## Version Skew Strategy
465537This KEP requires updates to both the EndpointSlice Controller and kube-proxy.
466538Thus there could be two potential version skew scenarios :
@@ -499,8 +571,13 @@ enabled even if the annotation has been set on the Service.
499571 EndpointSlices for Services that have this feature enabled.
500572
501573* **Are there any tests for feature enablement/disablement?**
502- This feature is not yet implemented but per-Service enablement/disablement is
503- covered in depth as part of the test plan.
574+ Per Service enablement and disablement is covered in depth by unit tests. As a
575+ prerequisite for graduation to GA, we will also add the following :
576+
577+ - Test coverage in EndpointSlice strategy to ensure that the Hints field is
578+ dropped when the feature gate is not enabled.
579+ - Test coverage in EndpointSlice controller for the transition from enabled to
580+ disabled.
504581
505582# ## Rollout, Upgrade and Rollback Planning
506583
@@ -517,8 +594,10 @@ enabled even if the annotation has been set on the Service.
517594 with before the feature was enabled.
518595
519596* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
520- This feature is not yet implemented but per-Service enablement/disablement is
521- covered in depth as part of the test plan.
597+ Per-Service enablement/disablement is covered in depth and feature gate
598+ enablement and disablement will be covered before the feature graduates to GA.
599+ In addition, manual testing covering combinations of
600+ upgrade->downgrade->upgrade cycles will be completed prior to GA graduation.
522601
523602* **Is the rollout accompanied by any deprecations and/or removals of features,
524603 APIs, fields of API types, flags, etc.?**
@@ -567,6 +646,10 @@ enabled even if the annotation has been set on the Service.
567646 additional calls to the EndpointSlice API, but expect the increase to be
568647 minimal.
569648
649+ The EndpointSlice controller will begin publishing Events for each Service
650+ that has opted in to this feature when this transitions between enablement
651+ states.
652+
570653* **Will enabling / using this feature result in introducing new API types?**
571654 No.
572655
@@ -612,6 +695,13 @@ enabled even if the annotation has been set on the Service.
612695
613696- KEP Merged : February 2021
614697- Alpha release : Kubernetes 1.21
698+ - Beta Release : Kubernetes 1.23[^1]
699+ - Feature Gate on-by default, feature available by default : 1.24
700+
701+ [^1] : This was intended to also flip the feature gate to enabled by default, but
702+ unfortunately that part was missed in 1.23. See
703+ [#108747](https://github.com/kubernetes/kubernetes/pull/108747) for more
704+ information.
615705
616706# # Drawbacks
6177071. Increased complexity in EndpointSlice controller
0 commit comments