Skip to content

Conversation

@andrewsykim
Copy link
Member

Signed-off-by: Andrew Sy Kim [email protected]

  • One-line PR description:

Updates ServiceInternalTrafficPolicy feature to GA in v1.26. Feature has been in Beta since v1.22 and was blocked on a metric that was added in v1.24.

  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 13, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 13, 2022
@andrewsykim
Copy link
Member Author

/assign @thockin @danwinship @wojtek-t

@danwinship
Copy link
Contributor

I am still really really dubious about internal traffic policy.

The primary user story (DNS) is IMNSHO completely wrong and ought to be removed from the KEP, because nobody actually wants "DNS requests should either stay local or else fail". They want "DNS requests should stay local when possible, but should still work no matter what". And internal traffic policy can't do that. (Node-level topology can.)

The second user story is... weird. Is anybody actually using internal traffic policy that way?

Do we have any evidence of anyone using internal traffic policy in the real world in a way other than to make their DNS fast-but-fragile?

@andrewsykim
Copy link
Member Author

The primary user story (DNS) is IMNSHO completely wrong and ought to be removed from the KEP, because nobody actually wants "DNS requests should either stay local or else fail". They want "DNS requests should stay local when possible, but should still work no matter what". And internal traffic policy can't do that. (#3293 can.)

The "PreferLocal" case was one of the main motivations for the field in the first place. I don't feel too strongly about how we achieve this (either internalTrafficPolicy: PreferLocal or topology based), but I agree it's something that should be addressed. I missed your KEP and will review it soon.

The second user story is... weird. Is anybody actually using internal traffic policy that way?

Do we have any evidence of anyone using internal traffic policy in the real world in a way other than to make their DNS fast-but-fragile?

The use-cases around "Local" are still valid I think. The way I've been thinking about it, is that in many cases where a Pod would need to use downward API to get a host's IP, you can do more elegantly using a static cluster IP that deterministically routes to the local host from any Pod. Pushing logs/metrics to a local daemon is just one example from the 2nd user story but you can imagine other use-cases for this.

Kubernetes also supports pod affinity rules that allow pods to be co-scheduled on the same node so it's not a stretch to think someone would also want those pods to strictly talk to each other locally via a Service. But even for this case, you could achieve that with node-level topology. At the time we went with a distinct field because the assumption was that topology was going to be strictly zone/regional (hence the addition of nodeName in v1). TBH I forget why we decided topology was zone/region only, maybe it was too expensive to generate hints at the granularity of nodes?

Anyways, I'll let others chime in on use-cases to avoid confirmation bias here :)

@avoltz
Copy link

avoltz commented Sep 22, 2022

Pushing logs/metrics to a local daemon is just one example from the 2nd user story but you can imagine other use-cases for this.

I agree with this use case mentioned by @andrewsykim. This local-only routing has been useful in my development. PreferLocal sounds like a useful option, but I would rather leave that up to the user to choose and try to guide via documentation.

In my use, traffic is not relevant off node.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 22, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 22, 2022
@danwinship
Copy link
Contributor

danwinship commented Sep 26, 2022

Pushing logs/metrics to a local daemon is just one example from the 2nd user story but you can imagine other use-cases for this.

The multi-cluster folks were talking about services where the same service IP resolves to different things in different clusters, and that sounds like pretty much the same thing. (But didn't we dislike that idea?) But then that makes me wonder if this should exist at the zone level too. (It's not quite the same as Topology Aware Hints because there the assumption is that the service is identical everywhere, not different in each zone.)

At any rate, this is very very different from what externalTrafficPolicy: Local means, so even if we want this feature, maybe it shouldn't be called internalTrafficPolicy: Local.

(Another example of how it is weird: it seems like any Service that has this "different instance on each node" behavior would have to be type: ClusterIP... because what would it mean to have a type: LoadBalancer, internalTrafficPolicy: Local service where the endpoints on different nodes did different things, but you can't control which endpoint you hit? So maybe it should really just be a new service type?)

But even for this case, you could achieve that with node-level topology

Well, no, the node-level topology proposal (like the existing zone-level topology) has the rule that if you can't hit a node-level endpoint, you fall back to hitting a remote endpoint (because that's what you want for, eg, the DNS case). So it won't work for something where it's mandatory that you talk to a local endpoint.

@danwinship
Copy link
Contributor

PreferLocal sounds like a useful option, but I would rather leave that up to the user to choose and try to guide via documentation.

Note that the features are not equivalent: internalTrafficPolicy: Local says "traffic should be sent to a local endpoint, and if there are no local endpoints then the traffic should be dropped". Node-local topology (previously internalTrafficPolicy: Local) says "traffic should be sent to a local endpoint, and if there are no local endpoints then the traffic should be sent to a remote endpoint".

I agree with this use case mentioned by @andrewsykim. This local-only routing has been useful in my development.

Could you describe in more detail what you are doing?

Especially, I am concerned about the "if there are no local endpoints then the traffic should be dropped". It seems to me that that means that internalTrafficPolicy: Local would only be useful if one of:

  • The service has at least 2 endpoints per node, so that you can roll out updates to it without causing an outage, OR
  • The cluster is basically "static", and the service will never be updated; if it was necessary to change something about the service, you would just deploy a new cluster instead, OR
  • The service is non-critical and so you don't care if it is sometimes unavailable to the clients.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 27, 2022
@andrewsykim andrewsykim force-pushed the kep-2086 branch 3 times, most recently from 8af5fc0 to 2bd8de4 Compare September 27, 2022 15:06
@avoltz
Copy link

avoltz commented Sep 27, 2022

Note that the features are not equivalent: internalTrafficPolicy: Local says "traffic should be sent to a local endpoint, and if there are no local endpoints then the traffic should be dropped". Node-local topology (previously internalTrafficPolicy: Local) says "traffic should be sent to a local endpoint, and if there are no local endpoints then the traffic should be sent to a remote endpoint".

That makes sense. I think both behaviors can be useful.

Could you describe in more detail what you are doing?

Especially, I am concerned about the "if there are no local endpoints then the traffic should be dropped". It seems to me that that means that internalTrafficPolicy: Local would only be useful if one of:

  • The service is non-critical and so you don't care if it is sometimes unavailable to the clients.

I think this is closest to my case. In my case, the client will tolerate the outage on the node, but the state in the service currently requires the local connection.

I understand it's a less common mode, but I've found this beta feature useful as-is. The docs seemed clear about this availability effect.

@andrewsykim
Copy link
Member Author

I removed the use-case around DNS per the SIG Network call earlier today. I also updated the Test Plan section to use the new format with links to existing tests and added steps for (manual) rollback testing.

@danwinship @avoltz PTAL.

@avoltz
Copy link

avoltz commented Sep 30, 2022

Looks good @andrewsykim thanks.

There are some integration tests related to ITP in test/integration/service/service_test.go

Should these be mentioned in the integration tests section?

ExternalNameServiceStopsDefaultingInternalTrafficPolicy
ExternalNameServiceDropsInternalTrafficPolicy
ConvertingToExternalNameServiceDropsInternalTrafficPolicy

@andrewsykim
Copy link
Member Author

Good catch! Updated the test plan to include those tests

@thockin
Copy link
Member

thockin commented Oct 3, 2022

The service has at least 2 endpoints per node, so that you can roll out updates to it without causing an outage, OR

DaemonSet does support MaxSurge and MaxUnavailable, so you can effect this (unless there's some corner case I am missing?)

The service is non-critical and so you don't care if it is sometimes unavailable to the clients.

For many applications, retry-until-it-works is OK.

so even if we want this feature, maybe it shouldn't be called internalTrafficPolicy: Local

I am not in love with the name, for sure, but given the analogy of eTP it seems better than picking a whole new name, no? If there were no nodes, how would we describe this? Would we bend it to mean "same zone" or just make it not work?

I'm LGTM on this change, but I hate to feel like a steamroller - Dan do you want to debate more?

@andrewsykim
Copy link
Member Author

I am not in love with the name, for sure, but given the analogy of eTP it seems better than picking a whole new name, no? If there were no nodes, how would we describe this? Would we bend it to mean "same zone" or just make it not work?

Agreed with this sentiment around naming, but I am also happy to debate further better names. Now is the last chance to change it :)

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim - just some minor comments from me - other than that it LGTM

* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**

No, but this will be manually tested prior to beta. Automated testing will be done if the test tooling is available.
#### API Enablement Rollback
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry can't comment on unchanged lines so commenting here:

L248: Have those metrics been added? Can you please update this answer?
[I think it's sync_proxy_rules_no_local_endpoints_total - if so, please just update the answer]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated


* Check Service to see if `internalTrafficPolicy` is set to `Local`.
* A per-node "blackhole" metric will be added to kube-proxy which represent Services that are being intentionally dropped (internalTrafficPolicy=Local and no endpoints). The metric will be named `kubeproxy/sync_proxy_rules/blackhole_total` (subject to rename).
* A per-node "blackhole" metric will be added to kube-proxy which represent Services that are being intentionally dropped (internalTrafficPolicy=Local and no endpoints). The metric will be named `kubeproxy/sync_proxy_rules_no_endpoints_total` (subject to rename).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's sync_proxy_rules_no_local_endpoints_total instead?
If so, can you please update (and remove "subject to rename :) )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also L570: I think this metrics was already added, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

@danwinship
Copy link
Contributor

DaemonSet does support MaxSurge and MaxUnavailable, so you can effect this (unless there's some corner case I am missing?)

Ah, I hadn't seen MaxSurge before. (MaxUnavailable doesn't help AFAICT.)

I am not in love with the name, for sure, but given the analogy of eTP it seems better than picking a whole new name, no?

No, it's a false analogy. That's why I don't like the name. The two features have the same name because they're the same to us (the people hacking on kube-proxy), but they are completely different and unrelated from the perspective of actual users of the feature. (eTP:Local = "preserve client IP", iTP:Local = "only send to local endpoints"). Of course, the problem is that "Local" is actually a good description of iTP:Local, and it's eTP:Local that has the bad name...

I'm LGTM on this change, but I hate to feel like a steamroller - Dan do you want to debate more?

No, my primary objection was claiming that DNS was a good use case for the feature, and the KEP no longer claims that. So, OK.

@thockin
Copy link
Member

thockin commented Oct 5, 2022

No, it's a false analogy. That's why I don't like the name. The two features have the same name because they're the same to us (the people hacking on kube-proxy), but they are completely different and unrelated from the perspective of actual users of the feature. (eTP:Local = "preserve client IP", iTP:Local = "only send to local endpoints"). Of course, the problem is that "Local" is actually a good description of iTP:Local, and it's eTP:Local that has the bad name...

A different read: "Local" means "avoid extra network hops". eTP:Local doesn't make sense at all for LBs that don't boounce through nodes (direct to pod), so yeah - the error lies there, I think :)

@thockin
Copy link
Member

thockin commented Oct 5, 2022

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2022
@wojtek-t
Copy link
Member

wojtek-t commented Oct 5, 2022

/label tide/merge-method-squash

/lgtm
/approve PRR

Thanks!

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 5, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2022
@k8s-ci-robot k8s-ci-robot merged commit a8590f3 into kubernetes:master Oct 5, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Oct 5, 2022
ahmedtd pushed a commit to ahmedtd/enhancements that referenced this pull request Feb 2, 2023
…etes#3509)

* KEP-2086: promote ServiceInternalTrafficPolicy to GA in v1.26

Signed-off-by: Andrew Sy Kim <[email protected]>

* KEP-2086: document manual validation for rollback

Signed-off-by: Andrew Sy Kim <[email protected]>

* KEP-2086: use new test plan format, including links to tests

Signed-off-by: Andrew Sy Kim <[email protected]>

* KEP-2086: remove use-case about DNS optimization

Signed-off-by: Andrew Sy Kim <[email protected]>

* KEP-2086: fix metric name for sync_proxy_rules_no_local_endpoints_total

Signed-off-by: Andrew Sy Kim <[email protected]>

Signed-off-by: Andrew Sy Kim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants