Skip to content

Conversation

@caesarxuchao
Copy link
Contributor

@caesarxuchao caesarxuchao commented Oct 29, 2020

This PR enables the HTTP/2 connection health check for all the clients that call SetTransportDefaults. This includes all clients created via k8s.io/client-go. Without the health check, a broken transport layer connection can stay in the HTTP/2 connection pool for a long time, causing requests to the same host to fail. For example, a broken TCP connection can linger for 15 minutes before being closed.

The HTTP/2 feature exposes two parameters,

  • ReadIdleTimeout: if the HTTP/2 client has not received any frame from the connection for this amount of time, it starts sending periodic pings to the server. The period of the pings is also ReadIdleTimeout.
  • PingTimeout: if an ACK to the ping is not received by the client after this amount of time, the connection is considered broken and will be closed by the client.

For Kubernetes, I default the ReadIdleTimeout to 30s, which I think is not going to cause performance issues even in large clusters, because

  • the ping is carried by persisted connections,
  • the ping is handled at the http/2 layer, never surfaced to k8s layer.

I'll wait for results from pull-kubernetes-e2e-gce-large-performance to confirm.

I default the PingTimeout to 15s, which is relaxed enough in most network conditions.

Although I think most users don't need to tune these two parameters, I add environment variables to allow user to tune them, or even disable the health check, by setting HTTP2_READ_IDLE_TIMEOUT_SECONDS=0.

Fixes https://github.com/kubernetes/client-go/issues/374 and
#87615.

Ref. #95898
#94844

/kind bug
/sig api-machinery

HTTP/2 connection health check is enabled by default in all Kubernetes clients. The feature should work out-of-the-box. If needed, users can tune the feature via the HTTP2_READ_IDLE_TIMEOUT_SECONDS and HTTP2_PING_TIMEOUT_SECONDS environment variables. The feature is disabled if HTTP2_READ_IDLE_TIMEOUT_SECONDS is set to 0.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver area/cloudprovider area/code-generation area/dependency Issues or PRs related to dependency changes area/kubectl sig/cli Categorizes an issue or PR as relevant to SIG CLI. labels Oct 29, 2020
@k8s-ci-robot k8s-ci-robot requested review from a team, andrewsykim and andyzhangx October 29, 2020 08:15
@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 29, 2020
@caesarxuchao
Copy link
Contributor Author

caesarxuchao commented Oct 29, 2020

I disagree with setting ReadIdleTimeout to 1/2 * IdleTimeout in #95898. The connection is considered "idle" if there is no stream using the connection (See here). And the connection is considered "readIdle" if there is no frame read from the connection (See here). The two concepts are unique to each other so we shouldn't use one value to determine the other value.

update: I think the IdleTimeout in http/1 behaves more like what's described in #95898. A connection is closed if there is no traffic for the IdleTimeout duration.

@caesarxuchao
Copy link
Contributor Author

@cheftako, this health check is very similar to the GRPC health check we want to add to the apiserver-network-proxy to solve kubernetes-sigs/apiserver-network-proxy#152. Their performance implication is also similar.

@caesarxuchao
Copy link
Contributor Author

@cheftako
I'm not too worried about the performance cost of sending the periodic pings, because

  • the ping is carried by persisted connections
  • the ping is handled at the http/2 layer, never surfaced to k8s layer
  • the ping is only sent if there is no traffic on a connection for the ReadIdleTimeout (30s) duration.

It would be more convincing if we can test the behavior in a large cluster. @kubernetes/sig-scalability is there an easy way to test this PR in a large cluster?

@JensErat do you mind sharing the size of the clusters where you turned on the http/2 health check feature? Thanks.

@BenTheElder
Copy link
Member

/test
[intentionally triggering list of available tests]

@k8s-ci-robot
Copy link
Contributor

@BenTheElder: The /test command needs one or more targets.
The following commands are available to trigger jobs:

  • /test pull-kubernetes-bazel-build
  • /test pull-kubernetes-bazel-test
  • /test pull-kubernetes-conformance-image-test
  • /test pull-kubernetes-conformance-kind-ipv6-parallel
  • /test pull-kubernetes-dependencies
  • /test pull-kubernetes-dependencies-canary
  • /test pull-kubernetes-e2e-ipvs-azure-dualstack
  • /test pull-kubernetes-e2e-iptables-azure-dualstack
  • /test pull-kubernetes-e2e-aws-eks-1-13-correctness
  • /test pull-kubernetes-files-remake
  • /test pull-kubernetes-e2e-gce
  • /test pull-kubernetes-e2e-gce-no-stage
  • /test pull-kubernetes-e2e-gce-kubetest2
  • /test pull-kubernetes-e2e-gce-canary
  • /test pull-kubernetes-e2e-gce-ubuntu
  • /test pull-kubernetes-e2e-gce-ubuntu-containerd
  • /test pull-kubernetes-e2e-gce-ubuntu-containerd-canary
  • /test pull-kubernetes-e2e-gce-rbe
  • /test pull-kubernetes-e2e-gce-alpha-features
  • /test pull-kubernetes-e2e-gce-device-plugin-gpu
  • /test pull-kubernetes-integration
  • /test pull-kubernetes-cross
  • /test pull-kubernetes-e2e-kind
  • /test pull-kubernetes-e2e-kind-canary
  • /test pull-kubernetes-e2e-kind-ipv6
  • /test pull-kubernetes-e2e-kind-ipv6-canary
  • /test pull-kubernetes-conformance-kind-ga-only
  • /test pull-kubernetes-conformance-kind-ga-only-parallel
  • /test pull-kubernetes-e2e-kops-aws
  • /test pull-kubernetes-bazel-build-canary
  • /test pull-kubernetes-bazel-test-canary
  • /test pull-kubernetes-bazel-test-integration-canary
  • /test pull-kubernetes-local-e2e
  • /test pull-publishing-bot-validate
  • /test pull-kubernetes-e2e-gce-network-proxy-http-connect
  • /test pull-kubernetes-e2e-gce-network-proxy-grpc
  • /test pull-kubernetes-e2e-gci-gce-autoscaling
  • /test pull-kubernetes-e2e-aks-engine-azure
  • /test pull-kubernetes-e2e-azure-disk
  • /test pull-kubernetes-e2e-azure-disk-vmss
  • /test pull-kubernetes-e2e-azure-file
  • /test pull-kubernetes-e2e-kind-dual-canary
  • /test pull-kubernetes-e2e-kind-ipvs-dual-canary
  • /test pull-kubernetes-e2e-gci-gce-ipvs
  • /test pull-kubernetes-node-e2e
  • /test pull-kubernetes-e2e-containerd-gce
  • /test pull-kubernetes-node-e2e-containerd
  • /test pull-kubernetes-node-e2e-alpha
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager
  • /test pull-kubernetes-node-kubelet-serial-hugepages
  • /test pull-kubernetes-node-crio-cgrpv2-e2e
  • /test pull-kubernetes-node-crio-e2e
  • /test pull-kubernetes-e2e-gce-100-performance
  • /test pull-kubernetes-e2e-gce-big-performance
  • /test pull-kubernetes-e2e-gce-correctness
  • /test pull-kubernetes-e2e-gce-large-performance
  • /test pull-kubernetes-kubemark-e2e-gce-big
  • /test pull-kubernetes-kubemark-e2e-gce-scale
  • /test pull-kubernetes-e2e-gce-storage-slow
  • /test pull-kubernetes-e2e-gce-storage-snapshot
  • /test pull-kubernetes-e2e-gce-storage-slow-rbe
  • /test pull-kubernetes-e2e-gce-csi-serial
  • /test pull-kubernetes-e2e-gce-iscsi
  • /test pull-kubernetes-e2e-gce-iscsi-serial
  • /test pull-kubernetes-e2e-gce-storage-disruptive
  • /test pull-kubernetes-e2e-aks-engine-azure-windows
  • /test pull-kubernetes-e2e-azure-disk-windows
  • /test pull-kubernetes-e2e-azure-file-windows
  • /test pull-kubernetes-typecheck
  • /test pull-kubernetes-verify
  • /test pull-kubernetes-e2e-windows-gce

Use /test all to run the following jobs:

  • pull-kubernetes-bazel-build
  • pull-kubernetes-bazel-test
  • pull-kubernetes-dependencies
  • pull-kubernetes-e2e-gce-ubuntu-containerd
  • pull-kubernetes-integration
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6
  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-node-e2e
  • pull-kubernetes-e2e-gce-100-performance
  • pull-kubernetes-e2e-azure-disk-windows
  • pull-kubernetes-e2e-azure-file-windows
  • pull-kubernetes-typecheck
  • pull-kubernetes-verify

In response to this:

/test
[intentionally triggering list of available tests]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@BenTheElder
Copy link
Member

I think https://github.com/kubernetes/test-infra/blob/b40bcac24d0ec96cc7d440ca5ece690cce7a62fe/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L67 is what you want, there are some optional scale presubmit.

On second thought though, even if that were the right one, we should check with someone from the scale team before using these resources 🙃

@caesarxuchao
Copy link
Contributor Author

Thanks, @BenTheElder ! The pull-kubernetes-e2e-gce-large-performance uses 2000 nodes, that seems to be the right test. I'll check with the scalability sig.

@jkaniuk
Copy link
Contributor

jkaniuk commented Oct 29, 2020

Seems relatively harmless from scalability point of view.

I'm fine with running 2k presubmit to verify, however our approval is not needed:
/test pull-kubernetes-e2e-gce-large-performance
(fingers crossed, it was a while since I've run this)

@tosi3k - for visibility as next oncaller

@liggitt
Copy link
Member

liggitt commented Mar 18, 2021

Backport to 1.18 is riskier, as it is using v0.0.0-20191004110552-13f9640d40b9, not sure if a backport will introduce incompatible changes.

I think we should backport to 1.18, given:

  • how unpredictable the stuck connection issue is
  • how many components it affects
  • how problematic a stuck connection is for informer clients
  • that we've successfully used these sys and net levels in 1.19 for O(months) now
  • that other consumers have successfully backported these changes to 1.18 and used them in production

I opened #100376 to pick this to 1.18

@deads2k
Copy link
Contributor

deads2k commented Mar 24, 2021

Given the consequences when this fails, I'm in favor of backporting this change if we can get an ack (ideally from @caesarxuchao) that the backport is fairly safe. My knowledge of the golang stack at this depth is fairly limited.

Is there a particular minimum golang level that is required or was it just the golang.org/x/net level?

@liggitt
Copy link
Member

liggitt commented Mar 24, 2021

Given the consequences when this fails, I'm in favor of backporting this change if we can get an ack (ideally from @caesarxuchao) that the backport is fairly safe.

@caesarxuchao acked the backport PR (#100376)

Is there a particular minimum golang level that is required or was it just the golang.org/x/net level?

It was contained to golang.org/x/{net,sys}

k8s-ci-robot added a commit that referenced this pull request Mar 25, 2021
Cherrypick #95981 to 1.18, Enables HTTP/2 health check
christarazi added a commit to christarazi/cilium that referenced this pull request Mar 31, 2022
Following up from cilium#19259, this
commit implements the same logic as kubelet [1] to close all idle
connections. We can rely on HTTP/2 to determine which connections are
idle (HTTP/2 pings), instead of closing all connections indiscriminately
(when the heartbeat fails) via the custom dialer in setDialer().

We can assume HTTP/2 support as it has been supported since K8s 1.20
[2]. Users can disable HTTP/2 by setting the DISABLE_HTTP2 environment
variable, however this only really meant from very advanced users, hence
no documentation. Kubernetes has been running with HTTP/2 as the default
for a long time (~since Dec 2020) without major issues.

CloseIdleConnectionsFor() was introduced in K8s 1.23 [3], hence this commit
cannot be backported to any earlier version of Cilium which doesn't
support K8s 1.23.

[1]: kubernetes/kubernetes#104844
[2]: kubernetes/kubernetes#95981
[3]: kubernetes/kubernetes@b9d865a

Suggested-by: André Martins <[email protected]>
Signed-off-by: Chris Tarazi <[email protected]>
pchaigno pushed a commit to cilium/cilium that referenced this pull request Apr 4, 2022
Following up from #19259, this
commit implements the same logic as kubelet [1] to close all idle
connections. We can rely on HTTP/2 to determine which connections are
idle (HTTP/2 pings), instead of closing all connections indiscriminately
(when the heartbeat fails) via the custom dialer in setDialer().

We can assume HTTP/2 support as it has been supported since K8s 1.20
[2]. Users can disable HTTP/2 by setting the DISABLE_HTTP2 environment
variable, however this only really meant from very advanced users, hence
no documentation. Kubernetes has been running with HTTP/2 as the default
for a long time (~since Dec 2020) without major issues.

CloseIdleConnectionsFor() was introduced in K8s 1.23 [3], hence this commit
cannot be backported to any earlier version of Cilium which doesn't
support K8s 1.23.

[1]: kubernetes/kubernetes#104844
[2]: kubernetes/kubernetes#95981
[3]: kubernetes/kubernetes@b9d865a

Suggested-by: André Martins <[email protected]>
Signed-off-by: Chris Tarazi <[email protected]>
pchaigno pushed a commit to pchaigno/cilium that referenced this pull request Apr 5, 2022
[ upstream commit 365c788 ]

Following up from cilium#19259, this
commit implements the same logic as kubelet [1] to close all idle
connections. We can rely on HTTP/2 to determine which connections are
idle (HTTP/2 pings), instead of closing all connections indiscriminately
(when the heartbeat fails) via the custom dialer in setDialer().

We can assume HTTP/2 support as it has been supported since K8s 1.20
[2]. Users can disable HTTP/2 by setting the DISABLE_HTTP2 environment
variable, however this only really meant from very advanced users, hence
no documentation. Kubernetes has been running with HTTP/2 as the default
for a long time (~since Dec 2020) without major issues.

CloseIdleConnectionsFor() was introduced in K8s 1.23 [3], hence this commit
cannot be backported to any earlier version of Cilium which doesn't
support K8s 1.23.

[1]: kubernetes/kubernetes#104844
[2]: kubernetes/kubernetes#95981
[3]: kubernetes/kubernetes@b9d865a

Suggested-by: André Martins <[email protected]>
Signed-off-by: Chris Tarazi <[email protected]>
Signed-off-by: Paul Chaignon <[email protected]>
pchaigno pushed a commit to cilium/cilium that referenced this pull request Apr 6, 2022
[ upstream commit 365c788 ]

Following up from #19259, this
commit implements the same logic as kubelet [1] to close all idle
connections. We can rely on HTTP/2 to determine which connections are
idle (HTTP/2 pings), instead of closing all connections indiscriminately
(when the heartbeat fails) via the custom dialer in setDialer().

We can assume HTTP/2 support as it has been supported since K8s 1.20
[2]. Users can disable HTTP/2 by setting the DISABLE_HTTP2 environment
variable, however this only really meant from very advanced users, hence
no documentation. Kubernetes has been running with HTTP/2 as the default
for a long time (~since Dec 2020) without major issues.

CloseIdleConnectionsFor() was introduced in K8s 1.23 [3], hence this commit
cannot be backported to any earlier version of Cilium which doesn't
support K8s 1.23.

[1]: kubernetes/kubernetes#104844
[2]: kubernetes/kubernetes#95981
[3]: kubernetes/kubernetes@b9d865a

Suggested-by: André Martins <[email protected]>
Signed-off-by: Chris Tarazi <[email protected]>
Signed-off-by: Paul Chaignon <[email protected]>
serathius pushed a commit to serathius/kubernetes that referenced this pull request Mar 14, 2024
The env vars were implemented in kubernetes#95981.
Effectively no impacts if using the default values here.

BUG=255296578

Change-Id: I2441ef1a181a0fb202af7893c25cff5ff15ef17a
hoskeri pushed a commit to hoskeri/kubernetes that referenced this pull request Jul 23, 2024
The env vars were implemented in kubernetes#95981.
Effectively no impacts if using the default values here.

BUG=255296578

Change-Id: I2441ef1a181a0fb202af7893c25cff5ff15ef17a

Remove GODEBUG=x509sha1=1 workaround

Bug: 227456358a

Revert changes made in I5165e6c2fe73e8e1b2a617ced591133228b6d275

Change-Id: I7dd1ca8f9796c542c89074703e44e6b6f4b6edab
hoskeri pushed a commit to hoskeri/kubernetes that referenced this pull request Jul 23, 2024
The env vars were implemented in kubernetes#95981.
Effectively no impacts if using the default values here.

BUG=255296578

Change-Id: I2441ef1a181a0fb202af7893c25cff5ff15ef17a

Remove GODEBUG=x509sha1=1 workaround

Bug: 227456358a

Revert changes made in I5165e6c2fe73e8e1b2a617ced591133228b6d275

Change-Id: I7dd1ca8f9796c542c89074703e44e6b6f4b6edab
aojea pushed a commit to aojea/kubernetes that referenced this pull request Jul 24, 2025
kube-apiserver: Change kube-apiserver liveness probe config setup
Prior OSS attempt: kubernetes#94076
Bug: b/164955956
Change-Id: Ia2f75ac5e97f4eb6aa4dc9b00fc2bb3112db9360

kube-apiserver: Set profiling false for kube-apiserver
Change-Id: I052a53d72acfc3c5aa8d03f695beb93cadc0be87

kube-apiserver: allow adding memory limit to kube-apiserver via the KUBE_APISERVER_MEMORY_LIMIT environment variable
bug: 181991030
Change-Id: I1a01766e6f2415eba31d6d865f7f39d7855f969e

kube-apiserver: Drop removed insecure --address and --insecure-port flags
GKE config script version of kubernetes#106859
Bug: 209962139
Change-Id: I551cfc12fd94a927383fcff89013a3a78fb61592

kube-apiserver: Readd KUBE_APISERVER_EVENT_TTL_SEC env for kube-apiserver --event-ttl
Bug: b/176806841
Change-Id: I5c651a555287af0e891e9592dfb7c172b6bb24a2

kube-apiserver: Re-enable the deprecated behavior of CommonName on X.509 as a diagnoser to detect the fleet level impact on the deprecation
Change-Id: If4962b6ecb3294878bfa37ab0ebcd4441fae7140

kube-apiserver: Update permissions of /etc/srv/kubernetes/abac-authz-policy.jsonl if kube-apiserver is running as non-root.
BUG=199669294
Change-Id: Ic9ae9add0a35b7df99781e470c917b2427bee289

kube-apiserver: Remove reference to deleted target-ram-mb flag
Change-Id: I57d0c80eb2285ec9d93b075b1845f1805e0fd748

kube-apiserver: Re-enable SHA1 signatures with GODEBUG=x509sha1=1
Bug: b/227456358
Bug: b/226424430
Change-Id: I5165e6c2fe73e8e1b2a617ced591133228b6d275

kube-apiserver: use --api-audiences as --service-account-api-audiences is deprecated

Copy of kubernetes#103078

Change-Id: I88c2f2eb8bde4378b115e01cbbe9700c27f03955

Expose UDS with profiling data for kube-apiserver.

Added in kubernetes#114191.

See go/no-gke-tcp-pprof for policy.

Refs b/273485199

Change-Id: Ic772c9249468fdf7f516d62bd2a0bfbfe933c1bb

Allow specifying terminationGracePeriodSeconds for kube-apiserver.

Refs b/252987333

Change-Id: I20699501f429630fe74531cb086091eb9ed3611c

Run kube-apiserver with cloud-provider=external

Bug: 299159412
Change-Id: I8db8e72377f1c63a3874ee9f0567be264c40ba58

Allow specifying timeout second for HTTP2/2 health check

The env vars were implemented in kubernetes#95981.
Effectively no impacts if using the default values here.

BUG=255296578

Change-Id: I2441ef1a181a0fb202af7893c25cff5ff15ef17a

Remove GODEBUG=x509sha1=1 workaround

Bug: 227456358a

Revert changes made in I5165e6c2fe73e8e1b2a617ced591133228b6d275

Change-Id: I7dd1ca8f9796c542c89074703e44e6b6f4b6edab

Use structured authorization configuration

Bug: b/333034514
Bug: b/172290849
Bug: b/175273696
Bug: b/332956373
Change-Id: If16e5dd6f9a4da93fb8623e27a9b9fda959eea83

gke/cluster: make kube-apiserver explicitly set enable-logs-handler

Preserves existing implicit behavior with 1.31 default changing in kubernetes#125787

Change-Id: I6380b46cef32c7aba4ea7c0ea7e014e18c8f09d0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/cloudprovider area/code-generation area/dependency Issues or PRs related to dependency changes area/kubectl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client should expose a mechanism to close underlying TCP connections