-
Notifications
You must be signed in to change notification settings - Fork 148
Open
Description
What’s happening
We run a PushProx client on a host behind a NAT whose public IP changes about twice a day. When that IP change happens, the long-polling /poll request to the proxy stays open for almost 5 minutes before it finally errors out. During that time Prometheus receives no metrics, which causes a 5 min gap, and the “down” alerts to get triggered.
Root cause
The client’s transport uses TCP keepalives of 30 s, but Linux by default will send 9 probes (each 30 s apart) before declaring the socket dead. That’s 30 s + (8 × 30 s) = 4 m 30 s from the moment the connection is broken until it closes.
client/main.go
transport := &http.Transport{
Proxy: http.ProxyFromEnvironment,
DialContext: (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
DualStack: true,
}).DialContext,
MaxIdleConns: 100,
IdleConnTimeout: 90 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
ExpectContinueTimeout: 1 * time.Second,
TLSClientConfig: tlsConfig,
}Metadata
Metadata
Assignees
Labels
No labels