Skip to content

Long-polling /poll hangs ~5 min on IP change, causes metrics gap #205

@SalahDevp

Description

@SalahDevp

What’s happening
We run a PushProx client on a host behind a NAT whose public IP changes about twice a day. When that IP change happens, the long-polling /poll request to the proxy stays open for almost 5 minutes before it finally errors out. During that time Prometheus receives no metrics, which causes a 5 min gap, and the “down” alerts to get triggered.

Root cause
The client’s transport uses TCP keepalives of 30 s, but Linux by default will send 9 probes (each 30 s apart) before declaring the socket dead. That’s 30 s + (8 × 30 s) = 4 m 30 s from the moment the connection is broken until it closes.
client/main.go

transport := &http.Transport{
    Proxy: http.ProxyFromEnvironment,
    DialContext: (&net.Dialer{
        Timeout:   30 * time.Second,
        KeepAlive: 30 * time.Second,
        DualStack: true,
    }).DialContext,
    MaxIdleConns:          100,
    IdleConnTimeout:       90 * time.Second,
    TLSHandshakeTimeout:   10 * time.Second,
    ExpectContinueTimeout: 1 * time.Second,
    TLSClientConfig:       tlsConfig,
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions