Skip to content

Max parallel connection limit 1024 on envoy proxy since diego 2.53.0 (HTTP/1.1, HTTP/2). #603

@renelehmann

Description

@renelehmann

Summary:

Since diego 2.53.0 only a maximum of 1024 parallel connections on the envoy proxy (and so to the backend) is allowed.
Such a limit was not respected before this diego release.

Expected Result

Diego or the envoy proxy should support some thousands of parallel connections above 1024 (e.g. while using websocket connections).

Actual Result

We get a client error if using more than 1024 parallel connections.
ERROR: 502 Bad Gateway: Registered endpoint failed to handle the request.

Context

cf-deployment v16.25.0, diego v2.53.0, stemcell vsphere ubuntu-bionic 1.25, envoy proxy enabled.
On using diego 2.52.0 the envoy proxy supports more than 1024 parallel connections to the container backend.

The limit of 1024 connections we hit is on using diego 2.53.0 together with cf-deployment v16.25.0 https://github.com/cloudfoundry/cf-deployment/releases/tag/v16.25.0.
Diego 2.53.0 was introduced with cf-deployment v16.24.0 already (which we never deployed/tested).
On using the older diego version 2.52.0 on cell-vms together and based on cf-deployment v16.25.0 a limit of 1024 parallel connections is not respected and behave as expected.

Since we saw a lot of envoy version bumps (1.16.5 -> 1.19.1) on diego 2.53.0 and the introduced default activation regarding http/2 we have been thinking on work around the envoy proxy default property setting "max_requests=1024" (which we think plays a role with using http/2). max_connections however is set via code to 4 billions.
Disabling the default http/2 support on envoy proxy and further on the gorouter did not help and we still hit the 1024 parallel connection limit.

Unfortunately we can not easy introduce corrections on the envoy proxy configuration and to play around with higher value settings (via curl POST or via file on cell-vm).
It seems we would have to adjust the codebase of diego (executor/envoyproxy) itself to introduce some configuration changes and to have a better pointer to the problem.
Therefore for us it is unclear at the moment if a configuration setting of the envoy proxy or another default 1024 setting somewhere else or in conjunction does now have that limitation effect.

We have seen the values of the envoy proxy properties did not change recently.
Some property settings of the envoy proxy within the container:

curl http://127.0.01:61003/clusters
0-service-cluster::default_priority::max_connections::4294967295
0-service-cluster::default_priority::max_pending_requests::1024
0-service-cluster::default_priority::max_requests::1024
0-service-cluster::default_priority::max_retries::3

Setting for CircuitBreakers on max_connections:
#https://github.com/cloudfoundry/diego-release/tree/v2.53.0/src/code.cloudfoundry.org sources executor @ 1c8b2ff
https://github.com/cloudfoundry/executor/blob/1c8b2ff139d98ec08f5d54e95bb3fddc7eafc9c2/depot/containerstore/proxy_config_handler.go#L334-L336

From the envoy documentation:
https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto.html

max_connections
(UInt32Value) The maximum number of connections that Envoy will make to the upstream cluster. If not specified, the default is 1024.

max_pending_requests
(UInt32Value) The maximum number of pending requests that Envoy will allow to the upstream cluster. If not specified, the default is 1024.

max_requests
(UInt32Value) The maximum number of parallel requests that Envoy will make to the upstream cluster. If not specified, the default is 1024.

Steps to Reproduce

On using a cloudfoundry foundation e.g. on cf-deployment v16.25.0 (bundled with diego 2.53.0).
With http/2 enabled (default) or explicit disabled on envoy and router.
We have seen this issue/limitation on using websocket connections. E.g. 4 app instances running on cloudfoundry with around 17'000 connections in total which works fine with diego 2.52.0 but not with diego 2.53.0.
For this case and using diego 2.53.0 we had to workaround with scaling the amount of app instances (amount of instances * 1024 connections).

Below a test in golang using default HTTP/1.1 protocol with keeping the connection for 30 seconds.

mkdir hello_test
cd hello_test

#create two files go.mod and main.go
cat > go.mod <<HERE
module example.com/hello
go 1.13
HERE

cat > main.go <<HERE
package main
import (
"fmt"
"net/http"
"time"
)

func main() {
http.HandleFunc("/", HelloServer)
http.ListenAndServe(":8080", nil)
}

func HelloServer(w http.ResponseWriter, r *http.Request) {
time.Sleep(30 * time.Second)
fmt.Fprintf(w, "Hello, %s!", r.URL.Path[1:])
}
HERE


cf push helloapp

xargs -I % -P 1025 curl https://helloapp.DOMAIN < <(printf '%s\n' {1..1025})
#Create more than 1024 parallel connections. 
#With diego 2.53.0 it reports an error after a few seconds: ERROR: 502 Bad Gateway: Registered endpoint failed to handle the request. Shows then 1024 times Hello after 30 sec.
#A testcase with the latest diego 2.53.1 showed the same 1024 limitation behaviour.
#With diego 2.52.0 it shows 1025 times Hello after 30 seconds without any error.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions