-
Notifications
You must be signed in to change notification settings - Fork 549
feat: support percentage based db limits with reload support #2177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
**Summary** Introduce a context aware DB dial path, a new `ConnPercentage` knob to cap Auth's share of Postgres connections, and background wiring to apply pool changes on config reloads. **Storage / DB** - Add `DialContext(ctx, *conf.GlobalConfiguration)` and keep `Dial(...)` as a thin wrapper. `serve` now passes its cancelable context so startup can't hang indefinitely. - `Connection` now keeps a handle to the underlying `*sql.DB` (via `popConnToStd`) when available. - New helpers: - `newConnectionDetails` and `applyDBDriver` to build `pop.ConnectionDetails` and derive driver when omitted. - `Connection.Copy()` to retain `sqldb` reference and updated locations that copy (`WithContext, Transaction)`. - Runtime tuning API: `(*Connection).ApplyConfig(ctx, cfg, le)` computes and applies connection limits to the underlying `*sql.DB`. - Fixed limits come from `MaxPoolSize`, `MaxIdlePoolSize`, `ConnMaxLifetime`, `ConnMaxIdleTime`. - If `ConnPercentage` is set (1-100), compute limits from `SHOW max_connections`, prefer percentage over fixed pool sizes, and set idle = open. - Retains previous behavior when `ConnPercentage` is `0` - No-op (and error) if `*sql.DB` is unavailable. **API worker** - `apiworker.New` now accepts the DB connection. - Split worker into three goroutines (via `errgroup`): - `configNotifier` fans out reload signals, - `templateWorker` refreshes template cache, - `dbWorker` applies DB connection limits on boot and each reload. **Serve** - Use `storage.DialContext(ctx, cfg)` and then `db = db.WithContext(ctx)` so the DB handle participates in request/trace context and shutdown. **Observability** - Add `observability.NewLogEntry(*logrus.Entry)` to construct chi middleware log entries. - Structured logs around applying DB limits. **Configuration knobs** (`GOTRUE_DB_*`) - `GOTRUE_DB_CONN_PERCENTAGE` (int, clamped to `[0,100]`): - `0` (default) disables percentage-based sizing. - `1-100` reserves that % of `max_connections` for the Auth server. **Tests** - `internal/storage/dial_test.go` - `DialContext` happy path and invalid driver/URL error path. - Reflection bridge to `*sql.DB` (`popConnToStd`) including `WithContext`-wrapped connection behavior. - `ApplyConfig` end-to-end: verify pool sizing and stats reflect limits. - Percentage math and precedence vs fixed pools across edge cases. - `internal/conf/configuration_test.go` - Validation clamps `ConnPercentage` to `[0,100]`.
hf
approved these changes
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it work with the existing DB_MAX_POOL_SIZE setting?
Pull Request Test Coverage Report for Build 17983722060Details
💛 - Coveralls |
Include the configuration values, server reported max conns and the applied limits with `limit_strategy` field describing if `fixed` or `percentage` was used.
4b5bc8d
to
21a9928
Compare
@stojan updated the PR description to include how the limits work. I also just made a commit with better logging as well. |
issuedat
pushed a commit
that referenced
this pull request
Sep 30, 2025
## Summary Introduce a context aware DB dial path, a new `ConnPercentage` knob to cap Auth's share of Postgres connections, and background wiring to apply pool changes on config reloads. **Storage / DB** - Add `DialContext(ctx, *conf.GlobalConfiguration)` and keep `Dial(...)` as a thin wrapper. `serve` now passes its cancelable context so startup can't hang indefinitely. - `Connection` now keeps a handle to the underlying `*sql.DB` (via `popConnToStd`) when available. - New helpers: - `newConnectionDetails` and `applyDBDriver` to build `pop.ConnectionDetails` and derive driver when omitted. - `Connection.Copy()` to retain `sqldb` reference and updated locations that copy (`WithContext, Transaction)`. - Runtime tuning API: `(*Connection).ApplyConfig(ctx, cfg, le)` computes and applies connection limits to the underlying `*sql.DB`. - Fixed limits come from `MaxPoolSize`, `MaxIdlePoolSize`, `ConnMaxLifetime`, `ConnMaxIdleTime`. - If `ConnPercentage` is set (1-100), compute limits from `SHOW max_connections`, prefer percentage over fixed pool sizes, and set idle = open. - Retains previous behavior when `ConnPercentage` is `0` - No-op (and error) if `*sql.DB` is unavailable. **API worker** - `apiworker.New` now accepts the DB connection. - Split worker into three goroutines (via `errgroup`): - `configNotifier` fans out reload signals, - `templateWorker` refreshes template cache, - `dbWorker` applies DB connection limits on boot and each reload. **Serve** - Use `storage.DialContext(ctx, cfg)` and then `db = db.WithContext(ctx)` so the DB handle participates in request/trace context and shutdown. **Observability** - Add `observability.NewLogEntry(*logrus.Entry)` to construct chi middleware log entries. - Structured logs around applying DB limits. **Configuration knobs** (`GOTRUE_DB_*`) - `GOTRUE_DB_CONN_PERCENTAGE` (int, clamped to `[0,100]`): - `0` (default) disables percentage-based sizing. - `1-100` reserves that % of `max_connections` for the Auth server. **Tests** - `internal/storage/dial_test.go` - `DialContext` happy path and invalid driver/URL error path. - Reflection bridge to `*sql.DB` (`popConnToStd`) including `WithContext`-wrapped connection behavior. - `ApplyConfig` end-to-end: verify pool sizing and stats reflect limits. - Percentage math and precedence vs fixed pools across edge cases. - `internal/conf/configuration_test.go` - Validation clamps `ConnPercentage` to `[0,100]`. ## How it works In short if `GOTRUE_DB_CONN_PERCENTAGE=0`, we use the fixed `GOTRUE_DB_{MAX,CONN}_*` limits. If it's in the range `[1, 100]` we set percentage based limits derived from `SHOW max_connections` and ignore the fixed pool sizes. ### Deep Dive The startup sequence remains the same, trying to set it _before_ we returned from `DialContext` was a bit messy (chicken / egg: need a conn to setup a conn). I also didn't want to delay startup time during failure scenarios (db is unavailable, db is blocking, etc). So after `DialContext` we have a connection which is configured initially with only the existing DB settings: ```bash GOTRUE_DB_MAX_POOL_SIZE="50" GOTRUE_DB_MAX_IDLE_POOL_SIZE="10" GOTRUE_DB_CONN_MAX_IDLE_TIME="60s" GOTRUE_DB_CONN_MAX_LIFETIME="0" ``` Next the server starts the `apiworker` which immediatelly creates a new [dbWorker](https://github.com/supabase/auth/pull/2177/files#diff-b20c1e9d1c21d077494cf5ff490de301a864d3d1812538cf594a687f620a7175R122) goroutine concurrently while the rest of the startup sequence continues. Before entering the config update loop the `dbWorker` will call the newly added [ApplyConfig(...)](https://github.com/supabase/auth/pull/2177/files#diff-5b7e4f0f03bfbc3a58168e58eb88386b9e683241c1ebcb57f6764c38308f2257R179) method on the `*storage.Connection`. The `ApplyConfig` method is where the logic for obtaining the best values to call the `sql.DB.Set*(...)` methods below lives: * [SetConnMaxIdleTime(d time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxIdleTime) * [SetConnMaxLifetime(d time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxLifetime) * [SetMaxIdleConns(n int)](https://pkg.go.dev/database/sql#DB.SetMaxIdleConns) * [SetMaxOpenConns(n int)](https://pkg.go.dev/database/sql#DB.SetMaxOpenConns) Right now [ApplyConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L179) works like this: 1. Checks that we were able to [reflect](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L140) a `*sql.DB` during `DialContext`, if not we do nothing since we can't call `sql.DB.Set*(...)`. * If we can't access *sql.DB or fetch max_connections, we leave the prior limits untouched and log a warning. Always all-or-nothing, no partial application of limits. 2. Calls the new [getConnLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L209) method. 3. `getConnLimits` calls [newConnLimitsFromConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L287) which returns a `ConnLimits` setup with `GOTRUE_DB_{MAX,CONN}_*` settings. 4. Check if `GOTRUE_DB_CONN_PERCENTAGE` is zero, if so [it returns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L218) the `GOTRUE_DB_{MAX,CONN}_*` from `newConnLimitsFromConfig`. * This means the limits are set exactly as they are today. 5. Percentage config is non-zero so we make a call to [showMaxConns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L266) which just returns an integer from `"SHOW max_connections;"`. In my testing this value always seems to be available for the auth server: * This value cannot change without postgres restarts. * Postgres will not start if it is 0. * Being in recovery mode still shows the maximum connections. 6. As long as `showMaxConns` does not return an error we attempt to apply percentage based connection limits in [applyPercentageLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L235). 7. If max conns is <= 0 we return [an error](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L245) which prevents any config changes from being applied. Leaving the connection in its prior state. * max_connections > 0 is guaranteed if postgres is running, this is a defensive check to prevent applying a clamp to 1 max conns on 0. 8. We perform a simple [bounds check](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L252) and then set the [`MaxOpenConns` and `MaxIdleConns`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L257) to the values derived from the `ConnPercentage` and `maxConns`. * Note that we preserve the existing behavior of IdleConns == MaxConns. I believe the aim is to minimize connection churn (latency) at the cost of more Postgres slots when idle. It's worth thinking about making this a bit more considerate in the future, something simple like (open/2) or more advanced heuristics using [sql.DBStats](https://pkg.go.dev/database/sql#DBStats). ```Go pct := float64(dbCfg.ConnPercentage) cl.MaxOpenConns = int(max(1, (pct/100)*float64(maxConns))) cl.MaxIdleConns = cl.MaxOpenConns ``` 9. The values set from the call to `getConnLimits` are logged before being [applied via the `sql.DB.Set*(...)`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L202) calls. We fail strictly and quickly on derivation errors to keep the last known good settings. By supporting config reloading my hope is that when under high load users may balance this setting without taking down the auth server. This tight feedback loop should help rule out (or resolve) the auth server as a potential root cause to connection timeouts and similar downstream effects. @stojan `apiworker` approach also gives a good place for your [stats tracking](#2167) to live, adding a simple ticker in the `dbWorker` to poll stats between config updates. This stats polling could be used to form additional heuristics in our connect limit tuning if we would like to explore that in the future. For example use the mean connection time as an additional weight to further increase the pool size. ## Some notes: I tested this extensively but please give a thorough review, I made some judgement calls on non-happy paths. I'm also not sure how reliable the sqldb reference is as it seems the composition of the *pop.Store can change based on inputs, context, dialect, driver, etc. The entire feature will not work if I can't reflect out the sqldb. --------- Co-authored-by: Chris Stockton <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Introduce a context aware DB dial path, a new
ConnPercentage
knob to cap Auth's share of Postgres connections, and background wiring to apply pool changes on config reloads.Storage / DB
DialContext(ctx, *conf.GlobalConfiguration)
and keepDial(...)
as a thin wrapper.serve
now passes its cancelable context so startup can't hang indefinitely.Connection
now keeps a handle to the underlying*sql.DB
(viapopConnToStd
) when available.newConnectionDetails
andapplyDBDriver
to buildpop.ConnectionDetails
and derive driver when omitted.Connection.Copy()
to retainsqldb
reference and updated locations that copy (WithContext, Transaction)
.(*Connection).ApplyConfig(ctx, cfg, le)
computes and applies connection limits to the underlying*sql.DB
.MaxPoolSize
,MaxIdlePoolSize
,ConnMaxLifetime
,ConnMaxIdleTime
.ConnPercentage
is set (1-100), compute limits fromSHOW max_connections
, prefer percentage over fixed pool sizes, and set idle = open.ConnPercentage
is0
*sql.DB
is unavailable.API worker
apiworker.New
now accepts the DB connection.errgroup
):configNotifier
fans out reload signals,templateWorker
refreshes template cache,dbWorker
applies DB connection limits on boot and each reload.Serve
storage.DialContext(ctx, cfg)
and thendb = db.WithContext(ctx)
so the DB handle participates in request/trace context and shutdown.Observability
observability.NewLogEntry(*logrus.Entry)
to construct chi middleware log entries.Configuration knobs (
GOTRUE_DB_*
)GOTRUE_DB_CONN_PERCENTAGE
(int, clamped to[0,100]
):0
(default) disables percentage-based sizing.1-100
reserves that % ofmax_connections
for the Auth server.Tests
internal/storage/dial_test.go
DialContext
happy path and invalid driver/URL error path.*sql.DB
(popConnToStd
) includingWithContext
-wrapped connection behavior.ApplyConfig
end-to-end: verify pool sizing and stats reflect limits.internal/conf/configuration_test.go
ConnPercentage
to[0,100]
.How it works
In short if
GOTRUE_DB_CONN_PERCENTAGE=0
, we use the fixedGOTRUE_DB_{MAX,CONN}_*
limits. If it's in the range[1, 100]
we set percentage based limits derived fromSHOW max_connections
and ignore the fixed pool sizes.Deep Dive
The startup sequence remains the same, trying to set it before we returned from
DialContext
was a bit messy (chicken / egg: need a conn to setup a conn). I also didn't want to delay startup time during failure scenarios (db is unavailable, db is blocking, etc).So after
DialContext
we have a connection which is configured initially with only the existing DB settings:Next the server starts the
apiworker
which immediatelly creates a new dbWorker goroutine concurrently while the rest of the startup sequence continues. Before entering the config update loop thedbWorker
will call the newly added ApplyConfig(...) method on the*storage.Connection
.The
ApplyConfig
method is where the logic for obtaining the best values to call thesql.DB.Set*(...)
methods below lives:Right now ApplyConfig works like this:
Checks that we were able to reflect a
*sql.DB
duringDialContext
, if not we do nothing since we can't callsql.DB.Set*(...)
.Calls the new getConnLimits method.
getConnLimits
calls newConnLimitsFromConfig which returns aConnLimits
setup withGOTRUE_DB_{MAX,CONN}_*
settings.Check if
GOTRUE_DB_CONN_PERCENTAGE
is zero, if so it returns theGOTRUE_DB_{MAX,CONN}_*
fromnewConnLimitsFromConfig
.Percentage config is non-zero so we make a call to showMaxConns which just returns an integer from
"SHOW max_connections;"
. In my testing this value always seems to be available for the auth server:As long as
showMaxConns
does not return an error we attempt to apply percentage based connection limits in applyPercentageLimits.If max conns is <= 0 we return an error which prevents any config changes from being applied. Leaving the connection in its prior state.
We perform a simple bounds check and then set the
MaxOpenConns
andMaxIdleConns
to the values derived from theConnPercentage
andmaxConns
.The values set from the call to
getConnLimits
are logged before being applied via thesql.DB.Set*(...)
calls.We fail strictly and quickly on derivation errors to keep the last known good settings. By supporting config reloading my hope is that when under high load users may balance this setting without taking down the auth server. This tight feedback loop should help rule out (or resolve) the auth server as a potential root cause to connection timeouts and similar downstream effects.
@stojan
apiworker
approach also gives a good place for your stats tracking to live, adding a simple ticker in thedbWorker
to poll stats between config updates. This stats polling could be used to form additional heuristics in our connect limit tuning if we would like to explore that in the future. For example use the mean connection time as an additional weight to further increase the pool size.Some notes:
I tested this extensively but please give a thorough review, I made some judgement calls on non-happy paths. I'm also not sure how reliable the sqldb reference is as it seems the composition of the *pop.Store can change based on inputs, context, dialect, driver, etc. The entire feature will not work if I can't reflect out the sqldb.