-
Notifications
You must be signed in to change notification settings - Fork 549
feat: add advisor to notify you when to double the max connection pool #2167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Pull Request Test Coverage Report for Build 17978381415Details
💛 - Coveralls |
cstockton
approved these changes
Sep 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
cemalkilic
approved these changes
Sep 16, 2025
cstockton
added a commit
that referenced
this pull request
Sep 24, 2025
## Summary Introduce a context aware DB dial path, a new `ConnPercentage` knob to cap Auth's share of Postgres connections, and background wiring to apply pool changes on config reloads. **Storage / DB** - Add `DialContext(ctx, *conf.GlobalConfiguration)` and keep `Dial(...)` as a thin wrapper. `serve` now passes its cancelable context so startup can't hang indefinitely. - `Connection` now keeps a handle to the underlying `*sql.DB` (via `popConnToStd`) when available. - New helpers: - `newConnectionDetails` and `applyDBDriver` to build `pop.ConnectionDetails` and derive driver when omitted. - `Connection.Copy()` to retain `sqldb` reference and updated locations that copy (`WithContext, Transaction)`. - Runtime tuning API: `(*Connection).ApplyConfig(ctx, cfg, le)` computes and applies connection limits to the underlying `*sql.DB`. - Fixed limits come from `MaxPoolSize`, `MaxIdlePoolSize`, `ConnMaxLifetime`, `ConnMaxIdleTime`. - If `ConnPercentage` is set (1-100), compute limits from `SHOW max_connections`, prefer percentage over fixed pool sizes, and set idle = open. - Retains previous behavior when `ConnPercentage` is `0` - No-op (and error) if `*sql.DB` is unavailable. **API worker** - `apiworker.New` now accepts the DB connection. - Split worker into three goroutines (via `errgroup`): - `configNotifier` fans out reload signals, - `templateWorker` refreshes template cache, - `dbWorker` applies DB connection limits on boot and each reload. **Serve** - Use `storage.DialContext(ctx, cfg)` and then `db = db.WithContext(ctx)` so the DB handle participates in request/trace context and shutdown. **Observability** - Add `observability.NewLogEntry(*logrus.Entry)` to construct chi middleware log entries. - Structured logs around applying DB limits. **Configuration knobs** (`GOTRUE_DB_*`) - `GOTRUE_DB_CONN_PERCENTAGE` (int, clamped to `[0,100]`): - `0` (default) disables percentage-based sizing. - `1-100` reserves that % of `max_connections` for the Auth server. **Tests** - `internal/storage/dial_test.go` - `DialContext` happy path and invalid driver/URL error path. - Reflection bridge to `*sql.DB` (`popConnToStd`) including `WithContext`-wrapped connection behavior. - `ApplyConfig` end-to-end: verify pool sizing and stats reflect limits. - Percentage math and precedence vs fixed pools across edge cases. - `internal/conf/configuration_test.go` - Validation clamps `ConnPercentage` to `[0,100]`. ## How it works In short if `GOTRUE_DB_CONN_PERCENTAGE=0`, we use the fixed `GOTRUE_DB_{MAX,CONN}_*` limits. If it's in the range `[1, 100]` we set percentage based limits derived from `SHOW max_connections` and ignore the fixed pool sizes. ### Deep Dive The startup sequence remains the same, trying to set it _before_ we returned from `DialContext` was a bit messy (chicken / egg: need a conn to setup a conn). I also didn't want to delay startup time during failure scenarios (db is unavailable, db is blocking, etc). So after `DialContext` we have a connection which is configured initially with only the existing DB settings: ```bash GOTRUE_DB_MAX_POOL_SIZE="50" GOTRUE_DB_MAX_IDLE_POOL_SIZE="10" GOTRUE_DB_CONN_MAX_IDLE_TIME="60s" GOTRUE_DB_CONN_MAX_LIFETIME="0" ``` Next the server starts the `apiworker` which immediatelly creates a new [dbWorker](https://github.com/supabase/auth/pull/2177/files#diff-b20c1e9d1c21d077494cf5ff490de301a864d3d1812538cf594a687f620a7175R122) goroutine concurrently while the rest of the startup sequence continues. Before entering the config update loop the `dbWorker` will call the newly added [ApplyConfig(...)](https://github.com/supabase/auth/pull/2177/files#diff-5b7e4f0f03bfbc3a58168e58eb88386b9e683241c1ebcb57f6764c38308f2257R179) method on the `*storage.Connection`. The `ApplyConfig` method is where the logic for obtaining the best values to call the `sql.DB.Set*(...)` methods below lives: * [SetConnMaxIdleTime(d time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxIdleTime) * [SetConnMaxLifetime(d time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxLifetime) * [SetMaxIdleConns(n int)](https://pkg.go.dev/database/sql#DB.SetMaxIdleConns) * [SetMaxOpenConns(n int)](https://pkg.go.dev/database/sql#DB.SetMaxOpenConns) Right now [ApplyConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L179) works like this: 1. Checks that we were able to [reflect](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L140) a `*sql.DB` during `DialContext`, if not we do nothing since we can't call `sql.DB.Set*(...)`. * If we can't access *sql.DB or fetch max_connections, we leave the prior limits untouched and log a warning. Always all-or-nothing, no partial application of limits. 2. Calls the new [getConnLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L209) method. 3. `getConnLimits` calls [newConnLimitsFromConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L287) which returns a `ConnLimits` setup with `GOTRUE_DB_{MAX,CONN}_*` settings. 4. Check if `GOTRUE_DB_CONN_PERCENTAGE` is zero, if so [it returns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L218) the `GOTRUE_DB_{MAX,CONN}_*` from `newConnLimitsFromConfig`. * This means the limits are set exactly as they are today. 5. Percentage config is non-zero so we make a call to [showMaxConns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L266) which just returns an integer from `"SHOW max_connections;"`. In my testing this value always seems to be available for the auth server: * This value cannot change without postgres restarts. * Postgres will not start if it is 0. * Being in recovery mode still shows the maximum connections. 6. As long as `showMaxConns` does not return an error we attempt to apply percentage based connection limits in [applyPercentageLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L235). 7. If max conns is <= 0 we return [an error](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L245) which prevents any config changes from being applied. Leaving the connection in its prior state. * max_connections > 0 is guaranteed if postgres is running, this is a defensive check to prevent applying a clamp to 1 max conns on 0. 8. We perform a simple [bounds check](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L252) and then set the [`MaxOpenConns` and `MaxIdleConns`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L257) to the values derived from the `ConnPercentage` and `maxConns`. * Note that we preserve the existing behavior of IdleConns == MaxConns. I believe the aim is to minimize connection churn (latency) at the cost of more Postgres slots when idle. It's worth thinking about making this a bit more considerate in the future, something simple like (open/2) or more advanced heuristics using [sql.DBStats](https://pkg.go.dev/database/sql#DBStats). ```Go pct := float64(dbCfg.ConnPercentage) cl.MaxOpenConns = int(max(1, (pct/100)*float64(maxConns))) cl.MaxIdleConns = cl.MaxOpenConns ``` 9. The values set from the call to `getConnLimits` are logged before being [applied via the `sql.DB.Set*(...)`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L202) calls. We fail strictly and quickly on derivation errors to keep the last known good settings. By supporting config reloading my hope is that when under high load users may balance this setting without taking down the auth server. This tight feedback loop should help rule out (or resolve) the auth server as a potential root cause to connection timeouts and similar downstream effects. @stojan `apiworker` approach also gives a good place for your [stats tracking](#2167) to live, adding a simple ticker in the `dbWorker` to poll stats between config updates. This stats polling could be used to form additional heuristics in our connect limit tuning if we would like to explore that in the future. For example use the mean connection time as an additional weight to further increase the pool size. ## Some notes: I tested this extensively but please give a thorough review, I made some judgement calls on non-happy paths. I'm also not sure how reliable the sqldb reference is as it seems the composition of the *pop.Store can change based on inputs, context, dialect, driver, etc. The entire feature will not work if I can't reflect out the sqldb. --------- Co-authored-by: Chris Stockton <[email protected]>
issuedat
pushed a commit
that referenced
this pull request
Sep 30, 2025
#2167) Adds a simple advisor that emits a log message every hour, if the number of [DBStats](https://pkg.go.dev/database/sql#DBStats) samples exceeds: - 1/3 or more samples indicate more than 1ms wait for every millisecond in the sampling interval - 1/3 or more samples indicate 2 or more goroutines have been waiting to acquire a connection from the pool For instance, if you set `GOTRUE_DB_ADVISOR_SAMPLING_INTERVAL="100ms"` and `GOTRUE_DB_ADVISOR_OBSERVATION_INTERVAL="1s"` then: - There are 1000 / 100 = 10 samples - Every 100ms a sample is taken and the conditions are checked - Log message is emitted if either one of the conditions are met at most once per hour
issuedat
pushed a commit
that referenced
this pull request
Sep 30, 2025
## Summary Introduce a context aware DB dial path, a new `ConnPercentage` knob to cap Auth's share of Postgres connections, and background wiring to apply pool changes on config reloads. **Storage / DB** - Add `DialContext(ctx, *conf.GlobalConfiguration)` and keep `Dial(...)` as a thin wrapper. `serve` now passes its cancelable context so startup can't hang indefinitely. - `Connection` now keeps a handle to the underlying `*sql.DB` (via `popConnToStd`) when available. - New helpers: - `newConnectionDetails` and `applyDBDriver` to build `pop.ConnectionDetails` and derive driver when omitted. - `Connection.Copy()` to retain `sqldb` reference and updated locations that copy (`WithContext, Transaction)`. - Runtime tuning API: `(*Connection).ApplyConfig(ctx, cfg, le)` computes and applies connection limits to the underlying `*sql.DB`. - Fixed limits come from `MaxPoolSize`, `MaxIdlePoolSize`, `ConnMaxLifetime`, `ConnMaxIdleTime`. - If `ConnPercentage` is set (1-100), compute limits from `SHOW max_connections`, prefer percentage over fixed pool sizes, and set idle = open. - Retains previous behavior when `ConnPercentage` is `0` - No-op (and error) if `*sql.DB` is unavailable. **API worker** - `apiworker.New` now accepts the DB connection. - Split worker into three goroutines (via `errgroup`): - `configNotifier` fans out reload signals, - `templateWorker` refreshes template cache, - `dbWorker` applies DB connection limits on boot and each reload. **Serve** - Use `storage.DialContext(ctx, cfg)` and then `db = db.WithContext(ctx)` so the DB handle participates in request/trace context and shutdown. **Observability** - Add `observability.NewLogEntry(*logrus.Entry)` to construct chi middleware log entries. - Structured logs around applying DB limits. **Configuration knobs** (`GOTRUE_DB_*`) - `GOTRUE_DB_CONN_PERCENTAGE` (int, clamped to `[0,100]`): - `0` (default) disables percentage-based sizing. - `1-100` reserves that % of `max_connections` for the Auth server. **Tests** - `internal/storage/dial_test.go` - `DialContext` happy path and invalid driver/URL error path. - Reflection bridge to `*sql.DB` (`popConnToStd`) including `WithContext`-wrapped connection behavior. - `ApplyConfig` end-to-end: verify pool sizing and stats reflect limits. - Percentage math and precedence vs fixed pools across edge cases. - `internal/conf/configuration_test.go` - Validation clamps `ConnPercentage` to `[0,100]`. ## How it works In short if `GOTRUE_DB_CONN_PERCENTAGE=0`, we use the fixed `GOTRUE_DB_{MAX,CONN}_*` limits. If it's in the range `[1, 100]` we set percentage based limits derived from `SHOW max_connections` and ignore the fixed pool sizes. ### Deep Dive The startup sequence remains the same, trying to set it _before_ we returned from `DialContext` was a bit messy (chicken / egg: need a conn to setup a conn). I also didn't want to delay startup time during failure scenarios (db is unavailable, db is blocking, etc). So after `DialContext` we have a connection which is configured initially with only the existing DB settings: ```bash GOTRUE_DB_MAX_POOL_SIZE="50" GOTRUE_DB_MAX_IDLE_POOL_SIZE="10" GOTRUE_DB_CONN_MAX_IDLE_TIME="60s" GOTRUE_DB_CONN_MAX_LIFETIME="0" ``` Next the server starts the `apiworker` which immediatelly creates a new [dbWorker](https://github.com/supabase/auth/pull/2177/files#diff-b20c1e9d1c21d077494cf5ff490de301a864d3d1812538cf594a687f620a7175R122) goroutine concurrently while the rest of the startup sequence continues. Before entering the config update loop the `dbWorker` will call the newly added [ApplyConfig(...)](https://github.com/supabase/auth/pull/2177/files#diff-5b7e4f0f03bfbc3a58168e58eb88386b9e683241c1ebcb57f6764c38308f2257R179) method on the `*storage.Connection`. The `ApplyConfig` method is where the logic for obtaining the best values to call the `sql.DB.Set*(...)` methods below lives: * [SetConnMaxIdleTime(d time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxIdleTime) * [SetConnMaxLifetime(d time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxLifetime) * [SetMaxIdleConns(n int)](https://pkg.go.dev/database/sql#DB.SetMaxIdleConns) * [SetMaxOpenConns(n int)](https://pkg.go.dev/database/sql#DB.SetMaxOpenConns) Right now [ApplyConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L179) works like this: 1. Checks that we were able to [reflect](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L140) a `*sql.DB` during `DialContext`, if not we do nothing since we can't call `sql.DB.Set*(...)`. * If we can't access *sql.DB or fetch max_connections, we leave the prior limits untouched and log a warning. Always all-or-nothing, no partial application of limits. 2. Calls the new [getConnLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L209) method. 3. `getConnLimits` calls [newConnLimitsFromConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L287) which returns a `ConnLimits` setup with `GOTRUE_DB_{MAX,CONN}_*` settings. 4. Check if `GOTRUE_DB_CONN_PERCENTAGE` is zero, if so [it returns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L218) the `GOTRUE_DB_{MAX,CONN}_*` from `newConnLimitsFromConfig`. * This means the limits are set exactly as they are today. 5. Percentage config is non-zero so we make a call to [showMaxConns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L266) which just returns an integer from `"SHOW max_connections;"`. In my testing this value always seems to be available for the auth server: * This value cannot change without postgres restarts. * Postgres will not start if it is 0. * Being in recovery mode still shows the maximum connections. 6. As long as `showMaxConns` does not return an error we attempt to apply percentage based connection limits in [applyPercentageLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L235). 7. If max conns is <= 0 we return [an error](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L245) which prevents any config changes from being applied. Leaving the connection in its prior state. * max_connections > 0 is guaranteed if postgres is running, this is a defensive check to prevent applying a clamp to 1 max conns on 0. 8. We perform a simple [bounds check](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L252) and then set the [`MaxOpenConns` and `MaxIdleConns`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L257) to the values derived from the `ConnPercentage` and `maxConns`. * Note that we preserve the existing behavior of IdleConns == MaxConns. I believe the aim is to minimize connection churn (latency) at the cost of more Postgres slots when idle. It's worth thinking about making this a bit more considerate in the future, something simple like (open/2) or more advanced heuristics using [sql.DBStats](https://pkg.go.dev/database/sql#DBStats). ```Go pct := float64(dbCfg.ConnPercentage) cl.MaxOpenConns = int(max(1, (pct/100)*float64(maxConns))) cl.MaxIdleConns = cl.MaxOpenConns ``` 9. The values set from the call to `getConnLimits` are logged before being [applied via the `sql.DB.Set*(...)`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L202) calls. We fail strictly and quickly on derivation errors to keep the last known good settings. By supporting config reloading my hope is that when under high load users may balance this setting without taking down the auth server. This tight feedback loop should help rule out (or resolve) the auth server as a potential root cause to connection timeouts and similar downstream effects. @stojan `apiworker` approach also gives a good place for your [stats tracking](#2167) to live, adding a simple ticker in the `dbWorker` to poll stats between config updates. This stats polling could be used to form additional heuristics in our connect limit tuning if we would like to explore that in the future. For example use the mean connection time as an additional weight to further increase the pool size. ## Some notes: I tested this extensively but please give a thorough review, I made some judgement calls on non-happy paths. I'm also not sure how reliable the sqldb reference is as it seems the composition of the *pop.Store can change based on inputs, context, dialect, driver, etc. The entire feature will not work if I can't reflect out the sqldb. --------- Co-authored-by: Chris Stockton <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds a simple advisor that emits a log message every hour, if the number of DBStats samples exceeds:
For instance, if you set
GOTRUE_DB_ADVISOR_SAMPLING_INTERVAL="100ms"
andGOTRUE_DB_ADVISOR_OBSERVATION_INTERVAL="1s"
then: