Skip to content

Conversation

cstockton
Copy link
Contributor

@cstockton cstockton commented Sep 22, 2025

Summary

Introduce a context aware DB dial path, a new ConnPercentage knob to cap Auth's share of Postgres connections, and background wiring to apply pool changes on config reloads.

Storage / DB

  • Add DialContext(ctx, *conf.GlobalConfiguration) and keep Dial(...) as a thin wrapper. serve now passes its cancelable context so startup can't hang indefinitely.
  • Connection now keeps a handle to the underlying *sql.DB (via popConnToStd) when available.
  • New helpers:
    • newConnectionDetails and applyDBDriver to build pop.ConnectionDetails and derive driver when omitted.
    • Connection.Copy() to retain sqldb reference and updated locations that copy (WithContext, Transaction).
  • Runtime tuning API: (*Connection).ApplyConfig(ctx, cfg, le) computes and applies connection limits to the underlying *sql.DB.
    • Fixed limits come from MaxPoolSize, MaxIdlePoolSize, ConnMaxLifetime, ConnMaxIdleTime.
    • If ConnPercentage is set (1-100), compute limits from SHOW max_connections, prefer percentage over fixed pool sizes, and set idle = open.
    • Retains previous behavior when ConnPercentage is 0
    • No-op (and error) if *sql.DB is unavailable.

API worker

  • apiworker.New now accepts the DB connection.
  • Split worker into three goroutines (via errgroup):
    • configNotifier fans out reload signals,
    • templateWorker refreshes template cache,
    • dbWorker applies DB connection limits on boot and each reload.

Serve

  • Use storage.DialContext(ctx, cfg) and then db = db.WithContext(ctx) so the DB handle participates in request/trace context and shutdown.

Observability

  • Add observability.NewLogEntry(*logrus.Entry) to construct chi middleware log entries.
  • Structured logs around applying DB limits.

Configuration knobs (GOTRUE_DB_*)

  • GOTRUE_DB_CONN_PERCENTAGE (int, clamped to [0,100]):
    • 0 (default) disables percentage-based sizing.
    • 1-100 reserves that % of max_connections for the Auth server.

Tests

  • internal/storage/dial_test.go
    • DialContext happy path and invalid driver/URL error path.
    • Reflection bridge to *sql.DB (popConnToStd) including WithContext-wrapped connection behavior.
    • ApplyConfig end-to-end: verify pool sizing and stats reflect limits.
    • Percentage math and precedence vs fixed pools across edge cases.
  • internal/conf/configuration_test.go
    • Validation clamps ConnPercentage to [0,100].

How it works

In short if GOTRUE_DB_CONN_PERCENTAGE=0, we use the fixed GOTRUE_DB_{MAX,CONN}_* limits. If it's in the range [1, 100] we set percentage based limits derived from SHOW max_connections and ignore the fixed pool sizes.

Deep Dive

The startup sequence remains the same, trying to set it before we returned from DialContext was a bit messy (chicken / egg: need a conn to setup a conn). I also didn't want to delay startup time during failure scenarios (db is unavailable, db is blocking, etc).

So after DialContext we have a connection which is configured initially with only the existing DB settings:

GOTRUE_DB_MAX_POOL_SIZE="50"
GOTRUE_DB_MAX_IDLE_POOL_SIZE="10"
GOTRUE_DB_CONN_MAX_IDLE_TIME="60s"
GOTRUE_DB_CONN_MAX_LIFETIME="0"

Next the server starts the apiworker which immediatelly creates a new dbWorker goroutine concurrently while the rest of the startup sequence continues. Before entering the config update loop the dbWorker will call the newly added ApplyConfig(...) method on the *storage.Connection.

The ApplyConfig method is where the logic for obtaining the best values to call the sql.DB.Set*(...) methods below lives:

Right now ApplyConfig works like this:

  1. Checks that we were able to reflect a *sql.DB during DialContext, if not we do nothing since we can't call sql.DB.Set*(...).

    • If we can't access *sql.DB or fetch max_connections, we leave the prior limits untouched and log a warning. Always all-or-nothing, no partial application of limits.
  2. Calls the new getConnLimits method.

  3. getConnLimits calls newConnLimitsFromConfig which returns a ConnLimits setup with GOTRUE_DB_{MAX,CONN}_* settings.

  4. Check if GOTRUE_DB_CONN_PERCENTAGE is zero, if so it returns the GOTRUE_DB_{MAX,CONN}_* from newConnLimitsFromConfig.

    • This means the limits are set exactly as they are today.
  5. Percentage config is non-zero so we make a call to showMaxConns which just returns an integer from "SHOW max_connections;". In my testing this value always seems to be available for the auth server:

    • This value cannot change without postgres restarts.
    • Postgres will not start if it is 0.
    • Being in recovery mode still shows the maximum connections.
  6. As long as showMaxConns does not return an error we attempt to apply percentage based connection limits in applyPercentageLimits.

  7. If max conns is <= 0 we return an error which prevents any config changes from being applied. Leaving the connection in its prior state.

    • max_connections > 0 is guaranteed if postgres is running, this is a defensive check to prevent applying a clamp to 1 max conns on 0.
  8. We perform a simple bounds check and then set the MaxOpenConns and MaxIdleConns to the values derived from the ConnPercentage and maxConns.

    • Note that we preserve the existing behavior of IdleConns == MaxConns. I believe the aim is to minimize connection churn (latency) at the cost of more Postgres slots when idle. It's worth thinking about making this a bit more considerate in the future, something simple like (open/2) or more advanced heuristics using sql.DBStats.
    pct := float64(dbCfg.ConnPercentage)
    cl.MaxOpenConns = int(max(1, (pct/100)*float64(maxConns)))
    cl.MaxIdleConns = cl.MaxOpenConns
  9. The values set from the call to getConnLimits are logged before being applied via the sql.DB.Set*(...) calls.

We fail strictly and quickly on derivation errors to keep the last known good settings. By supporting config reloading my hope is that when under high load users may balance this setting without taking down the auth server. This tight feedback loop should help rule out (or resolve) the auth server as a potential root cause to connection timeouts and similar downstream effects.

@stojan apiworker approach also gives a good place for your stats tracking to live, adding a simple ticker in the dbWorker to poll stats between config updates. This stats polling could be used to form additional heuristics in our connect limit tuning if we would like to explore that in the future. For example use the mean connection time as an additional weight to further increase the pool size.

Some notes:

I tested this extensively but please give a thorough review, I made some judgement calls on non-happy paths. I'm also not sure how reliable the sqldb reference is as it seems the composition of the *pop.Store can change based on inputs, context, dialect, driver, etc. The entire feature will not work if I can't reflect out the sqldb.

**Summary**
Introduce a context aware DB dial path, a new `ConnPercentage` knob to cap
Auth's share of Postgres connections, and background wiring to apply pool
changes on config reloads.

**Storage / DB**
- Add `DialContext(ctx, *conf.GlobalConfiguration)` and keep `Dial(...)`
  as a thin wrapper. `serve` now passes its cancelable context so startup
  can't hang indefinitely.
- `Connection` now keeps a handle to the underlying `*sql.DB` (via
  `popConnToStd`) when available.
- New helpers:
  - `newConnectionDetails` and `applyDBDriver` to build `pop.ConnectionDetails`
    and derive driver when omitted.
  - `Connection.Copy()` to retain `sqldb` reference and updated locations that
    copy (`WithContext, Transaction)`.
- Runtime tuning API: `(*Connection).ApplyConfig(ctx, cfg, le)` computes and
  applies connection limits to the underlying `*sql.DB`.
  - Fixed limits come from `MaxPoolSize`, `MaxIdlePoolSize`,
    `ConnMaxLifetime`, `ConnMaxIdleTime`.
  - If `ConnPercentage` is set (1-100), compute limits from
    `SHOW max_connections`, prefer percentage over fixed pool sizes, and
    set idle = open.
  - Retains previous behavior when `ConnPercentage` is `0`
  - No-op (and error) if `*sql.DB` is unavailable.

**API worker**
- `apiworker.New` now accepts the DB connection.
- Split worker into three goroutines (via `errgroup`):
  - `configNotifier` fans out reload signals,
  - `templateWorker` refreshes template cache,
  - `dbWorker` applies DB connection limits on boot and each reload.

**Serve**
- Use `storage.DialContext(ctx, cfg)` and then `db = db.WithContext(ctx)` so
  the DB handle participates in request/trace context and shutdown.

**Observability**
- Add `observability.NewLogEntry(*logrus.Entry)` to construct chi middleware
  log entries.
- Structured logs around applying DB limits.

**Configuration knobs** (`GOTRUE_DB_*`)
- `GOTRUE_DB_CONN_PERCENTAGE` (int, clamped to `[0,100]`):
  - `0` (default) disables percentage-based sizing.
  - `1-100` reserves that % of `max_connections` for the Auth server.

**Tests**
- `internal/storage/dial_test.go`
  - `DialContext` happy path and invalid driver/URL error path.
  - Reflection bridge to `*sql.DB` (`popConnToStd`) including
    `WithContext`-wrapped connection behavior.
  - `ApplyConfig` end-to-end: verify pool sizing and stats reflect limits.
  - Percentage math and precedence vs fixed pools across edge cases.
- `internal/conf/configuration_test.go`
  - Validation clamps `ConnPercentage` to `[0,100]`.
@cstockton cstockton requested a review from a team as a code owner September 22, 2025 23:27
Copy link
Contributor

@hf hf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work with the existing DB_MAX_POOL_SIZE setting?

@coveralls
Copy link

coveralls commented Sep 23, 2025

Pull Request Test Coverage Report for Build 17983722060

Details

  • 168 of 237 (70.89%) changed or added relevant lines in 4 files are covered.
  • 5 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.1%) to 67.738%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/observability/request-logger.go 0 3 0.0%
internal/storage/dial.go 167 177 94.35%
internal/api/apiworker/apiworker.go 0 56 0.0%
Files with Coverage Reduction New Missed Lines %
internal/api/apiworker/apiworker.go 1 0.0%
internal/storage/dial.go 2 89.01%
internal/tokens/service.go 2 73.47%
Totals Coverage Status
Change from base Build 17980540058: 0.1%
Covered Lines: 13240
Relevant Lines: 19546

💛 - Coveralls

Include the configuration values, server reported max conns and
the applied limits with `limit_strategy` field describing if
`fixed` or `percentage` was used.
@cstockton cstockton force-pushed the cs/feat-percentage-based-db-conn-limits branch from 4b5bc8d to 21a9928 Compare September 23, 2025 17:21
@cstockton
Copy link
Contributor Author

@stojan updated the PR description to include how the limits work. I also just made a commit with better logging as well.

@cstockton cstockton merged commit 1731466 into master Sep 24, 2025
5 checks passed
@cstockton cstockton deleted the cs/feat-percentage-based-db-conn-limits branch September 24, 2025 17:33
issuedat pushed a commit that referenced this pull request Sep 30, 2025
## Summary
Introduce a context aware DB dial path, a new `ConnPercentage` knob to
cap Auth's share of Postgres connections, and background wiring to apply
pool changes on config reloads.

**Storage / DB**
- Add `DialContext(ctx, *conf.GlobalConfiguration)` and keep `Dial(...)`
as a thin wrapper. `serve` now passes its cancelable context so startup
can't hang indefinitely.
- `Connection` now keeps a handle to the underlying `*sql.DB` (via
`popConnToStd`) when available.
- New helpers:
- `newConnectionDetails` and `applyDBDriver` to build
`pop.ConnectionDetails` and derive driver when omitted.
- `Connection.Copy()` to retain `sqldb` reference and updated locations
that copy (`WithContext, Transaction)`.
- Runtime tuning API: `(*Connection).ApplyConfig(ctx, cfg, le)` computes
and applies connection limits to the underlying `*sql.DB`.
- Fixed limits come from `MaxPoolSize`, `MaxIdlePoolSize`,
`ConnMaxLifetime`, `ConnMaxIdleTime`.
- If `ConnPercentage` is set (1-100), compute limits from `SHOW
max_connections`, prefer percentage over fixed pool sizes, and set idle
= open.
  - Retains previous behavior when `ConnPercentage` is `0`
  - No-op (and error) if `*sql.DB` is unavailable.

**API worker**
- `apiworker.New` now accepts the DB connection.
- Split worker into three goroutines (via `errgroup`):
  - `configNotifier` fans out reload signals,
  - `templateWorker` refreshes template cache,
  - `dbWorker` applies DB connection limits on boot and each reload.

**Serve**
- Use `storage.DialContext(ctx, cfg)` and then `db =
db.WithContext(ctx)` so the DB handle participates in request/trace
context and shutdown.

**Observability**
- Add `observability.NewLogEntry(*logrus.Entry)` to construct chi
middleware log entries.
- Structured logs around applying DB limits.

**Configuration knobs** (`GOTRUE_DB_*`)
- `GOTRUE_DB_CONN_PERCENTAGE` (int, clamped to `[0,100]`):
  - `0` (default) disables percentage-based sizing.
  - `1-100` reserves that % of `max_connections` for the Auth server.

**Tests**
- `internal/storage/dial_test.go`
  - `DialContext` happy path and invalid driver/URL error path.
- Reflection bridge to `*sql.DB` (`popConnToStd`) including
`WithContext`-wrapped connection behavior.
- `ApplyConfig` end-to-end: verify pool sizing and stats reflect limits.
  - Percentage math and precedence vs fixed pools across edge cases.
- `internal/conf/configuration_test.go`
  - Validation clamps `ConnPercentage` to `[0,100]`.

## How it works
In short if `GOTRUE_DB_CONN_PERCENTAGE=0`, we use the fixed
`GOTRUE_DB_{MAX,CONN}_*` limits. If it's in the range `[1, 100]` we set
percentage based limits derived from `SHOW max_connections` and ignore
the fixed pool sizes.

### Deep Dive
The startup sequence remains the same, trying to set it _before_ we
returned from `DialContext` was a bit messy (chicken / egg: need a conn
to setup a conn). I also didn't want to delay startup time during
failure scenarios (db is unavailable, db is blocking, etc).

So after `DialContext` we have a connection which is configured
initially with only the existing DB settings:
```bash
GOTRUE_DB_MAX_POOL_SIZE="50"
GOTRUE_DB_MAX_IDLE_POOL_SIZE="10"
GOTRUE_DB_CONN_MAX_IDLE_TIME="60s"
GOTRUE_DB_CONN_MAX_LIFETIME="0"
```

Next the server starts the `apiworker` which immediatelly creates a new
[dbWorker](https://github.com/supabase/auth/pull/2177/files#diff-b20c1e9d1c21d077494cf5ff490de301a864d3d1812538cf594a687f620a7175R122)
goroutine concurrently while the rest of the startup sequence continues.
Before entering the config update loop the `dbWorker` will call the
newly added
[ApplyConfig(...)](https://github.com/supabase/auth/pull/2177/files#diff-5b7e4f0f03bfbc3a58168e58eb88386b9e683241c1ebcb57f6764c38308f2257R179)
method on the `*storage.Connection`.

The `ApplyConfig` method is where the logic for obtaining the best
values to call the `sql.DB.Set*(...)` methods below lives:
* [SetConnMaxIdleTime(d
time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxIdleTime)
* [SetConnMaxLifetime(d
time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxLifetime)
* [SetMaxIdleConns(n
int)](https://pkg.go.dev/database/sql#DB.SetMaxIdleConns)
* [SetMaxOpenConns(n
int)](https://pkg.go.dev/database/sql#DB.SetMaxOpenConns)

Right now
[ApplyConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L179)
works like this:
1. Checks that we were able to
[reflect](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L140)
a `*sql.DB` during `DialContext`, if not we do nothing since we can't
call `sql.DB.Set*(...)`.
* If we can't access *sql.DB or fetch max_connections, we leave the
prior limits untouched and log a warning. Always all-or-nothing, no
partial application of limits.
2. Calls the new
[getConnLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L209)
method.
3. `getConnLimits` calls
[newConnLimitsFromConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L287)
which returns a `ConnLimits` setup with `GOTRUE_DB_{MAX,CONN}_*`
settings.
4. Check if `GOTRUE_DB_CONN_PERCENTAGE` is zero, if so [it
returns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L218)
the `GOTRUE_DB_{MAX,CONN}_*` from `newConnLimitsFromConfig`.
    * This means the limits are set exactly as they are today.
5. Percentage config is non-zero so we make a call to
[showMaxConns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L266)
which just returns an integer from `"SHOW max_connections;"`. In my
testing this value always seems to be available for the auth server:
	* This value cannot change without postgres restarts.
	* Postgres will not start if it is 0.
	* Being in recovery mode still shows the maximum connections.
6. As long as `showMaxConns` does not return an error we attempt to
apply percentage based connection limits in
[applyPercentageLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L235).
7. If max conns is <= 0 we return [an
error](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L245)
which prevents any config changes from being applied. Leaving the
connection in its prior state.
* max_connections > 0 is guaranteed if postgres is running, this is a
defensive check to prevent applying a clamp to 1 max conns on 0.
8. We perform a simple [bounds
check](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L252)
and then set the [`MaxOpenConns` and
`MaxIdleConns`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L257)
to the values derived from the `ConnPercentage` and `maxConns`.
* Note that we preserve the existing behavior of IdleConns == MaxConns.
I believe the aim is to minimize connection churn (latency) at the cost
of more Postgres slots when idle. It's worth thinking about making this
a bit more considerate in the future, something simple like (open/2) or
more advanced heuristics using
[sql.DBStats](https://pkg.go.dev/database/sql#DBStats).

	```Go
	pct := float64(dbCfg.ConnPercentage)
	cl.MaxOpenConns = int(max(1, (pct/100)*float64(maxConns)))
	cl.MaxIdleConns = cl.MaxOpenConns
	```

9. The values set from the call to `getConnLimits` are logged before
being [applied via the
`sql.DB.Set*(...)`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L202)
calls.

We fail strictly and quickly on derivation errors to keep the last known
good settings. By supporting config reloading my hope is that when under
high load users may balance this setting without taking down the auth
server. This tight feedback loop should help rule out (or resolve) the
auth server as a potential root cause to connection timeouts and similar
downstream effects.

@stojan `apiworker` approach also gives a good place for your [stats
tracking](#2167) to live, adding a
simple ticker in the `dbWorker` to poll stats between config updates.
This stats polling could be used to form additional heuristics in our
connect limit tuning if we would like to explore that in the future. For
example use the mean connection time as an additional weight to further
increase the pool size.


## Some notes:
I tested this extensively but please give a thorough review, I made some
judgement calls on non-happy paths. I'm also not sure how reliable the
sqldb reference is as it seems the composition of the *pop.Store can
change based on inputs, context, dialect, driver, etc. The entire
feature will not work if I can't reflect out the sqldb.

---------

Co-authored-by: Chris Stockton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants