Skip to content

Conversation

hf
Copy link
Contributor

@hf hf commented Sep 16, 2025

Adds a simple advisor that emits a log message every hour, if the number of DBStats samples exceeds:

  • 1/3 or more samples indicate more than 1ms wait for every millisecond in the sampling interval
  • 1/3 or more samples indicate 2 or more goroutines have been waiting to acquire a connection from the pool

For instance, if you set GOTRUE_DB_ADVISOR_SAMPLING_INTERVAL="100ms" and GOTRUE_DB_ADVISOR_OBSERVATION_INTERVAL="1s" then:

  • There are 1000 / 100 = 10 samples
  • Every 100ms a sample is taken and the conditions are checked
  • Log message is emitted if either one of the conditions are met at most once per hour

@hf hf requested a review from a team as a code owner September 16, 2025 13:41
@coveralls
Copy link

coveralls commented Sep 16, 2025

Pull Request Test Coverage Report for Build 17978381415

Details

  • 40 of 70 (57.14%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.03%) to 67.635%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/storage/advisor.go 40 55 72.73%
internal/storage/dial.go 0 15 0.0%
Totals Coverage Status
Change from base Build 17977940517: -0.03%
Covered Lines: 13084
Relevant Lines: 19345

💛 - Coveralls

Copy link
Contributor

@cstockton cstockton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@hf hf merged commit a72f5d9 into master Sep 24, 2025
5 of 6 checks passed
@hf hf deleted the hf/db-advisor branch September 24, 2025 14:50
cstockton added a commit that referenced this pull request Sep 24, 2025
## Summary
Introduce a context aware DB dial path, a new `ConnPercentage` knob to
cap Auth's share of Postgres connections, and background wiring to apply
pool changes on config reloads.

**Storage / DB**
- Add `DialContext(ctx, *conf.GlobalConfiguration)` and keep `Dial(...)`
as a thin wrapper. `serve` now passes its cancelable context so startup
can't hang indefinitely.
- `Connection` now keeps a handle to the underlying `*sql.DB` (via
`popConnToStd`) when available.
- New helpers:
- `newConnectionDetails` and `applyDBDriver` to build
`pop.ConnectionDetails` and derive driver when omitted.
- `Connection.Copy()` to retain `sqldb` reference and updated locations
that copy (`WithContext, Transaction)`.
- Runtime tuning API: `(*Connection).ApplyConfig(ctx, cfg, le)` computes
and applies connection limits to the underlying `*sql.DB`.
- Fixed limits come from `MaxPoolSize`, `MaxIdlePoolSize`,
`ConnMaxLifetime`, `ConnMaxIdleTime`.
- If `ConnPercentage` is set (1-100), compute limits from `SHOW
max_connections`, prefer percentage over fixed pool sizes, and set idle
= open.
  - Retains previous behavior when `ConnPercentage` is `0`
  - No-op (and error) if `*sql.DB` is unavailable.

**API worker**
- `apiworker.New` now accepts the DB connection.
- Split worker into three goroutines (via `errgroup`):
  - `configNotifier` fans out reload signals,
  - `templateWorker` refreshes template cache,
  - `dbWorker` applies DB connection limits on boot and each reload.

**Serve**
- Use `storage.DialContext(ctx, cfg)` and then `db =
db.WithContext(ctx)` so the DB handle participates in request/trace
context and shutdown.

**Observability**
- Add `observability.NewLogEntry(*logrus.Entry)` to construct chi
middleware log entries.
- Structured logs around applying DB limits.

**Configuration knobs** (`GOTRUE_DB_*`)
- `GOTRUE_DB_CONN_PERCENTAGE` (int, clamped to `[0,100]`):
  - `0` (default) disables percentage-based sizing.
  - `1-100` reserves that % of `max_connections` for the Auth server.

**Tests**
- `internal/storage/dial_test.go`
  - `DialContext` happy path and invalid driver/URL error path.
- Reflection bridge to `*sql.DB` (`popConnToStd`) including
`WithContext`-wrapped connection behavior.
- `ApplyConfig` end-to-end: verify pool sizing and stats reflect limits.
  - Percentage math and precedence vs fixed pools across edge cases.
- `internal/conf/configuration_test.go`
  - Validation clamps `ConnPercentage` to `[0,100]`.

## How it works
In short if `GOTRUE_DB_CONN_PERCENTAGE=0`, we use the fixed
`GOTRUE_DB_{MAX,CONN}_*` limits. If it's in the range `[1, 100]` we set
percentage based limits derived from `SHOW max_connections` and ignore
the fixed pool sizes.

### Deep Dive
The startup sequence remains the same, trying to set it _before_ we
returned from `DialContext` was a bit messy (chicken / egg: need a conn
to setup a conn). I also didn't want to delay startup time during
failure scenarios (db is unavailable, db is blocking, etc).

So after `DialContext` we have a connection which is configured
initially with only the existing DB settings:
```bash
GOTRUE_DB_MAX_POOL_SIZE="50"
GOTRUE_DB_MAX_IDLE_POOL_SIZE="10"
GOTRUE_DB_CONN_MAX_IDLE_TIME="60s"
GOTRUE_DB_CONN_MAX_LIFETIME="0"
```

Next the server starts the `apiworker` which immediatelly creates a new
[dbWorker](https://github.com/supabase/auth/pull/2177/files#diff-b20c1e9d1c21d077494cf5ff490de301a864d3d1812538cf594a687f620a7175R122)
goroutine concurrently while the rest of the startup sequence continues.
Before entering the config update loop the `dbWorker` will call the
newly added
[ApplyConfig(...)](https://github.com/supabase/auth/pull/2177/files#diff-5b7e4f0f03bfbc3a58168e58eb88386b9e683241c1ebcb57f6764c38308f2257R179)
method on the `*storage.Connection`.

The `ApplyConfig` method is where the logic for obtaining the best
values to call the `sql.DB.Set*(...)` methods below lives:
* [SetConnMaxIdleTime(d
time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxIdleTime)
* [SetConnMaxLifetime(d
time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxLifetime)
* [SetMaxIdleConns(n
int)](https://pkg.go.dev/database/sql#DB.SetMaxIdleConns)
* [SetMaxOpenConns(n
int)](https://pkg.go.dev/database/sql#DB.SetMaxOpenConns)

Right now
[ApplyConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L179)
works like this:
1. Checks that we were able to
[reflect](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L140)
a `*sql.DB` during `DialContext`, if not we do nothing since we can't
call `sql.DB.Set*(...)`.
* If we can't access *sql.DB or fetch max_connections, we leave the
prior limits untouched and log a warning. Always all-or-nothing, no
partial application of limits.
2. Calls the new
[getConnLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L209)
method.
3. `getConnLimits` calls
[newConnLimitsFromConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L287)
which returns a `ConnLimits` setup with `GOTRUE_DB_{MAX,CONN}_*`
settings.
4. Check if `GOTRUE_DB_CONN_PERCENTAGE` is zero, if so [it
returns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L218)
the `GOTRUE_DB_{MAX,CONN}_*` from `newConnLimitsFromConfig`.
    * This means the limits are set exactly as they are today.
5. Percentage config is non-zero so we make a call to
[showMaxConns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L266)
which just returns an integer from `"SHOW max_connections;"`. In my
testing this value always seems to be available for the auth server:
	* This value cannot change without postgres restarts.
	* Postgres will not start if it is 0.
	* Being in recovery mode still shows the maximum connections.
6. As long as `showMaxConns` does not return an error we attempt to
apply percentage based connection limits in
[applyPercentageLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L235).
7. If max conns is <= 0 we return [an
error](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L245)
which prevents any config changes from being applied. Leaving the
connection in its prior state.
* max_connections > 0 is guaranteed if postgres is running, this is a
defensive check to prevent applying a clamp to 1 max conns on 0.
8. We perform a simple [bounds
check](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L252)
and then set the [`MaxOpenConns` and
`MaxIdleConns`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L257)
to the values derived from the `ConnPercentage` and `maxConns`.
* Note that we preserve the existing behavior of IdleConns == MaxConns.
I believe the aim is to minimize connection churn (latency) at the cost
of more Postgres slots when idle. It's worth thinking about making this
a bit more considerate in the future, something simple like (open/2) or
more advanced heuristics using
[sql.DBStats](https://pkg.go.dev/database/sql#DBStats).

	```Go
	pct := float64(dbCfg.ConnPercentage)
	cl.MaxOpenConns = int(max(1, (pct/100)*float64(maxConns)))
	cl.MaxIdleConns = cl.MaxOpenConns
	```

9. The values set from the call to `getConnLimits` are logged before
being [applied via the
`sql.DB.Set*(...)`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L202)
calls.

We fail strictly and quickly on derivation errors to keep the last known
good settings. By supporting config reloading my hope is that when under
high load users may balance this setting without taking down the auth
server. This tight feedback loop should help rule out (or resolve) the
auth server as a potential root cause to connection timeouts and similar
downstream effects.

@stojan `apiworker` approach also gives a good place for your [stats
tracking](#2167) to live, adding a
simple ticker in the `dbWorker` to poll stats between config updates.
This stats polling could be used to form additional heuristics in our
connect limit tuning if we would like to explore that in the future. For
example use the mean connection time as an additional weight to further
increase the pool size.


## Some notes:
I tested this extensively but please give a thorough review, I made some
judgement calls on non-happy paths. I'm also not sure how reliable the
sqldb reference is as it seems the composition of the *pop.Store can
change based on inputs, context, dialect, driver, etc. The entire
feature will not work if I can't reflect out the sqldb.

---------

Co-authored-by: Chris Stockton <[email protected]>
issuedat pushed a commit that referenced this pull request Sep 30, 2025
#2167)

Adds a simple advisor that emits a log message every hour, if the number
of [DBStats](https://pkg.go.dev/database/sql#DBStats) samples exceeds:

- 1/3 or more samples indicate more than 1ms wait for every millisecond
in the sampling interval
- 1/3 or more samples indicate 2 or more goroutines have been waiting to
acquire a connection from the pool

For instance, if you set `GOTRUE_DB_ADVISOR_SAMPLING_INTERVAL="100ms"`
and `GOTRUE_DB_ADVISOR_OBSERVATION_INTERVAL="1s"` then:

- There are 1000 / 100 = 10 samples
- Every 100ms a sample is taken and the conditions are checked
- Log message is emitted if either one of the conditions are met at most
once per hour
issuedat pushed a commit that referenced this pull request Sep 30, 2025
## Summary
Introduce a context aware DB dial path, a new `ConnPercentage` knob to
cap Auth's share of Postgres connections, and background wiring to apply
pool changes on config reloads.

**Storage / DB**
- Add `DialContext(ctx, *conf.GlobalConfiguration)` and keep `Dial(...)`
as a thin wrapper. `serve` now passes its cancelable context so startup
can't hang indefinitely.
- `Connection` now keeps a handle to the underlying `*sql.DB` (via
`popConnToStd`) when available.
- New helpers:
- `newConnectionDetails` and `applyDBDriver` to build
`pop.ConnectionDetails` and derive driver when omitted.
- `Connection.Copy()` to retain `sqldb` reference and updated locations
that copy (`WithContext, Transaction)`.
- Runtime tuning API: `(*Connection).ApplyConfig(ctx, cfg, le)` computes
and applies connection limits to the underlying `*sql.DB`.
- Fixed limits come from `MaxPoolSize`, `MaxIdlePoolSize`,
`ConnMaxLifetime`, `ConnMaxIdleTime`.
- If `ConnPercentage` is set (1-100), compute limits from `SHOW
max_connections`, prefer percentage over fixed pool sizes, and set idle
= open.
  - Retains previous behavior when `ConnPercentage` is `0`
  - No-op (and error) if `*sql.DB` is unavailable.

**API worker**
- `apiworker.New` now accepts the DB connection.
- Split worker into three goroutines (via `errgroup`):
  - `configNotifier` fans out reload signals,
  - `templateWorker` refreshes template cache,
  - `dbWorker` applies DB connection limits on boot and each reload.

**Serve**
- Use `storage.DialContext(ctx, cfg)` and then `db =
db.WithContext(ctx)` so the DB handle participates in request/trace
context and shutdown.

**Observability**
- Add `observability.NewLogEntry(*logrus.Entry)` to construct chi
middleware log entries.
- Structured logs around applying DB limits.

**Configuration knobs** (`GOTRUE_DB_*`)
- `GOTRUE_DB_CONN_PERCENTAGE` (int, clamped to `[0,100]`):
  - `0` (default) disables percentage-based sizing.
  - `1-100` reserves that % of `max_connections` for the Auth server.

**Tests**
- `internal/storage/dial_test.go`
  - `DialContext` happy path and invalid driver/URL error path.
- Reflection bridge to `*sql.DB` (`popConnToStd`) including
`WithContext`-wrapped connection behavior.
- `ApplyConfig` end-to-end: verify pool sizing and stats reflect limits.
  - Percentage math and precedence vs fixed pools across edge cases.
- `internal/conf/configuration_test.go`
  - Validation clamps `ConnPercentage` to `[0,100]`.

## How it works
In short if `GOTRUE_DB_CONN_PERCENTAGE=0`, we use the fixed
`GOTRUE_DB_{MAX,CONN}_*` limits. If it's in the range `[1, 100]` we set
percentage based limits derived from `SHOW max_connections` and ignore
the fixed pool sizes.

### Deep Dive
The startup sequence remains the same, trying to set it _before_ we
returned from `DialContext` was a bit messy (chicken / egg: need a conn
to setup a conn). I also didn't want to delay startup time during
failure scenarios (db is unavailable, db is blocking, etc).

So after `DialContext` we have a connection which is configured
initially with only the existing DB settings:
```bash
GOTRUE_DB_MAX_POOL_SIZE="50"
GOTRUE_DB_MAX_IDLE_POOL_SIZE="10"
GOTRUE_DB_CONN_MAX_IDLE_TIME="60s"
GOTRUE_DB_CONN_MAX_LIFETIME="0"
```

Next the server starts the `apiworker` which immediatelly creates a new
[dbWorker](https://github.com/supabase/auth/pull/2177/files#diff-b20c1e9d1c21d077494cf5ff490de301a864d3d1812538cf594a687f620a7175R122)
goroutine concurrently while the rest of the startup sequence continues.
Before entering the config update loop the `dbWorker` will call the
newly added
[ApplyConfig(...)](https://github.com/supabase/auth/pull/2177/files#diff-5b7e4f0f03bfbc3a58168e58eb88386b9e683241c1ebcb57f6764c38308f2257R179)
method on the `*storage.Connection`.

The `ApplyConfig` method is where the logic for obtaining the best
values to call the `sql.DB.Set*(...)` methods below lives:
* [SetConnMaxIdleTime(d
time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxIdleTime)
* [SetConnMaxLifetime(d
time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxLifetime)
* [SetMaxIdleConns(n
int)](https://pkg.go.dev/database/sql#DB.SetMaxIdleConns)
* [SetMaxOpenConns(n
int)](https://pkg.go.dev/database/sql#DB.SetMaxOpenConns)

Right now
[ApplyConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L179)
works like this:
1. Checks that we were able to
[reflect](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L140)
a `*sql.DB` during `DialContext`, if not we do nothing since we can't
call `sql.DB.Set*(...)`.
* If we can't access *sql.DB or fetch max_connections, we leave the
prior limits untouched and log a warning. Always all-or-nothing, no
partial application of limits.
2. Calls the new
[getConnLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L209)
method.
3. `getConnLimits` calls
[newConnLimitsFromConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L287)
which returns a `ConnLimits` setup with `GOTRUE_DB_{MAX,CONN}_*`
settings.
4. Check if `GOTRUE_DB_CONN_PERCENTAGE` is zero, if so [it
returns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L218)
the `GOTRUE_DB_{MAX,CONN}_*` from `newConnLimitsFromConfig`.
    * This means the limits are set exactly as they are today.
5. Percentage config is non-zero so we make a call to
[showMaxConns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L266)
which just returns an integer from `"SHOW max_connections;"`. In my
testing this value always seems to be available for the auth server:
	* This value cannot change without postgres restarts.
	* Postgres will not start if it is 0.
	* Being in recovery mode still shows the maximum connections.
6. As long as `showMaxConns` does not return an error we attempt to
apply percentage based connection limits in
[applyPercentageLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L235).
7. If max conns is <= 0 we return [an
error](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L245)
which prevents any config changes from being applied. Leaving the
connection in its prior state.
* max_connections > 0 is guaranteed if postgres is running, this is a
defensive check to prevent applying a clamp to 1 max conns on 0.
8. We perform a simple [bounds
check](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L252)
and then set the [`MaxOpenConns` and
`MaxIdleConns`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L257)
to the values derived from the `ConnPercentage` and `maxConns`.
* Note that we preserve the existing behavior of IdleConns == MaxConns.
I believe the aim is to minimize connection churn (latency) at the cost
of more Postgres slots when idle. It's worth thinking about making this
a bit more considerate in the future, something simple like (open/2) or
more advanced heuristics using
[sql.DBStats](https://pkg.go.dev/database/sql#DBStats).

	```Go
	pct := float64(dbCfg.ConnPercentage)
	cl.MaxOpenConns = int(max(1, (pct/100)*float64(maxConns)))
	cl.MaxIdleConns = cl.MaxOpenConns
	```

9. The values set from the call to `getConnLimits` are logged before
being [applied via the
`sql.DB.Set*(...)`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L202)
calls.

We fail strictly and quickly on derivation errors to keep the last known
good settings. By supporting config reloading my hope is that when under
high load users may balance this setting without taking down the auth
server. This tight feedback loop should help rule out (or resolve) the
auth server as a potential root cause to connection timeouts and similar
downstream effects.

@stojan `apiworker` approach also gives a good place for your [stats
tracking](#2167) to live, adding a
simple ticker in the `dbWorker` to poll stats between config updates.
This stats polling could be used to form additional heuristics in our
connect limit tuning if we would like to explore that in the future. For
example use the mean connection time as an additional weight to further
increase the pool size.


## Some notes:
I tested this extensively but please give a thorough review, I made some
judgement calls on non-happy paths. I'm also not sure how reliable the
sqldb reference is as it seems the composition of the *pop.Store can
change based on inputs, context, dialect, driver, etc. The entire
feature will not work if I can't reflect out the sqldb.

---------

Co-authored-by: Chris Stockton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants