You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: support percentage based db limits with reload support (#2177)
## Summary
Introduce a context aware DB dial path, a new `ConnPercentage` knob to
cap Auth's share of Postgres connections, and background wiring to apply
pool changes on config reloads.
**Storage / DB**
- Add `DialContext(ctx, *conf.GlobalConfiguration)` and keep `Dial(...)`
as a thin wrapper. `serve` now passes its cancelable context so startup
can't hang indefinitely.
- `Connection` now keeps a handle to the underlying `*sql.DB` (via
`popConnToStd`) when available.
- New helpers:
- `newConnectionDetails` and `applyDBDriver` to build
`pop.ConnectionDetails` and derive driver when omitted.
- `Connection.Copy()` to retain `sqldb` reference and updated locations
that copy (`WithContext, Transaction)`.
- Runtime tuning API: `(*Connection).ApplyConfig(ctx, cfg, le)` computes
and applies connection limits to the underlying `*sql.DB`.
- Fixed limits come from `MaxPoolSize`, `MaxIdlePoolSize`,
`ConnMaxLifetime`, `ConnMaxIdleTime`.
- If `ConnPercentage` is set (1-100), compute limits from `SHOW
max_connections`, prefer percentage over fixed pool sizes, and set idle
= open.
- Retains previous behavior when `ConnPercentage` is `0`
- No-op (and error) if `*sql.DB` is unavailable.
**API worker**
- `apiworker.New` now accepts the DB connection.
- Split worker into three goroutines (via `errgroup`):
- `configNotifier` fans out reload signals,
- `templateWorker` refreshes template cache,
- `dbWorker` applies DB connection limits on boot and each reload.
**Serve**
- Use `storage.DialContext(ctx, cfg)` and then `db =
db.WithContext(ctx)` so the DB handle participates in request/trace
context and shutdown.
**Observability**
- Add `observability.NewLogEntry(*logrus.Entry)` to construct chi
middleware log entries.
- Structured logs around applying DB limits.
**Configuration knobs** (`GOTRUE_DB_*`)
- `GOTRUE_DB_CONN_PERCENTAGE` (int, clamped to `[0,100]`):
- `0` (default) disables percentage-based sizing.
- `1-100` reserves that % of `max_connections` for the Auth server.
**Tests**
- `internal/storage/dial_test.go`
- `DialContext` happy path and invalid driver/URL error path.
- Reflection bridge to `*sql.DB` (`popConnToStd`) including
`WithContext`-wrapped connection behavior.
- `ApplyConfig` end-to-end: verify pool sizing and stats reflect limits.
- Percentage math and precedence vs fixed pools across edge cases.
- `internal/conf/configuration_test.go`
- Validation clamps `ConnPercentage` to `[0,100]`.
## How it works
In short if `GOTRUE_DB_CONN_PERCENTAGE=0`, we use the fixed
`GOTRUE_DB_{MAX,CONN}_*` limits. If it's in the range `[1, 100]` we set
percentage based limits derived from `SHOW max_connections` and ignore
the fixed pool sizes.
### Deep Dive
The startup sequence remains the same, trying to set it _before_ we
returned from `DialContext` was a bit messy (chicken / egg: need a conn
to setup a conn). I also didn't want to delay startup time during
failure scenarios (db is unavailable, db is blocking, etc).
So after `DialContext` we have a connection which is configured
initially with only the existing DB settings:
```bash
GOTRUE_DB_MAX_POOL_SIZE="50"
GOTRUE_DB_MAX_IDLE_POOL_SIZE="10"
GOTRUE_DB_CONN_MAX_IDLE_TIME="60s"
GOTRUE_DB_CONN_MAX_LIFETIME="0"
```
Next the server starts the `apiworker` which immediatelly creates a new
[dbWorker](https://github.com/supabase/auth/pull/2177/files#diff-b20c1e9d1c21d077494cf5ff490de301a864d3d1812538cf594a687f620a7175R122)
goroutine concurrently while the rest of the startup sequence continues.
Before entering the config update loop the `dbWorker` will call the
newly added
[ApplyConfig(...)](https://github.com/supabase/auth/pull/2177/files#diff-5b7e4f0f03bfbc3a58168e58eb88386b9e683241c1ebcb57f6764c38308f2257R179)
method on the `*storage.Connection`.
The `ApplyConfig` method is where the logic for obtaining the best
values to call the `sql.DB.Set*(...)` methods below lives:
* [SetConnMaxIdleTime(d
time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxIdleTime)
* [SetConnMaxLifetime(d
time.Duration)](https://pkg.go.dev/database/sql#DB.SetConnMaxLifetime)
* [SetMaxIdleConns(n
int)](https://pkg.go.dev/database/sql#DB.SetMaxIdleConns)
* [SetMaxOpenConns(n
int)](https://pkg.go.dev/database/sql#DB.SetMaxOpenConns)
Right now
[ApplyConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L179)
works like this:
1. Checks that we were able to
[reflect](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L140)
a `*sql.DB` during `DialContext`, if not we do nothing since we can't
call `sql.DB.Set*(...)`.
* If we can't access *sql.DB or fetch max_connections, we leave the
prior limits untouched and log a warning. Always all-or-nothing, no
partial application of limits.
2. Calls the new
[getConnLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L209)
method.
3. `getConnLimits` calls
[newConnLimitsFromConfig](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L287)
which returns a `ConnLimits` setup with `GOTRUE_DB_{MAX,CONN}_*`
settings.
4. Check if `GOTRUE_DB_CONN_PERCENTAGE` is zero, if so [it
returns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L218)
the `GOTRUE_DB_{MAX,CONN}_*` from `newConnLimitsFromConfig`.
* This means the limits are set exactly as they are today.
5. Percentage config is non-zero so we make a call to
[showMaxConns](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L266)
which just returns an integer from `"SHOW max_connections;"`. In my
testing this value always seems to be available for the auth server:
* This value cannot change without postgres restarts.
* Postgres will not start if it is 0.
* Being in recovery mode still shows the maximum connections.
6. As long as `showMaxConns` does not return an error we attempt to
apply percentage based connection limits in
[applyPercentageLimits](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L235).
7. If max conns is <= 0 we return [an
error](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L245)
which prevents any config changes from being applied. Leaving the
connection in its prior state.
* max_connections > 0 is guaranteed if postgres is running, this is a
defensive check to prevent applying a clamp to 1 max conns on 0.
8. We perform a simple [bounds
check](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L252)
and then set the [`MaxOpenConns` and
`MaxIdleConns`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L257)
to the values derived from the `ConnPercentage` and `maxConns`.
* Note that we preserve the existing behavior of IdleConns == MaxConns.
I believe the aim is to minimize connection churn (latency) at the cost
of more Postgres slots when idle. It's worth thinking about making this
a bit more considerate in the future, something simple like (open/2) or
more advanced heuristics using
[sql.DBStats](https://pkg.go.dev/database/sql#DBStats).
```Go
pct := float64(dbCfg.ConnPercentage)
cl.MaxOpenConns = int(max(1, (pct/100)*float64(maxConns)))
cl.MaxIdleConns = cl.MaxOpenConns
```
9. The values set from the call to `getConnLimits` are logged before
being [applied via the
`sql.DB.Set*(...)`](https://github.com/supabase/auth/blob/4b5bc8d08fb4fbdf778f504f226da29246d34e84/internal/storage/dial.go#L202)
calls.
We fail strictly and quickly on derivation errors to keep the last known
good settings. By supporting config reloading my hope is that when under
high load users may balance this setting without taking down the auth
server. This tight feedback loop should help rule out (or resolve) the
auth server as a potential root cause to connection timeouts and similar
downstream effects.
@stojan `apiworker` approach also gives a good place for your [stats
tracking](#2167) to live, adding a
simple ticker in the `dbWorker` to poll stats between config updates.
This stats polling could be used to form additional heuristics in our
connect limit tuning if we would like to explore that in the future. For
example use the mean connection time as an additional weight to further
increase the pool size.
## Some notes:
I tested this extensively but please give a thorough review, I made some
judgement calls on non-happy paths. I'm also not sure how reliable the
sqldb reference is as it seems the composition of the *pop.Store can
change based on inputs, context, dialect, driver, etc. The entire
feature will not work if I can't reflect out the sqldb.
---------
Co-authored-by: Chris Stockton <[email protected]>
0 commit comments