Skip to content

fix(workflows): gateway DDL + billing perm-error + smoke check + tools tables + CI/CD plan#18

Closed
thijs-hakkenberg wants to merge 6 commits into
databricks-solutions:mainfrom
thijs-hakkenberg:fix/sync-lakebase-gateway-ddl-transaction
Closed

fix(workflows): gateway DDL + billing perm-error + smoke check + tools tables + CI/CD plan#18
thijs-hakkenberg wants to merge 6 commits into
databricks-solutions:mainfrom
thijs-hakkenberg:fix/sync-lakebase-gateway-ddl-transaction

Conversation

@thijs-hakkenberg

@thijs-hakkenberg thijs-hakkenberg commented Jun 6, 2026

Copy link
Copy Markdown

Summary

Four stacked changes from a customer onboarding:

  1. Bug fix (commit 1): 02_sync_to_lakebase Phase 6 silently fails to create gateway_usage_daily / gateway_usage_hourly on a fresh DB while reporting result=SUCCESS. The deployed app then logs an unbounded stream of psycopg2.errors.UndefinedTable errors and renders the Gateway page empty. Root cause: psycopg2 transaction abort on an ALTER TABLE issued before the matching CREATE TABLE.
  2. Regression gate (commit 2): A new smoke_check_lakebase workflow task asserts that every Lakebase table the dashboard reads from exists with the expected row content. Failure flips the workflow's result_state to FAILED so CI can gate on it.
  3. Silent-failure fix (commit 3): _execute_sql in 09_discover_billing.py returned [] on any non-SUCCEEDED state — including INSUFFICIENT_PERMISSIONS. That masked a missing SELECT ON system.billing.list_prices grant for weeks: cost-related dashboards (Cost Overview, Endpoint Costs, All Products) rendered empty while token usage worked, and the discovery task still reported SUCCESS. Helper now raises with the SQLSTATE + message so the same misconfiguration fails the discovery task immediately.
  4. Tools tables fix (commit 4): tool_registry and request_logs were created lazily by the app's startup hook in a 120 s daemon thread. When the app SP lacked Lakebase DDL or the daemon ran past timeout, the tables never landed; /api/v1/tools/overview returned 500, the Tools page rendered empty across all four tabs, and /api/v1/tools/sync lied with HTTP 200. Moved DDL into 02_sync_to_lakebase Phase 7 (workflow run-as is databricks_superuser); also stopped refresh_tools from swallowing exception tracebacks.
  5. CI/CD plan: ADR + a starter GitHub Actions workflow for the sandbox → stage → prod promotion pipeline.

Why I'm opening from a fork

I'm a Databricks customer deploying this repo into a sandbox workspace. I don't have push access to databricks-solutions/agent-control-plane, so this PR is from a personal fork. I'd love to be added as an outside contributor — there are already three follow-up items the smoke check surfaced that warrant separate PRs, and round-tripping each through cross-fork PR is high friction. Happy to provide a Databricks-side reference offline.


Commit 1: gateway DDL transaction fix

What was wrong

workflows/02_sync_to_lakebase.py Phase 6 issued the gateway DDLs in a single transaction with this order:

[
  CREATE TABLE gateway_usage_daily (...),
  CREATE INDEX idx_gud_date,
  CREATE INDEX idx_gud_ep,
  ALTER TABLE gateway_usage_daily  ADD COLUMN IF NOT EXISTS rate_limited_count ...,
  ALTER TABLE gateway_usage_hourly ADD COLUMN IF NOT EXISTS rate_limited_count ...,  # <-- table does not exist yet
  CREATE TABLE gateway_usage_hourly (...),
]

On a fresh database the ALTER gateway_usage_hourly raises UndefinedTable. psycopg2 marks the entire transaction aborted on any error, so subsequent cur.execute(ddl) calls in the same transaction silently no-op until rollback. The per-statement try/except swallowed the warning print but did not rollback, so the trailing CREATE TABLE gateway_usage_hourly was a no-op, and the trailing gw_conn.commit() succeeded — leaving the database without either table.

The observability section a few hundred lines up uses PostgreSQL SAVEPOINT per DDL and contains failures correctly. The gateway section did not.

Fix

  1. Reorder DDLs so all CREATE TABLEs run before any CREATE INDEX / ALTER TABLE.
  2. Switch from one shared transaction to one transaction per DDL: each statement runs in its own cursor() context with its own commit(), and a failure does rollback() then continues. This isolates the "ALTER on a not-yet-existing table on first run" case from poisoning sibling DDLs.
for ddl in _gw_ddls:
    try:
        with gw_conn.cursor() as cur:
            cur.execute(ddl)
        gw_conn.commit()
    except Exception as e:
        gw_conn.rollback()
        print(f\"  DDL warning: {e}\")

Verified end-to-end

Check Before After
gateway_usage_daily exists in Lakebase
gateway_usage_hourly exists in Lakebase
Rows in gateway_usage_daily n/a 21,303
Rows in gateway_usage_hourly n/a 2,606
App UndefinedTable errors after fix run continuous none
Discovery job result SUCCESS (silently broken) SUCCESS (correct)

Commit 2: Lakebase smoke check + CI/CD plan

What this catches

The bug above passed every existing automated check. The discovery job reported SUCCESS, the app's startup logs were clean, and the Lakebase tables that were created had rows. The only failure signal lived in the deployed app's stderr — which nobody reads on a successful deploy. The smoke check is a regression test that compares Lakebase reality to the app's data contract.

What's in workflows/10_smoke_check_lakebase.py

The notebook connects to Lakebase, enumerates pg_tables, and asserts every table the deployed app reads from. Tables are sourced from the actual FROM <table> references in control-plane-app/backend/services/ and bucketed:

Bucket Behaviour
REQUIRED (15 tables) Must exist AND have ≥1 row. Failure raises and the workflow goes red.
EXPECTED (11 tables) Must exist; 0 rows allowed (legitimate in quiet workspaces) but logged as WARN.
OPTIONAL (6 tables) App-managed (created lazily on user action). Existence not asserted.

Wired as smoke_check_lakebase task in databricks.yml, depending on sync_to_lakebase.

Subtleties this had to solve

  • dbutils.notebook.exit() masks failures. Calling it before raising marks the task SUCCEEDED with the JSON as its return value, hiding the FAIL from the workflow-level result_state. The smoke notebook raises directly instead.
  • jobs/get-run-output does not surface notebook stdout for failed runs. Only error and error_trace round-trip. The smoke notebook inlines the breakdown into the RuntimeError message itself so a CI runner reading error sees the per-table failure list directly.

What it found end-to-end

Beyond the gateway tables (now ✅), the first smoke run flagged two real new findings worth a separate ticket:

RuntimeError: Lakebase smoke check FAILED:
  EMPTY required table: billing_user_cost_daily;
  EMPTY required table: billing_product_daily
Breakdown:
  [required] billing_user_cost_daily             EMPTY    rows=0
  [required] billing_product_daily               EMPTY    rows=0
  [expected] billing_serving_daily               EMPTY    rows=0
  [expected] gateway_inference_logs              EMPTY    rows=0
  [expected] vector_search_indexes               EMPTY    rows=0
  [expected] kb_billing_daily                    EMPTY    rows=0

Update: the two REQUIRED failures and billing_serving_daily are now root-caused and fixed in commit 3 (see below). The smoke check did its job — surfaced a silent permission gap that had been quietly producing 0-row dashboards.

The remaining EXPECTED warnings (gateway_inference_logs, vector_search_indexes, kb_billing_daily) are legitimately empty for this workspace (no AI Gateway inference logging enabled, the one Vector Search endpoint has 0 indexes).


Commit 3: surface SQL permission errors in discover_billing

What was wrong

workflows/09_discover_billing.py issues five SQL statements via _execute_sql. Three of them (queries 1, 3, 5 — billing_serving_daily, billing_product_daily, billing_user_cost_daily) LEFT JOIN system.billing.list_prices to compute cost in USD. On the sandbox, the deployer / workflow run-as identity had SELECT ON system.billing.usage but not on system.billing.list_prices. The JOIN failed with:

[INSUFFICIENT_PERMISSIONS] User does not have SELECT on Table 'system.billing.list_prices'. SQLSTATE: 42501

But the helper had this:

if status != "SUCCEEDED":
    err = resp.get("status", {}).get("error", {})
    print(f"  SQL {status}: {err.get('message', '')[:300]}")
    return []   # <-- silent zero-row path

So all three queries silently returned []. The notebook then overwrote Delta with empty DataFrames, sync-to-Lakebase truncated and re-inserted 0 rows, and the discovery task reported result=SUCCESS. Token-related dashboards worked because queries 2 and 4 hit system.serving.endpoint_usage only — no list_prices join, no permission gap. That's why "token usage shows but cost overview is empty" was the symptom.

Fix

workflows/09_discover_billing.py:170_execute_sql now raises a RuntimeError carrying the API error code and message when the statement does not succeed. Future permission gaps now fail the discovery task at the source with an actionable message, instead of producing a 0-row result that surfaces three steps downstream as "EMPTY required table" in the smoke check.

(The corresponding production fix is a one-line GRANT SELECT ON SCHEMA system.billing TO <identity> — that's an environmental config, not code, so it belongs in docs/installation.md rather than this PR.)

Verified end-to-end

Table Before grant + fix After
billing_serving_daily (Lakebase) 0 15,040
billing_product_daily (Lakebase) 0 9,806
billing_user_cost_daily (Lakebase) 0 4,472
billing_token_daily (Lakebase) 3,932 (was working) 3,932
Discovery result_state on missing grant SUCCESS (silent) FAILED with SQLSTATE in error

CI/CD pattern

docs/decisions/2026-06-06-cicd-deployment-pattern.md proposes a sandbox → stage → prod promotion pipeline driven by GitHub Actions, gated on the smoke check at every environment. .github/workflows/deploy-sandbox.yml is the concrete starting point: bundle deploy, app deploy, jobs run-now, poll until terminal, fail the CI job if result_state != SUCCESS.

This is a starting point, not a finished pipeline. Stage and prod workflows are deferred until those workspaces are provisioned.


Commit 4: app-managed table DDL belongs in the workflow, not the app

What was wrong

The Tools page in the deployed app rendered empty across all four tabs (Overview / MCP Servers / UC Functions / Usage). Symptoms:

HTTP/2 500
psycopg2.errors.UndefinedTable: relation "tool_registry" does not exist
LINE 1: SELECT * FROM tool_registry WHERE type = 'mcp_server' ...

The app's startup hook in backend/main.py runs _init_tools() inside a daemon thread fan-out wrapped with t.join(timeout=120). _init_tools calls ensure_tools_tables() (which CREATE TABLE IF NOT EXISTS tool_registry ...) and then maybe_refresh_async(). Three layered failures masked each other:

  1. tool_registry was never created. On the sandbox, the daemon thread either crashed before _init_tools ran, ran past the 120 s budget, or hit Lakebase auth pressure under cold-start load — the server boots regardless and the table is missing.

  2. ensure_tools_tables() swallowed real DDL failures. Per-statement try/except: logger.warning(...) was indistinguishable from the harmless "already exists" case.

  3. refresh_tools() swallowed the entire body with try/except Exception: logger.warning("Tools refresh failed: %s", exc). POST /api/v1/tools/sync returned {"status":"ok","message":"Tools refresh complete"} regardless of whether the underlying refresh blew up — the dashboard was the only externally visible signal.

The same architectural issue applied to request_logs (created lazily by the request-audit middleware, also fails silently if the app SP lacks DDL).

Fix (three parts)

  1. Move app-managed table DDL into the workflow. New Phase 7 in workflows/02_sync_to_lakebase.py creates tool_registry and request_logs from the workflow run-as identity. The workflow runs as databricks_superuser on the Lakebase PG instance — DDL here is always safe. The app no longer needs DDL privileges to function, only SELECT/INSERT/UPDATE/DELETE. This parallels the pattern already used for discovered_agents, gateway_usage_*, and 20+ other tables.

  2. Make tools_service.refresh_tools self-heal and stop hiding errors.

    • refresh_tools() now calls ensure_tools_tables() at the top, so a fresh deploy where the workflow hasn't run yet still recovers on the first read.
    • Catch-all switched from logger.warning to logger.exception — full tracebacks land in app logs.
    • Per-statement handler in ensure_tools_tables() also uses logger.exception with the offending DDL preview.
  3. Smoke check coverage. workflows/10_smoke_check_lakebase.py now asserts tool_registry and request_logs in the EXPECTED bucket — existence required, 0 rows allowed. Missing table → workflow goes red with an actionable message; 0 rows is a WARN (legitimate before any user activity).

Verified end-to-end

Tab Before After
Overview HTTP 500 (UndefinedTable: tool_registry) {total_tools:3, mcp_servers:3, uc_functions:0, managed_count:3}
MCP Servers HTTP 500 3 entries: Atlassian / Google Drive / SharePoint (system-managed)
UC Functions HTTP 500 Empty (legitimate — no UC functions in ai_control_plane.control_plane; SP only sees system.*)
Usage [] (worked because no Lakebase touch) [] (legitimate — no MLflow traces with TOOL/FUNCTION spans yet)

The empty UC Functions and Usage tabs are correct given the current sandbox state — both will populate naturally as agents using UC function tools are deployed.


Bonus: in-workspace SP-grant fallback (also in commit 1)

control-plane-app/grant_sp_lakebase_notebook.py + run_grant_sp_lakebase_job.sh: a drop-in replacement for the laptop-side grant_sp_lakebase.py invocation in deploy.sh, for workspaces where the network policy blocks public Lakebase ingress. Submits the grant as a one-shot serverless Databricks Job with notebook_task + base_parameters so it runs inside the workspace.

Subtleties this had to solve, all of which would bite the next deployer:

  • spark_python_task + environment_variables doesn't work on serverless — environments[].spec doesn't honour environment_variables. Switched to notebook_task + widget reads.
  • The default SDK on serverless runtime didn't have databricks.sdk.service.postgres. Pinned databricks-sdk>=0.40.0 in the env spec and ran %pip install --upgrade at the top of the wrapper notebook.
  • grant_sp_lakebase.py ends with sys.exit(main()). Notebooks treat SystemExit(0) as a task failure. The wrapper runpy.run_paths the script and catches SystemExit, only re-raising on non-zero codes.

deploy.sh is updated to print a pointer to the helper instead of running grant_sp_lakebase.py directly.

Bonus: deployment-blocking issues log

docs/rca/2026-06-06-deploy-non-trivial-fixes.md captures four deployment-blocking issues from the sandbox onboarding that aren't covered by docs/installation.md:

  1. CREATE CATALOG on metastore is required — workspace-admin alone is not enough.
  2. SSO session reuse silently authenticates the wrong identity during databricks auth login — profile name does not constrain identity. current-user me verification recipe.
  3. Public Lakebase Postgres connectivity is blocked by some workspace network policies — laptop-side grant_sp_lakebase.py cannot work; in-workspace job is the fix (this PR's helper scripts).
  4. The Phase 6 gateway DDL transaction bug (the fix in this PR), with full root-cause and verification trail.

Plus a permission-audit table for the privileged admin identity used for the deploy, showing exactly what's required to land this stack.

Happy to split any of these out into separate PRs if a single PR is too much surface area to review at once.

Test plan

  • Phase 6 fix: gateway_usage_daily and gateway_usage_hourly exist with non-zero rows after a full discovery run.
  • App's Gateway page no longer logs UndefinedTable for gateway_usage_daily.
  • Commit 3: with the schema grant in place, all five billing queries return rows (15,040 / 3,932 / 9,806 / 21,674 / 4,472) and 1:1 sync to Lakebase.
  • Commit 3: with the schema grant intentionally revoked, discovery task fails loud with SQLSTATE: 42501 in the error message instead of silently producing 0-row dashboards.
  • Commit 4: workflow Phase 7 creates tool_registry and request_logs on every run; /api/v1/tools/overview returns 200 with three MCP server entries instead of 500.
  • Commit 4: smoke check fails when tool_registry is dropped manually (existence assert in EXPECTED bucket).
  • Smoke check passes on a healthy Lakebase after both fixes — all REQUIRED tables populated.
  • Smoke check fails loud on a broken Lakebase: tested by leaving billing_user_cost_daily REQUIRED before the commit-3 fix — task result_state=FAILED, exception message includes the per-table breakdown.
  • (Reviewers) Run on a clean target where neither gateway table existed — most likely scenario the bug hides in.
  • (Reviewers) Confirm the grant_sp_lakebase_notebook.py wrapper handles a real grant on a workspace that does have public Lakebase reachable (current verification was on a workspace where it isn't).
  • (Reviewers) Sanity-check .github/workflows/deploy-sandbox.yml against your CI conventions; the expected secret names are DATABRICKS_HOST_SANDBOX, DATABRICKS_CLIENT_ID_SANDBOX, DATABRICKS_CLIENT_SECRET_SANDBOX.

🤖 Generated with Claude Code

thijshakkenbergecolab and others added 2 commits June 6, 2026 14:59
Phase 6 of `02_sync_to_lakebase.py` ran all gateway DDLs in a single
transaction, with the ALTER for `gateway_usage_hourly` issued *before*
its CREATE TABLE. The ALTER raises `UndefinedTable`; psycopg2 marks the
whole transaction aborted, and the per-statement try/except swallows
the warning while the trailing CREATE silently no-ops. Result on a
fresh database: the table never gets created, the discovery job
reports SUCCESS, and the deployed app's Gateway page logs a stream of
`UndefinedTable: relation "gateway_usage_daily" does not exist`.

Reorder DDLs so all CREATE TABLEs run before any CREATE INDEX/ALTER,
and run each DDL in its own commit/rollback scope so a single failure
cannot poison sibling DDLs. Verified on the sandbox: post-fix
discovery run materialises both `gateway_usage_daily` (21k rows) and
`gateway_usage_hourly` (2.6k rows) in Lakebase, app errors stop.

Also adds in-workspace SP-grant fallback (helper notebook + shell
runner) for environments where the workspace blocks public Lakebase
access from the laptop, and a docs/rca/ entry capturing this and the
other deployment-blocking issues we hit during the sandbox
onboarding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The sync bug fixed in the previous commit (silent gateway DDL skip)
shipped to production despite passing every existing automated check —
the discovery job reported SUCCESS, the app start-up logs were green,
and only the Gateway page rendering empty in the browser exposed the
problem. We need a regression test that compares Lakebase reality to
the dashboard's contract.

Adds `workflows/10_smoke_check_lakebase.py`, wired as
`smoke_check_lakebase` task in `databricks.yml` depending on
`sync_to_lakebase`. The notebook connects to Lakebase, enumerates
`pg_tables`, and asserts every table the deployed app reads from
(scraped from `FROM <table>` references in
`control-plane-app/backend/services/`):

- `REQUIRED` (15 tables): must exist AND have ≥1 row.
- `EXPECTED` (11 tables): must exist; 0 rows logged as WARN.
- `OPTIONAL` (6 tables): app-managed, existence not asserted.

A `REQUIRED` failure raises (not `dbutils.notebook.exit` — that masks
the failure as task SUCCESS) so the workflow's `result_state` flips
to `FAILED` and any CI runner watching `databricks jobs get-run`
sees a non-zero terminal state. The exception body inlines the
breakdown so the failure message itself round-trips through the API
without depending on notebook stdout (which `get-run-output` does
not surface for failed runs).

End-to-end verification on the sandbox correctly surfaced two
real findings: `billing_user_cost_daily` and `billing_product_daily`
are empty post-sync despite their `billing_user_endpoint_daily` and
`billing_token_daily` source tables being populated. Tracked as a
separate ticket; not the bug this PR fixes.

Also adds:
- `docs/decisions/2026-06-06-cicd-deployment-pattern.md` — proposed
  GitHub Actions pipeline (sandbox → stage → prod) gated on the smoke
  check, with environment matrix and SP grant requirements.
- `.github/workflows/deploy-sandbox.yml` — concrete starting point
  for the sandbox path: bundle deploy, app deploy, run-now, poll
  the workflow, fail the CI job if `result_state != SUCCESS`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@thijs-hakkenberg thijs-hakkenberg changed the title fix(workflows): create gateway_usage tables in isolated transactions fix(workflows): gateway DDL transaction bug + Lakebase smoke check + CI/CD plan Jun 6, 2026
The _execute_sql helper in 09_discover_billing.py returned [] on any
non-SUCCEEDED state and only printed the error. When <admin-profile> lacked
SELECT on system.billing.list_prices, the three queries that JOIN
list_prices (serving / product / user_cost) silently produced 0 rows;
the task still reported result=SUCCESS and the cost-related dashboard
sections rendered empty while token usage worked.

Root cause was a missing schema-level grant — fixed in the workspace
with GRANT SELECT ON SCHEMA system.billing. This commit makes the code
fail loud instead of silently zero on the next time it happens: the
helper now raises RuntimeError carrying the SQLSTATE and message.

Verified: post-grant + post-fix discovery run produced 15,040 serving /
9,806 product / 4,472 user_cost rows, all syncing 1:1 to Lakebase.

Captured as item 5 in the deployment findings RCA.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@thijs-hakkenberg thijs-hakkenberg changed the title fix(workflows): gateway DDL transaction bug + Lakebase smoke check + CI/CD plan fix(workflows): gateway DDL + billing silent-perm-error + smoke check + CI/CD plan Jun 6, 2026
…rrors

The Tools page (Overview / MCP Servers / UC Functions / Usage) rendered
empty in the deployed app with /api/v1/tools/overview returning 500
"relation tool_registry does not exist". Three layered failures masked
each other:

1. tool_registry was never created. App startup runs ensure_tools_tables
   in a daemon thread under t.join(timeout=120); if the thread crashed,
   timed out, or hit Lakebase auth pressure, the table never landed and
   the app booted regardless.
2. ensure_tools_tables swallowed real DDL failures with logger.warning,
   indistinguishable from "already exists" cases.
3. refresh_tools wrapped its entire body in try/except and POST /tools/sync
   returned 200 unconditionally, hiding the broken state from operators.

Same pattern applied to request_logs (created lazily by the audit
middleware, also failed silently if app SP lacked DDL).

Fix:
- workflows/02_sync_to_lakebase Phase 7 now creates tool_registry and
  request_logs from the workflow run-as identity (databricks_superuser),
  matching how every other Lakebase table in this stack is provisioned.
  The app no longer needs DDL privileges to function.
- tools_service.refresh_tools self-heals by calling ensure_tools_tables
  first, and switches catch-all to logger.exception for full tracebacks.
- workflows/10_smoke_check_lakebase asserts tool_registry and
  request_logs in the EXPECTED bucket — missing table fails the
  workflow, 0 rows is a WARN (legitimate before user activity).

Also adds RCA item 6 to docs/rca/2026-06-06-deploy-non-trivial-fixes.md.

Verified end-to-end on sandbox (run <run-id>):
overview now returns {total_tools:3, mcp_servers:3, uc_functions:0,
managed_count:3}; the three system-managed MCP connections render in
the UI. Empty UC Functions / Usage tabs are legitimate (no agent UC
functions or trace tool spans yet).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@thijs-hakkenberg thijs-hakkenberg changed the title fix(workflows): gateway DDL + billing silent-perm-error + smoke check + CI/CD plan fix(workflows): gateway DDL + billing silent-perm-error + smoke check + tools tables + CI/CD plan Jun 6, 2026
…verwrite job params

While verifying the item-6 fix end-to-end, a `databricks bundle deploy --target dev`
without explicit --var flags overwrote the running job's working catalog/warehouse_id
parameters with the literal placeholder strings (`<your-catalog>`, `<your-warehouse-id>`)
defined as defaults in workflows/databricks.yml's dev target. Every subsequent task
failed with a SQL parse error.

Captured the footgun, the recovery commands (the full --var= invocation that restores
the working job), and proposed durable fixes (sentinel defaults that fail loud, or a
gitignored .databricks-bundle.<target>.local.yml overlay).

Verified clean re-run on sandbox: 11/11 tasks SUCCESS, smoke check passed
12/12 REQUIRED tables OK, tool_registry and request_logs present and asserted in
the EXPECTED bucket.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@bkaankuguoglu

Copy link
Copy Markdown
Member

Thijs — first off, thank you. This is an exceptionally well-diagnosed PR: clear root-cause writeups, repro detail, and it's obvious you actually read the codebase (citing the SAVEPOINT-per-DDL pattern in the observability section was spot on). We reproduced and confirmed all three bugs against main:

  1. Gateway DDL transaction abort (02 Phase 6) — confirmed. The block runs in one non-autocommit transaction with ALTER gateway_usage_hourly before its CREATE, and our own observability / gateway-column sections already use savepoints to avoid exactly this. Real inconsistency, good catch.
  2. 09 _execute_sql swallowing INSUFFICIENT_PERMISSIONS — confirmed; cost dashboards can render empty while the task reports SUCCESS.
  3. Tools tables created lazily in the app's daemon thread — confirmed; moving DDL into the workflow is the right direction (it matches our "keep the app a thin reader; do the work in the workflows" principle).

We'd love to take these — with a few adjustments so the solution stays generic and doesn't encode one customer's context. That's the main thing we want to preserve in this shared/public solutions repo, so a couple of asks and a roadmap heads-up:

1. Keep it customer-agnostic. Could you drop the customer-specific pieces — the named deploy action, the environment-specific secret names, and the rollout-specific ADR/RCA framing? The patterns (a build gate + a post-deploy data-plane smoke check) are great and we want them; we'd just want them as a generic template any adopter can wire to their own workspace, rather than a named pipeline. Happy to take a generic CI/CD workflow as its own PR.

2. Heads-up on the sync layer (affects the gateway + tools DDL fixes). We're about to introduce automatic Lakebase sync to replace the manual 02_sync_to_lakebase step entirely — managed Delta→Lakebase syncing, for simplicity and performance (no hand-rolled psycopg2 TRUNCATE/INSERT/DDL, and the whole transaction-management bug class goes away). Because that file is on its way out, we'd prefer to land your DDL fixes as a minimal stopgap (just the reorder + per-statement savepoint, mirroring the existing observability block) rather than a larger restructure of 02 — so we don't invest heavily in code we're retiring. The tools-table DDL lands the same way for now, then moves to the managed sync.

3. On the billing fix (_execute_sql raising). We'd resolve this slightly differently: raise on INSUFFICIENT_PERMISSIONS/misconfig, but keep degrading gracefully when a system table is genuinely unavailable. The same helper shape elsewhere (e.g. 11_discover_ai_gateway_usage) intentionally returns [] so account-scoped discovery doesn't hard-fail in workspaces that legitimately can't see a given system table — that resilience is load-bearing for the cross-workspace story. A raise scoped to permission errors gets you the loud failure you want without breaking that.

4. Smoke check — yes please, this is a genuinely useful gate (it'd have caught more than just this bug). We'd just generalize the asserted table set so it isn't tied to one deployment.

5. Packaging — would you mind splitting this into (a) the three bug fixes + smoke check, and (b) the generic CI/CD? (a) can merge quickly; (b) can bake separately.

On contributing: we really appreciate this, and we're open to your contributions and PRs going forward — please keep them coming. Let's get (a) landed first; as the collaboration develops we're happy to look at smoother access for follow-ups down the line. Thanks again for digging in and writing it up so thoroughly.

@bkaankuguoglu

bkaankuguoglu commented Jun 9, 2026

Copy link
Copy Markdown
Member

Thanks for this @thijs-hakkenberg — thorough write-up and the smoke check already earned its keep by surfacing the billing perm gap.

To make review tractable, we split it into two focused PRs (rebased onto current main, so #17's workload-classification columns are preserved — your branch predated them):

We're picking up #20 first. This PR can stay open as the source of record until #20/#21 land, then we'll close it. On the outside-contributor ask — flagging that with the maintainers separately.

bkaankuguoglu added a commit that referenced this pull request Jun 9, 2026
…ebase smoke check (#20)

* fix(workflows): create gateway_usage tables in isolated transactions

Phase 6 of `02_sync_to_lakebase.py` ran all gateway DDLs in a single
transaction, with the ALTER for `gateway_usage_hourly` issued *before*
its CREATE TABLE. The ALTER raises `UndefinedTable`; psycopg2 marks the
whole transaction aborted, and the per-statement try/except swallows
the warning while the trailing CREATE silently no-ops. Result on a
fresh database: the table never gets created, the discovery job
reports SUCCESS, and the deployed app's Gateway page logs a stream of
`UndefinedTable: relation "gateway_usage_daily" does not exist`.

Reorder DDLs so all CREATE TABLEs run before any CREATE INDEX/ALTER,
and run each DDL in its own commit/rollback scope so a single failure
cannot poison sibling DDLs. Verified on the Ecolab sandbox: post-fix
discovery run materialises both `gateway_usage_daily` (21k rows) and
`gateway_usage_hourly` (2.6k rows) in Lakebase, app errors stop.

Also adds in-workspace SP-grant fallback (helper notebook + shell
runner) for environments where the workspace blocks public Lakebase
access from the laptop, and a docs/rca/ entry capturing this and the
other deployment-blocking issues we hit during the Ecolab sandbox
onboarding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(workflows): add Lakebase smoke check task

A new `smoke_check_lakebase` workflow task asserts that every Lakebase table
the dashboard reads from exists with the expected row content. REQUIRED tables
must exist with >=1 row; EXPECTED must exist (0 rows allowed, logged WARN);
OPTIONAL (app-managed) existence is not asserted. Failure raises directly
(not via dbutils.notebook.exit) so the workflow result_state flips to FAILED
and CI can gate on it. Wired as `smoke_check_lakebase` in databricks.yml,
depending on `sync_to_lakebase`.

Split out from #18; CI/CD deploy pattern is in a separate PR.

Co-authored-by: Isaac

* fix(workflows): surface SQL permission errors in discover_billing

The _execute_sql helper in 09_discover_billing.py returned [] on any
non-SUCCEEDED state and only printed the error. When a-hakketh lacked
SELECT on system.billing.list_prices, the three queries that JOIN
list_prices (serving / product / user_cost) silently produced 0 rows;
the task still reported result=SUCCESS and the cost-related dashboard
sections rendered empty while token usage worked.

Root cause was a missing schema-level grant — fixed in the workspace
with GRANT SELECT ON SCHEMA system.billing. This commit makes the code
fail loud instead of silently zero on the next time it happens: the
helper now raises RuntimeError carrying the SQLSTATE and message.

Verified: post-grant + post-fix discovery run produced 15,040 serving /
9,806 product / 4,472 user_cost rows, all syncing 1:1 to Lakebase.

Captured as item 5 in the deployment findings RCA.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(tools): create app-managed tables in workflow + stop swallowing errors

The Tools page (Overview / MCP Servers / UC Functions / Usage) rendered
empty in the deployed app with /api/v1/tools/overview returning 500
"relation tool_registry does not exist". Three layered failures masked
each other:

1. tool_registry was never created. App startup runs ensure_tools_tables
   in a daemon thread under t.join(timeout=120); if the thread crashed,
   timed out, or hit Lakebase auth pressure, the table never landed and
   the app booted regardless.
2. ensure_tools_tables swallowed real DDL failures with logger.warning,
   indistinguishable from "already exists" cases.
3. refresh_tools wrapped its entire body in try/except and POST /tools/sync
   returned 200 unconditionally, hiding the broken state from operators.

Same pattern applied to request_logs (created lazily by the audit
middleware, also failed silently if app SP lacked DDL).

Fix:
- workflows/02_sync_to_lakebase Phase 7 now creates tool_registry and
  request_logs from the workflow run-as identity (databricks_superuser),
  matching how every other Lakebase table in this stack is provisioned.
  The app no longer needs DDL privileges to function.
- tools_service.refresh_tools self-heals by calling ensure_tools_tables
  first, and switches catch-all to logger.exception for full tracebacks.
- workflows/10_smoke_check_lakebase asserts tool_registry and
  request_logs in the EXPECTED bucket — missing table fails the
  workflow, 0 rows is a WARN (legitimate before user activity).

Also adds RCA item 6 to docs/rca/2026-06-06-deploy-non-trivial-fixes.md.

Verified end-to-end on Ecolab sandbox (run 881046971716582):
overview now returns {total_tools:3, mcp_servers:3, uc_functions:0,
managed_count:3}; the three system-managed MCP connections render in
the UI. Empty UC Functions / Usage tabs are legitimate (no agent UC
functions or trace tool spans yet).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(rca): add item 7 — bundle deploy placeholder defaults silently overwrite job params

While verifying the item-6 fix end-to-end, a `databricks bundle deploy --target dev`
without explicit --var flags overwrote the running job's working catalog/warehouse_id
parameters with the literal placeholder strings (`<your-catalog>`, `<your-warehouse-id>`)
defined as defaults in workflows/databricks.yml's dev target. Every subsequent task
failed with a SQL parse error.

Captured the footgun, the recovery commands (the full --var= invocation that restores
the working job), and proposed durable fixes (sentinel defaults that fail loud, or a
gitignored .databricks-bundle.<target>.local.yml overlay).

Verified clean re-run on Ecolab sandbox: 11/11 tasks SUCCESS, smoke check passed
12/12 REQUIRED tables OK, tool_registry and request_logs present and asserted in
the EXPECTED bucket.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(installation): document system.billing / system.serving grants (Step 8)

The billing fail-loud fix in this PR turns a missing SELECT on
system.billing.list_prices into a hard discovery-task failure (SQLSTATE 42501)
instead of silent 0-row cost dashboards. Document the required schema-level
grants for the workflow run-as identity in Step 8 (grant the whole
system.billing / system.serving schemas so a future table addition can't
reintroduce the gap), and make the step non-optional. Closes the installation
docs TODO tracked in the deploy RCA for item 5.

Co-authored-by: Isaac

* docs(rca): soften CI/CD ADR link to plain text

The CI/CD deployment-pattern ADR ships in the separate CI/CD PR, so a relative
markdown link to it is dead on main until that PR merges. Reference the path as
plain text instead, leaving #20 with no cross-PR loose ends.

Co-authored-by: Isaac

---------

Co-authored-by: Hakkenberg <thijs.hakkenberg@ecolab.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…t, CI

Replaces every reference to a specific deploying organization, user, workspace
ID, account ID, warehouse ID, Lakebase DNS, app host, and service principal
client ID with generic placeholders (`<your-...>` / `<workspace-id>` / etc.).

Files touched:
- docs/rca/2026-06-06-deploy-non-trivial-fixes.md
- docs/decisions/2026-06-06-cicd-deployment-pattern.md
- control-plane-app/deploy.sh — comment generalised
- control-plane-app/backend/services/tools_service.py — code comment generalised
- .github/workflows/deploy-sandbox.yml — workflow display name generalised
- .gitignore — exclude editor/scratch files

The information itself was never sensitive (the workspace and warehouse IDs
appear in URLs visible to anyone with workspace access), but the document
narrative shouldn't pin to one customer. Future onboardings can fork this
doc and substitute their own values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@thijs-hakkenberg thijs-hakkenberg changed the title fix(workflows): gateway DDL + billing silent-perm-error + smoke check + tools tables + CI/CD plan fix(workflows): gateway DDL + billing perm-error + smoke check + tools tables + CI/CD plan Jun 10, 2026
@thijs-hakkenberg thijs-hakkenberg force-pushed the fix/sync-lakebase-gateway-ddl-transaction branch 2 times, most recently from eed8161 to 26a016c Compare June 10, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants