fix(workflows): gateway DDL + billing perm-error + smoke check + tools tables + CI/CD plan by thijs-hakkenberg · Pull Request #18 · databricks-solutions/agent-control-plane

thijs-hakkenberg · 2026-06-06T13:18:25Z

Summary

Four stacked changes from a customer onboarding:

Bug fix (commit 1): 02_sync_to_lakebase Phase 6 silently fails to create gateway_usage_daily / gateway_usage_hourly on a fresh DB while reporting result=SUCCESS. The deployed app then logs an unbounded stream of psycopg2.errors.UndefinedTable errors and renders the Gateway page empty. Root cause: psycopg2 transaction abort on an ALTER TABLE issued before the matching CREATE TABLE.
Regression gate (commit 2): A new smoke_check_lakebase workflow task asserts that every Lakebase table the dashboard reads from exists with the expected row content. Failure flips the workflow's result_state to FAILED so CI can gate on it.
Silent-failure fix (commit 3): _execute_sql in 09_discover_billing.py returned [] on any non-SUCCEEDED state — including INSUFFICIENT_PERMISSIONS. That masked a missing SELECT ON system.billing.list_prices grant for weeks: cost-related dashboards (Cost Overview, Endpoint Costs, All Products) rendered empty while token usage worked, and the discovery task still reported SUCCESS. Helper now raises with the SQLSTATE + message so the same misconfiguration fails the discovery task immediately.
Tools tables fix (commit 4): tool_registry and request_logs were created lazily by the app's startup hook in a 120 s daemon thread. When the app SP lacked Lakebase DDL or the daemon ran past timeout, the tables never landed; /api/v1/tools/overview returned 500, the Tools page rendered empty across all four tabs, and /api/v1/tools/sync lied with HTTP 200. Moved DDL into 02_sync_to_lakebase Phase 7 (workflow run-as is databricks_superuser); also stopped refresh_tools from swallowing exception tracebacks.
CI/CD plan: ADR + a starter GitHub Actions workflow for the sandbox → stage → prod promotion pipeline.

Why I'm opening from a fork

I'm a Databricks customer deploying this repo into a sandbox workspace. I don't have push access to databricks-solutions/agent-control-plane, so this PR is from a personal fork. I'd love to be added as an outside contributor — there are already three follow-up items the smoke check surfaced that warrant separate PRs, and round-tripping each through cross-fork PR is high friction. Happy to provide a Databricks-side reference offline.

Commit 1: gateway DDL transaction fix

What was wrong

workflows/02_sync_to_lakebase.py Phase 6 issued the gateway DDLs in a single transaction with this order:

[
  CREATE TABLE gateway_usage_daily (...),
  CREATE INDEX idx_gud_date,
  CREATE INDEX idx_gud_ep,
  ALTER TABLE gateway_usage_daily  ADD COLUMN IF NOT EXISTS rate_limited_count ...,
  ALTER TABLE gateway_usage_hourly ADD COLUMN IF NOT EXISTS rate_limited_count ...,  # <-- table does not exist yet
  CREATE TABLE gateway_usage_hourly (...),
]

On a fresh database the ALTER gateway_usage_hourly raises UndefinedTable. psycopg2 marks the entire transaction aborted on any error, so subsequent cur.execute(ddl) calls in the same transaction silently no-op until rollback. The per-statement try/except swallowed the warning print but did not rollback, so the trailing CREATE TABLE gateway_usage_hourly was a no-op, and the trailing gw_conn.commit() succeeded — leaving the database without either table.

The observability section a few hundred lines up uses PostgreSQL SAVEPOINT per DDL and contains failures correctly. The gateway section did not.

Fix

Reorder DDLs so all CREATE TABLEs run before any CREATE INDEX / ALTER TABLE.
Switch from one shared transaction to one transaction per DDL: each statement runs in its own cursor() context with its own commit(), and a failure does rollback() then continues. This isolates the "ALTER on a not-yet-existing table on first run" case from poisoning sibling DDLs.

for ddl in _gw_ddls:
    try:
        with gw_conn.cursor() as cur:
            cur.execute(ddl)
        gw_conn.commit()
    except Exception as e:
        gw_conn.rollback()
        print(f\"  DDL warning: {e}\")

Verified end-to-end

Check	Before	After
`gateway_usage_daily` exists in Lakebase	❌	✅
`gateway_usage_hourly` exists in Lakebase	❌	✅
Rows in `gateway_usage_daily`	n/a	21,303
Rows in `gateway_usage_hourly`	n/a	2,606
App `UndefinedTable` errors after fix run	continuous	none
Discovery job result	`SUCCESS` (silently broken)	`SUCCESS` (correct)

Commit 2: Lakebase smoke check + CI/CD plan

What this catches

The bug above passed every existing automated check. The discovery job reported SUCCESS, the app's startup logs were clean, and the Lakebase tables that were created had rows. The only failure signal lived in the deployed app's stderr — which nobody reads on a successful deploy. The smoke check is a regression test that compares Lakebase reality to the app's data contract.

What's in `workflows/10_smoke_check_lakebase.py`

The notebook connects to Lakebase, enumerates pg_tables, and asserts every table the deployed app reads from. Tables are sourced from the actual FROM <table> references in control-plane-app/backend/services/ and bucketed:

Bucket	Behaviour
`REQUIRED` (15 tables)	Must exist AND have ≥1 row. Failure raises and the workflow goes red.
`EXPECTED` (11 tables)	Must exist; 0 rows allowed (legitimate in quiet workspaces) but logged as WARN.
`OPTIONAL` (6 tables)	App-managed (created lazily on user action). Existence not asserted.

Wired as smoke_check_lakebase task in databricks.yml, depending on sync_to_lakebase.

Subtleties this had to solve

dbutils.notebook.exit() masks failures. Calling it before raising marks the task SUCCEEDED with the JSON as its return value, hiding the FAIL from the workflow-level result_state. The smoke notebook raises directly instead.
jobs/get-run-output does not surface notebook stdout for failed runs. Only error and error_trace round-trip. The smoke notebook inlines the breakdown into the RuntimeError message itself so a CI runner reading error sees the per-table failure list directly.

What it found end-to-end

Beyond the gateway tables (now ✅), the first smoke run flagged two real new findings worth a separate ticket:

RuntimeError: Lakebase smoke check FAILED:
  EMPTY required table: billing_user_cost_daily;
  EMPTY required table: billing_product_daily
Breakdown:
  [required] billing_user_cost_daily             EMPTY    rows=0
  [required] billing_product_daily               EMPTY    rows=0
  [expected] billing_serving_daily               EMPTY    rows=0
  [expected] gateway_inference_logs              EMPTY    rows=0
  [expected] vector_search_indexes               EMPTY    rows=0
  [expected] kb_billing_daily                    EMPTY    rows=0

Update: the two REQUIRED failures and billing_serving_daily are now root-caused and fixed in commit 3 (see below). The smoke check did its job — surfaced a silent permission gap that had been quietly producing 0-row dashboards.

The remaining EXPECTED warnings (gateway_inference_logs, vector_search_indexes, kb_billing_daily) are legitimately empty for this workspace (no AI Gateway inference logging enabled, the one Vector Search endpoint has 0 indexes).

Commit 3: surface SQL permission errors in `discover_billing`

What was wrong

workflows/09_discover_billing.py issues five SQL statements via _execute_sql. Three of them (queries 1, 3, 5 — billing_serving_daily, billing_product_daily, billing_user_cost_daily) LEFT JOIN system.billing.list_prices to compute cost in USD. On the sandbox, the deployer / workflow run-as identity had SELECT ON system.billing.usage but not on system.billing.list_prices. The JOIN failed with:

[INSUFFICIENT_PERMISSIONS] User does not have SELECT on Table 'system.billing.list_prices'. SQLSTATE: 42501

But the helper had this:

if status != "SUCCEEDED":
    err = resp.get("status", {}).get("error", {})
    print(f"  SQL {status}: {err.get('message', '')[:300]}")
    return []   # <-- silent zero-row path

So all three queries silently returned []. The notebook then overwrote Delta with empty DataFrames, sync-to-Lakebase truncated and re-inserted 0 rows, and the discovery task reported result=SUCCESS. Token-related dashboards worked because queries 2 and 4 hit system.serving.endpoint_usage only — no list_prices join, no permission gap. That's why "token usage shows but cost overview is empty" was the symptom.

Fix

workflows/09_discover_billing.py:170 — _execute_sql now raises a RuntimeError carrying the API error code and message when the statement does not succeed. Future permission gaps now fail the discovery task at the source with an actionable message, instead of producing a 0-row result that surfaces three steps downstream as "EMPTY required table" in the smoke check.

(The corresponding production fix is a one-line GRANT SELECT ON SCHEMA system.billing TO <identity> — that's an environmental config, not code, so it belongs in docs/installation.md rather than this PR.)

Verified end-to-end

Table	Before grant + fix	After
`billing_serving_daily` (Lakebase)	0	15,040
`billing_product_daily` (Lakebase)	0	9,806
`billing_user_cost_daily` (Lakebase)	0	4,472
`billing_token_daily` (Lakebase)	3,932 (was working)	3,932
Discovery `result_state` on missing grant	SUCCESS (silent)	FAILED with SQLSTATE in error

CI/CD pattern

docs/decisions/2026-06-06-cicd-deployment-pattern.md proposes a sandbox → stage → prod promotion pipeline driven by GitHub Actions, gated on the smoke check at every environment. .github/workflows/deploy-sandbox.yml is the concrete starting point: bundle deploy, app deploy, jobs run-now, poll until terminal, fail the CI job if result_state != SUCCESS.

This is a starting point, not a finished pipeline. Stage and prod workflows are deferred until those workspaces are provisioned.

Commit 4: app-managed table DDL belongs in the workflow, not the app

What was wrong

The Tools page in the deployed app rendered empty across all four tabs (Overview / MCP Servers / UC Functions / Usage). Symptoms:

HTTP/2 500
psycopg2.errors.UndefinedTable: relation "tool_registry" does not exist
LINE 1: SELECT * FROM tool_registry WHERE type = 'mcp_server' ...

The app's startup hook in backend/main.py runs _init_tools() inside a daemon thread fan-out wrapped with t.join(timeout=120). _init_tools calls ensure_tools_tables() (which CREATE TABLE IF NOT EXISTS tool_registry ...) and then maybe_refresh_async(). Three layered failures masked each other:

tool_registry was never created. On the sandbox, the daemon thread either crashed before _init_tools ran, ran past the 120 s budget, or hit Lakebase auth pressure under cold-start load — the server boots regardless and the table is missing.
ensure_tools_tables() swallowed real DDL failures. Per-statement try/except: logger.warning(...) was indistinguishable from the harmless "already exists" case.
refresh_tools() swallowed the entire body with try/except Exception: logger.warning("Tools refresh failed: %s", exc). POST /api/v1/tools/sync returned {"status":"ok","message":"Tools refresh complete"} regardless of whether the underlying refresh blew up — the dashboard was the only externally visible signal.

The same architectural issue applied to request_logs (created lazily by the request-audit middleware, also fails silently if the app SP lacks DDL).

Fix (three parts)

Move app-managed table DDL into the workflow. New Phase 7 in workflows/02_sync_to_lakebase.py creates tool_registry and request_logs from the workflow run-as identity. The workflow runs as databricks_superuser on the Lakebase PG instance — DDL here is always safe. The app no longer needs DDL privileges to function, only SELECT/INSERT/UPDATE/DELETE. This parallels the pattern already used for discovered_agents, gateway_usage_*, and 20+ other tables.
Make tools_service.refresh_tools self-heal and stop hiding errors.
- refresh_tools() now calls ensure_tools_tables() at the top, so a fresh deploy where the workflow hasn't run yet still recovers on the first read.
- Catch-all switched from logger.warning to logger.exception — full tracebacks land in app logs.
- Per-statement handler in ensure_tools_tables() also uses logger.exception with the offending DDL preview.
Smoke check coverage. workflows/10_smoke_check_lakebase.py now asserts tool_registry and request_logs in the EXPECTED bucket — existence required, 0 rows allowed. Missing table → workflow goes red with an actionable message; 0 rows is a WARN (legitimate before any user activity).

Verified end-to-end

Tab	Before	After
Overview	HTTP 500 (`UndefinedTable: tool_registry`)	`{total_tools:3, mcp_servers:3, uc_functions:0, managed_count:3}`
MCP Servers	HTTP 500	3 entries: Atlassian / Google Drive / SharePoint (system-managed)
UC Functions	HTTP 500	Empty (legitimate — no UC functions in `ai_control_plane.control_plane`; SP only sees `system.`)*
Usage	`[]` (worked because no Lakebase touch)	`[]` (legitimate — no MLflow traces with TOOL/FUNCTION spans yet)

The empty UC Functions and Usage tabs are correct given the current sandbox state — both will populate naturally as agents using UC function tools are deployed.

Bonus: in-workspace SP-grant fallback (also in commit 1)

control-plane-app/grant_sp_lakebase_notebook.py + run_grant_sp_lakebase_job.sh: a drop-in replacement for the laptop-side grant_sp_lakebase.py invocation in deploy.sh, for workspaces where the network policy blocks public Lakebase ingress. Submits the grant as a one-shot serverless Databricks Job with notebook_task + base_parameters so it runs inside the workspace.

Subtleties this had to solve, all of which would bite the next deployer:

spark_python_task + environment_variables doesn't work on serverless — environments[].spec doesn't honour environment_variables. Switched to notebook_task + widget reads.
The default SDK on serverless runtime didn't have databricks.sdk.service.postgres. Pinned databricks-sdk>=0.40.0 in the env spec and ran %pip install --upgrade at the top of the wrapper notebook.
grant_sp_lakebase.py ends with sys.exit(main()). Notebooks treat SystemExit(0) as a task failure. The wrapper runpy.run_paths the script and catches SystemExit, only re-raising on non-zero codes.

deploy.sh is updated to print a pointer to the helper instead of running grant_sp_lakebase.py directly.

Bonus: deployment-blocking issues log

docs/rca/2026-06-06-deploy-non-trivial-fixes.md captures four deployment-blocking issues from the sandbox onboarding that aren't covered by docs/installation.md:

CREATE CATALOG on metastore is required — workspace-admin alone is not enough.
SSO session reuse silently authenticates the wrong identity during databricks auth login — profile name does not constrain identity. current-user me verification recipe.
Public Lakebase Postgres connectivity is blocked by some workspace network policies — laptop-side grant_sp_lakebase.py cannot work; in-workspace job is the fix (this PR's helper scripts).
The Phase 6 gateway DDL transaction bug (the fix in this PR), with full root-cause and verification trail.

Plus a permission-audit table for the privileged admin identity used for the deploy, showing exactly what's required to land this stack.

Happy to split any of these out into separate PRs if a single PR is too much surface area to review at once.

Test plan

🤖 Generated with Claude Code

Phase 6 of `02_sync_to_lakebase.py` ran all gateway DDLs in a single transaction, with the ALTER for `gateway_usage_hourly` issued *before* its CREATE TABLE. The ALTER raises `UndefinedTable`; psycopg2 marks the whole transaction aborted, and the per-statement try/except swallows the warning while the trailing CREATE silently no-ops. Result on a fresh database: the table never gets created, the discovery job reports SUCCESS, and the deployed app's Gateway page logs a stream of `UndefinedTable: relation "gateway_usage_daily" does not exist`. Reorder DDLs so all CREATE TABLEs run before any CREATE INDEX/ALTER, and run each DDL in its own commit/rollback scope so a single failure cannot poison sibling DDLs. Verified on the sandbox: post-fix discovery run materialises both `gateway_usage_daily` (21k rows) and `gateway_usage_hourly` (2.6k rows) in Lakebase, app errors stop. Also adds in-workspace SP-grant fallback (helper notebook + shell runner) for environments where the workspace blocks public Lakebase access from the laptop, and a docs/rca/ entry capturing this and the other deployment-blocking issues we hit during the sandbox onboarding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The sync bug fixed in the previous commit (silent gateway DDL skip) shipped to production despite passing every existing automated check — the discovery job reported SUCCESS, the app start-up logs were green, and only the Gateway page rendering empty in the browser exposed the problem. We need a regression test that compares Lakebase reality to the dashboard's contract. Adds `workflows/10_smoke_check_lakebase.py`, wired as `smoke_check_lakebase` task in `databricks.yml` depending on `sync_to_lakebase`. The notebook connects to Lakebase, enumerates `pg_tables`, and asserts every table the deployed app reads from (scraped from `FROM <table>` references in `control-plane-app/backend/services/`): - `REQUIRED` (15 tables): must exist AND have ≥1 row. - `EXPECTED` (11 tables): must exist; 0 rows logged as WARN. - `OPTIONAL` (6 tables): app-managed, existence not asserted. A `REQUIRED` failure raises (not `dbutils.notebook.exit` — that masks the failure as task SUCCESS) so the workflow's `result_state` flips to `FAILED` and any CI runner watching `databricks jobs get-run` sees a non-zero terminal state. The exception body inlines the breakdown so the failure message itself round-trips through the API without depending on notebook stdout (which `get-run-output` does not surface for failed runs). End-to-end verification on the sandbox correctly surfaced two real findings: `billing_user_cost_daily` and `billing_product_daily` are empty post-sync despite their `billing_user_endpoint_daily` and `billing_token_daily` source tables being populated. Tracked as a separate ticket; not the bug this PR fixes. Also adds: - `docs/decisions/2026-06-06-cicd-deployment-pattern.md` — proposed GitHub Actions pipeline (sandbox → stage → prod) gated on the smoke check, with environment matrix and SP grant requirements. - `.github/workflows/deploy-sandbox.yml` — concrete starting point for the sandbox path: bundle deploy, app deploy, run-now, poll the workflow, fail the CI job if `result_state != SUCCESS`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The _execute_sql helper in 09_discover_billing.py returned [] on any non-SUCCEEDED state and only printed the error. When <admin-profile> lacked SELECT on system.billing.list_prices, the three queries that JOIN list_prices (serving / product / user_cost) silently produced 0 rows; the task still reported result=SUCCESS and the cost-related dashboard sections rendered empty while token usage worked. Root cause was a missing schema-level grant — fixed in the workspace with GRANT SELECT ON SCHEMA system.billing. This commit makes the code fail loud instead of silently zero on the next time it happens: the helper now raises RuntimeError carrying the SQLSTATE and message. Verified: post-grant + post-fix discovery run produced 15,040 serving / 9,806 product / 4,472 user_cost rows, all syncing 1:1 to Lakebase. Captured as item 5 in the deployment findings RCA. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rrors The Tools page (Overview / MCP Servers / UC Functions / Usage) rendered empty in the deployed app with /api/v1/tools/overview returning 500 "relation tool_registry does not exist". Three layered failures masked each other: 1. tool_registry was never created. App startup runs ensure_tools_tables in a daemon thread under t.join(timeout=120); if the thread crashed, timed out, or hit Lakebase auth pressure, the table never landed and the app booted regardless. 2. ensure_tools_tables swallowed real DDL failures with logger.warning, indistinguishable from "already exists" cases. 3. refresh_tools wrapped its entire body in try/except and POST /tools/sync returned 200 unconditionally, hiding the broken state from operators. Same pattern applied to request_logs (created lazily by the audit middleware, also failed silently if app SP lacked DDL). Fix: - workflows/02_sync_to_lakebase Phase 7 now creates tool_registry and request_logs from the workflow run-as identity (databricks_superuser), matching how every other Lakebase table in this stack is provisioned. The app no longer needs DDL privileges to function. - tools_service.refresh_tools self-heals by calling ensure_tools_tables first, and switches catch-all to logger.exception for full tracebacks. - workflows/10_smoke_check_lakebase asserts tool_registry and request_logs in the EXPECTED bucket — missing table fails the workflow, 0 rows is a WARN (legitimate before user activity). Also adds RCA item 6 to docs/rca/2026-06-06-deploy-non-trivial-fixes.md. Verified end-to-end on sandbox (run <run-id>): overview now returns {total_tools:3, mcp_servers:3, uc_functions:0, managed_count:3}; the three system-managed MCP connections render in the UI. Empty UC Functions / Usage tabs are legitimate (no agent UC functions or trace tool spans yet). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…verwrite job params While verifying the item-6 fix end-to-end, a `databricks bundle deploy --target dev` without explicit --var flags overwrote the running job's working catalog/warehouse_id parameters with the literal placeholder strings (`<your-catalog>`, `<your-warehouse-id>`) defined as defaults in workflows/databricks.yml's dev target. Every subsequent task failed with a SQL parse error. Captured the footgun, the recovery commands (the full --var= invocation that restores the working job), and proposed durable fixes (sentinel defaults that fail loud, or a gitignored .databricks-bundle.<target>.local.yml overlay). Verified clean re-run on sandbox: 11/11 tasks SUCCESS, smoke check passed 12/12 REQUIRED tables OK, tool_registry and request_logs present and asserted in the EXPECTED bucket. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

bkaankuguoglu · 2026-06-07T14:55:27Z

Thijs — first off, thank you. This is an exceptionally well-diagnosed PR: clear root-cause writeups, repro detail, and it's obvious you actually read the codebase (citing the SAVEPOINT-per-DDL pattern in the observability section was spot on). We reproduced and confirmed all three bugs against main:

Gateway DDL transaction abort (02 Phase 6) — confirmed. The block runs in one non-autocommit transaction with ALTER gateway_usage_hourly before its CREATE, and our own observability / gateway-column sections already use savepoints to avoid exactly this. Real inconsistency, good catch.
09 _execute_sql swallowing INSUFFICIENT_PERMISSIONS — confirmed; cost dashboards can render empty while the task reports SUCCESS.
Tools tables created lazily in the app's daemon thread — confirmed; moving DDL into the workflow is the right direction (it matches our "keep the app a thin reader; do the work in the workflows" principle).

We'd love to take these — with a few adjustments so the solution stays generic and doesn't encode one customer's context. That's the main thing we want to preserve in this shared/public solutions repo, so a couple of asks and a roadmap heads-up:

1. Keep it customer-agnostic. Could you drop the customer-specific pieces — the named deploy action, the environment-specific secret names, and the rollout-specific ADR/RCA framing? The patterns (a build gate + a post-deploy data-plane smoke check) are great and we want them; we'd just want them as a generic template any adopter can wire to their own workspace, rather than a named pipeline. Happy to take a generic CI/CD workflow as its own PR.

2. Heads-up on the sync layer (affects the gateway + tools DDL fixes). We're about to introduce automatic Lakebase sync to replace the manual 02_sync_to_lakebase step entirely — managed Delta→Lakebase syncing, for simplicity and performance (no hand-rolled psycopg2 TRUNCATE/INSERT/DDL, and the whole transaction-management bug class goes away). Because that file is on its way out, we'd prefer to land your DDL fixes as a minimal stopgap (just the reorder + per-statement savepoint, mirroring the existing observability block) rather than a larger restructure of 02 — so we don't invest heavily in code we're retiring. The tools-table DDL lands the same way for now, then moves to the managed sync.

3. On the billing fix (_execute_sql raising). We'd resolve this slightly differently: raise on INSUFFICIENT_PERMISSIONS/misconfig, but keep degrading gracefully when a system table is genuinely unavailable. The same helper shape elsewhere (e.g. 11_discover_ai_gateway_usage) intentionally returns [] so account-scoped discovery doesn't hard-fail in workspaces that legitimately can't see a given system table — that resilience is load-bearing for the cross-workspace story. A raise scoped to permission errors gets you the loud failure you want without breaking that.

4. Smoke check — yes please, this is a genuinely useful gate (it'd have caught more than just this bug). We'd just generalize the asserted table set so it isn't tied to one deployment.

5. Packaging — would you mind splitting this into (a) the three bug fixes + smoke check, and (b) the generic CI/CD? (a) can merge quickly; (b) can bake separately.

On contributing: we really appreciate this, and we're open to your contributions and PRs going forward — please keep them coming. Let's get (a) landed first; as the collaboration develops we're happy to look at smoother access for follow-ups down the line. Thanks again for digging in and writing it up so thoroughly.

bkaankuguoglu · 2026-06-09T17:02:20Z

Thanks for this @thijs-hakkenberg — thorough write-up and the smoke check already earned its keep by surfacing the billing perm gap.

To make review tractable, we split it into two focused PRs (rebased onto current main, so #17's workload-classification columns are preserved — your branch predated them):

fix(workflows): gateway DDL + billing perm-error + tools tables + Lakebase smoke check #20 — the three bug fixes (gateway DDL transaction, billing silent perm-error, tools-table DDL → workflow) + the Lakebase smoke check + the SP-grant fallback helper + the RCA log. Your original commits are preserved with you as author.
ci: sandbox deploy pipeline + CI/CD deployment ADR #21 — the generic CI/CD pipeline (deploy-sandbox.yml + the ADR), to merge after fix(workflows): gateway DDL + billing perm-error + tools tables + Lakebase smoke check #20 so the smoke-check task it gates on exists on main.

We're picking up #20 first. This PR can stay open as the source of record until #20/#21 land, then we'll close it. On the outside-contributor ask — flagging that with the maintainers separately.

…ebase smoke check (#20) * fix(workflows): create gateway_usage tables in isolated transactions Phase 6 of `02_sync_to_lakebase.py` ran all gateway DDLs in a single transaction, with the ALTER for `gateway_usage_hourly` issued *before* its CREATE TABLE. The ALTER raises `UndefinedTable`; psycopg2 marks the whole transaction aborted, and the per-statement try/except swallows the warning while the trailing CREATE silently no-ops. Result on a fresh database: the table never gets created, the discovery job reports SUCCESS, and the deployed app's Gateway page logs a stream of `UndefinedTable: relation "gateway_usage_daily" does not exist`. Reorder DDLs so all CREATE TABLEs run before any CREATE INDEX/ALTER, and run each DDL in its own commit/rollback scope so a single failure cannot poison sibling DDLs. Verified on the Ecolab sandbox: post-fix discovery run materialises both `gateway_usage_daily` (21k rows) and `gateway_usage_hourly` (2.6k rows) in Lakebase, app errors stop. Also adds in-workspace SP-grant fallback (helper notebook + shell runner) for environments where the workspace blocks public Lakebase access from the laptop, and a docs/rca/ entry capturing this and the other deployment-blocking issues we hit during the Ecolab sandbox onboarding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(workflows): add Lakebase smoke check task A new `smoke_check_lakebase` workflow task asserts that every Lakebase table the dashboard reads from exists with the expected row content. REQUIRED tables must exist with >=1 row; EXPECTED must exist (0 rows allowed, logged WARN); OPTIONAL (app-managed) existence is not asserted. Failure raises directly (not via dbutils.notebook.exit) so the workflow result_state flips to FAILED and CI can gate on it. Wired as `smoke_check_lakebase` in databricks.yml, depending on `sync_to_lakebase`. Split out from #18; CI/CD deploy pattern is in a separate PR. Co-authored-by: Isaac * fix(workflows): surface SQL permission errors in discover_billing The _execute_sql helper in 09_discover_billing.py returned [] on any non-SUCCEEDED state and only printed the error. When a-hakketh lacked SELECT on system.billing.list_prices, the three queries that JOIN list_prices (serving / product / user_cost) silently produced 0 rows; the task still reported result=SUCCESS and the cost-related dashboard sections rendered empty while token usage worked. Root cause was a missing schema-level grant — fixed in the workspace with GRANT SELECT ON SCHEMA system.billing. This commit makes the code fail loud instead of silently zero on the next time it happens: the helper now raises RuntimeError carrying the SQLSTATE and message. Verified: post-grant + post-fix discovery run produced 15,040 serving / 9,806 product / 4,472 user_cost rows, all syncing 1:1 to Lakebase. Captured as item 5 in the deployment findings RCA. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(tools): create app-managed tables in workflow + stop swallowing errors The Tools page (Overview / MCP Servers / UC Functions / Usage) rendered empty in the deployed app with /api/v1/tools/overview returning 500 "relation tool_registry does not exist". Three layered failures masked each other: 1. tool_registry was never created. App startup runs ensure_tools_tables in a daemon thread under t.join(timeout=120); if the thread crashed, timed out, or hit Lakebase auth pressure, the table never landed and the app booted regardless. 2. ensure_tools_tables swallowed real DDL failures with logger.warning, indistinguishable from "already exists" cases. 3. refresh_tools wrapped its entire body in try/except and POST /tools/sync returned 200 unconditionally, hiding the broken state from operators. Same pattern applied to request_logs (created lazily by the audit middleware, also failed silently if app SP lacked DDL). Fix: - workflows/02_sync_to_lakebase Phase 7 now creates tool_registry and request_logs from the workflow run-as identity (databricks_superuser), matching how every other Lakebase table in this stack is provisioned. The app no longer needs DDL privileges to function. - tools_service.refresh_tools self-heals by calling ensure_tools_tables first, and switches catch-all to logger.exception for full tracebacks. - workflows/10_smoke_check_lakebase asserts tool_registry and request_logs in the EXPECTED bucket — missing table fails the workflow, 0 rows is a WARN (legitimate before user activity). Also adds RCA item 6 to docs/rca/2026-06-06-deploy-non-trivial-fixes.md. Verified end-to-end on Ecolab sandbox (run 881046971716582): overview now returns {total_tools:3, mcp_servers:3, uc_functions:0, managed_count:3}; the three system-managed MCP connections render in the UI. Empty UC Functions / Usage tabs are legitimate (no agent UC functions or trace tool spans yet). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(rca): add item 7 — bundle deploy placeholder defaults silently overwrite job params While verifying the item-6 fix end-to-end, a `databricks bundle deploy --target dev` without explicit --var flags overwrote the running job's working catalog/warehouse_id parameters with the literal placeholder strings (`<your-catalog>`, `<your-warehouse-id>`) defined as defaults in workflows/databricks.yml's dev target. Every subsequent task failed with a SQL parse error. Captured the footgun, the recovery commands (the full --var= invocation that restores the working job), and proposed durable fixes (sentinel defaults that fail loud, or a gitignored .databricks-bundle.<target>.local.yml overlay). Verified clean re-run on Ecolab sandbox: 11/11 tasks SUCCESS, smoke check passed 12/12 REQUIRED tables OK, tool_registry and request_logs present and asserted in the EXPECTED bucket. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(installation): document system.billing / system.serving grants (Step 8) The billing fail-loud fix in this PR turns a missing SELECT on system.billing.list_prices into a hard discovery-task failure (SQLSTATE 42501) instead of silent 0-row cost dashboards. Document the required schema-level grants for the workflow run-as identity in Step 8 (grant the whole system.billing / system.serving schemas so a future table addition can't reintroduce the gap), and make the step non-optional. Closes the installation docs TODO tracked in the deploy RCA for item 5. Co-authored-by: Isaac * docs(rca): soften CI/CD ADR link to plain text The CI/CD deployment-pattern ADR ships in the separate CI/CD PR, so a relative markdown link to it is dead on main until that PR merges. Reference the path as plain text instead, leaving #20 with no cross-PR loose ends. Co-authored-by: Isaac --------- Co-authored-by: Hakkenberg <thijs.hakkenberg@ecolab.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…t, CI Replaces every reference to a specific deploying organization, user, workspace ID, account ID, warehouse ID, Lakebase DNS, app host, and service principal client ID with generic placeholders (`<your-...>` / `<workspace-id>` / etc.). Files touched: - docs/rca/2026-06-06-deploy-non-trivial-fixes.md - docs/decisions/2026-06-06-cicd-deployment-pattern.md - control-plane-app/deploy.sh — comment generalised - control-plane-app/backend/services/tools_service.py — code comment generalised - .github/workflows/deploy-sandbox.yml — workflow display name generalised - .gitignore — exclude editor/scratch files The information itself was never sensitive (the workspace and warehouse IDs appear in URLs visible to anyone with workspace access), but the document narrative shouldn't pin to one customer. Future onboardings can fork this doc and substitute their own values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

thijshakkenbergecolab and others added 2 commits June 6, 2026 14:59

thijs-hakkenberg changed the title ~~fix(workflows): create gateway_usage tables in isolated transactions~~ fix(workflows): gateway DDL transaction bug + Lakebase smoke check + CI/CD plan Jun 6, 2026

thijs-hakkenberg changed the title ~~fix(workflows): gateway DDL transaction bug + Lakebase smoke check + CI/CD plan~~ fix(workflows): gateway DDL + billing silent-perm-error + smoke check + CI/CD plan Jun 6, 2026

thijs-hakkenberg changed the title ~~fix(workflows): gateway DDL + billing silent-perm-error + smoke check + CI/CD plan~~ fix(workflows): gateway DDL + billing silent-perm-error + smoke check + tools tables + CI/CD plan Jun 6, 2026

This was referenced Jun 9, 2026

fix(workflows): gateway DDL + billing perm-error + tools tables + Lakebase smoke check #20

Merged

ci: sandbox deploy pipeline + CI/CD deployment ADR #21

Open

thijs-hakkenberg changed the title ~~fix(workflows): gateway DDL + billing silent-perm-error + smoke check + tools tables + CI/CD plan~~ fix(workflows): gateway DDL + billing perm-error + smoke check + tools tables + CI/CD plan Jun 10, 2026

thijs-hakkenberg force-pushed the fix/sync-lakebase-gateway-ddl-transaction branch 2 times, most recently from eed8161 to 26a016c Compare June 10, 2026 19:35

bkaankuguoglu closed this Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(workflows): gateway DDL + billing perm-error + smoke check + tools tables + CI/CD plan#18

fix(workflows): gateway DDL + billing perm-error + smoke check + tools tables + CI/CD plan#18
thijs-hakkenberg wants to merge 6 commits into
databricks-solutions:mainfrom
thijs-hakkenberg:fix/sync-lakebase-gateway-ddl-transaction

thijs-hakkenberg commented Jun 6, 2026 •

edited

Loading

Uh oh!

bkaankuguoglu commented Jun 7, 2026

Uh oh!

bkaankuguoglu commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

thijs-hakkenberg commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why I'm opening from a fork

Commit 1: gateway DDL transaction fix

What was wrong

Fix

Verified end-to-end

Commit 2: Lakebase smoke check + CI/CD plan

What this catches

What's in workflows/10_smoke_check_lakebase.py

Subtleties this had to solve

What it found end-to-end

Commit 3: surface SQL permission errors in discover_billing

What was wrong

Fix

Verified end-to-end

CI/CD pattern

Commit 4: app-managed table DDL belongs in the workflow, not the app

What was wrong

Fix (three parts)

Verified end-to-end

Bonus: in-workspace SP-grant fallback (also in commit 1)

Bonus: deployment-blocking issues log

Test plan

Uh oh!

bkaankuguoglu commented Jun 7, 2026

Uh oh!

bkaankuguoglu commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thijs-hakkenberg commented Jun 6, 2026 •

edited

Loading

What's in `workflows/10_smoke_check_lakebase.py`

Commit 3: surface SQL permission errors in `discover_billing`

bkaankuguoglu commented Jun 9, 2026 •

edited

Loading