fix(workflows): gateway DDL + billing perm-error + smoke check + tools tables + CI/CD plan#18
Conversation
Phase 6 of `02_sync_to_lakebase.py` ran all gateway DDLs in a single transaction, with the ALTER for `gateway_usage_hourly` issued *before* its CREATE TABLE. The ALTER raises `UndefinedTable`; psycopg2 marks the whole transaction aborted, and the per-statement try/except swallows the warning while the trailing CREATE silently no-ops. Result on a fresh database: the table never gets created, the discovery job reports SUCCESS, and the deployed app's Gateway page logs a stream of `UndefinedTable: relation "gateway_usage_daily" does not exist`. Reorder DDLs so all CREATE TABLEs run before any CREATE INDEX/ALTER, and run each DDL in its own commit/rollback scope so a single failure cannot poison sibling DDLs. Verified on the sandbox: post-fix discovery run materialises both `gateway_usage_daily` (21k rows) and `gateway_usage_hourly` (2.6k rows) in Lakebase, app errors stop. Also adds in-workspace SP-grant fallback (helper notebook + shell runner) for environments where the workspace blocks public Lakebase access from the laptop, and a docs/rca/ entry capturing this and the other deployment-blocking issues we hit during the sandbox onboarding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The sync bug fixed in the previous commit (silent gateway DDL skip) shipped to production despite passing every existing automated check — the discovery job reported SUCCESS, the app start-up logs were green, and only the Gateway page rendering empty in the browser exposed the problem. We need a regression test that compares Lakebase reality to the dashboard's contract. Adds `workflows/10_smoke_check_lakebase.py`, wired as `smoke_check_lakebase` task in `databricks.yml` depending on `sync_to_lakebase`. The notebook connects to Lakebase, enumerates `pg_tables`, and asserts every table the deployed app reads from (scraped from `FROM <table>` references in `control-plane-app/backend/services/`): - `REQUIRED` (15 tables): must exist AND have ≥1 row. - `EXPECTED` (11 tables): must exist; 0 rows logged as WARN. - `OPTIONAL` (6 tables): app-managed, existence not asserted. A `REQUIRED` failure raises (not `dbutils.notebook.exit` — that masks the failure as task SUCCESS) so the workflow's `result_state` flips to `FAILED` and any CI runner watching `databricks jobs get-run` sees a non-zero terminal state. The exception body inlines the breakdown so the failure message itself round-trips through the API without depending on notebook stdout (which `get-run-output` does not surface for failed runs). End-to-end verification on the sandbox correctly surfaced two real findings: `billing_user_cost_daily` and `billing_product_daily` are empty post-sync despite their `billing_user_endpoint_daily` and `billing_token_daily` source tables being populated. Tracked as a separate ticket; not the bug this PR fixes. Also adds: - `docs/decisions/2026-06-06-cicd-deployment-pattern.md` — proposed GitHub Actions pipeline (sandbox → stage → prod) gated on the smoke check, with environment matrix and SP grant requirements. - `.github/workflows/deploy-sandbox.yml` — concrete starting point for the sandbox path: bundle deploy, app deploy, run-now, poll the workflow, fail the CI job if `result_state != SUCCESS`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The _execute_sql helper in 09_discover_billing.py returned [] on any non-SUCCEEDED state and only printed the error. When <admin-profile> lacked SELECT on system.billing.list_prices, the three queries that JOIN list_prices (serving / product / user_cost) silently produced 0 rows; the task still reported result=SUCCESS and the cost-related dashboard sections rendered empty while token usage worked. Root cause was a missing schema-level grant — fixed in the workspace with GRANT SELECT ON SCHEMA system.billing. This commit makes the code fail loud instead of silently zero on the next time it happens: the helper now raises RuntimeError carrying the SQLSTATE and message. Verified: post-grant + post-fix discovery run produced 15,040 serving / 9,806 product / 4,472 user_cost rows, all syncing 1:1 to Lakebase. Captured as item 5 in the deployment findings RCA. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rrors
The Tools page (Overview / MCP Servers / UC Functions / Usage) rendered
empty in the deployed app with /api/v1/tools/overview returning 500
"relation tool_registry does not exist". Three layered failures masked
each other:
1. tool_registry was never created. App startup runs ensure_tools_tables
in a daemon thread under t.join(timeout=120); if the thread crashed,
timed out, or hit Lakebase auth pressure, the table never landed and
the app booted regardless.
2. ensure_tools_tables swallowed real DDL failures with logger.warning,
indistinguishable from "already exists" cases.
3. refresh_tools wrapped its entire body in try/except and POST /tools/sync
returned 200 unconditionally, hiding the broken state from operators.
Same pattern applied to request_logs (created lazily by the audit
middleware, also failed silently if app SP lacked DDL).
Fix:
- workflows/02_sync_to_lakebase Phase 7 now creates tool_registry and
request_logs from the workflow run-as identity (databricks_superuser),
matching how every other Lakebase table in this stack is provisioned.
The app no longer needs DDL privileges to function.
- tools_service.refresh_tools self-heals by calling ensure_tools_tables
first, and switches catch-all to logger.exception for full tracebacks.
- workflows/10_smoke_check_lakebase asserts tool_registry and
request_logs in the EXPECTED bucket — missing table fails the
workflow, 0 rows is a WARN (legitimate before user activity).
Also adds RCA item 6 to docs/rca/2026-06-06-deploy-non-trivial-fixes.md.
Verified end-to-end on sandbox (run <run-id>):
overview now returns {total_tools:3, mcp_servers:3, uc_functions:0,
managed_count:3}; the three system-managed MCP connections render in
the UI. Empty UC Functions / Usage tabs are legitimate (no agent UC
functions or trace tool spans yet).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…verwrite job params While verifying the item-6 fix end-to-end, a `databricks bundle deploy --target dev` without explicit --var flags overwrote the running job's working catalog/warehouse_id parameters with the literal placeholder strings (`<your-catalog>`, `<your-warehouse-id>`) defined as defaults in workflows/databricks.yml's dev target. Every subsequent task failed with a SQL parse error. Captured the footgun, the recovery commands (the full --var= invocation that restores the working job), and proposed durable fixes (sentinel defaults that fail loud, or a gitignored .databricks-bundle.<target>.local.yml overlay). Verified clean re-run on sandbox: 11/11 tasks SUCCESS, smoke check passed 12/12 REQUIRED tables OK, tool_registry and request_logs present and asserted in the EXPECTED bucket. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thijs — first off, thank you. This is an exceptionally well-diagnosed PR: clear root-cause writeups, repro detail, and it's obvious you actually read the codebase (citing the
We'd love to take these — with a few adjustments so the solution stays generic and doesn't encode one customer's context. That's the main thing we want to preserve in this shared/public solutions repo, so a couple of asks and a roadmap heads-up: 1. Keep it customer-agnostic. Could you drop the customer-specific pieces — the named deploy action, the environment-specific secret names, and the rollout-specific ADR/RCA framing? The patterns (a build gate + a post-deploy data-plane smoke check) are great and we want them; we'd just want them as a generic template any adopter can wire to their own workspace, rather than a named pipeline. Happy to take a generic CI/CD workflow as its own PR. 2. Heads-up on the sync layer (affects the gateway + tools DDL fixes). We're about to introduce automatic Lakebase sync to replace the manual 3. On the billing fix ( 4. Smoke check — yes please, this is a genuinely useful gate (it'd have caught more than just this bug). We'd just generalize the asserted table set so it isn't tied to one deployment. 5. Packaging — would you mind splitting this into (a) the three bug fixes + smoke check, and (b) the generic CI/CD? (a) can merge quickly; (b) can bake separately. On contributing: we really appreciate this, and we're open to your contributions and PRs going forward — please keep them coming. Let's get (a) landed first; as the collaboration develops we're happy to look at smoother access for follow-ups down the line. Thanks again for digging in and writing it up so thoroughly. |
|
Thanks for this @thijs-hakkenberg — thorough write-up and the smoke check already earned its keep by surfacing the billing perm gap. To make review tractable, we split it into two focused PRs (rebased onto current
We're picking up #20 first. This PR can stay open as the source of record until #20/#21 land, then we'll close it. On the outside-contributor ask — flagging that with the maintainers separately. |
…ebase smoke check (#20) * fix(workflows): create gateway_usage tables in isolated transactions Phase 6 of `02_sync_to_lakebase.py` ran all gateway DDLs in a single transaction, with the ALTER for `gateway_usage_hourly` issued *before* its CREATE TABLE. The ALTER raises `UndefinedTable`; psycopg2 marks the whole transaction aborted, and the per-statement try/except swallows the warning while the trailing CREATE silently no-ops. Result on a fresh database: the table never gets created, the discovery job reports SUCCESS, and the deployed app's Gateway page logs a stream of `UndefinedTable: relation "gateway_usage_daily" does not exist`. Reorder DDLs so all CREATE TABLEs run before any CREATE INDEX/ALTER, and run each DDL in its own commit/rollback scope so a single failure cannot poison sibling DDLs. Verified on the Ecolab sandbox: post-fix discovery run materialises both `gateway_usage_daily` (21k rows) and `gateway_usage_hourly` (2.6k rows) in Lakebase, app errors stop. Also adds in-workspace SP-grant fallback (helper notebook + shell runner) for environments where the workspace blocks public Lakebase access from the laptop, and a docs/rca/ entry capturing this and the other deployment-blocking issues we hit during the Ecolab sandbox onboarding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(workflows): add Lakebase smoke check task A new `smoke_check_lakebase` workflow task asserts that every Lakebase table the dashboard reads from exists with the expected row content. REQUIRED tables must exist with >=1 row; EXPECTED must exist (0 rows allowed, logged WARN); OPTIONAL (app-managed) existence is not asserted. Failure raises directly (not via dbutils.notebook.exit) so the workflow result_state flips to FAILED and CI can gate on it. Wired as `smoke_check_lakebase` in databricks.yml, depending on `sync_to_lakebase`. Split out from #18; CI/CD deploy pattern is in a separate PR. Co-authored-by: Isaac * fix(workflows): surface SQL permission errors in discover_billing The _execute_sql helper in 09_discover_billing.py returned [] on any non-SUCCEEDED state and only printed the error. When a-hakketh lacked SELECT on system.billing.list_prices, the three queries that JOIN list_prices (serving / product / user_cost) silently produced 0 rows; the task still reported result=SUCCESS and the cost-related dashboard sections rendered empty while token usage worked. Root cause was a missing schema-level grant — fixed in the workspace with GRANT SELECT ON SCHEMA system.billing. This commit makes the code fail loud instead of silently zero on the next time it happens: the helper now raises RuntimeError carrying the SQLSTATE and message. Verified: post-grant + post-fix discovery run produced 15,040 serving / 9,806 product / 4,472 user_cost rows, all syncing 1:1 to Lakebase. Captured as item 5 in the deployment findings RCA. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(tools): create app-managed tables in workflow + stop swallowing errors The Tools page (Overview / MCP Servers / UC Functions / Usage) rendered empty in the deployed app with /api/v1/tools/overview returning 500 "relation tool_registry does not exist". Three layered failures masked each other: 1. tool_registry was never created. App startup runs ensure_tools_tables in a daemon thread under t.join(timeout=120); if the thread crashed, timed out, or hit Lakebase auth pressure, the table never landed and the app booted regardless. 2. ensure_tools_tables swallowed real DDL failures with logger.warning, indistinguishable from "already exists" cases. 3. refresh_tools wrapped its entire body in try/except and POST /tools/sync returned 200 unconditionally, hiding the broken state from operators. Same pattern applied to request_logs (created lazily by the audit middleware, also failed silently if app SP lacked DDL). Fix: - workflows/02_sync_to_lakebase Phase 7 now creates tool_registry and request_logs from the workflow run-as identity (databricks_superuser), matching how every other Lakebase table in this stack is provisioned. The app no longer needs DDL privileges to function. - tools_service.refresh_tools self-heals by calling ensure_tools_tables first, and switches catch-all to logger.exception for full tracebacks. - workflows/10_smoke_check_lakebase asserts tool_registry and request_logs in the EXPECTED bucket — missing table fails the workflow, 0 rows is a WARN (legitimate before user activity). Also adds RCA item 6 to docs/rca/2026-06-06-deploy-non-trivial-fixes.md. Verified end-to-end on Ecolab sandbox (run 881046971716582): overview now returns {total_tools:3, mcp_servers:3, uc_functions:0, managed_count:3}; the three system-managed MCP connections render in the UI. Empty UC Functions / Usage tabs are legitimate (no agent UC functions or trace tool spans yet). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(rca): add item 7 — bundle deploy placeholder defaults silently overwrite job params While verifying the item-6 fix end-to-end, a `databricks bundle deploy --target dev` without explicit --var flags overwrote the running job's working catalog/warehouse_id parameters with the literal placeholder strings (`<your-catalog>`, `<your-warehouse-id>`) defined as defaults in workflows/databricks.yml's dev target. Every subsequent task failed with a SQL parse error. Captured the footgun, the recovery commands (the full --var= invocation that restores the working job), and proposed durable fixes (sentinel defaults that fail loud, or a gitignored .databricks-bundle.<target>.local.yml overlay). Verified clean re-run on Ecolab sandbox: 11/11 tasks SUCCESS, smoke check passed 12/12 REQUIRED tables OK, tool_registry and request_logs present and asserted in the EXPECTED bucket. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(installation): document system.billing / system.serving grants (Step 8) The billing fail-loud fix in this PR turns a missing SELECT on system.billing.list_prices into a hard discovery-task failure (SQLSTATE 42501) instead of silent 0-row cost dashboards. Document the required schema-level grants for the workflow run-as identity in Step 8 (grant the whole system.billing / system.serving schemas so a future table addition can't reintroduce the gap), and make the step non-optional. Closes the installation docs TODO tracked in the deploy RCA for item 5. Co-authored-by: Isaac * docs(rca): soften CI/CD ADR link to plain text The CI/CD deployment-pattern ADR ships in the separate CI/CD PR, so a relative markdown link to it is dead on main until that PR merges. Reference the path as plain text instead, leaving #20 with no cross-PR loose ends. Co-authored-by: Isaac --------- Co-authored-by: Hakkenberg <thijs.hakkenberg@ecolab.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…t, CI Replaces every reference to a specific deploying organization, user, workspace ID, account ID, warehouse ID, Lakebase DNS, app host, and service principal client ID with generic placeholders (`<your-...>` / `<workspace-id>` / etc.). Files touched: - docs/rca/2026-06-06-deploy-non-trivial-fixes.md - docs/decisions/2026-06-06-cicd-deployment-pattern.md - control-plane-app/deploy.sh — comment generalised - control-plane-app/backend/services/tools_service.py — code comment generalised - .github/workflows/deploy-sandbox.yml — workflow display name generalised - .gitignore — exclude editor/scratch files The information itself was never sensitive (the workspace and warehouse IDs appear in URLs visible to anyone with workspace access), but the document narrative shouldn't pin to one customer. Future onboardings can fork this doc and substitute their own values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
eed8161 to
26a016c
Compare
Summary
Four stacked changes from a customer onboarding:
02_sync_to_lakebasePhase 6 silently fails to creategateway_usage_daily/gateway_usage_hourlyon a fresh DB while reportingresult=SUCCESS. The deployed app then logs an unbounded stream ofpsycopg2.errors.UndefinedTableerrors and renders the Gateway page empty. Root cause: psycopg2 transaction abort on anALTER TABLEissued before the matchingCREATE TABLE.smoke_check_lakebaseworkflow task asserts that every Lakebase table the dashboard reads from exists with the expected row content. Failure flips the workflow'sresult_stateto FAILED so CI can gate on it._execute_sqlin09_discover_billing.pyreturned[]on any non-SUCCEEDED state — includingINSUFFICIENT_PERMISSIONS. That masked a missingSELECT ON system.billing.list_pricesgrant for weeks: cost-related dashboards (Cost Overview, Endpoint Costs, All Products) rendered empty while token usage worked, and the discovery task still reported SUCCESS. Helper nowraises with the SQLSTATE + message so the same misconfiguration fails the discovery task immediately.tool_registryandrequest_logswere created lazily by the app's startup hook in a 120 s daemon thread. When the app SP lacked Lakebase DDL or the daemon ran past timeout, the tables never landed;/api/v1/tools/overviewreturned 500, the Tools page rendered empty across all four tabs, and/api/v1/tools/synclied with HTTP 200. Moved DDL into02_sync_to_lakebasePhase 7 (workflow run-as isdatabricks_superuser); also stoppedrefresh_toolsfrom swallowing exception tracebacks.Why I'm opening from a fork
I'm a Databricks customer deploying this repo into a sandbox workspace. I don't have push access to
databricks-solutions/agent-control-plane, so this PR is from a personal fork. I'd love to be added as an outside contributor — there are already three follow-up items the smoke check surfaced that warrant separate PRs, and round-tripping each through cross-fork PR is high friction. Happy to provide a Databricks-side reference offline.Commit 1: gateway DDL transaction fix
What was wrong
workflows/02_sync_to_lakebase.pyPhase 6 issued the gateway DDLs in a single transaction with this order:On a fresh database the
ALTER gateway_usage_hourlyraisesUndefinedTable. psycopg2 marks the entire transaction aborted on any error, so subsequentcur.execute(ddl)calls in the same transaction silently no-op until rollback. The per-statementtry/exceptswallowed the warning print but did not rollback, so the trailingCREATE TABLE gateway_usage_hourlywas a no-op, and the trailinggw_conn.commit()succeeded — leaving the database without either table.The observability section a few hundred lines up uses PostgreSQL
SAVEPOINTper DDL and contains failures correctly. The gateway section did not.Fix
CREATE TABLEs run before anyCREATE INDEX/ALTER TABLE.cursor()context with its owncommit(), and a failure doesrollback()then continues. This isolates the "ALTER on a not-yet-existing table on first run" case from poisoning sibling DDLs.Verified end-to-end
gateway_usage_dailyexists in Lakebasegateway_usage_hourlyexists in Lakebasegateway_usage_dailygateway_usage_hourlyUndefinedTableerrors after fix runSUCCESS(silently broken)SUCCESS(correct)Commit 2: Lakebase smoke check + CI/CD plan
What this catches
The bug above passed every existing automated check. The discovery job reported SUCCESS, the app's startup logs were clean, and the Lakebase tables that were created had rows. The only failure signal lived in the deployed app's stderr — which nobody reads on a successful deploy. The smoke check is a regression test that compares Lakebase reality to the app's data contract.
What's in
workflows/10_smoke_check_lakebase.pyThe notebook connects to Lakebase, enumerates
pg_tables, and asserts every table the deployed app reads from. Tables are sourced from the actualFROM <table>references incontrol-plane-app/backend/services/and bucketed:REQUIRED(15 tables)EXPECTED(11 tables)OPTIONAL(6 tables)Wired as
smoke_check_lakebasetask indatabricks.yml, depending onsync_to_lakebase.Subtleties this had to solve
dbutils.notebook.exit()masks failures. Calling it before raising marks the task SUCCEEDED with the JSON as its return value, hiding the FAIL from the workflow-levelresult_state. The smoke notebook raises directly instead.jobs/get-run-outputdoes not surface notebook stdout for failed runs. Onlyerroranderror_traceround-trip. The smoke notebook inlines the breakdown into theRuntimeErrormessage itself so a CI runner readingerrorsees the per-table failure list directly.What it found end-to-end
Beyond the gateway tables (now ✅), the first smoke run flagged two real new findings worth a separate ticket:
Update: the two REQUIRED failures and
billing_serving_dailyare now root-caused and fixed in commit 3 (see below). The smoke check did its job — surfaced a silent permission gap that had been quietly producing 0-row dashboards.The remaining EXPECTED warnings (
gateway_inference_logs,vector_search_indexes,kb_billing_daily) are legitimately empty for this workspace (no AI Gateway inference logging enabled, the one Vector Search endpoint has 0 indexes).Commit 3: surface SQL permission errors in
discover_billingWhat was wrong
workflows/09_discover_billing.pyissues five SQL statements via_execute_sql. Three of them (queries 1, 3, 5 —billing_serving_daily,billing_product_daily,billing_user_cost_daily)LEFT JOIN system.billing.list_pricesto compute cost in USD. On the sandbox, the deployer / workflow run-as identity hadSELECT ON system.billing.usagebut not onsystem.billing.list_prices. The JOIN failed with:But the helper had this:
So all three queries silently returned
[]. The notebook thenoverwroteDelta with empty DataFrames, sync-to-Lakebase truncated and re-inserted 0 rows, and the discovery task reportedresult=SUCCESS. Token-related dashboards worked because queries 2 and 4 hitsystem.serving.endpoint_usageonly — no list_prices join, no permission gap. That's why "token usage shows but cost overview is empty" was the symptom.Fix
workflows/09_discover_billing.py:170—_execute_sqlnowraises aRuntimeErrorcarrying the API error code and message when the statement does not succeed. Future permission gaps now fail the discovery task at the source with an actionable message, instead of producing a 0-row result that surfaces three steps downstream as "EMPTY required table" in the smoke check.(The corresponding production fix is a one-line
GRANT SELECT ON SCHEMA system.billing TO <identity>— that's an environmental config, not code, so it belongs indocs/installation.mdrather than this PR.)Verified end-to-end
billing_serving_daily(Lakebase)billing_product_daily(Lakebase)billing_user_cost_daily(Lakebase)billing_token_daily(Lakebase)result_stateon missing grantCI/CD pattern
docs/decisions/2026-06-06-cicd-deployment-pattern.mdproposes a sandbox → stage → prod promotion pipeline driven by GitHub Actions, gated on the smoke check at every environment..github/workflows/deploy-sandbox.ymlis the concrete starting point: bundle deploy, app deploy,jobs run-now, poll until terminal, fail the CI job ifresult_state != SUCCESS.This is a starting point, not a finished pipeline. Stage and prod workflows are deferred until those workspaces are provisioned.
Commit 4: app-managed table DDL belongs in the workflow, not the app
What was wrong
The Tools page in the deployed app rendered empty across all four tabs (Overview / MCP Servers / UC Functions / Usage). Symptoms:
The app's startup hook in
backend/main.pyruns_init_tools()inside a daemon thread fan-out wrapped witht.join(timeout=120)._init_toolscallsensure_tools_tables()(whichCREATE TABLE IF NOT EXISTS tool_registry ...) and thenmaybe_refresh_async(). Three layered failures masked each other:tool_registrywas never created. On the sandbox, the daemon thread either crashed before_init_toolsran, ran past the 120 s budget, or hit Lakebase auth pressure under cold-start load — the server boots regardless and the table is missing.ensure_tools_tables()swallowed real DDL failures. Per-statementtry/except: logger.warning(...)was indistinguishable from the harmless "already exists" case.refresh_tools()swallowed the entire body withtry/except Exception: logger.warning("Tools refresh failed: %s", exc).POST /api/v1/tools/syncreturned{"status":"ok","message":"Tools refresh complete"}regardless of whether the underlying refresh blew up — the dashboard was the only externally visible signal.The same architectural issue applied to
request_logs(created lazily by the request-audit middleware, also fails silently if the app SP lacks DDL).Fix (three parts)
Move app-managed table DDL into the workflow. New Phase 7 in
workflows/02_sync_to_lakebase.pycreatestool_registryandrequest_logsfrom the workflow run-as identity. The workflow runs asdatabricks_superuseron the Lakebase PG instance — DDL here is always safe. The app no longer needs DDL privileges to function, onlySELECT/INSERT/UPDATE/DELETE. This parallels the pattern already used fordiscovered_agents,gateway_usage_*, and 20+ other tables.Make
tools_service.refresh_toolsself-heal and stop hiding errors.refresh_tools()now callsensure_tools_tables()at the top, so a fresh deploy where the workflow hasn't run yet still recovers on the first read.logger.warningtologger.exception— full tracebacks land in app logs.ensure_tools_tables()also useslogger.exceptionwith the offending DDL preview.Smoke check coverage.
workflows/10_smoke_check_lakebase.pynow assertstool_registryandrequest_logsin theEXPECTEDbucket — existence required, 0 rows allowed. Missing table → workflow goes red with an actionable message; 0 rows is a WARN (legitimate before any user activity).Verified end-to-end
UndefinedTable: tool_registry){total_tools:3, mcp_servers:3, uc_functions:0, managed_count:3}ai_control_plane.control_plane; SP only seessystem.*)[](worked because no Lakebase touch)[](legitimate — no MLflow traces with TOOL/FUNCTION spans yet)The empty UC Functions and Usage tabs are correct given the current sandbox state — both will populate naturally as agents using UC function tools are deployed.
Bonus: in-workspace SP-grant fallback (also in commit 1)
control-plane-app/grant_sp_lakebase_notebook.py+run_grant_sp_lakebase_job.sh: a drop-in replacement for the laptop-sidegrant_sp_lakebase.pyinvocation indeploy.sh, for workspaces where the network policy blocks public Lakebase ingress. Submits the grant as a one-shot serverless Databricks Job withnotebook_task+base_parametersso it runs inside the workspace.Subtleties this had to solve, all of which would bite the next deployer:
spark_python_task + environment_variablesdoesn't work on serverless —environments[].specdoesn't honourenvironment_variables. Switched tonotebook_task+ widget reads.databricks.sdk.service.postgres. Pinneddatabricks-sdk>=0.40.0in the env spec and ran%pip install --upgradeat the top of the wrapper notebook.grant_sp_lakebase.pyends withsys.exit(main()). Notebooks treatSystemExit(0)as a task failure. The wrapperrunpy.run_paths the script and catchesSystemExit, only re-raising on non-zero codes.deploy.shis updated to print a pointer to the helper instead of runninggrant_sp_lakebase.pydirectly.Bonus: deployment-blocking issues log
docs/rca/2026-06-06-deploy-non-trivial-fixes.mdcaptures four deployment-blocking issues from the sandbox onboarding that aren't covered bydocs/installation.md:CREATE CATALOGon metastore is required — workspace-admin alone is not enough.databricks auth login— profile name does not constrain identity.current-user meverification recipe.grant_sp_lakebase.pycannot work; in-workspace job is the fix (this PR's helper scripts).Plus a permission-audit table for the privileged admin identity used for the deploy, showing exactly what's required to land this stack.
Happy to split any of these out into separate PRs if a single PR is too much surface area to review at once.
Test plan
gateway_usage_dailyandgateway_usage_hourlyexist with non-zero rows after a full discovery run.UndefinedTableforgateway_usage_daily.SQLSTATE: 42501in the error message instead of silently producing 0-row dashboards.tool_registryandrequest_logson every run;/api/v1/tools/overviewreturns 200 with three MCP server entries instead of 500.tool_registryis dropped manually (existence assert in EXPECTED bucket).billing_user_cost_dailyREQUIRED before the commit-3 fix — taskresult_state=FAILED, exception message includes the per-table breakdown.grant_sp_lakebase_notebook.pywrapper handles a real grant on a workspace that does have public Lakebase reachable (current verification was on a workspace where it isn't)..github/workflows/deploy-sandbox.ymlagainst your CI conventions; the expected secret names areDATABRICKS_HOST_SANDBOX,DATABRICKS_CLIENT_ID_SANDBOX,DATABRICKS_CLIENT_SECRET_SANDBOX.🤖 Generated with Claude Code