|
10 | 10 | Databricks Data Exploration Guide |
11 | 11 | ===================================== |
12 | 12 |
|
13 | | -{{.WorkspaceInfo}} |
14 | | -Default SQL Warehouse: {{.WarehouseName}} ({{.WarehouseID}}){{.ProfilesInfo}} |
| 13 | +{{.WorkspaceInfo}}{{if .WarehouseName}} |
| 14 | +Default SQL Warehouse: {{.WarehouseName}} ({{.WarehouseID}}){{else}} |
| 15 | +Note: No SQL warehouse detected. SQL queries will require warehouse_id to be specified manually.{{end}}{{.ProfilesInfo}} |
15 | 16 |
|
16 | 17 | IMPORTANT: Use the invoke_databricks_cli tool to run all commands below! |
17 | 18 |
|
18 | 19 |
|
19 | 20 | 1. EXECUTING SQL QUERIES |
20 | | - Run SQL queries using the Statement Execution API with inline JSON: |
21 | | - invoke_databricks_cli 'api post /api/2.0/sql/statements --json {"warehouse_id":"<warehouse_id>","statement":"SELECT * FROM <catalog>.<schema>.<table> LIMIT 10","wait_timeout":"30s"}' |
| 21 | + Run queries with auto-wait (max 50s): |
| 22 | + invoke_databricks_cli 'api post /api/2.0/sql/statements --json {"warehouse_id":"{{if .WarehouseID}}{{.WarehouseID}}{{else}}<warehouse_id>{{end}}","statement":"SELECT * FROM <catalog>.<schema>.<table> LIMIT 10","wait_timeout":"50s"}' |
22 | 23 |
|
23 | | - Examples: |
24 | | - - Simple query: {"warehouse_id":"<id>","statement":"SELECT 42 as answer","wait_timeout":"10s"} |
25 | | - - Table query: {"warehouse_id":"<id>","statement":"SELECT * FROM catalog.schema.table LIMIT 10","wait_timeout":"30s"} |
| 24 | + Response has status.state: |
| 25 | + - "SUCCEEDED" → Results in result.data_array (you're done!) |
| 26 | + - "PENDING" → Warehouse starting or query slow. Poll with: |
| 27 | + invoke_databricks_cli 'api get /api/2.0/sql/statements/<statement_id>' |
| 28 | + Repeat every 5-10s until "SUCCEEDED" |
26 | 29 |
|
27 | | - Note: Use the warehouse ID shown above. Results are returned in JSON format. |
| 30 | + Note: First query on stopped warehouse takes 60-120s startup time |
28 | 31 |
|
29 | 32 |
|
30 | 33 | 2. EXPLORING JOBS AND WORKFLOWS |
@@ -74,3 +77,75 @@ Getting Started: |
74 | 77 | - Use the commands above to explore what resources exist in the workspace |
75 | 78 | - All commands support --output json for programmatic access |
76 | 79 | - Remember to add --profile <name> when working with non-default workspaces |
| 80 | + |
| 81 | + |
| 82 | +WORKFLOW PATTERNS FOR DATABRICKS PROJECTS |
| 83 | +========================================== |
| 84 | + |
| 85 | +Creating a New Databricks Project: |
| 86 | + When to use: Building a new project from scratch, setting up deployment to multiple environments |
| 87 | + Tools sequence: |
| 88 | + 1. init_project (creates proper project structure with templates) |
| 89 | + 2. add_project_resource (for each resource you need: pipeline/job/app/dashboard) |
| 90 | + 3. analyze_project (provides deployment commands) |
| 91 | + 4. invoke_databricks_cli 'bundle validate' |
| 92 | + 💡 Tip: Use init_project even if you know YAML syntax - it uses templates and best practices |
| 93 | + |
| 94 | +Working with Existing Databricks Project: |
| 95 | + When to use: databricks.yml file already exists in the directory |
| 96 | + Tools sequence: |
| 97 | + 1. analyze_project (MANDATORY FIRST STEP - provides specialized commands) |
| 98 | + 2. [Make your changes to project files] |
| 99 | + 3. invoke_databricks_cli 'bundle validate' |
| 100 | + 💡 Tip: ALWAYS call analyze_project before making changes - Databricks projects |
| 101 | + require specialized commands that differ from standard Python/Node.js workflows |
| 102 | + |
| 103 | +Adding Resources to Existing Project: |
| 104 | + When to use: Adding pipelines, jobs, apps, or dashboards to an existing project |
| 105 | + Tools sequence: |
| 106 | + 1. add_project_resource (with type: 'pipeline', 'job', 'app', or 'dashboard') |
| 107 | + 2. analyze_project (to get updated deployment commands) |
| 108 | + 3. invoke_databricks_cli 'bundle validate' |
| 109 | + |
| 110 | + |
| 111 | +PATTERN MATCHING: If Your Task Mentions... |
| 112 | +=========================================== |
| 113 | + |
| 114 | +"new project" / "create a project" / "Databricks project" / "project structure" |
| 115 | + → Use init_project first (don't create files manually!) |
| 116 | + → Then add_project_resource for each resource (pipeline/job/app/dashboard) |
| 117 | + |
| 118 | +"SQL pipeline" / "data pipeline" / "materialized views" / "ETL" / "DLT" |
| 119 | + → Use add_project_resource with type='pipeline' or type='job' |
| 120 | + |
| 121 | +"Databricks app" / "application" / "build an app" |
| 122 | + → Use add_project_resource with type='app' |
| 123 | + |
| 124 | +"dashboard" / "Lakeview dashboard" / "visualization" |
| 125 | + → Use add_project_resource with type='dashboard' |
| 126 | + |
| 127 | +"Databricks job" / "scheduled job" / "workflow" |
| 128 | + → Use add_project_resource with type='job' |
| 129 | + |
| 130 | +"deploy to dev and prod" / "multiple environments" / "dev/staging/prod" |
| 131 | + → Use init_project (sets up multi-environment structure automatically) |
| 132 | + |
| 133 | +"databricks.yml" / "bundle configuration" / "Asset Bundle" |
| 134 | + → If creating new: use init_project (don't create manually!) |
| 135 | + → If exists already: use analyze_project FIRST before making changes |
| 136 | + |
| 137 | + |
| 138 | +ANTI-PATTERNS TO AVOID |
| 139 | +======================= |
| 140 | + |
| 141 | +❌ DON'T manually create databricks.yml files |
| 142 | + ✅ DO use init_project instead |
| 143 | + |
| 144 | +❌ DON'T run bundle commands without calling analyze_project first |
| 145 | + ✅ DO call analyze_project to get the correct specialized commands |
| 146 | + |
| 147 | +❌ DON'T use regular Bash to run databricks CLI commands |
| 148 | + ✅ DO use invoke_databricks_cli (better for user allowlisting) |
| 149 | + |
| 150 | +❌ DON'T skip explore when planning Databricks work |
| 151 | + ✅ DO call explore during planning to get workflow recommendations |
0 commit comments