Skip to content

Commit e177d4a

Browse files
authored
fix: Reliable schema existence checking (#1261)
Resolves #1260 ### Description * Backport AGENTS.md, because it is frustrating to work with AI when it doesn't know things I've already told it * Ensure check_schema_exists works (and implement via reliable mechanism) ### Checklist - [ ] I have run this code in development and it appears to resolve the stated issue - [ ] This PR includes tests, or tests are not required/relevant for this PR - [ ] I have updated the `CHANGELOG.md` and added information about my change to the "dbt-databricks next" section.
1 parent b6df4c6 commit e177d4a

File tree

6 files changed

+221
-18
lines changed

6 files changed

+221
-18
lines changed

AGENTS.md

Lines changed: 134 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ This guide helps AI agents quickly understand and work productively with the dbt
99
- **What**: dbt adapter for Databricks Lakehouse platform
1010
- **Based on**: dbt-spark adapter with Databricks-specific enhancements
1111
- **Key Features**: Unity Catalog support, Delta Lake, Python models, streaming tables
12-
- **Language**: Python 3.9+ with Jinja2 SQL macros
12+
- **Language**: Python 3.10+ with Jinja2 SQL macros
1313
- **Architecture**: Inherits from Spark adapter, extends with Databricks-specific functionality
1414

1515
### Essential Files to Understand
@@ -20,6 +20,7 @@ dbt/adapters/databricks/
2020
├── connections.py # Connection management and SQL execution
2121
├── credentials.py # Authentication (token, OAuth, Azure AD)
2222
├── relation.py # Databricks-specific relation handling
23+
├── dbr_capabilities.py # DBR version capability system
2324
├── python_models/ # Python model execution on clusters
2425
├── relation_configs/ # Table/view configuration management
2526
└── catalogs/ # Unity Catalog vs Hive Metastore logic
@@ -33,24 +34,37 @@ dbt/include/databricks/macros/ # Jinja2 SQL templates
3334

3435
## 🛠 Development Environment
3536

36-
**Prerequisites**: Python 3.9+ installed on your system
37+
**Prerequisites**: Python 3.10+ installed on your system
3738

3839
**Install Hatch** (recommended):
3940

41+
For Linux:
42+
4043
```bash
41-
# Install Hatch globally - see https://hatch.pypa.io/dev/install/
42-
pip install hatch
44+
# Download and install standalone binary
45+
curl -Lo hatch.tar.gz https://github.com/pypa/hatch/releases/latest/download/hatch-x86_64-unknown-linux-gnu.tar.gz
46+
tar -xzf hatch.tar.gz
47+
mkdir -p $HOME/bin
48+
mv hatch $HOME/bin/hatch
49+
chmod +x $HOME/bin/hatch
50+
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.zshrc
51+
export PATH="$HOME/bin:$PATH"
4352

4453
# Create default environment (Hatch installs needed Python versions)
4554
hatch env create
4655
```
4756

57+
For other platforms: see https://hatch.pypa.io/latest/install/
58+
4859
**Essential commands**:
4960

5061
```bash
5162
hatch run code-quality # Format, lint, type-check
5263
hatch run unit # Run unit tests
5364
hatch run cluster-e2e # Run functional tests
65+
66+
# For specific tests, use pytest directly:
67+
hatch run pytest path/to/test_file.py::TestClass::test_method -v
5468
```
5569

5670
> 📖 **See [Development Guide](docs/dbt-databricks-dev.md)** for comprehensive setup documentation
@@ -113,17 +127,38 @@ class TestCreateTable(MacroTestBase):
113127

114128
#### Functional Test Example
115129

130+
**Important**: SQL models and YAML schemas should be defined in a `fixtures.py` file in the same directory as the test, not inline in the test class. This keeps tests clean and fixtures reusable.
131+
132+
**fixtures.py:**
133+
134+
```python
135+
my_model_sql = """
136+
{{ config(materialized='incremental', unique_key='id') }}
137+
select 1 as id, 'test' as name
138+
"""
139+
140+
my_schema_yml = """
141+
version: 2
142+
models:
143+
- name: my_model
144+
columns:
145+
- name: id
146+
description: 'ID column'
147+
"""
148+
```
149+
150+
**test_my_feature.py:**
151+
116152
```python
117153
from dbt.tests import util
154+
from tests.functional.adapter.my_feature import fixtures
118155

119156
class TestIncrementalModel:
120157
@pytest.fixture(scope="class")
121158
def models(self):
122159
return {
123-
"my_model.sql": """
124-
{{ config(materialized='incremental', unique_key='id') }}
125-
select 1 as id, 'test' as name
126-
"""
160+
"my_model.sql": fixtures.my_model_sql,
161+
"schema.yml": fixtures.my_schema_yml,
127162
}
128163

129164
def test_incremental_run(self, project):
@@ -147,6 +182,46 @@ DatabricksAdapter (impl.py)
147182

148183
### Key Components
149184

185+
#### DBR Capability System (`dbr_capabilities.py`)
186+
187+
- **Purpose**: Centralized management of DBR version-dependent features
188+
- **Key Features**:
189+
- Per-compute caching (different clusters can have different capabilities)
190+
- Named capabilities instead of magic version numbers
191+
- Automatic detection of DBR version and SQL warehouse environments
192+
- **Supported Capabilities**:
193+
- `TIMESTAMPDIFF` (DBR 10.4+): Advanced date/time functions
194+
- `INSERT_BY_NAME` (DBR 12.2+): Name-based column matching in INSERT
195+
- `ICEBERG` (DBR 14.3+): Apache Iceberg table format
196+
- `COMMENT_ON_COLUMN` (DBR 16.1+): Modern column comment syntax
197+
- `JSON_COLUMN_METADATA` (DBR 16.2+): Efficient metadata retrieval
198+
- **Usage in Code**:
199+
200+
```python
201+
# In Python code
202+
if adapter.has_capability(DBRCapability.ICEBERG):
203+
# Use Iceberg features
204+
205+
# In Jinja macros
206+
{% if adapter.has_dbr_capability('comment_on_column') %}
207+
COMMENT ON COLUMN ...
208+
{% else %}
209+
ALTER TABLE ... ALTER COLUMN ...
210+
{% endif %}
211+
212+
{% if adapter.has_dbr_capability('insert_by_name') %}
213+
INSERT INTO table BY NAME SELECT ...
214+
{% else %}
215+
INSERT INTO table SELECT ... -- positional
216+
{% endif %}
217+
```
218+
219+
- **Adding New Capabilities**:
220+
1. Add to `DBRCapability` enum
221+
2. Add `CapabilitySpec` with version requirements
222+
3. Use `has_capability()` or `require_capability()` in code
223+
- **Important**: Each compute resource (identified by `http_path`) maintains its own capability cache
224+
150225
#### Connection Management (`connections.py`)
151226

152227
- Extends Spark connection manager for Databricks
@@ -184,6 +259,42 @@ DatabricksAdapter (impl.py)
184259
- Override Spark macros with Databricks-specific logic
185260
- Handle materializations (table, view, incremental, snapshot)
186261
- Implement Databricks features (liquid clustering, column masks, tags)
262+
- **Important**: To override a `spark__macro_name` macro, create `databricks__macro_name` (NOT `spark__macro_name`)
263+
264+
#### Multi-Statement SQL Execution
265+
266+
When a macro needs to execute multiple SQL statements (e.g., DELETE followed by INSERT), use the `execute_multiple_statements` helper:
267+
268+
**Pattern for Multi-Statement Strategies:**
269+
```jinja
270+
{% macro my_multi_statement_strategy(args) %}
271+
{%- set statements = [] -%}
272+
273+
{#-- Build first statement --#}
274+
{%- set statement1 -%}
275+
DELETE FROM {{ target_relation }}
276+
WHERE some_condition
277+
{%- endset -%}
278+
{%- do statements.append(statement1) -%}
279+
280+
{#-- Build second statement --#}
281+
{%- set statement2 -%}
282+
INSERT INTO {{ target_relation }}
283+
SELECT * FROM {{ source_relation }}
284+
{%- endset -%}
285+
{%- do statements.append(statement2) -%}
286+
287+
{{- return(statements) -}}
288+
{% endmacro %}
289+
```
290+
291+
**How It Works:**
292+
- Return a **list of SQL strings** from your strategy macro
293+
- The incremental materialization automatically detects lists and calls `execute_multiple_statements()`
294+
- Each statement executes separately via `{% call statement('main') %}`
295+
- Used by: `delete+insert` incremental strategy (DBR < 17.1 fallback), materialized views, streaming tables
296+
297+
**Note:** Databricks SQL connector does NOT support semicolon-separated statements in a single execute call. Always return a list.
187298

188299
### Configuration System
189300

@@ -256,6 +367,7 @@ Models can be configured with Databricks-specific options:
256367

257368
- **Development**: `docs/dbt-databricks-dev.md` - Setup and workflow
258369
- **Testing**: `docs/testing.md` - Comprehensive testing guide
370+
- **DBR Capabilities**: `docs/dbr-capability-system.md` - Version-dependent features
259371
- **Contributing**: `CONTRIBUTING.MD` - Code standards and PR process
260372
- **User Docs**: [docs.getdbt.com](https://docs.getdbt.com/reference/resource-configs/databricks-configs)
261373

@@ -273,6 +385,11 @@ Models can be configured with Databricks-specific options:
273385
3. **SQL Generation**: Prefer macros over Python string manipulation
274386
4. **Testing**: Write both unit and functional tests for new features
275387
5. **Configuration**: Use dataclasses with validation for new config options
388+
6. **Imports**: Always import at the top of the file, never use local imports within functions or methods
389+
7. **Version Checks**: Use capability system instead of direct version comparisons:
390+
-`if adapter.compare_dbr_version(16, 1) >= 0:`
391+
-`if adapter.has_capability(DBRCapability.COMMENT_ON_COLUMN):`
392+
-`{% if adapter.has_dbr_capability('comment_on_column') %}`
276393

277394
## 🚨 Common Pitfalls for Agents
278395

@@ -284,12 +401,21 @@ Models can be configured with Databricks-specific options:
284401
6. **Follow SQL normalization** in test assertions with `assert_sql_equal()`
285402
7. **Handle Unity Catalog vs HMS differences** in feature implementations
286403
8. **Consider backward compatibility** when modifying existing behavior
404+
9. **Use capability system for version checks** - Never add new `compare_dbr_version()` calls
405+
10. **Remember per-compute caching** - Different clusters may have different capabilities in the same run
406+
11. **Multi-statement SQL**: Don't use semicolons to separate statements - return a list instead and let `execute_multiple_statements()` handle it
287407

288408
## 🎯 Success Metrics
289409

290410
When working on this codebase, ensure:
291411

292412
- [ ] All tests pass (`hatch run code-quality && hatch run unit`)
413+
- [ ] **CRITICAL: Run affected functional tests before declaring success**
414+
- If you modified connection/capability logic: Run tests that use multiple computes or check capabilities
415+
- If you modified incremental materializations: Run `tests/functional/adapter/incremental/`
416+
- If you modified Python models: Run `tests/functional/adapter/python_model/`
417+
- If you modified macros: Run tests that use those macros
418+
- **NEVER declare "mission accomplished" without running functional tests for affected features**
293419
- [ ] New features have both unit and functional tests
294420
- [ ] SQL generation follows Databricks best practices
295421
- [ ] Changes maintain backward compatibility

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
## dbt-databricks 1.10.15 (TBD)
22

3+
### Fixes
4+
5+
- Switch to a more reliable mechanism for checking schema existence ([1261](https://github.com/databricks/dbt-databricks/pull/1261))
6+
37
### Under the hood
48

59
- Allow for dbt-core 1.10.15 ([1254](https://github.com/databricks/dbt-databricks/pull/1254))

dbt/adapters/databricks/impl.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -305,9 +305,11 @@ def list_schemas(self, database: Optional[str]) -> list[str]:
305305

306306
def check_schema_exists(self, database: Optional[str], schema: str) -> bool:
307307
"""Check if a schema exists."""
308-
return schema.lower() in set(
309-
s.lower() for s in self.connections.list_schemas(database or "hive_metastore", schema)
308+
results = self.execute_macro(
309+
"databricks__check_schema_exists",
310+
kwargs={"database": database or "hive_metastore", "schema": schema},
310311
)
312+
return len(results) > 0
311313

312314
def execute(
313315
self,

dbt/include/databricks/macros/adapters/metadata.sql

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,28 @@ SHOW TABLE EXTENDED IN {{ schema_relation.without_identifier()|lower }} LIKE '{{
2222
{{ return(run_query_as(show_tables_sql(relation), 'show_tables')) }}
2323
{% endmacro %}
2424

25+
{% macro databricks__list_schemas(database) -%}
26+
{{ return(run_query_as(list_schemas_sql(database), 'list_schemas')) }}
27+
{% endmacro %}
28+
29+
{% macro list_schemas_sql(database) %}
30+
{% if database %}
31+
SHOW SCHEMAS IN {{ database }}
32+
{% else %}
33+
SHOW SCHEMAS
34+
{% endif %}
35+
{% endmacro %}
36+
37+
{% macro databricks__check_schema_exists(database, schema) %}
38+
{{ return(run_query_as(check_schema_exists_sql(database, schema), 'check_schema_exists')) }}
39+
{% endmacro %}
40+
41+
{% macro check_schema_exists_sql(database, schema) %}
42+
SHOW SCHEMAS IN {{ database }} LIKE '{{ schema }}'
43+
{% endmacro %}
44+
2545
{% macro show_tables_sql(relation) %}
46+
2647
SHOW TABLES IN {{ relation.render() }}
2748
{% endmacro %}
2849

pyproject.toml

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -63,14 +63,8 @@ check-sdist = [
6363
]
6464

6565
[tool.hatch.envs.default]
66-
pre-install-commands = [
67-
"pip install git+https://github.com/dbt-labs/[email protected]",
68-
"pip install git+https://github.com/dbt-labs/[email protected]#subdirectory=dbt-adapters",
69-
"pip install git+https://github.com/dbt-labs/[email protected]#subdirectory=dbt-tests-adapter",
70-
"pip install git+https://github.com/dbt-labs/[email protected]#subdirectory=core",
71-
]
7266
dependencies = [
73-
"dbt-spark @ git+https://github.com/dbt-labs/[email protected]#subdirectory=dbt-spark",
67+
"dbt-tests-adapter",
7468
"pytest",
7569
"pytest-xdist",
7670
"pytest-dotenv",
@@ -83,7 +77,6 @@ dependencies = [
8377
"pydantic>=1.10.0, <2",
8478
"pytest-cov",
8579
]
86-
path = ".hatch"
8780
python = "3.9"
8881

8982
[tool.hatch.envs.default.scripts]
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
import pytest
2+
3+
4+
class TestCheckSchemaExists:
5+
"""Test the check_schema_exists adapter method."""
6+
7+
@pytest.fixture(scope="class", autouse=True)
8+
def setUp(self, project):
9+
"""Create a test schema and clean it up after tests."""
10+
test_schema = f"{project.test_schema}_check_exists"
11+
12+
with project.adapter.connection_named("__test"):
13+
relation = project.adapter.Relation.create(
14+
database=project.database,
15+
schema=test_schema,
16+
)
17+
# Drop if exists from previous run
18+
project.adapter.drop_schema(relation)
19+
# Create the test schema
20+
project.adapter.create_schema(relation)
21+
22+
yield test_schema
23+
24+
# Cleanup
25+
with project.adapter.connection_named("__test"):
26+
project.adapter.drop_schema(relation)
27+
28+
def test_check_schema_exists(self, project, setUp):
29+
"""Test that check_schema_exists correctly identifies existing and non-existing schemas."""
30+
test_schema = setUp
31+
32+
with project.adapter.connection_named("__test"):
33+
# Test 1: Verify existing schema returns True
34+
exists = project.adapter.check_schema_exists(
35+
database=project.database, schema=test_schema
36+
)
37+
assert (
38+
exists is True
39+
), f"Expected schema '{test_schema}' to exist but check returned False"
40+
41+
# Test 2: Verify non-existing schema returns False
42+
non_existent_schema = "this_schema_definitely_does_not_exist_12345"
43+
exists = project.adapter.check_schema_exists(
44+
database=project.database, schema=non_existent_schema
45+
)
46+
assert (
47+
exists is False
48+
), f"Expected schema '{non_existent_schema}' to not exist but check returned True"
49+
50+
# Test 3: Verify existing default schema returns True (should always exist)
51+
exists = project.adapter.check_schema_exists(
52+
database=project.database, schema=project.test_schema
53+
)
54+
assert exists is True, (
55+
f"Expected default test schema '{project.test_schema}' "
56+
"to exist but check returned False"
57+
)

0 commit comments

Comments
 (0)