|
| 1 | +# AGENTS.md - AI Agent Guide for dbt-databricks |
| 2 | + |
| 3 | +This guide helps AI agents quickly understand and work productively with the dbt-databricks adapter codebase. |
| 4 | + |
| 5 | +## 🚀 Quick Start for Agents |
| 6 | + |
| 7 | +### Project Overview |
| 8 | + |
| 9 | +- **What**: dbt adapter for Databricks Lakehouse platform |
| 10 | +- **Based on**: dbt-spark adapter with Databricks-specific enhancements |
| 11 | +- **Key Features**: Unity Catalog support, Delta Lake, Python models, streaming tables |
| 12 | +- **Language**: Python 3.9+ with Jinja2 SQL macros |
| 13 | +- **Architecture**: Inherits from Spark adapter, extends with Databricks-specific functionality |
| 14 | + |
| 15 | +### Essential Files to Understand |
| 16 | + |
| 17 | +``` |
| 18 | +dbt/adapters/databricks/ |
| 19 | +├── impl.py # Main adapter implementation (DatabricksAdapter class) |
| 20 | +├── connections.py # Connection management and SQL execution |
| 21 | +├── credentials.py # Authentication (token, OAuth, Azure AD) |
| 22 | +├── relation.py # Databricks-specific relation handling |
| 23 | +├── python_models/ # Python model execution on clusters |
| 24 | +├── relation_configs/ # Table/view configuration management |
| 25 | +└── catalogs/ # Unity Catalog vs Hive Metastore logic |
| 26 | +
|
| 27 | +dbt/include/databricks/macros/ # Jinja2 SQL templates |
| 28 | +├── adapters/ # Core adapter macros |
| 29 | +├── materializations/ # Model materialization strategies |
| 30 | +├── relations/ # Table/view creation and management |
| 31 | +└── utils/ # Utility macros |
| 32 | +``` |
| 33 | + |
| 34 | +## 🛠 Development Environment |
| 35 | + |
| 36 | +**Prerequisites**: Python 3.9+ installed on your system |
| 37 | + |
| 38 | +**Install Hatch** (recommended): |
| 39 | + |
| 40 | +```bash |
| 41 | +# Install Hatch globally - see https://hatch.pypa.io/dev/install/ |
| 42 | +pip install hatch |
| 43 | + |
| 44 | +# Create default environment (Hatch installs needed Python versions) |
| 45 | +hatch env create |
| 46 | +``` |
| 47 | + |
| 48 | +**Essential commands**: |
| 49 | + |
| 50 | +```bash |
| 51 | +hatch run code-quality # Format, lint, type-check |
| 52 | +hatch run unit # Run unit tests |
| 53 | +hatch run cluster-e2e # Run functional tests |
| 54 | +``` |
| 55 | + |
| 56 | +> 📖 **See [Development Guide](docs/dbt-databricks-dev.md)** for comprehensive setup documentation |
| 57 | +> 📖 **See [Testing Guide](docs/testing.md)** for comprehensive testing documentation |
| 58 | +
|
| 59 | +## 🧪 Testing Strategy |
| 60 | + |
| 61 | +### Test Types & When to Use |
| 62 | + |
| 63 | +1. **Unit Tests** (`tests/unit/`): Fast, isolated, no external dependencies |
| 64 | + |
| 65 | + - Test individual functions, utility methods, SQL generation |
| 66 | + - Mock external dependencies (database calls, API calls) |
| 67 | + - Run with: `hatch run unit` |
| 68 | + |
| 69 | +2. **Functional Tests** (`tests/functional/`): End-to-end with real Databricks |
| 70 | + - Test complete dbt workflows (run, seed, test, snapshot) |
| 71 | + - Require live Databricks workspace |
| 72 | + - Run with: `hatch run cluster-e2e` (or `uc-cluster-e2e`, `sqlw-e2e`) |
| 73 | + |
| 74 | +### Test Environments |
| 75 | + |
| 76 | +- **HMS Cluster** (`databricks_cluster`): Legacy Hive Metastore |
| 77 | +- **Unity Catalog Cluster** (`databricks_uc_cluster`): Modern UC features |
| 78 | +- **SQL Warehouse** (`databricks_uc_sql_endpoint`): Serverless compute |
| 79 | + |
| 80 | +### Writing Tests |
| 81 | + |
| 82 | +#### Unit Test Example |
| 83 | + |
| 84 | +```python |
| 85 | +from dbt.adapters.databricks.utils import redact_credentials |
| 86 | + |
| 87 | +def test_redact_credentials(): |
| 88 | + sql = "WITH (credential ('KEY' = 'SECRET_VALUE'))" |
| 89 | + expected = "WITH (credential ('KEY' = '[REDACTED]'))" |
| 90 | + assert redact_credentials(sql) == expected |
| 91 | +``` |
| 92 | + |
| 93 | +#### Macro Test Example |
| 94 | + |
| 95 | +```python |
| 96 | +from tests.unit.macros.base import MacroTestBase |
| 97 | + |
| 98 | +class TestCreateTable(MacroTestBase): |
| 99 | + @pytest.fixture(scope="class") |
| 100 | + def template_name(self) -> str: |
| 101 | + return "create.sql" # File in macros/relations/table/ |
| 102 | + |
| 103 | + @pytest.fixture(scope="class") |
| 104 | + def macro_folders_to_load(self) -> list: |
| 105 | + return ["macros", "macros/relations/table"] |
| 106 | + |
| 107 | + def test_create_table_sql(self, template_bundle): |
| 108 | + result = self.run_macro(template_bundle.template, "create_table", |
| 109 | + template_bundle.relation, "select 1") |
| 110 | + expected = "create table `database`.`schema`.`table` as (select 1)" |
| 111 | + self.assert_sql_equal(result, expected) |
| 112 | +``` |
| 113 | + |
| 114 | +#### Functional Test Example |
| 115 | + |
| 116 | +```python |
| 117 | +from dbt.tests import util |
| 118 | + |
| 119 | +class TestIncrementalModel: |
| 120 | + @pytest.fixture(scope="class") |
| 121 | + def models(self): |
| 122 | + return { |
| 123 | + "my_model.sql": """ |
| 124 | + {{ config(materialized='incremental', unique_key='id') }} |
| 125 | + select 1 as id, 'test' as name |
| 126 | + """ |
| 127 | + } |
| 128 | + |
| 129 | + def test_incremental_run(self, project): |
| 130 | + results = util.run_dbt(["run"]) |
| 131 | + assert len(results) == 1 |
| 132 | + # Verify table exists and has expected data |
| 133 | + results = project.run_sql("select count(*) from my_model", fetch="all") |
| 134 | + assert results[0][0] == 1 |
| 135 | +``` |
| 136 | + |
| 137 | +## 🏗 Architecture Deep Dive |
| 138 | + |
| 139 | +### Adapter Inheritance Chain |
| 140 | + |
| 141 | +``` |
| 142 | +DatabricksAdapter (impl.py) |
| 143 | + ↳ SparkAdapter (from dbt-spark) |
| 144 | + ↳ SQLAdapter (from dbt-core) |
| 145 | + ↳ BaseAdapter (from dbt-core) |
| 146 | +``` |
| 147 | + |
| 148 | +### Key Components |
| 149 | + |
| 150 | +#### Connection Management (`connections.py`) |
| 151 | + |
| 152 | +- Extends Spark connection manager for Databricks |
| 153 | +- Manages connection lifecycle and query execution |
| 154 | +- Handles query comments and context tracking |
| 155 | +- Integrates with `credentials.py` for authentication and `handle.py` for cursor operations |
| 156 | + |
| 157 | +#### Authentication & Credentials (`credentials.py`) |
| 158 | + |
| 159 | +- Defines credential dataclass with all auth methods (token, OAuth, Azure AD) |
| 160 | +- Handles credential validation and session properties |
| 161 | +- Manages compute resource configuration |
| 162 | + |
| 163 | +#### SQL Execution (`handle.py`) |
| 164 | + |
| 165 | +- Provides cursor wrapper for Databricks SQL connector |
| 166 | +- Implements retry logic and connection pooling |
| 167 | +- Handles SQL execution details and error handling |
| 168 | + |
| 169 | +#### Relation Handling (`relation.py`) |
| 170 | + |
| 171 | +- Extends Spark relations with Databricks features |
| 172 | +- Handles Unity Catalog 3-level namespace (catalog.schema.table) |
| 173 | +- Manages relation metadata and configuration |
| 174 | + |
| 175 | +#### Python Models (`python_models/`) |
| 176 | + |
| 177 | +- Executes Python models on Databricks clusters |
| 178 | +- Supports multiple submission methods (jobs, workflows, serverless) |
| 179 | +- Handles dependency management and result collection |
| 180 | + |
| 181 | +#### Macros (`dbt/include/databricks/macros/`) |
| 182 | + |
| 183 | +- Jinja2 templates that generate SQL |
| 184 | +- Override Spark macros with Databricks-specific logic |
| 185 | +- Handle materializations (table, view, incremental, snapshot) |
| 186 | +- Implement Databricks features (liquid clustering, column masks, tags) |
| 187 | + |
| 188 | +### Configuration System |
| 189 | + |
| 190 | +Models can be configured with Databricks-specific options: |
| 191 | + |
| 192 | +```sql |
| 193 | +{{ config( |
| 194 | + materialized='table', |
| 195 | + file_format='delta', |
| 196 | + liquid_clustering=['column1', 'column2'], |
| 197 | + tblproperties={'key': 'value'}, |
| 198 | + column_tags={'pii_col': ['sensitive']}, |
| 199 | + location_root='/mnt/external/' |
| 200 | +) }} |
| 201 | +``` |
| 202 | + |
| 203 | +## 🔧 Common Development Tasks |
| 204 | + |
| 205 | +### Adding New Materialization |
| 206 | + |
| 207 | +1. Create macro in `macros/materializations/` |
| 208 | +2. Implement SQL generation logic |
| 209 | +3. Add configuration options to relation configs |
| 210 | +4. Write unit tests for macro |
| 211 | +5. Write functional tests for end-to-end behavior |
| 212 | +6. Update documentation |
| 213 | + |
| 214 | +### Adding New Adapter Method |
| 215 | + |
| 216 | +1. Add method to `DatabricksAdapter` class in `impl.py` |
| 217 | +2. Implement database interaction logic |
| 218 | +3. Add corresponding macro if SQL generation needed |
| 219 | +4. Write unit tests with mocked database calls |
| 220 | +5. Write functional tests with real database |
| 221 | + |
| 222 | +### Modifying SQL Generation |
| 223 | + |
| 224 | +1. Locate relevant macro in `macros/` directory |
| 225 | +2. Test current behavior with unit tests |
| 226 | +3. Modify macro logic |
| 227 | +4. Update unit tests to verify new behavior |
| 228 | +5. Run affected functional tests to ensure no regressions |
| 229 | + |
| 230 | +### Adding Configuration Option |
| 231 | + |
| 232 | +1. Add field to appropriate config class in `relation_configs/` |
| 233 | +2. Update macro to use new configuration |
| 234 | +3. Add validation logic if needed |
| 235 | +4. Write tests for both valid and invalid configurations |
| 236 | + |
| 237 | +## 🐛 Debugging Guide |
| 238 | + |
| 239 | +### Common Issues |
| 240 | + |
| 241 | +1. **SQL Generation**: Use macro unit tests with `assert_sql_equal()` |
| 242 | +2. **Connection Problems**: Check credentials and environment variables |
| 243 | +3. **Python Model Failures**: Check cluster configuration and dependencies |
| 244 | +4. **Test Failures**: Review logs in `logs/` directory, look for red text |
| 245 | + |
| 246 | +### Debugging Tools |
| 247 | + |
| 248 | +- **IDE Test Runner**: Set breakpoints and step through code |
| 249 | +- **Log Analysis**: dbt generates detailed debug logs by default |
| 250 | +- **SQL Inspection**: Print generated SQL in macros for debugging |
| 251 | +- **Mock Inspection**: Verify mocked calls in unit tests |
| 252 | + |
| 253 | +## 📚 Key Resources |
| 254 | + |
| 255 | +### Documentation |
| 256 | + |
| 257 | +- **Development**: `docs/dbt-databricks-dev.md` - Setup and workflow |
| 258 | +- **Testing**: `docs/testing.md` - Comprehensive testing guide |
| 259 | +- **Contributing**: `CONTRIBUTING.MD` - Code standards and PR process |
| 260 | +- **User Docs**: [docs.getdbt.com](https://docs.getdbt.com/reference/resource-configs/databricks-configs) |
| 261 | + |
| 262 | +### Important Files for Agents |
| 263 | + |
| 264 | +- `pyproject.toml` - Project configuration, dependencies, tool settings |
| 265 | +- `test.env.example` - Template for test environment variables |
| 266 | +- `tests/conftest.py` - Global test configuration |
| 267 | +- `tests/profiles.py` - Test database profiles |
| 268 | + |
| 269 | +### Code Patterns to Follow |
| 270 | + |
| 271 | +1. **Error Handling**: Use dbt's exception classes, provide helpful messages |
| 272 | +2. **Logging**: Use `logger` from `dbt.adapters.databricks.logging` |
| 273 | +3. **SQL Generation**: Prefer macros over Python string manipulation |
| 274 | +4. **Testing**: Write both unit and functional tests for new features |
| 275 | +5. **Configuration**: Use dataclasses with validation for new config options |
| 276 | + |
| 277 | +## 🚨 Common Pitfalls for Agents |
| 278 | + |
| 279 | +1. **Don't modify dbt-spark behavior** without understanding inheritance |
| 280 | +2. **Always run code-quality** before committing changes |
| 281 | +3. **Test on multiple environments** (HMS, UC cluster, SQL warehouse) |
| 282 | +4. **Mock external dependencies** in unit tests properly |
| 283 | +5. **Use appropriate test fixtures** from dbt-tests-adapter |
| 284 | +6. **Follow SQL normalization** in test assertions with `assert_sql_equal()` |
| 285 | +7. **Handle Unity Catalog vs HMS differences** in feature implementations |
| 286 | +8. **Consider backward compatibility** when modifying existing behavior |
| 287 | + |
| 288 | +## 🎯 Success Metrics |
| 289 | + |
| 290 | +When working on this codebase, ensure: |
| 291 | + |
| 292 | +- [ ] All tests pass (`hatch run code-quality && hatch run unit`) |
| 293 | +- [ ] New features have both unit and functional tests |
| 294 | +- [ ] SQL generation follows Databricks best practices |
| 295 | +- [ ] Changes maintain backward compatibility |
| 296 | +- [ ] Code follows project style guidelines |
| 297 | + |
| 298 | +--- |
| 299 | + |
| 300 | +_This guide is maintained by the dbt-databricks team. When making significant architectural changes, update this guide to help future agents understand the codebase._ |
0 commit comments