|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This is a monorepo for OpenDataDiscovery (ODD) Collectors - a suite of services that extract metadata from various data sources and send it to the ODD Platform. The repository contains four collectors (odd-collector, odd-collector-aws, odd-collector-azure, odd-collector-gcp) and a shared SDK (odd-collector-sdk). |
| 8 | + |
| 9 | +## Common Commands |
| 10 | + |
| 11 | +### Development Setup |
| 12 | + |
| 13 | +```bash |
| 14 | +# Install dependencies for a specific collector |
| 15 | +cd odd-collector # or odd-collector-aws, odd-collector-azure, odd-collector-gcp |
| 16 | +poetry install |
| 17 | + |
| 18 | +# Activate virtual environment |
| 19 | +poetry shell |
| 20 | +``` |
| 21 | + |
| 22 | +### Linting |
| 23 | + |
| 24 | +```bash |
| 25 | +# From repository root - formats all code |
| 26 | +make lint |
| 27 | + |
| 28 | +# This runs: |
| 29 | +# - black . (code formatter) |
| 30 | +# - isort . (import sorter with black profile) |
| 31 | +``` |
| 32 | + |
| 33 | +### Testing |
| 34 | + |
| 35 | +```bash |
| 36 | +# Run all tests for a collector (must be in collector directory) |
| 37 | +cd odd-collector |
| 38 | +poetry shell |
| 39 | +pytest ./tests -v |
| 40 | + |
| 41 | +# Run tests for a specific adapter |
| 42 | +pytest ./tests/integration/test_postgres.py -v |
| 43 | + |
| 44 | +# Run tests without activating shell (useful for CI) |
| 45 | +poetry run pytest ./tests -v |
| 46 | + |
| 47 | +# Integration tests are marked with @pytest.mark.integration |
| 48 | +# Most integration tests use testcontainers |
| 49 | +``` |
| 50 | + |
| 51 | +### Running Collectors |
| 52 | + |
| 53 | +```bash |
| 54 | +# Run a collector locally (requires collector_config.yaml in current directory) |
| 55 | +cd odd-collector |
| 56 | +poetry run python -m odd_collector |
| 57 | + |
| 58 | +# Run with Docker |
| 59 | +docker run -v ./collector_config.yaml:/app/collector_config.yaml ghcr.io/opendatadiscovery/odd-collector:latest |
| 60 | + |
| 61 | +# Set log level via environment variable |
| 62 | +export LOGLEVEL=DEBUG # Options: DEBUG, INFO, WARNING, ERROR |
| 63 | +``` |
| 64 | + |
| 65 | +### Versioning and Release |
| 66 | + |
| 67 | +Version numbers are stored in two places per collector: |
| 68 | +- `./<collector_package>/__version__.py` |
| 69 | +- `./pyproject.toml` |
| 70 | + |
| 71 | +Both must be updated manually before creating a release tag. |
| 72 | + |
| 73 | +## Monorepo Architecture |
| 74 | + |
| 75 | +### Package Structure |
| 76 | + |
| 77 | +``` |
| 78 | +/ |
| 79 | +├── odd-collector/ # Generic collector (databases, BI tools, ML platforms) |
| 80 | +│ ├── odd_collector/ |
| 81 | +│ │ ├── adapters/ # 40+ adapters (postgresql, mysql, snowflake, etc.) |
| 82 | +│ │ ├── domain/ |
| 83 | +│ │ │ └── plugin.py # PLUGIN_FACTORY registry |
| 84 | +│ │ └── __main__.py # Entry point |
| 85 | +│ ├── config_examples/ # YAML configs for each adapter |
| 86 | +│ ├── tests/ |
| 87 | +│ └── pyproject.toml |
| 88 | +├── odd-collector-aws/ # AWS services (S3, Glue, Athena, etc.) |
| 89 | +├── odd-collector-azure/ # Azure services (Blob Storage, PowerBI, etc.) |
| 90 | +├── odd-collector-gcp/ # GCP services (BigQuery, GCS, etc.) |
| 91 | +├── odd-collector-sdk/ # Shared SDK library |
| 92 | +│ ├── odd_collector_sdk/ |
| 93 | +│ │ ├── domain/ |
| 94 | +│ │ │ ├── adapter.py # BaseAdapter, AbstractAdapter |
| 95 | +│ │ │ ├── plugin.py # Plugin base class |
| 96 | +│ │ │ └── filter.py # Include/exclude pattern matching |
| 97 | +│ │ ├── collector.py # Main orchestrator |
| 98 | +│ │ ├── job.py # SyncJob, AsyncJob, AsyncGeneratorJob |
| 99 | +│ │ └── api/ # ODD Platform API client |
| 100 | +│ └── pyproject.toml |
| 101 | +└── pyproject.toml # Root monorepo config |
| 102 | +``` |
| 103 | + |
| 104 | +### How Collectors Work |
| 105 | + |
| 106 | +1. **Configuration Loading**: Collector reads `collector_config.yaml` from current directory |
| 107 | +2. **Dynamic Adapter Loading**: Based on plugin `type`, imports `{root_package}.adapters.{type}.adapter.Adapter` |
| 108 | +3. **Job Scheduling**: APScheduler runs jobs based on `default_pulling_interval` (one-time if not set) |
| 109 | +4. **Metadata Extraction**: Each adapter's `get_data_entity_list()` returns `DataEntityList` |
| 110 | +5. **Platform Ingestion**: SDK chunks and sends entities to ODD Platform via REST API |
| 111 | + |
| 112 | +### Adapter Conventions |
| 113 | + |
| 114 | +Every adapter MUST follow these conventions: |
| 115 | + |
| 116 | +**Directory Structure:** |
| 117 | +``` |
| 118 | +adapters/{adapter_name}/ |
| 119 | +└── adapter.py # REQUIRED: Contains "Adapter" class |
| 120 | +``` |
| 121 | + |
| 122 | +**Adapter Class:** |
| 123 | +```python |
| 124 | +from odd_collector_sdk.domain.adapter import BaseAdapter |
| 125 | +from odd_models.models import DataEntityList |
| 126 | + |
| 127 | +class Adapter(BaseAdapter): |
| 128 | + def __init__(self, config: YourPluginType) -> None: |
| 129 | + super().__init__(config) |
| 130 | + |
| 131 | + def create_generator(self) -> Generator: |
| 132 | + # Return ODDRN generator for this data source |
| 133 | + return YourOddrnGenerator(...) |
| 134 | + |
| 135 | + def get_data_entity_list(self) -> DataEntityList: |
| 136 | + # Extract metadata and return entities |
| 137 | + return DataEntityList(...) |
| 138 | +``` |
| 139 | + |
| 140 | +**Plugin Registration** (`domain/plugin.py`): |
| 141 | +```python |
| 142 | +from typing import Literal |
| 143 | +from odd_collector_sdk.domain.plugin import Plugin |
| 144 | + |
| 145 | +class YourAdapterPlugin(Plugin): |
| 146 | + type: Literal["adapter_name"] # MUST match adapter directory name |
| 147 | + # Add adapter-specific config fields |
| 148 | + host: str |
| 149 | + port: int |
| 150 | + # ... |
| 151 | + |
| 152 | +# Register in factory |
| 153 | +PLUGIN_FACTORY = { |
| 154 | + "adapter_name": YourAdapterPlugin, |
| 155 | + # ... |
| 156 | +} |
| 157 | +``` |
| 158 | + |
| 159 | +**Configuration Example** (`config_examples/adapter_name.yaml`): |
| 160 | +```yaml |
| 161 | +platform_host_url: http://localhost:8080 |
| 162 | +token: "" |
| 163 | +plugins: |
| 164 | + - type: adapter_name # Maps to PLUGIN_FACTORY key |
| 165 | + name: my_instance |
| 166 | + host: localhost |
| 167 | + port: 5432 |
| 168 | +``` |
| 169 | +
|
| 170 | +### Common Adapter Patterns |
| 171 | +
|
| 172 | +**Repository Pattern** (for databases): |
| 173 | +``` |
| 174 | +adapters/{adapter}/ |
| 175 | +├── adapter.py # Orchestrates metadata extraction |
| 176 | +├── repository.py # Executes queries, fetches raw data |
| 177 | +├── models.py # Domain models (Table, Column, Schema, etc.) |
| 178 | +└── mappers/ # Transform domain models to ODD entities |
| 179 | + ├── tables.py |
| 180 | + ├── columns.py |
| 181 | + └── ... |
| 182 | +``` |
| 183 | + |
| 184 | +**Mapper Pattern** (all adapters): |
| 185 | +- Separate transformation logic from data access |
| 186 | +- Input: Source-specific models (e.g., PostgreSQL Table) |
| 187 | +- Output: ODD models (`DataEntity`, `DataSet`, `DataTransformer`, etc.) |
| 188 | + |
| 189 | +**Client Pattern** (for APIs): |
| 190 | +``` |
| 191 | +adapters/{adapter}/ |
| 192 | +├── adapter.py |
| 193 | +├── client.py # HTTP/SDK client for external API |
| 194 | +└── mapper/ # Transform API responses to ODD entities |
| 195 | +``` |
| 196 | + |
| 197 | +### SDK Base Classes |
| 198 | + |
| 199 | +**BaseAdapter** (most common): |
| 200 | +- Provides `generator` attribute via `create_generator()` |
| 201 | +- Implements `get_data_source_oddrn()` using generator |
| 202 | +- Subclass must implement: `create_generator()`, `get_data_entity_list()` |
| 203 | + |
| 204 | +**AsyncAbstractAdapter** (for async operations): |
| 205 | +- Same interface but with `async def get_data_entity_list()` |
| 206 | +- SDK automatically wraps in `AsyncJob` or `AsyncGeneratorJob` |
| 207 | + |
| 208 | +**AbstractAdapter** (rarely used directly): |
| 209 | +- Minimal interface when you need custom ODDRN handling |
| 210 | + |
| 211 | +### Configuration Patterns |
| 212 | + |
| 213 | +**Priority Order** (highest to lowest): |
| 214 | +1. AWS SSM Parameter Store (if `secrets_backend` configured) |
| 215 | +2. `collector_config.yaml` file |
| 216 | +3. Environment variables (top-level config only, not plugins) |
| 217 | +4. Default values |
| 218 | + |
| 219 | +**Environment Variable Substitution:** |
| 220 | +```yaml |
| 221 | +plugins: |
| 222 | + - type: postgresql |
| 223 | + password: !ENV ${POSTGRES_PASSWORD} |
| 224 | +``` |
| 225 | +
|
| 226 | +**Filter Configuration** (reusable across adapters): |
| 227 | +```yaml |
| 228 | +plugins: |
| 229 | + - type: postgresql |
| 230 | + schemas_filter: |
| 231 | + include: ["public.*", "sales.*"] # Regex patterns |
| 232 | + exclude: ["temp.*"] |
| 233 | + ignore_case: false |
| 234 | +``` |
| 235 | +
|
| 236 | +Adapters using filters: PostgreSQL (`schemas_filter`), Snowflake (`schemas_filter`), S3 (`filename_filter`), BigQuery (`datasets_filter`), GCS (`filename_filter`), Azure Data Factory (`pipeline_filter`), Azure Blob (`file_filter`). |
| 237 | + |
| 238 | +## Adding a New Adapter |
| 239 | + |
| 240 | +1. Create adapter directory: `mkdir -p {collector}/adapters/{name}` |
| 241 | +2. Implement adapter: `{collector}/adapters/{name}/adapter.py` with `Adapter` class |
| 242 | +3. Create plugin model in `{collector}/domain/plugin.py` with `type: Literal["{name}"]` |
| 243 | +4. Register in `PLUGIN_FACTORY` dictionary |
| 244 | +5. Add configuration example: `{collector}/config_examples/{name}.yaml` |
| 245 | +6. Write tests: `{collector}/tests/integration/test_{name}.py` |
| 246 | + |
| 247 | +## Key Dependencies |
| 248 | + |
| 249 | +- **odd-models**: ODD metadata model (`DataEntity`, `DataSet`, etc.) |
| 250 | +- **oddrn-generator**: Generates ODDRNs (Open Data Discovery Resource Names) |
| 251 | +- **pydantic**: Configuration validation and parsing |
| 252 | +- **APScheduler**: Job scheduling for periodic metadata collection |
| 253 | +- **funcy**: Functional programming utilities (used extensively) |
| 254 | +- **pyaml-env**: YAML parsing with environment variable support |
| 255 | + |
| 256 | +## Docker Build |
| 257 | + |
| 258 | +Each collector has its own Dockerfile: |
| 259 | +- Multi-stage build (build + runtime) |
| 260 | +- Python 3.9 base image (Debian Bullseye) |
| 261 | +- Poetry for dependency management |
| 262 | +- Specific system dependencies (ODBC drivers, Oracle client, etc.) |
| 263 | +- Entry point: `/bin/bash start.sh` → `python -m {collector_package}` |
| 264 | + |
| 265 | +## Relationships Feature |
| 266 | + |
| 267 | +PostgreSQL and Snowflake adapters support building ERD (Entity-Relationship Diagram) relationships: |
| 268 | +- Foreign key constraints → `ONE_TO_EXACTLY_ONE`, `ONE_TO_ZERO_OR_ONE`, etc. |
| 269 | +- Cross-schema relationships supported |
| 270 | +- Implementation: `adapters/{adapter}/mappers/relationships/` |
| 271 | + |
| 272 | +## Testing with Testcontainers |
| 273 | + |
| 274 | +Integration tests use testcontainers to spin up real databases/services: |
| 275 | +```python |
| 276 | +import pytest |
| 277 | +from testcontainers.postgres import PostgresContainer |
| 278 | +
|
| 279 | +@pytest.mark.integration |
| 280 | +def test_postgres_adapter(): |
| 281 | + with PostgresContainer("postgres:14") as postgres: |
| 282 | + # Test adapter against real Postgres |
| 283 | +``` |
| 284 | + |
| 285 | +Run integration tests: `pytest ./tests -v -m integration` |
0 commit comments