Skip to content

Commit b6219e7

Browse files
RamanDamayeuRaman Damayeuclaude
authored
Add CLAUDE.md for Claude Code initialization (#129)
Initial setup documentation for Claude Code that provides: - Common commands for building, linting, and testing - Monorepo architecture and collector structure - Adapter conventions and patterns - Configuration guidance and SDK usage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Raman Damayeu <rdamayeu@provectus.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent a679448 commit b6219e7

File tree

1 file changed

+285
-0
lines changed

1 file changed

+285
-0
lines changed

CLAUDE.md

Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Overview
6+
7+
This is a monorepo for OpenDataDiscovery (ODD) Collectors - a suite of services that extract metadata from various data sources and send it to the ODD Platform. The repository contains four collectors (odd-collector, odd-collector-aws, odd-collector-azure, odd-collector-gcp) and a shared SDK (odd-collector-sdk).
8+
9+
## Common Commands
10+
11+
### Development Setup
12+
13+
```bash
14+
# Install dependencies for a specific collector
15+
cd odd-collector # or odd-collector-aws, odd-collector-azure, odd-collector-gcp
16+
poetry install
17+
18+
# Activate virtual environment
19+
poetry shell
20+
```
21+
22+
### Linting
23+
24+
```bash
25+
# From repository root - formats all code
26+
make lint
27+
28+
# This runs:
29+
# - black . (code formatter)
30+
# - isort . (import sorter with black profile)
31+
```
32+
33+
### Testing
34+
35+
```bash
36+
# Run all tests for a collector (must be in collector directory)
37+
cd odd-collector
38+
poetry shell
39+
pytest ./tests -v
40+
41+
# Run tests for a specific adapter
42+
pytest ./tests/integration/test_postgres.py -v
43+
44+
# Run tests without activating shell (useful for CI)
45+
poetry run pytest ./tests -v
46+
47+
# Integration tests are marked with @pytest.mark.integration
48+
# Most integration tests use testcontainers
49+
```
50+
51+
### Running Collectors
52+
53+
```bash
54+
# Run a collector locally (requires collector_config.yaml in current directory)
55+
cd odd-collector
56+
poetry run python -m odd_collector
57+
58+
# Run with Docker
59+
docker run -v ./collector_config.yaml:/app/collector_config.yaml ghcr.io/opendatadiscovery/odd-collector:latest
60+
61+
# Set log level via environment variable
62+
export LOGLEVEL=DEBUG # Options: DEBUG, INFO, WARNING, ERROR
63+
```
64+
65+
### Versioning and Release
66+
67+
Version numbers are stored in two places per collector:
68+
- `./<collector_package>/__version__.py`
69+
- `./pyproject.toml`
70+
71+
Both must be updated manually before creating a release tag.
72+
73+
## Monorepo Architecture
74+
75+
### Package Structure
76+
77+
```
78+
/
79+
├── odd-collector/ # Generic collector (databases, BI tools, ML platforms)
80+
│ ├── odd_collector/
81+
│ │ ├── adapters/ # 40+ adapters (postgresql, mysql, snowflake, etc.)
82+
│ │ ├── domain/
83+
│ │ │ └── plugin.py # PLUGIN_FACTORY registry
84+
│ │ └── __main__.py # Entry point
85+
│ ├── config_examples/ # YAML configs for each adapter
86+
│ ├── tests/
87+
│ └── pyproject.toml
88+
├── odd-collector-aws/ # AWS services (S3, Glue, Athena, etc.)
89+
├── odd-collector-azure/ # Azure services (Blob Storage, PowerBI, etc.)
90+
├── odd-collector-gcp/ # GCP services (BigQuery, GCS, etc.)
91+
├── odd-collector-sdk/ # Shared SDK library
92+
│ ├── odd_collector_sdk/
93+
│ │ ├── domain/
94+
│ │ │ ├── adapter.py # BaseAdapter, AbstractAdapter
95+
│ │ │ ├── plugin.py # Plugin base class
96+
│ │ │ └── filter.py # Include/exclude pattern matching
97+
│ │ ├── collector.py # Main orchestrator
98+
│ │ ├── job.py # SyncJob, AsyncJob, AsyncGeneratorJob
99+
│ │ └── api/ # ODD Platform API client
100+
│ └── pyproject.toml
101+
└── pyproject.toml # Root monorepo config
102+
```
103+
104+
### How Collectors Work
105+
106+
1. **Configuration Loading**: Collector reads `collector_config.yaml` from current directory
107+
2. **Dynamic Adapter Loading**: Based on plugin `type`, imports `{root_package}.adapters.{type}.adapter.Adapter`
108+
3. **Job Scheduling**: APScheduler runs jobs based on `default_pulling_interval` (one-time if not set)
109+
4. **Metadata Extraction**: Each adapter's `get_data_entity_list()` returns `DataEntityList`
110+
5. **Platform Ingestion**: SDK chunks and sends entities to ODD Platform via REST API
111+
112+
### Adapter Conventions
113+
114+
Every adapter MUST follow these conventions:
115+
116+
**Directory Structure:**
117+
```
118+
adapters/{adapter_name}/
119+
└── adapter.py # REQUIRED: Contains "Adapter" class
120+
```
121+
122+
**Adapter Class:**
123+
```python
124+
from odd_collector_sdk.domain.adapter import BaseAdapter
125+
from odd_models.models import DataEntityList
126+
127+
class Adapter(BaseAdapter):
128+
def __init__(self, config: YourPluginType) -> None:
129+
super().__init__(config)
130+
131+
def create_generator(self) -> Generator:
132+
# Return ODDRN generator for this data source
133+
return YourOddrnGenerator(...)
134+
135+
def get_data_entity_list(self) -> DataEntityList:
136+
# Extract metadata and return entities
137+
return DataEntityList(...)
138+
```
139+
140+
**Plugin Registration** (`domain/plugin.py`):
141+
```python
142+
from typing import Literal
143+
from odd_collector_sdk.domain.plugin import Plugin
144+
145+
class YourAdapterPlugin(Plugin):
146+
type: Literal["adapter_name"] # MUST match adapter directory name
147+
# Add adapter-specific config fields
148+
host: str
149+
port: int
150+
# ...
151+
152+
# Register in factory
153+
PLUGIN_FACTORY = {
154+
"adapter_name": YourAdapterPlugin,
155+
# ...
156+
}
157+
```
158+
159+
**Configuration Example** (`config_examples/adapter_name.yaml`):
160+
```yaml
161+
platform_host_url: http://localhost:8080
162+
token: ""
163+
plugins:
164+
- type: adapter_name # Maps to PLUGIN_FACTORY key
165+
name: my_instance
166+
host: localhost
167+
port: 5432
168+
```
169+
170+
### Common Adapter Patterns
171+
172+
**Repository Pattern** (for databases):
173+
```
174+
adapters/{adapter}/
175+
├── adapter.py # Orchestrates metadata extraction
176+
├── repository.py # Executes queries, fetches raw data
177+
├── models.py # Domain models (Table, Column, Schema, etc.)
178+
└── mappers/ # Transform domain models to ODD entities
179+
├── tables.py
180+
├── columns.py
181+
└── ...
182+
```
183+
184+
**Mapper Pattern** (all adapters):
185+
- Separate transformation logic from data access
186+
- Input: Source-specific models (e.g., PostgreSQL Table)
187+
- Output: ODD models (`DataEntity`, `DataSet`, `DataTransformer`, etc.)
188+
189+
**Client Pattern** (for APIs):
190+
```
191+
adapters/{adapter}/
192+
├── adapter.py
193+
├── client.py # HTTP/SDK client for external API
194+
└── mapper/ # Transform API responses to ODD entities
195+
```
196+
197+
### SDK Base Classes
198+
199+
**BaseAdapter** (most common):
200+
- Provides `generator` attribute via `create_generator()`
201+
- Implements `get_data_source_oddrn()` using generator
202+
- Subclass must implement: `create_generator()`, `get_data_entity_list()`
203+
204+
**AsyncAbstractAdapter** (for async operations):
205+
- Same interface but with `async def get_data_entity_list()`
206+
- SDK automatically wraps in `AsyncJob` or `AsyncGeneratorJob`
207+
208+
**AbstractAdapter** (rarely used directly):
209+
- Minimal interface when you need custom ODDRN handling
210+
211+
### Configuration Patterns
212+
213+
**Priority Order** (highest to lowest):
214+
1. AWS SSM Parameter Store (if `secrets_backend` configured)
215+
2. `collector_config.yaml` file
216+
3. Environment variables (top-level config only, not plugins)
217+
4. Default values
218+
219+
**Environment Variable Substitution:**
220+
```yaml
221+
plugins:
222+
- type: postgresql
223+
password: !ENV ${POSTGRES_PASSWORD}
224+
```
225+
226+
**Filter Configuration** (reusable across adapters):
227+
```yaml
228+
plugins:
229+
- type: postgresql
230+
schemas_filter:
231+
include: ["public.*", "sales.*"] # Regex patterns
232+
exclude: ["temp.*"]
233+
ignore_case: false
234+
```
235+
236+
Adapters using filters: PostgreSQL (`schemas_filter`), Snowflake (`schemas_filter`), S3 (`filename_filter`), BigQuery (`datasets_filter`), GCS (`filename_filter`), Azure Data Factory (`pipeline_filter`), Azure Blob (`file_filter`).
237+
238+
## Adding a New Adapter
239+
240+
1. Create adapter directory: `mkdir -p {collector}/adapters/{name}`
241+
2. Implement adapter: `{collector}/adapters/{name}/adapter.py` with `Adapter` class
242+
3. Create plugin model in `{collector}/domain/plugin.py` with `type: Literal["{name}"]`
243+
4. Register in `PLUGIN_FACTORY` dictionary
244+
5. Add configuration example: `{collector}/config_examples/{name}.yaml`
245+
6. Write tests: `{collector}/tests/integration/test_{name}.py`
246+
247+
## Key Dependencies
248+
249+
- **odd-models**: ODD metadata model (`DataEntity`, `DataSet`, etc.)
250+
- **oddrn-generator**: Generates ODDRNs (Open Data Discovery Resource Names)
251+
- **pydantic**: Configuration validation and parsing
252+
- **APScheduler**: Job scheduling for periodic metadata collection
253+
- **funcy**: Functional programming utilities (used extensively)
254+
- **pyaml-env**: YAML parsing with environment variable support
255+
256+
## Docker Build
257+
258+
Each collector has its own Dockerfile:
259+
- Multi-stage build (build + runtime)
260+
- Python 3.9 base image (Debian Bullseye)
261+
- Poetry for dependency management
262+
- Specific system dependencies (ODBC drivers, Oracle client, etc.)
263+
- Entry point: `/bin/bash start.sh` → `python -m {collector_package}`
264+
265+
## Relationships Feature
266+
267+
PostgreSQL and Snowflake adapters support building ERD (Entity-Relationship Diagram) relationships:
268+
- Foreign key constraints → `ONE_TO_EXACTLY_ONE`, `ONE_TO_ZERO_OR_ONE`, etc.
269+
- Cross-schema relationships supported
270+
- Implementation: `adapters/{adapter}/mappers/relationships/`
271+
272+
## Testing with Testcontainers
273+
274+
Integration tests use testcontainers to spin up real databases/services:
275+
```python
276+
import pytest
277+
from testcontainers.postgres import PostgresContainer
278+
279+
@pytest.mark.integration
280+
def test_postgres_adapter():
281+
with PostgresContainer("postgres:14") as postgres:
282+
# Test adapter against real Postgres
283+
```
284+
285+
Run integration tests: `pytest ./tests -v -m integration`

0 commit comments

Comments
 (0)