Skip to content

Commit 0d489fd

Browse files
committed
Update core functionality, add LLM integration, and improve rules
1 parent f1284a2 commit 0d489fd

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+9377
-584
lines changed

AGENTS.md

Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
# AGENTS.md - Metacrafter
2+
3+
This document provides guidance for AI agents working with the Metacrafter codebase.
4+
5+
## Overview
6+
7+
Metacrafter is a Python command-line tool and library for labeling table fields and data files. It uses rule-based classification to identify:
8+
- Personal Identifiable Information (PII)
9+
- Person names, surnames, midnames
10+
- Basic identifiers (UUID/GUID, email, phone, etc.)
11+
- Country/language-specific identifiers
12+
- Dates and times
13+
- Various semantic data types
14+
15+
## Repository Structure
16+
17+
```
18+
metacrafter/
19+
├── metacrafter/ # Main package
20+
│ ├── __init__.py # Package initialization, exports exceptions
21+
│ ├── __main__.py # CLI entry point
22+
│ ├── core.py # Main CLI command handler (CrafterCmd)
23+
│ ├── config.py # Configuration file loader (.metacrafter)
24+
│ ├── exceptions.py # Custom exception classes
25+
│ ├── classify/ # Core classification engine
26+
│ │ ├── processor.py # RulesProcessor - loads and applies rules
27+
│ │ ├── stats.py # Analyzer - field statistics and analysis
28+
│ │ └── utils.py # Utility functions
29+
│ ├── core/ # Core validation utilities
30+
│ │ └── validators.py # Validation functions
31+
│ ├── registry/ # Registry client integration
32+
│ │ └── client.py # Client for metacrafter-registry
33+
│ └── server/ # API server components
34+
│ ├── api.py # API endpoints
35+
│ └── manager.py # Server management
36+
├── rules/ # Default rule files (YAML)
37+
│ ├── basic/ # Basic identifier rules
38+
│ ├── common/ # Common rules (dates, internet, etc.)
39+
│ ├── pii/ # PII detection rules
40+
│ ├── en/ # English-specific rules
41+
│ ├── ru/ # Russian-specific rules
42+
│ └── fr/ # French-specific rules
43+
├── tests/ # Test suite
44+
├── scripts/ # Utility scripts
45+
└── setup.py # Package setup
46+
```
47+
48+
## Key Components
49+
50+
### 1. Core Engine (`metacrafter/core.py`)
51+
52+
The `CrafterCmd` class is the main entry point for all operations:
53+
- `scan_file()` - Scan data files (CSV, JSON, Parquet, etc.)
54+
- `scan_db()` - Scan SQL databases
55+
- `scan_mongodb()` - Scan MongoDB databases
56+
- `scan_data()` - Scan in-memory data (list of dicts)
57+
- `scan_bulk()` - Scan multiple files in a directory
58+
59+
### 2. Rules Processor (`metacrafter/classify/processor.py`)
60+
61+
`RulesProcessor` handles:
62+
- Loading YAML rule files from configured paths
63+
- Compiling rules (text, PyParsing, function-based)
64+
- Applying rules to field names and data values
65+
- Filtering by context, language, country codes
66+
- Confidence scoring
67+
68+
### 3. Statistics Analyzer (`metacrafter/classify/stats.py`)
69+
70+
`Analyzer` computes field statistics:
71+
- Data type detection (str, int, float, dict, etc.)
72+
- Uniqueness metrics
73+
- Length statistics (min, max, avg)
74+
- Character analysis (digits, alphas, special chars)
75+
- Dictionary value detection
76+
77+
### 4. Configuration (`metacrafter/config.py`)
78+
79+
`ConfigLoader` reads `.metacrafter` YAML config files from:
80+
- Current working directory
81+
- User home directory (`~/.metacrafter`)
82+
83+
Configuration options:
84+
- `rulepath`: List of directories containing rule YAML files
85+
86+
## Rule System
87+
88+
Rules are YAML files that define how to identify data types. Three match types:
89+
90+
1. **text** - Exact text matching (for field names)
91+
```yaml
92+
midname:
93+
key: person_midname
94+
match: text
95+
type: field
96+
rule: midname,secondname,middlename
97+
```
98+
99+
2. **ppr** - PyParsing pattern matching (for data values)
100+
```yaml
101+
rukadastr:
102+
key: rukadastr
103+
match: ppr
104+
type: data
105+
rule: Word(nums, min=1, max=2) + Literal(':')...
106+
```
107+
108+
3. **func** - Python function validation
109+
```yaml
110+
runpabyfunc:
111+
key: runpa
112+
match: func
113+
type: data
114+
rule: metacrafter.rules.ru.gov.is_ru_law
115+
```
116+
117+
## Supported File Formats
118+
119+
Metacrafter uses `iterabledata` package for file format support:
120+
121+
**Text formats:** CSV, TSV, JSON, JSONL, XML
122+
**Binary formats:** BSON, Parquet, Avro, ORC, Excel (XLS/XLSX), Pickle
123+
**Compression:** gzip, bzip2, xz, lz4, zstandard, Brotli, Snappy, ZIP
124+
125+
Format detection is automatic based on file extension.
126+
127+
## Database Support
128+
129+
- **SQL databases:** Any database supported by SQLAlchemy (PostgreSQL, MySQL, SQLite, SQL Server, Oracle, DuckDB, etc.)
130+
- **NoSQL:** MongoDB (via pymongo)
131+
132+
## Common Tasks
133+
134+
### Adding a New Rule
135+
136+
1. Create or edit a YAML file in `rules/` directory (or custom rulepath)
137+
2. Define rule with appropriate match type (text/ppr/func)
138+
3. Set metadata: key, name, type (field/data), priority, contexts, langs, country
139+
4. Test with `metacrafter scan file test.csv`
140+
141+
### Extending Rule Validation Functions
142+
143+
1. Create Python module in appropriate location
144+
2. Define function that accepts string/value and returns bool
145+
3. Reference in rule YAML: `rule: package.module.function_name`
146+
4. Ensure function is importable (may need to add to package)
147+
148+
### Adding Database Support
149+
150+
1. Ensure SQLAlchemy driver is available
151+
2. Use connection string format: `dialect+driver://user:pass@host:port/db`
152+
3. For new NoSQL databases, extend `scan_mongodb()` pattern in `core.py`
153+
154+
### Working with Registry Integration
155+
156+
The registry client (`metacrafter/registry/client.py`) connects to metacrafter-registry to:
157+
- Fetch datatype metadata
158+
- Resolve datatype URLs
159+
- Get rule metadata
160+
161+
Registry URL defaults to `https://registry.apicrafter.io` but can be configured.
162+
163+
## CLI Usage Patterns
164+
165+
### Basic File Scan
166+
```bash
167+
metacrafter scan file data.csv --format full -o results.json
168+
```
169+
170+
### Database Scan
171+
```bash
172+
metacrafter scan sql "postgresql://user:pass@localhost/db" --format full
173+
```
174+
175+
### PII Detection
176+
```bash
177+
metacrafter scan file users.csv --contexts pii --langs en --confidence 20.0
178+
```
179+
180+
### Server Mode
181+
```bash
182+
metacrafter server run --host 127.0.0.1 --port 10399
183+
```
184+
185+
## Python API Usage
186+
187+
```python
188+
from metacrafter.core import CrafterCmd
189+
190+
cmd = CrafterCmd()
191+
report = cmd.scan_data(
192+
items=[{"email": "test@example.com"}],
193+
contexts="pii",
194+
langs="en",
195+
confidence=20.0
196+
)
197+
```
198+
199+
## Important Files
200+
201+
- `metacrafter/core.py` - Main CLI handler (2246 lines)
202+
- `metacrafter/classify/processor.py` - Rule processing engine
203+
- `metacrafter/classify/stats.py` - Statistics computation
204+
- `metacrafter/config.py` - Configuration management
205+
- `metacrafter/server/api.py` - API server endpoints
206+
207+
## Dependencies
208+
209+
Key dependencies:
210+
- `pyparsing` - Rule pattern matching
211+
- `iterabledata` - File format support
212+
- `sqlalchemy` - Database connectivity
213+
- `pymongo` - MongoDB support
214+
- `qddate` - Date/time pattern detection
215+
- `typer` - CLI framework
216+
- `pydantic` - Data validation
217+
- `phonenumbers` - Phone number validation
218+
219+
## Testing
220+
221+
Tests are in `tests/` directory. Run with:
222+
```bash
223+
python setup.py test
224+
# or
225+
pytest tests/
226+
```
227+
228+
## Error Handling
229+
230+
Custom exceptions in `metacrafter/exceptions.py`:
231+
- `MetacrafterError` - Base exception
232+
- `ConfigurationError` - Config file issues
233+
- `RuleCompilationError` - Rule parsing/compilation failures
234+
- `FileProcessingError` - File I/O issues
235+
- `DatabaseError` - Database connection/query issues
236+
- `ValidationError` - Data validation failures
237+
238+
## Contributing Guidelines
239+
240+
1. Follow existing code style
241+
2. Add tests for new features
242+
3. Update documentation (README.md) for user-facing changes
243+
4. Ensure backward compatibility when possible
244+
5. Use type hints where appropriate
245+
6. Handle errors gracefully with appropriate exceptions
246+
247+
## Registry Integration
248+
249+
Metacrafter integrates with `metacrafter-registry` to:
250+
- Link detected datatypes to registry entries
251+
- Provide datatype URLs in output
252+
- Fetch rule metadata
253+
254+
Registry is optional - Metacrafter works standalone but provides richer metadata when registry is available.
255+

CHANGELOG.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Changelog
2+
3+
All notable changes to Metacrafter will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [Unreleased]
9+
10+
### Added
11+
- **LLM-Based Classification**: New LLM-powered classification using Retrieval-Augmented Generation (RAG)
12+
- Support for multiple LLM providers: OpenAI, OpenRouter, Ollama, LM Studio, and Perplexity
13+
- Three classification modes: rules-only (default), LLM-only, and hybrid (rules + LLM fallback)
14+
- Vector-based similarity search using ChromaDB for retrieving relevant registry entries
15+
- Automatic index building from registry on first use
16+
- Configurable confidence thresholds for LLM results
17+
- CLI options: `--classification-mode`, `--llm-provider`, `--llm-model`, `--llm-api-key`, `--llm-base-url`, etc.
18+
- Configuration support via `.metacrafter` config file
19+
- Optional dependencies: `openai`, `chromadb`, `requests`
20+
- See [README](README.md#llm-based-classification) for usage examples
21+
- **Apache Atlas Integration**: New export command to push Metacrafter scan results to Apache Atlas metadata catalog
22+
- Export PII labels, datatypes, and confidence scores as classifications and custom attributes
23+
- CLI command: `metacrafter export atlas`
24+
- Configuration support via `.metacrafter` config file or environment variables
25+
- Optional dependency: `requests`
26+
- See [Implementation Guide](devdocs/ISSUE_ATLAS_IMPLEMENTATION.md)
27+
- **DataHub Integration**: New export command to push Metacrafter scan results to DataHub metadata catalog
28+
- Export PII labels, datatypes, and confidence scores as tags, glossary terms, and custom properties
29+
- CLI command: `metacrafter export datahub`
30+
- Configuration support via `.metacrafter` config file or environment variables
31+
- Optional dependency: `acryl-datahub[datahub-rest]`
32+
- See [Issue #24](https://github.com/apicrafter/metacrafter/issues/24) and [Implementation Guide](devdocs/ISSUE_24_IMPLEMENTATION.md)
33+
- **OpenMetadata Integration**: New export command to push Metacrafter scan results to OpenMetadata metadata catalog
34+
- Export PII labels, datatypes, and confidence scores as tags, glossary terms, and custom properties
35+
- CLI command: `metacrafter export openmetadata`
36+
- Configuration support via `.metacrafter` config file or environment variables
37+
- Optional dependency: `openmetadata-ingestion`
38+
- See [Implementation Guide](devdocs/ISSUE_OPENMETADATA_IMPLEMENTATION.md)
39+
- **Rules Inspection Commands**: New commands for inspecting and managing classification rules
40+
- `metacrafter rules list`: List all loaded rules with metadata (ID, name, type, match method, language, country, contexts)
41+
- Supports multiple output formats: table (default), JSON, YAML, CSV
42+
- Filterable by rule path and country codes
43+
- `metacrafter rules stats`: Display aggregate statistics about loaded rules
44+
- Shows counts of field rules, data rules, languages, contexts, country codes, and date/time patterns
45+
46+
### Changed
47+
- Updated configuration system to support LLM, DataHub, Apache Atlas, and OpenMetadata settings
48+
- Enhanced CLI with new `export` command group, `rules` command group, and LLM classification options
49+
- Extended `MetacrafterConfig` with LLM-related fields and validation
50+
- Improved error handling for missing optional dependencies (graceful degradation)
51+
52+
## [0.0.4] - Previous Release
53+
54+
### Added
55+
- Support for multiple file formats (CSV, JSON, JSONL, XML, Parquet, Avro, ORC, Excel, BSON, Pickle)
56+
- Support for compressed files (gzip, bzip2, xz, lz4, zstandard, Brotli, Snappy, ZIP)
57+
- Database scanning for SQL databases (via SQLAlchemy) and MongoDB
58+
- Rule-based classification system with 111+ rules
59+
- Date detection with 312+ patterns
60+
- Context and language filtering
61+
- Built-in API server
62+
- Statistics and field analysis
63+
64+
### Changed
65+
- Improved error handling and logging
66+
- Enhanced output formats (table, JSON, YAML, CSV)
67+
68+
## [0.0.3] - Earlier Release
69+
70+
Initial public release with core functionality.
71+
72+
---
73+
74+
## Notes
75+
76+
- **LLM Classification**: The LLM classification feature requires optional dependencies: `openai`, `chromadb`, and `requests`. Install them with `pip install openai chromadb requests` if you plan to use LLM-based classification. The feature gracefully degrades if dependencies are missing.
77+
- **Apache Atlas Integration**: The Apache Atlas integration requires the optional `requests` package. Install it separately if you plan to use this feature.
78+
- **DataHub Integration**: The DataHub integration requires the optional `acryl-datahub[datahub-rest]` package. Install it separately if you plan to use this feature.
79+
- **OpenMetadata Integration**: The OpenMetadata integration requires the optional `openmetadata-ingestion` package. Install it separately if you plan to use this feature.
80+
- **Configuration**: LLM, DataHub, Apache Atlas, and OpenMetadata settings can be configured in `.metacrafter` config file or via environment variables.
81+

0 commit comments

Comments
 (0)