|
| 1 | +# AGENTS.md - Metacrafter |
| 2 | + |
| 3 | +This document provides guidance for AI agents working with the Metacrafter codebase. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Metacrafter is a Python command-line tool and library for labeling table fields and data files. It uses rule-based classification to identify: |
| 8 | +- Personal Identifiable Information (PII) |
| 9 | +- Person names, surnames, midnames |
| 10 | +- Basic identifiers (UUID/GUID, email, phone, etc.) |
| 11 | +- Country/language-specific identifiers |
| 12 | +- Dates and times |
| 13 | +- Various semantic data types |
| 14 | + |
| 15 | +## Repository Structure |
| 16 | + |
| 17 | +``` |
| 18 | +metacrafter/ |
| 19 | +├── metacrafter/ # Main package |
| 20 | +│ ├── __init__.py # Package initialization, exports exceptions |
| 21 | +│ ├── __main__.py # CLI entry point |
| 22 | +│ ├── core.py # Main CLI command handler (CrafterCmd) |
| 23 | +│ ├── config.py # Configuration file loader (.metacrafter) |
| 24 | +│ ├── exceptions.py # Custom exception classes |
| 25 | +│ ├── classify/ # Core classification engine |
| 26 | +│ │ ├── processor.py # RulesProcessor - loads and applies rules |
| 27 | +│ │ ├── stats.py # Analyzer - field statistics and analysis |
| 28 | +│ │ └── utils.py # Utility functions |
| 29 | +│ ├── core/ # Core validation utilities |
| 30 | +│ │ └── validators.py # Validation functions |
| 31 | +│ ├── registry/ # Registry client integration |
| 32 | +│ │ └── client.py # Client for metacrafter-registry |
| 33 | +│ └── server/ # API server components |
| 34 | +│ ├── api.py # API endpoints |
| 35 | +│ └── manager.py # Server management |
| 36 | +├── rules/ # Default rule files (YAML) |
| 37 | +│ ├── basic/ # Basic identifier rules |
| 38 | +│ ├── common/ # Common rules (dates, internet, etc.) |
| 39 | +│ ├── pii/ # PII detection rules |
| 40 | +│ ├── en/ # English-specific rules |
| 41 | +│ ├── ru/ # Russian-specific rules |
| 42 | +│ └── fr/ # French-specific rules |
| 43 | +├── tests/ # Test suite |
| 44 | +├── scripts/ # Utility scripts |
| 45 | +└── setup.py # Package setup |
| 46 | +``` |
| 47 | + |
| 48 | +## Key Components |
| 49 | + |
| 50 | +### 1. Core Engine (`metacrafter/core.py`) |
| 51 | + |
| 52 | +The `CrafterCmd` class is the main entry point for all operations: |
| 53 | +- `scan_file()` - Scan data files (CSV, JSON, Parquet, etc.) |
| 54 | +- `scan_db()` - Scan SQL databases |
| 55 | +- `scan_mongodb()` - Scan MongoDB databases |
| 56 | +- `scan_data()` - Scan in-memory data (list of dicts) |
| 57 | +- `scan_bulk()` - Scan multiple files in a directory |
| 58 | + |
| 59 | +### 2. Rules Processor (`metacrafter/classify/processor.py`) |
| 60 | + |
| 61 | +`RulesProcessor` handles: |
| 62 | +- Loading YAML rule files from configured paths |
| 63 | +- Compiling rules (text, PyParsing, function-based) |
| 64 | +- Applying rules to field names and data values |
| 65 | +- Filtering by context, language, country codes |
| 66 | +- Confidence scoring |
| 67 | + |
| 68 | +### 3. Statistics Analyzer (`metacrafter/classify/stats.py`) |
| 69 | + |
| 70 | +`Analyzer` computes field statistics: |
| 71 | +- Data type detection (str, int, float, dict, etc.) |
| 72 | +- Uniqueness metrics |
| 73 | +- Length statistics (min, max, avg) |
| 74 | +- Character analysis (digits, alphas, special chars) |
| 75 | +- Dictionary value detection |
| 76 | + |
| 77 | +### 4. Configuration (`metacrafter/config.py`) |
| 78 | + |
| 79 | +`ConfigLoader` reads `.metacrafter` YAML config files from: |
| 80 | +- Current working directory |
| 81 | +- User home directory (`~/.metacrafter`) |
| 82 | + |
| 83 | +Configuration options: |
| 84 | +- `rulepath`: List of directories containing rule YAML files |
| 85 | + |
| 86 | +## Rule System |
| 87 | + |
| 88 | +Rules are YAML files that define how to identify data types. Three match types: |
| 89 | + |
| 90 | +1. **text** - Exact text matching (for field names) |
| 91 | + ```yaml |
| 92 | + midname: |
| 93 | + key: person_midname |
| 94 | + match: text |
| 95 | + type: field |
| 96 | + rule: midname,secondname,middlename |
| 97 | + ``` |
| 98 | +
|
| 99 | +2. **ppr** - PyParsing pattern matching (for data values) |
| 100 | + ```yaml |
| 101 | + rukadastr: |
| 102 | + key: rukadastr |
| 103 | + match: ppr |
| 104 | + type: data |
| 105 | + rule: Word(nums, min=1, max=2) + Literal(':')... |
| 106 | + ``` |
| 107 | +
|
| 108 | +3. **func** - Python function validation |
| 109 | + ```yaml |
| 110 | + runpabyfunc: |
| 111 | + key: runpa |
| 112 | + match: func |
| 113 | + type: data |
| 114 | + rule: metacrafter.rules.ru.gov.is_ru_law |
| 115 | + ``` |
| 116 | +
|
| 117 | +## Supported File Formats |
| 118 | +
|
| 119 | +Metacrafter uses `iterabledata` package for file format support: |
| 120 | + |
| 121 | +**Text formats:** CSV, TSV, JSON, JSONL, XML |
| 122 | +**Binary formats:** BSON, Parquet, Avro, ORC, Excel (XLS/XLSX), Pickle |
| 123 | +**Compression:** gzip, bzip2, xz, lz4, zstandard, Brotli, Snappy, ZIP |
| 124 | + |
| 125 | +Format detection is automatic based on file extension. |
| 126 | + |
| 127 | +## Database Support |
| 128 | + |
| 129 | +- **SQL databases:** Any database supported by SQLAlchemy (PostgreSQL, MySQL, SQLite, SQL Server, Oracle, DuckDB, etc.) |
| 130 | +- **NoSQL:** MongoDB (via pymongo) |
| 131 | + |
| 132 | +## Common Tasks |
| 133 | + |
| 134 | +### Adding a New Rule |
| 135 | + |
| 136 | +1. Create or edit a YAML file in `rules/` directory (or custom rulepath) |
| 137 | +2. Define rule with appropriate match type (text/ppr/func) |
| 138 | +3. Set metadata: key, name, type (field/data), priority, contexts, langs, country |
| 139 | +4. Test with `metacrafter scan file test.csv` |
| 140 | + |
| 141 | +### Extending Rule Validation Functions |
| 142 | + |
| 143 | +1. Create Python module in appropriate location |
| 144 | +2. Define function that accepts string/value and returns bool |
| 145 | +3. Reference in rule YAML: `rule: package.module.function_name` |
| 146 | +4. Ensure function is importable (may need to add to package) |
| 147 | + |
| 148 | +### Adding Database Support |
| 149 | + |
| 150 | +1. Ensure SQLAlchemy driver is available |
| 151 | +2. Use connection string format: `dialect+driver://user:pass@host:port/db` |
| 152 | +3. For new NoSQL databases, extend `scan_mongodb()` pattern in `core.py` |
| 153 | + |
| 154 | +### Working with Registry Integration |
| 155 | + |
| 156 | +The registry client (`metacrafter/registry/client.py`) connects to metacrafter-registry to: |
| 157 | +- Fetch datatype metadata |
| 158 | +- Resolve datatype URLs |
| 159 | +- Get rule metadata |
| 160 | + |
| 161 | +Registry URL defaults to `https://registry.apicrafter.io` but can be configured. |
| 162 | + |
| 163 | +## CLI Usage Patterns |
| 164 | + |
| 165 | +### Basic File Scan |
| 166 | +```bash |
| 167 | +metacrafter scan file data.csv --format full -o results.json |
| 168 | +``` |
| 169 | + |
| 170 | +### Database Scan |
| 171 | +```bash |
| 172 | +metacrafter scan sql "postgresql://user:pass@localhost/db" --format full |
| 173 | +``` |
| 174 | + |
| 175 | +### PII Detection |
| 176 | +```bash |
| 177 | +metacrafter scan file users.csv --contexts pii --langs en --confidence 20.0 |
| 178 | +``` |
| 179 | + |
| 180 | +### Server Mode |
| 181 | +```bash |
| 182 | +metacrafter server run --host 127.0.0.1 --port 10399 |
| 183 | +``` |
| 184 | + |
| 185 | +## Python API Usage |
| 186 | + |
| 187 | +```python |
| 188 | +from metacrafter.core import CrafterCmd |
| 189 | +
|
| 190 | +cmd = CrafterCmd() |
| 191 | +report = cmd.scan_data( |
| 192 | + items=[{"email": "test@example.com"}], |
| 193 | + contexts="pii", |
| 194 | + langs="en", |
| 195 | + confidence=20.0 |
| 196 | +) |
| 197 | +``` |
| 198 | + |
| 199 | +## Important Files |
| 200 | + |
| 201 | +- `metacrafter/core.py` - Main CLI handler (2246 lines) |
| 202 | +- `metacrafter/classify/processor.py` - Rule processing engine |
| 203 | +- `metacrafter/classify/stats.py` - Statistics computation |
| 204 | +- `metacrafter/config.py` - Configuration management |
| 205 | +- `metacrafter/server/api.py` - API server endpoints |
| 206 | + |
| 207 | +## Dependencies |
| 208 | + |
| 209 | +Key dependencies: |
| 210 | +- `pyparsing` - Rule pattern matching |
| 211 | +- `iterabledata` - File format support |
| 212 | +- `sqlalchemy` - Database connectivity |
| 213 | +- `pymongo` - MongoDB support |
| 214 | +- `qddate` - Date/time pattern detection |
| 215 | +- `typer` - CLI framework |
| 216 | +- `pydantic` - Data validation |
| 217 | +- `phonenumbers` - Phone number validation |
| 218 | + |
| 219 | +## Testing |
| 220 | + |
| 221 | +Tests are in `tests/` directory. Run with: |
| 222 | +```bash |
| 223 | +python setup.py test |
| 224 | +# or |
| 225 | +pytest tests/ |
| 226 | +``` |
| 227 | + |
| 228 | +## Error Handling |
| 229 | + |
| 230 | +Custom exceptions in `metacrafter/exceptions.py`: |
| 231 | +- `MetacrafterError` - Base exception |
| 232 | +- `ConfigurationError` - Config file issues |
| 233 | +- `RuleCompilationError` - Rule parsing/compilation failures |
| 234 | +- `FileProcessingError` - File I/O issues |
| 235 | +- `DatabaseError` - Database connection/query issues |
| 236 | +- `ValidationError` - Data validation failures |
| 237 | + |
| 238 | +## Contributing Guidelines |
| 239 | + |
| 240 | +1. Follow existing code style |
| 241 | +2. Add tests for new features |
| 242 | +3. Update documentation (README.md) for user-facing changes |
| 243 | +4. Ensure backward compatibility when possible |
| 244 | +5. Use type hints where appropriate |
| 245 | +6. Handle errors gracefully with appropriate exceptions |
| 246 | + |
| 247 | +## Registry Integration |
| 248 | + |
| 249 | +Metacrafter integrates with `metacrafter-registry` to: |
| 250 | +- Link detected datatypes to registry entries |
| 251 | +- Provide datatype URLs in output |
| 252 | +- Fetch rule metadata |
| 253 | + |
| 254 | +Registry is optional - Metacrafter works standalone but provides richer metadata when registry is available. |
| 255 | + |
0 commit comments