|
| 1 | +# AGENTS.md - AI Agent & Contributor Guidelines |
| 2 | + |
| 3 | +This document provides essential context for AI agents (Claude, Copilot, etc.) and human contributors working on the Transferia codebase. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +**Transferia** is an open-source, cloud-native ELT (Extract, Load, Transform) ingestion engine built in Go. It enables seamless, high-performance data movement between diverse database systems at scale. |
| 8 | + |
| 9 | +### Key Capabilities |
| 10 | +- **Snapshot**: One-time bulk data transfer with table-level consistency |
| 11 | +- **Replication**: Continuous CDC (Change Data Capture) streaming |
| 12 | +- **Transformations**: Row-level transformations during transfer (rename, mask, filter, SQL) |
| 13 | +- **Multi-source**: PostgreSQL, MySQL, MongoDB, Kafka, S3, and more |
| 14 | +- **Multi-destination**: ClickHouse, PostgreSQL, Kafka, S3, and more |
| 15 | + |
| 16 | +## Repository Structure |
| 17 | + |
| 18 | +``` |
| 19 | +transferia/ |
| 20 | +├── cmd/trcli/ # CLI entry point (replicate, upload, check, validate) |
| 21 | +├── pkg/ |
| 22 | +│ ├── abstract/ # Core interfaces (Source, Sink, Storage, Transfer) |
| 23 | +│ ├── providers/ # Database adapters (postgres, mysql, mongo, clickhouse, kafka) |
| 24 | +│ ├── dataplane/ # Runtime execution engine |
| 25 | +│ ├── middlewares/ # Cross-cutting concerns (retry, filter, metrics) |
| 26 | +│ ├── transformer/ # Pluggable transformers |
| 27 | +│ ├── parsers/ # Data format parsers (JSON, Avro, Parquet) |
| 28 | +│ ├── coordinator/ # Multi-node coordination (memory, S3) |
| 29 | +│ └── connection/ # Connection management |
| 30 | +├── internal/ # Internal packages (logger, config, metrics) |
| 31 | +├── tests/ |
| 32 | +│ ├── e2e-core/ # End-to-end tests (pg2ch, mysql2ch, mongo2ch) |
| 33 | +│ ├── helpers/ # Test utilities and helpers |
| 34 | +│ ├── canon/ # Type/schema validation tests |
| 35 | +│ └── storage/ # Provider storage tests |
| 36 | +├── recipe/ # Test container recipes |
| 37 | +├── examples/ # Configuration examples |
| 38 | +└── docs/ # Documentation |
| 39 | +``` |
| 40 | + |
| 41 | +## Core Abstractions |
| 42 | + |
| 43 | +### Key Interfaces (pkg/abstract/) |
| 44 | + |
| 45 | +1. **Storage** - One-time data reader for snapshots |
| 46 | +2. **Source** - Streaming data reader for CDC replication |
| 47 | +3. **Sink** - Data writer with async push semantics |
| 48 | +4. **Transformer** - Row-level data transformation |
| 49 | +5. **ChangeItem** - Core unit of data transfer (represents a row operation) |
| 50 | + |
| 51 | +### Provider Pattern |
| 52 | + |
| 53 | +All providers in `pkg/providers/` follow this structure: |
| 54 | +```go |
| 55 | +type Provider struct { |
| 56 | + logger log.Logger |
| 57 | + registry metrics.Registry |
| 58 | + cp coordinator.Coordinator |
| 59 | + transfer *model.Transfer |
| 60 | +} |
| 61 | +``` |
| 62 | + |
| 63 | +Providers register via `init()` with: |
| 64 | +- `providers.Register(ProviderType, New)` |
| 65 | +- `model.RegisterSource/RegisterDestination` |
| 66 | +- `abstract.RegisterProviderName` |
| 67 | + |
| 68 | +## Coding Guidelines |
| 69 | + |
| 70 | +### Error Handling |
| 71 | + |
| 72 | +- Use `xerrors.Errorf("context: %w", err)` for wrapping |
| 73 | +- Domain-specific error types exist: |
| 74 | + - `FatalError` - Stops transfer, forbids restart |
| 75 | + - `RetriablePartUploadError` - Transient, eligible for retry |
| 76 | + - `TableUploadError` - Specific upload failures |
| 77 | +- Never ignore errors silently; log if not returning |
| 78 | + |
| 79 | +### Logging |
| 80 | + |
| 81 | +- Use structured logging via the `logger` package (Zap-based) |
| 82 | +- Prefer `logger.Log.Info("message", log.String("key", value))` over `Infof` |
| 83 | +- Log levels: DEBUG, INFO, WARNING, ERROR, FATAL |
| 84 | +- Never log credentials or sensitive data |
| 85 | + |
| 86 | +### Testing |
| 87 | + |
| 88 | +- Place tests in `tests/` directory, not alongside code |
| 89 | +- Use testcontainers via `recipe/` package for integration tests |
| 90 | +- Follow the pattern: `TestSnapshotAndIncrement`, `TestReplication` |
| 91 | +- Use helpers: `helpers.Activate()`, `helpers.CompareStorages()` |
| 92 | +- Wait helpers for async operations: `WaitEqualRowsCount()`, `WaitCond()` |
| 93 | + |
| 94 | +### Concurrency |
| 95 | + |
| 96 | +- Always use context for cancellation |
| 97 | +- Use `sync.WaitGroup` for goroutine lifecycle |
| 98 | +- Prefer buffered channels to avoid deadlocks |
| 99 | +- Use `sync.Once` for one-time cleanup operations |
| 100 | +- Avoid mutex in hot paths; consider atomics |
| 101 | + |
| 102 | +## Security Considerations |
| 103 | + |
| 104 | +### Critical Rules |
| 105 | + |
| 106 | +1. **Never hardcode credentials** - Use environment variables |
| 107 | +2. **Always validate TLS certificates** - Don't set `InsecureSkipVerify: true` in production |
| 108 | +3. **Sanitize SQL inputs** - Use parameterized queries, never string concatenation |
| 109 | +4. **Redact secrets in logs** - Never log passwords, tokens, or keys |
| 110 | +5. **Validate database filters** - User-provided filters can be injection vectors |
| 111 | + |
| 112 | +### Known Security Debt |
| 113 | + |
| 114 | +- Some TLS configurations default to `InsecureSkipVerify` when no cert provided |
| 115 | +- `SecretString` type alias provides no actual protection |
| 116 | +- Test credentials exist in recipe files (acceptable for tests only) |
| 117 | + |
| 118 | +## Provider-Specific Notes |
| 119 | + |
| 120 | +### PostgreSQL |
| 121 | +- Most complex provider with full replication support |
| 122 | +- Uses pgx library with custom type mapping |
| 123 | +- Has DBLog support for alternative loading |
| 124 | +- System tables: `__consumer_keeper`, `__data_transfer_lsn` |
| 125 | + |
| 126 | +### MySQL |
| 127 | +- Supports both file-based and GTID position tracking |
| 128 | +- Character set handling (UTF-8MB3 vs MB4) |
| 129 | +- System tables: `__table_transfer_progress`, `__tm_keeper` |
| 130 | + |
| 131 | +### MongoDB |
| 132 | +- Simpler, document-oriented provider |
| 133 | +- Uses cluster time instead of LSN for position |
| 134 | +- System collection: `__dt_cluster_time` |
| 135 | + |
| 136 | +### ClickHouse |
| 137 | +- **Snapshot-only** - No replication source support |
| 138 | +- Implements different interfaces (`Abstract2Provider`, `AsyncSinker`) |
| 139 | +- HTTP and Native protocol support |
| 140 | + |
| 141 | +## Common Tasks |
| 142 | + |
| 143 | +### Adding a New Provider |
| 144 | + |
| 145 | +1. Create package in `pkg/providers/newprovider/` |
| 146 | +2. Implement required interfaces (Storage, Source, Sink as needed) |
| 147 | +3. Register in `init()` function |
| 148 | +4. Add test recipes in `recipe/` |
| 149 | +5. Create e2e tests in `tests/e2e-core/` |
| 150 | + |
| 151 | +### Adding a Transformer |
| 152 | + |
| 153 | +1. Add implementation in `pkg/transformer/registry/` |
| 154 | +2. Register with transformer registry |
| 155 | +3. Update configuration model if needed |
| 156 | +4. Add tests |
| 157 | + |
| 158 | +### Running Tests |
| 159 | + |
| 160 | +```bash |
| 161 | +# Quick core tests |
| 162 | +make test-core |
| 163 | + |
| 164 | +# Full CDC test suite |
| 165 | +make test-cdc-full |
| 166 | + |
| 167 | +# Specific wave |
| 168 | +make test-cdc-wave WAVE=providers |
| 169 | + |
| 170 | +# Specific layer |
| 171 | +make test-layer LAYER=e2e-core DB=pg2ch |
| 172 | +``` |
| 173 | + |
| 174 | +## Build Commands |
| 175 | + |
| 176 | +```bash |
| 177 | +make build # Build trcli binary |
| 178 | +make docker # Build Docker image |
| 179 | +make clean # Remove artifacts |
| 180 | +make lint # Run linters |
| 181 | +``` |
| 182 | + |
| 183 | +## Important Files |
| 184 | + |
| 185 | +- `pkg/abstract/model/transfer.go` - Transfer model definition |
| 186 | +- `pkg/abstract/endpoint.go` - Provider interface definitions |
| 187 | +- `pkg/providers/provider.go` - Provider registration |
| 188 | +- `cmd/trcli/config/model.go` - CLI configuration model |
| 189 | +- `.golangci.yml` - Linter configuration |
| 190 | + |
| 191 | +## Code Style |
| 192 | + |
| 193 | +- Follow Go idioms and effective Go guidelines |
| 194 | +- Use `gofmt` and linters (`.golangci.yml` configured) |
| 195 | +- Interface names: clear verbs (Source, Sink, Transformer) |
| 196 | +- File names: lowercase with underscores |
| 197 | +- Keep functions focused; extract when > 50 lines |
| 198 | +- Comment non-obvious logic; skip obvious comments |
| 199 | + |
| 200 | +## Architecture Decisions |
| 201 | + |
| 202 | +1. **Compile-time plugins** - No runtime plugin loading; all providers compiled in |
| 203 | +2. **Middleware pattern** - Cross-cutting concerns via composable middlewares |
| 204 | +3. **Marker interfaces** - Capability detection via `Is*()` marker methods |
| 205 | +4. **Coordinator abstraction** - Memory (single-node) or S3 (distributed) coordination |
| 206 | +5. **Wave-based testing** - Tests organized in dependency waves for efficient CI |
| 207 | + |
| 208 | +## Getting Help |
| 209 | + |
| 210 | +- See `/docs/` for detailed documentation |
| 211 | +- Check `/examples/` for configuration patterns |
| 212 | +- Review existing provider implementations for patterns |
| 213 | +- Test recipes in `/recipe/` show infrastructure setup |
0 commit comments