|
| 1 | +# Implementation Plan: FTP Sync Overwrite Modes and Metadata-Aware Sync |
| 2 | + |
| 3 | +**Issue:** [#11 - `acquire ftp`: add overwrite mode and metadata-aware sync for filled-frame use cases](https://github.com/zyra-project/zyra/issues/11) |
| 4 | + |
| 5 | +**Branch:** `claude/github-issue-11-EESdo` |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Problem Statement |
| 10 | + |
| 11 | +The `acquire ftp --sync-dir` feature prevents re-downloading existing files. However, when upstream FTP servers backfill missing data with real information (replacing interpolated/placeholder frames), the stale local data persists because filenames remain unchanged. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## Key Files to Modify |
| 16 | + |
| 17 | +| File | Purpose | |
| 18 | +|------|---------| |
| 19 | +| `src/zyra/connectors/backends/ftp.py` | Core sync logic, add `SyncOptions` dataclass and decision function | |
| 20 | +| `src/zyra/connectors/ingest/__init__.py` | CLI argument registration and handler | |
| 21 | +| `src/zyra/api/schemas/domain_args.py` | API schema updates | |
| 22 | +| `src/zyra/connectors/clients.py` | Update `FTPConnector.sync_directory()` signature | |
| 23 | +| `tests/connectors/test_ftp_backend.py` | Unit tests for new functionality | |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## Architecture Decisions |
| 28 | + |
| 29 | +### 1. New `SyncOptions` Dataclass |
| 30 | + |
| 31 | +A frozen dataclass to encapsulate all sync behavior configuration, following existing patterns in the codebase: |
| 32 | + |
| 33 | +```python |
| 34 | +@dataclass(frozen=True) |
| 35 | +class SyncOptions: |
| 36 | + """Configuration for FTP sync file replacement behavior.""" |
| 37 | + overwrite_existing: bool = False |
| 38 | + recheck_existing: bool = False |
| 39 | + min_remote_size: int | str | None = None # bytes or percentage like "10%" |
| 40 | + prefer_remote: bool = False |
| 41 | + prefer_remote_if_meta_newer: bool = False |
| 42 | + skip_if_local_done: bool = False |
| 43 | + recheck_missing_meta: bool = False |
| 44 | + frames_meta_path: str | None = None |
| 45 | +``` |
| 46 | + |
| 47 | +### 2. New `should_download()` Function |
| 48 | + |
| 49 | +Central decision logic implementing this precedence order: |
| 50 | + |
| 51 | +1. `--skip-if-local-done`: Check for `.done` marker file first |
| 52 | +2. File doesn't exist locally → always download |
| 53 | +3. Zero-byte local file → always replace |
| 54 | +4. `--overwrite-existing`: Unconditional download |
| 55 | +5. `--prefer-remote`: Always download |
| 56 | +6. `--prefer-remote-if-meta-newer`: Check `frames-meta.json` timestamps |
| 57 | +7. `--recheck-missing-meta`: Download if companion metadata missing |
| 58 | +8. `--min-remote-size`: Size-based comparison |
| 59 | +9. `--recheck-existing`: Size comparison when mtime unavailable |
| 60 | +10. **Default**: MDTM-based timestamp comparison |
| 61 | + |
| 62 | +### 3. New `get_remote_mtime()` Function |
| 63 | + |
| 64 | +Uses FTP's `MDTM` command to get file modification time (fails gracefully when unsupported). |
| 65 | + |
| 66 | +--- |
| 67 | + |
| 68 | +## Implementation Phases |
| 69 | + |
| 70 | +### Phase 1: Core Backend (`ftp.py`) |
| 71 | + |
| 72 | +1. Add `SyncOptions` dataclass |
| 73 | +2. Implement `get_remote_mtime()` function |
| 74 | +3. Implement helper functions: |
| 75 | + - `_parse_min_size()` - Parse bytes or percentage size spec |
| 76 | + - `_load_frames_meta()` - Load frames-meta.json file |
| 77 | + - `_has_done_marker()` - Check for .done marker file |
| 78 | + - `_is_missing_companion_meta()` - Check if file is missing from metadata |
| 79 | +4. Implement `should_download()` logic |
| 80 | +5. Update `sync_directory()` signature and loop |
| 81 | +6. Write unit tests for new functions |
| 82 | + |
| 83 | +### Phase 2: CLI Integration (`ingest/__init__.py`) |
| 84 | + |
| 85 | +1. Add new CLI arguments to FTP parser |
| 86 | +2. Update `_cmd_ftp()` to construct `SyncOptions` |
| 87 | +3. Pass options to `sync_directory()` |
| 88 | +4. Write CLI integration tests |
| 89 | + |
| 90 | +### Phase 3: API Schema (`domain_args.py`) |
| 91 | + |
| 92 | +1. Update `AcquireFtpArgs` model with new fields |
| 93 | +2. Update `normalize_and_validate()` if needed |
| 94 | +3. Test API endpoint with new parameters |
| 95 | + |
| 96 | +### Phase 4: Connector Client (`clients.py`) |
| 97 | + |
| 98 | +1. Update `FTPConnector.sync_directory()` signature |
| 99 | +2. Pass through `sync_options` parameter |
| 100 | + |
| 101 | +### Phase 5: Documentation and Cleanup |
| 102 | + |
| 103 | +1. Update CLI help strings |
| 104 | +2. Add docstrings with examples |
| 105 | +3. Update any sample workflows using FTP sync |
| 106 | + |
| 107 | +--- |
| 108 | + |
| 109 | +## New CLI Flags |
| 110 | + |
| 111 | +| Flag | Type | Description | |
| 112 | +|------|------|-------------| |
| 113 | +| `--overwrite-existing` | bool | Unconditional replacement regardless of timestamps | |
| 114 | +| `--recheck-existing` | bool | Compare file sizes when timestamps unavailable | |
| 115 | +| `--min-remote-size` | str/int | Replace if remote larger (bytes or "10%") | |
| 116 | +| `--prefer-remote` | bool | Always prioritize remote versions | |
| 117 | +| `--prefer-remote-if-meta-newer` | bool | Use frames-meta.json timestamps | |
| 118 | +| `--skip-if-local-done` | bool | Respect .done marker files | |
| 119 | +| `--recheck-missing-meta` | bool | Re-download files lacking metadata | |
| 120 | +| `--frames-meta` | path | Path to frames-meta.json | |
| 121 | + |
| 122 | +--- |
| 123 | + |
| 124 | +## Potential Risks & Mitigations |
| 125 | + |
| 126 | +| Risk | Impact | Mitigation | |
| 127 | +|------|--------|------------| |
| 128 | +| MDTM not supported by all servers | Medium | Graceful fallback with logging; document behavior | |
| 129 | +| Performance from extra MDTM calls | High | Cache connection; batch metadata queries | |
| 130 | +| frames-meta.json format changes | Low | Validate JSON structure; log warnings | |
| 131 | +| Timezone handling in MDTM | Medium | Document UTC assumption | |
| 132 | + |
| 133 | +--- |
| 134 | + |
| 135 | +## Test Strategy |
| 136 | + |
| 137 | +### Unit Tests (`test_ftp_backend.py`) |
| 138 | + |
| 139 | +- `test_get_remote_mtime_success` - MDTM parsing when server supports it |
| 140 | +- `test_get_remote_mtime_not_supported` - Graceful handling when unsupported |
| 141 | +- `test_parse_min_size_bytes` - Parsing absolute byte values |
| 142 | +- `test_parse_min_size_percentage` - Parsing percentage values |
| 143 | +- `test_should_download_*` - All decision logic branches |
| 144 | +- `test_sync_directory_with_sync_options` - Integration test |
| 145 | + |
| 146 | +### CLI Tests |
| 147 | + |
| 148 | +- Verify new arguments are recognized |
| 149 | +- Test argument combinations |
| 150 | + |
| 151 | +--- |
| 152 | + |
| 153 | +## Usage Examples |
| 154 | + |
| 155 | +```bash |
| 156 | +# Default: use MDTM timestamps |
| 157 | +zyra acquire ftp ftp://server/data/ --sync-dir ./local/ |
| 158 | + |
| 159 | +# Force overwrite all files |
| 160 | +zyra acquire ftp ftp://server/data/ --sync-dir ./local/ --overwrite-existing |
| 161 | + |
| 162 | +# Only replace if remote is significantly larger |
| 163 | +zyra acquire ftp ftp://server/data/ --sync-dir ./local/ --min-remote-size "20%" |
| 164 | + |
| 165 | +# Use metadata-aware sync |
| 166 | +zyra acquire ftp ftp://server/frames/ --sync-dir ./frames/ \ |
| 167 | + --frames-meta ./frames-meta.json \ |
| 168 | + --prefer-remote-if-meta-newer |
| 169 | + |
| 170 | +# Skip completed files |
| 171 | +zyra acquire ftp ftp://server/data/ --sync-dir ./local/ --skip-if-local-done |
| 172 | +``` |
| 173 | + |
| 174 | +--- |
| 175 | + |
| 176 | +*Ready for implementation upon approval.* |
0 commit comments