|
| 1 | +# [EPIC] Project data mover (Files ↔ Blob Cool) with AzCopy |
| 2 | + |
1 | 3 | ## Context |
2 | | -PRD: Hot → Cool Project Data Management (Immutable Cool) |
3 | 4 |
|
4 | | -This epic implements the **project-level** data lifecycle described in the PRD. |
5 | | -The lifecycle applies to the **entire project data folder**, including runs, |
6 | | -documents, and any other project-scoped files. |
| 5 | +This EPIC is part of the **Hot → Cool Project Data Management (Immutable Cool)** initiative. |
| 6 | + |
| 7 | +After delivering the **lifecycle state machine, immutability rules, eligibility logic, and admin visibility** (EPIC 1), the system now needs a **reliable, auditable, and verifiable mechanism** to physically move project data between storage tiers. |
| 8 | + |
| 9 | +Project data transitions must be: |
| 10 | + |
| 11 | +* **explicitly triggered** |
| 12 | +* **long-running** |
| 13 | +* **idempotent** |
| 14 | +* **verifiable** |
| 15 | +* **fully observable** |
| 16 | + |
| 17 | +Actual data movement is delegated to **euphrosyne-tools-api**, using **AzCopy** for storage-native, high-throughput transfers. |
7 | 18 |
|
8 | 19 | --- |
9 | 20 |
|
10 | 21 | ## Goal |
11 | | -Implement **project-level** lifecycle state machine, automation, immutability |
12 | | -enforcement, and UI visibility. |
13 | 22 |
|
14 | | -The lifecycle controls when a project’s data is stored in: |
15 | | -- Azure Files (HOT workspace), or |
16 | | -- Azure Blob Storage (COOL, immutable). |
| 23 | +Implement **project-level COOL and RESTORE operations** that: |
| 24 | + |
| 25 | +* move **entire project data** between: |
| 26 | + |
| 27 | + * Azure Files (HOT) |
| 28 | + * Azure Blob Storage (Cool tier) |
| 29 | +* are executed via **AzCopy** |
| 30 | +* are tracked as **long-running lifecycle operations** |
| 31 | +* are verified using **AzCopy transfer summaries** |
| 32 | +* drive project lifecycle state transitions in Euphrosyne |
| 33 | + |
| 34 | +This EPIC is the first one that **moves bytes**, not just states. |
17 | 35 |
|
18 | 36 | --- |
19 | 37 |
|
20 | | -## Success criteria |
| 38 | +## Scope |
| 39 | + |
| 40 | +### In scope |
| 41 | + |
| 42 | +* Project-level data copy: |
| 43 | + |
| 44 | + * Files → Blob Cool (COOL) |
| 45 | + * Blob Cool → Files (RESTORE) |
| 46 | +* Long-running operation tracking |
| 47 | +* Operation polling and status reconciliation |
| 48 | +* Verification based on: |
| 49 | + |
| 50 | + * expected file count |
| 51 | + * expected byte size |
| 52 | +* Automatic triggering for eligible projects |
| 53 | +* Manual triggering for restore |
| 54 | +* Full auditability |
| 55 | + |
| 56 | +### Out of scope (explicit) |
| 57 | + |
| 58 | +* Deletion of HOT data after cooling |
| 59 | +* Cold / Archive tier |
| 60 | +* Partial project tiering |
| 61 | +* Delta sync or incremental copy |
| 62 | +* Deduplication |
| 63 | +* Concurrent HOT + COOL writes |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## High-level design |
| 68 | + |
| 69 | +### Source of truth |
| 70 | + |
| 71 | +* **Euphrosyne DB** is the authoritative source for: |
| 72 | + |
| 73 | + * lifecycle state |
| 74 | + * storage class |
| 75 | + * eligibility |
| 76 | + * expected bytes / files |
| 77 | +* **tools-api** is the executor and reporter of physical copy operations |
| 78 | + |
| 79 | +State transitions **never happen based on AzCopy alone** — |
| 80 | +they only occur after **verified success** is reported back. |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +### Lifecycle operations |
| 85 | + |
| 86 | +Each COOL or RESTORE is modeled as a **single lifecycle operation** with: |
| 87 | + |
| 88 | +* a unique `operation_id` |
| 89 | +* a fixed direction (`COOL` or `RESTORE`) |
| 90 | +* immutable expectations (bytes/files) |
| 91 | +* monotonic status progression: |
| 92 | + |
| 93 | + ``` |
| 94 | + PENDING → RUNNING → SUCCEEDED | FAILED |
| 95 | + ``` |
| 96 | + |
| 97 | +Operations are: |
| 98 | + |
| 99 | +* idempotent per `(project_id, operation_id)` |
| 100 | +* never reused across retries |
| 101 | +* fully auditable |
| 102 | + |
| 103 | +--- |
| 104 | + |
| 105 | +### Storage movement model |
| 106 | + |
| 107 | +**COOL** |
| 108 | + |
| 109 | +``` |
| 110 | +Azure Files (project folder) |
| 111 | + ↓ |
| 112 | +Azure Blob Storage (Cool tier, project prefix) |
| 113 | +``` |
| 114 | + |
| 115 | +**RESTORE** |
| 116 | + |
| 117 | +``` |
| 118 | +Azure Blob Storage (Cool tier) |
| 119 | + ↓ |
| 120 | +Azure Files (project folder) |
| 121 | +``` |
21 | 122 |
|
22 | | -- Projects automatically become eligible for cooling based on activity: |
23 | | - - Initial eligibility: `project.created + 6 months` |
24 | | - - Updated eligibility each time a new run is planned: |
25 | | - `run.end_date + 6 months` |
26 | | -- Entire project data (runs + documents) is cooled as a single unit. |
27 | | -- Restore works on demand and returns the project to HOT. |
28 | | -- Writes are blocked when the project is in `COOL` or `COOLING`: |
29 | | - - document uploads/edits/deletes |
30 | | - - run outputs |
31 | | - - any write under the project folder |
32 | | -- **New runs cannot be created** when the project is `COOL` or `COOLING`. |
33 | | -- Admins can see: |
34 | | - - lifecycle state |
35 | | - - cooling eligibility date |
36 | | - - last lifecycle operation and error (if any). |
| 123 | +Characteristics: |
| 124 | + |
| 125 | +* Entire project moves as a single unit |
| 126 | +* Directory structure is preserved |
| 127 | +* COOL storage is treated as **immutable** |
| 128 | +* HOT storage is treated as **ephemeral workspace** |
37 | 129 |
|
38 | 130 | --- |
39 | 131 |
|
40 | | -## Non-goals (explicit for this epic) |
| 132 | +## AzCopy integration model |
| 133 | + |
| 134 | +### Role of AzCopy |
| 135 | + |
| 136 | +AzCopy is used as the **only mechanism** for data transfer: |
| 137 | + |
| 138 | +* high throughput |
| 139 | +* resumable |
| 140 | +* storage-native verification |
| 141 | + |
| 142 | +tools-api is responsible for: |
41 | 143 |
|
42 | | -- No cold/archive tier |
43 | | -- No partial (per-run) cooling |
44 | | -- No deletion of hot data after cooling |
45 | | -- No detection of concurrent readers (existing VM mounts tolerated) |
| 144 | +* starting AzCopy jobs |
| 145 | +* polling job progress |
| 146 | +* parsing AzCopy summaries |
| 147 | +* translating AzCopy outcomes into lifecycle operation status |
46 | 148 |
|
47 | 149 | --- |
48 | 150 |
|
49 | | -## Notes |
| 151 | +### Verification contract |
| 152 | + |
| 153 | +A lifecycle operation is considered **successful** if and only if: |
| 154 | + |
| 155 | +* AzCopy job completes successfully |
| 156 | +* `files_copied == expected_files` |
| 157 | +* `bytes_copied == expected_bytes` |
| 158 | + |
| 159 | +Any mismatch or execution error results in **FAILED**. |
| 160 | + |
| 161 | +There is no partial success. |
| 162 | + |
| 163 | +--- |
| 164 | + |
| 165 | +## Execution flow (nominal) |
| 166 | + |
| 167 | +### COOL (automatic) |
| 168 | + |
| 169 | +1. Project becomes eligible for cooling |
| 170 | +2. Euphrosyne starts a COOL operation |
| 171 | +3. Project enters `COOLING` |
| 172 | +4. tools-api launches AzCopy (Files → Blob Cool) |
| 173 | +5. AzCopy completes |
| 174 | +6. tools-api reports verified success + stats |
| 175 | +7. Euphrosyne marks project `COOL` |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +### RESTORE (manual) |
| 180 | + |
| 181 | +1. User or admin triggers restore |
| 182 | +2. Project enters `RESTORING` |
| 183 | +3. tools-api launches AzCopy (Blob Cool → Files) |
| 184 | +4. AzCopy completes |
| 185 | +5. tools-api reports verified success + stats |
| 186 | +6. Euphrosyne marks project `HOT` |
| 187 | + |
| 188 | +--- |
| 189 | + |
| 190 | +## Failure model |
| 191 | + |
| 192 | +* Any failure during copy or verification: |
| 193 | + |
| 194 | + * lifecycle operation → `FAILED` |
| 195 | + * project lifecycle → `ERROR` |
| 196 | +* Errors are: |
| 197 | + |
| 198 | + * persisted |
| 199 | + * visible to admins |
| 200 | + * retryable via a new operation |
| 201 | + |
| 202 | +Project state never flips on partial or unverified success. |
| 203 | + |
| 204 | +--- |
| 205 | + |
| 206 | +## Observability & auditability |
| 207 | + |
| 208 | +For each operation, the system records: |
| 209 | + |
| 210 | +* type (COOL / RESTORE) |
| 211 | +* timestamps (start / finish) |
| 212 | +* status |
| 213 | +* expected vs actual bytes/files |
| 214 | +* error details (if any) |
| 215 | + |
| 216 | +Admins can: |
| 217 | + |
| 218 | +* inspect operation history |
| 219 | +* understand failures |
| 220 | +* retry safely |
| 221 | + |
| 222 | +--- |
| 223 | + |
| 224 | +## Success criteria |
| 225 | + |
| 226 | +This EPIC is considered complete when: |
50 | 227 |
|
51 | | -- Lifecycle state is tracked **at the project level**, not run level. |
52 | | -- Euphrosyne is the source of truth for lifecycle state and eligibility. |
53 | | -- Physical data movement is handled by euphrosyne-tools-api and is out of scope |
54 | | - for this epic. |
| 228 | +* COOL and RESTORE operations move full project data via AzCopy |
| 229 | +* Operations are fully tracked and observable |
| 230 | +* Verification gates lifecycle state transitions |
| 231 | +* Automatic cooling works end-to-end |
| 232 | +* Restore reliably returns projects to HOT |
0 commit comments