Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
326 changes: 326 additions & 0 deletions docs/adr/002-large-file-upload-chunking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,326 @@
# ADR: Large File Upload Support with Storage-Level Chunking

## Status

Proposed

## Context

Twake Workplace (Cozy-Stack) allows users to upload files via a single HTTP request that are stored in OpenStack Swift or local filesystem (Afero).
Currently, there are practical limitations:

1. Swift has a ~5GB single object limit
2. Very large file uploads can stress server resources
3. Long uploads are more prone to network failures
4. We plan to add S3 API support, which has similar chunking needs (multipart uploads)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean as a storage backend or provide this API to our users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, yes, allow users to upload big files as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today at least backend storage @taratatach. For some customers, we may want to use S3 backend instead of swift (because of secnumcloud certification or things like that)

Having an S3 compatible API can be cool, but I don't know the cost of having the 2 API.


### Current Architecture

The VFS (Virtual File System) abstraction layer supports multiple backends:
- **VFSSwift** (`model/vfs/vfsswift/impl_v3.go`): OpenStack Swift storage
- **VFSAfero** (`model/vfs/vfsafero/impl.go`): Local filesystem storage

Both implement the same `vfs.VFS` interface, keeping the storage implementation transparent to consumers.

### Problem Statement

Users cannot upload files larger than 5GB to Swift storage. We need a solution that:
- Supports files larger than 5GB
- Works with existing single-request upload API
- Is extensible to local VFS and future S3 support
- Minimizes changes to CouchDB schema and existing code

## Proposal

### Storage-Level Chunking

Implement chunking at the storage backend level, keeping it transparent to the HTTP API layer. Each storage backend handles chunking internally using its native large object support:

**For Swift:** Use Static Large Objects (SLO)
- Swift automatically manages segments and manifests
- Downloads are transparently reassembled by Swift
- No CouchDB schema changes needed

**For Local VFS (Afero):** Already handles large files natively (no chunking needed)

**For Future S3:** Use S3 Multipart Upload API
- S3/MinIO provide native multipart upload support similar to Swift SLO (streaming parts directly to disk without buffering entire files)
- Same transparent approach can be used
- Observe S3 protocol limits: a single object maxes out at 5TB, uploads can have at most 10,000 parts, and each part must be between 5MB and 5GB (last part can be smaller)
- Pick part sizes small enough (and configurable) so the 10,000-part limit still covers the largest supported file; anything larger than 5TB must be chunked at the application level because the S3 API itself forbids it
- **Note on checksums:** Like Swift SLO, S3 multipart ETags are not MD5 hashes of the content (they're a hash of part ETags). Application-side MD5 computation will be required for S3 multipart uploads, using the same pattern as Swift SLO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a small mistake here?

Suggested change
- **Note on checksums:** Like Swift SLO, S3 multipart ETags are not MD5 hashes of the content (they're a hash of part ETags). Application-side MD5 computation will be required for S3 multipart uploads, using the same pattern as Swift SLO
- **Note on checksums:** Like Swift SLO, S3 multipart ETags are not MD5 hashes of the content (they're a hash of part content). Application-side MD5 computation will be required for S3 multipart uploads, using the same pattern as Swift SLO

If not, I don't really understand.


### Implementation Details

#### 1. Configuration (Swift-Specific)

Add Swift-specific configuration under `fs.swift`:

```yaml
fs:
url: swift://...
swift:
# Size of each segment for SLO uploads (default: 4GB)
segment_size: 4294967296
# Files larger than this use SLO (default: same as segment_size)
# Set to 0 to always use SLO
slo_threshold: 4294967296
```

**Test override**: Tests will be able to set a tiny segment size (e.g., 1KB) via config overrides to exercise the SLO code path without uploading gigabytes.

#### 2. Swift VFS Changes (`model/vfs/vfsswift/impl_v3.go`)

**CreateFile method modifications:**

The `CreateFile` method will be extended to:
1. Read segment size and SLO threshold from configuration
2. Determine whether to use SLO based on file size (files larger than threshold) or streaming mode (unknown size, indicated by negative `ByteSize`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to know the file size before hand to use SLO?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the "quota enforcement" paragraph it seems that we don't but then I don't really understand this sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least we know it from the request header, but it's not 100%, we can rely on it at first to what method to use but not for quota, we need to calculate it on the fly as well

3. For SLO uploads: use Swift's `StaticLargeObjectCreateFile` API with configured chunk size, letting Swift generate collision-free segment prefixes automatically
4. Return a new `swiftLargeFileCreationV3` struct that wraps the SLO writer and maintains its own MD5 hasher
5. For regular uploads: continue using the existing `ObjectCreate` path unchanged

**Quota enforcement for streaming uploads (unknown size):**

When `ByteSize < 0` (streaming/chunked transfer encoding), the total size is unknown upfront. Quota enforcement will work as follows:
1. Before upload: check that instance has *some* available quota (reject if quota already exceeded)
2. During upload: the `swiftLargeFileCreationV3.Write()` method will track cumulative bytes written
3. On each write: compare cumulative bytes against instance quota; if exceeded, call `Abort()` to clean up segments and return a quota-exceeded error
4. The existing `vfs.CheckAvailableDiskSpace` check runs at file creation time; for streaming uploads, we add a runtime check that aborts mid-stream if the limit is hit
5. This mirrors the existing behavior for regular uploads where Swift/the VFS layer rejects writes that exceed quota

#### 3. MD5/Checksum Handling

**Important**: SLO manifests don't return a single MD5 hash like regular objects. The manifest's ETag is a hash of the segment ETags, not the content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again you're saying here that the manifest's ETag is a hash of ETags. If that's not a mistake (i.e. we should not read "hash of segment") then what's the use of that manifest ETag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a MD5( segment1_etag + segment2_etag + ... )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so definitely a hash of ETags. Is the purpose of that hash to quickly know whether a segment was changed? I don't see much use for it besides this.


We will implement application-side MD5 computation:

1. Create a new `swiftLargeFileCreationV3` struct that holds an MD5 hasher alongside the Swift file writer
2. On each `Write()` call, update the MD5 hash before passing data to Swift
3. On `Close()`, finalize the MD5 hash and store it in `newdoc.MD5Sum` (ignoring Swift's manifest ETag)
4. Proceed with the normal close logic (CouchDB update, versioning) using our computed hash

This ensures:
- Antivirus scanning works (relies on MD5Sum)
- File versioning works (compares MD5Sum to detect changes)
- File integrity validation works

#### 4. Failure Handling and Cleanup

**Failure Scenarios:**

1. **Client disconnects mid-upload**: The `swiftLargeFileCreationV3.Close()` is never called; segments remain orphaned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we call Close() in a defer function so it's called every time the handler finishes whether it's because the upload is done or the connection was closed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we have to do it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, you're saying in the document that Close() can be never called though 🤔

2. **Server crash**: Same as above - partially written segments exist without a manifest
3. **Write error mid-stream**: Error returned from `Write()` or `Close()`, segments may exist

**Cleanup Strategy:**

**Best-effort cleanup on error:**

The `swiftLargeFileCreationV3` struct will implement cleanup behavior:
- On `Close()` error: attempt to delete any written segments using `LargeObjectDelete`, then return the original error
- New `Abort()` method: close the underlying writer and delete any segments that were written; called on context cancellation or explicit abort

**Periodic garbage collection of orphaned segments:**

Orphaned segments can accumulate from crashes or network failures. We will add a worker job (`worker/gc/slo_segments.go`) that:
1. Lists all segment prefixes (objects matching `*_segments/*` pattern)
2. For each segment prefix, checks if the parent manifest exists
3. If the manifest is missing AND segments are older than a configurable threshold (default: 24h), deletes the orphaned segments
4. Logs all deletions for audit trail

**Configuration:**
```yaml
fs:
swift:
# ... existing config ...
# Age threshold for orphan cleanup (default: 24h)
orphan_segment_max_age: 24h
```

**Triggering cleanup:**
- On upload error: immediate best-effort delete
- On server startup: schedule GC job
- Periodically: run GC worker (configurable interval)
- Manual: `cozy-stack swift gc-segments` CLI command

#### 5. Operations That Need SLO Awareness

The following methods currently use `ObjectDelete` or `ObjectCopy` and need updates:

| Method | Current | Change Needed |
|--------|---------|---------------|
| `destroyFileLocked` | `ObjectDelete` | Use `LargeObjectDelete` with fallback |
| `cleanOldVersion` | `ObjectDelete` | Use `LargeObjectDelete` with fallback |
| `EnsureErased` | `BulkDelete` / `ObjectDelete` | Use `LargeObjectDelete` for each |
| `CopyFile` | `ObjectCopy` | Copy manifest + segments or re-upload |
| `DissociateFile` | `ObjectCopy` + `ObjectDelete` | Handle SLO copy and delete |
| `CopyFileFromOtherFS` | `ObjectCopy` | Handle SLO source objects |

**Deletion pattern:**

We will implement a `deleteObject` helper method that:
1. First attempts `LargeObjectDelete` (which handles both SLO manifests with their segments and regular objects)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this could add a lot of overhead when deleting many files? Would it be better to use SLO for all files even ones which are smaller that the threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use also ObjectDelete for all object, for SLO it deletes only manifest and content will be deleted after with GC. But I'm a little but afraid of using GC for every file and prefer keep for extrodinay cases and even half manual.
There is also variants to store flag in file doc

2. If Swift returns `NotLargeObject` error, falls back to regular `ObjectDelete`
3. This unified approach ensures both SLO and regular objects are deleted correctly without needing to check the object type first

**Copy pattern** (for SLO objects):

Swift doesn't support copying SLO manifests directly. Available options:
1. **Copy manifest content and update segment references** - Complex: requires parsing manifest JSON, copying each segment individually, updating references
2. **Download and re-upload** - Simple but slow: streams entire file through the server
3. **Copy segments individually then create new manifest** - Medium complexity: server-side segment copy + new manifest creation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Swift provide helpers to create manifests from a list of segments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, there is and API to create manifest


**Chosen approach: Segment copy with new manifest (Option 3)**

Rationale:
- Avoids streaming large files through the server (unlike Option 2)
- Keeps data server-side within Swift (efficient for same-region copies)
- Acceptable complexity since copy/dissociate of very large files is rare

Recommendation: For `CopyFile`/`DissociateFile`, detect if source is SLO and handle appropriately. For initial implementation, fall back to download/re-upload for SLO objects (rare case for very large files).

### How Swift SLO Works

```
Upload 10GB file with 4GB segments:
├── {container}/{objName}_segments/1234567890.123456/00000000 (4GB)
├── {container}/{objName}_segments/1234567890.123456/00000001 (4GB)
├── {container}/{objName}_segments/1234567890.123456/00000002 (2GB)
└── {container}/{objName} (manifest JSON listing segments)

The segment prefix includes a timestamp to avoid collisions.

Download:
Client requests {objName} → Swift reads manifest → Streams segments in order
```

## Alternatives

### Alternative A: UI-Initiated Chunked Uploads

**Description:** Client sends multiple HTTP requests, each containing a chunk of the file. Server assembles chunks after all are received.

**Pros:**
- Client can resume uploads after failure
- Better progress tracking per chunk
- Works with any storage backend uniformly

**Cons:**
- Requires new HTTP API endpoints (`POST /files/chunks/start`, `PUT /files/chunks/{id}`, `POST /files/chunks/{id}/complete`)
- Significant UI/client changes required
- Server must track upload sessions and handle cleanup of incomplete uploads
- Adds complexity to CouchDB (need to track chunk metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we create a SLO manifest and create chunks by hand in this case? I'm just asking because I think adding this functionality later could be nice (for resuming uploads, parallel uploads…).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when we are ready to do it, we can create segments separately and then create a manifest with uploaded chunks

- More HTTP round trips
- State management for partial uploads

**Implementation complexity:** High

### Alternative B: Unified Chunking Layer in VFS

**Description:** Add a chunking abstraction in the VFS interface that all backends implement, with chunk metadata stored in CouchDB.

**Pros:**
- Consistent behavior across all storage backends
- Full control over chunk management
- CouchDB knows about file structure

**Cons:**
- CouchDB schema changes required (new `chunks` field in FileDoc)
- Must implement custom chunk assembly for downloads
- Duplicates functionality that Swift/S3 provide natively
- Increases complexity in all VFS operations
- Migration needed for existing files

**Implementation complexity:** High

### Alternative C: Swift Dynamic Large Objects (DLO) Instead of SLO

**Description:** Swift offers two large object mechanisms: Static Large Objects (SLO) and Dynamic Large Objects (DLO). Both are supported by the `ncw/swift` Go library used in this project. DLO uses a simpler manifest that only stores a segment prefix—Swift dynamically discovers segments matching that prefix at download time.

**Pros:**
- Simpler manifest structure (just a prefix, no segment list)
- Segments can be added or modified after manifest creation
- Slightly simpler upload logic (no need to track segment metadata)

**Cons:**
- **No integrity validation**: DLO manifests don't store segment ETags, so Swift cannot detect missing or corrupted segments. Downloads succeed with incomplete or wrong data rather than returning an error.
- **Race conditions**: Since segments are discovered dynamically, concurrent modifications can cause inconsistent reads (e.g., a download might see a partial set of segments if upload is in progress).
- **Unpredictable content**: The downloaded content depends on what segments exist at read time, not what was originally uploaded. If a segment is deleted (disk failure, bug, manual deletion), the file silently shrinks.
- **No size validation**: DLO cannot verify that the total size matches expectations.

**Why SLO is preferred:**

For a file storage system where data integrity matters, SLO provides critical guarantees that DLO lacks:

| Aspect | SLO | DLO |
|--------|-----|-----|
| Segment list | Explicit with ETags and sizes | Dynamic prefix matching |
| Integrity check | Swift validates each segment's ETag on download | None—trusts whatever exists |
| Missing segment | Detected—connection dropped, client receives partial results | **Silently succeeds**—"happily ignores" the missing segment |
| Modified segment | Detected—ETag mismatch causes failure | Silently returns different data |
| Consistency | Immutable after creation | Can change between reads |
| Client awareness | Failure is detectable (incomplete transfer) | No way to know data is missing |

**Note on UI chunked uploads:** DLO's flexibility (adding segments after manifest creation) might seem useful for resumable UI uploads, but it provides no real advantage. Both DLO and SLO support the same upload flow: client uploads segments, then server creates manifest on completion.

**Implementation complexity:** Low (similar to SLO), but unacceptable data integrity trade-offs.

## Decision

**Recommended approach: Storage-Level Chunking**

Rationale for recommendation:
1. **Minimal changes**: Only affects Swift VFS, no API or CouchDB changes
2. **Uses native features**: Swift SLO is battle-tested and efficient
3. **Transparent**: Existing code (downloads, file operations) works unchanged
4. **Extensible**: Same pattern applies to S3 multipart uploads
5. **No UI changes**: Works with existing single-request upload API
6. **Afero compatibility**: Local VFS already handles large files, no changes needed

## Consequences

### Positive
- Files larger than 5GB can be uploaded to Swift
- Downloads work transparently (Swift handles segment assembly)
- No CouchDB schema changes
- No HTTP API changes
- Easy to extend to S3 when needed
- Minimal code surface area to maintain
- Works with streaming uploads (unknown Content-Length)

### Negative
- Swift-specific implementation (though same pattern works for S3)
- MD5 must be computed application-side for SLO uploads
- Copy operations for SLO files are more complex
- Segments use additional storage namespace (though transparent to users)

### Neutral
- Configuration options are Swift-specific (`fs.swift.segment_size`)
- Deletion is slightly more complex (but library handles it)

### Security Considerations
- Switching to Swift SLO does not introduce new HTTP endpoints or long-lived upload sessions, so the DoS surface stays effectively the same as today's single-request uploads.
- Segment creation remains gated by the existing per-instance quota checks in `vfs.CheckAvailableDiskSpace`, so a malicious client cannot exceed its quota by partially uploading SLO files.
- **Streaming upload quota enforcement:** For uploads with unknown size (`ByteSize < 0`), the `swiftLargeFileCreationV3` writer tracks cumulative bytes and aborts the upload if quota is exceeded mid-stream. This prevents quota bypass via chunked transfer encoding.
- Large uploads remain inherently expensive (bandwidth/CPU). Existing protections—rate limiting (`pkg/limits`), request timeouts, and monitoring for long-running uploads—should continue to be enforced. No additional attack vectors are introduced by SLO itself.

## Testing Considerations

### Unit Tests
- Override `config.Fs.Swift.SegmentSize` to 1KB in tests to exercise SLO code path
- Test files smaller than threshold (regular upload)
- Test files larger than threshold (SLO upload)
- Test streaming upload with unknown size (should use SLO)
- Test MD5 computation matches expected value

### Integration Tests
- Test file deletion (verify segments are cleaned up)
- Test downloads of SLO files
- Test file versioning with SLO files
- Test copy/dissociate operations with SLO files

### Test Configuration Example

Tests will configure tiny segment sizes (e.g., 1KB segments, 512-byte threshold) to exercise the SLO code path with small test files.
For example, a 2KB test file would use 2 segments, allowing verification of upload, download, and delete operations without requiring gigabytes of test data.