🛡️ Engineering Challenges & War Stories

Building CloudSync required solving complex problems across networking, distributed consensus, and database management. Below is a log of the significant technical hurdles we encountered and resolved.

1. Network & Security

🔴 The CORS "Credentials" Trap

The Issue: The Frontend (Next.js) needed to send authentication cookies/headers to the Backend (Go). However, the browser blocked requests with the error:

"The value of the 'Access-Control-Allow-Origin' header in the response must not be the wildcard '' when the request's credentials mode is 'include'."*

The Root Cause: We were using AllowAllOrigins: true in our Go CORS middleware. Browser security specifications forbid using wildcards (*) when credentials (cookies/auth headers) are involved to prevent CSRF attacks.

The Solution: We reconfigured the CORS middleware to explicitly whitelist the frontend origin:

config.AllowOrigins = []string{"http://localhost:3000"} // Explicit Allow
config.AllowCredentials = true

🔴 The Master-Agent Identity Crisis

The Issue: How do we trust that an Agent is who they say they are? Initial designs used MAC addresses, but those can be spoofed. The Solution: We implemented a JWT-based Handshake.

On register, the Master generates a signed JWT containing the Agents UUID and Owner ID.
The Agent saves this token to disk (agent_token.jwt).
All subsequent gRPC calls (Heartbeats, Replication) must include this token in the Metadata header, which is verified by a global AuthInterceptor on the Master.

2. Architecture & Logic

🔴 The "Delete" Contradiction

The Issue: We initially wrote a DeleteChunk function in the Master that tried to delete files from the Masters local disk. The Realization: In a distributed system, the Master never holds the data. The data lives on remote Agents. The Fix: Refactored the logic so the Master acts as a Coordinator:

Master looks up which Agent holds the chunk.
Master sends a gRPC command (RPC DeleteChunk) to that specific Agent.
Master removes the metadata from RAM.

🔴 The "Split-Brain" Metadata Problem

The Issue: We needed to track high-speed file system operations (Files, Chunks) AND persistent business data (User Balance, Credits) simultaneously. The Solution: We adopted a Split-Storage Architecture:

Business Logic (Slow, Durable): Users, Billings, and Logs go to PostgreSQL.
File Logic (Fast, Volatile): The File Namespace and Chunk Map live purely in RAM (Go Maps) protected by RWMutex. This mirrors the architecture of HDFS/GFS for maximum throughput.

3. Database & ORM

🔴 The "Relation Does Not Exist" Startup Crash

The Issue: The Master node would panic on startup with relation "users" does not exist when trying to register the first user. The Cause: We were initializing the HTTP Server (which accepts requests) simultaneously with the Database connection, but before running migrations. The Fix: Refactored main.go to enforce a strict sequential startup:

Connect DB.
Run AutoMigrate.
Only then start HTTP/gRPC listeners.

🔴 GORM vs. Connection Poolers (Prepared Statements)

The Issue: We encountered FATAL: prepared statement name is already in use when connecting to NeonDB (Postgres). The Cause: NeonDB (and tools like PgBouncer) does not support prepared statements in transaction mode effectively, but GORM caches them by default. The Fix: Disabled prepared statement caching in GORM config:

gorm.Config{
    PrepareStmt: false,
}

4. Current Limitations (To Be Solved)

Metadata Persistence: If the Master crashes, the In-Memory file map is lost. We need to implement an Operation Log (WAL) and Snapshotting (serializing the map to disk on shutdown).
NAT Traversal: Agents behind home routers currently cannot be reached by the Master easily. We need to implement a Reverse Tunnel or require Port Forwarding.

This is the right move. Trying to solve business logic (billing) when the underlying file system is fragile is a recipe for disaster.

In distributed systems, failures are not exceptions; they are the norm. You must design your system assuming Agents will crash, networks will disconnect, and disks will fail.

Here is the robust architecture to solve both problems using a "Fail-Fast & Client-Driven Recovery" strategy.

Problem 1: Agent Goes Offline During Browser Upload Scenario: The User is 50% done uploading a 1GB file to Agent A. Suddenly, Agent A loses power. Current Behavior: The Browser request hangs, times out, and the upload fails. The user sees "Network Error."

The Solution: Resumable Client-Side Retry The Browser (Client) is the "Driver." It must be smart enough to detect failure and ask for a new route.

Problem 2: Pipeline Breaks (The "Chain" Problem) Scenario: You have a replication pipeline: Agent A → Agent B → Agent C.

User uploads to A.

A forwards to B.

B forwards to C. The Failure: Agent B crashes.

A tries to send data to B but gets a gRPC error (Connection Refused).

If A does nothing, the data exists only on A (Under-replicated).

The Solution: Pipeline Repair (Short-Circuiting) Instead of failing the whole upload, we allow the pipeline to "skip" the dead node.

The Algorithm (HDFS Style):

Setup: Master gives Agent A the full path: [A, B, C].

Detection: A tries to send to B. B is unreachable.

Recovery (Short Circuit):

A looks at the list [A, B, C].

A sees B failed. A checks the next node: C.

A attempts to open a stream directly to C, bypassing B.

New Path: A → C.

Reporting:

A finishes writing to C.

A reports to Master: "Upload Success, but path was [A, C]. B failed."

Healing:

Master records the file exists on A and C (2 replicas).

Master marks B as suspect.

Master schedules a background task: "File 123 is under-replicated (2/3). Replicate C → D to restore health."

To build a truly production-grade Distributed File System, you need to handle the "UnHappy Paths." Here is a comprehensive list of edge cases you can solve, categorized by where things usually break.

1. Network & Connectivity (The "Flaky Internet" Problem)

Edge Case	Scenario	Solution Strategy
Client Disconnect	User closes laptop lid or WiFi drops while uploading 99% of a 5GB file.	Garbage Collection (GC): Master marks file as `pending` at start. If not confirmed `active` within 1 hour, Master orders Agent to delete the partial data.
The "Straggler" Node	You have a pipeline of 3 agents. Agent A and C are fast, but Agent B is on a slow connection. The whole upload crawls to 10kb/s.	Pipeline Optimization: Agents monitor write speed. If a neighbor is too slow, the Agent drops them from the pipeline and reports them as "Slow/Congested" to the Master.
Timeout Hell	Agent A sends data to Agent B. Agent B accepts the connection but hangs (doesn't write or ack). Agent A waits forever.	Strict Deadlines: Implement `SetDeadline()` on all TCP/gRPC connections. If a chunk isn't acked in 5 seconds, consider the node dead and retry or short-circuit.
Zombie Agent	Master thinks Agent A is dead (missed heartbeat). But Agent A is actually alive and Client is still uploading to it.	Epoch/Version Numbers: Master increments a `ClusterVersion` on every view change. If Agent A reconnects with an old Version, it must "Re-join" and potentially wipe invalid state.

2. Storage & Hardware (The "Real World" Problem)

Edge Case	Scenario	Solution Strategy
Disk Full Mid-Stream	Agent A accepts an upload, but at 50%, its hard drive hits 100% capacity.	Space Reservation: Before accepting the stream, Agent checks `FreeSpace > FileSize`. Better yet, reserve the space in a temp file immediately.
Bit Rot (Corruption)	A file sits on the disk for 6 months. A cosmic ray flips a bit. The user downloads it and gets a corrupted image.	Checksums (CRC32/SHA256): Calculate checksum during upload and store it in Metadata. On download (or periodically), Agent recalculates and verifies. If mismatch, Master fetches a replica from another Agent.
The "Lie" (Missing File)	Master thinks file is on Agent A. User asks Agent A for it. Agent A says "404 Not Found" (maybe an admin deleted it manually).	Report Missing Replicas: If a Read fails, Agent reports "Replica Missing" to Master. Master immediately triggers replication from Agent B to Agent A to fix it.

3. Concurrency & Logic (The "Race Condition" Problem)

Edge Case	Scenario	Solution Strategy
Name Collision	User Alice uploads `report.pdf`. At the exact same millisecond, User Bob uploads `report.pdf` to the same folder.	Namespace Locking: Master puts a `Mutex` on the parent directory or uses "Compare-And-Swap" logic. Reject the second request or auto-rename to `report (1).pdf`.
Read while Write	User A is uploading a huge video. User B tries to download that video before it's finished.	Status Flags: Metadata has a state: `UPLOADING` vs `READY`. Master rejects download requests for files in `UPLOADING` state.
The "ABA" Problem	Agent A crashes. Master reassigns its work to Agent B. Agent A comes back online thinking it still owns the work.	Lease Mechanism: Master gives "Leases" (time-bound permissions) to Agents. If the lease expires, Agent A stops working until it renews with Master.

4. Master Node Failures (The "Brain Dead" Problem)

Edge Case	Scenario	Solution Strategy
Master Restart (Amnesia)	You restart the Master. Since metadata is in RAM, it forgets all 10,000 files existed.	Operation Log (WAL): Log every `Create`, `Delete`, `Register` event to a file on disk before updating RAM. On startup, replay the log to rebuild RAM state.
Agent Re-Registration	Master restarts. Agents are still running. Agents try to heartbeat, but Master says "Who are you?".	Block Report: On first heartbeat after connection loss, Agents send a full list of all chunks they hold. Master rebuilds its map based on these reports.

Which one should you tackle first?

I recommend solving "Disk Full Mid-Stream" and "Client Disconnect (Garbage Collection)" first.

Why? They happen constantly in real usage.
How? They are easy to simulate (just pull your ethernet cable or fill your disk with dummy data).

Would you like to start with the Garbage Collector logic for cleaning up failed uploads?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🛡️ Engineering Challenges & War Stories

1. Network & Security

🔴 The CORS "Credentials" Trap

🔴 The Master-Agent Identity Crisis

2. Architecture & Logic

🔴 The "Delete" Contradiction

🔴 The "Split-Brain" Metadata Problem

3. Database & ORM

🔴 The "Relation Does Not Exist" Startup Crash

🔴 GORM vs. Connection Poolers (Prepared Statements)

4. Current Limitations (To Be Solved)

1. Network & Connectivity (The "Flaky Internet" Problem)

2. Storage & Hardware (The "Real World" Problem)

3. Concurrency & Logic (The "Race Condition" Problem)

4. Master Node Failures (The "Brain Dead" Problem)

Which one should you tackle first?

FilesExpand file tree

CHALLENGES.md

Latest commit

History

CHALLENGES.md

File metadata and controls

🛡️ Engineering Challenges & War Stories

1. Network & Security

🔴 The CORS "Credentials" Trap

🔴 The Master-Agent Identity Crisis

2. Architecture & Logic

🔴 The "Delete" Contradiction

🔴 The "Split-Brain" Metadata Problem

3. Database & ORM

🔴 The "Relation Does Not Exist" Startup Crash

🔴 GORM vs. Connection Poolers (Prepared Statements)

4. Current Limitations (To Be Solved)

1. Network & Connectivity (The "Flaky Internet" Problem)

2. Storage & Hardware (The "Real World" Problem)

3. Concurrency & Logic (The "Race Condition" Problem)

4. Master Node Failures (The "Brain Dead" Problem)

Which one should you tackle first?