Building CloudSync required solving complex problems across networking, distributed consensus, and database management. Below is a log of the significant technical hurdles we encountered and resolved.
The Issue: The Frontend (Next.js) needed to send authentication cookies/headers to the Backend (Go). However, the browser blocked requests with the error:
"The value of the 'Access-Control-Allow-Origin' header in the response must not be the wildcard '' when the request's credentials mode is 'include'."*
The Root Cause:
We were using AllowAllOrigins: true in our Go CORS middleware. Browser security specifications forbid using wildcards (*) when credentials (cookies/auth headers) are involved to prevent CSRF attacks.
The Solution: We reconfigured the CORS middleware to explicitly whitelist the frontend origin:
config.AllowOrigins = []string{"http://localhost:3000"} // Explicit Allow
config.AllowCredentials = trueThe Issue: How do we trust that an Agent is who they say they are? Initial designs used MAC addresses, but those can be spoofed. The Solution: We implemented a JWT-based Handshake.
- On
register, the Master generates a signed JWT containing the Agents UUID and Owner ID. - The Agent saves this token to disk (
agent_token.jwt). - All subsequent gRPC calls (Heartbeats, Replication) must include this token in the Metadata header, which is verified by a global
AuthInterceptoron the Master.
The Issue:
We initially wrote a DeleteChunk function in the Master that tried to delete files from the Masters local disk.
The Realization:
In a distributed system, the Master never holds the data. The data lives on remote Agents.
The Fix:
Refactored the logic so the Master acts as a Coordinator:
- Master looks up which Agent holds the chunk.
- Master sends a gRPC command (
RPC DeleteChunk) to that specific Agent. - Master removes the metadata from RAM.
The Issue: We needed to track high-speed file system operations (Files, Chunks) AND persistent business data (User Balance, Credits) simultaneously. The Solution: We adopted a Split-Storage Architecture:
- Business Logic (Slow, Durable): Users, Billings, and Logs go to PostgreSQL.
- File Logic (Fast, Volatile): The File Namespace and Chunk Map live purely in RAM (Go Maps) protected by
RWMutex. This mirrors the architecture of HDFS/GFS for maximum throughput.
The Issue:
The Master node would panic on startup with relation "users" does not exist when trying to register the first user.
The Cause:
We were initializing the HTTP Server (which accepts requests) simultaneously with the Database connection, but before running migrations.
The Fix:
Refactored main.go to enforce a strict sequential startup:
- Connect DB.
- Run
AutoMigrate. - Only then start HTTP/gRPC listeners.
The Issue:
We encountered FATAL: prepared statement name is already in use when connecting to NeonDB (Postgres).
The Cause:
NeonDB (and tools like PgBouncer) does not support prepared statements in transaction mode effectively, but GORM caches them by default.
The Fix:
Disabled prepared statement caching in GORM config:
gorm.Config{
PrepareStmt: false,
}- Metadata Persistence: If the Master crashes, the In-Memory file map is lost. We need to implement an Operation Log (WAL) and Snapshotting (serializing the map to disk on shutdown).
- NAT Traversal: Agents behind home routers currently cannot be reached by the Master easily. We need to implement a Reverse Tunnel or require Port Forwarding.
This is the right move. Trying to solve business logic (billing) when the underlying file system is fragile is a recipe for disaster.
In distributed systems, failures are not exceptions; they are the norm. You must design your system assuming Agents will crash, networks will disconnect, and disks will fail.
Here is the robust architecture to solve both problems using a "Fail-Fast & Client-Driven Recovery" strategy.
Problem 1: Agent Goes Offline During Browser Upload Scenario: The User is 50% done uploading a 1GB file to Agent A. Suddenly, Agent A loses power. Current Behavior: The Browser request hangs, times out, and the upload fails. The user sees "Network Error."
The Solution: Resumable Client-Side Retry The Browser (Client) is the "Driver." It must be smart enough to detect failure and ask for a new route.
Problem 2: Pipeline Breaks (The "Chain" Problem) Scenario: You have a replication pipeline: Agent A → Agent B → Agent C.
User uploads to A.
A forwards to B.
B forwards to C. The Failure: Agent B crashes.
A tries to send data to B but gets a gRPC error (Connection Refused).
If A does nothing, the data exists only on A (Under-replicated).
The Solution: Pipeline Repair (Short-Circuiting) Instead of failing the whole upload, we allow the pipeline to "skip" the dead node.
The Algorithm (HDFS Style):
Setup: Master gives Agent A the full path: [A, B, C].
Detection: A tries to send to B. B is unreachable.
Recovery (Short Circuit):
A looks at the list [A, B, C].
A sees B failed. A checks the next node: C.
A attempts to open a stream directly to C, bypassing B.
New Path: A → C.
Reporting:
A finishes writing to C.
A reports to Master: "Upload Success, but path was [A, C]. B failed."
Healing:
Master records the file exists on A and C (2 replicas).
Master marks B as suspect.
Master schedules a background task: "File 123 is under-replicated (2/3). Replicate C → D to restore health."
To build a truly production-grade Distributed File System, you need to handle the "UnHappy Paths." Here is a comprehensive list of edge cases you can solve, categorized by where things usually break.
| Edge Case | Scenario | Solution Strategy |
|---|---|---|
| Client Disconnect | User closes laptop lid or WiFi drops while uploading 99% of a 5GB file. | Garbage Collection (GC): Master marks file as pending at start. If not confirmed active within 1 hour, Master orders Agent to delete the partial data. |
| The "Straggler" Node | You have a pipeline of 3 agents. Agent A and C are fast, but Agent B is on a slow connection. The whole upload crawls to 10kb/s. | Pipeline Optimization: Agents monitor write speed. If a neighbor is too slow, the Agent drops them from the pipeline and reports them as "Slow/Congested" to the Master. |
| Timeout Hell | Agent A sends data to Agent B. Agent B accepts the connection but hangs (doesn't write or ack). Agent A waits forever. | Strict Deadlines: Implement SetDeadline() on all TCP/gRPC connections. If a chunk isn't acked in 5 seconds, consider the node dead and retry or short-circuit. |
| Zombie Agent | Master thinks Agent A is dead (missed heartbeat). But Agent A is actually alive and Client is still uploading to it. | Epoch/Version Numbers: Master increments a ClusterVersion on every view change. If Agent A reconnects with an old Version, it must "Re-join" and potentially wipe invalid state. |
| Edge Case | Scenario | Solution Strategy |
|---|---|---|
| Disk Full Mid-Stream | Agent A accepts an upload, but at 50%, its hard drive hits 100% capacity. | Space Reservation: Before accepting the stream, Agent checks FreeSpace > FileSize. Better yet, reserve the space in a temp file immediately. |
| Bit Rot (Corruption) | A file sits on the disk for 6 months. A cosmic ray flips a bit. The user downloads it and gets a corrupted image. | Checksums (CRC32/SHA256): Calculate checksum during upload and store it in Metadata. On download (or periodically), Agent recalculates and verifies. If mismatch, Master fetches a replica from another Agent. |
| The "Lie" (Missing File) | Master thinks file is on Agent A. User asks Agent A for it. Agent A says "404 Not Found" (maybe an admin deleted it manually). | Report Missing Replicas: If a Read fails, Agent reports "Replica Missing" to Master. Master immediately triggers replication from Agent B to Agent A to fix it. |
| Edge Case | Scenario | Solution Strategy |
|---|---|---|
| Name Collision | User Alice uploads report.pdf. At the exact same millisecond, User Bob uploads report.pdf to the same folder. |
Namespace Locking: Master puts a Mutex on the parent directory or uses "Compare-And-Swap" logic. Reject the second request or auto-rename to report (1).pdf. |
| Read while Write | User A is uploading a huge video. User B tries to download that video before it's finished. | Status Flags: Metadata has a state: UPLOADING vs READY. Master rejects download requests for files in UPLOADING state. |
| The "ABA" Problem | Agent A crashes. Master reassigns its work to Agent B. Agent A comes back online thinking it still owns the work. | Lease Mechanism: Master gives "Leases" (time-bound permissions) to Agents. If the lease expires, Agent A stops working until it renews with Master. |
| Edge Case | Scenario | Solution Strategy |
|---|---|---|
| Master Restart (Amnesia) | You restart the Master. Since metadata is in RAM, it forgets all 10,000 files existed. | Operation Log (WAL): Log every Create, Delete, Register event to a file on disk before updating RAM. On startup, replay the log to rebuild RAM state. |
| Agent Re-Registration | Master restarts. Agents are still running. Agents try to heartbeat, but Master says "Who are you?". | Block Report: On first heartbeat after connection loss, Agents send a full list of all chunks they hold. Master rebuilds its map based on these reports. |
I recommend solving "Disk Full Mid-Stream" and "Client Disconnect (Garbage Collection)" first.
- Why? They happen constantly in real usage.
- How? They are easy to simulate (just pull your ethernet cable or fill your disk with dummy data).
Would you like to start with the Garbage Collector logic for cleaning up failed uploads?