CloudSync/edgeCases.md at main · Ayush-Vish/CloudSync

To build a truly production-grade Distributed File System, you need to handle the "UnHappy Paths." Here is a comprehensive list of edge cases you can solve, categorized by where things usually break.

1. Network & Connectivity (The "Flaky Internet" Problem)

Edge Case	Scenario	Solution Strategy
Client Disconnect	User closes laptop lid or WiFi drops while uploading 99% of a 5GB file.	Garbage Collection (GC): Master marks file as `pending` at start. If not confirmed `active` within 1 hour, Master orders Agent to delete the partial data.
The "Straggler" Node	You have a pipeline of 3 agents. Agent A and C are fast, but Agent B is on a slow connection. The whole upload crawls to 10kb/s.	Pipeline Optimization: Agents monitor write speed. If a neighbor is too slow, the Agent drops them from the pipeline and reports them as "Slow/Congested" to the Master.
Timeout Hell	Agent A sends data to Agent B. Agent B accepts the connection but hangs (doesn't write or ack). Agent A waits forever.	Strict Deadlines: Implement `SetDeadline()` on all TCP/gRPC connections. If a chunk isn't acked in 5 seconds, consider the node dead and retry or short-circuit.
Zombie Agent	Master thinks Agent A is dead (missed heartbeat). But Agent A is actually alive and Client is still uploading to it.	Epoch/Version Numbers: Master increments a `ClusterVersion` on every view change. If Agent A reconnects with an old Version, it must "Re-join" and potentially wipe invalid state.

2. Storage & Hardware (The "Real World" Problem)

Edge Case	Scenario	Solution Strategy
Disk Full Mid-Stream	Agent A accepts an upload, but at 50%, its hard drive hits 100% capacity.	Space Reservation: Before accepting the stream, Agent checks `FreeSpace > FileSize`. Better yet, reserve the space in a temp file immediately.
Bit Rot (Corruption)	A file sits on the disk for 6 months. A cosmic ray flips a bit. The user downloads it and gets a corrupted image.	Checksums (CRC32/SHA256): Calculate checksum during upload and store it in Metadata. On download (or periodically), Agent recalculates and verifies. If mismatch, Master fetches a replica from another Agent.
The "Lie" (Missing File)	Master thinks file is on Agent A. User asks Agent A for it. Agent A says "404 Not Found" (maybe an admin deleted it manually).	Report Missing Replicas: If a Read fails, Agent reports "Replica Missing" to Master. Master immediately triggers replication from Agent B to Agent A to fix it.

3. Concurrency & Logic (The "Race Condition" Problem)

Edge Case	Scenario	Solution Strategy
Name Collision	User Alice uploads `report.pdf`. At the exact same millisecond, User Bob uploads `report.pdf` to the same folder.	Namespace Locking: Master puts a `Mutex` on the parent directory or uses "Compare-And-Swap" logic. Reject the second request or auto-rename to `report (1).pdf`.
Read while Write	User A is uploading a huge video. User B tries to download that video before it's finished.	Status Flags: Metadata has a state: `UPLOADING` vs `READY`. Master rejects download requests for files in `UPLOADING` state.
The "ABA" Problem	Agent A crashes. Master reassigns its work to Agent B. Agent A comes back online thinking it still owns the work.	Lease Mechanism: Master gives "Leases" (time-bound permissions) to Agents. If the lease expires, Agent A stops working until it renews with Master.

4. Master Node Failures (The "Brain Dead" Problem)

Edge Case	Scenario	Solution Strategy
Master Restart (Amnesia)	You restart the Master. Since metadata is in RAM, it forgets all 10,000 files existed.	Operation Log (WAL): Log every `Create`, `Delete`, `Register` event to a file on disk before updating RAM. On startup, replay the log to rebuild RAM state.
Agent Re-Registration	Master restarts. Agents are still running. Agents try to heartbeat, but Master says "Who are you?".	Block Report: On first heartbeat after connection loss, Agents send a full list of all chunks they hold. Master rebuilds its map based on these reports.

Which one should you tackle first?

I recommend solving "Disk Full Mid-Stream" and "Client Disconnect (Garbage Collection)" first.

Why? They happen constantly in real usage.
How? They are easy to simulate (just pull your ethernet cable or fill your disk with dummy data).

Would you like to start with the Garbage Collector logic for cleaning up failed uploads?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1. Network & Connectivity (The "Flaky Internet" Problem)

2. Storage & Hardware (The "Real World" Problem)

3. Concurrency & Logic (The "Race Condition" Problem)

4. Master Node Failures (The "Brain Dead" Problem)

Which one should you tackle first?

FilesExpand file tree

edgeCases.md

Latest commit

History

edgeCases.md

File metadata and controls

1. Network & Connectivity (The "Flaky Internet" Problem)

2. Storage & Hardware (The "Real World" Problem)

3. Concurrency & Logic (The "Race Condition" Problem)

4. Master Node Failures (The "Brain Dead" Problem)

Which one should you tackle first?