You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To build a truly production-grade Distributed File System, you need to handle the "UnHappy Paths." Here is a comprehensive list of edge cases you can solve, categorized by where things usually break.
1. Network & Connectivity (The "Flaky Internet" Problem)
Edge Case
Scenario
Solution Strategy
Client Disconnect
User closes laptop lid or WiFi drops while uploading 99% of a 5GB file.
Garbage Collection (GC): Master marks file as pending at start. If not confirmed active within 1 hour, Master orders Agent to delete the partial data.
The "Straggler" Node
You have a pipeline of 3 agents. Agent A and C are fast, but Agent B is on a slow connection. The whole upload crawls to 10kb/s.
Pipeline Optimization: Agents monitor write speed. If a neighbor is too slow, the Agent drops them from the pipeline and reports them as "Slow/Congested" to the Master.
Timeout Hell
Agent A sends data to Agent B. Agent B accepts the connection but hangs (doesn't write or ack). Agent A waits forever.
Strict Deadlines: Implement SetDeadline() on all TCP/gRPC connections. If a chunk isn't acked in 5 seconds, consider the node dead and retry or short-circuit.
Zombie Agent
Master thinks Agent A is dead (missed heartbeat). But Agent A is actually alive and Client is still uploading to it.
Epoch/Version Numbers: Master increments a ClusterVersion on every view change. If Agent A reconnects with an old Version, it must "Re-join" and potentially wipe invalid state.
2. Storage & Hardware (The "Real World" Problem)
Edge Case
Scenario
Solution Strategy
Disk Full Mid-Stream
Agent A accepts an upload, but at 50%, its hard drive hits 100% capacity.
Space Reservation: Before accepting the stream, Agent checks FreeSpace > FileSize. Better yet, reserve the space in a temp file immediately.
Bit Rot (Corruption)
A file sits on the disk for 6 months. A cosmic ray flips a bit. The user downloads it and gets a corrupted image.
Checksums (CRC32/SHA256): Calculate checksum during upload and store it in Metadata. On download (or periodically), Agent recalculates and verifies. If mismatch, Master fetches a replica from another Agent.
The "Lie" (Missing File)
Master thinks file is on Agent A. User asks Agent A for it. Agent A says "404 Not Found" (maybe an admin deleted it manually).
Report Missing Replicas: If a Read fails, Agent reports "Replica Missing" to Master. Master immediately triggers replication from Agent B to Agent A to fix it.
3. Concurrency & Logic (The "Race Condition" Problem)
Edge Case
Scenario
Solution Strategy
Name Collision
User Alice uploads report.pdf. At the exact same millisecond, User Bob uploads report.pdf to the same folder.
Namespace Locking: Master puts a Mutex on the parent directory or uses "Compare-And-Swap" logic. Reject the second request or auto-rename to report (1).pdf.
Read while Write
User A is uploading a huge video. User B tries to download that video before it's finished.
Status Flags: Metadata has a state: UPLOADING vs READY. Master rejects download requests for files in UPLOADING state.
The "ABA" Problem
Agent A crashes. Master reassigns its work to Agent B. Agent A comes back online thinking it still owns the work.
Lease Mechanism: Master gives "Leases" (time-bound permissions) to Agents. If the lease expires, Agent A stops working until it renews with Master.
4. Master Node Failures (The "Brain Dead" Problem)
Edge Case
Scenario
Solution Strategy
Master Restart (Amnesia)
You restart the Master. Since metadata is in RAM, it forgets all 10,000 files existed.
Operation Log (WAL): Log every Create, Delete, Register event to a file on disk before updating RAM. On startup, replay the log to rebuild RAM state.
Agent Re-Registration
Master restarts. Agents are still running. Agents try to heartbeat, but Master says "Who are you?".
Block Report: On first heartbeat after connection loss, Agents send a full list of all chunks they hold. Master rebuilds its map based on these reports.
Which one should you tackle first?
I recommend solving "Disk Full Mid-Stream" and "Client Disconnect (Garbage Collection)" first.
Why? They happen constantly in real usage.
How? They are easy to simulate (just pull your ethernet cable or fill your disk with dummy data).
Would you like to start with the Garbage Collector logic for cleaning up failed uploads?