A simple Go implementation of a distributed MapReduce framework, based on the Google MapReduce paper.
For context, the MapReduce design simplifies large-scale data processing. The user will
provide 2 functions: map and reduce.
- The
mapfunction reads data and emits a set of intermediate key-value pairs. - The
reducefunction collects all intermediate values for a given key and "reduces" them to a final output.
While this project is only a lab submission for MIT Distributed Systems
course, its ambitious goal is to provide the distributed system that
handles (1) task scheduling, (2) data partitioning, and (3) fault
tolerance, allowing the user to focus on the logic of the map and
reduce functions.
The system uses a coordinator/worker architecture communicating via RPC. Two main code files: mr/coordinator.go and mr/worker.go.
-
Coordinator: A single master process responsible for:
- Task Scheduling: Assigns map and reduce tasks to workers. Map tasks are created for each input file and run to completion before reduce tasks are scheduled.
- Fault Tolerance: If a worker doesn't complete a task within a timeout (e.g., 10 seconds), the coordinator reassigns the task to another worker.
-
Worker: Executes
mapandreducetasks assigned by the coordinator. -
Data Flow:
maptasks read input files and partition their output into intermediate files (mr-X-Y) using a hash function on the keys.Xis the map task ID, andYis the reduce task ID.reducetasks read their corresponding intermediate files, process the data, and write final output tomr-out-Yfiles.
Communication is handled via RPC over Unix domain sockets this simple prototype to run on 1 machine.
To run coordinator and workers on separate machines, as they would in practice, we need to (1) set up RPCs to communicate over TCP/IP instead of Unix sockets, and (2) read/write files using a shared file system.
To verify the implementation, including parallelism and fault tolerance, run the test script:
# Navigate to the main directory
cd main
# Run the test script on all test cases
sh ./test-mr.shA successful run will end with the output *** PASSED ALL TESTS.
Workers and the coordinator communicate via a single RPC
method, Coordinator.RequestTask (see mr/rpc.go), which allows a worker to report its previous task's completion and request a new one in a single call.
- Worker -> Coordinator (
RequestTaskArgs): A worker sends its completed task ID and a list of output files. A task ID of 0 indicates a new worker. - Coordinator -> Worker (
RequestTaskReply): The coordinator responds with a new map/reduce task or a signal to exit if the job is complete.