This repository showcases a basic implementation of external-merge-sort and its performance compared to regular sorting backed up swap-space / page-cache (MMAP).
Note: The code-quality, compilation / execution setup and documentation are exteremely basic (not expected the repository have grown like this, was supposed to be a few files).
The sorting is performed on a file of 8-byte values (LONGs) using different approaches with limited memory allocated to the processes.
Approaches:
- Dumb (user-space memory, double-buffering of file contents, sys-calls IO, using file-system cache): Reading all the data from the file into user-space, sorting it and writing it all back. The only reason why it works is due to the swap-space (requires it to be large enough to fit in all the user-space memory).
- Mapped (kernel-space memory, no double-buffering, memory-mapped IO): Mapping the entire contents of the file into user-space (the actual physical-pages belong to the page-cache) and operating on them. This method works well enough as a low-effort solution.
- External-Merge-Sort (ONLY user-space memory, direct IO, no file-system / page cache): Performing the actual external-merge-sort with minimum assistance of the OS (no page-cache / page-buffer, only using direct IO, allocating all the memory at once into user-space, not using swap-space).
See STATS.md for example stats of each approach under specific conditions.
Important: Unlike the other two approaches, the external-merge-sort can be optimized by A LOT, but only the basic implementation is used to showcase that even the most simple implementation still performs much better (see the "Future" section for the list of optimizations).
Requirements:
- Linux operating system (Ubuntu 24.04.03 was used).
- C-Groups V2: Configuration of the memory-limit and inspection of the IO / memory stats depends on it.
- Support of swap-space: Needed for running the "Dumb" approach.
- GCC C++ compiler (with version 20 support) and GLIBC.
- Admin-level permissions (performing
sudo): Needed for C-Groups, Swap-space and other setup.
Setup:
- C-Groups V2: Enabling the use of CPU, memory and IO controllers (
./setup-cg.sh). - Swap-Space: Creating a 1GB swap-space for the "Dumb" approach to work (
./setup-swap.sh).
Execution:
- Populating a data-file with 8-byte values (
./populate.sh): Generates LONG values from 1 to the specific limit, shuffles them into a random order and writes them out. Note: Separated from the actual sorting script to have no affect on the memory-usage / IO / file-system / etc. - Running the actual sorting (
./run.sh): Used for running sorting using all of the approaches (comment out to select the target one for compilation, the rest of the script is the same).- Compiles the target sorting executable
- Flushes / clears the page-cache and page-buffer (simulating complete cold-start / no data in memory).
- Initializes a C-Group for running the sorting with a specific memory-limit and joining it (therefore the sorting process is also going to join).
- Running the sorting and checking that it was successful.
- Outputting stats about CPU, memory and IO utilization.
- Checking that the sorting was done correclty (
./check.sh): Confirms that the LONG values are in order with no items missing.
Top-level:
build: Generated for building executables (no object-files, cause the compilation setup is very basic)src: Source-files for the utilities and sorting code.workspace: Directory with the data-file as well as temporary work-files (only generated by external-merge-sort).- Scripts (
run.sh/setup-cg.sh/ etc.): Setup and execution of sorting (see the details above).
Code:
common.hpp: Common definitions of items, their sizes and various minor helpers.data: Code for populating the data-file (populate.cpp) and checking it after sorting (check.cpp).sort: Different sorting approaches implementations (everything related to the external-merge-sort is underexternal).
See the actual code for implementation details (rough for now, but should be readable).
As mentione, external-merge-sort can be further optimized and improved:
- Performing double-buffering: Reading in a new block of data for sorting while performing sorting on a previous block of data (interleaving CPU work and IO). Utilizing just a single CPU by having the IO be performed in non-blocking / asynchronous way.
- Parallelizing: Performing sorting and merging in parallel using multiple threads / multiple CPUs.
- Tuning work-memory separation / distribution of buffer-pages amongst components of the algorithm.
- Optimizing data-structures / using more specialized algorithms.
- etc.