Machine: GCP e2-medium with pd-balanced SSD disk
- CPUs: 2 (Only a single core is used)
- Memory: 4GB (Only 10MB are actually used)
- Disk: 32GB SSD (1GB is used at most)
Input: 33M items -> ~250MB of data
Resources: 10MB (At least 1MB needs to be given just for the code / libraries / etc.)
Common:
- CPU: All implementations are single-threaded / use only a single CPU
- Memory: All implementations need at least 1MB for the code to run properly
- 0.5MB for the quick-sort / base sorting-algorithm (stack space) -> Can OOM when performing base-sorting otherwise
- 1MB for the code and libraries (everything is loaded from scratch) -> Only about 0.5MB actually needs to be present in memory
Note: In case where memory-limit is not set / processes have access to infinite amount of memory, they all perform about the same.
Gets OOMed. After all, the allocated memory-buffer exceeds the memory-limit and it's impossible to swap it out
Does NOT get OOMed, but causes insane swapping, which is MUCH slower than page-cache thrashing (Linux internals).
Time: ~2min45s, only 30s are spent on the CPU
- IO: ~2min (Swapping in and out on memory-pressure)
- CPU: ~30s (~15s in user-space / ~15s in kerenl-space due to swapping page-faults and file-system system-calls)
Memory: 10MB (Making full use)
- Swap-space: ~250MB (Keeping the entire memory-buffer in swap-space due to memory-constraints)
- File-system (Page-cache): ~0.5MB (Basically does not use due to it getting evicted due to memory-pressure)
IO: Swapping incurs HUGE IO
- Reading: ~680K disk-reads, ~3GB in total (SWAPPING IN AND OUT INDIVIDUAL PAGES IS HORRIBLE)
- Writing: ~640K disk-writes, ~2.8GB in total (SAME)
Overall: Whether the program is even able to work depends entirely on the machine setup (swap-space being big enough) and even then it performs HORRIBLY.
Completes successfully regardless of swap-space (does NOT use it, as all memory resides in the kernel-space page-cache).
Time: ~1min, most spent stalling / waiting on IO
- IO: ~40s (Paging in and out on memory-pressure)
- CPU: ~20s (~15s in user-space for actual sorting / ~5s in kernel-space for paging)
Memory: 10MB (Makes full use of the allocated memory under the limit)
- Swap-space: 0 bytes (Does NOT use, all the memory is used via the kernel page-cache)
- File-system (Page-cache): ~10MB (Uses it as much as possible)
IO: Page-cache is more efficient than swap-space when it comes to managing IOs (Same amount of data, but less operations)
- Reading: ~24K disk-reads, ~2.2GB in total (re-reading / re-writing the same pages over and over on thrashing)
- Writing: ~25K disk-writes, ~2.3GB in total (same idea, but also with additional final-sync cost)
Overall: Extremely mal-performant, but at least guaranteed to work regardless of the machine (memory-limit / swap-space / etc.)
Allows configuring dedicated work-memory to be used for the algorithm, regardless of the memory-limits. The algorithm works by sorting individual chunks of data separately and then combining them via N-way merge-sort, using the disk to store intermediate results.
Time: ~23s, most spent ACTUALLY WORKING
- IO: ~6s (Reading and writing individual sorted-runs)
- CPU: ~17s (Almost all in user-space, sorting individual runs and merging them)
Memory: <=10MB (Only guaranteed to use 9MB of the assigned work-memory, but the remaining 1MB is used for loading code / libraries and quick-sorting)
- Swap-space: 0 bytes (Does NOT use, all the memory is used via the kernel page-cache)
- File-system (Page-cache): ~0.5MB (Only for loading code and libraries, as direct-IO is used for all the disk operations)
IO: Performing direct IO without the use of page-cache / page-buffer / file-system ENTIRELY - Using internal work-memory as the internal buffer
- Reading: ~2.5K disk-reads, ~500MB in total (reading from the data-file and from the temporary work-file)
- Writing: ~2.5K disk-writes, ~500MB in total (writing to the data-file and to the temporary work-file)
Overall: The best approach to use, despite the actual code not being optimized enough
FUTURE: Optimizing the actual low-level logic as well as the high-level design
- Double-bufferring (Reading / writing data and operating it on it at the same time using non-synchronous IO)
- Improving the way work-memory is partitioned / split between individual parts
- Reusing sequences in the temporary-file instead of just appending (free-space management)
- etc.