basic-external-merge-sort/STATS.md at main · gregory-pevnev-transit-app/basic-external-merge-sort

Conditions

Machine: GCP e2-medium with pd-balanced SSD disk

CPUs: 2 (Only a single core is used)
Memory: 4GB (Only 10MB are actually used)
Disk: 32GB SSD (1GB is used at most)

Input: 33M items -> ~250MB of data

Resources: 10MB (At least 1MB needs to be given just for the code / libraries / etc.)

Implementations

Common:

CPU: All implementations are single-threaded / use only a single CPU
Memory: All implementations need at least 1MB for the code to run properly
- 0.5MB for the quick-sort / base sorting-algorithm (stack space) -> Can OOM when performing base-sorting otherwise
- 1MB for the code and libraries (everything is loaded from scratch) -> Only about 0.5MB actually needs to be present in memory

Note: In case where memory-limit is not set / processes have access to infinite amount of memory, they all perform about the same.

Dumb-sort (Without swap-space)

Gets OOMed. After all, the allocated memory-buffer exceeds the memory-limit and it's impossible to swap it out

Dumb-sort (With swap-space)

Does NOT get OOMed, but causes insane swapping, which is MUCH slower than page-cache thrashing (Linux internals).

Time: ~2min45s, only 30s are spent on the CPU

IO: ~2min (Swapping in and out on memory-pressure)
CPU: ~30s (~15s in user-space / ~15s in kerenl-space due to swapping page-faults and file-system system-calls)

Memory: 10MB (Making full use)

Swap-space: ~250MB (Keeping the entire memory-buffer in swap-space due to memory-constraints)
File-system (Page-cache): ~0.5MB (Basically does not use due to it getting evicted due to memory-pressure)

IO: Swapping incurs HUGE IO

Reading: ~680K disk-reads, ~3GB in total (SWAPPING IN AND OUT INDIVIDUAL PAGES IS HORRIBLE)
Writing: ~640K disk-writes, ~2.8GB in total (SAME)

Overall: Whether the program is even able to work depends entirely on the machine setup (swap-space being big enough) and even then it performs HORRIBLY.

Mapped-sort

Completes successfully regardless of swap-space (does NOT use it, as all memory resides in the kernel-space page-cache).

Time: ~1min, most spent stalling / waiting on IO

IO: ~40s (Paging in and out on memory-pressure)
CPU: ~20s (~15s in user-space for actual sorting / ~5s in kernel-space for paging)

Memory: 10MB (Makes full use of the allocated memory under the limit)

Swap-space: 0 bytes (Does NOT use, all the memory is used via the kernel page-cache)
File-system (Page-cache): ~10MB (Uses it as much as possible)

IO: Page-cache is more efficient than swap-space when it comes to managing IOs (Same amount of data, but less operations)

Reading: ~24K disk-reads, ~2.2GB in total (re-reading / re-writing the same pages over and over on thrashing)
Writing: ~25K disk-writes, ~2.3GB in total (same idea, but also with additional final-sync cost)

Overall: Extremely mal-performant, but at least guaranteed to work regardless of the machine (memory-limit / swap-space / etc.)

External-merge-sort

Allows configuring dedicated work-memory to be used for the algorithm, regardless of the memory-limits. The algorithm works by sorting individual chunks of data separately and then combining them via N-way merge-sort, using the disk to store intermediate results.

Time: ~23s, most spent ACTUALLY WORKING

IO: ~6s (Reading and writing individual sorted-runs)
CPU: ~17s (Almost all in user-space, sorting individual runs and merging them)

Memory: <=10MB (Only guaranteed to use 9MB of the assigned work-memory, but the remaining 1MB is used for loading code / libraries and quick-sorting)

Swap-space: 0 bytes (Does NOT use, all the memory is used via the kernel page-cache)
File-system (Page-cache): ~0.5MB (Only for loading code and libraries, as direct-IO is used for all the disk operations)

IO: Performing direct IO without the use of page-cache / page-buffer / file-system ENTIRELY - Using internal work-memory as the internal buffer

Reading: ~2.5K disk-reads, ~500MB in total (reading from the data-file and from the temporary work-file)
Writing: ~2.5K disk-writes, ~500MB in total (writing to the data-file and to the temporary work-file)

Overall: The best approach to use, despite the actual code not being optimized enough

FUTURE: Optimizing the actual low-level logic as well as the high-level design

Double-bufferring (Reading / writing data and operating it on it at the same time using non-synchronous IO)
Improving the way work-memory is partitioned / split between individual parts
Reusing sequences in the temporary-file instead of just appending (free-space management)
etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditions

Implementations

Dumb-sort (Without swap-space)

Dumb-sort (With swap-space)

Mapped-sort

External-merge-sort

FilesExpand file tree

STATS.md

Latest commit

History

STATS.md

File metadata and controls

Conditions

Implementations

Dumb-sort (Without swap-space)

Dumb-sort (With swap-space)

Mapped-sort

External-merge-sort