A Lock-Free Cache-Aware Pipeline for High Throughput Joins - SIGMOD 2025 Programming Contest

Implementation of a cache-efficient, multithreaded, lock-free, hash-based join pipeline utilizing a memory-efficient hash table optimized for joins. This project was created for the SIGMOD 2025 Programming Competition, where our undergraduate team achieved 4th place globally! Our implementation achieves a 630x speedup over the organizers' reference solution.

Our Team

Our undergraduate team, representing University of Athens, consists of the following members:

Preparation

Tip

Run all the following commands for this and the subsequent sections in the root directory of this project.

First, download the imdb dataset.

./download_imdb.sh

Second, prepare the DuckDB database for correctness checking.

./build/build_database imdb.db

Building

Build the project with the following commands:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -Wno-dev
cmake --build build -- -j $(nproc)

Execution

You can run some quick unit tests that verify the correct execution of the program with:

./build/unit_tests

Execute the full query set with:

./build/run plans.json

Run individual queries (e.g. queries 1a and 1b) using:

./build/run plans.json 1a 1b

Task Overview

Develop an efficient, in-memory join pipeline executor delivering high-performance across a wide range of hardware architectures and configurations.
Evaluated on completely anonymized data and queries sampled from the Join Order Benchmark (J.O.B. on the IMDB Dataset).
Input Data provided as in-memory, column-store, paginated tables. Produced output follows the same format.
Join Execution Plan provided and optimized by PostgreSQL as a directed tree of Scan Nodes - representing input tables - and Join Nodes.
Joins are performed only on integer fields, while Doubles and Strings can also appear in the result. Attributes can have null values.
Example of a Join Plan:

Implementation Details

Loading and Preprocessing

Analyze the Plan to classify columns as join fields, output-only attributes or unused.
Unused columns are discarded. Remaining columns require fast random access.
The vast majority of columns are fully packed without null values and can be accessed in place in constant time with a single division.
Iterators are used when such columns are accessed serially to eliminate the division on the hot path.
Remaining columns are copied in a contiguous array to enable random access.
Strings are copied lazily only when they actually appear in the result using fast hardware instructions.
Join execution can start early with output-only columns loaded lazily in the background.
99+% of columns never have to be loaded or are processed in the background, thereby approximating a completely zero-copy solution!
Do not spend any time collecting stats in an attempt to beat Postgres’ plan optimizer.

Unchained Hash Table

Hash table optimized for joins [1].
Since inserts and lookups are not mixed, separate building and probing into distinct phases.
Partition Data among workers to parallelize building.
All the entries of the table live in one large contiguous block, massively improving cache locality.
Use of CRC hardware instruction instead of off-the-shelf hash functions to hash in only a few cycles while maintaining adequate collision resistance.
Built-in bloom filter in the lower 16-bits of directory pointer and use of precalculated tags for containment check.

Join Data Flow & Materialization

Late Record Materialization to eliminate unnecessary data copies to reduce cache pollution.
Each join materializes the column needed for the next join in the pipeline to improve memory access patterns.
Joins at the root of the tree materilaze the final output.

Parallel Data Processing

Implement a Priority-Based Task Queue that dynamically assigns tasks on available hardware threads.
Prioritize tasks that enable processing of interdependent joins.
Morsel-driven Parallelism [2]: break down heavier jobs into self-contained tasks, each responsible for independently processing a smaller chunk of the data.
Implement a lock-free chunk feeder mechanism to efficiently handle skew: tasks that finish early can dynamically “steal” pages from other tasks to balance workload.
Each task independently produces paginated results and deposits them in dedicated per-worker slots without incurring any synchronization cost.
Dynamic Job Batching for very small tasks to reduce overhead.
Eliminate false sharing by properly aligning concurrently accessed data structures.

Pipeline

Intermediate results are chunked and immediately forwarded so the next join in the pipeline can also progress while the data are still hot in CPU’s Caches.
Completely lock-free pipeline implemented using atomic instructions to minimize synchronization overhead and enable finer granularity tasks.
Every completed task executes part of the pipeline control logic - appropriately advancing pipeline state and kickstarting tasks for next stages or joins.
Schedule further processing of the data to the same hardware thread that produced it - optimizing cache efficiency.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
include		include
job		job
src		src
tests		tests
.clang-format		.clang-format
.clangd		.clangd
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
download_imdb.sh		download_imdb.sh
plans.json		plans.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Lock-Free Cache-Aware Pipeline for High Throughput Joins - SIGMOD 2025 Programming Contest

Contents

Our Team

Preparation

Building

Execution

Task Overview

Implementation Details

Loading and Preprocessing

Unchained Hash Table

Join Data Flow & Materialization

Parallel Data Processing

Pipeline

References

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

sigmod25-teamOPT/optimized-join-pipeline

Folders and files

Latest commit

History

Repository files navigation

A Lock-Free Cache-Aware Pipeline for High Throughput Joins - SIGMOD 2025 Programming Contest

Contents

Our Team

Preparation

Building

Execution

Task Overview

Implementation Details

Loading and Preprocessing

Unchained Hash Table

Join Data Flow & Materialization

Parallel Data Processing

Pipeline

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages