Spark Connect C++

Overview

This repository provides a native C++ client for Apache Spark Connect, enabling C++ applications to execute Spark SQL workloads against remote Spark clusters without requiring JVM dependencies.

Spark Connect introduces a decoupled client–server architecture for Apache Spark, where client applications construct logical query plans that are executed remotely by a Spark server. This project delivers an idiomatic, high-performance C++ interface for building and submitting Spark queries, with efficient Apache Arrow–based columnar data exchange over gRPC.

The library is intended for environments where:

Native C++ integration is required
JVM runtimes are impractical or undesirable
High-throughput data movement is necessary
Tight control over memory and system resources is important
Existing performance-critical C++ systems need to integrate with Spark

Status: Work in Progress

Design Goals

Native-first Spark integration for C++ ecosystems
Efficient Arrow-based columnar data transfer
Clear separation between client logic and remote Spark execution
Compatibility with evolving Spark Connect protocols
Predictable performance characteristics suitable for production systems
Minimal runtime dependencies outside standard native infrastructure

Use Cases

The Spark Connect C++ client is designed for environments that require native system integration, efficient columnar data transfer, and low-overhead communication with remote Spark clusters.

Native Data Producers and High-Throughput Pipelines

Enable high-performance C++ services to submit ingestion workloads and push structured datasets to Spark using Apache Arrow serialization without JVM dependencies.

Integration with Legacy, Industrial, and Scientific Systems

Connect existing C++ infrastructure such as industrial control systems, financial engines, robotics platforms, HPC pipelines, or scientific instrumentation directly to Spark for large-scale analytics.

AI/ML and Inference Pipelines

Run high-speed inference using native libraries (e.g., TensorRT, ONNX Runtime, CUDA pipelines) and stream features, embeddings, or predictions into Spark for downstream analytics and training workflows.

Custom High-Performance Data Connectors

Build specialized connectors for proprietary data stores, binary protocols, in-memory engines, or high-throughput messaging systems that benefit from native execution and fine-grained memory control.

Interactive Analytics Clients

Develop native dashboards, monitoring tools, and backend services that execute Spark SQL queries through Spark Connect with minimal client-side overhead.

Edge Gateways and Data Aggregation Services

Use C++ aggregation services or gateway nodes to collect telemetry from distributed environments and forward structured datasets to centralized Spark clusters for processing.

Performance Testing and Systems Benchmarking

Evaluate Spark Connect performance characteristics such as serialization overhead, network latency, query planning costs, and concurrent client behavior using native workloads.

When Should You Use the C++ Client?

The Spark Connect C++ client is not a replacement for Python or Scala Spark APIs. Instead, it enables Spark integration in environments where native execution, performance constraints, or system-level integration are required.

Use the C++ Client When

You are integrating Spark into an existing C++ application or platform
Your environment cannot depend on a JVM runtime
You are building high-performance ingestion or producer services
You require integration with:
- HPC systems
- scientific computing pipelines
- robotics or industrial platforms
- embedded gateways or native backend services
You run AI/ML inference pipelines in C++ and need to forward structured outputs into Spark
You are implementing proprietary data connectors or binary protocols
You require precise control over memory layout and data transfer efficiency

Prefer Python or Scala Spark APIs When

You are building notebooks or data science workflows
Your team primarily consists of data engineers or analysts
You require the full Spark API surface immediately
Rapid prototyping is more important than native performance
Your applications already run comfortably in JVM or Python environments

API Implementation Status

Category	API Feature	Status	Implementation
Session	Databricks / Local Conn	●	Implemented
Reader	CSV	●	Implemented
Reader	JSON	●	Implemented
Reader	Parquet	●	Implemented
Writer	Parquet (Overwrite/Gzip)	●	Implemented
Schema	JSON / Print / Column List	●	Implemented
Action	Show / Collect / Head / First	●	Implemented
Query	SQL / Range	●	Implemented
Logic	Filter / Where / Distinct	●	Implemented
Relational	Joins (Inner, Outer, Expr)	●	Implemented
Group	GroupBy & Global Aggs	●	Implemented
Analytics	Window Functions	○	Planned
Catalog	Table/Database Management	○	Planned
Streaming	Structured Streaming	◌	Not Implemented
GraphFrames	Graph processing & analytics	●	Implemented

Architecture Deep Dive

The Spark Connect C++ client follows the Spark Connect execution model:

Client API Layer

Applications use a native C++ API to construct DataFrame operations and Spark SQL queries. These operations are translated into logical execution plans rather than executed locally.

Logical Plan Construction

The client builds Spark logical plans representing transformations, queries, and actions. No distributed computation occurs inside the client process.

Serialization Layer

Logical plans and data batches are serialized using:

Protobuf for query plans and RPC communication
Apache Arrow for efficient columnar data transfer

Transport Layer

Communication with the Spark Connect server occurs via:

gRPC streaming RPCs
bidirectional execution channels
Arrow batch streaming

Spark Server Execution

The Spark Connect server:

receives logical plans
executes distributed workloads
performs query planning and optimization
returns Arrow-encoded results to the client

This separation enables native applications to leverage Spark’s distributed engine without embedding Spark or JVM runtimes locally.

Getting Started

1. Prerequisites

Apache Spark 3.5+ with Spark Connect enabled
C++17 or later
Required libraries:
- gRPC
- Protobuf
- Apache Arrow
- uuid

2. Build & Run Tests

Linux (Ubuntu/Debian)

# --------------------------------
# Install all required dependencies
# --------------------------------
chmod +x ./install_deps.sh
./install_deps.sh

mkdir build && cd build

# ----------------------------------
# Build the Spark Connect Client
# ----------------------------------
cmake ..
make -j$(nproc)

# --------------------------------
# Make sure Spark is running...
# --------------------------------
docker compose up spark --build

# ---------------------------
# Run Full Test Suite
# ---------------------------
ctest --output-on-failure --verbose
# ctest --verbose --test-arguments=--gtest_color=yes

# -----------------------------
# Run Single Test Suite
# -----------------------------
ctest -R test_dataframe_reader --verbose

# ------------------------------
# Run Single Test Case
# ------------------------------
ctest -R test_dataframe_writer --test-args --gtest_filter=SparkIntegrationTest.ParquetWrite

# --------------------------------
# Run Test Suite directly
# --------------------------------
./test_<suite_name>

# --------------------------------
# Run Single Test Case directly - show output
# --------------------------------
./test_dataframe --gtest_filter=SparkIntegrationTest.DropDuplicates

3. Mem Checks (Valgrind)

mkdir -p build && cd build

cmake .. \
  -DCMAKE_BUILD_TYPE=Debug \
  -DCTEST_MEMORYCHECK_COMMAND=/usr/bin/valgrind \
  -DCTEST_MEMORYCHECK_TYPE=Valgrind

valgrind --leak-check=full --show-leak-kinds=all ./test_dataframe

4. Code Coverage

This project uses gcovr for coverage reporting.

gcovr analyzes which parts of the compiled code were executed during the test suite execution.

Installation:

sudo apt update
sudo apt install -y gcovr

Generate Coverage:

cmake -S . -B build -DENABLE_COVERAGE=ON
cmake --build build -j

gcovr -r src \
      --object-directory build \
      --exclude '.*\.pb\.cc' \
      --exclude '.*\.grpc\.pb\.cc' \
      --exclude '.*\.h' \
      --html-details coverage.html \
      --html-theme green \
      --print-summary \
      --fail-under-line 70

This will generate a coverage report in HTML format.

5. Installation & Usage

mkdir -p build && cd build
cmake ..
make -j$(nproc)
cpack

This generates a compressed archive under the build directory.

CPack: Create package using TGZ
CPack: Install projects
CPack: - Run preinstall target for: spark_connect_cpp
CPack: - Install project: spark_connect_cpp []
CPack: Create package
CPack: - package: /home/user/spark-connect-cpp/build/spark-connect-cpp-0.0.1-linux-amd64.tar.gz generated.

Extract the contents of the archive:

tar -xzf spark-connect-cpp-0.0.1-linux-amd64.tar.gz

You can verify the contents by running:

ls -F spark-connect-cpp-0.0.1-linux-amd64

Copy contents to global path:

sudo cp -r ~/spark-connect-cpp/build/spark-connect-cpp-0.0.1-linux-amd64/include/spark_connect_cpp /usr/local/include/
sudo cp ~/spark-connect-cpp/build/spark-connect-cpp-0.0.1-linux-amd64/lib/libspark_connect_cpp.a /usr/local/lib/

# --------------------------------------------------------------------------------------------
# Purpose of the pkgconfig directory
# --------------------------------------------------------------------------------------------
#
# The pkg-config directory serves as a central repository for metadata about installed libraries on a system. 
#
# It contains .pc files, which are the metadata files that provide information such as the 
# location of header files, library binaries, and various descriptive information. 
# 
# These files are essential for developers to retrieve package-specific compilation and linking parameters, 
# ensuring that applications are built correctly with all necessary components. 
#
# The pkg-config tool, which is invoked via its command line interface (CLI), reports library information via standard output, 
# allowing for the sharing of a codebase in a cross-platform way by using host-specific library information 
# that is stored outside of yet referenced by the codebase.
# --------------------------------------------------------------------------------------------
sudo mkdir /usr/local/lib/pkgconfig
sudo cp ~/spark-connect-cpp/build/spark-connect-cpp-0.0.1-linux-amd64/lib/pkgconfig/spark_connect_cpp.pc /usr/local/lib/pkgconfig/

Verify that the contents were copied successfully:

ls /usr/local/lib/pkgconfig

Update the Shared Library Cache:

sudo ldconfig

Verify Spark Connect C++ is accessible globally:

pkg-config --exists spark_connect_cpp && echo "Library is global" || echo "Library not found"

Running a sample application:

#include <spark_connect_cpp/session.h>

int main()
{
    auto spark = &SparkSession::builder()
                     .master("sc://localhost")
                     .appName("demo")
                     .getOrCreate();

    auto df = spark->sql("SELECT 1 AS id");

    df.show();
}

Compile the sample application:

g++ src/main.cpp $(pkg-config --cflags --libs spark_connect_cpp) -o spark_app

Run the sample application:

./spark_app

Output:

+----------------------+
| id                   |
+----------------------+
| 1                    |
+----------------------+

For detailed development setup instructions, see:

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
.github		.github
datasets		datasets
docs		docs
hooks		hooks
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitmessage		.gitmessage
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yaml		docker-compose.yaml
install_deps.sh		install_deps.sh
install_hooks.sh		install_hooks.sh
spark_connect_cpp.pc.in		spark_connect_cpp.pc.in

Folders and files

Latest commit

History

Repository files navigation

Spark Connect C++

Overview

Design Goals

Use Cases

Native Data Producers and High-Throughput Pipelines

Integration with Legacy, Industrial, and Scientific Systems

AI/ML and Inference Pipelines

Custom High-Performance Data Connectors

Interactive Analytics Clients

Edge Gateways and Data Aggregation Services

Performance Testing and Systems Benchmarking

When Should You Use the C++ Client?

Use the C++ Client When

Prefer Python or Scala Spark APIs When

API Implementation Status

Architecture Deep Dive

Client API Layer

Logical Plan Construction

Serialization Layer

Transport Layer

Spark Server Execution

Getting Started

1. Prerequisites

2. Build & Run Tests

Linux (Ubuntu/Debian)

3. Mem Checks (Valgrind)

4. Code Coverage

Installation:

Generate Coverage:

5. Installation & Usage

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages