TypeScript client for Apache Spark Connect.
Construct Spark logical plans entirely in TypeScript and run them against a Spark Connect server.
- Build Spark logical plans using a fluent, PySpark-style API in TypeScript
- Evaluate transformations locally or stream results via Arrow
- Tagless Final DSL design with support for multiple backends
- Composable, immutable, and strongly typed DataFrame operations
- Column expressions (
col,.gt,.alias,.and, etc.) - Compatible with Spark Connect Protobuf and
spark-submit --class org.apache.spark.sql.connect.service.SparkConnectServer - Set operations (
UNION,INTERSECT,EXCEPT) withby_name,is_all, andallow_missing_columns - Spark-compatible joins with configurable join types
- Session-aware execution (no global singletons)
- Plan viz / AST dump: export client AST to JSON & Mermaid
- Ready-to-run examples in
examples/
npm i ts-spark-connector
# o: yarn add / pnpm addcd spark-server docker compose up -d
import { SparkSession, col } from "ts-spark-connector";
You need a running Spark Connect server. See
spark-server/README.mdfor a ready-to-use Docker setup, or run your own server.
import { SparkSession } from "./src/client/session";
import { col } from "./src/engine/column";
const session = SparkSession.builder()
// optional: auth / TLS
.getOrCreate();
const people = session.read
.option("delimiter", "\t")
.option("header", "true")
.csv("/data/people.tsv");
const purchases = session.read
.option("delimiter", "\t")
.option("header", "true")
.csv("/data/purchases.tsv");
await people
.join(purchases, col("id").eq(col("user_id")), "left")
.select("name", "product", "amount")
.filter(col("amount").gt(100))
.show();const p2024 = purchases.filter(col("year").eq(2024));
const p2025 = purchases.filter(col("year").eq(2025));
await p2024.union(p2025, { is_all: true, by_name: false })
.limit(5)
.show();Inspect the client-side plan (before server optimization):
const df = purchases
.select("user_id", "product", "amount")
.filter(col("amount").gt(100))
.orderBy(col("user_id").descNullsLast());
console.log(df.toClientASTJSON()); // JSON AST
console.log(df.toClientASTMermaid()); // Mermaid diagram
console.log(df.toSparkLogicalPlanJSON());// Client logical plan
console.log(df.toProtoJSON()); // Spark Connect protoTip: write these strings to disk (
.mmd,.json) and publish them as CI artifacts.
const session = SparkSession.builder()
.enableTLS({
keyStorePath: "./certs/keystore.p12",
keyStorePassword: "password",
trustStorePath: "./certs/cert.crt",
trustStorePassword: "password",
})
.getOrCreate();| Component | Supported / Tested |
|---|---|
| Spark Connect | 3.5.x |
| Scala ABI (JAR) | 2.12 (spark-connect_2.12) |
| Node.js | 18, 20, 22 |
| OS | Linux (CI); macOS (local) |
Planned: add CI jobs for macOS/Windows; update table as coverage expands.
| Feature | Supported |
|---|---|
| CSV Reading | ✅ |
| Filtering | ✅ |
| Projection / Alias | ✅ |
Arrow decoding (.show()) |
✅ |
Column expressions (col, .gt, .and, .alias, etc.) |
✅ |
| DSL abstraction (Tagless Final) | ✅ |
| Joins (configurable types) | ✅ |
Aggregation (groupBy().agg({...})) |
✅ |
Distinct (distinct(), dropDuplicates(...)) |
✅ |
Sorting (orderBy(...), sort(...)) |
✅ |
Limit (limit(n)) |
✅ |
Set operations (UNION, INTERSECT, EXCEPT) |
✅ |
Column renaming (withColumnRenamed(...)) |
✅ |
Type declarations (.d.ts) |
✅ |
| Modular compiler core (backend-agnostic) | ✅ |
| Tests (Unit + Integration + E2E) | ✅ |
| withColumn(...) | ✅ |
| when(...).otherwise(...) | ✅ |
| Window functions | ✅ |
Null handling (isNull, na.drop/fill/replace) |
✅ |
| Parquet Reading | ✅ |
| JSON Reading | ✅ |
| DataFrameWriter (CSV/JSON/Parquet/ORC/Avro) | ✅ |
Write partitionBy, bucketBy, sortBy |
✅ |
describe(), summary() |
✅ |
| unionByName(...) | ✅ |
Complex types + explode/posexplode |
✅ |
JSON helpers (from_json, to_json) |
✅ |
| repartition(...) / coalesce(...) | ✅ |
explain(...) (simple/extended/formatted) |
✅ |
SparkSession.builder.config(...) |
✅ |
| Auth/TLS for Spark Connect | ✅ |
| spark.sql(...) | ✅ |
Temp views (createOrReplaceTempView) |
✅ |
Catalog (read.table, saveAsTable) |
✅ |
| Plan viz / AST dump | ✅ |
| cache() / persist() / unpersist() | |
Join hints (broadcast, etc.) |
✅ |
sample(...), randomSplit(...) |
✅ |
| UDF (scalar) | |
| UDAF / Vectorized UDF (Arrow) | |
| Structured Streaming | ✅ |
| Watermark / triggers / output modes | ✅ |
| Lakehouse: Delta/Iceberg/Hudi | ❌ |
| JDBC read/write | ❌ |
| MLlib | ❌ |
Apache-2.0