Deploying Apache Celebron for Large-Scale Spark SQL Workloads #3191

sid-habu · 2025-04-01T00:19:33Z

sid-habu
Apr 1, 2025

We are exploring the deployment of Apache Celebron as an external shuffle service for Spark on Kubernetes. Our Spark environment runs multiple concurrent and isolated Spark SQL jobs, each potentially shuffling anywhere from a few gigabytes to tens of terabytes of data. Given these workload characteristics, we are evaluating the best approach for deploying Celebron.

According to the Celebron deployment guide, worker pods should be placed on nodes with local SSD volumes for optimal performance. A key question we have is whether to:

Deploy a single large Celebron cluster with multiple workers that collectively provide sufficient disk space (e.g., a cluster with ~50-100 TB of local SSD storage), serving all concurrent jobs.

Deploy multiple smaller, job-specific Celebron clusters, where each job or a group of jobs gets a dedicated shuffle service instance.

A centralized Celebron cluster could simplify resource management and improve disk utilization across workloads. However, it also introduces potential bottlenecks, failure domains, and increased cross-job interference, especially if a few large shuffle-heavy jobs dominate the cluster.

On the other hand, running multiple independent Celebron clusters (e.g., per namespace or per workload category) could provide better isolation but may lead to underutilization of local SSD resources.

Has anyone deployed Celebron at this scale? What trade-offs should we consider when deciding between a single large Celebron cluster vs. multiple smaller clusters? Are there best practices for tuning Celebron’s worker distribution and disk allocation?

Looking forward to insights from the community!

sid-habu · 2025-04-15T05:50:19Z

sid-habu
Apr 15, 2025
Author

@SteNicholas Do you have any insights on the above deployment related question?

1 reply

swapneshkumar-d11 Jun 30, 2025

Facing the same issue. Waiting for insights from the community

ssharma · 2025-07-16T15:09:47Z

ssharma
Jul 16, 2025

Appreciate any inputs from the community. The project looks very promising 🙏🏽

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying Apache Celebron for Large-Scale Spark SQL Workloads #3191

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Deploying Apache Celebron for Large-Scale Spark SQL Workloads #3191

Uh oh!

sid-habu Apr 1, 2025

Replies: 2 comments · 1 reply

Uh oh!

sid-habu Apr 15, 2025 Author

Uh oh!

swapneshkumar-d11 Jun 30, 2025

Uh oh!

ssharma Jul 16, 2025

sid-habu
Apr 1, 2025

Replies: 2 comments 1 reply

sid-habu
Apr 15, 2025
Author

ssharma
Jul 16, 2025