Replies: 2 comments 1 reply
-
|
@SteNicholas Do you have any insights on the above deployment related question? |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
Appreciate any inputs from the community. The project looks very promising 🙏🏽 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
We are exploring the deployment of Apache Celebron as an external shuffle service for Spark on Kubernetes. Our Spark environment runs multiple concurrent and isolated Spark SQL jobs, each potentially shuffling anywhere from a few gigabytes to tens of terabytes of data. Given these workload characteristics, we are evaluating the best approach for deploying Celebron.
According to the Celebron deployment guide, worker pods should be placed on nodes with local SSD volumes for optimal performance. A key question we have is whether to:
Deploy a single large Celebron cluster with multiple workers that collectively provide sufficient disk space (e.g., a cluster with ~50-100 TB of local SSD storage), serving all concurrent jobs.
Deploy multiple smaller, job-specific Celebron clusters, where each job or a group of jobs gets a dedicated shuffle service instance.
A centralized Celebron cluster could simplify resource management and improve disk utilization across workloads. However, it also introduces potential bottlenecks, failure domains, and increased cross-job interference, especially if a few large shuffle-heavy jobs dominate the cluster.
On the other hand, running multiple independent Celebron clusters (e.g., per namespace or per workload category) could provide better isolation but may lead to underutilization of local SSD resources.
Has anyone deployed Celebron at this scale? What trade-offs should we consider when deciding between a single large Celebron cluster vs. multiple smaller clusters? Are there best practices for tuning Celebron’s worker distribution and disk allocation?
Looking forward to insights from the community!
Beta Was this translation helpful? Give feedback.
All reactions