Skip to content

Commit 9cd61dd

Browse files
razvansbernauerNickLarsenNZ
authored
fix: ensure a spark application can only be submitted once (#460)
* fix: ensure a spark application can only be submitted once * update changelog * add doc page for app status * add callout * fix typos * Update rust/operator-binary/src/spark_k8s_controller.rs Co-authored-by: Sebastian Bernauer <[email protected]> * Update rust/operator-binary/src/spark_k8s_controller.rs Co-authored-by: Sebastian Bernauer <[email protected]> * implement review feedback * Update rust/operator-binary/src/spark_k8s_controller.rs Co-authored-by: Sebastian Bernauer <[email protected]> * Update rust/operator-binary/src/spark_k8s_controller.rs Co-authored-by: Nick <[email protected]> --------- Co-authored-by: Sebastian Bernauer <[email protected]> Co-authored-by: Nick <[email protected]>
1 parent 0a83d7b commit 9cd61dd

File tree

5 files changed

+56
-1
lines changed

5 files changed

+56
-1
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ All notable changes to this project will be documented in this file.
1616
### Fixed
1717

1818
- Fix `envOverrides` for SparkApplication and SparkHistoryServer ([#451]).
19+
- Ensure SparkApplications can only create a single submit Job. Fix for #457 ([#460]).
1920

2021
### Removed
2122

@@ -24,6 +25,7 @@ All notable changes to this project will be documented in this file.
2425
[#450]: https://github.com/stackabletech/spark-k8s-operator/pull/450
2526
[#451]: https://github.com/stackabletech/spark-k8s-operator/pull/451
2627
[#459]: https://github.com/stackabletech/spark-k8s-operator/pull/459
28+
[#460]: https://github.com/stackabletech/spark-k8s-operator/pull/460
2729

2830
## [24.7.0] - 2024-07-24
2931

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
= Spark Applications
2+
3+
Spark applications are submitted to the Spark Operator as SparkApplication resources. These resources are used to define the configuration of the Spark job, including the image to use, the main application file, and the number of executors to start.
4+
5+
Upon creation, the application's status set to `Unknown`. As the operator creates the necessary resources, the status of the application transitions through different phases that reflect the phase of the driver Pod. A successful application will eventually reach the `Succeeded` phase.
6+
7+
NOTE: The operator will never reconcile an application once it has been created. To resubmit an application, a new SparkApplication resource must be created.

docs/modules/spark-k8s/partials/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
** xref:spark-k8s:usage-guide/history-server.adoc[]
1212
** xref:spark-k8s:usage-guide/examples.adoc[]
1313
** xref:spark-k8s:usage-guide/operations/index.adoc[]
14+
*** xref:spark-k8s:usage-guide/operations/applications.adoc[]
1415
*** xref:spark-k8s:usage-guide/operations/pod-placement.adoc[]
1516
*** xref:spark-k8s:usage-guide/operations/pod-disruptions.adoc[]
1617
*** xref:spark-k8s:usage-guide/operations/graceful-shutdown.adoc[]

rust/crd/src/lib.rs

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -230,6 +230,21 @@ pub struct JobDependencies {
230230
}
231231

232232
impl SparkApplication {
233+
/// Returns if this [`SparkApplication`] has already created a Kubernetes Job doing the actual `spark-submit`.
234+
///
235+
/// This is needed because Kubernetes will remove the succeeded Job after some time. When the spark-k8s-operator is
236+
/// restarted it would re-create the Job, resulting in the Spark job running multiple times. This function assumes
237+
/// that the [`SparkApplication`]'s status will always be set when the Kubernetes Job is created. It therefore
238+
/// checks if the status is set to determine if the Job was already created in the past.
239+
///
240+
/// See the bug report [#457](https://github.com/stackabletech/spark-k8s-operator/issues/457) for details.
241+
pub fn k8s_job_has_been_created(&self) -> bool {
242+
self.status
243+
.as_ref()
244+
.map(|s| !s.phase.is_empty())
245+
.unwrap_or_default()
246+
}
247+
233248
pub fn submit_job_config_map_name(&self) -> String {
234249
format!("{app_name}-submit-job", app_name = self.name_any())
235250
}

rust/operator-binary/src/spark_k8s_controller.rs

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ use product_config::writer::to_java_properties_string;
1010
use stackable_operator::time::Duration;
1111
use stackable_spark_k8s_crd::{
1212
constants::*, s3logdir::S3LogDir, tlscerts, RoleConfig, SparkApplication, SparkApplicationRole,
13-
SparkContainer, SubmitConfig,
13+
SparkApplicationStatus, SparkContainer, SubmitConfig,
1414
};
1515

1616
use crate::product_logging::{self, resolve_vector_aggregator_address};
@@ -155,6 +155,12 @@ pub enum Error {
155155
CreateVolumes {
156156
source: stackable_spark_k8s_crd::Error,
157157
},
158+
159+
#[snafu(display("Failed to update status for application {name:?}"))]
160+
ApplySparkApplicationStatus {
161+
source: stackable_operator::client::Error,
162+
name: String,
163+
},
158164
}
159165

160166
type Result<T, E = Error> = std::result::Result<T, E>;
@@ -170,6 +176,14 @@ pub async fn reconcile(spark_application: Arc<SparkApplication>, ctx: Arc<Ctx>)
170176

171177
let client = &ctx.client;
172178

179+
if spark_application.k8s_job_has_been_created() {
180+
tracing::info!(
181+
spark_application = spark_application.name_any(),
182+
"Skipped reconciling SparkApplication with non empty status"
183+
);
184+
return Ok(Action::await_change());
185+
}
186+
173187
let opt_s3conn = match spark_application.spec.s3connection.as_ref() {
174188
Some(s3bd) => s3bd
175189
.resolve(
@@ -346,6 +360,22 @@ pub async fn reconcile(spark_application: Arc<SparkApplication>, ctx: Arc<Ctx>)
346360
.await
347361
.context(ApplyApplicationSnafu)?;
348362

363+
// Fix for #457
364+
// Update the status of the SparkApplication immediately after creating the Job
365+
// to ensure the Job is not created again after being recycled by Kubernetes.
366+
client
367+
.apply_patch_status(
368+
CONTROLLER_NAME,
369+
spark_application.as_ref(),
370+
&SparkApplicationStatus {
371+
phase: "Unknown".to_string(),
372+
},
373+
)
374+
.await
375+
.with_context(|_| ApplySparkApplicationStatusSnafu {
376+
name: spark_application.name_any(),
377+
})?;
378+
349379
Ok(Action::await_change())
350380
}
351381

0 commit comments

Comments
 (0)