-
Notifications
You must be signed in to change notification settings - Fork 498
[FLINK-28648][Kubernetes Operator] Allow session deletion to block on any running job #994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@gyfora The build was failing due to missing documentation about added configuration. I have added those in the new commit now. |
| var context = TestUtils.createContextWithReadyFlinkDeployment(kubernetesClient); | ||
| var resourceContext = getResourceContext(deployment, context); | ||
|
|
||
| // Use reflection to access the private getUnmanagedJobs method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use reflection for this, we can make the method protected with the @visiblefor testing annotation
.../main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/SessionReconciler.java
Outdated
Show resolved
Hide resolved
| * the Flink cluster but are not managed by FlinkSessionJob resources. | ||
| */ | ||
| private Set<JobID> getUnmanagedJobs( | ||
| FlinkResourceContext<FlinkDeployment> ctx, Set<FlinkSessionJob> sessionJobs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we need to pass sessionJobs here. if the flag is enabled, any running job should simply block it. We can simplify this logic a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simply replace this method with something like getNonTerminalJobIds() or boolean anyNonTerminalJobs()
That would be enough for this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will simplify this further. Thanks for the review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyfora I have pushed another commit to address the comments here. I have stuck with getNonTerminalJobIds() to ensure the Event contains the list of job IDs that are not terminated for better observability for the user.
| LOG.info( | ||
| "Starting unmanaged job detection for session cluster: {}", | ||
| ctx.getResource().getMetadata().getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be on debug level, also no need to include resource name/info in the log message. Its in the MDC already
| LOG.warn(error); | ||
| if (eventRecorder.triggerEvent( | ||
| deployment, | ||
| EventRecorder.Type.Warning, | ||
| EventRecorder.Reason.CleanupFailed, | ||
| EventRecorder.Component.Operator, | ||
| error, | ||
| ctx.getKubernetesClient())) { | ||
| LOG.warn(error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are logging the error twice, you don't need to log it at all as the event triggering already logs it.
…getNonTerminated + annotated @VisibleForTesting
…getNonTerminated + annotated @VisibleForTesting
|
@gyfora I ran these tests on my local, they seem to be passing. I am not sure what is causing the issue here. Can you have a look at it if possible? |
gyfora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 2 minor comments, otherwise looks good!
| if (eventRecorder.triggerEvent( | ||
| deployment, | ||
| EventRecorder.Type.Warning, | ||
| EventRecorder.Reason.CleanupFailed, | ||
| EventRecorder.Component.Operator, | ||
| error, | ||
| ctx.getKubernetesClient())) { | ||
| LOG.warn(error); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove the if branch and the logging. Event triggering already creates logs we don't need both I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| var conf = ctx.getDeployConfig(ctx.getResource().getSpec()); | ||
| ctx.getFlinkService() | ||
| .deleteClusterDeployment( | ||
| deployment.getMetadata(), deployment.getStatus(), conf, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a bug in this existing logic, instead of getting the deployconfig, here we should just use ctx.getObserveConfig()
@gyfora Thank you for reviewing, I have made the changes. Please do trigger the workflows on your end. Also do let me know if we should also remove the |
We can leave it as is for now :) |
What is the purpose of the change
This pull request adds a configuration to allow session cluster cleanup blocking in case unmanaged jobs are present. That is the
FlinkDeploymentdeletion is blocked in presence ofFlinkSessionJobsas well as if jobs submitted through CLI are innon_terminalstates.Brief change log
(for example:)
getUnmanagedJobs()method to detect CLI-submitted jobs not managed by FlinkSessionJob resourcescleanupInternal()to check for unmanaged jobs when blocking is enabledVerifying this change
This change added tests and can be verified as follows:
Added
testGetUnmanagedJobstest for validatinggetUnmanagedJobsSessionJoband are innon-terminalstate.Manually verified the change by running a cluster with 2 JobManagers and submitted a SessionJob as well as CLI Jobs.
Config:
session.block-on-unmanaged-jobs: trueDeleted
flinkDeployment-> Generates EventCleanupFailedinflinkDeploymentdue to presence of sessionjobsDeleted
flinkSessionJob-> Generates EventCleanupFailedinflinkDeploymentdue to presence of unmanaged Jobs ->flinkSessionJobwas deleted.Cancelled CLI submitted job -> Generates Event
CleanupafterReconcileInterval-> CLI job was cancelled and then theflinkDeploymentwas cleaned up.session.block-on-unmanaged-jobs: falseflinkDeployment-> Generates EventCleanupFailedinflinkDeploymentdue to presence of sessionjobsflinkSessionJob-> Generates EventCleanupinspite of running CLI jobs being present. ->SessionJobis deleted , followed byflinkDeployment.Does this pull request potentially affect one of the following parts:
CustomResourceDescriptors: noDocumentation