Skip to content

Conversation

@aIbrahiim
Copy link
Contributor

@aIbrahiim aIbrahiim commented Nov 5, 2025

Fixes: #32949
Successful Run: https://github.com/aIbrahiim/beam/actions/runs/19104037201/job/54582948548

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @aIbrahiim, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a known issue where Flink jobs running on Java 8 in attached mode could encounter an IllegalThreadStateException due to the premature destruction of ThreadGroup instances during Flink's internal cleanup. The solution introduces a targeted workaround that temporarily extends the lifecycle of the relevant ThreadGroup for a brief period, ensuring Flink's cleanup processes can complete without interruption and thereby stabilizing job execution on Java 8 environments.

Highlights

  • Fix for IllegalThreadStateException on Java 8: This pull request introduces a specific fix to prevent an IllegalThreadStateException that can occur when running Flink jobs on Java 8, particularly in attached execution mode.
  • Thread Group Protection Mechanism: A new mechanism has been implemented to temporarily protect the current ThreadGroup from premature destruction. This is achieved by adding it to a static protectedThreadGroups set and then releasing it after a short delay (2 seconds) via a dedicated cleanup thread.
  • Conditional Application: The ensureFlinkCleanupComplete method, which encapsulates this protection logic, is only invoked when the Java runtime version is detected as Java 8 and the Flink job is executed in attached mode.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Amar3tto Amar3tto requested a review from damccorm November 5, 2025 15:18
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

Assigning reviewers:

R: @damccorm added as fallback since no labels match configuration

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

}

/** Prevents ThreadGroup destruction while Flink cleanup threads are still running. */
private void ensureFlinkCleanupComplete(Object executionEnv) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an executionEnv but not used. What is the consideration here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh i added for any future checks like stream or batching, or local env and so on.

private void ensureFlinkCleanupComplete(Object executionEnv) {
String javaVersion = System.getProperty("java.version");
if (javaVersion == null || !javaVersion.startsWith("1.8")) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Busy wait is generally not desired but for band-aid fix I think it's fine. We should narrow the scenario that introduced 2 seconds latency

  • this only affects in-memory flink clusters, and also only affects batch jobs (':runners:flink:1.19:validatesRunnerBatch'.). Can we call wait only for batch?

  • Can we check whether FlinkPipelineOptions.getFlinkMaster is a "org.apache.flink.api.java.ExecutionEnvironment.LocalEnvironment" and only do the busy wait for local environment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I agree we can narrow the scope to wait for batch jobs and detect the environment and apply it for local env.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry realize this parameter is actually useful. We can do

if (executionEnv instanceof org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.LocalStreamEnvironment || executionEnv instanceof org.apache.flink.api.java.ExecutionEnvironment.LocalEnvironment)

this is more reliable than check getFlinkMaster below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the parameter

@aIbrahiim aIbrahiim requested a review from Abacn November 5, 2025 18:36
@Amar3tto
Copy link
Collaborator

Amar3tto commented Nov 5, 2025

Is it okay that PreCommit Java PVR Flink Batch previously took ~2 min, and now it takes >1 hour?
CC: @Abacn

@Abacn
Copy link
Contributor

Abacn commented Nov 5, 2025

I don't see a difference here, the last run on this PR was 31 min: https://github.com/apache/beam/actions/runs/19112042171, a (non-cache) run on master is 30 min https://github.com/apache/beam/actions/runs/19089630209

The 1 min run are due to no change in code and skipped on gradle cache.

@Abacn
Copy link
Contributor

Abacn commented Nov 5, 2025

I see, currently validatesPortableRunnerBatch is timing out.

@aIbrahiim
Copy link
Contributor Author

I see, currently validatesPortableRunnerBatch is timing out.

I guess the timeout related to some other issue as the validatesPortableRunnerBatch doesnt run on direct Flink runner

@aIbrahiim
Copy link
Contributor Author

aIbrahiim commented Nov 5, 2025

got a successful run for PreCommit Java PVR Flink Batch on my runner in 37 minutes https://github.com/aIbrahiim/beam/actions/runs/19117606320
cc: @Abacn @Amar3tto

@Abacn Abacn merged commit 1d33741 into apache:master Nov 6, 2025
21 of 22 checks passed
@aIbrahiim aIbrahiim deleted the 32949-PostCommit-Java-ValidatesRunner-Flink-Java8 branch November 17, 2025 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The PostCommit Java ValidatesRunner Flink Java8 job is flaky

3 participants