-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Implement pluggable Lineage in Java SDK #36781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| ? (JmsTextMessage message) -> { | ||
| if (message == null) { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
irrelevant change to fix flaky tests
| assertTrue( | ||
| String.format("Too many unacknowledged messages: %d", unackRecords), | ||
| unackRecords < OPTIONS.getNumberOfRecords() * 0.003); | ||
| unackRecords < OPTIONS.getNumberOfRecords() * 0.005); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
irrelevant change to fix flaky tests
e8b6a7e to
661f5c7
Compare
23b45a2 to
4065277
Compare
|
Assigning reviewers: R: @kennknowles for label java. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #36781 +/- ##
=========================================
Coverage 55.15% 55.15%
Complexity 1676 1676
=========================================
Files 1067 1067
Lines 167149 167149
Branches 1208 1208
=========================================
Hits 92189 92189
Misses 72779 72779
Partials 2181 2181 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Addresses #36790: "[Feature Request]: Make lineage tracking pluggable"
Key changes:
org.apache.beam.sdk.lineage.LineageRegistraras an entry point in plugin system.org.apache.beam.sdk.metrics.MetricsLineagethat contains extracted original metric-based implementation.org.apache.beam.sdk.metrics.Lineage:org.apache.beam.sdk.io.FileSystems:Plugin Implementation
The
Lineageclass acts as an abstract base for pluggable implementations, defining these instance members:An alternative approach would be to create separate interface (e.g.
interface LineageReporter) for plugin implementations, but that would require breaking changes to main public API that is in use:Therefore, using
Lineageas a base class is the best solution to maintain backward compatibility.Initialization
Originally
Lineageinstances were created in static field initialization without any need of external parameters. But plugins rely onPipelineOptionsby design, that's whyLineage.setDefaultPipelineOptions(options)must be called externally.Lineage.setDefaultPipelineOptions(options)is called fromFileSystems.setDefaultPipelineOptions()(at
FileSystems.java:581), following the same pattern used byMetrics.setDefaultPipelineOptions()(line 580).Rationale:
FileSystems.setDefaultPipelineOptions()is called at 48+ locations across the codebase, covering all execution scenarios: pipeline construction, worker startup across all runners (Flink, Spark, Dataflow, etc.), and deserialization points. This single-line addition ensures Lineage is initialized everywhere without modifying 48+ call sites.Known limitations: While
FileSystems.setDefaultPipelineOptions()has known issues (see #18430 regarding race conditions and initialization semantics), this PR follows the established pattern rather than introducing a divergent approach. Any broader architectural improvements to subsystem initialization would naturally address Lineage initialization as part of that larger effort.Thread Safety
Lineage.setDefaultPipelineOptionsis expected to be called concurrently because it inherits the same execution context asFileSystems.setDefaultPipelineOptionsandMetrics.setDefaultPipelineOptions. The implementation uses threeAtomicReference<>fields:AtomicReference<KV<Long, Integer>>to track PipelineOptions identity viaoptionsIdandrevisionAtomicReference<Lineage>for SOURCES and SINKS instancesThe method follows the exact same concurrent resolution pattern as
FileSystemsusing an infinite loop withcompareAndSetto handle race conditions during initialization.Sample Custom Plugin Implementation - OpenLineage
OpenLineage integration is for demonstration purposes only:
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.