Fix: Prevent race conditions in Python SDK monitoring info collection.#35049
Fix: Prevent race conditions in Python SDK monitoring info collection.#35049liferoad wants to merge 6 commits intoapache:masterfrom
Conversation
This change addresses Apache Beam GitHub Issue apache#24776, where race conditions could occur during the collection of monitoring information in the Python SDK Harness, leading to errors such as: - SystemError: returned NULL without setting an error - RuntimeError: dictionary changed size during iteration - AttributeError: 'bytes' object has no attribute 'payload' - ValueError: non-UTF-8 strings The primary cause was concurrent access to metric data structures (specifically `MetricsContainer` and its underlying `MetricCell`s) by the DoFn execution thread (updating metrics) and the thread responsible for reporting bundle progress. The fix introduces the following: 1. A `threading.Lock` is added to the `MetricsContainer` class. This lock is acquired before any access or modification of the internal dictionaries that store metric cells (`self.counters`, `self.distributions`, `self.gauges`). This protection is applied during metric cell retrieval/creation (`get_metric_cell`) and when all monitoring information is collected for reporting (`to_runner_api_monitoring_infos`). 2. The `MetricsContainer`'s lock is passed to individual `MetricCell` instances (`CounterCell`, `DistributionCell`, `GaugeCell`) upon their creation. 3. Metric update methods within `CounterCell`, `DistributionCell`, and `GaugeCell` (e.g., `update()`, `set()`, `add_data()`) now acquire this container-level lock before modifying their internal state. This ensures that updates are atomic with respect to the collection process in `MetricsContainer.to_runner_api_monitoring_infos`. These changes ensure that metric data is read and updated in a thread-safe manner, preventing the previously observed errors caused by concurrent access and modification of shared metric state.
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
Assigning reviewers: R: @jrmccluskey for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
stop reviewer notifications |
|
Stopping reviewer notifications for this pull request: requested by reviewer. If you'd like to restart, comment |
| self.value += ivalue | ||
| else: | ||
| with self._lock: | ||
| # If a container lock is provided, use it. Otherwise, use cell's own lock. |
There was a problem hiding this comment.
Are cells ever not managed by a metrics container?
There was a problem hiding this comment.
https://github.com/apache/beam/pull/35049/files#diff-1293713097c11f7136aa69a2affccdd624a958dad8a10c68a4a31f39fa6c303dR281 make sure the cells use the lock from the metrics container from what I can tell. I think https://github.com/apache/beam/pull/35049/files#diff-1293713097c11f7136aa69a2affccdd624a958dad8a10c68a4a31f39fa6c303dR342 is probably more interesting to be reviewed.
I am not super confident about this PR in terms of testing this and the potential downside on the performance.
robertwb
left a comment
There was a problem hiding this comment.
Is there any concern about the move to a (significantly more) global lock causing performance regressions? (Are metrics containers at least per worker thread?)
I do have the concern about global lock and even the lock used by this PR. |
|
Close this for now until we have a good way to test it. |
Testing Jules
Fixes #24776
This change addresses Apache Beam GitHub Issue #24776, where race conditions could occur during the collection of monitoring information in the Python SDK Harness, leading to errors such as:
The primary cause was concurrent access to metric data structures (specifically
MetricsContainerand its underlyingMetricCells) by the DoFn execution thread (updating metrics) and the thread responsible for reporting bundle progress.The fix introduces the following:
threading.Lockis added to theMetricsContainerclass. This lock is acquired before any access or modification of the internal dictionaries that store metric cells (self.counters,self.distributions,self.gauges). This protection is applied during metric cell retrieval/creation (get_metric_cell) and when all monitoring information is collected for reporting (to_runner_api_monitoring_infos).MetricsContainer's lock is passed to individualMetricCellinstances (CounterCell,DistributionCell,GaugeCell) upon their creation.CounterCell,DistributionCell, andGaugeCell(e.g.,update(),set(),add_data()) now acquire this container-level lock before modifying their internal state. This ensures that updates are atomic with respect to the collection process inMetricsContainer.to_runner_api_monitoring_infos.These changes ensure that metric data is read and updated in a thread-safe manner, preventing the previously observed errors caused by concurrent access and modification of shared metric state.
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.