Fix reproducibility by ternaus · Pull Request #2561 · albumentations-team/albumentations

ternaus · 2025-06-17T01:31:53Z

Summary by Sourcery

Enable worker-aware random seeding in Compose transforms to ensure reproducible augmentations across single- and multi-process DataLoader contexts.

New Features:

Add _get_effective_seed to compute per-worker seed based on torch.initial_seed
Override Compose.set_random_seed and setstate to incorporate worker context into seeding
Introduce _check_worker_seed to automatically synchronize RNG state in call when inside DataLoader workers

Bug Fixes:

Fix seed propagation to nested transforms for consistent reproducibility
Correct parameter passing in OneOf and ChannelShuffle super().init calls

Enhancements:

Store original seed and include it in to_dict serialization
Propagate RNG state updates to all transforms via set_random_state or set_random_seed

Tests:

Add comprehensive tests for worker-aware seeding in single- and multi-worker DataLoader scenarios
Update serialization tests to verify seed inclusion
Include deterministic behavior and diversity tests across epochs and workers

sourcery-ai · 2025-06-17T01:31:57Z

Reviewer's Guide

This PR adds worker-aware seeding to Compose by computing an effective seed per DataLoader worker, synchronizing RNG state on each call, persisting the original seed for serialization, correcting API consistency in transform constructors and calls, and covering all cases with new comprehensive tests.

Sequence diagram for worker-aware seeding in Compose call

sequenceDiagram
    participant User
    participant DataLoaderWorker
    participant Compose
    participant Transform
    User->>DataLoaderWorker: Request batch
    DataLoaderWorker->>Compose: __call__(data)
    Compose->>Compose: _check_worker_seed()
    Compose->>Compose: Update RNG state if worker seed changed
    Compose->>Transform: set_random_state()/set_random_seed(effective_seed)
    Compose->>Compose: Apply transforms
    Compose-->>DataLoaderWorker: Return augmented data

Class diagram for updated Compose and BaseCompose seeding logic

classDiagram
    class BaseCompose {
        +int|None seed
        +np.random.Generator random_generator
        +random.Random py_random
        +set_random_seed(seed: int|None)
        +_get_effective_seed(base_seed: int|None) int|None
    }
    class Compose {
        +int|None _last_torch_seed
        +_check_worker_seed()
        +__setstate__(state: dict)
        +set_random_seed(seed: int|None)
        +__call__(*args, force_apply: bool = False, **data) dict
    }
    BaseCompose <|-- Compose

Class diagram for transform constructor and call API consistency

classDiagram
    class OneOf {
        +__init__(transforms, p)
    }
    class ChannelDropout {
        +__init__(transforms, p, channels)
        +__call__(*args, force_apply: bool = False, **data) dict
    }
    OneOf --|> BaseCompose
    ChannelDropout --|> BaseCompose

File-Level Changes

Change	Details	Files
Implement worker-aware seeding and RNG synchronization in Compose	Added _get_effective_seed to adjust base_seed using torch.initial_seed in worker contexts Introduced _check_worker_seed to detect worker processes and reinitialize RNG state before each call Overrode set_random_seed and setstate to recalculate and apply effective seeds upon seeding and unpickling Updated Compose.init to compute and pass effective_seed instead of raw seed, and Compose.call to invoke _check_worker_seed	`albumentations/core/composition.py`
Persist and serialize original seed in Compose	Stored the original seed in compose instances (self.seed) during seeding routines Added 'seed' field to to_dict_private for consistent pickling and inclusion in to_dict output	`albumentations/core/composition.py`
Fix super().init signatures and transform invocation details	Corrected super().init calls to use named parameters (transforms=, p=) in several transform subclasses Changed channel-based transform call to use t(**sub_data)["image"] and updated _track_transform_params invocation	`albumentations/core/composition.py`
Add comprehensive tests for worker-aware seeding and deterministic behavior	Created tests/test_per_worker_seed.py with unit tests across no-torch, multi-worker, epoch diversity, and seed overflow scenarios Updated tests/test_transforms.py and tests/test_serialization.py to include seed field in dict and cover new FrequencyMasking augment Covered deterministic single-process behavior, multiple Compose instances, and per-worker sequence diversity	`tests/test_per_worker_seed.py` `tests/test_transforms.py` `tests/test_serialization.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @ternaus - I've reviewed your changes - here's some feedback:

You’re importing torch inside both _get_effective_seed and _check_worker_seed on every call, which can be expensive—consider doing the import (and worker_info retrieval) once at module load or caching the availability check.
There’s a lot of duplicated seed‐initialization logic between init, set_random_seed, and setstate; extracting a single helper to centralize effective seed generation would reduce maintenance overhead.
In the channel‐by‐channel loop you changed, _track_transform_params is now called with data instead of sub_data, which likely breaks correct param tracking for those sub-transforms.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- You’re importing torch inside both _get_effective_seed and _check_worker_seed on every call, which can be expensive—consider doing the import (and worker_info retrieval) once at module load or caching the availability check.
- There’s a lot of duplicated seed‐initialization logic between __init__, set_random_seed, and __setstate__; extracting a single helper to centralize effective seed generation would reduce maintenance overhead.
- In the channel‐by‐channel loop you changed, _track_transform_params is now called with `data` instead of `sub_data`, which likely breaks correct param tracking for those sub-transforms.

## Individual Comments

### Comment 1
<location> `albumentations/core/composition.py:903` </location>
<code_context>
+        except (ImportError, AttributeError):
+            pass
+
+    def __setstate__(self, state: dict[str, Any]) -> None:
+        """Set state from unpickling and handle worker seed."""
+        self.__dict__.update(state)
</code_context>

<issue_to_address>
Reset _last_torch_seed in __setstate__

Reset `_last_torch_seed` to `None` before calling `set_random_seed` to guarantee the worker-seed sync runs after unpickling, even if the unpickled value matches the current seed.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    def __setstate__(self, state: dict[str, Any]) -> None:
        """Set state from unpickling and handle worker seed."""
        self.__dict__.update(state)
        # If we have a seed, recalculate effective seed in worker context
        if hasattr(self, "seed") and self.seed is not None:
            # Recalculate effective seed in worker context
            self.set_random_seed(self.seed)
=======
    def __setstate__(self, state: dict[str, Any]) -> None:
        """Set state from unpickling and handle worker seed."""
        self.__dict__.update(state)
        # If we have a seed, recalculate effective seed in worker context
        if hasattr(self, "seed") and self.seed is not None:
            # Reset _last_torch_seed to ensure worker-seed sync runs after unpickling
            self._last_torch_seed = None
            # Recalculate effective seed in worker context
            self.set_random_seed(self.seed)
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `albumentations/core/composition.py:828` </location>
<code_context>

     def __call__(self, *args: Any, force_apply: bool = False, **data: Any) -> dict[str, Any]:
-        """Apply transformations to data.
+        """Apply transformations to data with automatic worker seed synchronization.

         Args:
</code_context>

<issue_to_address>
Docstring no longer documents force_apply behavior

Please include a brief description of the `force_apply` parameter in the docstring to clarify its purpose and usage.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

albumentations/core/composition.py

tests/test_per_worker_seed.py

Copilot

Pull Request Overview

This PR improves reproducibility in Compose transforms by incorporating worker-aware random seeding, ensuring deterministic behavior across single- and multi-process DataLoader contexts.

Introduces worker-aware seed calculation and synchronization through new helper methods (_get_effective_seed, _check_worker_seed, and updates to set_random_seed and setstate).
Updates tests to cover single- and multi-worker scenarios and adjusts serialization to include the original seed.
Corrects parameter passing in OneOf and ChannelShuffle transforms and aligns default seed values with project guidelines.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/test_transforms.py	Adds configuration for FrequencyMasking and updates transform tests.
tests/test_serialization.py	Includes seed in serialization output.
tests/test_per_worker_seed.py	Adds comprehensive tests for worker-aware seed functionality.
albumentations/core/composition.py	Enhances seed handling with worker context and updates API usage.
.cursor/rules/albumentations-rules.mdc	Updates coding rules, including default seed value usage.

Comments suppressed due to low confidence (3)

tests/test_transforms.py:1792

Consider adding a brief comment or reference for the new transform A.FrequencyMasking to clarify its intended usage and ensure future maintainability.

A.FrequencyMasking,

albumentations/core/composition.py:830

Since the type-check for force_apply was removed, please update the call docstring to clearly describe the accepted types and behavior of the force_apply parameter.

def __call__(self, *args: Any, force_apply: bool = False, **data: Any) -> dict[str, Any]:

albumentations/core/composition.py:869

[nitpick] Expand the _check_worker_seed docstring to detail when the worker seed synchronization is triggered and how it propagates the updated seed to child transforms.

def _check_worker_seed(self) -> None:

ternaus added 2 commits June 16, 2025 18:24

Fix in random seed

9e02525

Fix

a215782

ternaus requested a review from Copilot June 17, 2025 01:31

This comment was marked as outdated.

Sign in to view

sourcery-ai bot reviewed Jun 17, 2025

View reviewed changes

ternaus added 3 commits June 16, 2025 18:39

Fix

cce1b16

Fix

5db4700

Fix

99f6f80

ternaus requested a review from Copilot June 17, 2025 01:52

Copilot AI reviewed Jun 17, 2025

View reviewed changes

ternaus added 3 commits June 16, 2025 19:17

Fix in test

7b9d67c

Fix in tests

7431ad2

Fix in tests

3fc99f1

ternaus merged commit 76fef70 into main Jun 17, 2025
15 checks passed

ternaus deleted the fix_reproducibility branch June 17, 2025 02:44

This was referenced Jun 17, 2025

'persistent_workers' results in inconsistent behavior. #2559

Closed

Setting random seed with multiprocessing #2473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Fix reproducibility#2561

Fix reproducibility#2561
ternaus merged 8 commits intomainfrom
fix_reproducibility

ternaus commented Jun 17, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jun 17, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

This comment was marked as outdated.

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

ternaus commented Jun 17, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for worker-aware seeding in Compose call

Class diagram for updated Compose and BaseCompose seeding logic

Class diagram for transform constructor and call API consistency

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

This comment was marked as outdated.

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ternaus commented Jun 17, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jun 17, 2025 •

edited

Loading