Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Comments

Fix reproducibility#2561

Merged
ternaus merged 8 commits intomainfrom
fix_reproducibility
Jun 17, 2025
Merged

Fix reproducibility#2561
ternaus merged 8 commits intomainfrom
fix_reproducibility

Conversation

@ternaus
Copy link
Collaborator

@ternaus ternaus commented Jun 17, 2025

Summary by Sourcery

Enable worker-aware random seeding in Compose transforms to ensure reproducible augmentations across single- and multi-process DataLoader contexts.

New Features:

  • Add _get_effective_seed to compute per-worker seed based on torch.initial_seed
  • Override Compose.set_random_seed and setstate to incorporate worker context into seeding
  • Introduce _check_worker_seed to automatically synchronize RNG state in call when inside DataLoader workers

Bug Fixes:

  • Fix seed propagation to nested transforms for consistent reproducibility
  • Correct parameter passing in OneOf and ChannelShuffle super().init calls

Enhancements:

  • Store original seed and include it in to_dict serialization
  • Propagate RNG state updates to all transforms via set_random_state or set_random_seed

Tests:

  • Add comprehensive tests for worker-aware seeding in single- and multi-worker DataLoader scenarios
  • Update serialization tests to verify seed inclusion
  • Include deterministic behavior and diversity tests across epochs and workers

@ternaus ternaus requested a review from Copilot June 17, 2025 01:31
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jun 17, 2025

Reviewer's Guide

This PR adds worker-aware seeding to Compose by computing an effective seed per DataLoader worker, synchronizing RNG state on each call, persisting the original seed for serialization, correcting API consistency in transform constructors and calls, and covering all cases with new comprehensive tests.

Sequence diagram for worker-aware seeding in Compose call

sequenceDiagram
    participant User
    participant DataLoaderWorker
    participant Compose
    participant Transform
    User->>DataLoaderWorker: Request batch
    DataLoaderWorker->>Compose: __call__(data)
    Compose->>Compose: _check_worker_seed()
    Compose->>Compose: Update RNG state if worker seed changed
    Compose->>Transform: set_random_state()/set_random_seed(effective_seed)
    Compose->>Compose: Apply transforms
    Compose-->>DataLoaderWorker: Return augmented data
Loading

Class diagram for updated Compose and BaseCompose seeding logic

classDiagram
    class BaseCompose {
        +int|None seed
        +np.random.Generator random_generator
        +random.Random py_random
        +set_random_seed(seed: int|None)
        +_get_effective_seed(base_seed: int|None) int|None
    }
    class Compose {
        +int|None _last_torch_seed
        +_check_worker_seed()
        +__setstate__(state: dict)
        +set_random_seed(seed: int|None)
        +__call__(*args, force_apply: bool = False, **data) dict
    }
    BaseCompose <|-- Compose
Loading

Class diagram for transform constructor and call API consistency

classDiagram
    class OneOf {
        +__init__(transforms, p)
    }
    class ChannelDropout {
        +__init__(transforms, p, channels)
        +__call__(*args, force_apply: bool = False, **data) dict
    }
    OneOf --|> BaseCompose
    ChannelDropout --|> BaseCompose
Loading

File-Level Changes

Change Details Files
Implement worker-aware seeding and RNG synchronization in Compose
  • Added _get_effective_seed to adjust base_seed using torch.initial_seed in worker contexts
  • Introduced _check_worker_seed to detect worker processes and reinitialize RNG state before each call
  • Overrode set_random_seed and setstate to recalculate and apply effective seeds upon seeding and unpickling
  • Updated Compose.init to compute and pass effective_seed instead of raw seed, and Compose.call to invoke _check_worker_seed
albumentations/core/composition.py
Persist and serialize original seed in Compose
  • Stored the original seed in compose instances (self.seed) during seeding routines
  • Added 'seed' field to to_dict_private for consistent pickling and inclusion in to_dict output
albumentations/core/composition.py
Fix super().init signatures and transform invocation details
  • Corrected super().init calls to use named parameters (transforms=, p=) in several transform subclasses
  • Changed channel-based transform call to use t(**sub_data)["image"] and updated _track_transform_params invocation
albumentations/core/composition.py
Add comprehensive tests for worker-aware seeding and deterministic behavior
  • Created tests/test_per_worker_seed.py with unit tests across no-torch, multi-worker, epoch diversity, and seed overflow scenarios
  • Updated tests/test_transforms.py and tests/test_serialization.py to include seed field in dict and cover new FrequencyMasking augment
  • Covered deterministic single-process behavior, multiple Compose instances, and per-worker sequence diversity
tests/test_per_worker_seed.py
tests/test_transforms.py
tests/test_serialization.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

This comment was marked as outdated.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ternaus - I've reviewed your changes - here's some feedback:

  • You’re importing torch inside both _get_effective_seed and _check_worker_seed on every call, which can be expensive—consider doing the import (and worker_info retrieval) once at module load or caching the availability check.
  • There’s a lot of duplicated seed‐initialization logic between init, set_random_seed, and setstate; extracting a single helper to centralize effective seed generation would reduce maintenance overhead.
  • In the channel‐by‐channel loop you changed, _track_transform_params is now called with data instead of sub_data, which likely breaks correct param tracking for those sub-transforms.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- You’re importing torch inside both _get_effective_seed and _check_worker_seed on every call, which can be expensive—consider doing the import (and worker_info retrieval) once at module load or caching the availability check.
- There’s a lot of duplicated seed‐initialization logic between __init__, set_random_seed, and __setstate__; extracting a single helper to centralize effective seed generation would reduce maintenance overhead.
- In the channel‐by‐channel loop you changed, _track_transform_params is now called with `data` instead of `sub_data`, which likely breaks correct param tracking for those sub-transforms.

## Individual Comments

### Comment 1
<location> `albumentations/core/composition.py:903` </location>
<code_context>
+        except (ImportError, AttributeError):
+            pass
+
+    def __setstate__(self, state: dict[str, Any]) -> None:
+        """Set state from unpickling and handle worker seed."""
+        self.__dict__.update(state)
</code_context>

<issue_to_address>
Reset _last_torch_seed in __setstate__

Reset `_last_torch_seed` to `None` before calling `set_random_seed` to guarantee the worker-seed sync runs after unpickling, even if the unpickled value matches the current seed.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
    def __setstate__(self, state: dict[str, Any]) -> None:
        """Set state from unpickling and handle worker seed."""
        self.__dict__.update(state)
        # If we have a seed, recalculate effective seed in worker context
        if hasattr(self, "seed") and self.seed is not None:
            # Recalculate effective seed in worker context
            self.set_random_seed(self.seed)
=======
    def __setstate__(self, state: dict[str, Any]) -> None:
        """Set state from unpickling and handle worker seed."""
        self.__dict__.update(state)
        # If we have a seed, recalculate effective seed in worker context
        if hasattr(self, "seed") and self.seed is not None:
            # Reset _last_torch_seed to ensure worker-seed sync runs after unpickling
            self._last_torch_seed = None
            # Recalculate effective seed in worker context
            self.set_random_seed(self.seed)
>>>>>>> REPLACE

</suggested_fix>

### Comment 2
<location> `albumentations/core/composition.py:828` </location>
<code_context>

     def __call__(self, *args: Any, force_apply: bool = False, **data: Any) -> dict[str, Any]:
-        """Apply transformations to data.
+        """Apply transformations to data with automatic worker seed synchronization.

         Args:
</code_context>

<issue_to_address>
Docstring no longer documents force_apply behavior

Please include a brief description of the `force_apply` parameter in the docstring to clarify its purpose and usage.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ternaus ternaus requested a review from Copilot June 17, 2025 01:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves reproducibility in Compose transforms by incorporating worker-aware random seeding, ensuring deterministic behavior across single- and multi-process DataLoader contexts.

  • Introduces worker-aware seed calculation and synchronization through new helper methods (_get_effective_seed, _check_worker_seed, and updates to set_random_seed and setstate).
  • Updates tests to cover single- and multi-worker scenarios and adjusts serialization to include the original seed.
  • Corrects parameter passing in OneOf and ChannelShuffle transforms and aligns default seed values with project guidelines.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/test_transforms.py Adds configuration for FrequencyMasking and updates transform tests.
tests/test_serialization.py Includes seed in serialization output.
tests/test_per_worker_seed.py Adds comprehensive tests for worker-aware seed functionality.
albumentations/core/composition.py Enhances seed handling with worker context and updates API usage.
.cursor/rules/albumentations-rules.mdc Updates coding rules, including default seed value usage.
Comments suppressed due to low confidence (3)

tests/test_transforms.py:1792

  • Consider adding a brief comment or reference for the new transform A.FrequencyMasking to clarify its intended usage and ensure future maintainability.
A.FrequencyMasking,

albumentations/core/composition.py:830

  • Since the type-check for force_apply was removed, please update the call docstring to clearly describe the accepted types and behavior of the force_apply parameter.
def __call__(self, *args: Any, force_apply: bool = False, **data: Any) -> dict[str, Any]:

albumentations/core/composition.py:869

  • [nitpick] Expand the _check_worker_seed docstring to detail when the worker seed synchronization is triggered and how it propagates the updated seed to child transforms.
def _check_worker_seed(self) -> None:

@ternaus ternaus merged commit 76fef70 into main Jun 17, 2025
15 checks passed
@ternaus ternaus deleted the fix_reproducibility branch June 17, 2025 02:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant