Skip to content

Conversation

@gertln
Copy link
Contributor

@gertln gertln commented Jun 10, 2025

PhysicsNeMo Pull Request

Description

Introduces InfiniteHashSampler, a new memory-efficient infinite sampler designed for very large datasets (billion+ samples) that uses hash-based randomization without storing full index arrays.
Tests for both infinite samplers have been added.

  • Hash-Based Randomization: Deterministic pseudo-random sampling using efficient hash function
  • Distributed Training Support: Full compatibility with DistributedDataParallel (DDP)
  • Billion-Scale Ready: Tested with datasets up to 10 billion samples
  • Sequential Fallback: Option to disable randomization for sequential access

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant