Skip to content

Support inserting based on batchsize into shuffler#1369

Open
ayushdg wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
ayushdg:batched-exact-dedup
Open

Support inserting based on batchsize into shuffler#1369
ayushdg wants to merge 11 commits intoNVIDIA-NeMo:mainfrom
ayushdg:batched-exact-dedup

Conversation

@ayushdg
Copy link
Contributor

@ayushdg ayushdg commented Jan 13, 2026

Description

  • This pr adds support for inserting into a shuffler with a batched method if available and adds support in the ExactDuplicateIdentification stage.
  • Improves speed for exact dedup tests by ~25s on my machine (1:54s -> 1:29s)

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 13, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ayushdg
Copy link
Contributor Author

ayushdg commented Jan 14, 2026

/ok to test 41ba9c0

class TestExactDuplicates:
@pytest.mark.parametrize("assign_id", [True, False])
@pytest.mark.parametrize("total_nparts", [2, 4])
@pytest.mark.parametrize(("assign_id", "total_nparts", "batch_size"), [(False, 2, 1), (True, 4, 5)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage reduced from 4 combinations (2 assign_id × 2 total_nparts) to just 2 specific combinations. Consider restoring full combinatorial testing or add more edge cases like (True, 2, 1) and (False, 4, 5)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant