Skip to content

fix(tests): stabilize flaky Hub LFS integration test#4

Open
The-Obstacle-Is-The-Way wants to merge 8 commits intomainfrom
fix/flaky-lfs-test
Open

fix(tests): stabilize flaky Hub LFS integration test#4
The-Obstacle-Is-The-Way wants to merge 8 commits intomainfrom
fix/flaky-lfs-test

Conversation

@The-Obstacle-Is-The-Way
Copy link
Owner

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way commented Nov 29, 2025

Summary

  • Add _wait_for_repo_ready helper to ensure Hub repository consistency between operations
  • Fix race condition in test_push_dataset_dict_to_hub_overwrite_files that caused intermittent LFS pointer errors

Problem

test_push_dataset_dict_to_hub_overwrite_files intermittently fails with:

BadRequestError: LFS pointer pointed to a file that does not exist

This occurs because rapid successive push_to_hub calls don't wait for Hub's LFS object propagation between pushes.

Solution

Add synchronization delay between push operations to ensure Hub has fully processed LFS objects before the next push.

Test plan

  • Ran test locally - passes consistently
  • CI passes on this PR
  • Verify fix works on both Ubuntu and Windows

Related Issues


This fix is intended for upstream submission to huggingface/datasets after validation.

Summary by CodeRabbit

  • Tests
    • Improved test robustness by implementing explicit synchronization checks between repository operations to ensure proper propagation of changes, replacing timing-based sequencing with time-bounded synchronization points.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 29, 2025

Warning

Rate limit exceeded

@The-Obstacle-Is-The-Way has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 37 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 3ea8de7 and 8a242d8.

📒 Files selected for processing (13)
  • setup.py (1 hunks)
  • src/datasets/__init__.py (1 hunks)
  • src/datasets/arrow_dataset.py (2 hunks)
  • src/datasets/builder.py (2 hunks)
  • src/datasets/config.py (1 hunks)
  • src/datasets/dataset_dict.py (4 hunks)
  • src/datasets/features/features.py (3 hunks)
  • src/datasets/features/nifti.py (3 hunks)
  • src/datasets/fingerprint.py (1 hunks)
  • src/datasets/iterable_dataset.py (2 hunks)
  • tests/conftest.py (1 hunks)
  • tests/features/test_nifti.py (1 hunks)
  • tests/test_fingerprint.py (1 hunks)

Walkthrough

The change adds a _wait_for_repo_ready helper method to the TestPushToHub test class that polls repository file listings to verify Hub readiness, replacing timing-based delays with explicit, bounded synchronization points between sequential test operations.

Changes

Cohort / File(s) Summary
Test synchronization helper
tests/test_upstream_hub.py
Added _wait_for_repo_ready(self, repo_id, max_wait=30) method to poll repository readiness with monotonic timeout and 1-second intervals; integrated polling checks after pushes and file uploads to ensure Hub/LFS propagation between test operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify the polling logic and timeout behavior in _wait_for_repo_ready method
  • Confirm all test call sites correctly invoke the helper after relevant operations
  • Check that polling intervals and max_wait defaults are appropriate for test execution speed

Poem

🐰 A rabbit waits, not by the clock,
But checks if files unlock,
No rushing through Hub's hidden dance—
Just poll and sync with second's glance!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: adding synchronization to stabilize a flaky Hub LFS integration test by fixing a race condition.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/test_upstream_hub.py (3)

270-281: Polling helper for Hub consistency looks solid; only minor nits possible

The retry loop around list_repo_files with a bounded wait and explicit TimeoutError is a good fit to make the test deterministic and fail loudly if the Hub is slow or inconsistent. Catching Exception is acceptable in this test-only helper, and the extra sleep(1) after a successful probe gives LFS a bit more buffer.

If you ever need finer control, a small optional refinement would be:

  • Track elapsed time via time.monotonic() so max_wait is strictly “seconds” rather than “iterations”.
  • Optionally narrow the except to hub-related errors to avoid masking unexpected issues, or at least log them.

Not blockers; current implementation is fine for the intended use.


294-313: Well-placed waits between first push, manual upload, and second push

Adding _wait_for_repo_ready(ds_name) after the initial push_to_hub and again after the upload_file call is a reasonable way to avoid the LFS propagation race that was causing the flaky BadRequestError. The helper is scoped to this test class and only used here, so the additional latency is limited to this one integration test.

If further flakes appear, a non-blocking follow-up could be to:

  • Also use _wait_for_repo_ready after the second push_to_hub before asserting on list_repo_files, to guard against any remaining backend lag.
    But given the current problem description, the added waits should already address the primary race.

345-364: Symmetric guarding of the second overwrite scenario looks appropriate

Mirroring the same wait pattern in the second half of test_push_dataset_dict_to_hub_overwrite_files (push → wait → upload → wait → push) keeps both branches consistent and should prevent the same LFS race when the second run changes the shard count in the opposite direction.

No issues from a correctness standpoint. As with the first block, you might consider adding a final readiness wait before the concluding load_dataset if you ever see residual flakiness, but it's not strictly necessary right now.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 004a5bf and b6490dc.

📒 Files selected for processing (1)
  • tests/test_upstream_hub.py (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: test (unit, windows-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-minimum)
  • GitHub Check: test (unit, windows-latest, deps-latest)
  • GitHub Check: test (integration, ubuntu-latest, deps-minimum)
  • GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test (integration, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314 (unit, windows-latest, deps-latest)
  • GitHub Check: build / build_pr_documentation

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test_upstream_hub.py (1)

270-296: Reasonable approach to stabilize flaky Hub tests.

The helper addresses the race condition with a polling + sleep strategy. A few observations:

  1. Exception specificity: Catching all HfHubHTTPError is broad. Consider logging transient errors for debugging intermittent CI failures.

  2. Import placement: Moving the import to the top of the file (alongside other huggingface_hub imports on line 15) would be more idiomatic.

  3. Delay tuning: The 1-second post-success sleep is a heuristic. If flakiness persists, this may need adjustment or a more robust verification (e.g., retrying the subsequent operation with backoff rather than pre-waiting).

Consider moving the import to the file header:

-from huggingface_hub import DatasetCard, HfApi
+from huggingface_hub import DatasetCard, HfApi
+from huggingface_hub.errors import HfHubHTTPError

Then remove line 284 from within the method.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6490dc and fab2e85.

📒 Files selected for processing (1)
  • tests/test_upstream_hub.py (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: test (integration, ubuntu-latest, deps-minimum)
  • GitHub Check: test (unit, windows-latest, deps-latest)
  • GitHub Check: test (integration, windows-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-latest)
  • GitHub Check: test (unit, windows-latest, deps-minimum)
  • GitHub Check: test (integration, ubuntu-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-minimum)
  • GitHub Check: test (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314 (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
  • GitHub Check: build / build_pr_documentation
🔇 Additional comments (4)
tests/test_upstream_hub.py (4)

309-311: LGTM!

Appropriate placement after the initial push to ensure Hub consistency before the subsequent upload operation.


326-328: LGTM!

Correctly placed between the upload and the subsequent push operation.


360-362: LGTM!

Mirrors the synchronization pattern from the first test block.


377-379: LGTM!

Completes the synchronization pattern for the second test scenario.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test_upstream_hub.py (1)

270-296: Consider broadening exception handling to cover more transient network errors.

The helper only catches HfHubHTTPError, but transient connection issues (e.g., ConnectionError, requests.exceptions.Timeout) would propagate and fail the test. Consider catching a broader set of transient errors:

-            except HfHubHTTPError:
+            except (HfHubHTTPError, ConnectionError) as e:
+                # Retry on transient errors
                 time.sleep(1)

Also, moving the import to the top of the file with other huggingface_hub imports would be more idiomatic:

-from huggingface_hub import DatasetCard, HfApi
+from huggingface_hub import DatasetCard, HfApi
+from huggingface_hub.errors import HfHubHTTPError
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3c23371 and e98efc8.

📒 Files selected for processing (1)
  • tests/test_upstream_hub.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_upstream_hub.py (1)
tests/fixtures/hub.py (1)
  • temporary_repo (87-99)
🪛 Gitleaks (8.29.1)
tests/test_upstream_hub.py

[high] 358-358: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: test (unit, windows-latest, deps-latest)
  • GitHub Check: test (integration, ubuntu-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-latest)
  • GitHub Check: test (integration, ubuntu-latest, deps-latest)
  • GitHub Check: test (unit, windows-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-minimum)
  • GitHub Check: test (unit, ubuntu-latest, deps-minimum)
  • GitHub Check: test (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314 (unit, windows-latest, deps-latest)
  • GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
  • GitHub Check: build / build_pr_documentation
🔇 Additional comments (4)
tests/test_upstream_hub.py (4)

306-311: LGTM!

Correctly placed synchronization point after the first push ensures Hub has processed LFS objects before the subsequent file upload.


326-329: LGTM!

Synchronization before the second push ensures the manually uploaded file is fully propagated.


357-361: LGTM - variable rename and synchronization points are correct.

The rename to ds_name_2 for the second test block improves clarity. The static analysis warning about "Generic API Key" on line 358 is a false positive—self._token references the test fixture constant CI_HUB_USER_TOKEN, not a hardcoded secret.


369-395: LGTM!

All API calls in the second test block consistently reference ds_name_2, and the wait synchronization is correctly placed before the second push operation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
src/datasets/packaged_modules/spark/spark.py (1)

264-268: Add missing shutil import for _prepare_split_single move logic.

shutil.move(file, dest) is used here but shutil isn’t imported at the top of the file, which will lead to a NameError when this branch is executed; please add the import (even though this predates the current diff).

-import os
+import os
+import shutil
src/datasets/packaged_modules/text/text.py (1)

104-109: Inconsistent key type: Line 109 uses plain tuple instead of Key.

The final batch yield in paragraph mode at line 109 uses a plain tuple (file_idx, batch_idx) while line 104 uses Key(file_idx, batch_idx). This inconsistency could cause issues downstream if consumers expect all keys to be Key instances.

                     if batch:
                         pa_table = pa.Table.from_arrays([pa.array([batch])], names=pa_table_names)
-                        yield (file_idx, batch_idx), self._cast_table(pa_table)
+                        yield Key(file_idx, batch_idx), self._cast_table(pa_table)
src/datasets/splits.py (1)

520-573: SplitDict.from_split_dict breaks when passed a list of SplitInfo objects

In from_split_dict, dataset name inference assumes dict elements:

if dataset_name is None:
    dataset_name = split_infos[0].get("dataset_name") if split_infos else None

If split_infos is a list[SplitInfo], split_infos[0] has no .get, raising AttributeError. This is likely in scope since the method advertises accepting both lists and dicts, and to_split_dict() returns list[SplitInfo].

Consider making the inference robust to both dicts and SplitInfo:

-        if dataset_name is None:
-            dataset_name = split_infos[0].get("dataset_name") if split_infos else None
+        if dataset_name is None and split_infos:
+            first = split_infos[0]
+            if isinstance(first, dict):
+                dataset_name = first.get("dataset_name")
+            elif isinstance(first, SplitInfo):
+                dataset_name = first.dataset_name
src/datasets/iterable_dataset.py (1)

409-425: ShuffledDataSourcesArrowExamplesIterable.shift_rngs uses a non-existent attribute

In ShuffledDataSourcesArrowExamplesIterable.shift_rngs:

return ShuffledDataSourcesArrowExamplesIterable(
    self.generate_examples_fn,
    self.kwargs,
    np.random.default_rng(seed=new_seed),
)

ShuffledDataSourcesArrowExamplesIterable (and its parent ArrowExamplesIterable) define self.generate_tables_fn, not self.generate_examples_fn. This will raise AttributeError when shift_rngs is called (e.g. via shift_ex_examples_rngs on an arrow-backed iterable).

You likely want to pass self.generate_tables_fn:

     def shift_rngs(self, value: int) -> "_BaseExamplesIterable":
         new_seed = self.generator.bit_generator.state["state"]["state"] + value
         return ShuffledDataSourcesArrowExamplesIterable(
-            self.generate_examples_fn,
+            self.generate_tables_fn,
             self.kwargs,
             np.random.default_rng(seed=new_seed),
         )
src/datasets/builder.py (1)

1-1893: CRITICAL: PR description does not match the actual code changes.

The PR objectives claim this fixes a flaky Hub LFS integration test by adding a _wait_for_repo_ready helper to address race conditions in test_push_dataset_dict_to_hub_overwrite_files. However, the actual changes in this file are:

  1. Introduction of a new Key dataclass (lines 1320-1327)
  2. Major refactoring of generator interfaces (_generate_examples, _generate_tables)
  3. Addition of original shard tracking metadata throughout the dataset generation pipeline
  4. No _wait_for_repo_ready helper or test-related changes visible

This is a significant architectural refactor affecting core dataset generation logic, not a targeted test fix. This inconsistency suggests either:

  • The wrong PR description was used
  • The wrong code changes were included
  • Multiple unrelated changes were combined

This must be clarified before proceeding with the review.

🧹 Nitpick comments (8)
tests/test_upstream_hub.py (1)

270-296: Repo readiness helper is well-structured; consider making polling behavior slightly more flexible

The bounded time.monotonic loop plus polling list_repo_files is a good, minimal way to synchronize with Hub/LFS propagation. As an optional refinement, you could (a) accept a poll_interval argument instead of hardcoding 1 and (b) optionally treat clearly-fatal HfHubHTTPError cases (e.g. auth/permission failures) as immediate failures rather than waiting the full max_wait, while still retrying on transient 5xx/404 conditions.

src/datasets/info.py (1)

121-123: Consider updating splits docstring to reflect SplitDict type.

The field type is now Optional[SplitDict] and __post_init__ already normalizes plain dicts, so the comment describing splits as dict is slightly misleading; consider mentioning SplitDict (a dict‑like mapping) there for clarity.

Also applies to: 152-152

src/datasets/packaged_modules/spark/spark.py (1)

49-75: Consider downstream impact of changing Spark example keys from str to (part_id, row_id) tuples.

_generate_iterable_examples now yields tuple keys, which may break callers that relied on the previous string format (e.g., persisting or pattern‑matching example IDs). If this is an intentional API change, it’s worth documenting and verifying that tests and any public examples using Dataset.from_spark don’t assume string keys.

tests/test_arrow_dataset.py (1)

4786-4795: New test_add_column correctly covers basic API behavior

The test verifies that add_column creates the new feature and preserves existing rows as expected; this is a good, minimal check on top of the more exhaustive test_dataset_add_column above.

Minor optional cleanup: you can drop the inner from datasets import Dataset and reuse the module-level Dataset import to keep this file consistent:

-def test_add_column():
-    from datasets import Dataset
-
-    ds = Dataset.from_dict({"a": [1, 2]})
+def test_add_column():
+    ds = Dataset.from_dict({"a": [1, 2]})
tests/test_builder.py (1)

71-73: Consider using Key for consistency across all test builders.

The simple dummy builders (DummyGeneratorBasedBuilder, DummyArrowBasedBuilder, etc.) yield plain tuples (0, i) while the shard-aware builders use Key(shard_idx, i). While this may work since both are tuple-like, using Key(0, i) consistently would better mirror the production code patterns and avoid potential type mismatches.

Example for DummyGeneratorBasedBuilder:

     def _generate_examples(self):
         for i in range(100):
-            yield (0, i), {"text": "foo"}
+            yield Key(0, i), {"text": "foo"}

Also applies to: 83-85, 95-97, 116-118

tests/test_arrow_writer.py (1)

142-143: ArrowWriter.write usage now matches the updated signature

The test correctly adapts to the new ArrowWriter.write(example, writer_batch_size=None) signature by passing only the example dict on Lines 142–143. The behavior checked by _check_output is preserved.

If keys are no longer part of the public API here, consider (optionally) renaming test_write_with_keys and/or updating its docstring/comment to avoid confusion for future readers.

src/datasets/arrow_writer.py (1)

461-467: Update ArrowWriter.write docstring and clean up unused key-related state

write no longer accepts a key argument and always appends (example, ""), but:

  • The docstring still says “(Example,Key) pair”.
  • current_examples’ second tuple element is now always an unused placeholder.
  • self.hkey_record appears unused after removing duplicate‑key logic.

Consider:

  • Updating the docstring/comment around write/write_examples_on_file to reflect that keys are no longer part of the public API.
  • Either dropping the dummy string and switching current_examples to store just examples (and updating write_examples_on_file accordingly), or clearly documenting that the second tuple element is deprecated.
  • Removing hkey_record if it’s no longer needed.

Also applies to: 582-593

src/datasets/splits.py (1)

523-551: Normalize SplitDict keys to strings to avoid subtle key-type mismatches

__setitem__ accepts Union[SplitBase, str] keys and writes them as‑is:

if key != value.name:
    raise ValueError(...)
super().__setitem__(key, value)

Elsewhere, lookups assume string keys via str(key) in self and super().__getitem__(str(key)). If some code ever does split_dict[Split.TRAIN] = info, the dict key becomes a NamedSplit, and 'train' in split_dict will be false because dict key comparison uses 'train' == NamedSplit('train'), not the reverse.

To keep behavior predictable, it would be safer to always store keys as strings:

-    def __setitem__(self, key: Union[SplitBase, str], value: SplitInfo):
-        if key != value.name:
-            raise ValueError(f"Cannot add elem. (key mismatch: '{key}' != '{value.name}')")
-        super().__setitem__(key, value)
+    def __setitem__(self, key: Union[SplitBase, str], value: SplitInfo):
+        key_str = str(key)
+        if key_str != value.name:
+            raise ValueError(f"Cannot add elem. (key mismatch: '{key}' != '{value.name}')")
+        super().__setitem__(key_str, value)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e98efc8 and 60963a0.

📒 Files selected for processing (29)
  • src/datasets/arrow_dataset.py (2 hunks)
  • src/datasets/arrow_writer.py (2 hunks)
  • src/datasets/builder.py (28 hunks)
  • src/datasets/info.py (1 hunks)
  • src/datasets/inspect.py (0 hunks)
  • src/datasets/iterable_dataset.py (5 hunks)
  • src/datasets/keyhash.py (0 hunks)
  • src/datasets/naming.py (1 hunks)
  • src/datasets/packaged_modules/arrow/arrow.py (2 hunks)
  • src/datasets/packaged_modules/cache/cache.py (2 hunks)
  • src/datasets/packaged_modules/csv/csv.py (2 hunks)
  • src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (4 hunks)
  • src/datasets/packaged_modules/generator/generator.py (2 hunks)
  • src/datasets/packaged_modules/hdf5/hdf5.py (2 hunks)
  • src/datasets/packaged_modules/json/json.py (3 hunks)
  • src/datasets/packaged_modules/pandas/pandas.py (2 hunks)
  • src/datasets/packaged_modules/parquet/parquet.py (2 hunks)
  • src/datasets/packaged_modules/spark/spark.py (1 hunks)
  • src/datasets/packaged_modules/sql/sql.py (2 hunks)
  • src/datasets/packaged_modules/text/text.py (4 hunks)
  • src/datasets/packaged_modules/webdataset/webdataset.py (2 hunks)
  • src/datasets/packaged_modules/xml/xml.py (1 hunks)
  • src/datasets/splits.py (3 hunks)
  • src/datasets/utils/info_utils.py (1 hunks)
  • tests/packaged_modules/test_spark.py (2 hunks)
  • tests/test_arrow_dataset.py (1 hunks)
  • tests/test_arrow_writer.py (1 hunks)
  • tests/test_builder.py (14 hunks)
  • tests/test_upstream_hub.py (6 hunks)
💤 Files with no reviewable changes (2)
  • src/datasets/keyhash.py
  • src/datasets/inspect.py
✅ Files skipped from review due to trivial changes (1)
  • src/datasets/utils/info_utils.py
🧰 Additional context used
🧬 Code graph analysis (18)
src/datasets/packaged_modules/sql/sql.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/hdf5/hdf5.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/webdataset/webdataset.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/info.py (1)
src/datasets/splits.py (1)
  • SplitDict (520-600)
tests/test_builder.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/parquet/parquet.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/builder.py (5)
src/datasets/packaged_modules/generator/generator.py (2)
  • _split_generators (31-32)
  • _generate_examples (34-38)
src/datasets/packaged_modules/webdataset/webdataset.py (1)
  • _generate_examples (110-129)
src/datasets/utils/sharding.py (1)
  • _number_of_shards_in_gen_kwargs (4-18)
src/datasets/packaged_modules/csv/csv.py (1)
  • _generate_tables (177-199)
src/datasets/packaged_modules/json/json.py (1)
  • _generate_tables (113-193)
tests/test_arrow_writer.py (1)
src/datasets/arrow_writer.py (1)
  • write (582-598)
src/datasets/packaged_modules/arrow/arrow.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
tests/test_arrow_dataset.py (1)
src/datasets/arrow_dataset.py (4)
  • Dataset (704-6467)
  • add_column (6078-6129)
  • features (206-207)
  • features (781-785)
src/datasets/packaged_modules/generator/generator.py (2)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/utils/sharding.py (2)
  • _number_of_shards_in_gen_kwargs (4-18)
  • _split_gen_kwargs (48-64)
src/datasets/packaged_modules/xml/xml.py (7)
src/datasets/packaged_modules/arrow/arrow.py (1)
  • _cast_table (56-61)
src/datasets/packaged_modules/csv/csv.py (1)
  • _cast_table (166-175)
src/datasets/packaged_modules/json/json.py (1)
  • _cast_table (88-111)
src/datasets/packaged_modules/pandas/pandas.py (1)
  • _cast_table (55-60)
src/datasets/packaged_modules/parquet/parquet.py (1)
  • _cast_table (144-149)
src/datasets/packaged_modules/sql/sql.py (1)
  • _cast_table (101-110)
src/datasets/packaged_modules/text/text.py (1)
  • _cast_table (53-64)
src/datasets/packaged_modules/json/json.py (8)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/arrow/arrow.py (1)
  • _cast_table (56-61)
src/datasets/packaged_modules/csv/csv.py (1)
  • _cast_table (166-175)
src/datasets/packaged_modules/pandas/pandas.py (1)
  • _cast_table (55-60)
src/datasets/packaged_modules/parquet/parquet.py (1)
  • _cast_table (144-149)
src/datasets/packaged_modules/sql/sql.py (1)
  • _cast_table (101-110)
src/datasets/packaged_modules/text/text.py (1)
  • _cast_table (53-64)
src/datasets/packaged_modules/xml/xml.py (1)
  • _cast_table (48-59)
src/datasets/packaged_modules/cache/cache.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/pandas/pandas.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (3)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/features/features.py (1)
  • to_pylist (824-830)
src/datasets/table.py (1)
  • to_pylist (234-241)
src/datasets/iterable_dataset.py (2)
src/datasets/builder.py (1)
  • Key (1321-1326)
src/datasets/packaged_modules/generator/generator.py (1)
  • Generator (25-38)
src/datasets/packaged_modules/text/text.py (1)
src/datasets/builder.py (1)
  • Key (1321-1326)
🪛 Gitleaks (8.30.0)
tests/test_upstream_hub.py

[high] 358-358: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: test (integration, windows-latest, deps-latest)
  • GitHub Check: test (integration, ubuntu-latest, deps-latest)
  • GitHub Check: test (integration, ubuntu-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-minimum)
  • GitHub Check: test (unit, windows-latest, deps-minimum)
  • GitHub Check: test (unit, windows-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-minimum)
  • GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314 (unit, windows-latest, deps-latest)
  • GitHub Check: build / build_pr_documentation
🔇 Additional comments (34)
tests/test_upstream_hub.py (2)

309-327: Readiness waits around the first overwrite flow look correct and should deflake the test

Waiting once after the initial push_to_hub and again after the manual upload_file before the second push is exactly where the LFS propagation race would have bitten, so this should meaningfully stabilize the test without adding too much latency. Please keep an eye on CI duration; if this test ever becomes a bottleneck, you can tune max_wait or the polling cadence.


357-396: Mirroring the readiness pattern for ds_name_2 is clear and consistent

Using a separate ds_name_2 context and inserting the same two _wait_for_repo_ready calls around the first push and the manual upload_file should stabilize this second overwrite branch as well, and threading ds_name_2 through all subsequent Hub calls improves readability. The static-analysis “generic API key” warning here appears to be a false positive since _token is coming from fixtures/environment, so this change does not add any hard-coded secrets.

src/datasets/naming.py (1)

70-84: Confirm behavior change for single-shard shard_lengths in filenames_for_dataset_split.

Requiring len(shard_lengths) > 1 means a 1‑element shard_lengths will now use the non‑sharded filename (no -00000-of-00001 suffix). Please confirm this matches how shard files are actually written for single‑shard splits (e.g., by builders that populate SplitInfo.shard_lengths), especially for existing on‑disk caches that Cache._split_generators will read.

src/datasets/packaged_modules/pandas/pandas.py (1)

10-11: Use of Key in Pandas builder is consistent with composite key semantics.

Switching _generate_tables to yield Key(i, 0) (with i as file index) aligns this builder with the new Key(original_shard_id, item_or_batch_id) convention and keeps table generation logic unchanged.

Also applies to: 62-66

src/datasets/packaged_modules/xml/xml.py (1)

61-68: Composite tuple keys for XML tables look reasonable.

Emitting (file_idx, 0) here standardizes XML table keys to a (file, batch) shape and should remain compatible with consumers that treat keys as hashable/str‑able identifiers.

src/datasets/packaged_modules/csv/csv.py (1)

10-11: Key-based keys in CSV generator are consistent with the new scheme.

Yielding Key(file_idx, batch_idx) in _generate_tables cleanly encodes the (file, chunk) identity and aligns this module with other Arrow-based builders using Key.

Also applies to: 177-197

src/datasets/packaged_modules/json/json.py (1)

12-13: Key usage in JSON generator is consistent across all code paths.

All branches of _generate_tables now emit Key(file_idx, batch_idx) (or (file_idx, 0) when the whole file becomes a single table), which keeps key semantics uniform whether parsing via PyArrow JSON or falling back to pandas.

Also applies to: 113-125, 170-193

src/datasets/packaged_modules/cache/cache.py (1)

14-16: Key-based cache table keys look correct; verify split/shard metadata assumptions.

Using Key(file_idx, batch_idx) in _generate_tables aligns cache streaming with the new composite key model, and pairs well with filenames_for_dataset_split driven by split_info.shard_lengths; please just confirm that cached SplitInfo objects only set shard_lengths for truly multi-shard splits so single-shard caches still resolve the correct filenames.

Also applies to: 157-177, 179-190

src/datasets/packaged_modules/arrow/arrow.py (1)

8-8: Arrow batch keys now consistently use Key

Importing Key and yielding Key(file_idx, batch_idx) from _generate_tables aligns Arrow with the new structured key convention without changing control flow or error handling. Looks correct and consistent with other packaged modules.

Also applies to: 77-77

src/datasets/packaged_modules/generator/generator.py (1)

5-7: Sharding-aware generator keys via Key(shard_idx, sample_idx)

Using _number_of_shards_in_gen_kwargs + _split_gen_kwargs and yielding Key(shard_idx, sample_idx) gives stable, structured keys and brings the generator path in line with other Key-based builders.

One thing to double-check: for custom generators that previously received list-valued gen_kwargs (e.g. a list of input files) in a single call, this now invokes self.config.generator(**shard_gen_kwargs) once per shard with sliced lists instead. If any downstream usage relied on the old “single call with full lists” behavior, this is a semantic change and may need either documentation/tests or a migration note.

Functionally this looks sound; the main concern is just confirming intended semantics for list-valued gen_kwargs.

Also applies to: 35-38

src/datasets/packaged_modules/parquet/parquet.py (1)

10-10: Parquet batch keys correctly migrated to Key

Switching the yielded key from a string to Key(file_idx, batch_idx) (with the corresponding import) cleanly aligns Parquet with the new Key-based indexing scheme, without altering batching or error-handling behavior.

Also applies to: 182-182

src/datasets/packaged_modules/sql/sql.py (1)

10-10: SQL chunk keys now use Key(0, chunk_idx) consistently

Importing Key and yielding Key(0, chunk_idx) from _generate_tables provides a structured key while preserving the existing chunking and casting behavior. This matches the single-shard SQL use case and is consistent with other modules.

Also applies to: 120-120

tests/packaged_modules/test_spark.py (1)

28-28: LGTM! Test expectations updated for tuple-based keys.

The test changes correctly reflect the migration from string-based keys to tuple-based keys (part_id, row_idx), consistent with the broader Key migration across the codebase.

Also applies to: 78-79

src/datasets/packaged_modules/hdf5/hdf5.py (1)

9-9: LGTM! Consistent Key migration for HDF5 table generation.

The changes correctly:

  1. Import Key from datasets.builder
  2. Enumerate batches to derive batch_idx
  3. Yield Key(file_idx, batch_idx) instead of the previous string-based key

This aligns with the broader migration to structured Key-based identifiers.

Also applies to: 88-94

src/datasets/packaged_modules/webdataset/webdataset.py (1)

12-12: LGTM! Key migration for WebDataset example generation.

The change from string-formatted keys to Key(tar_idx, example_idx) is consistent with the codebase-wide migration to structured Key-based identifiers.

Also applies to: 129-129

src/datasets/packaged_modules/text/text.py (2)

9-9: LGTM! Key migration for line-based sampling.

The import and Key usage for line-based sampling mode is correct.

Also applies to: 86-87


110-113: LGTM! Key usage for document mode.

The Key usage for document mode is correct.

src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (2)

16-16: LGTM! Key migration for metadata-driven example generation.

The changes correctly:

  1. Import Key from datasets.builder
  2. Enumerate metadata files to derive shard_idx
  3. Enumerate samples within each metadata table to derive sample_idx
  4. Yield Key(shard_idx, sample_idx) for each sample

The indexing logic properly resets sample_idx for each shard.

Also applies to: 368-390


398-411: LGTM! Key migration for non-metadata file path.

The non-metadata path correctly mirrors the metadata path with proper enumeration for shard_idx and sample_idx.

tests/test_builder.py (4)

25-26: LGTM! Key import added for test builders.

The Key import is correctly added to support the shard-aware builders in tests.


154-156: LGTM! Shard-aware test builders correctly use Key.

The builders with explicit shards properly use Key(shard_idx, i) to emit composite keys, correctly exercising the shard-tracking functionality.

Also applies to: 167-169, 188-190, 209-211


1037-1038: LGTM! Shard metadata assertions are comprehensive.

The test assertions correctly verify:

  • shard_lengths contains per-shard row counts
  • original_shard_lengths is None when no parallel processing occurs
  • original_shard_lengths tracks original shard boundaries when using num_proc

Also applies to: 1087-1087, 1130-1131, 1179-1179


768-771: LGTM! Test parameter updates.

The added config_name="a" parameter aligns tests with the expected configuration requirements.

Also applies to: 777-777

src/datasets/arrow_dataset.py (2)

6079-6084: add_column signature change is consistent and backward‑compatible

Making new_fingerprint optional while keeping it in the same positional slot, and tightening the type to np.ndarray, aligns this method with other @fingerprint_transform-wrapped transforms and should not break existing callers.


6340-6390: Optional new_fingerprint for add_item matches the rest of the API

Allowing add_item to omit new_fingerprint brings it in line with other dataset transforms that default this parameter, improving ergonomics without affecting existing positional or keyword usages.

src/datasets/splits.py (2)

31-45: Addition of original_shard_lengths looks consistent

Adding original_shard_lengths to SplitInfo and excluding it from YAML via _to_yaml_list keeps wire format stable while allowing richer in‑memory metadata. No issues spotted here.


587-596: YAML stripping of shard metadata and deprecated fields is coherent

_to_yaml_list now drops shard_lengths, original_shard_lengths, and dataset_name from the serialized split dicts. This keeps YAML light while preserving full metadata in JSON and in-memory structures. Looks consistent with the surrounding comments.

src/datasets/iterable_dataset.py (3)

77-89: Key alias and BuilderKey import pattern look good

Using a TYPE_CHECKING-only import for BuilderKey and defining:

Key = Union[int, str, tuple[int, int], "BuilderKey"]

accurately reflects the new key shapes without incurring runtime imports. No behavioral concerns here.


257-288: Iterator-based type hints for ExamplesIterable and sharded variants now match actual usage

generate_examples_fn is now typed as returning an Iterator[tuple[Key, dict]], which aligns with how __iter__ consumes it (for key_example in islice(self.generate_examples_fn(**gen_kwags), ...)). No functional change, but the annotation is now accurate.


342-403: Arrow examples iterator type hints are consistent with generator behavior

ArrowExamplesIterable and ShuffledDataSourcesArrowExamplesIterable now type generate_tables_fn as Callable[..., Iterator[tuple[Key, pa.Table]]], matching their for key, pa_table in self.generate_tables_fn(**gen_kwags) usage. This tightens the public surface without changing behavior.

src/datasets/builder.py (4)

1339-1339: LGTM: Type annotations correctly updated.

The return type annotations for _generate_examples and _generate_tables have been properly updated to reflect the new Key-based interface, improving type safety and documentation.

Also applies to: 1628-1628


1427-1435: LGTM: Original shard tracking properly integrated into multiprocessing.

The unpacking and aggregation of original_shards_per_job and original_shard_lengths_per_job is consistent across both single-process and multi-process code paths in both GeneratorBasedBuilder and ArrowBasedBuilder.

Also applies to: 1705-1713


1388-1398: LGTM: num_proc adjustment respects original shard boundaries.

The logic correctly computes num_original_shards and adjusts num_proc downward to avoid creating more parallel jobs than there are original shards to process. This prevents inefficient over-parallelization.

Also applies to: 1667-1677


27-27: LGTM: Import additions are appropriate.

The added imports (Iterator, Mapping from collections.abc and pyarrow as pa) are necessary for the new type annotations and properly used throughout the file.

Also applies to: 35-35

Comment on lines 1320 to 1327
@dataclass
class Key:
original_shard_id: int
item_or_batch_id: int

def __str__(self):
return str((self.original_shard_id, self.item_or_batch_id))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add documentation for the Key dataclass.

The Key dataclass is introduced without any docstring or explanation of its purpose, field semantics, or usage contract. Given that this changes the signature of core generator methods (_generate_examples and _generate_tables), comprehensive documentation is essential for maintainability and to guide custom builder implementations.

Apply this diff to add documentation:

 @dataclass
 class Key:
+    """Represents the origin of a generated example or table batch.
+    
+    Attributes:
+        original_shard_id: Zero-based index of the source shard from which this data originated.
+        item_or_batch_id: Zero-based index of the item (for examples) or batch (for tables) within that shard.
+    """
     original_shard_id: int
     item_or_batch_id: int
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@dataclass
class Key:
original_shard_id: int
item_or_batch_id: int
def __str__(self):
return str((self.original_shard_id, self.item_or_batch_id))
@dataclass
class Key:
"""Represents the origin of a generated example or table batch.
Attributes:
original_shard_id: Zero-based index of the source shard from which this data originated.
item_or_batch_id: Zero-based index of the item (for examples) or batch (for tables) within that shard.
"""
original_shard_id: int
item_or_batch_id: int
def __str__(self):
return str((self.original_shard_id, self.item_or_batch_id))
🤖 Prompt for AI Agents
In src/datasets/builder.py around lines 1320 to 1327, the Key dataclass lacks
any docstring explaining its purpose, the semantics of original_shard_id and
item_or_batch_id, and how generators should use it; add a clear docstring to the
Key class that (1) succinctly states its role as the stable identifier passed to
_generate_examples/_generate_tables, (2) explains that original_shard_id
identifies the source shard and item_or_batch_id is either an item index or
batch id (specify whether zero-based and when batching is used), (3) documents
any invariants/expectations (immutability, must be convertible to str, used for
sharding/ordering), and (4) gives a short usage example showing creation and
__str__ behavior so implementers of custom builders know how to construct and
interpret Keys.

Comment on lines 1517 to 1522
if total_original_shards > 1:
split_generator.split_info.original_shard_lengths = [
original_shard_length
for original_shard_lengths in original_shard_lengths_per_job
for original_shard_length in original_shard_lengths
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent population of original_shard_lengths metadata.

The split_info.original_shard_lengths is only populated when total_original_shards > 1, but the existing split_info.shard_lengths is always populated (even for single shard, see lines 1506-1508). This inconsistency could complicate downstream code that relies on this metadata.

Consider populating it consistently:

-        if total_original_shards > 1:
-            split_generator.split_info.original_shard_lengths = [
-                original_shard_length
-                for original_shard_lengths in original_shard_lengths_per_job
-                for original_shard_length in original_shard_lengths
-            ]
+        split_generator.split_info.original_shard_lengths = [
+            original_shard_length
+            for original_shard_lengths in original_shard_lengths_per_job
+            for original_shard_length in original_shard_lengths
+        ]

Or document why the conditional is necessary if there's a specific reason (e.g., storage optimization).

🤖 Prompt for AI Agents
In src/datasets/builder.py around lines 1517 to 1522,
split_info.original_shard_lengths is only set when total_original_shards > 1
which is inconsistent with split_info.shard_lengths (always populated); make
original_shard_lengths populated consistently: remove the conditional and always
assign the flattened list (or, if you prefer to preserve single-shard behavior,
add an else branch that sets original_shard_lengths to the single-shard
length(s) equivalent to shard_lengths), so downstream code can rely on this
metadata being present.

check_duplicate_keys: bool,
job_id: int,
) -> Iterable[tuple[int, bool, Union[int, tuple]]]:
) -> Iterator[tuple[int, bool, tuple[int, int, Features, int, int, int]]]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file structure and locate the method
wc -l src/datasets/builder.py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 104


🏁 Script executed:

# View the code around line 1535 to see the return type annotation
sed -n '1530,1540p' src/datasets/builder.py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 537


🏁 Script executed:

# View the code around line 1807 to see the second occurrence
sed -n '1802,1812p' src/datasets/builder.py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 683


🏁 Script executed:

# Search for _prepare_split_single definitions in the codebase
rg -n 'def _prepare_split_single' --type=py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 275


🏁 Script executed:

# Search for any overrides or implementations outside builder.py
rg -n '_prepare_split_single' --type=py -g '!src/datasets/builder.py'

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 300


🏁 Script executed:

# View the spark.py override implementation
sed -n '195,220p' src/datasets/packaged_modules/spark/spark.py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 1267


🏁 Script executed:

# Check the full method signature and return pattern
sed -n '199,250p' src/datasets/packaged_modules/spark/spark.py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 2582


🏁 Script executed:

# View the complete spark.py implementation to see what it yields
sed -n '199,300p' src/datasets/packaged_modules/spark/spark.py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 4736


🏁 Script executed:

# Also check if there are any other subclasses that override this method
rg -n 'class.*Builder' --type=py src/datasets/builder.py | head -20

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 478


🏁 Script executed:

# Search for any other potential overrides in the codebase
find . -name '*.py' -type f -exec grep -l 'def _prepare_split_single' {} \; 2>/dev/null | grep -v __pycache__

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 148


🏁 Script executed:

# Check git history for changes to _prepare_split_single return type
git log --oneline -20 -- src/datasets/builder.py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 137


🏁 Script executed:

# Look at the actual return statement to understand what's being yielded
rg -B5 -A15 'yield.*shard_id.*bool' src/datasets/builder.py | head -50

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 58


🏁 Script executed:

# Check around line 1535 to see the actual yield statements
sed -n '1550,1600p' src/datasets/builder.py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 2936


🏁 Script executed:

# Continue reading to find the final yield
sed -n '1590,1610p' src/datasets/builder.py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 963


🏁 Script executed:

# Search for the complete final yield statement in GeneratorBasedBuilder
sed -n '1527,1650p' src/datasets/builder.py | grep -A10 "yield ("

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 382


🏁 Script executed:

# Check ArrowBasedBuilder's final yield
sed -n '1805,1880p' src/datasets/builder.py

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 4179


🏁 Script executed:

# Look for the yield statement in ArrowBasedBuilder
sed -n '1805,1900p' src/datasets/builder.py | grep -B5 -A20 "yield ("

Repository: The-Obstacle-Is-The-Way/datasets

Length of output: 813


Fix type annotations for _prepare_split_single return signature to match implementation.

The return type annotations at lines 1535 and 1807 claim a 6-element inner tuple but the implementations return 7 elements: (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths, num_original_shards, original_shard_lengths). Update both signatures to Iterator[tuple[int, bool, tuple[int, int, Features, int, int, int, int, list]]] or use a more descriptive type alias.

This is a breaking change for custom builders overriding this method, including src/datasets/packaged_modules/spark/spark.py, which currently has an incompatible signature and needs updating.

Comment on lines 1557 to 1558
if isinstance(key, Key): # old custom builders may not use Key
original_shard_id = key.original_shard_id
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify fallback behavior when key is not a Key instance.

The backward compatibility check allows old custom builders that don't yield Key objects, but the fallback behavior is implicit: original_shard_id retains its previous value (or 0 initially). This could lead to incorrect original shard attribution when mixing old-style keys with the new shard tracking.

Consider one of these approaches:

  1. Log a deprecation warning to alert users:
                 for key, record in generator:
                     if isinstance(key, Key):  # old custom builders may not use Key
                         original_shard_id = key.original_shard_id
+                    else:
+                        logger.warning(
+                            "Yielding non-Key objects from _generate_examples is deprecated. "
+                            "Please update to yield Key(original_shard_id, item_or_batch_id)."
+                        )
  1. Reset to 0 explicitly to make the behavior clear:
                 for key, record in generator:
                     if isinstance(key, Key):  # old custom builders may not use Key
                         original_shard_id = key.original_shard_id
+                    else:
+                        original_shard_id = 0  # fallback for legacy builders
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(key, Key): # old custom builders may not use Key
original_shard_id = key.original_shard_id
if isinstance(key, Key): # old custom builders may not use Key
original_shard_id = key.original_shard_id
else:
original_shard_id = 0 # fallback for legacy builders
🤖 Prompt for AI Agents
In src/datasets/builder.py around lines 1557-1558, the fallback when key is not
an instance of Key is implicit and can misattribute original_shard_id; change
the branch so that when not isinstance(key, Key) you explicitly set
original_shard_id = 0 and emit a clear deprecation warning (use the module
logger or warnings.warn) indicating old-style keys are deprecated and that
original_shard_id will default to 0 for compatibility; keep the existing
behavior for Key instances unchanged.

Comment on lines 1575 to 1578
if len(original_shard_lengths) == original_shard_id:
original_shard_lengths.append(1)
else:
original_shard_lengths[original_shard_id] += 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate that original_shard_id values are sequential.

The logic assumes original_shard_id values are contiguous integers starting from 0. If a generator yields non-sequential shard IDs (e.g., 0, 2, skipping 1), the condition len(original_shard_lengths) == original_shard_id will fail, causing an IndexError on the else branch.

Add validation to detect non-sequential shard IDs:

                     writer.write(example)
-                    if len(original_shard_lengths) == original_shard_id:
+                    if original_shard_id < 0:
+                        raise ValueError(f"original_shard_id must be non-negative, got {original_shard_id}")
+                    elif len(original_shard_lengths) == original_shard_id:
                         original_shard_lengths.append(1)
+                    elif len(original_shard_lengths) > original_shard_id:
+                        original_shard_lengths[original_shard_id] += 1
                     else:
-                        original_shard_lengths[original_shard_id] += 1
+                        raise ValueError(
+                            f"Non-sequential original_shard_id detected: expected {len(original_shard_lengths)}, got {original_shard_id}. "
+                            "Ensure your generator yields Keys with sequential original_shard_id values starting from 0."
+                        )

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/datasets/builder.py around lines 1575-1578, the code assumes
original_shard_id values are contiguous which can cause an IndexError if a
generator yields a non-sequential id; validate that original_shard_id is
non-negative and either equal to len(original_shard_lengths) (append a new
count) or less than len(original_shard_lengths) (increment the slot); if
original_shard_id is greater than len(original_shard_lengths) (i.e., a
gap/skip), raise a clear ValueError (or log and abort) indicating a
non-sequential shard id was received so the upstream generator can be fixed.

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way force-pushed the fix/flaky-lfs-test branch 2 times, most recently from 8e00a44 to 3ea8de7 Compare December 9, 2025 15:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test_upstream_hub.py (1)

270-296: Consider catching broader exceptions for more robust test stability.

The exception handling currently only catches HfHubHTTPError. Transient network errors (e.g., requests.RequestException, ConnectionError, timeouts) would cause immediate test failure rather than retry. Since the goal is to stabilize flaky tests, consider catching a broader set of exceptions.

Apply this diff to handle transient errors more gracefully:

-        from huggingface_hub.errors import HfHubHTTPError
+        from huggingface_hub.errors import HfHubHTTPError
+        import requests

         start_time = time.monotonic()
         while (time.monotonic() - start_time) < max_wait:
             try:
                 # Verify we can list files (repo is consistent)
                 self._api.list_repo_files(repo_id, repo_type="dataset", token=self._token)
                 # Small delay to ensure LFS objects are fully propagated
                 time.sleep(1)
                 return
-            except HfHubHTTPError:
+            except (HfHubHTTPError, requests.RequestException, ConnectionError, TimeoutError) as e:
+                # Retry on transient errors
                 time.sleep(1)
         raise TimeoutError(f"Repository {repo_id} not ready after {max_wait}s")

Optionally, enhance the timeout error message for easier debugging:

-        raise TimeoutError(f"Repository {repo_id} not ready after {max_wait}s")
+        raise TimeoutError(f"Repository {repo_id} not ready after {max_wait}s. Check Hub status and LFS propagation.")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60963a0 and 3ea8de7.

📒 Files selected for processing (1)
  • tests/test_upstream_hub.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_upstream_hub.py (4)
tests/fixtures/hub.py (1)
  • temporary_repo (87-99)
src/datasets/arrow_dataset.py (1)
  • push_to_hub (5658-6074)
src/datasets/iterable_dataset.py (1)
  • push_to_hub (4092-4508)
src/datasets/dataset_dict.py (2)
  • push_to_hub (1616-1983)
  • push_to_hub (2495-2852)
🪛 Gitleaks (8.30.0)
tests/test_upstream_hub.py

[high] 358-358: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: test (integration, windows-latest, deps-minimum)
  • GitHub Check: test (integration, windows-latest, deps-latest)
  • GitHub Check: test (integration, ubuntu-latest, deps-latest)
  • GitHub Check: test (unit, windows-latest, deps-latest)
  • GitHub Check: test (unit, windows-latest, deps-minimum)
  • GitHub Check: test (integration, ubuntu-latest, deps-minimum)
  • GitHub Check: test (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test (unit, ubuntu-latest, deps-minimum)
  • GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
  • GitHub Check: test_py314 (unit, windows-latest, deps-latest)
  • GitHub Check: build / build_pr_documentation
🔇 Additional comments (2)
tests/test_upstream_hub.py (2)

309-327: LGTM - proper synchronization between Hub operations.

The integration of _wait_for_repo_ready after push_to_hub (line 307) and upload_file (lines 318-324) correctly addresses the race condition described in the PR objectives, ensuring the Hub has fully processed LFS objects before the next operation.


357-395: LGTM - improved test isolation with separate repositories.

The second test scenario now uses a dedicated repository (ds_name_2) instead of reusing the first repository. This improves test isolation and ensures clean state for each scenario. The _wait_for_repo_ready calls at lines 361 and 378 correctly follow the same synchronization pattern, ensuring Hub consistency between operations.

ada-ggf25 and others added 8 commits December 16, 2025 15:20
…ace#7877) (huggingface#7891)

* fix(fingerprint): improve TMPDIR environment variable handling in _TempCacheDir

Enhanced the _TempCacheDir.__init__ method to properly respect and handle
the TMPDIR environment variable when creating temporary cache directories.

Changes:
- Add TMPDIR environment variable detection and validation
- Normalise paths to handle path resolution issues
- Auto-create TMPDIR directory if it doesn't exist to prevent silent
  fallback to default temporary directory
- Validate that TMPDIR is actually a directory before use
- Explicitly pass directory to mkdtemp to ensure TMPDIR is respected even
  if tempfile.gettempdir() was already called and cached
- Add appropriate logging for directory creation and fallback scenarios

This ensures that when TMPDIR is set, the temporary cache files are
created in the specified directory rather than silently falling back to
the system default temporary directory.

* test(fingerprint): add comprehensive tests for TMPDIR handling in _TempCacheDir

Add test coverage for the improved TMPDIR environment variable handling
in the _TempCacheDir class. These tests verify the various scenarios
for TMPDIR usage and error handling.

Changes:
- Refactor test_fingerprint_in_multiprocessing to use Pool.map for
  cleaner test implementation
- Add test_temp_cache_dir_with_tmpdir_nonexistent to verify TMPDIR
  auto-creation when directory doesn't exist
- Add test_temp_cache_dir_with_tmpdir_existing to verify correct
  behaviour when TMPDIR exists and is valid
- Add test_temp_cache_dir_without_tmpdir to verify fallback to default
  temporary directory when TMPDIR is not set
- Add test_temp_cache_dir_tmpdir_creation_failure to verify graceful
  error handling and fallback when TMPDIR creation fails

These tests ensure that the TMPDIR improvements work correctly across
all scenarios and edge cases, including proper logging and fallback
behaviour.

* test(fingerprint): tighten TMPDIR error-path tests for _TempCacheDir

Refine TMPDIR-related failure tests for _TempCacheDir to assert explicit
error conditions instead of fallback behaviour.

Changes:
- Update test_temp_cache_dir_tmpdir_creation_failure to use _TempCacheDir
  directly and assert that an OSError is raised with a clear TMPDIR
  context when directory creation fails
- Introduce test_temp_cache_dir_tmpdir_not_directory to verify that
  pointing TMPDIR at a non-directory raises an OSError with an
  informative error message

These tests better match the intended contract of _TempCacheDir by
ensuring invalid TMPDIR configurations fail loudly with descriptive
messages rather than silently falling back.

* fix(fingerprint): make TMPDIR misconfiguration in _TempCacheDir fail loudly

Tighten TMPDIR handling in _TempCacheDir so that invalid configurations
raise clear errors instead of silently falling back to the default
temporary directory.

Changes:
- When TMPDIR points to a non-existent directory, raise an OSError with
  explicit guidance to create it manually or unset TMPDIR
- When TMPDIR points to a non-directory path, raise an OSError with
  guidance to point TMPDIR to a writable directory or unset it
- Remove previous warning-and-fallback behaviour to avoid masking
  configuration issues

This ensures that TMPDIR misconfigurations are surfaced early and
clearly, aligning runtime behaviour with the stricter expectations
codified in the new tests.

* Update src/datasets/fingerprint.py

* Update fingerprint.py

* Fix formatting of TMPDIR retrieval line

---------

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* encode nifti correctly when uploading lazily

* update docstring for nifti functionality
)

* fix(nifti): enable lazy loading for Nifti1ImageWrapper

- Change dataobj initialization from get_fdata() to .dataobj
- Preserves nibabel's ArrayProxy for lazy loading
- Prevents OOM errors on large 4D fMRI files
- Defers I/O errors to usage time for better error handling

Resolves #1

* chore: trigger CI
…gingface#7906)

don't save original_shard_lengths by default for backward compat
`test_push_dataset_dict_to_hub_overwrite_files` intermittently fails with:
```
BadRequestError: LFS pointer pointed to a file that does not exist
```

Root cause: Two race conditions in the test design:
1. Rapid successive `push_to_hub` calls don't wait for Hub's LFS object
   propagation between pushes
2. Second test scenario reused the same repo name, creating a race between
   repo deletion and recreation

Fix:
- Add `_wait_for_repo_ready()` helper that ensures Hub repository is in a
  consistent state before subsequent operations
- Use unique repo name (`ds_name_2`) for second scenario to eliminate the
  delete/create race entirely

Tested: All 4 integration test variants now pass consistently (ubuntu/windows,
deps-latest/deps-minimum).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants