fix(tests): stabilize flaky Hub LFS integration test#4
fix(tests): stabilize flaky Hub LFS integration test#4The-Obstacle-Is-The-Way wants to merge 8 commits intomainfrom
Conversation
|
Warning Rate limit exceeded@The-Obstacle-Is-The-Way has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 37 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (13)
WalkthroughThe change adds a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/test_upstream_hub.py (3)
270-281: Polling helper for Hub consistency looks solid; only minor nits possibleThe retry loop around
list_repo_fileswith a bounded wait and explicitTimeoutErroris a good fit to make the test deterministic and fail loudly if the Hub is slow or inconsistent. CatchingExceptionis acceptable in this test-only helper, and the extrasleep(1)after a successful probe gives LFS a bit more buffer.If you ever need finer control, a small optional refinement would be:
- Track elapsed time via
time.monotonic()somax_waitis strictly “seconds” rather than “iterations”.- Optionally narrow the
exceptto hub-related errors to avoid masking unexpected issues, or at least log them.Not blockers; current implementation is fine for the intended use.
294-313: Well-placed waits between first push, manual upload, and second pushAdding
_wait_for_repo_ready(ds_name)after the initialpush_to_huband again after theupload_filecall is a reasonable way to avoid the LFS propagation race that was causing the flakyBadRequestError. The helper is scoped to this test class and only used here, so the additional latency is limited to this one integration test.If further flakes appear, a non-blocking follow-up could be to:
- Also use
_wait_for_repo_readyafter the secondpush_to_hubbefore asserting onlist_repo_files, to guard against any remaining backend lag.
But given the current problem description, the added waits should already address the primary race.
345-364: Symmetric guarding of the second overwrite scenario looks appropriateMirroring the same wait pattern in the second half of
test_push_dataset_dict_to_hub_overwrite_files(push → wait → upload → wait → push) keeps both branches consistent and should prevent the same LFS race when the second run changes the shard count in the opposite direction.No issues from a correctness standpoint. As with the first block, you might consider adding a final readiness wait before the concluding
load_datasetif you ever see residual flakiness, but it's not strictly necessary right now.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_upstream_hub.py(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: test (unit, windows-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-minimum)
- GitHub Check: test (unit, windows-latest, deps-latest)
- GitHub Check: test (integration, ubuntu-latest, deps-minimum)
- GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-latest)
- GitHub Check: test (integration, ubuntu-latest, deps-latest)
- GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314 (unit, windows-latest, deps-latest)
- GitHub Check: build / build_pr_documentation
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_upstream_hub.py (1)
270-296: Reasonable approach to stabilize flaky Hub tests.The helper addresses the race condition with a polling + sleep strategy. A few observations:
Exception specificity: Catching all
HfHubHTTPErroris broad. Consider logging transient errors for debugging intermittent CI failures.Import placement: Moving the import to the top of the file (alongside other
huggingface_hubimports on line 15) would be more idiomatic.Delay tuning: The 1-second post-success sleep is a heuristic. If flakiness persists, this may need adjustment or a more robust verification (e.g., retrying the subsequent operation with backoff rather than pre-waiting).
Consider moving the import to the file header:
-from huggingface_hub import DatasetCard, HfApi +from huggingface_hub import DatasetCard, HfApi +from huggingface_hub.errors import HfHubHTTPErrorThen remove line 284 from within the method.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_upstream_hub.py(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: test (integration, ubuntu-latest, deps-minimum)
- GitHub Check: test (unit, windows-latest, deps-latest)
- GitHub Check: test (integration, windows-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-latest)
- GitHub Check: test (unit, windows-latest, deps-minimum)
- GitHub Check: test (integration, ubuntu-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-minimum)
- GitHub Check: test (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
- GitHub Check: test_py314 (unit, windows-latest, deps-latest)
- GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
- GitHub Check: build / build_pr_documentation
🔇 Additional comments (4)
tests/test_upstream_hub.py (4)
309-311: LGTM!Appropriate placement after the initial push to ensure Hub consistency before the subsequent upload operation.
326-328: LGTM!Correctly placed between the upload and the subsequent push operation.
360-362: LGTM!Mirrors the synchronization pattern from the first test block.
377-379: LGTM!Completes the synchronization pattern for the second test scenario.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_upstream_hub.py (1)
270-296: Consider broadening exception handling to cover more transient network errors.The helper only catches
HfHubHTTPError, but transient connection issues (e.g.,ConnectionError,requests.exceptions.Timeout) would propagate and fail the test. Consider catching a broader set of transient errors:- except HfHubHTTPError: + except (HfHubHTTPError, ConnectionError) as e: + # Retry on transient errors time.sleep(1)Also, moving the import to the top of the file with other
huggingface_hubimports would be more idiomatic:-from huggingface_hub import DatasetCard, HfApi +from huggingface_hub import DatasetCard, HfApi +from huggingface_hub.errors import HfHubHTTPError
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_upstream_hub.py(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_upstream_hub.py (1)
tests/fixtures/hub.py (1)
temporary_repo(87-99)
🪛 Gitleaks (8.29.1)
tests/test_upstream_hub.py
[high] 358-358: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: test (unit, windows-latest, deps-latest)
- GitHub Check: test (integration, ubuntu-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-latest)
- GitHub Check: test (integration, ubuntu-latest, deps-latest)
- GitHub Check: test (unit, windows-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-minimum)
- GitHub Check: test (unit, ubuntu-latest, deps-minimum)
- GitHub Check: test (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
- GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314 (unit, windows-latest, deps-latest)
- GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
- GitHub Check: build / build_pr_documentation
🔇 Additional comments (4)
tests/test_upstream_hub.py (4)
306-311: LGTM!Correctly placed synchronization point after the first push ensures Hub has processed LFS objects before the subsequent file upload.
326-329: LGTM!Synchronization before the second push ensures the manually uploaded file is fully propagated.
357-361: LGTM - variable rename and synchronization points are correct.The rename to
ds_name_2for the second test block improves clarity. The static analysis warning about "Generic API Key" on line 358 is a false positive—self._tokenreferences the test fixture constantCI_HUB_USER_TOKEN, not a hardcoded secret.
369-395: LGTM!All API calls in the second test block consistently reference
ds_name_2, and the wait synchronization is correctly placed before the second push operation.
e98efc8 to
109720e
Compare
109720e to
60963a0
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/datasets/packaged_modules/spark/spark.py (1)
264-268: Add missingshutilimport for_prepare_split_singlemove logic.
shutil.move(file, dest)is used here butshutilisn’t imported at the top of the file, which will lead to aNameErrorwhen this branch is executed; please add the import (even though this predates the current diff).-import os +import os +import shutilsrc/datasets/packaged_modules/text/text.py (1)
104-109: Inconsistent key type: Line 109 uses plain tuple instead ofKey.The final batch yield in paragraph mode at line 109 uses a plain tuple
(file_idx, batch_idx)while line 104 usesKey(file_idx, batch_idx). This inconsistency could cause issues downstream if consumers expect all keys to beKeyinstances.if batch: pa_table = pa.Table.from_arrays([pa.array([batch])], names=pa_table_names) - yield (file_idx, batch_idx), self._cast_table(pa_table) + yield Key(file_idx, batch_idx), self._cast_table(pa_table)src/datasets/splits.py (1)
520-573:SplitDict.from_split_dictbreaks when passed a list ofSplitInfoobjectsIn
from_split_dict, dataset name inference assumes dict elements:if dataset_name is None: dataset_name = split_infos[0].get("dataset_name") if split_infos else NoneIf
split_infosis alist[SplitInfo],split_infos[0]has no.get, raisingAttributeError. This is likely in scope since the method advertises accepting both lists and dicts, andto_split_dict()returnslist[SplitInfo].Consider making the inference robust to both dicts and
SplitInfo:- if dataset_name is None: - dataset_name = split_infos[0].get("dataset_name") if split_infos else None + if dataset_name is None and split_infos: + first = split_infos[0] + if isinstance(first, dict): + dataset_name = first.get("dataset_name") + elif isinstance(first, SplitInfo): + dataset_name = first.dataset_namesrc/datasets/iterable_dataset.py (1)
409-425:ShuffledDataSourcesArrowExamplesIterable.shift_rngsuses a non-existent attributeIn
ShuffledDataSourcesArrowExamplesIterable.shift_rngs:return ShuffledDataSourcesArrowExamplesIterable( self.generate_examples_fn, self.kwargs, np.random.default_rng(seed=new_seed), )
ShuffledDataSourcesArrowExamplesIterable(and its parentArrowExamplesIterable) defineself.generate_tables_fn, notself.generate_examples_fn. This will raiseAttributeErrorwhenshift_rngsis called (e.g. viashift_ex_examples_rngson an arrow-backed iterable).You likely want to pass
self.generate_tables_fn:def shift_rngs(self, value: int) -> "_BaseExamplesIterable": new_seed = self.generator.bit_generator.state["state"]["state"] + value return ShuffledDataSourcesArrowExamplesIterable( - self.generate_examples_fn, + self.generate_tables_fn, self.kwargs, np.random.default_rng(seed=new_seed), )src/datasets/builder.py (1)
1-1893: CRITICAL: PR description does not match the actual code changes.The PR objectives claim this fixes a flaky Hub LFS integration test by adding a
_wait_for_repo_readyhelper to address race conditions intest_push_dataset_dict_to_hub_overwrite_files. However, the actual changes in this file are:
- Introduction of a new
Keydataclass (lines 1320-1327)- Major refactoring of generator interfaces (
_generate_examples,_generate_tables)- Addition of original shard tracking metadata throughout the dataset generation pipeline
- No
_wait_for_repo_readyhelper or test-related changes visibleThis is a significant architectural refactor affecting core dataset generation logic, not a targeted test fix. This inconsistency suggests either:
- The wrong PR description was used
- The wrong code changes were included
- Multiple unrelated changes were combined
This must be clarified before proceeding with the review.
🧹 Nitpick comments (8)
tests/test_upstream_hub.py (1)
270-296: Repo readiness helper is well-structured; consider making polling behavior slightly more flexibleThe bounded
time.monotonicloop plus pollinglist_repo_filesis a good, minimal way to synchronize with Hub/LFS propagation. As an optional refinement, you could (a) accept apoll_intervalargument instead of hardcoding1and (b) optionally treat clearly-fatalHfHubHTTPErrorcases (e.g. auth/permission failures) as immediate failures rather than waiting the fullmax_wait, while still retrying on transient 5xx/404 conditions.src/datasets/info.py (1)
121-123: Consider updatingsplitsdocstring to reflectSplitDicttype.The field type is now
Optional[SplitDict]and__post_init__already normalizes plain dicts, so the comment describingsplitsasdictis slightly misleading; consider mentioningSplitDict(a dict‑like mapping) there for clarity.Also applies to: 152-152
src/datasets/packaged_modules/spark/spark.py (1)
49-75: Consider downstream impact of changing Spark example keys fromstrto(part_id, row_id)tuples.
_generate_iterable_examplesnow yields tuple keys, which may break callers that relied on the previous string format (e.g., persisting or pattern‑matching example IDs). If this is an intentional API change, it’s worth documenting and verifying that tests and any public examples usingDataset.from_sparkdon’t assume string keys.tests/test_arrow_dataset.py (1)
4786-4795: Newtest_add_columncorrectly covers basic API behaviorThe test verifies that
add_columncreates the new feature and preserves existing rows as expected; this is a good, minimal check on top of the more exhaustivetest_dataset_add_columnabove.Minor optional cleanup: you can drop the inner
from datasets import Datasetand reuse the module-levelDatasetimport to keep this file consistent:-def test_add_column(): - from datasets import Dataset - - ds = Dataset.from_dict({"a": [1, 2]}) +def test_add_column(): + ds = Dataset.from_dict({"a": [1, 2]})tests/test_builder.py (1)
71-73: Consider usingKeyfor consistency across all test builders.The simple dummy builders (
DummyGeneratorBasedBuilder,DummyArrowBasedBuilder, etc.) yield plain tuples(0, i)while the shard-aware builders useKey(shard_idx, i). While this may work since both are tuple-like, usingKey(0, i)consistently would better mirror the production code patterns and avoid potential type mismatches.Example for
DummyGeneratorBasedBuilder:def _generate_examples(self): for i in range(100): - yield (0, i), {"text": "foo"} + yield Key(0, i), {"text": "foo"}Also applies to: 83-85, 95-97, 116-118
tests/test_arrow_writer.py (1)
142-143:ArrowWriter.writeusage now matches the updated signatureThe test correctly adapts to the new
ArrowWriter.write(example, writer_batch_size=None)signature by passing only the example dict on Lines 142–143. The behavior checked by_check_outputis preserved.If keys are no longer part of the public API here, consider (optionally) renaming
test_write_with_keysand/or updating its docstring/comment to avoid confusion for future readers.src/datasets/arrow_writer.py (1)
461-467: UpdateArrowWriter.writedocstring and clean up unused key-related state
writeno longer accepts akeyargument and always appends(example, ""), but:
- The docstring still says “(Example,Key) pair”.
current_examples’ second tuple element is now always an unused placeholder.self.hkey_recordappears unused after removing duplicate‑key logic.Consider:
- Updating the docstring/comment around
write/write_examples_on_fileto reflect that keys are no longer part of the public API.- Either dropping the dummy string and switching
current_examplesto store just examples (and updatingwrite_examples_on_fileaccordingly), or clearly documenting that the second tuple element is deprecated.- Removing
hkey_recordif it’s no longer needed.Also applies to: 582-593
src/datasets/splits.py (1)
523-551: NormalizeSplitDictkeys to strings to avoid subtle key-type mismatches
__setitem__acceptsUnion[SplitBase, str]keys and writes them as‑is:if key != value.name: raise ValueError(...) super().__setitem__(key, value)Elsewhere, lookups assume string keys via
str(key) in selfandsuper().__getitem__(str(key)). If some code ever doessplit_dict[Split.TRAIN] = info, the dict key becomes aNamedSplit, and'train' in split_dictwill be false because dict key comparison uses'train' == NamedSplit('train'), not the reverse.To keep behavior predictable, it would be safer to always store keys as strings:
- def __setitem__(self, key: Union[SplitBase, str], value: SplitInfo): - if key != value.name: - raise ValueError(f"Cannot add elem. (key mismatch: '{key}' != '{value.name}')") - super().__setitem__(key, value) + def __setitem__(self, key: Union[SplitBase, str], value: SplitInfo): + key_str = str(key) + if key_str != value.name: + raise ValueError(f"Cannot add elem. (key mismatch: '{key}' != '{value.name}')") + super().__setitem__(key_str, value)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (29)
src/datasets/arrow_dataset.py(2 hunks)src/datasets/arrow_writer.py(2 hunks)src/datasets/builder.py(28 hunks)src/datasets/info.py(1 hunks)src/datasets/inspect.py(0 hunks)src/datasets/iterable_dataset.py(5 hunks)src/datasets/keyhash.py(0 hunks)src/datasets/naming.py(1 hunks)src/datasets/packaged_modules/arrow/arrow.py(2 hunks)src/datasets/packaged_modules/cache/cache.py(2 hunks)src/datasets/packaged_modules/csv/csv.py(2 hunks)src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py(4 hunks)src/datasets/packaged_modules/generator/generator.py(2 hunks)src/datasets/packaged_modules/hdf5/hdf5.py(2 hunks)src/datasets/packaged_modules/json/json.py(3 hunks)src/datasets/packaged_modules/pandas/pandas.py(2 hunks)src/datasets/packaged_modules/parquet/parquet.py(2 hunks)src/datasets/packaged_modules/spark/spark.py(1 hunks)src/datasets/packaged_modules/sql/sql.py(2 hunks)src/datasets/packaged_modules/text/text.py(4 hunks)src/datasets/packaged_modules/webdataset/webdataset.py(2 hunks)src/datasets/packaged_modules/xml/xml.py(1 hunks)src/datasets/splits.py(3 hunks)src/datasets/utils/info_utils.py(1 hunks)tests/packaged_modules/test_spark.py(2 hunks)tests/test_arrow_dataset.py(1 hunks)tests/test_arrow_writer.py(1 hunks)tests/test_builder.py(14 hunks)tests/test_upstream_hub.py(6 hunks)
💤 Files with no reviewable changes (2)
- src/datasets/keyhash.py
- src/datasets/inspect.py
✅ Files skipped from review due to trivial changes (1)
- src/datasets/utils/info_utils.py
🧰 Additional context used
🧬 Code graph analysis (18)
src/datasets/packaged_modules/sql/sql.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/hdf5/hdf5.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/webdataset/webdataset.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/info.py (1)
src/datasets/splits.py (1)
SplitDict(520-600)
tests/test_builder.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/parquet/parquet.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/builder.py (5)
src/datasets/packaged_modules/generator/generator.py (2)
_split_generators(31-32)_generate_examples(34-38)src/datasets/packaged_modules/webdataset/webdataset.py (1)
_generate_examples(110-129)src/datasets/utils/sharding.py (1)
_number_of_shards_in_gen_kwargs(4-18)src/datasets/packaged_modules/csv/csv.py (1)
_generate_tables(177-199)src/datasets/packaged_modules/json/json.py (1)
_generate_tables(113-193)
tests/test_arrow_writer.py (1)
src/datasets/arrow_writer.py (1)
write(582-598)
src/datasets/packaged_modules/arrow/arrow.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
tests/test_arrow_dataset.py (1)
src/datasets/arrow_dataset.py (4)
Dataset(704-6467)add_column(6078-6129)features(206-207)features(781-785)
src/datasets/packaged_modules/generator/generator.py (2)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/utils/sharding.py (2)
_number_of_shards_in_gen_kwargs(4-18)_split_gen_kwargs(48-64)
src/datasets/packaged_modules/xml/xml.py (7)
src/datasets/packaged_modules/arrow/arrow.py (1)
_cast_table(56-61)src/datasets/packaged_modules/csv/csv.py (1)
_cast_table(166-175)src/datasets/packaged_modules/json/json.py (1)
_cast_table(88-111)src/datasets/packaged_modules/pandas/pandas.py (1)
_cast_table(55-60)src/datasets/packaged_modules/parquet/parquet.py (1)
_cast_table(144-149)src/datasets/packaged_modules/sql/sql.py (1)
_cast_table(101-110)src/datasets/packaged_modules/text/text.py (1)
_cast_table(53-64)
src/datasets/packaged_modules/json/json.py (8)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/packaged_modules/arrow/arrow.py (1)
_cast_table(56-61)src/datasets/packaged_modules/csv/csv.py (1)
_cast_table(166-175)src/datasets/packaged_modules/pandas/pandas.py (1)
_cast_table(55-60)src/datasets/packaged_modules/parquet/parquet.py (1)
_cast_table(144-149)src/datasets/packaged_modules/sql/sql.py (1)
_cast_table(101-110)src/datasets/packaged_modules/text/text.py (1)
_cast_table(53-64)src/datasets/packaged_modules/xml/xml.py (1)
_cast_table(48-59)
src/datasets/packaged_modules/cache/cache.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/pandas/pandas.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (3)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/features/features.py (1)
to_pylist(824-830)src/datasets/table.py (1)
to_pylist(234-241)
src/datasets/iterable_dataset.py (2)
src/datasets/builder.py (1)
Key(1321-1326)src/datasets/packaged_modules/generator/generator.py (1)
Generator(25-38)
src/datasets/packaged_modules/text/text.py (1)
src/datasets/builder.py (1)
Key(1321-1326)
🪛 Gitleaks (8.30.0)
tests/test_upstream_hub.py
[high] 358-358: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: test (integration, windows-latest, deps-latest)
- GitHub Check: test (integration, ubuntu-latest, deps-latest)
- GitHub Check: test (integration, ubuntu-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-minimum)
- GitHub Check: test (unit, windows-latest, deps-minimum)
- GitHub Check: test (unit, windows-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-minimum)
- GitHub Check: test_py314_future (unit, windows-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314 (unit, windows-latest, deps-latest)
- GitHub Check: build / build_pr_documentation
🔇 Additional comments (34)
tests/test_upstream_hub.py (2)
309-327: Readiness waits around the first overwrite flow look correct and should deflake the testWaiting once after the initial
push_to_huband again after the manualupload_filebefore the second push is exactly where the LFS propagation race would have bitten, so this should meaningfully stabilize the test without adding too much latency. Please keep an eye on CI duration; if this test ever becomes a bottleneck, you can tunemax_waitor the polling cadence.
357-396: Mirroring the readiness pattern fords_name_2is clear and consistentUsing a separate
ds_name_2context and inserting the same two_wait_for_repo_readycalls around the first push and the manualupload_fileshould stabilize this second overwrite branch as well, and threadingds_name_2through all subsequent Hub calls improves readability. The static-analysis “generic API key” warning here appears to be a false positive since_tokenis coming from fixtures/environment, so this change does not add any hard-coded secrets.src/datasets/naming.py (1)
70-84: Confirm behavior change for single-shardshard_lengthsinfilenames_for_dataset_split.Requiring
len(shard_lengths) > 1means a 1‑elementshard_lengthswill now use the non‑sharded filename (no-00000-of-00001suffix). Please confirm this matches how shard files are actually written for single‑shard splits (e.g., by builders that populateSplitInfo.shard_lengths), especially for existing on‑disk caches thatCache._split_generatorswill read.src/datasets/packaged_modules/pandas/pandas.py (1)
10-11: Use ofKeyin Pandas builder is consistent with composite key semantics.Switching
_generate_tablesto yieldKey(i, 0)(withias file index) aligns this builder with the newKey(original_shard_id, item_or_batch_id)convention and keeps table generation logic unchanged.Also applies to: 62-66
src/datasets/packaged_modules/xml/xml.py (1)
61-68: Composite tuple keys for XML tables look reasonable.Emitting
(file_idx, 0)here standardizes XML table keys to a (file, batch) shape and should remain compatible with consumers that treat keys as hashable/str‑able identifiers.src/datasets/packaged_modules/csv/csv.py (1)
10-11: Key-based keys in CSV generator are consistent with the new scheme.Yielding
Key(file_idx, batch_idx)in_generate_tablescleanly encodes the (file, chunk) identity and aligns this module with other Arrow-based builders usingKey.Also applies to: 177-197
src/datasets/packaged_modules/json/json.py (1)
12-13:Keyusage in JSON generator is consistent across all code paths.All branches of
_generate_tablesnow emitKey(file_idx, batch_idx)(or(file_idx, 0)when the whole file becomes a single table), which keeps key semantics uniform whether parsing via PyArrow JSON or falling back to pandas.Also applies to: 113-125, 170-193
src/datasets/packaged_modules/cache/cache.py (1)
14-16:Key-based cache table keys look correct; verify split/shard metadata assumptions.Using
Key(file_idx, batch_idx)in_generate_tablesaligns cache streaming with the new composite key model, and pairs well withfilenames_for_dataset_splitdriven bysplit_info.shard_lengths; please just confirm that cachedSplitInfoobjects only setshard_lengthsfor truly multi-shard splits so single-shard caches still resolve the correct filenames.Also applies to: 157-177, 179-190
src/datasets/packaged_modules/arrow/arrow.py (1)
8-8: Arrow batch keys now consistently useKeyImporting
Keyand yieldingKey(file_idx, batch_idx)from_generate_tablesaligns Arrow with the new structured key convention without changing control flow or error handling. Looks correct and consistent with other packaged modules.Also applies to: 77-77
src/datasets/packaged_modules/generator/generator.py (1)
5-7: Sharding-aware generator keys viaKey(shard_idx, sample_idx)Using
_number_of_shards_in_gen_kwargs+_split_gen_kwargsand yieldingKey(shard_idx, sample_idx)gives stable, structured keys and brings the generator path in line with other Key-based builders.One thing to double-check: for custom generators that previously received list-valued
gen_kwargs(e.g. a list of input files) in a single call, this now invokesself.config.generator(**shard_gen_kwargs)once per shard with sliced lists instead. If any downstream usage relied on the old “single call with full lists” behavior, this is a semantic change and may need either documentation/tests or a migration note.Functionally this looks sound; the main concern is just confirming intended semantics for list-valued
gen_kwargs.Also applies to: 35-38
src/datasets/packaged_modules/parquet/parquet.py (1)
10-10: Parquet batch keys correctly migrated toKeySwitching the yielded key from a string to
Key(file_idx, batch_idx)(with the corresponding import) cleanly aligns Parquet with the new Key-based indexing scheme, without altering batching or error-handling behavior.Also applies to: 182-182
src/datasets/packaged_modules/sql/sql.py (1)
10-10: SQL chunk keys now useKey(0, chunk_idx)consistentlyImporting
Keyand yieldingKey(0, chunk_idx)from_generate_tablesprovides a structured key while preserving the existing chunking and casting behavior. This matches the single-shard SQL use case and is consistent with other modules.Also applies to: 120-120
tests/packaged_modules/test_spark.py (1)
28-28: LGTM! Test expectations updated for tuple-based keys.The test changes correctly reflect the migration from string-based keys to tuple-based keys
(part_id, row_idx), consistent with the broader Key migration across the codebase.Also applies to: 78-79
src/datasets/packaged_modules/hdf5/hdf5.py (1)
9-9: LGTM! Consistent Key migration for HDF5 table generation.The changes correctly:
- Import
Keyfromdatasets.builder- Enumerate batches to derive
batch_idx- Yield
Key(file_idx, batch_idx)instead of the previous string-based keyThis aligns with the broader migration to structured Key-based identifiers.
Also applies to: 88-94
src/datasets/packaged_modules/webdataset/webdataset.py (1)
12-12: LGTM! Key migration for WebDataset example generation.The change from string-formatted keys to
Key(tar_idx, example_idx)is consistent with the codebase-wide migration to structured Key-based identifiers.Also applies to: 129-129
src/datasets/packaged_modules/text/text.py (2)
9-9: LGTM! Key migration for line-based sampling.The import and Key usage for line-based sampling mode is correct.
Also applies to: 86-87
110-113: LGTM! Key usage for document mode.The Key usage for document mode is correct.
src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py (2)
16-16: LGTM! Key migration for metadata-driven example generation.The changes correctly:
- Import
Keyfromdatasets.builder- Enumerate metadata files to derive
shard_idx- Enumerate samples within each metadata table to derive
sample_idx- Yield
Key(shard_idx, sample_idx)for each sampleThe indexing logic properly resets
sample_idxfor each shard.Also applies to: 368-390
398-411: LGTM! Key migration for non-metadata file path.The non-metadata path correctly mirrors the metadata path with proper enumeration for
shard_idxandsample_idx.tests/test_builder.py (4)
25-26: LGTM! Key import added for test builders.The
Keyimport is correctly added to support the shard-aware builders in tests.
154-156: LGTM! Shard-aware test builders correctly use Key.The builders with explicit shards properly use
Key(shard_idx, i)to emit composite keys, correctly exercising the shard-tracking functionality.Also applies to: 167-169, 188-190, 209-211
1037-1038: LGTM! Shard metadata assertions are comprehensive.The test assertions correctly verify:
shard_lengthscontains per-shard row countsoriginal_shard_lengthsisNonewhen no parallel processing occursoriginal_shard_lengthstracks original shard boundaries when usingnum_procAlso applies to: 1087-1087, 1130-1131, 1179-1179
768-771: LGTM! Test parameter updates.The added
config_name="a"parameter aligns tests with the expected configuration requirements.Also applies to: 777-777
src/datasets/arrow_dataset.py (2)
6079-6084:add_columnsignature change is consistent and backward‑compatibleMaking
new_fingerprintoptional while keeping it in the same positional slot, and tightening the type tonp.ndarray, aligns this method with other@fingerprint_transform-wrapped transforms and should not break existing callers.
6340-6390: Optionalnew_fingerprintforadd_itemmatches the rest of the APIAllowing
add_itemto omitnew_fingerprintbrings it in line with other dataset transforms that default this parameter, improving ergonomics without affecting existing positional or keyword usages.src/datasets/splits.py (2)
31-45: Addition oforiginal_shard_lengthslooks consistentAdding
original_shard_lengthstoSplitInfoand excluding it from YAML via_to_yaml_listkeeps wire format stable while allowing richer in‑memory metadata. No issues spotted here.
587-596: YAML stripping of shard metadata and deprecated fields is coherent
_to_yaml_listnow dropsshard_lengths,original_shard_lengths, anddataset_namefrom the serialized split dicts. This keeps YAML light while preserving full metadata in JSON and in-memory structures. Looks consistent with the surrounding comments.src/datasets/iterable_dataset.py (3)
77-89:Keyalias andBuilderKeyimport pattern look goodUsing a TYPE_CHECKING-only import for
BuilderKeyand defining:Key = Union[int, str, tuple[int, int], "BuilderKey"]accurately reflects the new key shapes without incurring runtime imports. No behavioral concerns here.
257-288: Iterator-based type hints forExamplesIterableand sharded variants now match actual usage
generate_examples_fnis now typed as returning anIterator[tuple[Key, dict]], which aligns with how__iter__consumes it (for key_example in islice(self.generate_examples_fn(**gen_kwags), ...)). No functional change, but the annotation is now accurate.
342-403: Arrow examples iterator type hints are consistent with generator behavior
ArrowExamplesIterableandShuffledDataSourcesArrowExamplesIterablenow typegenerate_tables_fnasCallable[..., Iterator[tuple[Key, pa.Table]]], matching theirfor key, pa_table in self.generate_tables_fn(**gen_kwags)usage. This tightens the public surface without changing behavior.src/datasets/builder.py (4)
1339-1339: LGTM: Type annotations correctly updated.The return type annotations for
_generate_examplesand_generate_tableshave been properly updated to reflect the newKey-based interface, improving type safety and documentation.Also applies to: 1628-1628
1427-1435: LGTM: Original shard tracking properly integrated into multiprocessing.The unpacking and aggregation of
original_shards_per_jobandoriginal_shard_lengths_per_jobis consistent across both single-process and multi-process code paths in bothGeneratorBasedBuilderandArrowBasedBuilder.Also applies to: 1705-1713
1388-1398: LGTM: num_proc adjustment respects original shard boundaries.The logic correctly computes
num_original_shardsand adjustsnum_procdownward to avoid creating more parallel jobs than there are original shards to process. This prevents inefficient over-parallelization.Also applies to: 1667-1677
27-27: LGTM: Import additions are appropriate.The added imports (
Iterator,Mappingfromcollections.abcandpyarrow as pa) are necessary for the new type annotations and properly used throughout the file.Also applies to: 35-35
| @dataclass | ||
| class Key: | ||
| original_shard_id: int | ||
| item_or_batch_id: int | ||
|
|
||
| def __str__(self): | ||
| return str((self.original_shard_id, self.item_or_batch_id)) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add documentation for the Key dataclass.
The Key dataclass is introduced without any docstring or explanation of its purpose, field semantics, or usage contract. Given that this changes the signature of core generator methods (_generate_examples and _generate_tables), comprehensive documentation is essential for maintainability and to guide custom builder implementations.
Apply this diff to add documentation:
@dataclass
class Key:
+ """Represents the origin of a generated example or table batch.
+
+ Attributes:
+ original_shard_id: Zero-based index of the source shard from which this data originated.
+ item_or_batch_id: Zero-based index of the item (for examples) or batch (for tables) within that shard.
+ """
original_shard_id: int
item_or_batch_id: int📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @dataclass | |
| class Key: | |
| original_shard_id: int | |
| item_or_batch_id: int | |
| def __str__(self): | |
| return str((self.original_shard_id, self.item_or_batch_id)) | |
| @dataclass | |
| class Key: | |
| """Represents the origin of a generated example or table batch. | |
| Attributes: | |
| original_shard_id: Zero-based index of the source shard from which this data originated. | |
| item_or_batch_id: Zero-based index of the item (for examples) or batch (for tables) within that shard. | |
| """ | |
| original_shard_id: int | |
| item_or_batch_id: int | |
| def __str__(self): | |
| return str((self.original_shard_id, self.item_or_batch_id)) |
🤖 Prompt for AI Agents
In src/datasets/builder.py around lines 1320 to 1327, the Key dataclass lacks
any docstring explaining its purpose, the semantics of original_shard_id and
item_or_batch_id, and how generators should use it; add a clear docstring to the
Key class that (1) succinctly states its role as the stable identifier passed to
_generate_examples/_generate_tables, (2) explains that original_shard_id
identifies the source shard and item_or_batch_id is either an item index or
batch id (specify whether zero-based and when batching is used), (3) documents
any invariants/expectations (immutability, must be convertible to str, used for
sharding/ordering), and (4) gives a short usage example showing creation and
__str__ behavior so implementers of custom builders know how to construct and
interpret Keys.
src/datasets/builder.py
Outdated
| if total_original_shards > 1: | ||
| split_generator.split_info.original_shard_lengths = [ | ||
| original_shard_length | ||
| for original_shard_lengths in original_shard_lengths_per_job | ||
| for original_shard_length in original_shard_lengths | ||
| ] |
There was a problem hiding this comment.
Inconsistent population of original_shard_lengths metadata.
The split_info.original_shard_lengths is only populated when total_original_shards > 1, but the existing split_info.shard_lengths is always populated (even for single shard, see lines 1506-1508). This inconsistency could complicate downstream code that relies on this metadata.
Consider populating it consistently:
- if total_original_shards > 1:
- split_generator.split_info.original_shard_lengths = [
- original_shard_length
- for original_shard_lengths in original_shard_lengths_per_job
- for original_shard_length in original_shard_lengths
- ]
+ split_generator.split_info.original_shard_lengths = [
+ original_shard_length
+ for original_shard_lengths in original_shard_lengths_per_job
+ for original_shard_length in original_shard_lengths
+ ]Or document why the conditional is necessary if there's a specific reason (e.g., storage optimization).
🤖 Prompt for AI Agents
In src/datasets/builder.py around lines 1517 to 1522,
split_info.original_shard_lengths is only set when total_original_shards > 1
which is inconsistent with split_info.shard_lengths (always populated); make
original_shard_lengths populated consistently: remove the conditional and always
assign the flattened list (or, if you prefer to preserve single-shard behavior,
add an else branch that sets original_shard_lengths to the single-shard
length(s) equivalent to shard_lengths), so downstream code can rely on this
metadata being present.
| check_duplicate_keys: bool, | ||
| job_id: int, | ||
| ) -> Iterable[tuple[int, bool, Union[int, tuple]]]: | ||
| ) -> Iterator[tuple[int, bool, tuple[int, int, Features, int, int, int]]]: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file structure and locate the method
wc -l src/datasets/builder.pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 104
🏁 Script executed:
# View the code around line 1535 to see the return type annotation
sed -n '1530,1540p' src/datasets/builder.pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 537
🏁 Script executed:
# View the code around line 1807 to see the second occurrence
sed -n '1802,1812p' src/datasets/builder.pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 683
🏁 Script executed:
# Search for _prepare_split_single definitions in the codebase
rg -n 'def _prepare_split_single' --type=pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 275
🏁 Script executed:
# Search for any overrides or implementations outside builder.py
rg -n '_prepare_split_single' --type=py -g '!src/datasets/builder.py'Repository: The-Obstacle-Is-The-Way/datasets
Length of output: 300
🏁 Script executed:
# View the spark.py override implementation
sed -n '195,220p' src/datasets/packaged_modules/spark/spark.pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 1267
🏁 Script executed:
# Check the full method signature and return pattern
sed -n '199,250p' src/datasets/packaged_modules/spark/spark.pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 2582
🏁 Script executed:
# View the complete spark.py implementation to see what it yields
sed -n '199,300p' src/datasets/packaged_modules/spark/spark.pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 4736
🏁 Script executed:
# Also check if there are any other subclasses that override this method
rg -n 'class.*Builder' --type=py src/datasets/builder.py | head -20Repository: The-Obstacle-Is-The-Way/datasets
Length of output: 478
🏁 Script executed:
# Search for any other potential overrides in the codebase
find . -name '*.py' -type f -exec grep -l 'def _prepare_split_single' {} \; 2>/dev/null | grep -v __pycache__Repository: The-Obstacle-Is-The-Way/datasets
Length of output: 148
🏁 Script executed:
# Check git history for changes to _prepare_split_single return type
git log --oneline -20 -- src/datasets/builder.pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 137
🏁 Script executed:
# Look at the actual return statement to understand what's being yielded
rg -B5 -A15 'yield.*shard_id.*bool' src/datasets/builder.py | head -50Repository: The-Obstacle-Is-The-Way/datasets
Length of output: 58
🏁 Script executed:
# Check around line 1535 to see the actual yield statements
sed -n '1550,1600p' src/datasets/builder.pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 2936
🏁 Script executed:
# Continue reading to find the final yield
sed -n '1590,1610p' src/datasets/builder.pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 963
🏁 Script executed:
# Search for the complete final yield statement in GeneratorBasedBuilder
sed -n '1527,1650p' src/datasets/builder.py | grep -A10 "yield ("Repository: The-Obstacle-Is-The-Way/datasets
Length of output: 382
🏁 Script executed:
# Check ArrowBasedBuilder's final yield
sed -n '1805,1880p' src/datasets/builder.pyRepository: The-Obstacle-Is-The-Way/datasets
Length of output: 4179
🏁 Script executed:
# Look for the yield statement in ArrowBasedBuilder
sed -n '1805,1900p' src/datasets/builder.py | grep -B5 -A20 "yield ("Repository: The-Obstacle-Is-The-Way/datasets
Length of output: 813
Fix type annotations for _prepare_split_single return signature to match implementation.
The return type annotations at lines 1535 and 1807 claim a 6-element inner tuple but the implementations return 7 elements: (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths, num_original_shards, original_shard_lengths). Update both signatures to Iterator[tuple[int, bool, tuple[int, int, Features, int, int, int, int, list]]] or use a more descriptive type alias.
This is a breaking change for custom builders overriding this method, including src/datasets/packaged_modules/spark/spark.py, which currently has an incompatible signature and needs updating.
| if isinstance(key, Key): # old custom builders may not use Key | ||
| original_shard_id = key.original_shard_id |
There was a problem hiding this comment.
Clarify fallback behavior when key is not a Key instance.
The backward compatibility check allows old custom builders that don't yield Key objects, but the fallback behavior is implicit: original_shard_id retains its previous value (or 0 initially). This could lead to incorrect original shard attribution when mixing old-style keys with the new shard tracking.
Consider one of these approaches:
- Log a deprecation warning to alert users:
for key, record in generator:
if isinstance(key, Key): # old custom builders may not use Key
original_shard_id = key.original_shard_id
+ else:
+ logger.warning(
+ "Yielding non-Key objects from _generate_examples is deprecated. "
+ "Please update to yield Key(original_shard_id, item_or_batch_id)."
+ )- Reset to 0 explicitly to make the behavior clear:
for key, record in generator:
if isinstance(key, Key): # old custom builders may not use Key
original_shard_id = key.original_shard_id
+ else:
+ original_shard_id = 0 # fallback for legacy builders📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if isinstance(key, Key): # old custom builders may not use Key | |
| original_shard_id = key.original_shard_id | |
| if isinstance(key, Key): # old custom builders may not use Key | |
| original_shard_id = key.original_shard_id | |
| else: | |
| original_shard_id = 0 # fallback for legacy builders |
🤖 Prompt for AI Agents
In src/datasets/builder.py around lines 1557-1558, the fallback when key is not
an instance of Key is implicit and can misattribute original_shard_id; change
the branch so that when not isinstance(key, Key) you explicitly set
original_shard_id = 0 and emit a clear deprecation warning (use the module
logger or warnings.warn) indicating old-style keys are deprecated and that
original_shard_id will default to 0 for compatibility; keep the existing
behavior for Key instances unchanged.
| if len(original_shard_lengths) == original_shard_id: | ||
| original_shard_lengths.append(1) | ||
| else: | ||
| original_shard_lengths[original_shard_id] += 1 |
There was a problem hiding this comment.
Validate that original_shard_id values are sequential.
The logic assumes original_shard_id values are contiguous integers starting from 0. If a generator yields non-sequential shard IDs (e.g., 0, 2, skipping 1), the condition len(original_shard_lengths) == original_shard_id will fail, causing an IndexError on the else branch.
Add validation to detect non-sequential shard IDs:
writer.write(example)
- if len(original_shard_lengths) == original_shard_id:
+ if original_shard_id < 0:
+ raise ValueError(f"original_shard_id must be non-negative, got {original_shard_id}")
+ elif len(original_shard_lengths) == original_shard_id:
original_shard_lengths.append(1)
+ elif len(original_shard_lengths) > original_shard_id:
+ original_shard_lengths[original_shard_id] += 1
else:
- original_shard_lengths[original_shard_id] += 1
+ raise ValueError(
+ f"Non-sequential original_shard_id detected: expected {len(original_shard_lengths)}, got {original_shard_id}. "
+ "Ensure your generator yields Keys with sequential original_shard_id values starting from 0."
+ )Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/datasets/builder.py around lines 1575-1578, the code assumes
original_shard_id values are contiguous which can cause an IndexError if a
generator yields a non-sequential id; validate that original_shard_id is
non-negative and either equal to len(original_shard_lengths) (append a new
count) or less than len(original_shard_lengths) (increment the slot); if
original_shard_id is greater than len(original_shard_lengths) (i.e., a
gap/skip), raise a clear ValueError (or log and abort) indicating a
non-sequential shard id was received so the upstream generator can be fixed.
8e00a44 to
3ea8de7
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_upstream_hub.py (1)
270-296: Consider catching broader exceptions for more robust test stability.The exception handling currently only catches
HfHubHTTPError. Transient network errors (e.g.,requests.RequestException,ConnectionError, timeouts) would cause immediate test failure rather than retry. Since the goal is to stabilize flaky tests, consider catching a broader set of exceptions.Apply this diff to handle transient errors more gracefully:
- from huggingface_hub.errors import HfHubHTTPError + from huggingface_hub.errors import HfHubHTTPError + import requests start_time = time.monotonic() while (time.monotonic() - start_time) < max_wait: try: # Verify we can list files (repo is consistent) self._api.list_repo_files(repo_id, repo_type="dataset", token=self._token) # Small delay to ensure LFS objects are fully propagated time.sleep(1) return - except HfHubHTTPError: + except (HfHubHTTPError, requests.RequestException, ConnectionError, TimeoutError) as e: + # Retry on transient errors time.sleep(1) raise TimeoutError(f"Repository {repo_id} not ready after {max_wait}s")Optionally, enhance the timeout error message for easier debugging:
- raise TimeoutError(f"Repository {repo_id} not ready after {max_wait}s") + raise TimeoutError(f"Repository {repo_id} not ready after {max_wait}s. Check Hub status and LFS propagation.")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test_upstream_hub.py(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_upstream_hub.py (4)
tests/fixtures/hub.py (1)
temporary_repo(87-99)src/datasets/arrow_dataset.py (1)
push_to_hub(5658-6074)src/datasets/iterable_dataset.py (1)
push_to_hub(4092-4508)src/datasets/dataset_dict.py (2)
push_to_hub(1616-1983)push_to_hub(2495-2852)
🪛 Gitleaks (8.30.0)
tests/test_upstream_hub.py
[high] 358-358: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: test (integration, windows-latest, deps-minimum)
- GitHub Check: test (integration, windows-latest, deps-latest)
- GitHub Check: test (integration, ubuntu-latest, deps-latest)
- GitHub Check: test (unit, windows-latest, deps-latest)
- GitHub Check: test (unit, windows-latest, deps-minimum)
- GitHub Check: test (integration, ubuntu-latest, deps-minimum)
- GitHub Check: test (unit, ubuntu-latest, deps-latest)
- GitHub Check: test (unit, ubuntu-latest, deps-minimum)
- GitHub Check: test_py314 (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314_future (unit, ubuntu-latest, deps-latest)
- GitHub Check: test_py314 (unit, windows-latest, deps-latest)
- GitHub Check: build / build_pr_documentation
🔇 Additional comments (2)
tests/test_upstream_hub.py (2)
309-327: LGTM - proper synchronization between Hub operations.The integration of
_wait_for_repo_readyafterpush_to_hub(line 307) andupload_file(lines 318-324) correctly addresses the race condition described in the PR objectives, ensuring the Hub has fully processed LFS objects before the next operation.
357-395: LGTM - improved test isolation with separate repositories.The second test scenario now uses a dedicated repository (
ds_name_2) instead of reusing the first repository. This improves test isolation and ensures clean state for each scenario. The_wait_for_repo_readycalls at lines 361 and 378 correctly follow the same synchronization pattern, ensuring Hub consistency between operations.
…ace#7877) (huggingface#7891) * fix(fingerprint): improve TMPDIR environment variable handling in _TempCacheDir Enhanced the _TempCacheDir.__init__ method to properly respect and handle the TMPDIR environment variable when creating temporary cache directories. Changes: - Add TMPDIR environment variable detection and validation - Normalise paths to handle path resolution issues - Auto-create TMPDIR directory if it doesn't exist to prevent silent fallback to default temporary directory - Validate that TMPDIR is actually a directory before use - Explicitly pass directory to mkdtemp to ensure TMPDIR is respected even if tempfile.gettempdir() was already called and cached - Add appropriate logging for directory creation and fallback scenarios This ensures that when TMPDIR is set, the temporary cache files are created in the specified directory rather than silently falling back to the system default temporary directory. * test(fingerprint): add comprehensive tests for TMPDIR handling in _TempCacheDir Add test coverage for the improved TMPDIR environment variable handling in the _TempCacheDir class. These tests verify the various scenarios for TMPDIR usage and error handling. Changes: - Refactor test_fingerprint_in_multiprocessing to use Pool.map for cleaner test implementation - Add test_temp_cache_dir_with_tmpdir_nonexistent to verify TMPDIR auto-creation when directory doesn't exist - Add test_temp_cache_dir_with_tmpdir_existing to verify correct behaviour when TMPDIR exists and is valid - Add test_temp_cache_dir_without_tmpdir to verify fallback to default temporary directory when TMPDIR is not set - Add test_temp_cache_dir_tmpdir_creation_failure to verify graceful error handling and fallback when TMPDIR creation fails These tests ensure that the TMPDIR improvements work correctly across all scenarios and edge cases, including proper logging and fallback behaviour. * test(fingerprint): tighten TMPDIR error-path tests for _TempCacheDir Refine TMPDIR-related failure tests for _TempCacheDir to assert explicit error conditions instead of fallback behaviour. Changes: - Update test_temp_cache_dir_tmpdir_creation_failure to use _TempCacheDir directly and assert that an OSError is raised with a clear TMPDIR context when directory creation fails - Introduce test_temp_cache_dir_tmpdir_not_directory to verify that pointing TMPDIR at a non-directory raises an OSError with an informative error message These tests better match the intended contract of _TempCacheDir by ensuring invalid TMPDIR configurations fail loudly with descriptive messages rather than silently falling back. * fix(fingerprint): make TMPDIR misconfiguration in _TempCacheDir fail loudly Tighten TMPDIR handling in _TempCacheDir so that invalid configurations raise clear errors instead of silently falling back to the default temporary directory. Changes: - When TMPDIR points to a non-existent directory, raise an OSError with explicit guidance to create it manually or unset TMPDIR - When TMPDIR points to a non-directory path, raise an OSError with guidance to point TMPDIR to a writable directory or unset it - Remove previous warning-and-fallback behaviour to avoid masking configuration issues This ensures that TMPDIR misconfigurations are surfaced early and clearly, aligning runtime behaviour with the stricter expectations codified in the new tests. * Update src/datasets/fingerprint.py * Update fingerprint.py * Fix formatting of TMPDIR retrieval line --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
* encode nifti correctly when uploading lazily * update docstring for nifti functionality
) * fix(nifti): enable lazy loading for Nifti1ImageWrapper - Change dataobj initialization from get_fdata() to .dataobj - Preserves nibabel's ArrayProxy for lazy loading - Prevents OOM errors on large 4D fMRI files - Defers I/O errors to usage time for better error handling Resolves #1 * chore: trigger CI
…gingface#7906) don't save original_shard_lengths by default for backward compat
`test_push_dataset_dict_to_hub_overwrite_files` intermittently fails with: ``` BadRequestError: LFS pointer pointed to a file that does not exist ``` Root cause: Two race conditions in the test design: 1. Rapid successive `push_to_hub` calls don't wait for Hub's LFS object propagation between pushes 2. Second test scenario reused the same repo name, creating a race between repo deletion and recreation Fix: - Add `_wait_for_repo_ready()` helper that ensures Hub repository is in a consistent state before subsequent operations - Use unique repo name (`ds_name_2`) for second scenario to eliminate the delete/create race entirely Tested: All 4 integration test variants now pass consistently (ubuntu/windows, deps-latest/deps-minimum).
891d794 to
8a242d8
Compare
Summary
_wait_for_repo_readyhelper to ensure Hub repository consistency between operationstest_push_dataset_dict_to_hub_overwrite_filesthat caused intermittent LFS pointer errorsProblem
test_push_dataset_dict_to_hub_overwrite_filesintermittently fails with:This occurs because rapid successive
push_to_hubcalls don't wait for Hub's LFS object propagation between pushes.Solution
Add synchronization delay between push operations to ensure Hub has fully processed LFS objects before the next push.
Test plan
Related Issues
push_to_hubis not concurrency safe (dataset schema corruption) huggingface/datasets#7600 (push_to_hub concurrency safety)push_to_hubis not robust to hub closing connection huggingface/datasets#6392 (push_to_hub connection robustness)This fix is intended for upstream submission to huggingface/datasets after validation.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.