Skip to content

Conversation

ArjunJagdale
Copy link
Contributor

This PR builds on #6832 by @mariosasko.

May close - #4101, #2538

Discussion - #7648 (comment)


Note - This PR is under work and frequent changes will be pushed.

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Jul 22, 2025

I’ve completed the following steps to continue the partial split download support (from PR #6832):

I did changes on top of what has been done by mario. Here are some of those changes:

  • Restored support for writing multiple split shards:

  • In _prepare_split_single, we now correctly replace JJJJJ and SSSSS placeholders in the fpath for job/shard IDs before creating the writer.

  • Added os.makedirs(os.path.dirname(path), exist_ok=True) after placeholder substitution to prevent FileNotFoundError.

  • Applied the fix to both split writers:

     1] self._generate_examples version (used by most modules).
    
     2] self._generate_tables version (used by IterableDatasetBuilder).
    
  • Confirmed 109/113 tests passing, meaning the general logic is working across the board.

What’s still failing
4 integration tests fail:

test_load_hub_dataset_with_single_config_in_metadata

test_load_hub_dataset_with_two_config_in_metadata

test_load_hub_dataset_with_metadata_config_in_parallel

test_reload_old_cache_from_2_15

All are due to FileNotFoundError from uncreated output paths, which I'm currently finalizing by ensuring os.makedirs() is correctly applied before every writer instantiation.

I will update about these fixes after running tests!

@ArjunJagdale
Copy link
Contributor Author

@lhoestq this was just an update

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Jul 28, 2025

Local DIR wasn't doing well, dk actually what happened, will PR again! Sorry :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants