TransformAndMapHuggingFaceDatasetJob and ExtractTextFromHuggingFaceDatasetJob #627

albertz · 2025-10-04T02:29:30Z

(Related: #626)

albertz · 2025-10-04T02:30:12Z

Still a draft but maybe you are interested.

albertz · 2025-10-04T10:44:39Z

Note, my current usage:

def py():
    for name in ["small", "medium", "large"]:
        tk.register_output(f"datasets/loquacious_hf_{name}_ogg", get_loquacious_hf_ogg(name))


def get_loquacious_hf_ogg(name: str = "large") -> Path:
    ffmpeg_binary = tools_paths.get_ffmpeg_binary()

    job = TransformAndMapHuggingFaceDatasetJob(
        "speechbrain/LoquaciousSet",
        name,
        transform=_transform_rename_columns,
        map_func=partial(_map_func_wav_to_ogg, ffmpeg_binary=ffmpeg_binary, quality_opts=["-q", "4"]),
        map_opts=_map_opts,
    )
    job.rqmt.update({"cpu": 32, "time": 24, "mem": 32})
    return job.out_dir


def _transform_rename_columns(ds: DatasetDict) -> DatasetDict:
    return ds.rename_columns({"ID": "id", "wav": "audio"})


def _map_func_wav_to_ogg(
    data: Dict[str, Any], *, ffmpeg_binary: Union[str, Path], quality_opts: Sequence[str]
) -> Dict[str, Any]:
    import subprocess

    proc_res = subprocess.run(
        [
            ffmpeg_binary.get_path() if isinstance(ffmpeg_binary, Path) else ffmpeg_binary,
            "-hide_banner",
            "-loglevel",
            "error",
            "-y",
            "-i",
            "pipe:0",
            "-ar",
            "16000",
            "-c:a",
            "libvorbis",
            *quality_opts,
            "-f",
            "ogg",
            "-",
        ],
        input=data["audio"]["bytes"],
        stdout=subprocess.PIPE,
        check=True,
    )
    data["audio"]["bytes"] = proc_res.stdout
    return data


def _map_opts(ds: DatasetDict) -> Dict[str, Any]:
    from datasets import Audio

    features = ds["train"].features.copy()
    audio_feat = features["audio"]
    assert isinstance(audio_feat, Audio)
    audio_feat.decode = True
    return {"features": features}

michelwi · 2025-10-20T09:48:20Z

datasets/huggingface.py

+        kwargs.pop("non_hashed_load_dataset_opts")
+        kwargs.pop("non_hashed_map_opts")


as per the discussion in rwth-i6/sisyphus#274, should we do a copy here?

Well, my comment from there (rwth-i6/sisyphus#274 (comment)):

But I see that there is a lot of code which currently does it that way, so I don't think we can change this now.

The (shallow) copy on kwargs would need to be done anyway in JobSingleton.__call__ if we want to fix rwth-i6/sisyphus#274. So then modifying kwargs here is safe, and consistent to most other code.

datasets/huggingface.py

albertz · 2025-11-03T16:42:58Z

@Atticus1806 @robin-p-schmitt @NeoLegends @JackTemaki ping? anyone?

TransformAndMapHuggingFaceDatasetJob

7fe8057

albertz requested review from JackTemaki and robin-p-schmitt October 4, 2025 02:29

albertz added 2 commits October 4, 2025 12:04

better sharding

f7bee40

small fix

841b9e2

albertz mentioned this pull request Oct 4, 2025

HuggingFace datasets wrapper rwth-i6/returnn#1257

Closed

albertz added 3 commits October 5, 2025 16:48

ExtractTextFromHuggingFaceDatasetJob

3d9b51f

more

1750732

cleanup

d59b448

albertz marked this pull request as ready for review October 5, 2025 16:50

albertz changed the title ~~TransformAndMapHuggingFaceDatasetJob~~ TransformAndMapHuggingFaceDatasetJob and ExtractTextFromHuggingFaceDatasetJob Oct 5, 2025

albertz added 10 commits October 5, 2025 22:00

more

0a1482b

better

75425a2

better

db4f560

support loading existing dataset via load_from_disk

211cd49

support loading existing dataset via load_from_disk fix

1e50b2a

more

b846858

better

927091a

max_shard_size fixed default

0f2bde4

doc

cafaf2f

doc

5f88c54

michelwi reviewed Oct 20, 2025

View reviewed changes

comment on tmp dir

0c6ea2b

michelwi approved these changes Oct 20, 2025

View reviewed changes

albertz requested review from Atticus1806 and NeoLegends October 29, 2025 00:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TransformAndMapHuggingFaceDatasetJob and ExtractTextFromHuggingFaceDatasetJob #627

TransformAndMapHuggingFaceDatasetJob and ExtractTextFromHuggingFaceDatasetJob #627

Uh oh!

albertz commented Oct 4, 2025 •

edited

Loading

Uh oh!

albertz commented Oct 4, 2025

Uh oh!

albertz commented Oct 4, 2025

Uh oh!

michelwi Oct 20, 2025

Uh oh!

albertz Oct 20, 2025

Uh oh!

Uh oh!

albertz commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		kwargs.pop("non_hashed_load_dataset_opts")
		kwargs.pop("non_hashed_map_opts")

TransformAndMapHuggingFaceDatasetJob and ExtractTextFromHuggingFaceDatasetJob #627

Are you sure you want to change the base?

TransformAndMapHuggingFaceDatasetJob and ExtractTextFromHuggingFaceDatasetJob #627

Uh oh!

Conversation

albertz commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertz commented Oct 4, 2025

Uh oh!

albertz commented Oct 4, 2025

Uh oh!

michelwi Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

albertz Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albertz commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

albertz commented Oct 4, 2025 •

edited

Loading