Skip to content

Conversation

@albertz
Copy link
Member

@albertz albertz commented Oct 4, 2025

(Related: #626)

@albertz
Copy link
Member Author

albertz commented Oct 4, 2025

Still a draft but maybe you are interested.

@albertz
Copy link
Member Author

albertz commented Oct 4, 2025

Note, my current usage:

def py():
    for name in ["small", "medium", "large"]:
        tk.register_output(f"datasets/loquacious_hf_{name}_ogg", get_loquacious_hf_ogg(name))


def get_loquacious_hf_ogg(name: str = "large") -> Path:
    ffmpeg_binary = tools_paths.get_ffmpeg_binary()

    job = TransformAndMapHuggingFaceDatasetJob(
        "speechbrain/LoquaciousSet",
        name,
        transform=_transform_rename_columns,
        map_func=partial(_map_func_wav_to_ogg, ffmpeg_binary=ffmpeg_binary, quality_opts=["-q", "4"]),
        map_opts=_map_opts,
    )
    job.rqmt.update({"cpu": 32, "time": 24, "mem": 32})
    return job.out_dir


def _transform_rename_columns(ds: DatasetDict) -> DatasetDict:
    return ds.rename_columns({"ID": "id", "wav": "audio"})


def _map_func_wav_to_ogg(
    data: Dict[str, Any], *, ffmpeg_binary: Union[str, Path], quality_opts: Sequence[str]
) -> Dict[str, Any]:
    import subprocess

    proc_res = subprocess.run(
        [
            ffmpeg_binary.get_path() if isinstance(ffmpeg_binary, Path) else ffmpeg_binary,
            "-hide_banner",
            "-loglevel",
            "error",
            "-y",
            "-i",
            "pipe:0",
            "-ar",
            "16000",
            "-c:a",
            "libvorbis",
            *quality_opts,
            "-f",
            "ogg",
            "-",
        ],
        input=data["audio"]["bytes"],
        stdout=subprocess.PIPE,
        check=True,
    )
    data["audio"]["bytes"] = proc_res.stdout
    return data


def _map_opts(ds: DatasetDict) -> Dict[str, Any]:
    from datasets import Audio

    features = ds["train"].features.copy()
    audio_feat = features["audio"]
    assert isinstance(audio_feat, Audio)
    audio_feat.decode = True
    return {"features": features}

@albertz albertz marked this pull request as ready for review October 5, 2025 16:50
@albertz albertz changed the title TransformAndMapHuggingFaceDatasetJob TransformAndMapHuggingFaceDatasetJob and ExtractTextFromHuggingFaceDatasetJob Oct 5, 2025
Comment on lines +197 to +198
kwargs.pop("non_hashed_load_dataset_opts")
kwargs.pop("non_hashed_map_opts")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per the discussion in rwth-i6/sisyphus#274, should we do a copy here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my comment from there (rwth-i6/sisyphus#274 (comment)):

But I see that there is a lot of code which currently does it that way, so I don't think we can change this now.

The (shallow) copy on kwargs would need to be done anyway in JobSingleton.__call__ if we want to fix rwth-i6/sisyphus#274. So then modifying kwargs here is safe, and consistent to most other code.

@albertz
Copy link
Member Author

albertz commented Nov 3, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants