-
Notifications
You must be signed in to change notification settings - Fork 24
TransformAndMapHuggingFaceDatasetJob and ExtractTextFromHuggingFaceDatasetJob #627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Still a draft but maybe you are interested. |
|
Note, my current usage: def py():
for name in ["small", "medium", "large"]:
tk.register_output(f"datasets/loquacious_hf_{name}_ogg", get_loquacious_hf_ogg(name))
def get_loquacious_hf_ogg(name: str = "large") -> Path:
ffmpeg_binary = tools_paths.get_ffmpeg_binary()
job = TransformAndMapHuggingFaceDatasetJob(
"speechbrain/LoquaciousSet",
name,
transform=_transform_rename_columns,
map_func=partial(_map_func_wav_to_ogg, ffmpeg_binary=ffmpeg_binary, quality_opts=["-q", "4"]),
map_opts=_map_opts,
)
job.rqmt.update({"cpu": 32, "time": 24, "mem": 32})
return job.out_dir
def _transform_rename_columns(ds: DatasetDict) -> DatasetDict:
return ds.rename_columns({"ID": "id", "wav": "audio"})
def _map_func_wav_to_ogg(
data: Dict[str, Any], *, ffmpeg_binary: Union[str, Path], quality_opts: Sequence[str]
) -> Dict[str, Any]:
import subprocess
proc_res = subprocess.run(
[
ffmpeg_binary.get_path() if isinstance(ffmpeg_binary, Path) else ffmpeg_binary,
"-hide_banner",
"-loglevel",
"error",
"-y",
"-i",
"pipe:0",
"-ar",
"16000",
"-c:a",
"libvorbis",
*quality_opts,
"-f",
"ogg",
"-",
],
input=data["audio"]["bytes"],
stdout=subprocess.PIPE,
check=True,
)
data["audio"]["bytes"] = proc_res.stdout
return data
def _map_opts(ds: DatasetDict) -> Dict[str, Any]:
from datasets import Audio
features = ds["train"].features.copy()
audio_feat = features["audio"]
assert isinstance(audio_feat, Audio)
audio_feat.decode = True
return {"features": features} |
| kwargs.pop("non_hashed_load_dataset_opts") | ||
| kwargs.pop("non_hashed_map_opts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per the discussion in rwth-i6/sisyphus#274, should we do a copy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, my comment from there (rwth-i6/sisyphus#274 (comment)):
But I see that there is a lot of code which currently does it that way, so I don't think we can change this now.
The (shallow) copy on kwargs would need to be done anyway in JobSingleton.__call__ if we want to fix rwth-i6/sisyphus#274. So then modifying kwargs here is safe, and consistent to most other code.
|
@Atticus1806 @robin-p-schmitt @NeoLegends @JackTemaki ping? anyone? |
(Related: #626)