This repository was archived by the owner on Apr 29, 2021. It is now read-only.
Speed Perturbation #28
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds speed perturbation.
This introduces non-trivial changes to the pre-processing pipeline so it's worth giving a bit of background on why I've taken this approach. Some of this is a repetition of an earlier slack message.
TL;DR: it is necessary to use
torchaudio.sox_effect_chainwhich adds a large amount of complexity.'Simpler' alternatives to using
soxI've tried x2 multiple other ways of performing speed-perturbation:
librosa. NVIDIA use this (https://github.com/ryanleary/mlperf-rnnt-ref/blob/fe0cc4145c240d4f8a8fe1814f397df63095e220/parts/perturb.py#L42)torchaudio.Resampleto perform directly on the input tensor.Both of these are very slow. 1. converts to the frequency domain and back while 2. is even slower when upsampling the signal (i.e. making it slower). For reference on
copernicusthe methods are respectivelyx20andx300slower (!) than thesoximplementation and the dataloaders become the limiting factors during training.For comparison, the
soxversion does add some overhead but this is acceptable (+25% time per epoch - and this includes the fact that some sequences are 15% longer.)A third potential method (which NVIDIA also use: https://github.com/ryanleary/mlperf-rnnt-ref/blob/fe0cc4145c240d4f8a8fe1814f397df63095e220/utils/preprocessing_utils.py#L52) is performing the perturbation offline. This seems like a poor choice to me since:
a) Each training sample has a fixed speed change - reducing augmentation effectiveness
b) This isn't scalable with training set size (to 60k/100k hrs) as the multiple dataset copies won't fit on the disk of a single machine.
Necessary changes
The complexity added by
sox_effects_chainis that it must be applied on a filepath rather than a tensor. To deal with this I've split the audio transforms into two types:pre_load_transforms- speed_perturbation is of this typepost_load_transforms- all previous transforms are of this typeFYI, the high-level API treats
speed_pertubationin exactly the same way as other steps but I've found it necessary for thebuilders+datasetto have knowledge of the two transform types.It is also necessary to add a
worker_init_fnto avoid seg_faults when sox is being used 😱 - I think the lack of this fn lead samG to think thatsoxwasn't thread-safe.