Skip to content

Commit 3e59c1c

Browse files
committed
Add neural audio codec examples and audio files for EnCodec in documentation
1 parent 4e08d4e commit 3e59c1c

File tree

4 files changed

+126
-1
lines changed

4 files changed

+126
-1
lines changed
469 KB
Binary file not shown.
469 KB
Binary file not shown.

docs/audio/neural_codec_input.wav

861 KB
Binary file not shown.

docs/audio_intelligence/neural_audio_codecs.md

Lines changed: 126 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,132 @@ HILCodec matches or outperforms both traditional and leading neural codecs (like
204204
<figcaption>HILCodec Performance Results. Source: [3]</figcaption>
205205
</figure>
206206

207+
## Code
208+
209+
Let's walk through a hands-on example to see how to use one of these neural audio codecs in practice. We will use EnCodec [2].
210+
211+
```python linenums="1"
212+
# Load the required packages
213+
from audiotools import AudioSignal
214+
from transformers import EncodecModel, AutoProcessor
215+
216+
# Step 1: load the EnCodec model and processor
217+
model = EncodecModel.from_pretrained("facebook/encodec_24khz")
218+
processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")
219+
# Note: you can also use "facebook/encodec_48khz" for higher sampling rates
220+
221+
# Step 2: load an audio file
222+
audio_file = "samples/neural_codec_input.wav" # replace with your audio file path
223+
audio = AudioSignal(audio_file)
224+
audio = audio.resample(24000) # resample to 24kHz
225+
audio = audio.to_mono() # convert to mono
226+
frame_rate = audio.sample_rate # get the sample rate
227+
audio_sample = audio.audio_data.reshape(-1) # get the reshaped audio samples
228+
229+
# Step 3: Preprocess the audio
230+
inputs = processor(raw_audio=audio_sample,
231+
sampling_rate=frame_rate,
232+
return_tensors="pt")
233+
234+
# Step 4: Encode the audio
235+
encoder_outputs = model.encode(inputs["input_values"],
236+
inputs["padding_mask"])
237+
238+
# Step 5: Decode the audio
239+
audio_values = model.decode(encoder_outputs.audio_codes,
240+
encoder_outputs.audio_scales,
241+
inputs["padding_mask"])[0]
242+
243+
# OR replace Step 4 and 5: Forward pass on the input
244+
audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values
245+
```
246+
Let's go through the important code steps:
247+
248+
### Load the EnCodec Model and Processor
249+
250+
We use the `facebook/encodec_24khz` model, optimized for 24 kHz audio. The processor handles audio preprocessing, including splitting audio for batch operations and creating the `padding_mask`, that indicates which positions in the input are real audio (1) and which are padding (0) to be ignored by the model.
251+
252+
!!! Hint
253+
You can also use `facebook/encodec_48khz` for higher sampling rates, but in this example, we will stick to 24 kHz for simplicity.
254+
255+
### Loading and Preprocessing the Audio
256+
257+
We load an audio file, resample it to 24 kHz, and convert it to mono. The audio data is reshaped into a 1D array for processing. In this example, we use a sample audio file named `neural_codec_input.wav` *(listen to it below)* that is 10 seconds long and after resampling, has the shape of `torch.Size([240000])` and looks like `tensor([0.0088, 0.0117, 0.0194, ..., 0.0390, 0.0460, 0.0213])`
258+
259+
Original Audio:
260+
<audio controls>
261+
<source src="../../audio/neural_codec_input.wav" type="audio/wav">
262+
Your browser does not support the audio element.
263+
</audio>
264+
265+
### Preprocessing the Audio
266+
267+
The audio is processed using the EnCodec processor, which prepares it for encoding by creating input tensors and a padding mask. The padding mask is crucial for handling variable-length audio inputs, ensuring that the model only processes valid audio samples. The `inputs` looks like `{'input_values': tensor([[ 0.0088, 0.0117, 0.0194, ..., 0.0390, 0.0460, 0.0213]]), 'padding_mask': tensor([[1, 1, 1, ..., 1, 1, 1]])}`. If you notice, the `input_values` tensor contains the audio samples exactly as we have loaded, and the `padding_mask` indicates that all positions are valid (1) since we have a single audio sample without padding.
268+
269+
### Encoding the Audio
270+
271+
The `encode` method processes the input audio, producing quantized latent representations (`audio_codes`) and scales for decoding. This step compresses the audio into a more compact form. One important aspect here is the bandwidth - this is how much data the compressed audio will use per second. Lower bandwidth means smaller files but lower audio quality; higher bandwidth means better quality but larger files. Bandwidth is correlated to the codebooks used in the quantization step, the relation is shown below, *(for Encodec 24kHz model)*:
272+
273+
| Bandwidth (kbps) | Number of Codebooks (n_q) |
274+
|------------------|--------------------------|
275+
| 1.5 | 2 |
276+
| 3 | 4 |
277+
| 6 | 8 |
278+
| 12 | 16 |
279+
| 24 | 32 |
280+
281+
So, `bandwidth = 1.5` will use 2 codebooks, while `bandwidth = 24` will use 32 codebooks. The number of codebooks directly affects the quality and size of the compressed audio. If we try with `bandwidth = 1.5`, the `audio_codes` will have shape `torch.Size([1, 1, 2, 750])` and looks like
282+
283+
```
284+
tensor([[[[727, 407, 906, ..., 561, 424, 925],
285+
[946, 734, 949, ..., 673, 769, 987]]]])
286+
```
287+
288+
But in case of `bandwidth = 24`, the `audio_codes` will have shape `torch.Size([1, 1, 32, 750])` and looks like
289+
290+
```
291+
tensor([[[[ 727, 407, 906, ..., 561, 424, 925],
292+
[ 946, 734, 949, ..., 673, 769, 987],
293+
[ 988, 21, 623, ..., 870, 1023, 452],
294+
...,
295+
[ 792, 792, 220, ..., 419, 1011, 422],
296+
[ 502, 550, 893, ..., 328, 832, 450],
297+
[ 681, 906, 872, ..., 820, 601, 658]]]])
298+
```
299+
300+
!!! Hint
301+
If you're wondering how bandwidth relates to the number of codebooks in EnCodec, here's how it works: The encoder produces 75 steps per second of audio. So, for a 10-second clip, there are 750 steps. At each step, the model outputs one code per codebook (`N_q`). For example, with `bandwidth = 1.5`, there are 2 codebooks, so you get a total of 2 × 750 = 1500 codes, which corresponds to 1.5 kbps. With `bandwidth = 24`, there are 32 codebooks, resulting in 32 × 750 = 24,000 codes, or 24 kbps. In summary: more codebooks mean higher bandwidth and better quality, but also larger compressed files.
302+
303+
Notice one thing, an audio of 10 seconds that used 240k samples is now compressed into (N_q, 750) where N_q is the number of codebooks used. For `bandwidth = 1.5`, the shape is (2, 750) and the compression ratio is 160x and for `bandwidth = 24`, the shape is (32, 750) and the compression ratio is 10x! Quite impressive, right?
304+
305+
### Decoding the Audio
306+
307+
The `decode` method reconstructs the audio from the quantized codes and scales. The output is a tensor of audio samples, which can be saved as a WAV file or played directly. The shape of the output audio tensor will be `(1, 240000)` for 10 seconds of audio at 24 kHz.
308+
309+
If you play the audio file `neural_codec_input.wav`, you will hear the original audio. After running the code, you can listen to the output generated by EnCodec. Both are presented below,
310+
311+
Original Audio: (48KHz)
312+
<audio controls>
313+
<source src="../../audio/neural_codec_input.wav" type="audio/wav">
314+
Your browser does not support the audio element.
315+
</audio>
316+
317+
Encodec Output: (24KHz; Bandwidth: 1.5 kbps)
318+
<audio controls>
319+
<source src="../../audio/neural_codec_encodec_output_bandwidth1.5.wav" type="audio/wav">
320+
Your browser does not support the audio element.
321+
</audio>
322+
323+
Encodec Output: (24KHz; Bandwidth: 24 kbps)
324+
<audio controls>
325+
<source src="../../audio/neural_codec_encodec_output_bandwidth24.wav" type="audio/wav">
326+
Your browser does not support the audio element.
327+
</audio>
328+
329+
As you can hear, while there are some distortions, the output is audible and is able to maintain the speech of the original audio. This demonstrates the effectiveness of neural codecs in audio compression.
330+
207331
## Comparative Analysis
332+
208333
The evolution of neural codecs reveals several key trends:
209334

210335
| Characteristic | SoundStream[1] | EnCodec[2] | HILCodec[3] |
@@ -242,6 +367,6 @@ Neural audio codecs represent a paradigm shift in audio compression, offering un
242367

243368
[1] SoundStream: An End-to-End Neural Audio Codec - [Paper](https://arxiv.org/abs/2107.03312) | [Video](https://www.youtube.com/watch?v=V4jj-yhiclk&ab_channel=RISEResearchInstitutesofSweden)
244369

245-
[2] [EnCodec: High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
370+
[2] EnCodec: High Fidelity Neural Audio Compression - [Paper](https://arxiv.org/abs/2210.13438) | [Code](https://github.com/facebookresearch/encodec)
246371

247372
[3] [HILCodec: High Fidelity and Lightweight Neural Audio Codec](https://arxiv.org/pdf/2405.04752v1)

0 commit comments

Comments
 (0)