You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use the `facebook/encodec_24khz` model, optimized for 24 kHz audio. The processor handles audio preprocessing, including splitting audio for batch operations and creating the `padding_mask`, that indicates which positions in the input are real audio (1) and which are padding (0) to be ignored by the model.
251
+
252
+
!!! Hint
253
+
You can also use `facebook/encodec_48khz` for higher sampling rates, but in this example, we will stick to 24 kHz for simplicity.
254
+
255
+
### Loading and Preprocessing the Audio
256
+
257
+
We load an audio file, resample it to 24 kHz, and convert it to mono. The audio data is reshaped into a 1D array for processing. In this example, we use a sample audio file named `neural_codec_input.wav`*(listen to it below)* that is 10 seconds long and after resampling, has the shape of `torch.Size([240000])` and looks like `tensor([0.0088, 0.0117, 0.0194, ..., 0.0390, 0.0460, 0.0213])`
The audio is processed using the EnCodec processor, which prepares it for encoding by creating input tensors and a padding mask. The padding mask is crucial for handling variable-length audio inputs, ensuring that the model only processes valid audio samples. The `inputs` looks like `{'input_values': tensor([[ 0.0088, 0.0117, 0.0194, ..., 0.0390, 0.0460, 0.0213]]), 'padding_mask': tensor([[1, 1, 1, ..., 1, 1, 1]])}`. If you notice, the `input_values` tensor contains the audio samples exactly as we have loaded, and the `padding_mask` indicates that all positions are valid (1) since we have a single audio sample without padding.
268
+
269
+
### Encoding the Audio
270
+
271
+
The `encode` method processes the input audio, producing quantized latent representations (`audio_codes`) and scales for decoding. This step compresses the audio into a more compact form. One important aspect here is the bandwidth - this is how much data the compressed audio will use per second. Lower bandwidth means smaller files but lower audio quality; higher bandwidth means better quality but larger files. Bandwidth is correlated to the codebooks used in the quantization step, the relation is shown below, *(for Encodec 24kHz model)*:
272
+
273
+
| Bandwidth (kbps) | Number of Codebooks (n_q) |
274
+
|------------------|--------------------------|
275
+
| 1.5 | 2 |
276
+
| 3 | 4 |
277
+
| 6 | 8 |
278
+
| 12 | 16 |
279
+
| 24 | 32 |
280
+
281
+
So, `bandwidth = 1.5` will use 2 codebooks, while `bandwidth = 24` will use 32 codebooks. The number of codebooks directly affects the quality and size of the compressed audio. If we try with `bandwidth = 1.5`, the `audio_codes` will have shape `torch.Size([1, 1, 2, 750])` and looks like
282
+
283
+
```
284
+
tensor([[[[727, 407, 906, ..., 561, 424, 925],
285
+
[946, 734, 949, ..., 673, 769, 987]]]])
286
+
```
287
+
288
+
But in case of `bandwidth = 24`, the `audio_codes` will have shape `torch.Size([1, 1, 32, 750])` and looks like
289
+
290
+
```
291
+
tensor([[[[ 727, 407, 906, ..., 561, 424, 925],
292
+
[ 946, 734, 949, ..., 673, 769, 987],
293
+
[ 988, 21, 623, ..., 870, 1023, 452],
294
+
...,
295
+
[ 792, 792, 220, ..., 419, 1011, 422],
296
+
[ 502, 550, 893, ..., 328, 832, 450],
297
+
[ 681, 906, 872, ..., 820, 601, 658]]]])
298
+
```
299
+
300
+
!!! Hint
301
+
If you're wondering how bandwidth relates to the number of codebooks in EnCodec, here's how it works: The encoder produces 75 steps per second of audio. So, for a 10-second clip, there are 750 steps. At each step, the model outputs one code per codebook (`N_q`). For example, with `bandwidth = 1.5`, there are 2 codebooks, so you get a total of 2 × 750 = 1500 codes, which corresponds to 1.5 kbps. With `bandwidth = 24`, there are 32 codebooks, resulting in 32 × 750 = 24,000 codes, or 24 kbps. In summary: more codebooks mean higher bandwidth and better quality, but also larger compressed files.
302
+
303
+
Notice one thing, an audio of 10 seconds that used 240k samples is now compressed into (N_q, 750) where N_q is the number of codebooks used. For `bandwidth = 1.5`, the shape is (2, 750) and the compression ratio is 160x and for `bandwidth = 24`, the shape is (32, 750) and the compression ratio is 10x! Quite impressive, right?
304
+
305
+
### Decoding the Audio
306
+
307
+
The `decode` method reconstructs the audio from the quantized codes and scales. The output is a tensor of audio samples, which can be saved as a WAV file or played directly. The shape of the output audio tensor will be `(1, 240000)` for 10 seconds of audio at 24 kHz.
308
+
309
+
If you play the audio file `neural_codec_input.wav`, you will hear the original audio. After running the code, you can listen to the output generated by EnCodec. Both are presented below,
As you can hear, while there are some distortions, the output is audible and is able to maintain the speech of the original audio. This demonstrates the effectiveness of neural codecs in audio compression.
330
+
207
331
## Comparative Analysis
332
+
208
333
The evolution of neural codecs reveals several key trends:
0 commit comments