Skip to content

Commit 257eb22

Browse files
authored
Merge pull request #1 from wkzng/relaxation_fps_constraint
0.5.0 Release
2 parents a6bcd31 + b7ee6fa commit 257eb22

20 files changed

+252
-117
lines changed

README.md

Lines changed: 52 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,45 @@
11
# iSincNet (Lightweight Sincnet Spectrogram Vocoder)
22

3-
[[Blog]](https://gitlab.com/sonustech/sincnet) [[SincNet Paper]](https://arxiv.org/abs/1808.00158)
3+
[[Blog]](https://github.com/wkzng/iSincNet) [[Original SincNet Paper (M. Ravenelli, Y. Bengio)]](https://arxiv.org/abs/1808.00158)
44

55
iSincNet is as Fast and Lightweight Sincnet Spectrogram Vocoder neural network trained to reconstruct audio waveforms from their SincNet spectogram (real and signed 2d representation). We used the GTZAN dataset which is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). The files were collected in 2000-2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions (http://marsyas.info/downloads/datasets.html).
66

77
<p align="center">
88
<img src=illustrations/SincNet-Filterbank.png alt="Fast and Lightweight Sincnet Spectrogram Vocoder" width="80%"/>
99
</p>
1010

11-
# TODO: Benchmark
12-
colums: architecture or method | dataset | MSE | MAE | SNR | checkpoint
13-
14-
datasets:
11+
Datasets used during development:
1512
- [GTZAN](https://github.com/chittalpatel/Music-Genre-Classification-GTZAN)
1613
- [MUSDB-18](https://sigsep.github.io/datasets/musdb.html)
1714

1815

16+
## Example Spectrogram
17+
The First 5s second of the Audio `audio/invertibility/15033000.mp3`
18+
19+
| | Non-causal Encoder | Causal Encoder |
20+
|:------:|:-------------------:|:--------------:|
21+
| signed values | <img src="illustrations/spec_noncausal_signed.jpeg" alt="non-causal 15033000" width="260"> | <img src="illustrations/spec_causal_signed.jpeg" alt="causal 15033000" width="260"> |
22+
| abs values | <img src="illustrations/spec_noncausal_abs.jpeg" alt="non-causal 15033000" width="260"> | <img src="illustrations/spec_causal_abs.jpeg" alt="causal 15033000" width="260"> |
23+
24+
25+
### 🎧 Pretrained Models
26+
The following table summarizes the key characteristics and access points for the available pretrained models.
27+
All models are open-source and stored in the `pretrained/` folder.
28+
29+
| Sample Rate | Frame rate | Bins | Weights | Corpus | Causal Encoder | Scale | Open-Source |
30+
|:------------:|:---:|:-----:|:--------|:--------|:----------------:|:-------:|:------------:|
31+
| 16000 | 128 | 128 | [📦](pretrained/16000fs_128fps_128bins_lin_complex_ncausal.ckpt) | GTZAN || Linear ||
32+
| 16000 | 128 | 128 | [📦](pretrained/16000fs_128fps_128bins_lin_real_causal.ckpt) | GTZAN || Linear ||
33+
| 16000 | 128 | 256 | [📦](pretrained/16000fs_128fps_256bins_mel_complex_ncausal.ckpt) | GTZAN || Mel ||
34+
| 44100 | 210 | 256 | [📦](pretrained/44100fs_210fps_256bins_lin_complex_ncausal.ckpt) | GTZAN || Linear ||
35+
| 44100 | 210 | 512 | [📦](pretrained/44100fs_210fps_512bins_mel_complex_ncausal.ckpt) | GTZAN || Mel ||
36+
| 44100 | 350 | 128 | [📦](pretrained/44100fs_350fps_128bins_lin_real_causal.ckpt) | GTZAN || Linear ||
37+
| 44100 | 350 | 128 | [📦](pretrained/44100fs_350fps_128bins_lin_complex_ncausal.ckpt) | GTZAN || Linear ||
38+
| 44100 | 350 | 256 | [📦](pretrained/44100fs_350fps_256bins_mel_complex_ncausal.ckpt) | GTZAN || Mel ||
39+
| 44100 | 350 | 256 |[📦](pretrained/44100fs_350fps_256bins_mel_real_causal.ckpt) | GTZAN || Mel ||
40+
41+
42+
1943
## Quick Start
2044
```bash
2145
pip install -r requirements.txt
@@ -30,13 +54,27 @@ import torch
3054
from sincnet.model import SincNet, Quantizer
3155
from datasets.utils.waveform import WaveformLoader
3256

33-
# load the model
57+
58+
SAMPLE_RATE = 16_000
3459
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
35-
model = SincNet().load_pretrained_weights().eval().to(device)
36-
processor = WaveformLoader(sample_rate=SAMPLE_RATE)
60+
audio_loader = WaveformLoader(sample_rate=SAMPLE_RATE)
61+
62+
# load the model
63+
params = {
64+
"fs": SAMPLE_RATE,
65+
"fps": 128,
66+
"scale": "lin",
67+
"component": "complex"
68+
}
69+
70+
model : SincNet = (
71+
SincNet(**params)
72+
.load_pretrained_weights(weights_folder="pretrained", verbose=False)
73+
.eval()
74+
.to(device)
75+
)
3776

3877
# encode and decode an audio waveform
39-
sample_rate = 16_000
4078
duration = 5
4179
offset = 0
4280
audio_path = ...
@@ -49,11 +87,11 @@ with torch.no_grad():
4987
spectrogram = model.encode(audio_tensor.unsqueeze(0), scale="mel")
5088
reconstructed_audio_tensor = model.decode(spectrogram, scale="mel")
5189

52-
#(optional) elementwise quantization into a vocabulary of size 2^{q_bits}
90+
#(optional) elementwise quantization into a discrete vocabulary of size 2^{q_bits}
5391
quantizer = Quantizer(q_bits=10).to(device)
5492
indices = quantizer(spectrogram)
55-
detokenized_spectrogram = tokenizer.inverse(indices)
56-
detokenized_audio = model.decode(detokenized_spectrogram)
93+
dequantized_spectrogram = tokenizer.inverse(indices)
94+
dequantized_audio = model.decode(dequantized_spectrogram)
5795
```
5896

5997

@@ -85,9 +123,7 @@ Related discussion about SincNet vs STFT https://github.com/mravanelli/SincNet/i
85123

86124

87125
## Roadmap and projects status
88-
- [x] Added Automatic projection forward and backward to MEL scale
126+
- [x] Host weights in Github and add auto-download
89127
- [ ] Benchmark of inversion vs Griffin-Lim, iSTFTNet
90-
- [ ] Host weights in cloud and add auto-download
91128

92-
## Contributions and acknowledgment
93-
Show your appreciation to those who have contributed to the project.
129+
## Contributions and acknowledgment (TODO)

datasets/configs.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,12 @@
1111

1212
class BaseDatasetConfig:
1313
id :str
14-
sample_rate:int = int(os.getenv("SAMPLE_RATE"))
14+
sample_rate:int
1515
audio_root: str = os.getenv("H5_DIRECTORY")
1616

17-
def __init__(self, id:str) -> None:
17+
def __init__(self, id:str, sample_rate:int=None) -> None:
1818
super().__init__()
19+
self.sample_rate = sample_rate or int(os.getenv("SAMPLE_RATE", 16000))
1920
self.id = id
2021

2122
@property
@@ -31,5 +32,5 @@ def hdf5(self) -> str:
3132

3233
class GTZANConfig(BaseDatasetConfig):
3334
""" traceability: https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification """
34-
def __init__(self, id:str="gtzan") -> None:
35-
super().__init__(id=id)
35+
def __init__(self, sample_rate:int=None) -> None:
36+
super().__init__(id="gtzan", sample_rate=sample_rate)

demo.ipynb

Lines changed: 51 additions & 49 deletions
Large diffs are not rendered by default.

illustrations/spec_causal_abs.jpeg

68.5 KB
Loading
55.3 KB
Loading
57 KB
Loading
36 KB
Loading
-1.59 MB
Binary file not shown.
-1.04 MB
Binary file not shown.
3.18 MB
Binary file not shown.

0 commit comments

Comments
 (0)