11# iSincNet (Lightweight Sincnet Spectrogram Vocoder)
22
3- [[ Blog]] ( https://gitlab .com/sonustech/sincnet ) [[ SincNet Paper]] ( https://arxiv.org/abs/1808.00158 )
3+ [[ Blog]] ( https://github .com/wkzng/iSincNet ) [[ Original SincNet Paper (M. Ravenelli, Y. Bengio) ]] ( https://arxiv.org/abs/1808.00158 )
44
55iSincNet is as Fast and Lightweight Sincnet Spectrogram Vocoder neural network trained to reconstruct audio waveforms from their SincNet spectogram (real and signed 2d representation). We used the GTZAN dataset which is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). The files were collected in 2000-2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions (http://marsyas.info/downloads/datasets.html ).
66
77<p align =" center " >
88 <img src=illustrations/SincNet-Filterbank.png alt="Fast and Lightweight Sincnet Spectrogram Vocoder" width="80%"/>
99</p >
1010
11- # TODO: Benchmark
12- colums: architecture or method | dataset | MSE | MAE | SNR | checkpoint
13-
14- datasets:
11+ Datasets used during development:
1512- [ GTZAN] ( https://github.com/chittalpatel/Music-Genre-Classification-GTZAN )
1613- [ MUSDB-18] ( https://sigsep.github.io/datasets/musdb.html )
1714
1815
16+ ## Example Spectrogram
17+ The First 5s second of the Audio ` audio/invertibility/15033000.mp3 `
18+
19+ | | Non-causal Encoder | Causal Encoder |
20+ | :------:| :-------------------:| :--------------:|
21+ | signed values | <img src =" illustrations/spec_noncausal_signed.jpeg " alt =" non-causal 15033000 " width =" 260 " > | <img src =" illustrations/spec_causal_signed.jpeg " alt =" causal 15033000 " width =" 260 " > |
22+ | abs values | <img src =" illustrations/spec_noncausal_abs.jpeg " alt =" non-causal 15033000 " width =" 260 " > | <img src =" illustrations/spec_causal_abs.jpeg " alt =" causal 15033000 " width =" 260 " > |
23+
24+
25+ ### 🎧 Pretrained Models
26+ The following table summarizes the key characteristics and access points for the available pretrained models.
27+ All models are open-source and stored in the ` pretrained/ ` folder.
28+
29+ | Sample Rate | Frame rate | Bins | Weights | Corpus | Causal Encoder | Scale | Open-Source |
30+ | :------------:| :---:| :-----:| :--------| :--------| :----------------:| :-------:| :------------:|
31+ | 16000 | 128 | 128 | [ 📦] ( pretrained/16000fs_128fps_128bins_lin_complex_ncausal.ckpt ) | GTZAN | ✗ | Linear | √ |
32+ | 16000 | 128 | 128 | [ 📦] ( pretrained/16000fs_128fps_128bins_lin_real_causal.ckpt ) | GTZAN | √ | Linear | √ |
33+ | 16000 | 128 | 256 | [ 📦] ( pretrained/16000fs_128fps_256bins_mel_complex_ncausal.ckpt ) | GTZAN | ✗ | Mel | √ |
34+ | 44100 | 210 | 256 | [ 📦] ( pretrained/44100fs_210fps_256bins_lin_complex_ncausal.ckpt ) | GTZAN | ✗ | Linear | √ |
35+ | 44100 | 210 | 512 | [ 📦] ( pretrained/44100fs_210fps_512bins_mel_complex_ncausal.ckpt ) | GTZAN | ✗ | Mel | √ |
36+ | 44100 | 350 | 128 | [ 📦] ( pretrained/44100fs_350fps_128bins_lin_real_causal.ckpt ) | GTZAN | √ | Linear | √ |
37+ | 44100 | 350 | 128 | [ 📦] ( pretrained/44100fs_350fps_128bins_lin_complex_ncausal.ckpt ) | GTZAN | ✗ | Linear | √ |
38+ | 44100 | 350 | 256 | [ 📦] ( pretrained/44100fs_350fps_256bins_mel_complex_ncausal.ckpt ) | GTZAN | ✗ | Mel | √ |
39+ | 44100 | 350 | 256 | [ 📦] ( pretrained/44100fs_350fps_256bins_mel_real_causal.ckpt ) | GTZAN | √ | Mel | √ |
40+
41+
42+
1943## Quick Start
2044``` bash
2145pip install -r requirements.txt
@@ -30,13 +54,27 @@ import torch
3054from sincnet.model import SincNet, Quantizer
3155from datasets.utils.waveform import WaveformLoader
3256
33- # load the model
57+
58+ SAMPLE_RATE = 16_000
3459device = torch.device(" cuda" if torch.cuda.is_available() else " cpu" )
35- model = SincNet().load_pretrained_weights().eval().to(device)
36- processor = WaveformLoader(sample_rate = SAMPLE_RATE )
60+ audio_loader = WaveformLoader(sample_rate = SAMPLE_RATE )
61+
62+ # load the model
63+ params = {
64+ " fs" : SAMPLE_RATE ,
65+ " fps" : 128 ,
66+ " scale" : " lin" ,
67+ " component" : " complex"
68+ }
69+
70+ model : SincNet = (
71+ SincNet(** params)
72+ .load_pretrained_weights(weights_folder = " pretrained" , verbose = False )
73+ .eval()
74+ .to(device)
75+ )
3776
3877# encode and decode an audio waveform
39- sample_rate = 16_000
4078duration = 5
4179offset = 0
4280audio_path = ...
@@ -49,11 +87,11 @@ with torch.no_grad():
4987 spectrogram = model.encode(audio_tensor.unsqueeze(0 ), scale = " mel" )
5088 reconstructed_audio_tensor = model.decode(spectrogram, scale = " mel" )
5189
52- # (optional) elementwise quantization into a vocabulary of size 2^{q_bits}
90+ # (optional) elementwise quantization into a discrete vocabulary of size 2^{q_bits}
5391quantizer = Quantizer(q_bits = 10 ).to(device)
5492indices = quantizer(spectrogram)
55- detokenized_spectrogram = tokenizer.inverse(indices)
56- detokenized_audio = model.decode(detokenized_spectrogram )
93+ dequantized_spectrogram = tokenizer.inverse(indices)
94+ dequantized_audio = model.decode(dequantized_spectrogram )
5795```
5896
5997
@@ -85,9 +123,7 @@ Related discussion about SincNet vs STFT https://github.com/mravanelli/SincNet/i
85123
86124
87125## Roadmap and projects status
88- - [x] Added Automatic projection forward and backward to MEL scale
126+ - [x] Host weights in Github and add auto-download
89127- [ ] Benchmark of inversion vs Griffin-Lim, iSTFTNet
90- - [ ] Host weights in cloud and add auto-download
91128
92- ## Contributions and acknowledgment
93- Show your appreciation to those who have contributed to the project.
129+ ## Contributions and acknowledgment (TODO)
0 commit comments