wkzng
diff --git a/‎README.md‎
Lines changed: 52 additions & 16 deletions b/‎README.md‎
Lines changed: 52 additions & 16 deletions
diff --git a/‎datasets/configs.py‎
Lines changed: 5 additions & 4 deletions b/‎datasets/configs.py‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎demo.ipynb‎
Lines changed: 51 additions & 49 deletions b/‎demo.ipynb‎
Lines changed: 51 additions & 49 deletions
diff --git a/‎illustrations/spec_causal_abs.jpeg‎
68.5 KB b/‎illustrations/spec_causal_abs.jpeg‎
68.5 KB
diff --git a/‎illustrations/spec_causal_signed.jpeg‎
55.3 KB b/‎illustrations/spec_causal_signed.jpeg‎
55.3 KB
diff --git a/‎illustrations/spec_noncausal_abs.jpeg‎
57 KB b/‎illustrations/spec_noncausal_abs.jpeg‎
57 KB
diff --git a/‎illustrations/spec_noncausal_signed.jpeg‎
36 KB b/‎illustrations/spec_noncausal_signed.jpeg‎
36 KB
diff --git a/‎pretrained/16000fs_128fps_128bins_mel_complex_ncausal.ckpt‎
-1.59 MB b/‎pretrained/16000fs_128fps_128bins_mel_complex_ncausal.ckpt‎
-1.59 MB
diff --git a/‎pretrained/16000fs_128fps_128bins_mel_real_causal.ckpt‎
-1.04 MB b/‎pretrained/16000fs_128fps_128bins_mel_real_causal.ckpt‎
-1.04 MB
diff --git a/‎pretrained/16000fs_128fps_256bins_mel_complex_ncausal.ckpt‎
3.18 MB b/‎pretrained/16000fs_128fps_256bins_mel_complex_ncausal.ckpt‎
3.18 MB
@@ -1,21 +1,45 @@
 # iSincNet (Lightweight Sincnet Spectrogram Vocoder)
 
-[[Blog]](https://gitlab.com/sonustech/sincnet) [[SincNet Paper]](https://arxiv.org/abs/1808.00158)
+[[Blog]](https://github.com/wkzng/iSincNet) [[Original SincNet Paper (M. Ravenelli, Y. Bengio)]](https://arxiv.org/abs/1808.00158)
 
 iSincNet is as Fast and Lightweight Sincnet Spectrogram Vocoder neural network trained to reconstruct audio waveforms from their SincNet spectogram (real and signed 2d representation). We used the GTZAN dataset which is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). The files were collected in 2000-2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions (http://marsyas.info/downloads/datasets.html).
 
 <p align="center">
   <img src=illustrations/SincNet-Filterbank.png alt="Fast and Lightweight Sincnet Spectrogram Vocoder" width="80%"/>
 </p>
 
-# TODO: Benchmark
-colums: architecture or method | dataset | MSE | MAE | SNR | checkpoint
-
-datasets:
+Datasets used during development:
 - [GTZAN](https://github.com/chittalpatel/Music-Genre-Classification-GTZAN)
 - [MUSDB-18](https://sigsep.github.io/datasets/musdb.html)
 
 
+## Example Spectrogram 
+The First 5s second of the Audio `audio/invertibility/15033000.mp3`
+
+|  | Non-causal Encoder | Causal Encoder |
+|:------:|:-------------------:|:--------------:|
+| signed values | <img src="illustrations/spec_noncausal_signed.jpeg" alt="non-causal 15033000" width="260"> | <img src="illustrations/spec_causal_signed.jpeg" alt="causal 15033000" width="260"> |
+| abs values | <img src="illustrations/spec_noncausal_abs.jpeg" alt="non-causal 15033000" width="260"> | <img src="illustrations/spec_causal_abs.jpeg" alt="causal 15033000" width="260"> |
+
+
+### 🎧 Pretrained Models
+The following table summarizes the key characteristics and access points for the available pretrained models.
+All models are open-source and stored in the `pretrained/` folder.
+
+| Sample Rate | Frame rate | Bins | Weights | Corpus | Causal Encoder | Scale | Open-Source |
+|:------------:|:---:|:-----:|:--------|:--------|:----------------:|:-------:|:------------:|
+| 16000 | 128 | 128 | [📦](pretrained/16000fs_128fps_128bins_lin_complex_ncausal.ckpt) | GTZAN | ✗ | Linear | √ |
+| 16000 | 128 | 128 | [📦](pretrained/16000fs_128fps_128bins_lin_real_causal.ckpt) | GTZAN | √ | Linear | √ |
+| 16000 | 128 | 256 | [📦](pretrained/16000fs_128fps_256bins_mel_complex_ncausal.ckpt) | GTZAN | ✗ | Mel | √ |
+| 44100 | 210 | 256 | [📦](pretrained/44100fs_210fps_256bins_lin_complex_ncausal.ckpt) | GTZAN | ✗ | Linear | √ |
+| 44100 | 210 | 512 | [📦](pretrained/44100fs_210fps_512bins_mel_complex_ncausal.ckpt) | GTZAN | ✗ | Mel | √ |
+| 44100 | 350 | 128 | [📦](pretrained/44100fs_350fps_128bins_lin_real_causal.ckpt) | GTZAN | √ | Linear | √ |
+| 44100 | 350 | 128 | [📦](pretrained/44100fs_350fps_128bins_lin_complex_ncausal.ckpt) | GTZAN | ✗ | Linear | √ |
+| 44100 | 350 | 256 | [📦](pretrained/44100fs_350fps_256bins_mel_complex_ncausal.ckpt) | GTZAN | ✗ | Mel | √ |
+| 44100 | 350 | 256 |[📦](pretrained/44100fs_350fps_256bins_mel_real_causal.ckpt) | GTZAN | √ | Mel | √ |
+
+
+
 ## Quick Start 
 ```bash
 pip install -r requirements.txt
@@ -30,13 +54,27 @@ import torch
 from sincnet.model import SincNet, Quantizer
 from datasets.utils.waveform import WaveformLoader 
 
-# load the model
+
+SAMPLE_RATE = 16_000
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-model = SincNet().load_pretrained_weights().eval().to(device)
-processor = WaveformLoader(sample_rate=SAMPLE_RATE) 
+audio_loader = WaveformLoader(sample_rate=SAMPLE_RATE) 
+
+# load the model
+params = {
+    "fs": SAMPLE_RATE,
+    "fps": 128,
+    "scale": "lin",
+    "component": "complex"
+}
+
+model : SincNet = (
+    SincNet(**params)
+    .load_pretrained_weights(weights_folder="pretrained", verbose=False)
+    .eval()
+    .to(device)
+)
 
 # encode and decode an audio waveform
-sample_rate = 16_000
 duration = 5
 offset = 0
 audio_path = ... 
@@ -49,11 +87,11 @@ with torch.no_grad():
   spectrogram = model.encode(audio_tensor.unsqueeze(0), scale="mel")
   reconstructed_audio_tensor = model.decode(spectrogram, scale="mel")
 
-#(optional) elementwise quantization into a vocabulary of size 2^{q_bits}
+#(optional) elementwise quantization into a discrete vocabulary of size 2^{q_bits}
 quantizer = Quantizer(q_bits=10).to(device)
 indices = quantizer(spectrogram)
-detokenized_spectrogram = tokenizer.inverse(indices)
-detokenized_audio = model.decode(detokenized_spectrogram)
+dequantized_spectrogram = tokenizer.inverse(indices)
+dequantized_audio = model.decode(dequantized_spectrogram)
 ```
 
 
@@ -85,9 +123,7 @@ Related discussion about SincNet vs STFT https://github.com/mravanelli/SincNet/i
 
 
 ## Roadmap and projects status
-- [x] Added Automatic projection forward and backward to MEL scale
+- [x] Host weights in Github and add auto-download
 - [ ] Benchmark of inversion vs Griffin-Lim, iSTFTNet
-- [ ] Host weights in cloud and add auto-download
 
-## Contributions and acknowledgment
-Show your appreciation to those who have contributed to the project.
+## Contributions and acknowledgment (TODO)
@@ -11,11 +11,12 @@
 
 class BaseDatasetConfig:
     id :str
-    sample_rate:int = int(os.getenv("SAMPLE_RATE"))
+    sample_rate:int 
     audio_root: str = os.getenv("H5_DIRECTORY")
 
-    def __init__(self, id:str) -> None:
+    def __init__(self, id:str, sample_rate:int=None) -> None:
         super().__init__()
+        self.sample_rate = sample_rate or int(os.getenv("SAMPLE_RATE", 16000))
         self.id = id
 
     @property
@@ -31,5 +32,5 @@ def hdf5(self) -> str:
 
 class GTZANConfig(BaseDatasetConfig):
     """ traceability: https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification """
-    def __init__(self, id:str="gtzan") -> None:
-        super().__init__(id=id)
+    def __init__(self, sample_rate:int=None) -> None:
+        super().__init__(id="gtzan", sample_rate=sample_rate)