Releases: NeptuneHub/AudioMuse-AI-DCLAP
Releases · NeptuneHub/AudioMuse-AI-DCLAP
AudioMuse-AI-DCLAP v1 - First release val_cosine 0.884
This is the first release of AudioMuse-AI Distilled CLAP (DCLAP) that contain model_epoch_36.onnx and model_epoch_36.onnx.data distilled model for the AUDIO tower.
The text tower clap_text_model.onnx is still the original one from LAION CLAP exported in .onnx.
The distilled audio model have around 7m model and an increased sped of 5-6x (tested on a Raspberry Pi 5 8GB Ram and NVME SSD).
With this model we reached this MSE and Cosine similarity against the LAION CLAP teacher model (music_audioset_epoch_15_esc_90.14.pt):
- train_cos=0.886233021853105
- train_mse=0.00044440243531563233
- val_mse=0.00045189925003796816
- val_cosine=0.8843138813972473
On Music Information Retrieval metrics among challenging query was reached the below result (Student vs Teacher):
Text query cosine similarities (mean across songs):
Query Teacher Student Delta
────────────────────────────── ───────── ───────── ─────────
Calm Piano song +0.0191 +0.0226 +0.0035
Energetic POP song +0.2005 +0.2268 +0.0263
Love Rock Song +0.2694 +0.3298 +0.0604
Happy Pop song +0.3236 +0.3664 +0.0428
POP song with Female vocalist +0.2663 +0.3091 +0.0428
Instrumental song +0.1253 +0.1543 +0.0290
Female Vocalist +0.1694 +0.1984 +0.0291
Male Vocalist +0.1238 +0.1545 +0.0306
Ukulele POP song +0.1190 +0.1486 +0.0296
Jazz Sax song +0.0980 +0.1229 +0.0249
Distorted Electric Guitar -0.1099 -0.1059 +0.0039
Drum and Bass beat +0.0878 +0.1213 +0.0335
Heavy Metal song +0.0977 +0.1117 +0.0140
Ambient song +0.1594 +0.2066 +0.0471
────────────────────────────── ───────── ───────── ─────────
OVERALL MEAN +0.1392 +0.1691 +0.0298
MIR RANKING METRICS: R@1, R@5, mAP@10 (teacher top-5 as relevance)
Query R@1 R@5 mAP@10 Overlap10 Ordered10 MeanShift
------------------------------ ------- ------------ -------- --------- --------- --------
Calm Piano song 0/1 4/5 (80.0%) 0.967 7/10 2/10 2.20
Energetic POP song 1/1 2/5 (40.0%) 0.508 5/10 2/10 5.40
Love Rock Song 0/1 3/5 (60.0%) 0.730 8/10 1/10 3.10
Happy Pop song 0/1 2/5 (40.0%) 0.408 4/10 0/10 6.20
POP song with Female vocalist 0/1 2/5 (40.0%) 0.489 7/10 0/10 4.90
Instrumental song 1/1 3/5 (60.0%) 0.858 8/10 3/10 3.00
Female Vocalist 0/1 2/5 (40.0%) 0.408 5/10 0/10 9.80
Male Vocalist 0/1 3/5 (60.0%) 0.858 8/10 2/10 2.50
Ukulele POP song 1/1 3/5 (60.0%) 0.680 6/10 1/10 5.40
Jazz Sax song 0/1 4/5 (80.0%) 0.967 8/10 3/10 2.30
Distorted Electric Guitar 0/1 3/5 (60.0%) 0.876 9/10 0/10 2.80
Drum and Bass beat 0/1 3/5 (60.0%) 0.634 8/10 1/10 3.40
Heavy Metal song 1/1 5/5 (100.0%) 1.000 9/10 5/10 0.70
Ambient song 1/1 4/5 (80.0%) 0.943 9/10 2/10 1.50
SUMMARY:
Mean R@1 (accuracy) : 35.7% (5/14)
Mean R@5 : 61.4% (mean overlap 3.07/5)
mAP@10 (mean) : 0.738
For training this was the main parameter used:
model:
dropout: 0.3
segment_batch_size: 5
fusion_backbone: "edgenext"
edgenext_variant: "edgenext_xx_small"
training:
use_amp: true
augmentation_enabled: true
batch_size: 64
learning_rate: 0.003
weight_decay: 0.1
mixup_alpha: 0.1
global_mixup: true
loss_function: "cosine"
use_logit_scale: true
max_logit_scale_T: 50