Skip to content

Releases: NeptuneHub/AudioMuse-AI-DCLAP

AudioMuse-AI-DCLAP v1 - First release val_cosine 0.884

28 Feb 16:55

Choose a tag to compare

This is the first release of AudioMuse-AI Distilled CLAP (DCLAP) that contain model_epoch_36.onnx and model_epoch_36.onnx.data distilled model for the AUDIO tower.

The text tower clap_text_model.onnx is still the original one from LAION CLAP exported in .onnx.

The distilled audio model have around 7m model and an increased sped of 5-6x (tested on a Raspberry Pi 5 8GB Ram and NVME SSD).

With this model we reached this MSE and Cosine similarity against the LAION CLAP teacher model (music_audioset_epoch_15_esc_90.14.pt):

  • train_cos=0.886233021853105
  • train_mse=0.00044440243531563233
  • val_mse=0.00045189925003796816
  • val_cosine=0.8843138813972473

On Music Information Retrieval metrics among challenging query was reached the below result (Student vs Teacher):

  Text query cosine similarities (mean across songs):

  Query                             Teacher    Student      Delta
  ──────────────────────────────  ─────────  ─────────  ─────────
  Calm Piano song                   +0.0191    +0.0226    +0.0035
  Energetic POP song                +0.2005    +0.2268    +0.0263
  Love Rock Song                    +0.2694    +0.3298    +0.0604
  Happy Pop song                    +0.3236    +0.3664    +0.0428
  POP song with Female vocalist     +0.2663    +0.3091    +0.0428
  Instrumental song                 +0.1253    +0.1543    +0.0290
  Female Vocalist                   +0.1694    +0.1984    +0.0291
  Male Vocalist                     +0.1238    +0.1545    +0.0306
  Ukulele POP song                  +0.1190    +0.1486    +0.0296
  Jazz Sax song                     +0.0980    +0.1229    +0.0249
  Distorted Electric Guitar         -0.1099    -0.1059    +0.0039
  Drum and Bass beat                +0.0878    +0.1213    +0.0335
  Heavy Metal song                  +0.0977    +0.1117    +0.0140
  Ambient song                      +0.1594    +0.2066    +0.0471
  ──────────────────────────────  ─────────  ─────────  ─────────
  OVERALL MEAN                      +0.1392    +0.1691    +0.0298

  MIR RANKING METRICS: R@1, R@5, mAP@10 (teacher top-5 as relevance)

  Query                             R@1        R@5        mAP@10   Overlap10  Ordered10  MeanShift
  ------------------------------  -------  ------------  --------  ---------  ---------  --------
  Calm Piano song                   0/1    4/5 (80.0%)    0.967      7/10       2/10       2.20  
  Energetic POP song                1/1    2/5 (40.0%)    0.508      5/10       2/10       5.40  
  Love Rock Song                    0/1    3/5 (60.0%)    0.730      8/10       1/10       3.10  
  Happy Pop song                    0/1    2/5 (40.0%)    0.408      4/10       0/10       6.20  
  POP song with Female vocalist     0/1    2/5 (40.0%)    0.489      7/10       0/10       4.90  
  Instrumental song                 1/1    3/5 (60.0%)    0.858      8/10       3/10       3.00  
  Female Vocalist                   0/1    2/5 (40.0%)    0.408      5/10       0/10       9.80  
  Male Vocalist                     0/1    3/5 (60.0%)    0.858      8/10       2/10       2.50  
  Ukulele POP song                  1/1    3/5 (60.0%)    0.680      6/10       1/10       5.40  
  Jazz Sax song                     0/1    4/5 (80.0%)    0.967      8/10       3/10       2.30  
  Distorted Electric Guitar         0/1    3/5 (60.0%)    0.876      9/10       0/10       2.80  
  Drum and Bass beat                0/1    3/5 (60.0%)    0.634      8/10       1/10       3.40  
  Heavy Metal song                  1/1    5/5 (100.0%)   1.000      9/10       5/10       0.70  
  Ambient song                      1/1    4/5 (80.0%)    0.943      9/10       2/10       1.50  

  SUMMARY:
    Mean R@1 (accuracy) : 35.7% (5/14)
    Mean R@5            : 61.4% (mean overlap 3.07/5)
    mAP@10 (mean)       : 0.738

For training this was the main parameter used:

model:
  dropout: 0.3
  segment_batch_size: 5
  fusion_backbone: "edgenext"
  edgenext_variant: "edgenext_xx_small"

training:
  use_amp: true
  augmentation_enabled: true
  batch_size: 64
  learning_rate: 0.003 
  weight_decay: 0.1
  mixup_alpha: 0.1
  global_mixup: true
  loss_function: "cosine"
  use_logit_scale: true
  max_logit_scale_T: 50