V1 Large still working best for me compared to V2 or V3 for English #1836

FurkanGozukara · 2023-11-23T19:07:56Z

FurkanGozukara
Nov 23, 2023

This is so weird. I use below settings and V2 and V3 hallucinates a lot but V1 performs best

--model large-v1 --language en --initial_prompt "Welcome our Youtube channel." --best_of 10 --beam_size 10

moreover I still have punctuation completely loss issue

glangford · 2023-11-23T19:13:30Z

glangford
Nov 23, 2023

Can you share the audio or give a pointer to the Youtube content

1 reply

FurkanGozukara Nov 24, 2023
Author

I am sorry for late reply

Just published the video. The subtitle is manually fixed by me atm :

on YouTube

How To Do Stable Diffusion XL (SDXL) DreamBooth Training For Free - Utilizing Kaggle - Easy Tutorial

glangford · 2023-11-24T12:28:50Z

glangford
Nov 24, 2023

I haven't tried large models yet, but these settings give good results:

--model medium.en --patience 2

[00:00.000 --> 00:08.000] In this tutorial video, I will guide you through setting up your Stable Diffusion XL SDXL Koia Training Notebook on a free Kaggle account.
[00:08.000 --> 00:10.000] Here is what you will learn.
[00:10.000 --> 00:14.000] How to select the correct Kaggle notebook settings and start your session.
[00:14.000 --> 00:20.000] Steps to install and initiate the Koia Graphical User Interface Stable Diffusion Trainer.
[00:20.000 --> 00:26.000] Setting best parameters and configurations for SDXL training with Koia on a free Kaggle notebook.
[00:26.000 --> 00:29.000] Utilizing dual T4 GPUs simultaneously.
[00:29.000 --> 00:34.000] Simply load my pre-shared configuration and click Prepare Data Set.
[00:34.000 --> 00:38.000] Adding new data to your Kaggle account as a data set for use in your session.
[00:38.000 --> 00:39.000] Like training images.
[00:39.000 --> 00:43.000] The types of training images to use in your data set.
[00:43.000 --> 00:45.000] A new training approach.
[00:45.000 --> 00:50.000] Instead of epochs, use a higher repetition count and save checkpoints based on step count.
[00:50.000 --> 00:53.000] How to calculate checkpoints saves every end step.
[00:53.000 --> 00:56.000] Estimating the total number of steps your training will take.
[00:56.000 --> 01:01.000] Downloading saved checkpoints or files directly from the Kaggle working directory.
[01:01.000 --> 01:04.000] Uploading generated checkpoints to Hugging Face from Kaggle.
[01:04.000 --> 01:09.000] Or from other cloud services such as Google Colab, RunPod, and AWS.
[01:09.000 --> 01:14.000] Quickly downloading checkpoints from Hugging Face using a browser or wget.
[01:14.000 --> 01:19.000] Switching your automatic 11.11 Stable Diffusion web UI to the development branch.
[01:19.000 --> 01:22.000] Finding and using amazing promptless PNGs.
[01:22.000 --> 01:29.000] Installing and effectively using the after detailer extension for automatic face-in painting to enhance image quality.

5 replies

glangford Nov 24, 2023

Testing solely on the first 90s of audio -

large-v2 with --language en makes a few changes over medium.en, and I don't get any hallucination. large-v3 gives the least accurate results (fails to capitalize "After Detailer", "face and painting" vs "face-in painting", STXL vs SDXL) but still no hallucination (for me).

Did your transcription start off correctly (with punctuation and no hallucination) and then start hallucinating later?

FurkanGozukara Nov 24, 2023
Author

hello. testing 2 minute is very big mistake :) you should test entire speech

glangford Nov 24, 2023

That's why I asked if it started hallucinating later

FurkanGozukara Nov 24, 2023
Author

yep it starts later

glangford Nov 25, 2023

large-v2 seemed to work fine for me, sharing the .srt here. large-v3 seems to have issues in general so I didn't test it.

Maybe a problem with the audio quality of the file you used or perhaps the whisper options triggered hallucination.

At any rate, I used
whisper SDXL.m4a --model large-v2 --language en --verbose False --patience 2 --word_timestamps True --output_format srt --fp16 False

SDXL.srt.txt

FurkanGozukara · 2023-11-25T20:39:18Z

FurkanGozukara
Nov 25, 2023
Author

large-v2 seemed to work fine for me, sharing the .srt here. large-v3 seems to have issues in general so I didn't test it.

Maybe a problem with the audio quality of the file you used or perhaps the whisper options triggered hallucination.

At any rate, I used whisper SDXL.m4a --model large-v2 --language en --verbose False --patience 2 --word_timestamps True --output_format srt --fp16 False

SDXL.srt.txt

--best_of 10 --beam_size 10 making significant difference in terms of hallucination and quality

what is patience doing?

7 replies

glangford Nov 26, 2023

I have used it in the past to improve transcriptions in non-English languages (going back to large-v1) and nowadays I just apply it automatically everywhere. I haven't compared with/without --patience recently!

FurkanGozukara Nov 26, 2023
Author

thanks a lot i should test

glangford Dec 1, 2023

@FurkanGozukara --initial_prompt can be a trigger of this problem apparently, see

Why does using init_prompt break the sentencer/punctuator? #625

FurkanGozukara Dec 1, 2023
Author

Actually I use it to improve punctuation. the model still loses punctuation and still they didn't fix it in 1 year

ghost Apr 25, 2024

Model punctuation loss on many fine-tuned whisper models is/are due to basic text normalization:

transcription = batch["sentence"]
if do_lower_case:
    transcription = transcription.lower()
if do_remove_punctuation:
    transcription = normalizer(transcription).strip()

if do_remove_punctuation:
print("Removing punctuation: ", punctuation_to_remove)

This won't normally apply outside of training (english) or end use of a multi-lingual model due to how the code works. One could implement language specific normalization "in vivo". For example, one could incorporate https://taku910.github.io/mecab/ into normalization for Japanese.

Be careful with initial prompting when it comes to whisper. Prompting with whisper does not behave the same way as prompting with other architectures.
https://github.com/sin2piusc/Whisper-trainer-with-multiple-streaming-datasets/blob/main/whisper-trainer-updated.ipynb

NielsMayer · 2023-11-26T19:40:42Z

NielsMayer
Nov 26, 2023

granted I'm using faster-whisper/whisper-ctranslate2, but i'm finding the options discussed here (--patience 2, --best_of 10, --beam_size 10) result in worse performance.
(example use case showing good performance with my chosen parameters: https://rumble.com/v2nvr1w-trainspodder-helping-decode-reggae-lyrics.html )

So I'm back to my old parameters:

--condition_on_previous_text False --compression_ratio_threshold 1.8

I should probably try this with the latest update to whisper itself to see if --patience performs differently in the original implementation. Likewise I haven't updated to the most recent version supporting large-v3, and given the reports I'm seeing, it seems wise to wait till the issues are resolved, especially since I'm hoping you guys figure out the AI issues and let me worry about all the other issues I'm having with Trainspodder. (see #233 )

6 replies

NielsMayer Nov 26, 2023

Is it really different???

https://www.youtube.com/watch?v=AFk5g7NJ1Ko

NielsMayer Nov 26, 2023

One thing I'm finding is that the hallucinations act as "labellings" for a musical segment, in that it often repeats the title or name or performer of the track (announced beforehand) throughout the music segment. Often the hallucinations are useful in this regard. I'm thinking of post-filtering these to visually indicate they're labels and not transcriptions.

(As part of my experiments, I'm basically running analyses of many of BBC's radio shows 24/7 so i get to see how it's doing, as the output of whisper is often scrolling by in a window on my workstation so i can check on what's happening randomly... this just scrolled by to illustrate the point)

[13:44.700 --> 13:47.540]  Just search for Radio 3 and night tracks.
[13:49.480 --> 13:53.380]  Now to The Forest, where the bird is asleep on the bow,
[13:53.960 --> 13:57.380]  the moon is rising and the stars beginning to twinkle.
[13:58.220 --> 14:02.260]  Ravishing music for soprano and string quartet by Paul Hindemith.
[14:02.320 --> 14:04.440]  It's sung by Barbara Hunigan.
[14:44.220 --> 15:02.240]  It's sung by Barbara Hunigan.
[15:02.260 --> 15:32.240]  It's sung by Barbara Hunigan.
[15:32.260 --> 15:56.980]  It's sung by Barbara Hunigan.
[16:02.260 --> 16:32.240]  It's sung by Barbara Hunigan.
[16:32.260 --> 17:02.240]  It's sung by Barbara Hunigan.
[17:05.460 --> 17:22.560]  It's sung by Barbara Hunigan.
[17:32.260 --> 17:37.220]  It's sung by Barbara Hunigan.
[18:31.680 --> 18:31.860]  It's sung by Barbara Hunigan.
[18:48.360 --> 19:01.980]  It's sung by Barbara Hunigan.
[19:02.260 --> 19:32.200]  It's sung by Barbara Hunigan.
[19:32.260 --> 20:02.240]  It's sung by Barbara Hunigan.
[20:31.320 --> 20:31.940]  It's sung by Barbara Hunigan.
[20:32.260 --> 21:01.860]  It's sung by Barbara Hunigan.
[21:31.500 --> 21:31.960]  It's sung by Barbara Hunigan.
[21:32.260 --> 22:02.240]  It's sung by Barbara Hunigan.
22:31.320 --> 22:31.940]  It's sung by Barbara Hunigan.
[22:32.260 --> 23:02.240]  It's sung by Barbara Hunigan.
[23:09.580 --> 23:24.880]  That was Dissolving Clouds, a track by Biosphere, and it followed on from the dream forest of the expressionist music of Paul Hindemith's Melancholy, with Barbara Hunigan joining the Emerson Quartet in their last ever recording.

The repeat hallucinations seem to occur over segments that are "spectrally similar" (the segmentations are for spectrally similar segments of the media). So I'm considering using the spectrally similar segment analyses to help distinguish such repeated hallucinations occurring within spectrally-similar segments, as "labels".

FYI, spectral similarity segments can be used to "diarize" the output. For example, if you select a specific segment type in trainspodder, then skip to the next segment of the same exact type, it will often be the same speaker. Or if you select a particular segment type across the media and start playback over only those segments, then often the output will be a single speaker captured from the stream. These are the kind of experiments I've been trying with Trainspodder.

NielsMayer Nov 26, 2023

Now if I could just keep it from "counting" during hallucinations involving numbers. (It correctly labelled the music being played during the hallucination as "Handel's Suite No. 2" but then starts counting. I've seen whisper count to over 100 during a long hallucination. (is it emergent intelligence that the model can count?? :-) )

[04:13:32.000 --> 04:13:38.200]  the stakes were around even back then. Handel's Suite No. 2 here from the pianist Christian
[04:14:08.200 --> 04:14:38.180]  Handel's Suite No. 2
[04:14:38.200 --> 04:15:08.180]  Handel's Suite No. 2
[04:15:08.200 --> 04:15:38.180]  Handel's Suite No. 2
[04:15:38.200 --> 04:16:08.180]  Handel's Suite No. 2
[04:16:08.200 --> 04:16:38.180]  Handel's Suite No. 2
[04:16:38.200 --> 04:17:08.180]  Handel's Suite No. 2
[04:17:08.200 --> 04:17:38.180]  Handel's Suite No. 2
[04:17:38.200 --> 04:18:08.180]  Handel's Suite No. 2
[04:18:08.200 --> 04:18:38.180]  Handel's Suite No. 3
[04:18:38.200 --> 04:19:08.180]  Handel's Suite No. 4
[04:19:08.200 --> 04:19:37.740]  Handel's Suite No. 5
[04:19:38.200 --> 04:20:08.180]  Handel's Suite No. 6
[04:20:08.200 --> 04:20:38.180]  Handel's Suite No. 6
[04:20:38.200 --> 04:21:08.180]  Handel's Suite No. 6
[04:21:08.200 --> 04:21:38.180]  Handel's Suite No. 6
[04:21:38.200 --> 04:22:08.180]  Handel's Suite No. 6
[04:22:35.340 --> 04:22:38.180]  Handel's Suite No. 2
[04:22:38.180 --> 04:22:40.260]  Joseph Haydn
[04:22:40.260 --> 04:22:44.460]  It's Joseph Haydn to follow on here on Radio 3 through the Knights Haydn's D minor string
[04:22:44.460 --> 04:22:50.780]  quartet, Opus 42. It's from the mid-1780s and probably written for a periodical series
[04:22:50.780 --> 04:22:56.060]  published by Mozart's friend Hofmeister. The players are the Pavel Haas Quartet.

glangford Nov 26, 2023

@NielsMayer I am not using best_of or beam_size parameters with --patience, just FYI. Not sure it matters in your examples.

NielsMayer Nov 26, 2023

I tried various combos of patience plus the beam size and best of suggestions. It resulted in worse results (missing segments where words were understandable for example) from my handful of trials. To do it right i should setup multiple parallel analyses with different parameters and let it rip for a week or two, but then my power bills and 4080 would suffer even more. :-)

V1 Large still working best for me compared to V2 or V3 for English #1836

Uh oh!

Uh oh!

Replies: 4 comments · 19 replies

Uh oh!

Uh oh!

Uh oh!

FurkanGozukara Nov 24, 2023 Author

Uh oh!

Uh oh!

Uh oh!

FurkanGozukara Nov 24, 2023 Author

Uh oh!

Uh oh!

FurkanGozukara Nov 24, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FurkanGozukara Nov 25, 2023 Author

Uh oh!

Uh oh!

FurkanGozukara Nov 26, 2023 Author

Uh oh!

Uh oh!

FurkanGozukara Dec 1, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 4 comments 19 replies

FurkanGozukara Nov 24, 2023
Author

FurkanGozukara Nov 24, 2023
Author

FurkanGozukara Nov 24, 2023
Author

FurkanGozukara
Nov 25, 2023
Author

FurkanGozukara Nov 26, 2023
Author

FurkanGozukara Dec 1, 2023
Author