Error while training Embedding: "No inf checks were recorded for this optimizer." #5280

DoughyInTheMiddle · 2022-12-01T07:11:24Z

DoughyInTheMiddle
Dec 1, 2022

I've done a little research, and everything here seems to point to an issue with pytorch, but as I've not modified anything there,

Training at rate of 5e-07 until step 100000
Preparing dataset...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:03<00:00, 18.99it/s]
  0%|                                                                                                                                                                              | 0/100000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "G:\GitHub\SDWebUI\modules\textual_inversion\textual_inversion.py", line 335, in train_embedding scaler.step(optimizer)
  File "G:\GitHub\SDWebUI\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 336, in step assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.

Embedding name is my name, middle initial, and suffix (I'm a junior), with an "_Em".
Vectors per token: left at default of 1
Embedding is selected on the training tab.
All images have prompt files with both CLIP and deepbooru captions, edited (and used in hypernetwork training previously)
Dataset directory is filled in properly
Template file is filled in (template only has [filewords]).
Max steps left at default
Image log set to 500
Embedding log save set to 457 (Only have a 2060 super, and I found offsetting the saves reduces crashes)
Save images in PNG chunks = true
Read parameters from prompt (these are pictures of me, using a very short Tagger-created "1boy, realistic, solo, looking at viewer, brown eyes")
Based on a write up, tried "Deterministic" for latent sampling method, but reverted to default (no change).

Steps attempted:
Restarted (closed terminal window completely) several times.
Unloaded all extensions (restarting each time)
Loaded with just xformers and as medvram (yeah, I know, might generate crap)
Loaded with my usual full args: --xformers --deepdanbooru --api --gradio-img2img-tool color-sketch

Any searching results in fairly hard core python / pytorch debugging which is outside of my wheelhouse. However, it's NOTHING I've changed in those files, so I don't know why they'd be an issue.

Searched AUTOMATIC1111's issue log, Reddit, as well as here...nada.

Thoughts?

DoughyInTheMiddle · 2022-12-01T23:59:45Z

DoughyInTheMiddle
Dec 1, 2022
Author

As an update, I tried making a hypernetwork using the same images and mostly the same settings, no isssues.

Definitely a pytorch / embeddings issue. Should I just push this to being a full-on bug issue?

0 replies

taylorharrison · 2022-12-02T11:00:46Z

taylorharrison
Dec 2, 2022

Not able to solve your problem, but I am able to say you're not alone. Same issue here.

0 replies

slashedstar · 2022-12-02T13:03:15Z

slashedstar
Dec 2, 2022

5 replies

pat-the-cat1 Dec 2, 2022

I have the same problem. Disabeling VAE did not help.

slashedstar Dec 2, 2022

Yeah, I think it is actually about the prompt template file, I get the error when its only [filewords], if I add back [name] to it, making it [filewords] [name] then it works

ajelliot Dec 29, 2022

Yeah, I think it is actually about the prompt template file, I get the error when its only [filewords], if I add back [name] to it, making it [filewords] [name] then it works

I was having the same problem as OP, and changing the prompt template file fixed it for me. I was using hypernetworks.txt, which only uses [filewords]. When I changed to subject.txt, which uses only [name], I no longer generated the error.

Thanks for the tip.

roberto-barrero Dec 30, 2022

Yeah, I think it is actually about the prompt template file, I get the error when its only [filewords], if I add back [name] to it, making it [filewords] [name] then it works

Thanks. I was also using a template file only with [filewords], and adding [name] to it solved it. But because the error occurred after training started and around 50 images were already processed, I wanted to know why adding [name] solved it.

After some trial and error, I found that one of my images didn't contain the embedding word in the filename (eg. instead of minions I wrote minoins). I made sure all the images contained the embedding word and it worked with only [filewords].

I guess that when a prompt doesn't contain the embedding, it crashes, and maybe that's why adding [name] solves the issue, because every prompt contains the embedding.

Ninja-Ferret Mar 1, 2023

thank you!!! this fixed it for me

DoughyInTheMiddle · 2022-12-02T19:14:20Z

DoughyInTheMiddle
Dec 2, 2022
Author

Honestly, that's the only thing I HAVE changed since the last time I ran an embedding.

…

On Fri, Dec 2, 2022, 10:45 AM slashedstar ***@***.***> wrote: Yeah, I think it is actually about the prompt template file, I get the error when its only [filewords], if I add back [name] to it, making it [filewords] [name] then it works — Reply to this email directly, view it on GitHub <#5280 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKRPHRQ2UCFJ2TSS3U7QP6DWLIKRBANCNFSM6AAAAAASQOL67A> . You are receiving this because you authored the thread.Message ID: <AUTOMATIC1111/stable-diffusion-webui/repo-discussions/5280/comments/4293658 @github.com>

0 replies

sosojni · 2022-12-13T07:36:18Z

sosojni
Dec 13, 2022

I got this error when i wanted to resume embedding training but i did not set correct (original) number of gradient accumulation steps number. As soon as i changed it from 1 to 31 which was number that i set when i started original training for that embedding, error dissapeared and training continued.

0 replies

philliphooper · 2022-12-13T16:37:06Z

philliphooper
Dec 13, 2022

I fixed this error by removing the trailing line break from my textual_inversion_templates txt file that I had edited.

4 replies

JimmyTime Jan 13, 2023

This was it for me as well

Mike-Schvedov Jan 18, 2023

It fixed the problem for me as well

LanceGao97 Jun 4, 2023

Yes it works, why the hell so many people under here still talking all those meaning less things,
just remove any " " from the filename of the custom template file, it works.

sanjuhs Feb 23, 2024

yes can confirm this was the error !

Zyin055 · 2022-12-18T20:52:06Z

Zyin055
Dec 18, 2022

AssertionError: No inf checks were recorded for this optimizer.

I got this error when renaming an embedding so that it ended with a "-", and then tried to resume training on it with a different learning rate. Renaming it to "-0" fixed it for me and allowed me to keep training.

0 replies

99percenthan · 2022-12-24T15:39:09Z

99percenthan
Dec 24, 2022

I think this error may be due to “a wrong backward”. Can you adjust the "batch size" and "gradient accumulation steps" to make their product become "a factor of the image number of training set", and see whether the problem still occurs?
For example, if your training set contains 24 images, you can set the "batch size" to "1" and the "gradient accumulation steps" to "6", so their product "6" (1*6=6) is a factor of "24".

1 reply

Pn7Hao Feb 4, 2024

this also help me , noted that check again per_device_train_batch_size

DarkAlchy · 2023-01-05T21:23:23Z

DarkAlchy
Jan 5, 2023

This hit me last night and many many hours later I am no closer to fixing it.

0 replies

Slug-Cat · 2023-01-08T22:03:16Z

Slug-Cat
Jan 8, 2023

You should escalate this to a full bug report, it seems many people are having this issue. @DoughyInTheMiddle

0 replies

Kirby1997 · 2023-01-17T12:33:40Z

Kirby1997
Jan 17, 2023

Had this error when I had an embedding template containing "[name], [keywords]". Removing everything apart from "[name]" seems to work. Maybe one of the prompt files with the training images was wrong or something. Could definitely do with a more informative error here.

0 replies

disarticulate · 2023-01-19T01:06:31Z

disarticulate
Jan 19, 2023

it appears related to a prompt template that just as [filewords] in it.

I preprocessed all images with a associated txt file and put the embedding term in it.

Instead, I removed the embedding term and selected subject_filewords.txt which accomplishes the same thing: writing a prompt with the embedding word and it's associated tokens in the txt.

0 replies

JonathanDotCel · 2023-01-20T21:08:20Z

JonathanDotCel
Jan 20, 2023

Usually means I've forgotten to add [name] somewhere in the template file. (style_filewords_*).
A hypernetwork will happily train without [name] but an embedding will not.

Side note:
For embeddings, you can also use [name] as part of source image file names, if you want to train similar embeddings (with different trigger words) based on the same source images.... but you still need [name] in the template, meaning you need at least [name], [filewords], so it will duplicate the [name] term.

5 replies

Worrah Jan 21, 2023

Usually means I've forgotten to add [name] somewhere in the template file. (style_filewords_*). A hypernetwork will happily train without [name] but an embedding will not.

Side note: For embeddings, you can also use [name] as part of source image file names, if you want to train similar embeddings (with different trigger words) based on the same source images.... but you still need [name] in the template, meaning you need at least [name], [filewords], so it will duplicate the [name] term.

Same for me with hypernetwork. Old builds worked fine with default unchanged template files, but now it interrupting with this error after saving a first image

Laugur3 Jan 21, 2023

exactly the same for me. Worked fine the morning still akwardly as usual trying to train a hypernetwork (european timezone) stopped to work at the begining of the afternoon. i will try to check up the scripts, maybe they have been broken in the meantime, i tried every solution up there with the fileswords templates and nothing worked so far.

Worrah Jan 21, 2023

exactly the same for me. Worked fine the morning still akwardly as usual trying to train a hypernetwork (european timezone) stopped to work at the begining of the afternoon. i will try to check up the scripts, maybe they have been broken in the meantime, i tried every solution up there with the fileswords templates and nothing worked so far.

switching back to old build with git reset --hard dac59b9 in the bat file worked for me

Laugur3 Jan 21, 2023

indeed that was the logical solution and it worked, we just have to wait for another comit to fix this.

DoughyInTheMiddle Jan 25, 2023
Author

My apologies for not having come back to this in so long, but yes, it's 100% that I only had [filewords] in the template file. It HAS to be two words I guess.

nazsmith · 2023-04-01T22:14:22Z

nazsmith
Apr 1, 2023

Same here. The template file I used had empty rows between the prompts, I eliminated that and now it works.

1 reply

jdsaund Apr 5, 2023

Same here, I had a blank line in the file. Deleted, now working.

0x6f6f · 2023-05-15T07:43:34Z

0x6f6f
May 15, 2023

I met this issue when i am trying to run training in multiple threads..

0 replies

risjut · 2023-06-12T11:31:43Z

risjut
Jun 12, 2023

Got the same problem while trying to train an embedding, none of the fixes above worked for me, especially as I had not changed anything on the default configuration files.

I had used "_" in my hypernetwork and embedding names and removing it fixed the problem
myEmbedding_v1 -> myEmbeddingv1

0 replies

p0ss · 2023-06-23T06:35:47Z

p0ss
Jun 23, 2023

After doing all the above it was still failing for me, I tried removing underscores from filenames and text inputs, and tried other input datasets against other embedding prompts, and it all still failed.

I then changed the source model I was training against and it started working again. I tried a bunch of different models, most worked, a few didn't. Models that used a noise offset consistently didn't work.

As far as I can tell the error happens if the the model is returning NANs, and pyTorch can't handle the infinities. I'm not sure exactly what in the models causes this output, but I did see it happen on a model I'd trained and interrupted, and all models I tried that had a baked in noise offset consistently failed.

Hopefully this helps someone

0 replies

bonyphacy · 2023-07-23T17:43:15Z

bonyphacy
Jul 23, 2023

Guys, i solve this problem for me. All you need is (love :) an empty txt file with only one string inside: a [name]. In my case I'm wrongly used "a[name]" instead "a [name]" string. Drop it in yours ...webui\textual_inversion_templates folder and choose in "Prompt template" field.
FYI, "Number of vectors per token" on the first tab is just a summ of tokens (words and comas) from Initialization text. So when you leave it 1 by default, it takes only first token from all string.
Enjoy!

0 replies

dragonplus-wby · 2023-07-27T03:24:12Z

dragonplus-wby
Jul 27, 2023

Those who adjust the Prompt template has no effect can check whether the Number of vectors per token was not adjusted when Create embedding, this will also cause the same error

0 replies

evilcrusher2 · 2023-08-15T01:42:33Z

evilcrusher2
Aug 15, 2023

So, for me it was the [name] field in the txt file for stylize filewords. I had a previous embedding and template that worked so I changed things one by one, comparing the template to the files it was used to train. The the issue was that the name must match the file and that you cannot use - in the name to separate words in either the file or the template. flamingmoes2-brushed will not work but flamingmoes2_brushed will for the file names and the name placement in the template. I hope this helps anyone else looking for an answer.

0 replies

Error while training Embedding: "No inf checks were recorded for this optimizer." #5280

Uh oh!

Replies: 20 comments · 16 replies

Uh oh!

DoughyInTheMiddle Dec 1, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DoughyInTheMiddle Dec 2, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DoughyInTheMiddle Jan 25, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 20 comments 16 replies

DoughyInTheMiddle
Dec 1, 2022
Author

DoughyInTheMiddle
Dec 2, 2022
Author

DoughyInTheMiddle Jan 25, 2023
Author