Error while training Embedding: "No inf checks were recorded for this optimizer." #5280
Replies: 20 comments 16 replies
-
As an update, I tried making a hypernetwork using the same images and mostly the same settings, no isssues. Definitely a pytorch / embeddings issue. Should I just push this to being a full-on bug issue? |
Beta Was this translation helpful? Give feedback.
-
Not able to solve your problem, but I am able to say you're not alone. Same issue here. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Honestly, that's the only thing I HAVE changed since the last time I ran an
embedding.
…On Fri, Dec 2, 2022, 10:45 AM slashedstar ***@***.***> wrote:
Yeah, I think it is actually about the prompt template file, I get the
error when its only [filewords], if I add back [name] to it, making it
[filewords] [name] then it works
—
Reply to this email directly, view it on GitHub
<#5280 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKRPHRQ2UCFJ2TSS3U7QP6DWLIKRBANCNFSM6AAAAAASQOL67A>
.
You are receiving this because you authored the thread.Message ID:
<AUTOMATIC1111/stable-diffusion-webui/repo-discussions/5280/comments/4293658
@github.com>
|
Beta Was this translation helpful? Give feedback.
-
I got this error when i wanted to resume embedding training but i did not set correct (original) number of gradient accumulation steps number. As soon as i changed it from 1 to 31 which was number that i set when i started original training for that embedding, error dissapeared and training continued. |
Beta Was this translation helpful? Give feedback.
-
I fixed this error by removing the trailing line break from my textual_inversion_templates txt file that I had edited. |
Beta Was this translation helpful? Give feedback.
-
I got this error when renaming an embedding so that it ended with a "-", and then tried to resume training on it with a different learning rate. Renaming it to "-0" fixed it for me and allowed me to keep training. |
Beta Was this translation helpful? Give feedback.
-
I think this error may be due to “a wrong backward”. Can you adjust the "batch size" and "gradient accumulation steps" to make their product become "a factor of the image number of training set", and see whether the problem still occurs? |
Beta Was this translation helpful? Give feedback.
-
This hit me last night and many many hours later I am no closer to fixing it. |
Beta Was this translation helpful? Give feedback.
-
You should escalate this to a full bug report, it seems many people are having this issue. @DoughyInTheMiddle |
Beta Was this translation helpful? Give feedback.
-
Had this error when I had an embedding template containing "[name], [keywords]". Removing everything apart from "[name]" seems to work. Maybe one of the prompt files with the training images was wrong or something. Could definitely do with a more informative error here. |
Beta Was this translation helpful? Give feedback.
-
it appears related to a prompt template that just as [filewords] in it. I preprocessed all images with a associated txt file and put the embedding term in it. Instead, I removed the embedding term and selected |
Beta Was this translation helpful? Give feedback.
-
Usually means I've forgotten to add Side note: |
Beta Was this translation helpful? Give feedback.
-
Same here. The template file I used had empty rows between the prompts, I eliminated that and now it works. |
Beta Was this translation helpful? Give feedback.
-
I met this issue when i am trying to run training in multiple threads.. |
Beta Was this translation helpful? Give feedback.
-
Got the same problem while trying to train an embedding, none of the fixes above worked for me, especially as I had not changed anything on the default configuration files. I had used "_" in my hypernetwork and embedding names and removing it fixed the problem |
Beta Was this translation helpful? Give feedback.
-
After doing all the above it was still failing for me, I tried removing underscores from filenames and text inputs, and tried other input datasets against other embedding prompts, and it all still failed. I then changed the source model I was training against and it started working again. I tried a bunch of different models, most worked, a few didn't. Models that used a noise offset consistently didn't work. As far as I can tell the error happens if the the model is returning NANs, and pyTorch can't handle the infinities. I'm not sure exactly what in the models causes this output, but I did see it happen on a model I'd trained and interrupted, and all models I tried that had a baked in noise offset consistently failed. Hopefully this helps someone |
Beta Was this translation helpful? Give feedback.
-
Guys, i solve this problem for me. All you need is (love :) an empty txt file with only one string inside: a [name]. In my case I'm wrongly used "a[name]" instead "a [name]" string. Drop it in yours ...webui\textual_inversion_templates folder and choose in "Prompt template" field. |
Beta Was this translation helpful? Give feedback.
-
Those who adjust the Prompt template has no effect can check whether the Number of vectors per token was not adjusted when Create embedding, this will also cause the same error |
Beta Was this translation helpful? Give feedback.
-
So, for me it was the [name] field in the txt file for stylize filewords. I had a previous embedding and template that worked so I changed things one by one, comparing the template to the files it was used to train. The the issue was that the name must match the file and that you cannot use - in the name to separate words in either the file or the template. flamingmoes2-brushed will not work but flamingmoes2_brushed will for the file names and the name placement in the template. I hope this helps anyone else looking for an answer. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've done a little research, and everything here seems to point to an issue with pytorch, but as I've not modified anything there,
Embedding name is my name, middle initial, and suffix (I'm a junior), with an "_Em".
Vectors per token: left at default of 1
Embedding is selected on the training tab.
All images have prompt files with both CLIP and deepbooru captions, edited (and used in hypernetwork training previously)
Dataset directory is filled in properly
Template file is filled in (template only has
[filewords]
).Max steps left at default
Image log set to 500
Embedding log save set to 457 (Only have a 2060 super, and I found offsetting the saves reduces crashes)
Save images in PNG chunks = true
Read parameters from prompt (these are pictures of me, using a very short Tagger-created
"1boy, realistic, solo, looking at viewer, brown eyes"
)Based on a write up, tried "Deterministic" for latent sampling method, but reverted to default (no change).
Steps attempted:
Restarted (closed terminal window completely) several times.
Unloaded all extensions (restarting each time)
Loaded with just xformers and as medvram (yeah, I know, might generate crap)
Loaded with my usual full args:
--xformers --deepdanbooru --api --gradio-img2img-tool color-sketch
Any searching results in fairly hard core python / pytorch debugging which is outside of my wheelhouse. However, it's NOTHING I've changed in those files, so I don't know why they'd be an issue.
Searched AUTOMATIC1111's issue log, Reddit, as well as here...nada.
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions