-
Couldn't load subscription status.
- Fork 6.4k
[fix] Correct progress bar initialization in train_instruct_pix2pix.py #9791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
this isn't correct. it's supposed to resume from global_step |
Hi, thank you for the feedback! I believe there’s still an issue—the progress bar is updated within the resume-skipping section of the code (L855). To resolve this, we could either remove the progress bar update in that section or adjust its range accordingly. This issue was already addressed in |
|
the other scripts iirc are simply jumping straight to the resume range position, and then |
|
it's debated whether stepping the dataloader was needed, or if that's needed at all any longer, since the random states are resumed by accelerate. i believe before accelerate was as robust/full-featured, stepping the loop N times was needed to continue training in a reproducible manner. |
|
Exactly. Using continue in this way is safer but can lead to unnecessary data loading. In this code, both a shortened range and continue are used together, resulting in an incorrect range. For example, if I want to train for 20k steps and resume from an 8k checkpoint, the range is set to range(8k, 20k). However, after all the continue statements, the current iteration counter updates to 16k, leaving only 4k steps instead of the expected 12k for training. So I think we should adjust the range. |
|
well, no - remove the continue statement, so that this script becomes consistent across the project's other script examples. |
|
OK that can solve the problem too. I’ve removed the continue statement and set the initial value to global_step instead. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| if step % args.gradient_accumulation_steps == 0: | ||
| progress_bar.update(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we have set initial=global_step in the progress_bar so it will be consistent with train_instruct_pix2pix_sdxl.py's progress_bar setup? As is proposed by bghira in here. If we don't remove this statement, the initial should be set to zero.
|
Can we try to harmonize the changes by following what we do here?
I think it's better to have better consistency that way. |
Yes, we can work towards harmonizing the changes. The main difference here lies in whether or not we skip the loop N times. This file does while the SDXL file doesn't. However, based on our previous discussion, skipping the loop N times in this implementation is needed for maintaining reproducibility in training. So, to ensure consistency, it may actually be the SDXL version that needs to be aligned with this approach. Or, if we believe accelerate is robust enough, we could remove the continue in this file to match the SDXL version. Thoughts? |
|
Thanks for explaining further! Could you please provide a visual example comparing the changes you would like to see if it's not too much? |
|
Gentle ping @GhostCai |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This PR ensures the progress bar starts from 0 when resuming from a checkpoint, fixing an issue where it previously started at global_step and caused wrong progress tracking.