feat: Add Continue Training feature for completed jobs#698
Open
diodiogod wants to merge 1 commit intoostris:mainfrom
Open
feat: Add Continue Training feature for completed jobs#698diodiogod wants to merge 1 commit intoostris:mainfrom
diodiogod wants to merge 1 commit intoostris:mainfrom
Conversation
Adds a 'Continue Training' feature that allows users to continue training from completed jobs in two ways: 1. Resume Training - Continue from last checkpoint with same job name and step counter 2. Start Fresh from Weights - Clone job with new name using final checkpoint as pretrained weights Changes: - Added /api/jobs/[jobID]/continue endpoint supporting both resume and clone modes - Added ContinueTrainingModal component with intuitive mode selection UI - Updated JobActionBar to show Continue Training option for completed jobs - Added continueJob() utility function for API calls - Improved Modal component to use React Portal for consistent rendering - Fixed Modal to prevent accidental close when dragging text selection - Enhanced checkpoint detection in BaseSDTrainProcess to prioritize by step number The checkpoint detection now intelligently sorts by: 1. Final files without step numbers (highest priority) 2. Checkpoints with highest step number 3. Most recently modified files (fallback) This ensures correct checkpoint loading even when files are copied or moved. Fixes issues where: - Modal transparency varied depending on render location - Modal closed when dragging text selection outside bounds - Checkpoint detection failed with copied/moved files due to unreliable creation times
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a Continue Training feature that allows users to easily resume or continue training from completed jobs directly from the UI. This addresses a common workflow need where users want to extend training beyond the initial step count or fine-tune from existing weights.
Features
Two Training Modes
1. Resume Training
2. Start Fresh from Weights
pretrained_lora_pathUser Interface
Changes
New Files
ui/src/app/api/jobs/[jobID]/continue/route.ts- API endpoint for both modesui/src/components/ContinueTrainingModal.tsx- Modal UI componentModified Files
ui/src/components/JobActionBar.tsx- Added Continue Training buttonui/src/components/Modal.tsx- Improved rendering and UXui/src/utils/jobs.ts- AddedcontinueJob()andcanContinuelogicjobs/process/BaseSDTrainProcess.py- Enhanced checkpoint detectionTechnical Improvements
Smart Checkpoint Detection
The checkpoint finding logic now prioritizes by:
model.safetensors)model_6000.safetensors)This fixes issues where copied/moved files had unreliable creation times, causing wrong checkpoints to be loaded.
Modal Enhancements
Testing
Tested with a Z-Image training job, successfully resuming from step 6000 to 10000.
Use Cases
This feature streamlines a common workflow that previously required manual database editing or complex configuration changes.