Skip to content

feat: Add Continue Training feature for completed jobs#698

Open
diodiogod wants to merge 1 commit intoostris:mainfrom
diodiogod:feature/continue-training
Open

feat: Add Continue Training feature for completed jobs#698
diodiogod wants to merge 1 commit intoostris:mainfrom
diodiogod:feature/continue-training

Conversation

@diodiogod
Copy link

@diodiogod diodiogod commented Feb 4, 2026

Summary

Adds a Continue Training feature that allows users to easily resume or continue training from completed jobs directly from the UI. This addresses a common workflow need where users want to extend training beyond the initial step count or fine-tune from existing weights.

image

Features

Two Training Modes

1. Resume Training

  • Continues from the last checkpoint with the same job name
  • Preserves step counter and training state
  • Ideal for extending training (e.g., from 6000 to 10000 steps)

2. Start Fresh from Weights

  • Creates a new job with a different name
  • Uses final checkpoint as pretrained_lora_path
  • Resets step counter to 0
  • Ideal for fine-tuning or transfer learning

User Interface

  • New Continue Training button in the gear menu for completed jobs
  • Modal dialog with clear explanations of each mode
  • Step count input with validation
  • Optional job renaming for clone mode
  • Prevents invalid configurations (e.g., steps less than current step)

Changes

New Files

  • ui/src/app/api/jobs/[jobID]/continue/route.ts - API endpoint for both modes
  • ui/src/components/ContinueTrainingModal.tsx - Modal UI component

Modified Files

  • ui/src/components/JobActionBar.tsx - Added Continue Training button
  • ui/src/components/Modal.tsx - Improved rendering and UX
  • ui/src/utils/jobs.ts - Added continueJob() and canContinue logic
  • jobs/process/BaseSDTrainProcess.py - Enhanced checkpoint detection

Technical Improvements

Smart Checkpoint Detection

The checkpoint finding logic now prioritizes by:

  1. Final files without step numbers (e.g., model.safetensors)
  2. Checkpoints with highest step number (e.g., model_6000.safetensors)
  3. Most recently modified files (fallback)

This fixes issues where copied/moved files had unreliable creation times, causing wrong checkpoints to be loaded.

Modal Enhancements

  • Uses React Portal for consistent rendering across different DOM contexts
  • Fixes transparency issues when modal is rendered inside tables
  • Prevents accidental closure when dragging text selection outside modal bounds

Testing

Tested with a Z-Image training job, successfully resuming from step 6000 to 10000.

Use Cases

  • Extend training when initial steps weren't enough
  • Fine-tune from a completed model with different settings
  • Create variations of a model with different step counts
  • Transfer learning from completed training runs

This feature streamlines a common workflow that previously required manual database editing or complex configuration changes.

Adds a 'Continue Training' feature that allows users to continue training from completed jobs in two ways:

1. Resume Training - Continue from last checkpoint with same job name and step counter
2. Start Fresh from Weights - Clone job with new name using final checkpoint as pretrained weights

Changes:
- Added /api/jobs/[jobID]/continue endpoint supporting both resume and clone modes
- Added ContinueTrainingModal component with intuitive mode selection UI
- Updated JobActionBar to show Continue Training option for completed jobs
- Added continueJob() utility function for API calls
- Improved Modal component to use React Portal for consistent rendering
- Fixed Modal to prevent accidental close when dragging text selection
- Enhanced checkpoint detection in BaseSDTrainProcess to prioritize by step number

The checkpoint detection now intelligently sorts by:
1. Final files without step numbers (highest priority)
2. Checkpoints with highest step number
3. Most recently modified files (fallback)

This ensures correct checkpoint loading even when files are copied or moved.

Fixes issues where:
- Modal transparency varied depending on render location
- Modal closed when dragging text selection outside bounds
- Checkpoint detection failed with copied/moved files due to unreliable creation times
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant