feat: Add Continue Training feature for completed jobs by diodiogod · Pull Request #698 · ostris/ai-toolkit

diodiogod · 2026-02-04T18:39:40Z

Summary

Adds a Continue Training feature that allows users to easily resume or continue training from completed jobs directly from the UI. This addresses a common workflow need where users want to extend training beyond the initial step count or fine-tune from existing weights.

Features

Two Training Modes

1. Resume Training

Continues from the last checkpoint with the same job name
Preserves step counter and training state
Ideal for extending training (e.g., from 6000 to 10000 steps)

2. Start Fresh from Weights

Creates a new job with a different name
Uses final checkpoint as pretrained_lora_path
Resets step counter to 0
Ideal for fine-tuning or transfer learning

User Interface

New Continue Training button in the gear menu for completed jobs
Modal dialog with clear explanations of each mode
Step count input with validation
Optional job renaming for clone mode
Prevents invalid configurations (e.g., steps less than current step)

Changes

New Files

ui/src/app/api/jobs/[jobID]/continue/route.ts - API endpoint for both modes
ui/src/components/ContinueTrainingModal.tsx - Modal UI component

Modified Files

ui/src/components/JobActionBar.tsx - Added Continue Training button
ui/src/components/Modal.tsx - Improved rendering and UX
ui/src/utils/jobs.ts - Added continueJob() and canContinue logic
jobs/process/BaseSDTrainProcess.py - Enhanced checkpoint detection

Technical Improvements

Smart Checkpoint Detection

The checkpoint finding logic now prioritizes by:

Final files without step numbers (e.g., model.safetensors)
Checkpoints with highest step number (e.g., model_6000.safetensors)
Most recently modified files (fallback)

This fixes issues where copied/moved files had unreliable creation times, causing wrong checkpoints to be loaded.

Modal Enhancements

Uses React Portal for consistent rendering across different DOM contexts
Fixes transparency issues when modal is rendered inside tables
Prevents accidental closure when dragging text selection outside modal bounds

Testing

Tested with a Z-Image training job, successfully resuming from step 6000 to 10000.

Use Cases

Extend training when initial steps weren't enough
Fine-tune from a completed model with different settings
Create variations of a model with different step counts
Transfer learning from completed training runs

This feature streamlines a common workflow that previously required manual database editing or complex configuration changes.

Adds a 'Continue Training' feature that allows users to continue training from completed jobs in two ways: 1. Resume Training - Continue from last checkpoint with same job name and step counter 2. Start Fresh from Weights - Clone job with new name using final checkpoint as pretrained weights Changes: - Added /api/jobs/[jobID]/continue endpoint supporting both resume and clone modes - Added ContinueTrainingModal component with intuitive mode selection UI - Updated JobActionBar to show Continue Training option for completed jobs - Added continueJob() utility function for API calls - Improved Modal component to use React Portal for consistent rendering - Fixed Modal to prevent accidental close when dragging text selection - Enhanced checkpoint detection in BaseSDTrainProcess to prioritize by step number The checkpoint detection now intelligently sorts by: 1. Final files without step numbers (highest priority) 2. Checkpoints with highest step number 3. Most recently modified files (fallback) This ensures correct checkpoint loading even when files are copied or moved. Fixes issues where: - Modal transparency varied depending on render location - Modal closed when dragging text selection outside bounds - Checkpoint detection failed with copied/moved files due to unreliable creation times

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Continue Training feature for completed jobs#698

feat: Add Continue Training feature for completed jobs#698
diodiogod wants to merge 1 commit intoostris:mainfrom
diodiogod:feature/continue-training

diodiogod commented Feb 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

diodiogod commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Two Training Modes

User Interface

Changes

New Files

Modified Files

Technical Improvements

Smart Checkpoint Detection

Modal Enhancements

Testing

Use Cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

diodiogod commented Feb 4, 2026 •

edited

Loading