Welcome to the NextStep configuration system! This guide will help you quickly get started and master the configuration system.
- Introduction
- Quick Start
- Configuration System Structure
- Creating New Configurations
- Data Configuration Details
- Advanced Usage
- Troubleshooting
- Related Documentation
- Summary
The configs/ directory manages all configuration files for NextStep training experiments. The configuration system adopts a "base configuration + derived configuration override" design pattern, enabling you to:
- ✅ Quickly create new experiment configurations
- ✅ Reuse existing configurations to avoid code duplication
- ✅ Flexibly override configuration parameters via command line
- ✅ Clearly manage configurations for different resolutions and datasets
smartrun -m configs.nextstep_qwen14b_512pxSingle-node training:
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 \
-m configs.nextstep_qwen14b_512pxMulti-node training:
# Node 0 (master node)
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
--master_addr=<master_node_ip> --master_port=29500 \
-m configs.nextstep_qwen14b_512px
# Node 1
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
--master_addr=<master_node_ip> --master_port=29500 \
-m configs.nextstep_qwen14b_512pxOverride parameters directly from the command line without modifying configuration files:
# Override a single parameter
smartrun -m configs.nextstep_qwen14b_512px training.max_steps=5000
# Override multiple parameters
smartrun -m configs.nextstep_qwen14b_512px \
training.max_steps=5000 \
training.learning_rate=2e-4 \
data.image_size=512💡 Tip: For more command-line override usage, refer to the Advanced Usage section.
configs/
├── README.md # This file (usage guide)
├── nextstep_qwen14b_256px.py # Base configuration (256px, most complete)
├── nextstep_qwen14b_512px.py # Derived configuration (512px, inherits from 256px)
└── data/ # Data configuration directory
├── pretrain_data_256px.py # 256px data source list
├── pretrain_data_512px.py # 512px data source list
└── pretrain_data.json # Dataset registry (paths, metadata)
| File | Description |
|---|---|
nextstep_qwen14b_256px.py |
Main base configuration, contains complete model, data, and training parameters; serves as a template for creating new configurations |
nextstep_qwen14b_512px.py |
Derived configuration, created by importing and overriding the 256px configuration; only modifies resolution-related parameters |
data/pretrain_data_*.py |
Data source configuration, defines training and validation dataset lists, sampling ratios, filtering conditions, etc. |
data/pretrain_data.json |
Dataset registry, stores dataset paths, metadata, and other information |
Each configuration file contains three main parts:
config = OmegaConf.create(flags={"allow_objects": True})
# 1. Model Configuration - Define model architecture and paths
config.model = dict(
model_type="NextStepModel",
lm_model_name_or_path="...", # Language model path
vae_name_or_path="...", # VAE model path
# ... more model parameters
)
# 2. Data Configuration - Define datasets and data processing methods
config.data = dict(
datasets=data_info_list, # Training dataset list
val_datasets=val_data_info_list, # Validation dataset list
image_size=256, # Image resolution
# ... more data parameters
)
# 3. Training Configuration - Define training hyperparameters and strategies
config.training = dict(
output_dir="./outputs/...", # Output directory
learning_rate=1e-4, # Learning rate
max_steps=500_000, # Maximum training steps
# ... more training parameters
)💡 Tip: Detailed descriptions of all parameters are in the English comments in the configuration files. You can directly view the configuration files to understand the meaning of each parameter.
Derived configurations are created by importing and modifying base configurations to avoid code duplication:
# Import base configuration in derived configuration
from .nextstep_qwen14b_256px import config
# Only override the configuration items that need to be modified
config.data.image_size = 512
config.training.per_device_train_batch_size = 1
config.training.max_steps = 10_000If you need to create a completely different experiment configuration:
-
Copy the base configuration file
cp configs/nextstep_qwen14b_256px.py configs/my_experiment.py
-
Modify the configuration
- Update model paths (
LM_MODEL_PATH,VAE_MODEL_PATH, etc.) - Update output directory (
OUTPUT_DIR) - Adjust training parameters (batch size, learning rate, training steps, etc.)
- If you need to use different data configurations, create new data configuration files
- Update model paths (
-
Test run
# Small-scale test run smartrun -m configs.my_experiment training.max_steps=100
If you only need to modify a few parameters (such as resolution, batch size, etc.):
-
Create a derived configuration file
# configs/my_experiment.py from .nextstep_qwen14b_256px import config from .data.pretrain_data_256px import data_info_list, val_data_info_list # Override configuration config.data.image_size = 512 config.training.per_device_train_batch_size = 1 config.training.max_steps = 10_000 config.training.output_dir = "./outputs/my_experiment" # Launch code (must be included) if __name__ == "__main__": from nextstep.engine.train_nextstep_ds import Arguments, main from nextstep.lazy_config import LazyLaunch from nextstep.lazyrun import setup_config, setup_environ setup_config() setup_environ() LazyLaunch(main, Arguments)
-
Test run
smartrun -m configs.my_experiment
When creating a new configuration, ensure you check the following items:
- All model paths are correct (
LM_MODEL_PATH,VAE_MODEL_PATH, etc.) - Output directory path is correct (
OUTPUT_DIR) - Data configuration is correct (dataset list, sampling ratios, etc.)
- Batch size is suitable for GPU memory (
per_device_train_batch_size) - Training steps are reasonable (
max_steps) - Learning rate and warmup steps match (
learning_rate,warmup_steps) - Checkpoint save interval is reasonable (
save_steps)
Before using any new data configuration, you must complete the following three steps:
All datasets that enter the data configuration must first be converted to WebDataset (tar) format using the nextstep/data/build_wds.py script.
Usage:
Refer to the nextstep/data/build_wds.py script and modify the load_data and create_example functions according to your data source format:
# Modify the configuration in the script
EXAMPLE_PATH = "/path/to/your/data.json" # Input data path
OUTPUT_DIR = "/path/to/output/tar/files" # tar file output directory
MAX_SAMPLES_PER_TAR = 100 # Maximum number of samples per tar file
# Run the script
python nextstep/data/build_wds.pyParameter Descriptions:
| Parameter | Description |
|---|---|
EXAMPLE_PATH |
Input JSON data file path (can be JSON or JSONL format) |
OUTPUT_DIR |
tar file output directory |
MAX_SAMPLES_PER_TAR |
Maximum number of samples per tar file (recommended: 100-1000) |
Output Format:
Each tar file contains multiple samples, and each sample contains:
key.json: Contains acaptionfield (using<image_n>placeholders to describe image-text positions) and optionalcaption_all,metadatafieldskey-{i}.jpg: Image files, corresponding sequentially to<image_0>,<image_1>, etc. placeholders
⚠️ Important Notes:
- You must modify the
load_dataandcreate_examplefunctions according to your data source format- The
<image_n>placeholders in thecaptionfield must correspond to the image file order- Sample
keycannot contain periods (.) or hyphens (-)
All datasets that enter the data configuration must first use the gen_meta tool to generate a meta.json file. This file contains sample statistics for each tar file and is key for indexing data during training.
Usage:
# Basic usage (meta.json will be saved in the dataset root directory)
gen_meta /path/to/your/dataset/root_dir
# Specify save path
gen_meta /path/to/your/dataset/root_dir --meta_save_path /path/to/meta.json
# Specify number of parallel jobs (to speed up processing)
gen_meta /path/to/your/dataset/root_dir --meta_save_path /path/to/meta.json --n_jobs 32Parameter Descriptions:
| Parameter | Description |
|---|---|
dir |
Root directory where tar files are located (required) |
--meta_save_path |
Save path for meta.json (default: {dir}/meta.json) |
--n_jobs |
Number of parallel jobs (default: 64, set to 0 to auto-detect CPU cores) |
Generated Files:
meta.json: Contains sample counts, checksums, and other information for each tar filemeta_stats.json: Contains overall dataset statistics (total samples, total files, etc.)
After generating meta.json, you need to register the dataset information in configs/data/pretrain_data.json.
Registration Format:
{
"dataset_name": {
"data_type": "image_text|interleave|nlp|...",
"dir": "/path/to/tar/files",
"tar_meta_path": "/path/to/meta.json",
"num_samples": 1000000,
"description": "Dataset description information"
}
}Field Descriptions:
| Field | Description |
|---|---|
dataset_name |
Name used in data configuration files (e.g., "text2image/BLIP3o-60k"), must match the name field in data configuration files |
data_type |
Data type (image_text, interleave, nlp, etc.) |
dir |
Directory path where tar files are located |
tar_meta_path |
Path to the meta.json file (usually {dir}/meta.json) |
num_samples |
Total number of samples in the dataset (can be obtained from meta_stats.json) |
description |
Dataset description information (optional, recommended to fill in data source information, etc.) |
Example:
{
"text2image/BLIP3o-60k": {
"data_type": "image_text",
"dir": "./nextstep_data/BLIP3o-60k",
"tar_meta_path": "./nextstep_data/BLIP3o-60k/meta.json",
"num_samples": 58859,
"description": "filtered from https://huggingface.co/datasets/BLIP3o/BLIP3o-60k"
}
}Data Preview and Validation:
After completing dataset registration, you can use the data preview tool to check if tar packages or registered datasets are normal:
streamlit run nextstep/service/_preview.py --server.port 8501This tool can:
- Preview sample content in tar files (images, text, etc.)
- Check if datasets registered in
pretrain_data.jsonare normal - Verify that data format meets requirements
⚠️ Important Notes:
- All datasets must complete these three steps before they can be used by data configuration files
tar_meta_pathmust point to the correctmeta.jsonfile pathnum_samplesshould matchtotal_samplesinmeta_stats.json- Dataset names need to match the names used in data configuration files (e.g.,
pretrain_data_256px.py)
Data configuration files support multiple data source types:
| Data Source Type | Dataset Class | Purpose |
|---|---|---|
| NLP Data (text-only) | NLPITD |
Language model pretraining |
| Image Generation Data (text-to-image) | ImageTextWDS |
Text-to-image generation training |
| Image Editing Data (image-to-image) | ImageEditingInterleave |
Image editing task training |
| Interleaved Multimodal Data (video and text) | VideoInterleave |
Video understanding task training |
Each data source is a dictionary containing the following fields:
{
"cls": _convert_target_to_string(ImageTextWDS), # Dataset class
"data_type": "image_text_pair", # Data type
"name": "text2image/BLIP3o-60k", # Dataset name
"caption_keys": ["caption"], # Caption field list
"caption_ratio": [1], # Weight ratio for each caption field
"filter": { # Data filtering conditions
"area": [256*256, 1024*1024], # Image area range (pixels)
"aspect_ratio": 6, # Maximum aspect ratio
},
"samples": LargeInt("58K"), # Number of samples
}samples: Number of samples for each data source, usingLargeIntto support large numbers- Examples:
LargeInt("58K"),LargeInt("20M")
- Examples:
scale_to(): Function used to scale data source lists to target total sample count while maintaining relative proportionsMixedDataset: Automatically normalizessamplesto sampling probabilities
Run the data configuration file to view data statistics:
python configs/data/pretrain_data_256px.pyOutput includes:
- Total number of samples
- Distribution statistics by dataset prefix
- Detailed statistics for each dataset
You can override any configuration item in the configuration file via command-line parameters without modifying the configuration file:
# Override a single parameter
smartrun -m configs.nextstep_qwen14b_256px training.max_steps=200
# Override multiple parameters
smartrun -m configs.nextstep_qwen14b_256px \
training.max_steps=200 \
training.learning_rate=2e-4 \
data.image_size=512Use dots (.) to separate and override nested configurations:
# Override training configuration
training.max_steps=10000
training.learning_rate=2e-4
training.per_device_train_batch_size=4
# Override model configuration
model.lm_loss_weight=0.02
model.freeze_lm_model=True
# Override data configuration
data.image_size=512
data.num_workers=8For complex types (lists, dictionaries), use quotes:
# Override list configuration
training.report_to='["wandb"]'
# Override dictionary configuration
training.eval_kwargs='{"temperature": 0.8, "cfg": 5.0}'Solutions:
- Reduce
per_device_train_batch_sizeto decrease the number of batches being trained - Reduce
model_max_lengthto decrease sequence length within the same batch - Increase
grad_accumulation_stepsto maintain effective batch size - Reduce
image_sizeor use a smaller model
Solutions:
- Increase
num_workers(but not exceeding CPU cores / number of GPUs) - Increase
prefetch_factor - Enable
pin_memory(if the environment supports it) - Check data storage location (local SSD is faster than network storage)
Checklist:
- Configuration path is correct (use dots to separate, e.g.,
training.max_steps) - Configuration item name exactly matches the one in the configuration file (case-sensitive)
- For complex types (lists, dictionaries), use quotes
Method 1: Manually specify the checkpoint directory to resume from
config.training.resume = "./outputs/checkpoint-10000"Method 2: Enable automatic resume
config.training.auto_resume = True # Automatically resume from the latest checkpointNotes:
- Ensure
output_dirmatches the checkpoint directory - Can also override via command line:
training.resume=./outputs/checkpoint-10000
Modification Methods:
- Modify the
samplesfield in the data configuration file - Use the
scale_to()function for uniform scaling MixedDatasetautomatically normalizes sampling probabilities
- Training Script:
nextstep/engine/train_nextstep_ds.py - Training Arguments Definition:
nextstep/engine/training_args.py - Model Definition:
nextstep/models/nextstep/modeling_nextstep.py - Dataset Classes:
nextstep/datasets/ - Configuration System:
nextstep/lazy_config/
Core concepts of the configuration system:
- Base Configuration:
nextstep_qwen14b_256px.pyis a complete template containing all parameter descriptions - Derived Configuration: Create new configurations by importing and overriding to avoid code duplication
- Command-Line Override: Flexibly modify parameters without modifying configuration files
- Data Separation: Data configuration is managed independently for easy reuse and modification
If you have questions, please check the detailed comments in the configuration files or refer to the related documentation.