Changelog

Documenting changes which affect configuration usage patterns (added/moved/removed/renamed fields, notable logic changes).

model.lora: Moved from model.experimental.lora to model.lora (no longer experimental) (#1440, 2025-12-16)
Auto-set api_server_count=1 on inference when LoRA is enabled, because vLLM doesn't support hotloading for multiple API servers (#1422, 2025-12-17)
inference.model.rope_scaling: Added RoPE scaling configuration passthrough to vLLM (#1447 2025-12-17)
orchestrator.env_mix: Deprecated in favor of orchestrator.buffer.env_ratios (#1450, 2025-12-18)
orchestrator.buffer.hash_keys: Added hash keys configuration for buffer checkpointing (#1450, 2025-12-18)
orchestrator.buffer.env_ratios: Added environment ratio configuration for buffer sampling (#1450, 2025-12-18)
orchestrator.buffer.skip_verification: Added configuration to skip verification of rollouts using the environment's rubric. If True, rewards are always set to 0. Cannot be used with online_difficulty_filtering=True or when easy_threshold/hard_threshold are set (default: False)
orchestrator.ckpt.buffer_path: Deprecated (#1450, 2025-12-18)
orchestrator.buffer.easy_fraction and orchestrator.buffer.hard_fraction: Easy and hard fraction now defines the fraction of easy and hard problems to convert to normal when resuming, whereas previously it was the ratio of easy/ hard samples to sample per step (#1450, 2025-12-18)
orchestrator.teacher_model: Added teacher model configuration for computing teacher logprobs (e.g. for distillation). Supports TeacherModelConfig (custom model) or None (disabled). Renamed from reference_model (2025-12-20)
seq_len: Added root-level seq_len config that sets both trainer.model.seq_len and orchestrator.seq_len. Added validation that trainer.model.seq_len >= orchestrator.seq_len (2025-12-18)
trainer.loss.sequence_mask_ratio_low and trainer.loss.sequence_mask_ratio_high: Renamed to trainer.loss.sequence_mask_low and trainer.loss.sequence_mask_high (2025-12-19)
trainer.loss.token_mask_high and trainer.loss.token_mask_low: Added token-level importance ratio masking thresholds (2025-12-19)
trainer.loss.sequence_clip_high: Added sequence-level importance ratio clipping threshold (2025-12-19)
trainer.loss.geo_mask_high and trainer.loss.geo_mask_low: Added geometric importance ratio masking thresholds (2025-12-19)
trainer.loss.adv_tau: Added tau parameter for advantages (default: 1.0)
trainer.loss.teacher_tau: Added tau parameter for teacher logprobs (default: 0.0). Renamed from ref_tau
teacher_gpu_ids: Added GPU IDs for teacher inference server. When set, automatically starts a teacher inference server and configures orchestrator.teacher_model
teacher_inference: Added optional teacher inference config. Defaults to copying from inference config with port 8001
{orchestrator,trainer}.transport.zmq: Added ZMQ transport for training batches and micro batches (#1446, 2025-12-22)
model.impl: Changed default from hf to auto. With auto, the implementation automatically selects custom if supported for the model, otherwise falls back to hf (#1488, 2025-12-27)
orchestrator.eval.skip_eval_on_resume: Added flag (default True) to skip the first potentially redundant online eval immediately after resuming from a checkpoint (#1491, 2025-12-27)
trainer.weight_broadcast.adapter_only: Removed. Adapter-only behavior is now automatically derived from the presence of LoRA configuration (2025-12-27)
ckpt.keep: Renamed to ckpt.keep_last. Added ckpt.keep_interval to keep checkpoints at every N steps permanently (2025-12-31)
MultiLoRAMoE: QwenMoE now supports training expert loras and this is enabled by default in the target_modules. (2026-01-01)
model.fused_lm_head_chunk_size: Added chunk size configuration for fused LM head to enable memory-efficient chunked logprob computation. When set, splits vocabulary into chunks to avoid materializing full [N, V] logit tensor (default: None) (#1525, 2026-01-03)
model.fused_lm_head_chunk_size: RL training now auto-sets this to 2048 if not specified (except when impl='liger_kernel'). SFT training continues to use None (2026-01-05)
trainer.metrics_server: Added optional Prometheus metrics server for trainer observability. Exposes /metrics endpoint with step, loss, throughput, grad_norm, etc. Disabled by default (default: None) (#1547, 2026-01-06)
model.lora.alpha: Changed default from 16.0 to 32.0 (2026-01-10)
orchestrator.env.log: Added logging configuration for environment workers. If set, enables logging with level (str, default: "warn") and vf_level (str, default: "warn") fields. If None (default), logging is disabled (#1561, 2026-01-13)
eval.watcher: Added flag (default False) to watch weights_dir for newly-created stable checkpoints and evaluate them as they appear (2026-01-14)
orchestrator.log.env_worker_logs: Added flag (default False) to write env worker logs to logs/env_workers/{env_name}.log (2026-01-15)
orchestrator.env.log: Removed. Use orchestrator.log for env worker logging instead (2026-01-15)
orchestrator.eval.retry.reraise: Changed default from True to False. When False, raises tenacity.RetryError after retries are exhausted instead of the original exception, allowing failed eval environments to be skipped with a warning (#1586, 2026-01-14)
model.ep: Expert parallelism now supported (with auto/custom impl only), changed from the old behaviour when ep>1 was a no-op to a proper parallelization of the MoE layers. (#1595, 2026-01-15)
orchestrator.reload_weights_on_start: Removed. The reload was a no-op in practice since vLLM servers already start with base weights, and LoRA runs skipped it. (#1829, 2026-02-19)
orchestrator.client.elastic: Added elastic inference pool with DNS-based service discovery. Supports dynamic server scaling via any DNS hostname with multiple A records (Kubernetes headless services, Consul, Route53, etc.). Automatically syncs LoRA adapters on new servers and only exposes ready servers to workers (#1617, 2026-01-19)
model.fused_lm_head_chunk_size: Replaced chunk size int | None setting with int | Literal["auto", "disabled"] setting. auto auto-sets to 2048 if possible. disabled explicitly disables chunked loss (use vanilla LM head). Default behaviour is to use auto for RL training and disabled for SFT training. (not changed from previous version) (#1649, 2026-01-23)
client.skip_model_check: Added configuration to skip checking if the model is available in the inference pool. Useful for external APIs or API keys that don't support the /models endpoint (default: False) (#1543, 2026-01-06)
orchestrator.sampling.temp_scheduler: Added optional temperature schedule configuration with linear and cosine schedules. Set either sampling.temperature (constant) or sampling.temp_scheduler (schedule), not both. Default remains 1.0 if neither is set. (2026-01-27)
orchestrator.trajectory_strategy: Deprecated. Interleaving now automatically handles extension breaks by starting a new sample when the prefix doesn't match, achieving best-of-both behavior. The setting is ignored and interleaved mode is always used. (2026-01-30)
model.impl: Removed liger_kernel model implementation from supported options. The Liger kernel dependency remains for SFT loss. (2026-01-30)
log.json_logging: Added JSON structured logging option for log aggregation systems (Loki, Grafana, etc.). Outputs flat newline-delimited JSON with timestamp, level, message, module, function, line fields. Available on root log, trainer.log, and orchestrator.log (default: False) (2026-01-28)
model.optim_cpu_offload: Added flag to offload optimizer states to CPU without moving parameters (default: False) (2026-01-31)
orchestrator.tasks_per_minute: Added optional rate limiting for sandbox tasks per environment worker. Uses token bucket algorithm. If None (default), no rate limiting is applied (2026-02-02)
model.cp: When cp>1 with attn="flash_attention_3", require model.impl="custom" (FA3 ring-attention kernel only in custom path) (2026-02-06)
model.attn: Added fa4 as an attention implementation option. Flash attention 4 is only supported with the custom implementation (#1726, 2026-02-06)
inference.model.enable_prefix_caching: Added flag to enable prefix caching in vLLM. Passed to vLLM as --enable-prefix-caching (default: None) (2026-02-08)
orchestrator.env.address: Added address field on EnvConfig. If set, connect to an environment server at this address; if None, spawn a server in a subprocess (2026-02-06)
orchestrator.env.extra_env_kwargs: Added on EnvConfig. Extra kwargs passed to the env (e.g. seq_len, interleaved_rollouts, score_rollouts). Auto-populated by the orchestrator for training envs; generally not recommended for user override. Main use case is to match these kwargs when running an env in an isolated environment server (default: {}) (2026-02-06)
OrchestratorConfig: Removed workers_per_env, max_env_worker_restarts, and mask_env_responses (2026-02-06)
EvalSaveDiskConfig, EvalSaveConfig, RetryConfig, OnlineEvalConfig: Removed (2026-02-06)
TemperatureScheduleConfig: Renamed to TemperatureSchedulerConfig (2026-02-06)
optim.mu: Added Muon momentum (mu) config field (default: 0.95). Previously hardcoded to Muon class default. Also fixed optim.betas1/optim.betas2 not being passed through to the Muon optimizer (2026-02-09)
dump_config: Added --dump-config <path> flag to the rl command. When set, writes the resolved subconfigs (trainer, orchestrator, inference, teacher_inference) to the given directory and exits without starting any processes (2026-02-12)
client.api_key_var: Changed default from "OPENAI_API_KEY" to "VLLM_API_KEY" (2026-02-12)
orchestrator.filters: Added orchestrator-side rollout filters for detecting degenerate generations. Supports [[filters]] type = "gibberish" (rare tokens at high entropy) and [[filters]] type = "repetition" (high-confidence token streaks). Detected rollouts get reward zeroed and completion mask cleared (2026-02-13)
inference.model.tool_call_parser: Changed default from "hermes" to auto-detection from model name. Uses MODEL_TOOL_CALL_PARSER dict to infer the correct vLLM tool call parser (e.g. Qwen3→hermes, GLM-4.5→glm45, GLM-4.7→glm47, MiniMax-M2→minimax_m2, INTELLECT-3→hermes). Unknown models default to None. Explicit values still take priority. (#1795, 2026-02-16)
orchestrator.eval.cancel_inflight_rollouts_on_eval: Added flag to optionally cancel in-flight training rollouts before starting online evals. When enabled, avoids congestion by preventing training and eval rollouts from running simultaneously, but slows training as the rollout pipeline must refill after each eval (default: False) (2026-02-16)
orchestrator.use_token_client: Added flag to use the token-in-token-out (TITO) client for training across all environments. When enabled, uses openai_chat_completions_token client type instead of openai_chat_completions. Only use when environments have linear history and the chat template has the extension property (default: False) (2026-02-21)
model.cp + AFMoE: Context parallelism now works with AFMoE models via unified substitute_ring_attn which patches _compute_attention on both FlashAttention and AfmoeFlashAttention to use ring attention. Sliding window layers automatically get per-layer window_size; full attention layers default to (-1, -1). Also plumbed window_size through the FA3 ring attention wrapper (ring_fa3_varlen_func). (2026-02-21)
orchestrator.token_batch_size and orchestrator.max_inflight_rollouts: Added token-based batching via token_batch_size and explicit in-flight rollout control via max_inflight_rollouts (2026-02-23)
orchestrator.batch_size: Now optional and mutually exclusive with token_batch_size. If neither is set, defaults to rollout mode with batch_size=128 (2026-02-23)
inference.enable_expert_parallel, inference.all2all_backend, and inference.enable_eplb: Added expert-parallel inference controls passed to vLLM as --enable-expert-parallel, --all2all-backend, and --enable-eplb (defaults: False, "allgather_reducescatter", False) (2026-02-23)
rl_slurm / sft_slurm entrypoints: Removed. SLURM submission is now handled by the unified rl and sft entrypoints. Add a [slurm] section to your config to submit via SLURM instead of running locally (2026-02-23)
inference_gpu_ids / trainer_gpu_ids / teacher_gpu_ids: Removed from RLConfig. Replaced by [deployment] section with type = "single_node" (fields: num_train_gpus, num_infer_gpus, num_teacher_gpus) or type = "multi_node" (fields: num_train_nodes, num_infer_nodes, num_teacher_nodes, nodes_per_fsdp_group). Default is single_node with 1 train GPU and 1 infer GPU (2026-02-23)
RLSLURMConfig: Removed. Fields job_name, num_train_nodes, num_infer_nodes, gpus_per_node, nodes_per_fsdp_group, project_dir, slurm_template, dry_run are now under [slurm] and [deployment] in the unified RLConfig (2026-02-23)
SFTSLURMConfig: Removed. Fields job_name, num_nodes, gpus_per_node, nodes_per_fsdp_group, project_dir, slurm_template, dry_run are now under [slurm] and [deployment] in SFTConfig (2026-02-23)
[deployment] (RL): Added deployment configuration. type = "single_node" auto-derives contiguous GPU assignments from num_infer_gpus, num_train_gpus, num_teacher_gpus. type = "multi_node" requires [slurm] and uses num_train_nodes, num_infer_nodes (2026-02-23)
[deployment] (SFT): Added deployment configuration. type = "single_node" with num_gpus (default: 1). type = "multi_node" with num_nodes, nodes_per_fsdp_group, hf_hub_offline (2026-02-23)
[slurm] (RL): Added SLURM configuration with job_name, project_dir, template_path, partition, dry_run. When present, uv run rl generates and submits an sbatch script instead of running locally. Template is auto-selected based on deployment type (2026-02-23)
[slurm] (SFT): Added SLURM configuration with job_name, project_dir, template_path, partition, dry_run. When present, uv run sft generates and submits an sbatch script instead of running locally (2026-02-23)
hf_hub_offline (RL/SFT SLURM): Removed. HF_HUB_OFFLINE=1 is now hardcoded in the multi-node SLURM templates (2026-02-23)
SLURM templates: Moved from src/prime_rl/slurm/ to src/prime_rl/templates/ and renamed to single_node_rl.sbatch.j2, multi_node_rl.sbatch.j2, single_node_sft.sbatch.j2, multi_node_sft.sbatch.j2 (2026-02-23)
Entrypoints: Moved rl and sft entrypoints from prime_rl.rl / prime_rl.sft to prime_rl.entrypoints.rl / prime_rl.entrypoints.sft. No change to CLI usage (uv run rl, uv run sft) (2026-02-24)
output_dir (RL): Changed from Path | None (default None) to Path (default Path("outputs")). The SLURM-specific validation that rejected the default has been removed — output_dir now works the same for local and SLURM runs (2026-02-24)
clean_output_dir: Added to RLConfig and SFTConfig (default: False). Training now raises FileExistsError when output_dir contains checkpoints from a previous run and not resuming. Set clean_output_dir=true to delete and start fresh, or set ckpt.resume_step to resume (2026-02-24)
clean: Removed from RLConfig. The old clean flag (default: True) silently deleted logs, rollouts, and broadcasts on every local RL run. Superseded by the explicit clean_output_dir flag (2026-02-24)
Config consolidation: All config modules moved into prime_rl.configs subpackage. utils/config.py + transport/config.py → configs/shared.py; trainer/config.py + trainer/rl/config.py → configs/trainer.py; trainer/sft/config.py → configs/sft.py; orchestrator/config.py → configs/orchestrator.py; inference/config.py → configs/inference.py; rl_config.py → configs/rl.py. Class renames: SFTTrainerConfig → SFTConfig, RLTrainerConfig → TrainerConfig. Component prefixes dropped from orchestrator and inference config classes (e.g. OrchestratorCheckpointConfig → CheckpointConfig). TypeAlias renames: dropped Type suffix (e.g. LossConfigType → LossConfig, TransportConfigType → TransportConfig), renamed LossConfig class → DefaultLossConfig. No TOML key changes. (2026-02-24)
trainer.enable_router_replay: Added flag to enable router replay. If True, will return routed experts in the batch. This is only supported if enable_return_routed_experts=True in the inference config or pass --enable-return-routed-experts to vLLM server. This is only supported for custom models. (2026-02-22)
inference.enable_return_routed_experts: Added flag to enable return routed experts. Passed to vLLM as --enable-return-routed-experts (2026-02-22)
orchestrator.oversampling_factor: Added rollout-only over-sampling config that resolves max_inflight_rollouts = int(batch_size * oversampling_factor) when max_inflight_rollouts is unset. Cannot be used with token_batch_size or together with explicit max_inflight_rollouts (2026-02-25)
model.fused_lm_head_chunk_size: Changed default value from 2048 to 8192 for RL training (2026-02-26)
inference.data_parallel_size_local and inference.data_parallel_rpc_port: Added data-parallel node-local controls for vLLM, passed as --data-parallel-size-local and --data-parallel-rpc-port (defaults: None, 13345) (2026-02-26)
dump_config: Removed from RLConfig. Replaced by dry_run (see below) (2026-02-26)
slurm.dry_run: Removed from SlurmConfig. Replaced by top-level dry_run (see below) (2026-02-26)
dry_run: Added to RLConfig and SFTConfig (default: False). When set, validates the config, writes resolved subconfigs to output_dir/configs/, and exits without starting any processes. Works the same for both local and SLURM runs (2026-02-26)
Config output location: Resolved subconfigs are now always written to output_dir/configs/ instead of .pydantic_config/<uuid>/. This applies to both local and SLURM entrypoints, and for both single-node and multi-node deployments (2026-02-26)
SFT config filename: The resolved SFT trainer config is now written as sft.toml instead of trainer.toml (2026-02-26)
orchestrator.prime_monitor: Extended PrimeMonitorConfig with run_name and team_id fields, and auto-registration support: when RUN_ID is not set, the monitor now registers an external run on the platform automatically and streams live metrics, samples, and distributions. Authentication is read from PRIME_API_KEY or ~/.prime/config.json (prime login). (2026-02-27)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog