Documenting changes which affect configuration usage patterns (added/moved/removed/renamed fields, notable logic changes).
model.lora: Moved frommodel.experimental.loratomodel.lora(no longer experimental) (#1440, 2025-12-16)- Auto-set
api_server_count=1on inference when LoRA is enabled, because vLLM doesn't support hotloading for multiple API servers (#1422, 2025-12-17) inference.model.rope_scaling: Added RoPE scaling configuration passthrough to vLLM (#1447 2025-12-17)orchestrator.env_mix: Deprecated in favor oforchestrator.buffer.env_ratios(#1450, 2025-12-18)orchestrator.buffer.hash_keys: Added hash keys configuration for buffer checkpointing (#1450, 2025-12-18)orchestrator.buffer.env_ratios: Added environment ratio configuration for buffer sampling (#1450, 2025-12-18)orchestrator.buffer.skip_verification: Added configuration to skip verification of rollouts using the environment's rubric. If True, rewards are always set to 0. Cannot be used withonline_difficulty_filtering=Trueor wheneasy_threshold/hard_thresholdare set (default: False)orchestrator.ckpt.buffer_path: Deprecated (#1450, 2025-12-18)orchestrator.buffer.easy_fractionandorchestrator.buffer.hard_fraction: Easy and hard fraction now defines the fraction of easy and hard problems to convert to normal when resuming, whereas previously it was the ratio of easy/ hard samples to sample per step (#1450, 2025-12-18)orchestrator.teacher_model: Added teacher model configuration for computing teacher logprobs (e.g. for distillation). SupportsTeacherModelConfig(custom model) orNone(disabled). Renamed fromreference_model(2025-12-20)seq_len: Added root-levelseq_lenconfig that sets bothtrainer.model.seq_lenandorchestrator.seq_len. Added validation thattrainer.model.seq_len >= orchestrator.seq_len(2025-12-18)trainer.loss.sequence_mask_ratio_lowandtrainer.loss.sequence_mask_ratio_high: Renamed totrainer.loss.sequence_mask_lowandtrainer.loss.sequence_mask_high(2025-12-19)trainer.loss.token_mask_highandtrainer.loss.token_mask_low: Added token-level importance ratio masking thresholds (2025-12-19)trainer.loss.sequence_clip_high: Added sequence-level importance ratio clipping threshold (2025-12-19)trainer.loss.geo_mask_highandtrainer.loss.geo_mask_low: Added geometric importance ratio masking thresholds (2025-12-19)trainer.loss.adv_tau: Added tau parameter for advantages (default: 1.0)trainer.loss.teacher_tau: Added tau parameter for teacher logprobs (default: 0.0). Renamed fromref_tauteacher_gpu_ids: Added GPU IDs for teacher inference server. When set, automatically starts a teacher inference server and configuresorchestrator.teacher_modelteacher_inference: Added optional teacher inference config. Defaults to copying frominferenceconfig with port 8001{orchestrator,trainer}.transport.zmq: Added ZMQ transport for training batches and micro batches (#1446, 2025-12-22)model.impl: Changed default fromhftoauto. Withauto, the implementation automatically selectscustomif supported for the model, otherwise falls back tohf(#1488, 2025-12-27)orchestrator.eval.skip_eval_on_resume: Added flag (defaultTrue) to skip the first potentially redundant online eval immediately after resuming from a checkpoint (#1491, 2025-12-27)trainer.weight_broadcast.adapter_only: Removed. Adapter-only behavior is now automatically derived from the presence of LoRA configuration (2025-12-27)ckpt.keep: Renamed tockpt.keep_last. Addedckpt.keep_intervalto keep checkpoints at every N steps permanently (2025-12-31)MultiLoRAMoE: QwenMoE now supports training expert loras and this is enabled by default in thetarget_modules. (2026-01-01)model.fused_lm_head_chunk_size: Added chunk size configuration for fused LM head to enable memory-efficient chunked logprob computation. When set, splits vocabulary into chunks to avoid materializing full [N, V] logit tensor (default: None) (#1525, 2026-01-03)model.fused_lm_head_chunk_size: RL training now auto-sets this to 2048 if not specified (except whenimpl='liger_kernel'). SFT training continues to use None (2026-01-05)trainer.metrics_server: Added optional Prometheus metrics server for trainer observability. Exposes/metricsendpoint with step, loss, throughput, grad_norm, etc. Disabled by default (default: None) (#1547, 2026-01-06)model.lora.alpha: Changed default from 16.0 to 32.0 (2026-01-10)orchestrator.env.log: Added logging configuration for environment workers. If set, enables logging withlevel(str, default: "warn") andvf_level(str, default: "warn") fields. If None (default), logging is disabled (#1561, 2026-01-13)eval.watcher: Added flag (defaultFalse) to watchweights_dirfor newly-created stable checkpoints and evaluate them as they appear (2026-01-14)orchestrator.log.env_worker_logs: Added flag (defaultFalse) to write env worker logs tologs/env_workers/{env_name}.log(2026-01-15)orchestrator.env.log: Removed. Useorchestrator.logfor env worker logging instead (2026-01-15)orchestrator.eval.retry.reraise: Changed default fromTruetoFalse. WhenFalse, raisestenacity.RetryErrorafter retries are exhausted instead of the original exception, allowing failed eval environments to be skipped with a warning (#1586, 2026-01-14)model.ep: Expert parallelism now supported (with auto/custom impl only), changed from the old behaviour whenep>1was a no-op to a proper parallelization of the MoE layers. (#1595, 2026-01-15)orchestrator.reload_weights_on_start: Removed. The reload was a no-op in practice since vLLM servers already start with base weights, and LoRA runs skipped it. (#1829, 2026-02-19)orchestrator.client.elastic: Added elastic inference pool with DNS-based service discovery. Supports dynamic server scaling via any DNS hostname with multiple A records (Kubernetes headless services, Consul, Route53, etc.). Automatically syncs LoRA adapters on new servers and only exposes ready servers to workers (#1617, 2026-01-19)model.fused_lm_head_chunk_size: Replaced chunk sizeint | Nonesetting withint | Literal["auto", "disabled"]setting.autoauto-sets to 2048 if possible.disabledexplicitly disables chunked loss (use vanilla LM head). Default behaviour is to useautofor RL training anddisabledfor SFT training. (not changed from previous version) (#1649, 2026-01-23)client.skip_model_check: Added configuration to skip checking if the model is available in the inference pool. Useful for external APIs or API keys that don't support the /models endpoint (default: False) (#1543, 2026-01-06)orchestrator.sampling.temp_scheduler: Added optional temperature schedule configuration with linear and cosine schedules. Set eithersampling.temperature(constant) orsampling.temp_scheduler(schedule), not both. Default remains 1.0 if neither is set. (2026-01-27)orchestrator.trajectory_strategy: Deprecated. Interleaving now automatically handles extension breaks by starting a new sample when the prefix doesn't match, achieving best-of-both behavior. The setting is ignored and interleaved mode is always used. (2026-01-30)model.impl: Removedliger_kernelmodel implementation from supported options. The Liger kernel dependency remains for SFT loss. (2026-01-30)log.json_logging: Added JSON structured logging option for log aggregation systems (Loki, Grafana, etc.). Outputs flat newline-delimited JSON withtimestamp,level,message,module,function,linefields. Available on rootlog,trainer.log, andorchestrator.log(default: False) (2026-01-28)model.optim_cpu_offload: Added flag to offload optimizer states to CPU without moving parameters (default: False) (2026-01-31)orchestrator.tasks_per_minute: Added optional rate limiting for sandbox tasks per environment worker. Uses token bucket algorithm. If None (default), no rate limiting is applied (2026-02-02)model.cp: Whencp>1withattn="flash_attention_3", requiremodel.impl="custom"(FA3 ring-attention kernel only in custom path) (2026-02-06)model.attn: Addedfa4as an attention implementation option. Flash attention 4 is only supported with the custom implementation (#1726, 2026-02-06)inference.model.enable_prefix_caching: Added flag to enable prefix caching in vLLM. Passed to vLLM as--enable-prefix-caching(default: None) (2026-02-08)orchestrator.env.address: Added address field onEnvConfig. If set, connect to an environment server at this address; if None, spawn a server in a subprocess (2026-02-06)orchestrator.env.extra_env_kwargs: Added onEnvConfig. Extra kwargs passed to the env (e.g. seq_len, interleaved_rollouts, score_rollouts). Auto-populated by the orchestrator for training envs; generally not recommended for user override. Main use case is to match these kwargs when running an env in an isolated environment server (default: {}) (2026-02-06)OrchestratorConfig: Removedworkers_per_env,max_env_worker_restarts, andmask_env_responses(2026-02-06)EvalSaveDiskConfig,EvalSaveConfig,RetryConfig,OnlineEvalConfig: Removed (2026-02-06)TemperatureScheduleConfig: Renamed toTemperatureSchedulerConfig(2026-02-06)optim.mu: Added Muon momentum (mu) config field (default: 0.95). Previously hardcoded to Muon class default. Also fixedoptim.betas1/optim.betas2not being passed through to the Muon optimizer (2026-02-09)dump_config: Added--dump-config <path>flag to therlcommand. When set, writes the resolved subconfigs (trainer, orchestrator, inference, teacher_inference) to the given directory and exits without starting any processes (2026-02-12)client.api_key_var: Changed default from "OPENAI_API_KEY" to "VLLM_API_KEY" (2026-02-12)orchestrator.filters: Added orchestrator-side rollout filters for detecting degenerate generations. Supports[[filters]] type = "gibberish"(rare tokens at high entropy) and[[filters]] type = "repetition"(high-confidence token streaks). Detected rollouts get reward zeroed and completion mask cleared (2026-02-13)inference.model.tool_call_parser: Changed default from"hermes"to auto-detection from model name. UsesMODEL_TOOL_CALL_PARSERdict to infer the correct vLLM tool call parser (e.g. Qwen3→hermes, GLM-4.5→glm45, GLM-4.7→glm47, MiniMax-M2→minimax_m2, INTELLECT-3→hermes). Unknown models default toNone. Explicit values still take priority. (#1795, 2026-02-16)orchestrator.eval.cancel_inflight_rollouts_on_eval: Added flag to optionally cancel in-flight training rollouts before starting online evals. When enabled, avoids congestion by preventing training and eval rollouts from running simultaneously, but slows training as the rollout pipeline must refill after each eval (default: False) (2026-02-16)orchestrator.use_token_client: Added flag to use the token-in-token-out (TITO) client for training across all environments. When enabled, usesopenai_chat_completions_tokenclient type instead ofopenai_chat_completions. Only use when environments have linear history and the chat template has the extension property (default: False) (2026-02-21)model.cp+ AFMoE: Context parallelism now works with AFMoE models via unifiedsubstitute_ring_attnwhich patches_compute_attentionon bothFlashAttentionandAfmoeFlashAttentionto use ring attention. Sliding window layers automatically get per-layerwindow_size; full attention layers default to(-1, -1). Also plumbedwindow_sizethrough the FA3 ring attention wrapper (ring_fa3_varlen_func). (2026-02-21)orchestrator.token_batch_sizeandorchestrator.max_inflight_rollouts: Added token-based batching viatoken_batch_sizeand explicit in-flight rollout control viamax_inflight_rollouts(2026-02-23)orchestrator.batch_size: Now optional and mutually exclusive withtoken_batch_size. If neither is set, defaults to rollout mode withbatch_size=128(2026-02-23)inference.enable_expert_parallel,inference.all2all_backend, andinference.enable_eplb: Added expert-parallel inference controls passed to vLLM as--enable-expert-parallel,--all2all-backend, and--enable-eplb(defaults:False,"allgather_reducescatter",False) (2026-02-23)rl_slurm/sft_slurmentrypoints: Removed. SLURM submission is now handled by the unifiedrlandsftentrypoints. Add a[slurm]section to your config to submit via SLURM instead of running locally (2026-02-23)inference_gpu_ids/trainer_gpu_ids/teacher_gpu_ids: Removed fromRLConfig. Replaced by[deployment]section withtype = "single_node"(fields:num_train_gpus,num_infer_gpus,num_teacher_gpus) ortype = "multi_node"(fields:num_train_nodes,num_infer_nodes,num_teacher_nodes,nodes_per_fsdp_group). Default issingle_nodewith 1 train GPU and 1 infer GPU (2026-02-23)RLSLURMConfig: Removed. Fieldsjob_name,num_train_nodes,num_infer_nodes,gpus_per_node,nodes_per_fsdp_group,project_dir,slurm_template,dry_runare now under[slurm]and[deployment]in the unifiedRLConfig(2026-02-23)SFTSLURMConfig: Removed. Fieldsjob_name,num_nodes,gpus_per_node,nodes_per_fsdp_group,project_dir,slurm_template,dry_runare now under[slurm]and[deployment]inSFTConfig(2026-02-23)[deployment](RL): Added deployment configuration.type = "single_node"auto-derives contiguous GPU assignments fromnum_infer_gpus,num_train_gpus,num_teacher_gpus.type = "multi_node"requires[slurm]and usesnum_train_nodes,num_infer_nodes(2026-02-23)[deployment](SFT): Added deployment configuration.type = "single_node"withnum_gpus(default: 1).type = "multi_node"withnum_nodes,nodes_per_fsdp_group,hf_hub_offline(2026-02-23)[slurm](RL): Added SLURM configuration withjob_name,project_dir,template_path,partition,dry_run. When present,uv run rlgenerates and submits an sbatch script instead of running locally. Template is auto-selected based on deployment type (2026-02-23)[slurm](SFT): Added SLURM configuration withjob_name,project_dir,template_path,partition,dry_run. When present,uv run sftgenerates and submits an sbatch script instead of running locally (2026-02-23)hf_hub_offline(RL/SFT SLURM): Removed.HF_HUB_OFFLINE=1is now hardcoded in the multi-node SLURM templates (2026-02-23)- SLURM templates: Moved from
src/prime_rl/slurm/tosrc/prime_rl/templates/and renamed tosingle_node_rl.sbatch.j2,multi_node_rl.sbatch.j2,single_node_sft.sbatch.j2,multi_node_sft.sbatch.j2(2026-02-23) - Entrypoints: Moved
rlandsftentrypoints fromprime_rl.rl/prime_rl.sfttoprime_rl.entrypoints.rl/prime_rl.entrypoints.sft. No change to CLI usage (uv run rl,uv run sft) (2026-02-24) output_dir(RL): Changed fromPath | None(defaultNone) toPath(defaultPath("outputs")). The SLURM-specific validation that rejected the default has been removed —output_dirnow works the same for local and SLURM runs (2026-02-24)clean_output_dir: Added toRLConfigandSFTConfig(default:False). Training now raisesFileExistsErrorwhenoutput_dircontains checkpoints from a previous run and not resuming. Setclean_output_dir=trueto delete and start fresh, or setckpt.resume_stepto resume (2026-02-24)clean: Removed fromRLConfig. The oldcleanflag (default:True) silently deleted logs, rollouts, and broadcasts on every local RL run. Superseded by the explicitclean_output_dirflag (2026-02-24)- Config consolidation: All config modules moved into
prime_rl.configssubpackage.utils/config.py+transport/config.py→configs/shared.py;trainer/config.py+trainer/rl/config.py→configs/trainer.py;trainer/sft/config.py→configs/sft.py;orchestrator/config.py→configs/orchestrator.py;inference/config.py→configs/inference.py;rl_config.py→configs/rl.py. Class renames:SFTTrainerConfig→SFTConfig,RLTrainerConfig→TrainerConfig. Component prefixes dropped from orchestrator and inference config classes (e.g.OrchestratorCheckpointConfig→CheckpointConfig). TypeAlias renames: droppedTypesuffix (e.g.LossConfigType→LossConfig,TransportConfigType→TransportConfig), renamedLossConfigclass →DefaultLossConfig. No TOML key changes. (2026-02-24) trainer.enable_router_replay: Added flag to enable router replay. If True, will return routed experts in the batch. This is only supported ifenable_return_routed_experts=Truein the inference config or pass--enable-return-routed-expertsto vLLM server. This is only supported for custom models. (2026-02-22)inference.enable_return_routed_experts: Added flag to enable return routed experts. Passed to vLLM as--enable-return-routed-experts(2026-02-22)orchestrator.oversampling_factor: Added rollout-only over-sampling config that resolvesmax_inflight_rollouts = int(batch_size * oversampling_factor)whenmax_inflight_rolloutsis unset. Cannot be used withtoken_batch_sizeor together with explicitmax_inflight_rollouts(2026-02-25)model.fused_lm_head_chunk_size: Changed default value from 2048 to 8192 for RL training (2026-02-26)inference.data_parallel_size_localandinference.data_parallel_rpc_port: Added data-parallel node-local controls for vLLM, passed as--data-parallel-size-localand--data-parallel-rpc-port(defaults:None,13345) (2026-02-26)dump_config: Removed fromRLConfig. Replaced bydry_run(see below) (2026-02-26)slurm.dry_run: Removed fromSlurmConfig. Replaced by top-leveldry_run(see below) (2026-02-26)dry_run: Added toRLConfigandSFTConfig(default:False). When set, validates the config, writes resolved subconfigs tooutput_dir/configs/, and exits without starting any processes. Works the same for both local and SLURM runs (2026-02-26)- Config output location: Resolved subconfigs are now always written to
output_dir/configs/instead of.pydantic_config/<uuid>/. This applies to both local and SLURM entrypoints, and for both single-node and multi-node deployments (2026-02-26) - SFT config filename: The resolved SFT trainer config is now written as
sft.tomlinstead oftrainer.toml(2026-02-26) orchestrator.prime_monitor: ExtendedPrimeMonitorConfigwithrun_nameandteam_idfields, and auto-registration support: whenRUN_IDis not set, the monitor now registers an external run on the platform automatically and streams live metrics, samples, and distributions. Authentication is read fromPRIME_API_KEYor~/.prime/config.json(prime login). (2026-02-27)