Skip to content

Commit 8ec8afa

Browse files
committed
add doc for save strategy
1 parent 4fac567 commit 8ec8afa

File tree

2 files changed

+14
-2
lines changed

2 files changed

+14
-2
lines changed

docs/sphinx_doc/source/tutorial/trinity_configs.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ model:
160160

161161
- `model_path`: Path to the model being trained.
162162
- `critic_model_path`: Optional path to a separate critic model. If empty, defaults to `model_path`.
163-
- `max_model_len`: Maximum number of tokens in a sequence. It is recommended to set this value manually. If not set, it will default to `max_prompt_tokens` + `max_response_tokens`. However, if either `max_prompt_tokens` or `max_response_tokens` is not set, we will raise an error.
163+
- `max_model_len`: Maximum number of tokens in a sequence. It is recommended to set this value manually. If not specified, the system will attempt to set it to `max_prompt_tokens` + `max_response_tokens`. However, this requires both values to be already set; otherwise, an error will be raised.
164164
- `max_response_tokens`: Maximum number of tokens allowed in generated responses. Only for `chat` and `generate` methods in `InferenceModel`.
165165
- `max_prompt_tokens`: Maximum number of tokens allowed in prompts. Only for `chat` and `generate` methods in `InferenceModel`.
166166
- `min_response_tokens`: Minimum number of tokens allowed in generated responses. Only for `chat` and `generate` methods in `InferenceModel`. Default is `1`. It must be less than `max_response_tokens`.
@@ -405,6 +405,7 @@ trainer:
405405
trainer_type: 'verl'
406406
save_interval: 100
407407
total_steps: 1000
408+
save_strategy: "unrestricted"
408409
trainer_config: null
409410
trainer_config_path: ''
410411
```
@@ -413,6 +414,11 @@ trainer:
413414
- `trainer_type`: Trainer backend implementation. Currently only supports `verl`.
414415
- `save_interval`: Frequency (in steps) at which to save model checkpoints.
415416
- `total_steps`: Total number of training steps.
417+
- `save_strategy`: The parallel strategy used when saving the model. Defaults to `unrestricted`. The available options are as follows:
418+
- `single_thread`: Only one thread across the entire system is allowed to save the model; saving tasks from different threads are executed sequentially.
419+
- `single_process`: Only one process across the entire system is allowed to perform saving; multiple threads within that process can handle saving tasks in parallel, while saving operations across different processes are executed sequentially.
420+
- `single_node`: Only one compute node across the entire system is allowed to perform saving; processes and threads within that node can work in parallel, while saving operations across different nodes are executed sequentially.
421+
- `unrestricted`: No restrictions on saving operations; multiple nodes, processes, or threads are allowed to save the model simultaneously.
416422
- `trainer_config`: The trainer configuration provided inline.
417423
- `trainer_config_path`: The path to the trainer configuration file. Only one of `trainer_config_path` and `trainer_config` should be specified.
418424

docs/sphinx_doc/source_zh/tutorial/trinity_configs.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ model:
160160

161161
- `model_path`: 被训练模型的路径。
162162
- `critic_model_path`: 可选的独立 critic 模型路径。若为空,则默认为 `model_path`。
163-
- `max_model_len`: 表示模型所支持的单个序列最大 token 数。如果未设置该值,则会尝试将其默认设为 `max_prompt_tokens + max_response_tokens`。但如果 `max_prompt_tokens` 或 `max_response_tokens` 中有任何一个未设置,代码将会报错
163+
- `max_model_len`: 表示模型所支持的单个序列最大 token 数。如未指定,系统会尝试将其设为 `max_prompt_tokens` + `max_response_tokens`。但前提是这两个值都必须已设置,否则将引发错误
164164
- `max_prompt_tokens`: 输入 prompt 中允许的最大 token 数。仅对 `InferenceModel` 中的 `chat` 和 `generate` 方法生效。
165165
- `max_response_tokens`: 模型生成的回复中允许的最大 token 数。仅对 `InferenceModel` 中的 `chat` 和 `generate` 方法生效。
166166
- `min_response_tokens`: 模型生成的回复中允许的最小 token 数。仅对 `InferenceModel` 中的 `chat` 和 `generate` 方法生效。
@@ -405,6 +405,7 @@ trainer:
405405
trainer_type: 'verl'
406406
save_interval: 100
407407
total_steps: 1000
408+
save_strategy: "unrestricted"
408409
trainer_config: null
409410
trainer_config_path: ''
410411
```
@@ -413,6 +414,11 @@ trainer:
413414
- `trainer_type`: trainer 后端实现。目前仅支持 `verl`。
414415
- `save_interval`: 保存模型检查点的频率(步)。
415416
- `total_steps`: 总训练步数。
417+
- `save_strategy`: 模型保存时的并行策略。默认值为`unrestricted`。可选值如下:
418+
- `single_thread`:整个系统中,仅允许一个线程进行模型保存,不同保存线程之间串行执行。
419+
- `single_process`:整个系统中,仅允许一个进程执行保存,该进程内的多个线程可以并行处理保存任务,不同进程之间串行执行。
420+
- `single_node`:整个系统中,仅允许一个计算节点执行保存,该节点内的进程和线程可并行工作,不同节点的保存串行执行。
421+
- `unrestricted`:不限制保存操作,允许多个节点、进程或线程同时保存模型。
416422
- `trainer_config`: 内联提供的 trainer 配置。
417423
- `trainer_config_path`: trainer 配置文件的路径。`trainer_config_path` 和 `trainer_config` 只能指定其一。
418424

0 commit comments

Comments
 (0)