You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/tutorials/config/config.md
+77-18Lines changed: 77 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,33 +13,92 @@ The `main_config` dictionary contains the main parameter settings for running th
13
13
### 1.1 Main Parameters in the `env` Part
14
14
15
15
-`env_id`: Specifies the environment to be used.
16
-
-`obs_shape`: The dimension of the environment observation.
16
+
-`observation_shape`: The dimension of the environment's observations.
17
17
-`collector_env_num`: The number of parallel environments used to collect data in the experience replay collector.
18
18
-`evaluator_env_num`: The number of parallel environments used to evaluate policy performance in the evaluator.
19
-
-`n_evaluator_episode`: The number of episodes run by each environment in the evaluator.
19
+
-`n_evaluator_episode`: The total number of episodes run across all environments in the evaluator.
20
+
-`collect_max_episode_steps`: The maximum number of steps allowed per episode during data collection.
21
+
-`eval_max_episode_steps`: The maximum number of steps allowed per episode during evaluation.
22
+
-`frame_stack_num`: The number of consecutive frames stacked together as input.
23
+
-`gray_scale`: Whether to use grayscale images.
24
+
-`scale`: Whether to scale the input data.
25
+
-`clip_rewards`: Whether to clip reward values.
26
+
-`episode_life`: If True, the game ends when the agent loses a life, otherwise, the game only ends when all lives are lost.
27
+
-`env_type`: The type of environment.
28
+
-`frame_skip`: The number of frames to repeat the same action.
29
+
-`stop_value`: The target score that stops the training.
30
+
-`replay_path`: Path to store the replay.
31
+
-`save_replay`: Whether to save the replay video.
32
+
-`channel_last`: Whether to put the channel dimension in the last dimension of the input data.
33
+
-`warp_frame`: Whether to crop each frame of the picture.
20
34
-`manager`: Specifies the type of environment manager, mainly used to control the parallelization mode of the environment.
21
35
22
36
### 1.2 Main Parameters in the `policy` Part
23
37
24
-
-`model`: Specifies the neural network model used by the policy, including the input dimension of the model, the number of frame stacking, the action space dimension of the model output, whether the model needs to use downsampling, whether to use self-supervised learning auxiliary loss, the action encoding type, the Normalization mode used in the network, etc.
25
-
-`cuda`: Specifies whether to migrate the model to the GPU for training.
26
-
-`reanalyze_noise`: Whether to introduce noise during MCTS reanalysis, which can increase exploration.
27
-
-`env_type`: Marks the environment type faced by the MuZero algorithm. According to different environment types, the MuZero algorithm will have some differences in detail processing.
28
-
-`game_segment_length`: The length of the sequence (game segment) used for self-play.
29
-
-`random_collect_episode_num`: The number of randomly collected episodes, providing initial data for exploration.
30
-
-`eps`: Exploration control parameters, including whether to use the epsilon-greedy method for control, the update method of control parameters, the starting value, the termination value, the decay rate, etc.
38
+
-`model`: Specifies the neural network model used by the policy.
39
+
-`model_type`: The type of model to use.
40
+
-`observation_shape`: The dimensions of the observation space.
41
+
-`action_space_size`: The size of the action space.
42
+
-`continuous_action_space`: Whether the action space is continuous.
43
+
-`num_res_blocks`: The number of residual blocks in the model.
44
+
-`downsample`: Whether to downsample the input.
45
+
-`norm_type`: The type of normalization used.
46
+
-`num_channels`: The number of channels in the convolutional layers (number of features extracted).
47
+
-`support_scale`: The range of the value support set (`-support_scale` to `support_scale`).
48
+
-`bias`: Whether to use bias terms in the layers.
49
+
-`discrete_action_encoding_type`: How discrete actions are encoded.
50
+
-`self_supervised_learning_loss`: Whether to use a self-supervised learning loss (as in EfficientZero).
51
+
-`image_channel`: The number of channels in the input image.
52
+
-`frame_stack_num`: Number of frames stacked.
53
+
-`gray_scale`: Whether to use gray images.
54
+
-`use_sim_norm`: Whether to use SimNorm after the Latent State.
55
+
-`use_sim_norm_kl_loss`: Whether the obs_loss corresponding to the Latent State after SimNorm uses KL divergence loss, which is often used together with SimNorm.
56
+
-`res_connection_in_dynamics`: Whether to use the residual connection in the dynamics model.
57
+
-`learn`: Configuration for the learning process.
58
+
-`learner`: Configuration for the learner (dictionary type), including train iterations and checkpoint saving.
59
+
-`resume_training`: Whether to resume training.
60
+
-`collect`: Configuration for the collect process.
61
+
-`collector`: Collector configuration (dictionary type), including type and print frequency.
62
+
-`eval`: Configuration for the evaluation process
63
+
-`evaluator`: Evaluator configuration (dictionary type), including evaluation frequency, number of episodes to evaluate, and path to save images.
64
+
-`other`: Other configurations.
65
+
-`replay_buffer`: Replay buffer configuration (dictionary type), including buffer size, maximum usage and staleness of experiences, and parameters for throughput control and monitoring.
66
+
-`cuda`: Whether to use CUDA (GPU) for training.
67
+
-`multi_gpu`: Whether to enable multi-GPU training.
68
+
-`use_wandb`: Whether to use Weights & Biases (wandb) for logging.
69
+
-`mcts_ctree`: Whether to use the C++ version of Monte Carlo Tree Search.
70
+
-`collector_env_num`: The number of collection environments.
71
+
-`evaluator_env_num`: The number of evaluation environments.
72
+
-`env_type`: The type of environment (board game or non-board game).
73
+
-`action_type`: The type of action space (fixed or other).
74
+
-`game_segment_length`: The length corresponding to the basic unit game segment during collection.
75
+
-`cal_dormant_ratio`: Whether to calculate the ratio of dormant neurons.
31
76
-`use_augmentation`: Whether to use data augmentation.
32
-
-`update_per_collect`: The number of updates after each data collection.
33
-
-`batch_size`: The batch size sampled during the update.
34
-
-`optim_type`: Optimizer type.
77
+
-`augmentation`: The data augmentation methods to use.
78
+
-`update_per_collect`: The number of model updates after each data collection phase.
79
+
-`batch_size`: The batch size used for training updates.
80
+
-`optim_type`: The type of optimizer.
81
+
-`reanalyze_ratio`: The reanalyze ratio, which controls the probability to conduct reanalyze.
82
+
-`reanalyze_noise`: Whether to introduce noise during MCTS reanalysis (for exploration).
83
+
-`reanalyze_batch_size`: Reanalyze batch size.
84
+
-`reanalyze_partition`: The partition of reanalyze. E.g., 1 means reanalyze_batch samples from the whole buffer, 0.5 means samples from the first half of the buffer.
85
+
-`random_collect_episode_num`: Number of episodes of random collection, to provide initial exploration data.
86
+
-`eps`: Parameters for exploration control, including whether to use epsilon-greedy, update schedules, start/end values, and decay rate.
35
87
-`piecewise_decay_lr_scheduler`: Whether to use piecewise constant learning rate decay.
36
-
-`learning_rate`: Initial learning rate.
88
+
-`learning_rate`: The initial learning rate.
37
89
-`num_simulations`: The number of simulations used in the MCTS algorithm.
38
-
-`reanalyze_ratio`: Reanalysis coefficient, controlling the probability of reanalysis.
39
-
-`ssl_loss_weight`: The weight of the self-supervised learning loss function.
40
-
-`n_episode`: The number of episodes run by each environment in the parallel collector.
41
-
-`eval_freq`: Policy evaluation frequency (measured by training steps).
42
-
-`replay_buffer_size`: The capacity of the experience replay buffer.
90
+
-`reward_loss_weight`: Weight for the reward loss.
91
+
-`policy_loss_weight`: Weight for the policy loss.
92
+
-`value_loss_weight`: Weight for the value loss.
93
+
-`ssl_loss_weight`: The weight of the self-supervised learning loss.
94
+
-`n_episode`: The number of episodes in parallel collector.
95
+
-`eval_freq`: The frequency of policy evaluation (in terms of training steps).
96
+
-`replay_buffer_size`: The capacity of the replay buffer.
97
+
-`target_update_freq`: How often to update the target network.
98
+
-`grad_clip_value`: Value to clip gradient.
99
+
-`discount_factor`: Discount factor.
100
+
-`td_steps`: TD steps.
101
+
-`num_unroll_steps`: The number of rollout steps during MuZero training.
43
102
44
103
Two frequently changed parameter setting areas are also specially mentioned here, annotated by comments:
0 commit comments