Skip to content

Commit 75c9247

Browse files
authored
[Misc.] Roll back the initial std range. Use larger learning rate by default
1 parent e87d766 commit 75c9247

File tree

10 files changed

+12
-12
lines changed

10 files changed

+12
-12
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,12 +58,12 @@ Here's an example of training a 340M FLA Transformer model with a LLaMA-like arc
5858
```sh
5959
bash train.sh \
6060
--job.config_file flame/models/fla.toml \
61-
--job.dump_folder exp/transformer-340M-4K-10B/batch1.seqlen65536.context4096.warmup1024.update1.steps20480.lr3e-4.cosine \
61+
--job.dump_folder exp/transformer-340M-4K-10B/batch1.seqlen65536.context4096.warmup1024.update1.steps20480.lr1e-3.cosine \
6262
--model.config configs/transformer_340M.json \
6363
--model.tokenizer_path fla-hub/transformer-1.3B-100B \
6464
--optimizer.name AdamW \
6565
--optimizer.eps 1e-15 \
66-
--optimizer.lr 3e-4 \
66+
--optimizer.lr 1e-3 \
6767
--lr_scheduler.warmup_steps 1024 \
6868
--lr_scheduler.lr_min 0.1 \
6969
--lr_scheduler.decay_type cosine \
@@ -92,7 +92,7 @@ You can specify the number of GPUs by setting the environment variable `NGPU`, w
9292
**For single-GPU debugging, set `NGPU=1`.**
9393

9494
We provide several [config files](https://github.com/fla-org/flame/tree/main/configs) for different models.
95-
By default, the learning rate is set to 3e-4 with a cosine scheduler. Other schedulers, such as WSD (wsd), are also supported.
95+
By default, the learning rate is set to 1e-3 with a cosine scheduler. Other schedulers, such as WSD (wsd), are also supported.
9696

9797
**Key parameters:**
9898
- `--lr_scheduler.decay_ratio`: The proportion of the steps allocated to the decay phase. The learning rate will remain stable after the warmup period and only start decaying during the last `decay_ratio` portion of the total training steps, which is known as the Warmup-Stable-Decay (WSD) schedule.

configs/delta_net_1B.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
"hidden_act": "swish",
1212
"hidden_ratio": 4,
1313
"hidden_size": 2048,
14-
"initializer_range": 0.006,
14+
"initializer_range": 0.02,
1515
"intermediate_size": null,
1616
"model_type": "delta_net",
1717
"norm_eps": 1e-06,

configs/delta_net_340M.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
"hidden_act": "swish",
1010
"hidden_ratio": 4,
1111
"hidden_size": 1024,
12-
"initializer_range": 0.006,
12+
"initializer_range": 0.02,
1313
"intermediate_size": null,
1414
"model_type": "delta_net",
1515
"norm_eps": 1e-06,

configs/gla_340M.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"hidden_act": "swish",
1111
"hidden_ratio": 4,
1212
"hidden_size": 1024,
13-
"initializer_range": 0.006,
13+
"initializer_range": 0.02,
1414
"intermediate_size": null,
1515
"model_type": "gla",
1616
"num_heads": 4,

configs/gla_7B.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"hidden_act": "swish",
1111
"hidden_ratio": 4,
1212
"hidden_size": 4096,
13-
"initializer_range": 0.006,
13+
"initializer_range": 0.02,
1414
"intermediate_size": 11008,
1515
"model_type": "gla",
1616
"norm_eps": 1e-06,

configs/gsa_340M.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
"hidden_act": "swish",
1313
"hidden_ratio": 4,
1414
"hidden_size": 1024,
15-
"initializer_range": 0.006,
15+
"initializer_range": 0.02,
1616
"intermediate_size": null,
1717
"model_type": "gsa",
1818
"num_heads": 4,

configs/hgrn2_340M.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"hidden_act": "swish",
99
"hidden_ratio": 4,
1010
"hidden_size": 1024,
11-
"initializer_range": 0.006,
11+
"initializer_range": 0.02,
1212
"intermediate_size": null,
1313
"model_type": "hgrn2",
1414
"num_heads": 8,

configs/transformer_1B.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"hidden_act": "swish",
99
"hidden_ratio": 4,
1010
"hidden_size": 2048,
11-
"initializer_range": 0.006,
11+
"initializer_range": 0.02,
1212
"intermediate_size": null,
1313
"max_position_embeddings": 8192,
1414
"model_type": "transformer",

configs/transformer_340M.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
"fuse_norm": true,
77
"hidden_act": "swish",
88
"hidden_size": 1024,
9-
"initializer_range": 0.006,
9+
"initializer_range": 0.02,
1010
"max_position_embeddings": 8192,
1111
"model_type": "transformer",
1212
"num_heads": 16,

configs/transformer_7B.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
"hidden_act": "swish",
88
"hidden_ratio": 4,
99
"hidden_size": 4096,
10-
"initializer_range": 0.006,
10+
"initializer_range": 0.02,
1111
"intermediate_size": 14336,
1212
"model_type": "transformer",
1313
"norm_eps": 1e-06,

0 commit comments

Comments
 (0)