[Module] Refactor init_weights to config-based param_init system by fegin · Pull Request #2633 · pytorch/torchtitan

fegin · 2026-03-19T18:02:27Z

Stack from ghstack (oldest at bottom):

-> [Module] Refactor init_weights to config-based param_init system #2633

Motiviation and Design decision

Parameter initialization should be configurable without modifying module code. Users reuse provided helpers (e.g., init_trunc_normal, init_zeros) to customize how parameters are initialized. Parameters are identified by their FQNs, and init_by_regex lets users map regex patterns to initializers — so a single declarative dict at the model config level controls all parameter init.

In theory, we could just apply param_init at the root module and be done — no recursion needed. So why does init_states recurse? The main motivation is that buffer initialization is usually inherently coupled with module internals and not user-configurable. Since buffers require per-module logic via init_self_buffers, we provide a recursive init_states that handles both: it walks the module tree, initializing parameters via the configurable param_init and buffers via module-specific overrides, in a single pass.

Summary

Remove init_weights.
init_states auto-recurses the module tree, then calls init_self_parameters and init_self_buffers on each module.
param_init (a NamedInitializer callable on Module.Config) uses regex-matched FQN patterns to map parameters to initializers. Child modules without their own param_init delegate up the parent chain, so init logic is defined once at the model config level.
make_decoder_param_init provides shared patterns for decoder-based model, reused by Llama3, Llama4, Qwen3, and DeepSeek V3. Model-specific extensions (Flux DiT-style, GPT-OSS MoE biases) are composed via dict merge.
Removes init_mean/init_std fields from Linear.Config and Embedding.Config — init parameters are now expressed entirely through param_init patterns.

Verification

with loss_compare.py and change the code to make the initialization order exact the same as the main branch, we can get llama3 and qwen3 loss parity

  ┌─────────────┬───────┬───────────────────┐
  │    Model    │ Steps │      Result       │
  ├─────────────┼───────┼───────────────────┤
  │ Llama3      │ 100   │ Bitwise identical │
  ├─────────────┼───────┼───────────────────┤
  │ Qwen3       │ 100   │ Bitwise identical │
  └─────────────┴───────┴───────────────────┘

However, the current code doesn't include the fixed order because it will make the code ugly. It is expected to have a different loss if the initialization order changes.

[ghstack-poisoned]

**NOT READY TO REVIEW** **NOT READY TO REVIEW** **Motiviation and Design decision** Parameter initialization should be configurable without modifying module code. Users reuse provided helpers (e.g., `init_trunc_normal`, `init_zeros`) to customize how parameters are initialized. Parameters are identified by their FQNs, and `init_by_regex` lets users map regex patterns to initializers — so a single declarative dict at the model config level controls all parameter init. In theory, we could just apply param_init at the root module and be done — no recursion needed. So why does `init_states` recurse? The main motivation is that buffer initialization is usually inherently coupled with module internals and not user-configurable. Since buffers require per-module logic via `init_self_buffers`, we provide a recursive init_states that handles both: it walks the module tree, initializing parameters via the configurable param_init and buffers via module-specific overrides, in a single pass. **Summary** - Remove `init_weights`. - `init_states` auto-recurses the module tree, then calls `init_self_parameters` and `init_self_buffers` on each module. - `param_init` (a NamedInitializer callable on Module.Config) uses regex-matched FQN patterns to map parameters to initializers. Child modules without their own `param_init` delegate up the parent chain, so init logic is defined once at the model config level. - `make_decoder_param_init provides` shared patterns for decoder-based model, reused by Llama3, Llama4, Qwen3, and DeepSeek V3. Model-specific extensions (Flux DiT-style, GPT-OSS MoE biases) are composed via dict merge. - Removes init_mean/init_std fields from Linear.Config and Embedding.Config — init parameters are now expressed entirely through param_init patterns. ghstack-source-id: 024899c Pull-Request: #2633

[ghstack-poisoned]

**NOT READY TO REVIEW** **NOT READY TO REVIEW** **Motiviation and Design decision** Parameter initialization should be configurable without modifying module code. Users reuse provided helpers (e.g., `init_trunc_normal`, `init_zeros`) to customize how parameters are initialized. Parameters are identified by their FQNs, and `init_by_regex` lets users map regex patterns to initializers — so a single declarative dict at the model config level controls all parameter init. In theory, we could just apply param_init at the root module and be done — no recursion needed. So why does `init_states` recurse? The main motivation is that buffer initialization is usually inherently coupled with module internals and not user-configurable. Since buffers require per-module logic via `init_self_buffers`, we provide a recursive init_states that handles both: it walks the module tree, initializing parameters via the configurable param_init and buffers via module-specific overrides, in a single pass. **Summary** - Remove `init_weights`. - `init_states` auto-recurses the module tree, then calls `init_self_parameters` and `init_self_buffers` on each module. - `param_init` (a NamedInitializer callable on Module.Config) uses regex-matched FQN patterns to map parameters to initializers. Child modules without their own `param_init` delegate up the parent chain, so init logic is defined once at the model config level. - `make_decoder_param_init provides` shared patterns for decoder-based model, reused by Llama3, Llama4, Qwen3, and DeepSeek V3. Model-specific extensions (Flux DiT-style, GPT-OSS MoE biases) are composed via dict merge. - Removes init_mean/init_std fields from Linear.Config and Embedding.Config — init parameters are now expressed entirely through param_init patterns. ghstack-source-id: 11120b6 Pull-Request: #2633

[ghstack-poisoned]

**NOT READY TO REVIEW** **NOT READY TO REVIEW** **Motiviation and Design decision** Parameter initialization should be configurable without modifying module code. Users reuse provided helpers (e.g., `init_trunc_normal`, `init_zeros`) to customize how parameters are initialized. Parameters are identified by their FQNs, and `init_by_regex` lets users map regex patterns to initializers — so a single declarative dict at the model config level controls all parameter init. In theory, we could just apply param_init at the root module and be done — no recursion needed. So why does `init_states` recurse? The main motivation is that buffer initialization is usually inherently coupled with module internals and not user-configurable. Since buffers require per-module logic via `init_self_buffers`, we provide a recursive init_states that handles both: it walks the module tree, initializing parameters via the configurable param_init and buffers via module-specific overrides, in a single pass. **Summary** - Remove `init_weights`. - `init_states` auto-recurses the module tree, then calls `init_self_parameters` and `init_self_buffers` on each module. - `param_init` (a NamedInitializer callable on Module.Config) uses regex-matched FQN patterns to map parameters to initializers. Child modules without their own `param_init` delegate up the parent chain, so init logic is defined once at the model config level. - `make_decoder_param_init provides` shared patterns for decoder-based model, reused by Llama3, Llama4, Qwen3, and DeepSeek V3. Model-specific extensions (Flux DiT-style, GPT-OSS MoE biases) are composed via dict merge. - Removes init_mean/init_std fields from Linear.Config and Embedding.Config — init parameters are now expressed entirely through param_init patterns. ghstack-source-id: 6871155 Pull-Request: #2633

[ghstack-poisoned]

**NOT READY TO REVIEW** **NOT READY TO REVIEW** **Motiviation and Design decision** Parameter initialization should be configurable without modifying module code. Users reuse provided helpers (e.g., `init_trunc_normal`, `init_zeros`) to customize how parameters are initialized. Parameters are identified by their FQNs, and `init_by_regex` lets users map regex patterns to initializers — so a single declarative dict at the model config level controls all parameter init. In theory, we could just apply param_init at the root module and be done — no recursion needed. So why does `init_states` recurse? The main motivation is that buffer initialization is usually inherently coupled with module internals and not user-configurable. Since buffers require per-module logic via `init_self_buffers`, we provide a recursive init_states that handles both: it walks the module tree, initializing parameters via the configurable param_init and buffers via module-specific overrides, in a single pass. **Summary** - Remove `init_weights`. - `init_states` auto-recurses the module tree, then calls `init_self_parameters` and `init_self_buffers` on each module. - `param_init` (a NamedInitializer callable on Module.Config) uses regex-matched FQN patterns to map parameters to initializers. Child modules without their own `param_init` delegate up the parent chain, so init logic is defined once at the model config level. - `make_decoder_param_init provides` shared patterns for decoder-based model, reused by Llama3, Llama4, Qwen3, and DeepSeek V3. Model-specific extensions (Flux DiT-style, GPT-OSS MoE biases) are composed via dict merge. - Removes init_mean/init_std fields from Linear.Config and Embedding.Config — init parameters are now expressed entirely through param_init patterns. ghstack-source-id: 803f279 Pull-Request: #2633

[ghstack-poisoned]

**NOT READY TO REVIEW** **NOT READY TO REVIEW** **Motiviation and Design decision** Parameter initialization should be configurable without modifying module code. Users reuse provided helpers (e.g., `init_trunc_normal`, `init_zeros`) to customize how parameters are initialized. Parameters are identified by their FQNs, and `init_by_regex` lets users map regex patterns to initializers — so a single declarative dict at the model config level controls all parameter init. In theory, we could just apply param_init at the root module and be done — no recursion needed. So why does `init_states` recurse? The main motivation is that buffer initialization is usually inherently coupled with module internals and not user-configurable. Since buffers require per-module logic via `init_self_buffers`, we provide a recursive init_states that handles both: it walks the module tree, initializing parameters via the configurable param_init and buffers via module-specific overrides, in a single pass. **Summary** - Remove `init_weights`. - `init_states` auto-recurses the module tree, then calls `init_self_parameters` and `init_self_buffers` on each module. - `param_init` (a NamedInitializer callable on Module.Config) uses regex-matched FQN patterns to map parameters to initializers. Child modules without their own `param_init` delegate up the parent chain, so init logic is defined once at the model config level. - `make_decoder_param_init provides` shared patterns for decoder-based model, reused by Llama3, Llama4, Qwen3, and DeepSeek V3. Model-specific extensions (Flux DiT-style, GPT-OSS MoE biases) are composed via dict merge. - Removes init_mean/init_std fields from Linear.Config and Embedding.Config — init parameters are now expressed entirely through param_init patterns. **Verification** with loss_compare.py ┌─────────────┬───────┬───────────────────┐ │ Model │ Steps │ Result │ ├─────────────┼───────┼───────────────────┤ │ Llama3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ Qwen3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ DeepSeek V3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ GPT-OSS │ 10 │ Bitwise identical │ └─────────────┴───────┴───────────────────┘ ghstack-source-id: e6704e8 Pull-Request: #2633

[ghstack-poisoned]

**NOT READY TO REVIEW** **NOT READY TO REVIEW** **Motiviation and Design decision** Parameter initialization should be configurable without modifying module code. Users reuse provided helpers (e.g., `init_trunc_normal`, `init_zeros`) to customize how parameters are initialized. Parameters are identified by their FQNs, and `init_by_regex` lets users map regex patterns to initializers — so a single declarative dict at the model config level controls all parameter init. In theory, we could just apply param_init at the root module and be done — no recursion needed. So why does `init_states` recurse? The main motivation is that buffer initialization is usually inherently coupled with module internals and not user-configurable. Since buffers require per-module logic via `init_self_buffers`, we provide a recursive init_states that handles both: it walks the module tree, initializing parameters via the configurable param_init and buffers via module-specific overrides, in a single pass. **Summary** - Remove `init_weights`. - `init_states` auto-recurses the module tree, then calls `init_self_parameters` and `init_self_buffers` on each module. - `param_init` (a NamedInitializer callable on Module.Config) uses regex-matched FQN patterns to map parameters to initializers. Child modules without their own `param_init` delegate up the parent chain, so init logic is defined once at the model config level. - `make_decoder_param_init provides` shared patterns for decoder-based model, reused by Llama3, Llama4, Qwen3, and DeepSeek V3. Model-specific extensions (Flux DiT-style, GPT-OSS MoE biases) are composed via dict merge. - Removes init_mean/init_std fields from Linear.Config and Embedding.Config — init parameters are now expressed entirely through param_init patterns. **Verification** with loss_compare.py ┌─────────────┬───────┬───────────────────┐ │ Model │ Steps │ Result │ ├─────────────┼───────┼───────────────────┤ │ Llama3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ Qwen3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ DeepSeek V3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ GPT-OSS │ 10 │ Bitwise identical │ └─────────────┴───────┴───────────────────┘ ghstack-source-id: 440e49d Pull-Request: #2633

[ghstack-poisoned]

**NOT READY TO REVIEW** **NOT READY TO REVIEW** **Motiviation and Design decision** Parameter initialization should be configurable without modifying module code. Users reuse provided helpers (e.g., `init_trunc_normal`, `init_zeros`) to customize how parameters are initialized. Parameters are identified by their FQNs, and `init_by_regex` lets users map regex patterns to initializers — so a single declarative dict at the model config level controls all parameter init. In theory, we could just apply param_init at the root module and be done — no recursion needed. So why does `init_states` recurse? The main motivation is that buffer initialization is usually inherently coupled with module internals and not user-configurable. Since buffers require per-module logic via `init_self_buffers`, we provide a recursive init_states that handles both: it walks the module tree, initializing parameters via the configurable param_init and buffers via module-specific overrides, in a single pass. **Summary** - Remove `init_weights`. - `init_states` auto-recurses the module tree, then calls `init_self_parameters` and `init_self_buffers` on each module. - `param_init` (a NamedInitializer callable on Module.Config) uses regex-matched FQN patterns to map parameters to initializers. Child modules without their own `param_init` delegate up the parent chain, so init logic is defined once at the model config level. - `make_decoder_param_init provides` shared patterns for decoder-based model, reused by Llama3, Llama4, Qwen3, and DeepSeek V3. Model-specific extensions (Flux DiT-style, GPT-OSS MoE biases) are composed via dict merge. - Removes init_mean/init_std fields from Linear.Config and Embedding.Config — init parameters are now expressed entirely through param_init patterns. **Verification** with loss_compare.py ┌─────────────┬───────┬───────────────────┐ │ Model │ Steps │ Result │ ├─────────────┼───────┼───────────────────┤ │ Llama3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ Qwen3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ DeepSeek V3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ GPT-OSS │ 10 │ Bitwise identical │ └─────────────┴───────┴───────────────────┘ ghstack-source-id: 1bcce2b Pull-Request: #2633

[ghstack-poisoned]

**NOT READY TO REVIEW** **NOT READY TO REVIEW** **Motiviation and Design decision** Parameter initialization should be configurable without modifying module code. Users reuse provided helpers (e.g., `init_trunc_normal`, `init_zeros`) to customize how parameters are initialized. Parameters are identified by their FQNs, and `init_by_regex` lets users map regex patterns to initializers — so a single declarative dict at the model config level controls all parameter init. In theory, we could just apply param_init at the root module and be done — no recursion needed. So why does `init_states` recurse? The main motivation is that buffer initialization is usually inherently coupled with module internals and not user-configurable. Since buffers require per-module logic via `init_self_buffers`, we provide a recursive init_states that handles both: it walks the module tree, initializing parameters via the configurable param_init and buffers via module-specific overrides, in a single pass. **Summary** - Remove `init_weights`. - `init_states` auto-recurses the module tree, then calls `init_self_parameters` and `init_self_buffers` on each module. - `param_init` (a NamedInitializer callable on Module.Config) uses regex-matched FQN patterns to map parameters to initializers. Child modules without their own `param_init` delegate up the parent chain, so init logic is defined once at the model config level. - `make_decoder_param_init provides` shared patterns for decoder-based model, reused by Llama3, Llama4, Qwen3, and DeepSeek V3. Model-specific extensions (Flux DiT-style, GPT-OSS MoE biases) are composed via dict merge. - Removes init_mean/init_std fields from Linear.Config and Embedding.Config — init parameters are now expressed entirely through param_init patterns. **Verification** with loss_compare.py ┌─────────────┬───────┬───────────────────┐ │ Model │ Steps │ Result │ ├─────────────┼───────┼───────────────────┤ │ Llama3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ Qwen3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ DeepSeek V3 │ 10 │ Bitwise identical │ ├─────────────┼───────┼───────────────────┤ │ GPT-OSS │ 10 │ Bitwise identical │ └─────────────┴───────┴───────────────────┘ ghstack-source-id: 884689c Pull-Request: #2633

tianyu-l · 2026-03-22T01:16:48Z

torchtitan/config/configurable.py

                except AttributeError:
                    # field(init=False) not yet set, ignore this field.
                    continue
+                if callable(val) and not dataclasses.is_dataclass(val):


oh, can a dataclass ever be callable?

btw, https://fburl.com/code/8t7xhgna is making Config.param_init a config, not a function, so that it should still be serializable. I think we should do this too, to make it reproducible (given a code version).

Right now parallelize_fn and other functions in ModelSpec are not serializable, which should be fixed later.

The config is a dictionary that maps from reg to function. So from serializability perspective, the requirement is the same?

torchtitan/models/common/moe/moe.py

torchtitan/trainer.py

tianyu-l · 2026-03-22T01:27:32Z

torchtitan/models/common/decoder.py

-        buffer_device = buffer_device or self.freqs_cis.device
        if self.rope is not None:
-            self.rope.init_weights(buffer_device=buffer_device)
+            # RoPE's _init_self_buffers was already called by auto-recursion


Not sure about this, but it's possible that it is set on meta device, when buffer_device here is cuda? Shall we add an assertion for device?

Not quite sure if I understand the question. If meta device is used, then everything is meta device. buffer_device is only used when trainer parallelizes the model and tries to re-initialize the states again. The comment here just want to explain the order, not about the device though.

Assertion is good, but I want to understand if I miss something.

tianyu-l · 2026-03-22T01:30:41Z