[RFC] Support full bf16 training #1646

ebsmothers · 2025-08-27T18:51:23Z

This PR adds support for full bfloat16 training. In SFT it is pretty common to store everything in bfloat16 to save memory, with select tensors (logits, RoPE buffers and activations) maintained in a higher precision to preserve numerical accuracy. Separately I think having this supported more generally would be useful for faster iteration -- e.g. it allows me to run Llama3 70B on a single node of H100s, which otherwise is not possible with the default config.

Assuming this is generally useful, would like feedback on:

Acceptable loss convergence: in the first 100 steps on Llama3 8B full bf16 training goes from 12.25 -> 8, as opposed to 12.25 -> 7 with fp32 training. Is this a concern? (As mentioned, for SFT this is less of an issue; happy to validate that statement if that's helpful.)
Interaction with mixed precision training -- where is the right place to validate that these are not both set at once?
Where to put the set_default_dtype API

fegin · 2025-08-27T19:14:44Z

torchtitan/models/llama3/model/model.py

@@ -421,5 +421,5 @@ def forward(
            h = layer(h, self.freqs_cis)

        h = self.norm(h) if self.norm else h
-        output = self.output(h) if self.output else h
+        output = self.output(h).float() if self.output else h


If we set the training dtype during the training initialization, why not also do the output conversion in the trainer (train loop)?

This is already in the loss function https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/loss.py#L21

Also see #642

Thanks, just removed

fegin · 2025-08-27T19:16:50Z

torchtitan/config/job_config.py

+    put all parameters, gradients, and optimizer states in bfloat16, without an extra copy of fp32 weights.
+    In the case of full bf16 training, RoPE calculations and logits will still be in fp32.
+    """
+
    mixed_precision_param: Literal["bfloat16", "float32"] = "bfloat16"


What if mixed_precision_param is float32 but dtype is bfloat16? There should be a check?

Yeah agreed. Do we want to do this somewhere in train.py? Lmk if you think there's a better place

mixed_precision_param is coming from FSDP2. I think if FSDP2 can work with that, it's users responsibility to config them properly.

We also make it work with DDP/single device: #1303. I think a warning is at least required.

Sounds good. In that case I will leave this as is

@fegin
autocast is not well supported in torchtitan anyways. I'm not sure if it is still maintained. See other issue like #1525

But sure, having a warning sounds good.

[RFC] Support full bf16 training

dba286d

ebsmothers requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 27, 2025 18:51

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 27, 2025

fegin reviewed Aug 27, 2025

View reviewed changes

remove .float()

e683589

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Support full bf16 training #1646

[RFC] Support full bf16 training #1646

Uh oh!

ebsmothers commented Aug 27, 2025

Uh oh!

fegin Aug 27, 2025

Uh oh!

tianyu-l Aug 27, 2025

Uh oh!

ebsmothers Aug 28, 2025

Uh oh!

fegin Aug 27, 2025

Uh oh!

ebsmothers Aug 27, 2025

Uh oh!

tianyu-l Aug 27, 2025

Uh oh!

fegin Aug 28, 2025

Uh oh!

ebsmothers Aug 28, 2025

Uh oh!

tianyu-l Aug 28, 2025

Uh oh!

Uh oh!

[RFC] Support full bf16 training #1646

Are you sure you want to change the base?

[RFC] Support full bf16 training #1646

Uh oh!

Conversation

ebsmothers commented Aug 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!