[Flamingo] Fix the memory usage of 2x checkpoint size after loading #1201

iseeyuan · 2024-09-25T06:29:22Z

As titled. The Flamingo model was initialized in meta. Then

load checkpoint with 1x weight size
model is initialized again (probably to populate the buffers in rotary embedding), with another 1x of weigth size.
So for a model with 21 GB weight size, the memory usage is 42 GB after model loading.

Context:
Buffers in rotary embedding are not included in the checkpoint.
Instead, they are calculated in initialization. Since buffers on meta device
does not host any actual values, need to reinitialize them in the actual
device.

Fix:
Only do those buffer initialization, without initializing the entire model.

Peak memory:
max_seq_len = 128: Before: 43.57 GB after: 28.55 GB
max_seq_len = 8192: Before: 45.47 GB after: 30.5 GB

pytorch-bot · 2024-09-25T06:29:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1201

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit eb68319 with merge base e27e162 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[Flamingo] Fix the memory usage of 2x checkpoint size after loading

eb68319

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 25, 2024

Gasoonjia approved these changes Sep 25, 2024

View reviewed changes

Jack-Khuu merged commit f0a03a7 into main Sep 25, 2024
51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flamingo] Fix the memory usage of 2x checkpoint size after loading #1201

[Flamingo] Fix the memory usage of 2x checkpoint size after loading #1201

Uh oh!

iseeyuan commented Sep 25, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 25, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Flamingo] Fix the memory usage of 2x checkpoint size after loading #1201

[Flamingo] Fix the memory usage of 2x checkpoint size after loading #1201

Uh oh!

Conversation

iseeyuan commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1201

✅ No Failures

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iseeyuan commented Sep 25, 2024 •

edited

Loading

pytorch-bot bot commented Sep 25, 2024 •

edited

Loading