Skip to content

Conversation

@mykytahordia
Copy link

Model barely fits into a standard consumer GPU because of double params loading. I did the following:

  1. load state_dict to idle_device
  2. release memory after generation

It reduced peak memory usage from 11GB to 6.5GB without any speed tradeoffs and made to_idle function actually work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant