Could prime support up to 70B llama3 model?

Hi Expert,

Currently if name_model = "70B" is configured, and torchrun in prime framework is launched on single server with 8 GPU, the launch on peer2 will fail due to some random rank fail. sometimes it report cuda out of memory, sometimes it fails without any specific reason.
Is it expected or not? and what kinds of configuration and model parameters could be used for larger model (for example, 70B)?


Thanks!!

Regards,
Kun


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could prime support up to 70B llama3 model? #203

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Could prime support up to 70B llama3 model? #203

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions