Hi,
For checkpointing, is there an equation or table for the minimum RAM required to run 10 checkpointing writes and 10 checkpointing reads?
I ran checkpointing with model=llama8b on a single 128GB client and during the read phase of the default behavior it hung due to RAM limitations. Was able to run the checkpointing workload with 2 clients at 128GB each (256GB RAM total)
Is there any resource pointing to the RAM needed for the llama70b, 405b, 1T models?
I know others have been able to run llama70b with 2TB RAM (4 clients with 512GB), and llama405b with 4TB RAM (16 clients with 256GB each)
Is there some equation for it? Would having more clients decrease the RAM needed for each server. For example could I have used 2 or 3 64GB clients to run the llama8b workload.
Hi,
For checkpointing, is there an equation or table for the minimum RAM required to run 10 checkpointing writes and 10 checkpointing reads?
I ran checkpointing with model=llama8b on a single 128GB client and during the read phase of the default behavior it hung due to RAM limitations. Was able to run the checkpointing workload with 2 clients at 128GB each (256GB RAM total)
Is there any resource pointing to the RAM needed for the llama70b, 405b, 1T models?
I know others have been able to run llama70b with 2TB RAM (4 clients with 512GB), and llama405b with 4TB RAM (16 clients with 256GB each)
Is there some equation for it? Would having more clients decrease the RAM needed for each server. For example could I have used 2 or 3 64GB clients to run the llama8b workload.