Training LLaMA 3 on TPUs | How To Scale Your Model #8
Replies: 6 comments 7 replies
-
Thank you for this section! Thanks for this wonderful book again! |
Beta Was this translation helpful? Give feedback.
-
In the answer to the 4th question:
The denominator should be 225, |
Beta Was this translation helpful? Give feedback.
-
What's the meaning of "number of gradient checkpoints" -- is it the factor of the total size of residuals compared to the size of the layer input? |
Beta Was this translation helpful? Give feedback.
-
When computing the T_math, should we, in practice, also take into consideration of TPU utilization? eg: in the LLaMA 70B, we assumed 40% FLOPs utilization, should we instead compute |
Beta Was this translation helpful? Give feedback.
-
Thanks for making an amazing resource! I just noticed two small typos:
To "4.8 years"
To "1024 sequences of length 4096 per batch" from the Llama3 paper |
Beta Was this translation helpful? Give feedback.
-
In the last question of How to shard LLaMA 3-70B for training, the answer mentions "4-way model parallelism", but the takeaway specifies 4-way tensor parallelism. Should this be updated to 'tensor parallelism' for consistency, given the sharding along F? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Training LLAMA on TPUs!
Beta Was this translation helpful? Give feedback.
All reactions