Sharded Matrices and How to Multiply Them | How To Scale Your Model #5

jacobaustin123 · 2025-02-03T02:22:12Z

jacobaustin123
Feb 3, 2025
Maintainer

Sharded matrix multiplications galore!

mitchellgoffpc · 2025-02-05T17:34:37Z

mitchellgoffpc
Feb 5, 2025 — with giscus

In the solution for pop quiz 2, the bidirectional ICI bandwidth for a TPU v5e is given as 9e10 bytes/s, which doesn't quite match the value of 1e11 bytes/s given in the table in part 2. Looking at https://cloud.google.com/tpu/docs/v5e, it appears that the value in the table is the correct one.

1 reply

jacobaustin123 Feb 5, 2025 — with giscus
Maintainer Author

Good catch. I think 9e10 is closer to the real value. I've updated the answer with that figure.

karunreddy30 · 2025-02-09T03:43:42Z

karunreddy30
Feb 9, 2025 — with giscus

In the Section - "A quick aside: how would we describe this in code?"

"For instance, in the above example, the local shape of A is [4, 1024] and for B is [2048, 4096]"

I think local shape of A is [2, 1024]?

1 reply

jacobaustin123 Feb 9, 2025
Maintainer Author

You're right, fixed.

burichh · 2025-02-09T20:13:30Z

burichh
Feb 9, 2025 — with giscus

On the first picture, you state that the shape of matrix A before sharding is [ $I$, $J$ ]. My immediate interpretation was that it means that it has $I$ rows and $J$ columns, but seemingly you invert the meaning of $I$ and $J$ on the sharded plot: $J_Y$ spans on the row dimension, and $I_X$ spans on the column dimension. For me this gets confusing with the Computation With Sharded Arrays chapter, when you explain for example in Case 1 that

$$A[I_X, J] \cdot B[J, K_Y] \rightarrow C[I_X, K_Y]$$

For me, this would mean that A is sharded across its rows, and B is sharded across its columns, thus we have everything to calculate a single element of the result C, because the contracting dimensions are not sharded. But because you reversed the meaning if $I$, and it means the columns, for me it gets a bit unintuitive how the matrix multiplication is trivial to execute.

Could you enlight this with an image?

I think I get the point, but visually it would help a lot what you mean exactly with these $I_X$, $J$ and $K_Y$.

6 replies

levskaya Feb 9, 2025
Collaborator

Do you recall which way did I have it before? I rows, J cols sure, but did I also have X vertical, Y horizontal to make things line up?

I was probably following matrix multiplication (row, col) ordering conventions when I first made the figures since people tend to have that ordering burned into their minds during school.

burichh Feb 10, 2025 — with giscus

So I created an image of what I believe you wanted to depict in the "Computation With Sharded Arrays Case 1: neither multiplicand has a sharded contracting dimension" with the notation above:

I changed the meaning of $I$ and $J$, so now $I$ is the row and $J$ is the column index for the $Data$ matrix. On the bottom I draw the matmul(Data, Weights) in a sharded way, where Data is sharded along its rows across the mesh dimension $Y$ and Weights is sharded along its columns across the mesh dimension $X$. In this notation:

$$Data[I_Y, J] \cdot Weights[J, K_X] = Out[I_Y, K_X]$$

.
I still feel that somehow this could be improved, because $J$ in $Data[I_Y, J]$ stands for the column index, while $J$ in $Weights[J, K_X]$ stands for the row index, which still might be confusing, looking only at the picture. Not even mentioning that the meaning of $K$ is not defined on the picture, although one can deduce that it must be the number of columns in $Weights$. Not easy to make this clear, I admit:D

manavgarg Feb 20, 2025 — with giscus

+1. I am a bit confused with the (I,J) notation. Does I refer to rows of a matrix? If that is the case, then why does (Ix,J) split the columns across the axis instead of rows.

Jiminator Mar 3, 2025

Wait yeah, I got lost on this one, because I feel like the diagrams are inconsistent with the text. If we use these images,
it seems that A[Ix,J]*B[J,Ky] is shared over the dimension we are summing over (contracting dimension), which goes against "neither A nor B has a sharded contracting dimension". If we go by standard row,col notation rather than x,y notation, the text makes sense, but obv doesn't align with the images. From the code, it seems jax follows row, col notation so I think the diagrams should resemble the ones made by @burichh. Thoughts?

findmyway Mar 13, 2025 — with giscus

+1

It would be much more clear to keep it consistent in either row major or column major.

kerrickstaley · 2025-02-11T03:54:22Z

kerrickstaley
Feb 11, 2025 — with giscus

Some issues with question 2:

In part 2's solution, I think you mean for X to be in the denominator. The result is the same because X = Y in this case.

In part 3's solution, you mention TPU v5e, but the question asks about v4p.

In part 4, I'm not sure what AllGather with a {U_Z} dimension means. I believe this is not addressed in the text of the chapter. Also, the solution again mentions v5e.

1 reply

jacobaustin123 Feb 11, 2025
Maintainer Author

Apologies for the less than ideal answer. I've updated this. I've removed part 2 since it's identical to part 1, and changed (4) to an AllReduce.

Shua1 · 2025-02-11T07:08:50Z

Shua1
Feb 11, 2025 — with giscus

Creates a jax.Mesh that maps our 4 TPUs into a 2x2 grid with names ‘X’ and ‘Y’ assigned to the two axes.

The code below it and the in the code really says: 8 TPUs into 4x2 grid.

1 reply

jacobaustin123 Feb 11, 2025
Maintainer Author

Fixed, thanks!

Zasder3 · 2025-02-15T22:10:57Z

Zasder3
Feb 15, 2025 — with giscus

I believe question 4 may have miscalculated the comms overhead for $AllGather_XB[D_X,F]$. In the solution it's stated as $2BD$ when it should be $2DF$ (e.g. gathered over the A matrix instead of the B matrix)?

1 reply

jacobaustin123 Feb 16, 2025 — with giscus
Maintainer Author

I think this is now fixed. Yes it should be 2DF here.

chipturner · 2025-02-16T05:49:59Z

chipturner
Feb 16, 2025 — with giscus

The flow in this chapter is a little jarring when it drops into the four cases without defining the term "contracting dimensions" or doing other setup to smooth the transition. Maybe an external reference or a bit more connective flow would help?

1 reply

jacobaustin123 Feb 16, 2025 — with giscus
Maintainer Author

I agree, I've done a very mild rewrite of the transition. Can you tell me if you think this is better or if there are other changes you think I should make?

jesse7chen · 2025-02-22T08:02:06Z

jesse7chen
Feb 22, 2025 — with giscus

In the solution to question 4, I believe it should be D < C / W_ici instead of F < C / W_ici when calculating when we are comms bound in strategy 1. The wording is also a bit confusing because it says "In the second case (baseline)", but it appears to be talking about strategy 1 if I'm not mistaken? Also a small grammatical error at the end of the solution - "we'll shard our parameters" instead of "we'll sharded our parameters".

1 reply

jacobaustin123 Feb 24, 2025
Maintainer Author

Yes, fixed. Thank you for catching these.

batterseapower · 2025-02-23T11:07:20Z

batterseapower
Feb 23, 2025 — with giscus

The text says "For example, A[IX,J]⋅B[J,K]→C[IX,K] can be multiplied without any communication because the contracting dimension (J, the one we’re actually summing over) is unsharded. However, if we wanted the output unsharded (i.e. A[IX,J]⋅B[J,K]→C[IX,K]), we would need to copy A or C to every device.". Presumably the last "C[IX,K]" should actually be "C[I,K]"

1 reply

jacobaustin123 Feb 24, 2025
Maintainer Author

Thanks for catching this. Yes, absolutely, fixed.

AakashKumarNain · 2025-02-28T14:41:51Z

AakashKumarNain
Feb 28, 2025 — with giscus

As someone who is fairly familiar with sharding and JAX, I think the flow of this chapter can be refined and the details (along with the notations) can be improved a lot. I am happy to contribute if you guys are open to contributions? I mean it when I say this is confusing and can be simplified

5 replies

chipturner Feb 28, 2025

I think perhaps einsum notation would help? It's also serves as a breadcrumb to get more context if the reader needs to learn or refresh some knowledge

jacobaustin123 Mar 3, 2025
Maintainer Author

Certainly happy to accept contributions, but maybe run the central ideas by me before spending too much time? I think in particular the subscript sharding notation here is important to the rest of the document and shouldn't be changed

AakashKumarNain Mar 3, 2025

Sure thing. The collectives part needs more refinement and there should be code corresponding to each of the four cases. It is easy to tell people why a certain assertion does not work, but a code makes it much clear when that assertion does not work. For example, in the (only) code sample, we sharded one of the arrays over the non-contracting dimension, and then proceeds to say JAX/XLA will automatically add communication across these arrays as necessary to perform the final multiplication.

But the end user, especially the ones who are new to JAX, will keep thinking about the case where this does not work automatically. Explicitly providing examples for both scenarios is a must, and this should be done for all the four cases. Given code is folded, I do not think, it will make the chapter extra long, but it will definitely set the agenda of when to think about collectives and when not in a much better way.

Re notation: I won't suggest to change the notation, but I would add a line stating that the sharding names does not have any meaning like rows and columns, it is just a lookup for partitioning the data along a certain data dim by that value

jacobaustin123 Mar 3, 2025
Maintainer Author

Cool. What would code look like in each case? I guess the thing is, XLA does add this communication, so I'm not sure what we'd show. We could obviously write jax.lax.psum for AllReduce, but I don't know how illustrative that is. Or do you just mean sharding the arrays and then running the matmul?

sshkhr May 1, 2025 — with giscus

@AakashKumarNain check out the shard_map tutorial on the JAX official documentation, it addresses some of the stuff that you wanted to see added in this chaper:

Manual parallelism with shard_map

Lhongpei · 2025-03-01T15:02:44Z

Lhongpei
Mar 1, 2025 — with giscus

Can you explain more about AllReduce? Because I think I misunderstand what this actually do in Question2, Part 3.

In my opinion, after we do $\text{AllReduce}_Z (B_X, D_Y) {U_Z}$, the shape of data is still $(B_X, D_Y)$ but the data will be reduced (like summed), so my answer is

$$ \text{T} = \frac{2 \times B \times D \times 2}{|X| \times |Y| \times W} $$

because there is no communication between X and Y?

1 reply

jacobaustin123 Mar 3, 2025
Maintainer Author

Absolutely, you're right. I've fixed the answer. This was a silly oversight on my part.

Lhongpei · 2025-03-02T01:03:30Z

Lhongpei
Mar 2, 2025 — with giscus

In Question 3, why the answer says "Since we have an axis of size 4 on a TPU v4p, we have a wraparound link, so we can do the AllGather by sending half the bytes in each direction". In the GIF above, I think each device sending the whole bytes in each direction? Is there any difference?

1 reply

jacobaustin123 Mar 3, 2025
Maintainer Author

Good question, I think this was a bit of a red herring. We can do an AllGather by sending half the array in each direction for 4 hops, or the full array in both directions for 2 hops. I've updated this to do the latter.

amoudgl · 2025-03-03T08:29:31Z

amoudgl
Mar 3, 2025 — with giscus

Thanks for the great work! I have a question in bi-directional all-gather case: since each hop sends $V / |X|$ bytes in both directions in parallel, shouldn't the time per hop be $T_{hop} = \frac{V}{|X|.W_{ICI}}$? The article mentions $T_{hop} = \frac{2.V}{|X|.W_{ICI}}$ and it's unclear to me why the time would double in bidirectional case for each hop.

7 replies

harisjavaid85 Apr 12, 2025

@jacobaustin123 Thanks for the write-up. To me, the W_bidriectional term is essentially the aggregate unidirectional bandwidth for a TPU in a particular axis and not the bidirectional bandwidth of a single ICI link. In this particular scenario, we happen to have only two links per TPU in a particular axis, so the math works out. Is my understanding correct? (the bidirectional bandwidth of a link has a somewhat different meaning in the traditional networking field where it means that a link can send and receive at the same time at that bandwidth aka full-duplex).

jacobaustin123 Apr 12, 2025
Maintainer Author

Yes I think that's roughly correct. Per-link, the TPU can send and receive at the unidirectional bandwidth. The links are full-duplex in the sense that you can send and receive at that bandwidth simultaneously. Because you have two links per axis, you can send and receive at the bidirectional rate in aggregate, but not per-link.

harisjavaid85 Apr 13, 2025

Thanks. I have one more question regarding AllGather on multiple axes (V / W_ICI * n_axes). Assume we have sharded an array on both X and Y axes on a mesh of 4x4 TPUs. We should do an AllGather on X axis first, followed by an AllGather on the Y axis instead of AllGather being executed on both the axes simultaneously (as assumed in your equation). For AllGather_X, V/4 data will be involved, while for AllGather_Y V data will be involved. So, I think the T_comms should be (V/4) / W_ICI + V/W_ICI? Can you provide more details on how an AllGather is scheduled when multiple axes are involved. Thanks.

jacobaustin123 Apr 13, 2025
Maintainer Author

No, you do want to do it at the same time. By doing one and then the other you waste bandwidth by duplicating some of the values across the sharded dimension. We dont have an animation for the algorithm but you can imagine sending chunks gradually along both axes simultaneously to avoid ever duplicating comms.

harisjavaid85 Apr 14, 2025

I see, got it, thanks!

gitnicos · 2025-03-05T19:54:42Z

gitnicos
Mar 5, 2025 — with giscus

this is a fantastic book! Kudos to the authors and big THANK YOU!

I think this section is super critical in appreciation of TPU differentiation vs GPU but needs quite substantial rework:

Often visuals are supposed to help to understand complex underlying concepts. Here though in the very first diagram that lays out data rows as J and features as I trying to superimpose on Y,X as it is accepted in graphics industry is confusing. In data science I are always Rows, J is always Columns. Unless I am missing something. Continuous mental acrobatics to transpose while reading challenging material is not helping.
This gets exacerbated by "exhaustive" diagram of possible combinations with not intuitive indexing, and what actually gets sharded and The replicated in each case. I would stick to the aforementioned four typical cases with proper diagrams.
Color scheme doesn't work as cue hinting what happens. Trying to discern color shades and exactly where and how splitting in shards happens is another dimension of unnecessary mental acrobatics. Instead I would enumerated blocks - before and after sharding - as boxes and numbers in circle as identifiers of the boxes. Keep simple. Uncluttered.
Interestingly, the text itself with standard mathematical eisum notation is much clearer and intuitive. Maybe indeed we'd rather stick to it.
Now, this one is more important - maybe What needs to precede How? I would move Transformer and Training before this section to posit the problem area first, and then elaborate how optimization is done with TPU in mind. Otherwise, trying to dissect intricacies of TPU/systolic arrays without firm footing in exactly what they solve (which is restatement of Transformer details) becomes dry academic curriculum causing unnecessary frustrations. I think the whole book's ethos is a practical guide in complex area.

I hope my feedback is not misconstrued. I feel this book overall is phenomenal in its objectives and style, and definitely stands out in the crowd of similar efforts. Thank you again!

2 replies

jacobaustin123 Mar 5, 2025 — with giscus
Maintainer Author

Thank you for the very helpful and detailed feedback. Firstly, let me say that I agree that, of all the sections, this may be the weakest in its flow – introducing only the concepts that are necessary and using visualizations to support understanding rather than add confusion. Let me address a few of your comments, but I basically plan to do a rewrite of this section as soon as I have the time.

1/2/3. I agree with all of this, and will try to fix.

You might disagree but I feel like the training section in particular might be difficult without this section at least defining the notation that we'll use for sharding/communication. I think the user needs to be able to reason about e.g. the cost of an AllGather to do the Training section.

I'll try to do a rewrite this weekend and I'll send a draft your way if you'd be willing to glance at it.

gitnicos Mar 6, 2025 — with giscus

That will be awesome, Jacob! I truly appreciate how much passion and efforts you guys are putting in conveying non trivial yet much needed knowledge of the topic. I am here to help however I can.

Here is how I would attack this particular dilemma (5):

Section 2. "All About TPUs" itself is a good teaser and sets right tone of what the architecture is about, and how it addresses the problem of parallelization in general terms. It very well builds on "Intro to Rooflines" narrative.
after that a phenomenal section 4 "Transformers", "accounting" including would come in play as laying out the whole landscape and challenges at hand. This is where I would introduce the principles of parallelization to alleviate the challenge. Including setting firm conceptual understanding of a) splitting across batches, b) splitting across model dimensions, c) splitting across ff, d) mix of them and gently introducing mental picture of axises of sharding and main primitives AllGather, AllReduce, ReduceScatter etc. All schematically, nothing very involved. Just simple illustrative corroboration of the main idea ahead. GEMM needs to be covered here too as precondition to appreciate the rest down the road.
after having firm idea of possibilities and having armed the reader with elementary idea of challenges of parallelization, contracting sides etc. and notation of axises and sharding bare minimum we can go full steam ahead into "Section 5. Training"
this can be succeeded by "Sharded Matmuls revisited" to provide further inner workings and deeper analysis
having being solidly armed with that knowledge I think "Section 6. Training LLaMA" will be particularly rewarding experience.

I agree, it is challenging aspect of fusing classical academic bottom-up approach with practitioner's top down guide as this gem book is vying to become.

kostyaby · 2025-03-09T23:38:08Z

kostyaby
Mar 9, 2025 — with giscus

In question 4, I believe some math + reasoning for All-Gather being the preferred strategy is incorrect.

At the beginning, T_total_(strategy_1) is correctly defined to be max(2 * B * D * F / C, 2 * D * F / W_ici), but later this term is incorrectly evaluated to say that "we're generally compute-bound as the condition for that is D > C / W_ici". In reality, the correct condition is B > C / W_ici, as the following math shows:

T_total_(strategy_1) = max(2 * B * D * F / C, 2 * D * F / W_ici) = 2 * D * F * max(B / C, 1 / W_ici)
Compute-bound when B / C > 1 / W_ici => B > C / W_ici (~ 2550)
ICI-bound when B / C < 1 / W_ici => B < C / W_ici ~ 2550

So for reasonably common batch sizes, we're ICI-bound for strategy 1, as we are for strategy 2. In that case, need to compare ICI times for both strategies to decide which one is best. Strategy 2 is best when:

4 * B * F / W_ici < 2 * D * F / W_ici => 2 * B < D

So basically, for reasonable batch sizes (~1-2K) and D (~4K) strategy 2 is better than strategy 1. I also built a bunch of plots in this Colab, which showed that for certain large values of D & F it's never even beneficial to do strategy 1 (for example, when D=8K, F=16K) while for other values (D=4K, F=16K) it's better to do strategy 2 for B<2K and then it's slightly better to do strategy 1 for larger values of B

Unless I screwed up doing my math above, I believe the recommendation that the "All-Gather" strategy is better for Case 2 should be reconsidered. At smaller batches, the "All-Reduce" strategy seems to be much better. It also makes sense when reasoning about it at the high level: when you have a giant weight matrix (i.e. B[D, F]) and relatively smaller activation matrices (i.e.A[B, D] + C[B, F]), it makes sense that we would prefer to do less matmul FLOPs + do comms for a smaller activations matrix (i.e. C) rather than doing more matmul FLOPs + more comms for the giant weight matrix (i.e. B)

--

A small nit re. the same question: it never mentions we want to do everything in bfloat16, would be great to add that info.

--

Thank you for reading and also thank you for providing such a great learning resource for the community!

2 replies

jacobaustin123 Mar 12, 2025
Maintainer Author

I went back and double checked this and I think you're basically right – there was a typo in one part of the answer. Either way, I think you're right that the AllReduce strategy is often better for reasonable models. I fixed the math, so lmk if it seems more correct to you.

Either way, I think the takeaway is that this AllReduce strategy is better in most cases, except for the caveat at the end which is that it's rare to be in a situation where $X$ isn't sharded across the axis that $Y$ is being contracted over. So it's good only in somewhat contrived cases. For instance in FSDP we have this situation but we choose to gather because $X$ is sharded over the batch dim

kostyaby Mar 15, 2025

Yep, looks good to me. Thanks for the fix!

nishanthdikkala · 2025-06-24T00:01:01Z

nishanthdikkala
Jun 24, 2025 — with giscus

In question 10.1, why is the number of floats communicated by a ReduceScatter the same as that of AllGather? Doesn't ReduceScatter need to communicate less since the partial sums remain scattered and don't need to be gathered?

1 reply

jacobaustin123 Jun 24, 2025 — with giscus
Maintainer Author

It does half as much as an AllReduce but the same as an AllGather. A ReduceScatter starts with an array of the full size (unsharded) but can send only a piece around each time.

elzino · 2025-06-27T15:29:24Z

elzino
Jun 27, 2025 — with giscus

In Pop quiz 2 Part 1, I wonder if we should use unidirectional bandwidth (which is 4.5e10) because Y axis size is smaller than 16. IIUC, the answer should be Tcomms=34e6/4.5e10=756μs. I'm curious if I'm missing something.

1 reply

elzino Jun 27, 2025 — with giscus

Ah, nvm. I should've read the rest of the explanation. Tcomms=3∗8.4e6/4.5e10=560μs makes sense!

sabujlaskar · 2025-06-30T07:22:16Z

sabujlaskar
Jun 30, 2025 — with giscus

Hi, thank you for the great explanation. I have a question regarding 10.2. Why is the data size considered to be $\frac{N^2}{D^2}$? Since the matrix size is $N \times N$ and there are a total of $D$ devices, I would expect each device to send $\frac{N^2}{D}$ data to other devices for the AllToAll communication as well. Why are we assuming the data size to be $\frac{N^2}{D^2}$ in this case? Could you please clarify what I might be missing?

1 reply

Kalamitous Aug 5, 2025 — with giscus

This confused me as well, but assuming the text is correct, I was able to reason why this may be. AllToAll takes what's on a device ($\frac{N^2}{D}$) and shards it ($\frac{N^2}{D^2}$). These shards (shards of a shard in this case) are what gets sent to the other devices. I think the AllGather and AllToAll animations don't show analogous situations, so it may be hard to see this from comparing the animations.

SimonBerens · 2025-07-06T02:28:43Z

SimonBerens
Jul 6, 2025 — with giscus

Could you clarify the $n$ dimensional AllReduce? If you have $V$ bytes sharded across $X_1,\dots, X_n$, my understanding is every machine will start with $B = \frac{V}{\prod |X_i|}$ bytes. Then per ring per hop each machine will be sending $2B$ bytes, which will take $T_{hop} = \frac{2B}{W_{ICI}}$ time. The longest path in the mesh is $\frac{1}{2}\sum|X_i|$ so $T_{total} = \sum|X_i| \cdot T_{hop} = \frac{V \cdot \sum|X_i|}{W_{ICI} \cdot \prod |X_i|}$, which is different from the stated $\frac{V}{W_{ICI} \cdot n}$. Where is my mistake?

I also don't have a great intuition of how an $n$ dim AllReduce is actually implemented on TPUs. One way would be running an AllGather per dimension sequentially, essentially reducing the dimensionality of the sharding/mesh by 1 until it's 1. But this feels like it's leaving bandwidth on the table since the $n$ ICI rings per chip are independent? Though o3 says in practice for large tensors a TPU doesn't actually have enough VMEM to do hold the $2n$ tensors per hop (across all rings). And then even if you can do all $n$ rings concurrently you still need multiple iterations to deal with hamming distance $n$ indices.

5 replies

SimonBerens Jul 6, 2025 — with giscus

Sorry I meant AllGather

jacobaustin123 Jul 7, 2025 — with giscus
Maintainer Author

Where is 2B coming from? You're right about the longest path, but most nodes are much closer than that longest path, so as parts of the AllGather complete, you can use the additional bandwidth to send stuff faster. The heuristic argument is that in a 2D mesh you have roughly O(|X| * |Y|) total bandwidth to send V bytes, so it ought to go like V / W_ICI * n. The actual algorithm is pretty complicated, but you can assume it will send the fewest possible bytes in the throughput-bound regime.

SimonBerens Jul 9, 2025

$2B$ because per ring you are sending out in both directions. The heuristic makes sense, because my result would be faster than the aggregate bandwidth.

Do you have a link to an implementation of a 2d/3d AllGather?

Hwhitetooth Jul 21, 2025 — with giscus

I have no idea how the canonical algorithm works but I have a guess for the 2D case. Basically we can construct two rings using the links. For example, you can draw the first ring as follows: start from the top-left node, go all the way right, go down, go all the way left, go down, go all the way right, ..., and on the last row go back to the starting node after you go all the way left (hopefully you get the idea...). Then grab a pen with a different color and draw another ring with the remaining links. Now we have two rings, each connecting all nodes together. We just need to divide the chunks into two groups, one is reduced via one ring and the other via the other ring. This way we basically doubled the overall bandwidth of the system.

Hwhitetooth Jul 21, 2025 — with giscus

By "reduced", I meant "gathered".

yejingxin · 2025-07-07T18:01:17Z

yejingxin
Jul 7, 2025 — with giscus

In Case 3: both multiplicands have sharded contracting dimensions, is the reduce scatter done via bf16 or f32? typically matmul accumulation we need to do with f32, does it mean the communication cost for reduce scatter will be higher?

0 replies

AndrewZhaoLuo · 2025-07-08T05:39:35Z

AndrewZhaoLuo
Jul 8, 2025 — with giscus

For question 7 I believe I might be misunderstanding the notation.

We want to multiple matrices C and B, and take the result and multiply by matrix x correct? In this case, it appears the shapes are incompatible, the result of C * B is [F, F] which is incompatible with the shape x of [B, D].

0 replies

eitanturok · 2025-07-14T22:40:13Z

eitanturok
Jul 14, 2025 — with giscus

In the first pop quiz, you write that

128 * 2048 * 2=512kiB

but this is wrong because 128 * 2048 * 2 = 524,288 = 524kiB.

2 replies

watate Jul 16, 2025 — with giscus

128 * 2048 * 2 / 1024 = 512kiB = 524 kB

eitanturok Jul 24, 2025 — with giscus

oh it's kiB, not mB, got it!

watate · 2025-07-16T13:33:34Z

watate
Jul 16, 2025 — with giscus

This isn't working in the Colab:

!pip install tensorboard-plugin-profile

Should be changed to this:

!pip install tensorboard tensorboard-plugin-profile

2 replies

kiankyars Aug 12, 2025 — with giscus

+1 on this @jacobaustin123

jacobaustin123 Aug 21, 2025
Maintainer Author

Fixed, thank you!

ICDI0906 · 2025-08-21T12:45:07Z

ICDI0906
Aug 21, 2025 — with giscus

hello, i am confused that how to judge tpu has a wraparound connection ? is it relative to tpu's type

1 reply

jacobaustin123 Aug 21, 2025
Maintainer Author

For a training TPU (v5p, v4p), you have wraparounds if the topology is a multiple of 4x4x4. So 4x4x8 or 4x8x8 has wraparounds on all axes. Anything that isn't a multiple of 4x4x4 won't. For inference chips (v5e, v6e), anything that has an axis of size 16 will have a wraparound on that axis, e.g. 8x16 has a wraparound on the long axis but not on the short.

ICDI0906 · 2025-08-21T12:49:47Z

ICDI0906
Aug 21, 2025 — with giscus

i can't understand "In 2D, the cost actually scales down with the size of the smallest axis.” in the article. why? can you give a example. Thank you every much.

1 reply

jacobaustin123 Aug 21, 2025
Maintainer Author

Look at the aside just after that. You can see that it's kind of bounded by the longest axis, so if you have more axes, the overall cost stays fixed, but the data per shard decreases.

ICDI0906 · 2025-08-21T13:10:31Z

ICDI0906
Aug 21, 2025 — with giscus

in question 2, the first question, Is the formula should be 2BD/(9e10 * X), why Y is in the denominator?

1 reply

jacobaustin123 Aug 21, 2025
Maintainer Author

The formula if it weren't sharded over Y would be 2BD / 9e10 (nothing else in the denominator). Since it's also sharded over Y, when we AllGather, we're only gathering 1 / Y of the array. If you think about just a single shard along the Y-axis, the AllGather along X looks like an unsharded AllGather with 1 / Y of the bytes.

sdbuch · 2025-08-25T22:12:15Z

sdbuch
Aug 25, 2025 — with giscus

In the "What happens when we AllGather over multiple axes" example, why does the latency-bound component of the total time depend on the sum of the length of each mesh axis, rather than their product? It seems to me that in each round of communication, each device cannot receive more than $O(n_{\mathrm{axes}})$ shards of the array that it hasn't already seen, so the worst-case total latency should grow with the number of devices in general (which would be the product of the axis lengths in this case).

1 reply

jacobaustin123 Aug 26, 2025
Maintainer Author

I think the TLDR is that in the latency-bound regime, you imagine the longest path is O(sum of axis lengths). Whatever the case may be, for very small messages, the total time will be bottlenecked by the longest path. To your point, you don't need to receive every shard 1 at a time, you can imagine e.g. accumulating over the x axis in O(N * 1us) hops, then doing a second accumulate over the y axis in another O(N * 1us) hops.

saad-jamal · 2025-09-03T02:12:34Z

saad-jamal
Sep 3, 2025 — with giscus

Could I get some advice on how to approach question 5? I feel like I just went through the cases mentioned in this chapter and tried to derive when each is compute bound and comms bound. It feels like I should either:
(1) shard both the non-contracting dimension and then AllGather two ways at the end,
(2) shard the contracting dimension and then AllReduce at the end,
(3) some combination of 1/2? I imagine that's a good answer but not sure what that looks like.

Also can I make use of the 3rd axis Z, it seems like as of now I'm only sharding with X, Y.

1 reply

jacobaustin123 Sep 3, 2025
Maintainer Author

So I think you roughly have the right cases. One thing is that you can shard a single array over multiple axes, i.e. A[I_XYZ, J] * B[J, K]. With that in mind, you can do

(1) A[I_XYZ, J] * B[J, K] + AG at the end
(2) A[I, J_XYZ] * B[J_XYZ, K] + AR at the end
(3) A[I, J] * B[J, K_XYZ] + AG at the end
(4) A[I, J] * B[J, K] (fully replicated)

There are others but these cover the cases that are most realistic. For all but (4), the total FLOPs per TPU is the same, but comms are different for each. Let me know if this is enough to get going.

karlstratos · 2025-09-04T23:42:50Z

karlstratos
Sep 4, 2025 — with giscus

Small typo in Q6. I think the first bullet should have J_Y, not J_X.

0 replies

shehper · 2025-09-11T22:52:14Z

shehper
Sep 11, 2025 — with giscus

In the description of bidirectional All-Gather in Case 2, it says " If we do two directions, we have ceil(N/2) hops of size 2⋅bytes / N".

Should the number of hops not be floor(N/2) instead? When N is odd, e.g. N=7, we need only 3 hops. And when N is even, e.g. N=8, we need 4 hops.

0 replies

Sharded Matrices and How to Multiply Them | How To Scale Your Model #5

Uh oh!

Uh oh!

jacobaustin123 Feb 3, 2025 Maintainer

Replies: 37 comments · 59 replies

Uh oh!

mitchellgoffpc Feb 5, 2025 — with giscus

Uh oh!

Uh oh!

jacobaustin123 Feb 5, 2025 — with giscus Maintainer Author

Uh oh!

karunreddy30 Feb 9, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 9, 2025 Maintainer Author

Uh oh!

burichh Feb 9, 2025 — with giscus

Uh oh!

levskaya Feb 9, 2025 Collaborator

Uh oh!

Uh oh!

burichh Feb 10, 2025 — with giscus

Uh oh!

manavgarg Feb 20, 2025 — with giscus

Uh oh!

Jiminator Mar 3, 2025

Uh oh!

findmyway Mar 13, 2025 — with giscus

Uh oh!

kerrickstaley Feb 11, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 11, 2025 Maintainer Author

Uh oh!

Shua1 Feb 11, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 11, 2025 Maintainer Author

Uh oh!

Zasder3 Feb 15, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 16, 2025 — with giscus Maintainer Author

Uh oh!

chipturner Feb 16, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 16, 2025 — with giscus Maintainer Author

Uh oh!

jesse7chen Feb 22, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 24, 2025 Maintainer Author

Uh oh!

batterseapower Feb 23, 2025 — with giscus

Uh oh!

jacobaustin123 Feb 24, 2025 Maintainer Author

Uh oh!

AakashKumarNain Feb 28, 2025 — with giscus

Uh oh!

chipturner Feb 28, 2025

Uh oh!

jacobaustin123 Mar 3, 2025 Maintainer Author

Uh oh!

AakashKumarNain Mar 3, 2025

Uh oh!

jacobaustin123 Mar 3, 2025 Maintainer Author

Uh oh!

sshkhr May 1, 2025 — with giscus

Uh oh!

Lhongpei Mar 1, 2025 — with giscus

Uh oh!

jacobaustin123 Mar 3, 2025 Maintainer Author

Uh oh!

Lhongpei Mar 2, 2025 — with giscus

Uh oh!

jacobaustin123
Feb 3, 2025
Maintainer

Replies: 37 comments 59 replies

mitchellgoffpc
Feb 5, 2025 — with giscus

jacobaustin123 Feb 5, 2025 — with giscus
Maintainer Author

karunreddy30
Feb 9, 2025 — with giscus

jacobaustin123 Feb 9, 2025
Maintainer Author

burichh
Feb 9, 2025 — with giscus

levskaya Feb 9, 2025
Collaborator

kerrickstaley
Feb 11, 2025 — with giscus

jacobaustin123 Feb 11, 2025
Maintainer Author

Shua1
Feb 11, 2025 — with giscus

jacobaustin123 Feb 11, 2025
Maintainer Author

Zasder3
Feb 15, 2025 — with giscus

jacobaustin123 Feb 16, 2025 — with giscus
Maintainer Author

chipturner
Feb 16, 2025 — with giscus

jacobaustin123 Feb 16, 2025 — with giscus
Maintainer Author

jesse7chen
Feb 22, 2025 — with giscus

jacobaustin123 Feb 24, 2025
Maintainer Author

batterseapower
Feb 23, 2025 — with giscus

jacobaustin123 Feb 24, 2025
Maintainer Author

AakashKumarNain
Feb 28, 2025 — with giscus

jacobaustin123 Mar 3, 2025
Maintainer Author

jacobaustin123 Mar 3, 2025
Maintainer Author

Lhongpei
Mar 1, 2025 — with giscus

jacobaustin123 Mar 3, 2025
Maintainer Author

Lhongpei
Mar 2, 2025 — with giscus