-
-
Notifications
You must be signed in to change notification settings - Fork 431
Description
Thanks for the great project !!!
I'm profiling the EPLB algo, I used several tools, one is scalene (also tried the AI optimization tool, pretty cool), another is py-spy, however the result is a bit confusion.
The code is basically like:
indices = weight.float().sort(-1, descending=True).indices.cpu()
pack_index = torch.full_like(weight, fill_value=-1, dtype=torch.int64, device='cpu')
rank_in_pack = torch.full_like(pack_index, fill_value=-1)
for i in range(num_layers):
pack_weights = [0] * num_packs
pack_items = [0] * num_packs
for group in indices[i]:
pack = min((i for i in range(num_packs) if pack_items[i] < groups_per_pack),
key=pack_weights.__getitem__)
assert pack_items[pack] < groups_per_pack
pack_index[i, group] = pack
rank_in_pack[i, group] = pack_items[pack]
pack_weights[pack] += weight[i, group]
pack_items[pack] += 1Most of them are tensor OP, then I wrote a simulator to run the EPLB algo, and got the profiling result:
For scalene, I ran command: scalene run <python-file>
the result looks like below, the bottleneck is the min() function and pack_items[pack] += 1 from the diagram:
For py-psy, the result is a bit different:
Based on my understanding, the min() is one bottleneck I agree, because it's a greedy algo which will loop the whole datasets.
However, for another bottleneck, if we take a look of the code, it seems the result from py-psy is right, the pack_items[pack] += 1 is more lightweight than the previous three lines of tensor OPs.
Would like to hear some explanations here if possible, thanks.