Skip to content

Conversation

@whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Oct 21, 2024

Please do not squash and merge this PR.

@whitneywhtsang whitneywhtsang self-assigned this Oct 21, 2024
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit fa229d1 Merge OpenAI Triton commit f9688ab Oct 21, 2024
}
if (auto dotLayout = dyn_cast<DotOperandEncodingAttr>(layout)) {
auto rank = getWarpsPerCTA(dotLayout.getParent()).size();
if (dyn_cast<intel::DpasEncodingAttr>(dotLayout.getParent())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whitneywhtsang this is the only change needed to fix lit tests. In the new code, the swap occurs conditionally (if (opIdx == 1)), which apparently did not work for dpas so I returned unconditional swap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update, I wonder if the logic can be added in intel specific files instead. Do you have any suggestions @chengjunlu ?

Copy link
Contributor

@chengjunlu chengjunlu Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can propose the changes to make the getOrderForDotOperand as the interface to the MmaTraits.

I found the AMD engineer refactor the code at this PR ff02a46. But their comments about the order is not general and doesn't make sense to Intel GPU. The order should be overridable by the parent layout of the DotOp layout.

Copy link
Contributor

@anmyachev anmyachev Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can propose the changes to make the getOrderForDotOperand as the interface to the MmaTraits.

@chengjunlu with this approach, BlockedEncodingAttr type needs to be handled separately, since it does not inherit MmaTraits interface. Tests in CI are now falling because of this (should be fixed in the last commit).

What can be done about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not a easy way to unified the BlockedEncodingAttr and the MmaTraits unless let it to inherited the MmaTraits as well. But I am not sure whether it worth to do so for now.

For the simplicity of the changes, we can handle it separately and check the feed back of the public Triton.

@whitneywhtsang whitneywhtsang linked an issue Oct 22, 2024 that may be closed by this pull request
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit f9688ab Reland upstream commit f9688ab Oct 22, 2024
@whitneywhtsang
Copy link
Contributor Author

FYI @anmyachev, squashed the last few commits, so is easier to isolate changes needed to review.

@anmyachev anmyachev marked this pull request as ready for review October 24, 2024 11:58
@etiotto etiotto requested a review from victor-eds October 25, 2024 15:14
Copy link
Contributor

@victor-eds victor-eds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look sensible to me. Just a question: why does our encoding have a different order compared to the rest? What would be the cost of modifying it so the order matches other dot encodings?

Copy link
Contributor

@victor-eds victor-eds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, changes LGTM. I am just revoking approval till we get an assessment of costs: changing our order vs upstreaming this change (we may just be asked to change our order instead).

@victor-eds victor-eds self-requested a review October 28, 2024 13:06
@anmyachev anmyachev force-pushed the whitneywhtsang/merge branch from 4d53f8e to 284b824 Compare October 28, 2024 14:53
@anmyachev
Copy link
Contributor

anmyachev commented Oct 28, 2024

Just a question: why does our encoding have a different order compared to the rest? What would be the cost of modifying it so the order matches other dot encodings?

Good question. There is a chance that this issue was discussed earlier during initial implementation. Let's ask more experienced Triton developers than me :) @chengjunlu @whitneywhtsang do you have an answer to this question?

I am just revoking approval till we get an assessment of costs: changing our order vs upstreaming this change (we may just be asked to change our order instead).

If there is no answer to this question, then I can research the implementation history myself and look for answers to the question of why it was done this way and not differently (any pointers and links to code will speed up the process). However, this is not fast, wouldn't it be better to merge this pull request to simplify merging subsequent commits? @whitneywhtsang

P.S. after rebase the tests don't pass, I'll have a look (fixed, my changes after rebase were lost, i returned them)

@anmyachev anmyachev force-pushed the whitneywhtsang/merge branch 2 times, most recently from a9e6384 to b00713d Compare October 28, 2024 18:46
@chengjunlu
Copy link
Contributor

chengjunlu commented Oct 29, 2024

Changes look sensible to me. Just a question: why does our encoding have a different order compared to the rest? What would be the cost of modifying it so the order matches other dot encodings?

Actually the all the DotOp layout has the same order before this PR ff02a46. (The linear ID of the register dim to coordinate are computed as in row major.)

The code before AMD's change

 else if (auto dotLayout = dyn_cast<DotOperandEncodingAttr>(layout)) {
    auto rank = getWarpsPerCTA(dotLayout.getParent()).size();
    SmallVector<unsigned> order(rank);
    for (auto i = 0; i < rank; ++i)
      order[i] = rank - 1 - i;
    return order;
  }

I think the AMD's engineer has different interpretation based on their comments about the new code:

// The 'order' field typically represents a descending sorted array of
  // dimensions based on contiguity. For instance, in axisInfo utilities that
  // retrieve tensor contiguity, it's assumed that the dimension with the
  // highest contiguity corresponds to order[0].
  //
  // The relation between contiguity and order is only relevant if the layout
  // interfaces with HBM, as is the case when we load tensor from HBM to
  // registers in the dot layout to bypass LDS. When bypassing LDS, we make the
  // following assumptions about tensor layouts:
  // - Tensor A (opIdx == 0) is considered to be row-major.
  // - Tensor B (opIdx == 1) is considered to be column-major.
  //
  // Based on these assumptions, we define the following orders:
  // - For opIdx == 0, we assume an order of [1, 0].
  // - For opIdx == 1, we assume an order of [0, 1].

For Intel GPU, the layout is only used to describe the layout of the value in register and the matrix A and B are both row-major in register.

return getOrderForDotOperand(dotLayout.getOpIdx(), rank);
} else {
std::iota(order.rbegin(), order.rend(), 0);
if (auto mmaParent = dyn_cast<MmaEncodingTrait>(dotLayout.getParent())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make a clear review at the first time. If the original code changes is only made for AMD, then we can keep all those DotOp register layout order unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, it is good for the third party extension can override the getOrder for DotOp layout with MmaEncodingTrait. I am neutral to this change.

@anmyachev anmyachev force-pushed the whitneywhtsang/merge branch from b00713d to be7965c Compare November 3, 2024 14:42
@anmyachev
Copy link
Contributor

anmyachev commented Nov 3, 2024

I suggest speeding up the merge of this pull request, leaving only the workaround for now. We can try to upstream the interface function getOrderForDotOperand separately.

@victor-eds @whitneywhtsang @chengjunlu please approve if this makes sense.

@whitneywhtsang
Copy link
Contributor Author

I suggest speeding up the merge of this pull request, leaving only the workaround for now. We can try to upstream the interface function getOrderForDotOperand separately.

@victor-eds @whitneywhtsang @chengjunlu please approve if this makes sense.

I think it makes sense, let's remove the last two commits, and add a FIXME comment in f9ccfeb.

@anmyachev anmyachev force-pushed the whitneywhtsang/merge branch from a7ba67a to e33df88 Compare November 3, 2024 17:55
@anmyachev
Copy link
Contributor

I suggest speeding up the merge of this pull request, leaving only the workaround for now. We can try to upstream the interface function getOrderForDotOperand separately.
@victor-eds @whitneywhtsang @chengjunlu please approve if this makes sense.

I think it makes sense, let's remove the last two commits, and add a FIXME comment in f9ccfeb.

Done.

Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev anmyachev force-pushed the whitneywhtsang/merge branch from e33df88 to c637c07 Compare November 3, 2024 18:13
@whitneywhtsang whitneywhtsang merged commit c637c07 into main Nov 4, 2024
4 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge branch November 4, 2024 00:13
@victor-eds
Copy link
Contributor

I'm fine with this merge 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reland upstream commit f9688ab

5 participants