Skip to content

Conversation

@geyyer
Copy link
Contributor

@geyyer geyyer commented Feb 13, 2025

Proposed changes

This PR adds MX FP4 tests and fixes MX FP4 functionality.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

This PR updates FP4 elements layout in a vector of 2 to comply with other vector types, see discussion below.

@geyyer
Copy link
Contributor Author

geyyer commented Feb 21, 2025

Patch was merged llvm/llvm-project#127464, so repro is not needed anymore.

@geyyer geyyer marked this pull request as ready for review February 21, 2025 21:00
return __builtin_amdgcn_cvt_scalef32_pk_f32_fp4(value.bitwise, type_convert<float>(scale), 0);
float2_t tmp =
__builtin_amdgcn_cvt_scalef32_pk_f32_fp4(value.bitwise, type_convert<float>(scale), 0);
// permute high bits and low bits to match the order of the original vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these changes because we modified packing order in the fp4 storage?

    __host__ __device__ inline type pack(const type x0, const type x1)
    {
        return (x0 << 4) | (x1 & 0b00001111);
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to provide more context for both comments here. In the CK we have several ways of packing elements into vectors: llvm clang vectors, our custom non_native_vector_base and custom types which we pack manually. In llvm clang vectors 0th element is stored in the highest bits and Nth element is in the lowest bits. Same layout is used in the non_native_vector_base, which makes sense as we use llvm clang vector under the hood. So I decided to update the f4x2_pk_t type to have a consistent layout with other vectors. I believe the issue with native conversion instructions is that they swap high and low bits, so we have to swap either input or output vector elements. I believe keeping old f4x2_pk_t layout would help with this issue, but have to be well documented and considered in the tests. @andriy-ca what is your perspective on it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having elements in bytes aligned consistently with the other data types makes sense.

__host__ __device__ inline type pack(const type x0, const type x1)
{
return (x1 << 4) | (x0 & 0b00001111);
return (x0 << 4) | (x1 & 0b00001111);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the original order incorrect?

andriy-ca
andriy-ca previously approved these changes Feb 24, 2025
Copy link
Contributor

@andriy-ca andriy-ca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@andriy-ca andriy-ca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@geyyer geyyer merged commit 441343a into develop Mar 27, 2025
33 of 39 checks passed
@geyyer geyyer deleted the lwpck-2836 branch May 12, 2025 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants