Skip to content

Conversation

@amritahs-ibm
Copy link
Owner

PPC MMA implementation for llamafile_sgemm API

Signed-off-by: Amrita H S <[email protected]>
__builtin_mma_disassemble_acc(vec_C, ACC); \
for (int I = 0; I < 4; I++) { \
for (int J = 0; J < 4; J++) { \
*((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&vec_C[I]+J); \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better to do a 4 vector transpose here or invert the MMA inputs. That way you can write vectors at a time instead of scalar elements.

aoffset1 += 8*lda;
aoffset2 += 8*lda;
aoffset3 += 8*lda;
aoffset4 += 8*lda;
Copy link

@ChipKerchner ChipKerchner Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come aoffset5 - 8 are not updated here? Could this be the reason it only works for multiples of 8?

string(FIND ${POWER10_M} "POWER10" substring_index)
if(${substring_index} GREATER_EQUAL 0)
list(APPEND ARCH_FLAGS -mcpu=power10)
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "ppc64le")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for CMAKE_SYSTEM_PROCESSOR to match both ppc64 and ppc64le?

vector float t1, t2, t3, t4;
c1 = vec_xl(0, aoffset1);
c2 = vec_xl(0, aoffset2);
c3 = vec_xl(0, aoffset3);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is c4 loaded here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants