- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
PPC MMA implementation #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Amrita H S <[email protected]>
| __builtin_mma_disassemble_acc(vec_C, ACC); \ | ||
| for (int I = 0; I < 4; I++) { \ | ||
| for (int J = 0; J < 4; J++) { \ | ||
| *((float*)(C+ii+((jj+J)*ldc)+I)) = *((float*)&vec_C[I]+J); \ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably better to do a 4 vector transpose here or invert the MMA inputs. That way you can write vectors at a time instead of scalar elements.
| aoffset1 += 8*lda; | ||
| aoffset2 += 8*lda; | ||
| aoffset3 += 8*lda; | ||
| aoffset4 += 8*lda; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come aoffset5 - 8 are not updated here? Could this be the reason it only works for multiples of 8?
| string(FIND ${POWER10_M} "POWER10" substring_index) | ||
| if(${substring_index} GREATER_EQUAL 0) | ||
| list(APPEND ARCH_FLAGS -mcpu=power10) | ||
| elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "ppc64le") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for CMAKE_SYSTEM_PROCESSOR to match both ppc64 and ppc64le?
| vector float t1, t2, t3, t4; | ||
| c1 = vec_xl(0, aoffset1); | ||
| c2 = vec_xl(0, aoffset2); | ||
| c3 = vec_xl(0, aoffset3); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is c4 loaded here?
PPC MMA implementation for llamafile_sgemm API