- 
                Notifications
    You must be signed in to change notification settings 
- Fork 706
Improving 4bit quant mat mul performance by shifting position of -8 operation. #15436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15436
 Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below: 
 ❌ 2 New Failures, 1 Unrelated FailureAs of commit f0c9c4d with merge base 3485495 ( NEW FAILURES - The following jobs have failed:
 
 FLAKY - The following job failed but was likely due to flakiness present on trunk:
 
 This comment was automatically generated by Dr. CI and updates every 15 minutes. | 
| @trivedivivek has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85721578. | 
45154bf    to
    65157ab      
    Compare
  
    …peration. (pytorch#15436) Summary: This diff introduces performance improvements in the 4-bit quant matrix multiplication operation by adjusting the position of the -8 operation. Resulting in overall reduction in math operation performed during shader runtime. Differential Revision: D85721578
…peration. (pytorch#15436) Summary: This diff introduces performance improvements in the 4-bit quant matrix multiplication operation by adjusting the position of the -8 operation. Resulting in overall reduction in math operation performed during shader runtime. The thinking here is as follows: * The 4 bit integer weights are unsigned ranging from 0 - 15, and thus to get unsigned number 8 is subtracted from the input. * Assume WS[] is array of signed weights, M[] is matrix, S is the sum The main loop essentially performs: S += ( WS[i] - 8 ) * M[i], for i = [0, N) * This equation can rewritten as: S += WS[i] * M[i] - 8 * M[i], for i = [0, N) * 8 * M[i] need not be performed in the main loop. Also 8 * M[i], for i = [0, N) Can be substituted with A += M[i], for i = [0, N) and A *= 8 Thus, splitting parts of this equation results in a significant reduction in math ops while producing the same result. Differential Revision: D85721578
1d800aa    to
    a54693c      
    Compare
  
    …peration. (pytorch#15436) Summary: This diff introduces performance improvements in the 4-bit quant matrix multiplication operation by adjusting the position of the -8 operation. Resulting in overall reduction in math operation performed during shader runtime. The thinking here is as follows: * The 4 bit integer weights are unsigned ranging from 0 - 15, and thus to get unsigned number 8 is subtracted from the input. * Assume WS[] is array of signed weights, M[] is matrix, S is the sum The main loop essentially performs: S += ( WS[i] - 8 ) * M[i], for i = [0, N) * This equation can rewritten as: S += WS[i] * M[i] - 8 * M[i], for i = [0, N) * 8 * M[i] need not be performed in the main loop. Also 8 * M[i], for i = [0, N) Can be substituted with A += M[i], for i = [0, N) and A *= 8 Thus, splitting parts of this equation results in a significant reduction in math ops while producing the same result. Differential Revision: D85721578
…peration. (pytorch#15436) Summary: This diff introduces performance improvements in the 4-bit quant matrix multiplication operation by adjusting the position of the -8 operation. Resulting in overall reduction in math operation performed during shader runtime. The thinking here is as follows: * The 4 bit integer weights are unsigned ranging from 0 - 15, and thus to get unsigned number 8 is subtracted from the input. * Assume WS[] is array of signed weights, M[] is matrix, S is the sum The main loop essentially performs: S += ( WS[i] - 8 ) * M[i], for i = [0, N) * This equation can rewritten as: S += WS[i] * M[i] - 8 * M[i], for i = [0, N) * 8 * M[i] need not be performed in the main loop. Also 8 * M[i], for i = [0, N) Can be substituted with A += M[i], for i = [0, N) and A *= 8 Thus, splitting parts of this equation results in a significant reduction in math ops while producing the same result. Reviewed By: SS-JIA Differential Revision: D85721578
a54693c    to
    deca607      
    Compare
  
    …peration. (pytorch#15436) Summary: This diff introduces performance improvements in the 4-bit quant matrix multiplication operation by adjusting the position of the -8 operation. Resulting in overall reduction in math operation performed during shader runtime. The thinking here is as follows: * The 4 bit integer weights are unsigned ranging from 0 - 15, and thus to get unsigned number 8 is subtracted from the input. * Assume WS[] is array of signed weights, M[] is matrix, S is the sum The main loop essentially performs: S += ( WS[i] - 8 ) * M[i], for i = [0, N) * This equation can rewritten as: S += WS[i] * M[i] - 8 * M[i], for i = [0, N) * 8 * M[i] need not be performed in the main loop. Also 8 * M[i], for i = [0, N) Can be substituted with A += M[i], for i = [0, N) and A *= 8 Thus, splitting parts of this equation results in a significant reduction in math ops while producing the same result. Reviewed By: SS-JIA Differential Revision: D85721578
deca607    to
    f0c9c4d      
    Compare
  
    …peration. (pytorch#15436) Summary: This diff introduces performance improvements in the 4-bit quant matrix multiplication operation by adjusting the position of the -8 operation. Resulting in overall reduction in math operation performed during shader runtime. The thinking here is as follows: * The 4 bit integer weights are unsigned ranging from 0 - 15, and thus to get unsigned number 8 is subtracted from the input. * Assume WS[] is array of signed weights, M[] is matrix, S is the sum The main loop essentially performs: S += ( WS[i] - 8 ) * M[i], for i = [0, N) * This equation can rewritten as: S += WS[i] * M[i] - 8 * M[i], for i = [0, N) * 8 * M[i] need not be performed in the main loop. Also 8 * M[i], for i = [0, N) Can be substituted with A += M[i], for i = [0, N) and A *= 8 Thus, splitting parts of this equation results in a significant reduction in math ops while producing the same result. Reviewed By: SS-JIA Differential Revision: D85721578
Summary: This diff introduces performance improvements in the 4-bit quant matrix multiplication operation by adjusting the position of the -8 operation. Resulting in overall reduction in math operation performed during shader runtime.
Differential Revision: D85721578