Pull in 16x4 kernel for QP8 #5827

mcr229 · 2024-10-02T21:13:34Z

We are pulling in and testing out new 16x4 kleidi kernels, we see some significant performance improvements from this

pytorch-bot · 2024-10-02T21:13:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5827

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 48fef5e with merge base fbcd332 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mcr229 · 2024-10-03T18:07:24Z

We see some rather significant speed up on prefill performance for Llama Models:

Before:

I 00:00:05.587790 executorch:stats.h:84] 	Prompt Tokens: 64    Generated Tokens: 63
I 00:00:05.587793 executorch:stats.h:90] 	Model Load Time:		3.999000 (seconds)
I 00:00:05.587796 executorch:stats.h:100] 	Total inference time:		1.579000 (seconds)		 Rate: 	39.898670 (tokens/second)
I 00:00:05.587806 executorch:stats.h:108] 		Prompt evaluation:	0.219000 (seconds)		 Rate: 	292.237443 (tokens/second)
I 00:00:05.587809 executorch:stats.h:119] 		Generated 63 tokens:	1.360000 (seconds)		 Rate: 	46.323529 (tokens/second)
I 00:00:05.587812 executorch:stats.h:127] 	Time to first generated token:	0.219000 (seconds)
I 00:00:05.587816 executorch:stats.h:134] 	Sampling time over 127 tokens:	0.014000 (seconds)

After

I 00:00:05.917623 executorch:stats.h:97] 	Prompt Tokens: 64    Generated Tokens: 63
I 00:00:05.917626 executorch:stats.h:103] 	Model Load Time:		0.000000 (seconds)
I 00:00:05.917628 executorch:stats.h:113] 	Total inference time:		1.326000 (seconds)		 Rate: 	47.511312 (tokens/second)
I 00:00:05.917632 executorch:stats.h:121] 		Prompt evaluation:	0.179000 (seconds)		 Rate: 	357.541899 (tokens/second)
I 00:00:05.917635 executorch:stats.h:132] 		Generated 63 tokens:	1.147000 (seconds)		 Rate: 	54.925894 (tokens/second)
I 00:00:05.917639 executorch:stats.h:140] 	Time to first generated token:	0.179000 (seconds)
I 00:00:05.917641 executorch:stats.h:147] 	Sampling time over 127 tokens:	0.009000 (seconds)

Pull in 16x4 kernel

a628b96

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2024

mcr229 requested a review from digantdesai October 2, 2024 21:17

Update to semi-final xnnpack branch

48fef5e

mcr229 closed this Jul 25, 2025

mcr229 deleted the qp8++ branch July 25, 2025 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pull in 16x4 kernel for QP8 #5827

Pull in 16x4 kernel for QP8 #5827

Uh oh!

mcr229 commented Oct 2, 2024

Uh oh!

pytorch-bot bot commented Oct 2, 2024 •

edited

Loading

Uh oh!

mcr229 commented Oct 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pull in 16x4 kernel for QP8 #5827

Pull in 16x4 kernel for QP8 #5827

Uh oh!

Conversation

mcr229 commented Oct 2, 2024

Uh oh!

pytorch-bot bot commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5827

✅ No Failures

Uh oh!

mcr229 commented Oct 3, 2024

Before:

After

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Oct 2, 2024 •

edited

Loading