Skip to content

Conversation

vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Oct 17, 2025

Summary:

convenient to analyze differences between roofline and observed

tl;dr; of findings:

mxfp8

  1. need to pre-swizzle weights
  2. torch.compile gives us two kernels, will repurpose the manual
    training kernel for this, will need to add pre-swizzling. Longer
    term, can see if fbgemm_gpu one is faster.

mxfp4

  1. need to pre-swizzle weights
  2. need a faster gemm (can use fbgemm_gpu)
  3. need a fused activation quant kernel (can use fbgemm_gpu)

nvfp4

  1. need to speed up existing triton activation quant kernel, currently
    it doesn't autotune anything so probably some easy wins here. Longer
    term can also benchmark vs fbgemm_gpu

Test Plan:

CUDA_VISIBLE_DEVICES=5 python benchmarks/float8/float8_inference_roofline.py ~/local/tmp/20251016_inference_nvfp4.csv --recipe_name nvfp4 --save_profile_traces True

Reviewers:

Subscribers:

Tasks:

Tags:

vkuzo added 4 commits October 16, 2025 07:41
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@vkuzo
Copy link
Contributor Author

vkuzo commented Oct 17, 2025

vkuzo added a commit that referenced this pull request Oct 17, 2025
Summary:

convenient to analyze differences between roofline and observed

tl;dr; of findings:

mxfp8
1. need to pre-swizzle weights
2. torch.compile gives us two kernels, will repurpose the manual
   training kernel for this, will need to add pre-swizzling. Longer
   term, can see if fbgemm_gpu one is faster.

mxfp4
1. need a faster gemm (can use fbgemm_gpu)
2. need a fused activation quant kernel (can use fbgemm_gpu)

nvfp4
1. need to speed up existing triton activation quant kernel, currently
   it doesn't autotune anything so probably some easy wins here. Longer
   term can also benchmark vs fbgemm_gpu

Test Plan:

```bash
CUDA_VISIBLE_DEVICES=5 python benchmarks/float8/float8_inference_roofline.py ~/local/tmp/20251016_inference_nvfp4.csv --recipe_name nvfp4 --save_profile_traces True
```

Reviewers:

Subscribers:

Tasks:

Tags:
ghstack-source-id: a942de7
ghstack-comment-id: 3413384438
Pull-Request: #3196
Copy link

pytorch-bot bot commented Oct 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3196

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 438b35e with merge base d1a7fbc (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2025
@vkuzo vkuzo added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Oct 17, 2025
vkuzo added 3 commits October 17, 2025 05:01
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@vkuzo vkuzo changed the base branch from gh/vkuzo/148/head to main October 17, 2025 15:06
@vkuzo vkuzo merged commit b50e37a into main Oct 17, 2025
47 of 50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants