feat: Add Windows support with triton-windows and PyTorch fallback#237
Open
tori29umai0123 wants to merge 2 commits intoTencent:mainfrom
Open
feat: Add Windows support with triton-windows and PyTorch fallback#237tori29umai0123 wants to merge 2 commits intoTencent:mainfrom
tori29umai0123 wants to merge 2 commits intoTencent:mainfrom
Conversation
When native_fp8_support=False and quant_type="fp8-per-block", 3D tensors (e.g., [batch, seq_len, hidden] from attention layers like txt_attn_q/k/v) were passed directly to fp8_per_block_quant which expects 2D input. Changes: - Reshape 3D input to 2D before calling fp8_per_block_quant - Store original shape and restore output to 3D after GEMM operation This fixes assertion errors when using FP8 quantization with transformer attention layers that produce 3D tensors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
yghstill
reviewed
Feb 2, 2026
|
|
||
| For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html). | ||
|
|
||
| #### Windows Installation (with FP8 Triton Support) |
Collaborator
There was a problem hiding this comment.
Please mv to docs/source/getting_started/installation.md
StromNoNo
approved these changes
Feb 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add Windows support for AngelSlim with FP8 Triton kernels using
triton-windows.Changes
New Files
angelslim/compressor/_platform.py- Platform detection and automatic backend selectionangelslim/compressor/quant/core/quant_func_torch.py- PyTorch fallback for weight quantizationangelslim/compressor/diffusion/kernels/python/quantizers/fp8_per_block_torch.py- PyTorch fallback for FP8per-block quantization
angelslim/compressor/diffusion/kernels/python/quantizers/fp8_per_token_group_torch.py- PyTorch fallback for FP8per-token-group quantization
angelslim/compressor/diffusion/kernels/python/gemm/fp8_gemm_torch.py- PyTorch fallback for FP8 GEMMModified Files
requirements/requirements.txt- Platform-specific triton dependency (triton for Linux, triton-windows for Windows)setup.py- Version string includes CUDA and PyTorch version (e.g.,0.0.0_dev+cu128.torch2.10)angelslim/compressor/diffusion/cache/taylorcache_helper.py- Conditionaltorch.compile(disabled on Windows)angelslim/compressor/quant/core/quant_func.py- Lazy Triton import with automatic backend selectionangelslim/compressor/diffusion/quant/quant_func.py- Remove CUDA requirement for per-token-group quantizationangelslim/compressor/diffusion/kernels/python/quantizers/__init__.py- Conditional imports based on backendangelslim/compressor/diffusion/kernels/python/gemm/__init__.py- Conditional imports based on backendREADME.md- Windows build instructionsFeatures
if not
ANGELSLIM_BACKEND=pytorch|triton- Force backend selectionANGELSLIM_TORCH_COMPILE=0|1- Enable/disable torch.compileTest