Skip to content

feat: Add Windows support with triton-windows and PyTorch fallback#237

Open
tori29umai0123 wants to merge 2 commits intoTencent:mainfrom
tori29umai0123:feature/windows-support
Open

feat: Add Windows support with triton-windows and PyTorch fallback#237
tori29umai0123 wants to merge 2 commits intoTencent:mainfrom
tori29umai0123:feature/windows-support

Conversation

@tori29umai0123
Copy link

Summary

Add Windows support for AngelSlim with FP8 Triton kernels using triton-windows.

Changes

New Files

  • angelslim/compressor/_platform.py - Platform detection and automatic backend selection
  • angelslim/compressor/quant/core/quant_func_torch.py - PyTorch fallback for weight quantization
  • angelslim/compressor/diffusion/kernels/python/quantizers/fp8_per_block_torch.py - PyTorch fallback for FP8
    per-block quantization
  • angelslim/compressor/diffusion/kernels/python/quantizers/fp8_per_token_group_torch.py - PyTorch fallback for FP8
    per-token-group quantization
  • angelslim/compressor/diffusion/kernels/python/gemm/fp8_gemm_torch.py - PyTorch fallback for FP8 GEMM

Modified Files

  • requirements/requirements.txt - Platform-specific triton dependency (triton for Linux, triton-windows for Windows)
  • setup.py - Version string includes CUDA and PyTorch version (e.g., 0.0.0_dev+cu128.torch2.10)
  • angelslim/compressor/diffusion/cache/taylorcache_helper.py - Conditional torch.compile (disabled on Windows)
  • angelslim/compressor/quant/core/quant_func.py - Lazy Triton import with automatic backend selection
  • angelslim/compressor/diffusion/quant/quant_func.py - Remove CUDA requirement for per-token-group quantization
  • angelslim/compressor/diffusion/kernels/python/quantizers/__init__.py - Conditional imports based on backend
  • angelslim/compressor/diffusion/kernels/python/gemm/__init__.py - Conditional imports based on backend
  • README.md - Windows build instructions

Features

  • Automatic Backend Selection: Automatically detects if Triton is available and functional; falls back to PyTorch
    if not
  • Environment Variables:
    • ANGELSLIM_BACKEND=pytorch|triton - Force backend selection
    • ANGELSLIM_TORCH_COMPILE=0|1 - Enable/disable torch.compile
  • Cross-Platform: Works on both Linux (with triton) and Windows (with triton-windows or PyTorch fallback)

Test

python -c "import torch; from angelslim.compressor.diffusion.kernels.python.quantizers import
fp8_per_block_quant_triton; from angelslim.compressor.diffusion.kernels.python.gemm import fp8_gemm_triton_block;
a,b=torch.randn(128,256,device='cuda'),torch.randn(512,256,device='cuda'); aq,a_s=fp8_per_block_quant_triton(a);
bq,b_s=fp8_per_block_quant_triton(b); c=fp8_gemm_triton_block(aq,a_s,bq,b_s); print(f'FP8 GEMM OK: {c.shape},
{c.dtype}')"

Expected output:
FP8 GEMM OK: torch.Size([128, 512]), torch.bfloat16

Tested Environment

- Windows 11 + NVIDIA RTX A6000
- CUDA 12.8 + PyTorch 2.10.0 + triton-windows

tori29umai0123 and others added 2 commits February 2, 2026 19:09
When native_fp8_support=False and quant_type="fp8-per-block", 3D tensors
(e.g., [batch, seq_len, hidden] from attention layers like txt_attn_q/k/v)
were passed directly to fp8_per_block_quant which expects 2D input.

Changes:
- Reshape 3D input to 2D before calling fp8_per_block_quant
- Store original shape and restore output to 3D after GEMM operation

This fixes assertion errors when using FP8 quantization with transformer
attention layers that produce 3D tensors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html).

#### Windows Installation (with FP8 Triton Support)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mv to docs/source/getting_started/installation.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants