Skip to content

Releases: tile-ai/tilelang

v0.1.8

16 Feb 14:05
41b2552

Choose a tag to compare

What's Changed

  • [Bugfix][Build] Update CMake configuration to remove project root injection for sys.path by @LeiWang1999 in #1385
  • [BugFix] Fix split kernel layout bug of GQA decode by @tzj-fxz in #1386
  • [Feat] Add better repr print for Layout and Fragment by @kurisu6912 in #1392
  • [Doc] Logging docs for Tilelang/TVM by @SiriusNEO in #1395
  • [Enhancement] Refactor inflight computing to support dynamic pipeline extents by @LeiWang1999 in #1399
  • [AMD] Fix 3 bugs when build docker on amd mi3x gpu by @danielhua23 in #1401
  • [Typo] Fix tilelang link in README.md by @senlyu163 in #1402
  • [Dependency] Update apache-tvm-ffi version to >=0.1.2 by @LeiWang1999 in #1400
  • [AMD] Enable FA2 fwd on AMD MI300X by @danielhua23 in #1406
  • [Typo] fix typo for SM120 by @Cunxiao2002 in #1408
  • [Doc] Minor documentation update by @LeiWang1999 in #1410
  • [Dependency] Add torch-c-dlpack-ext to project requirements by @LeiWang1999 in #1403
  • [Bugfix] Alloc T.make_tensor not on the top of prim_func by @LeiWang1999 in #1412
  • [Enhancement] Introduce T.__ldg by @LeiWang1999 in #1414
  • [Enhancement] Improve vectorization invariant check by @LJC00118 in #1398
  • [Lint] Phaseout Yapf format and embrace ruff format by @LeiWang1999 in #1417
  • [Atomic] Use ptr for atomicAdd dst instead of reference by @LeiWang1999 in #1425
  • [CUDA] Add read-only parameter annotation for CUDA codegen by @LeiWang1999 in #1416
  • [Refactor] Phase out the primitives folder since its design has been merged into tileop by @LeiWang1999 in #1429
  • [CI]: Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #1431
  • [CI]: Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in #1432
  • [Bugfix] Convey compile_flags to ffi compilation path with pass_configs by @LeiWang1999 in #1434
  • [Enhancement] Improve buffer usage tracking in MakePackedAPI by @LeiWang1999 in #1435
  • [Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice by @SiriusNEO in #1405
  • [Enhancement] Include PrimFunc name in memory cache logs for better ebugging by @LeiWang1999 in #1437
  • [CI] Update lint dependencies and fix lint on trunk by @XuehaiPan in #1433
  • [Enhancement] Refactor vectorization checks in loop_vectorize by @LeiWang1999 in #1440
  • [Enhancement] Implement vectorized FP8 to FP32 cast by @LJC00118 in #1438
  • [Feature] Support region as input of T.cumsum by @Dayuxiaoshui in #1426
  • [Fix] Fix analyzer bind conflicting bug in #1442 by @kurisu6912 in #1446
  • [Refactor] Reduce direct dependency on PyTorch due to its limited type support by @LeiWang1999 in #1444
  • [Refactor] Use pytest.mark.parameterize to speedup parallel testing by @kurisu6912 in #1447
  • [Docs] Improve installation instructions for developers by @SiriusNEO in #1450
  • [Feat] Integrate Z3 in TVM Arith Analyzer by @kurisu6912 in #1367
  • [Bugfix] Improve autotune from elementwise_add function in examples by @senlyu163 in #1445
  • [Language] Introduce T.annotate_restrict_buffers by @LeiWang1999 in #1428
  • [Analyzer] Require loop extent > 0 when entering loop (#1012) by @kurisu6912 in #1451
  • [BugFix] Update CI to ROCm-7.1 by @Gongen-Ali in #1449
  • [Enhancement] Update examples and tests for improved type handling functionality by @LeiWang1999 in #1448
  • [Issue Template] Enable blank issues in GitHub issue template by @LeiWang1999 in #1453
  • [CI] Moved the clang-tidy step to after pip install by @LeiWang1999 in #1456
  • [Bug] Fix tvm build script when patchelf is not found by @kurisu6912 in #1459
  • [Analyzer] Fix floordiv & floormod bug in z3 prover by @kurisu6912 in #1458
  • [Cache] Rename sparse compress cache directory by @LeiWang1999 in #1460
  • [Language]Adds a random number generation capability through curand_kernel by @silentCoder-dev in #1461
  • remove unused duplicated type check by @sgjzfzzf in #1462
  • feat(cutedsl): add CuTeDSL backend by @lucifer1004 in #1421
  • [Refactor] Rename test for curand & add triton baseline in test_tilelang_language_rand.py by @silentCoder-dev in #1464
  • [ArgBinder] Enhance shape variable handling and assertions by @LeiWang1999 in #1467
  • [Language] Make TL scripts friendly to Python syntax highlights by @SiriusNEO in #1466
  • [Refactor] Remove triton dependence in testing & move triton baseline into examples by @silentCoder-dev in #1470
  • [Language] Enhance T.dtype.as_torch conversion for compatibility by @LeiWang1999 in #1473
  • [News] update with latest news by @LeiWang1999 in #1475
  • [Enhancement] Use static Z3 context by @LeiWang1999 in #1482
  • [Enhancement] Enhance let binding handling in layout inference and warp specialized pass by @LeiWang1999 in #1484
  • [Refactor] Phaseout PassConfig kDisableDynamicTailSplit and kDynamicAlignment as they are legacy by @LeiWang1999 in #1486
  • [Enhancement] Optimize the time cost of critical path for IntervalSetEvaluator by @LeiWang1999 in #1491
  • [CI] Add preformance regression test script by @xwhzz in #1489
  • Pin nvidia-cutlass-dsl to 4.3.3 by @lucifer1004 in #1497
  • [Language] Remove ConstIf Frame for Better Meta-Programming by @kurisu6912 in #1496
  • [Bugfix][CI] Fix concurrency bug in regression test workflow by @xwhzz in #1500
  • [Refactor] Phaseout legacy alloc_local statement in examples and introduce processing for floating fragment buffers by @LeiWang1999 in #1495
  • [Enhancement] Optimize MHA varlen fwd and support autotune by @Rachmanino in #1499
  • [Enhancement] Refactor CUDA vectorized cast generation and remove unsupported FP8 type by @LJC00118 in #1474
  • [Dependency] Update apache-tvm-ffi to >=0.1.6 for memory safety when gc is not enabled by @LeiWang1999 in #1502
  • Update cutedsl docs and version check by @lucifer1004 in #1503
  • [Misc] configure pymarkdown by @lucifer1004 in #1505
  • [Language] Fix gemm syntax highlight by @SiriusNEO in #1476
  • [Fix] Fix TL_ENABLE_PTXAS_VERBOSE_OUTPUT has no effect in tvm-ffi by @kurisu6912 in #1511
  • [Refactor] Phaseout execution_backend ctypes by @LeiWang1999 in #1510
  • [Testing] Add Memory Leak Test by @kurisu6912 in #1516
  • [Refactor] Support auto swizzling for tma store and phaseout related layout annotations by @LeiWang1999 in #1509
  • [CuTeDSL][Fix] thread safety + context safety by @lucifer1004 in #1513
  • [BugFix] Phaseout unused tests for gqa decode kernels and add the kernels to CI by @tzj-fxz in #1515
  • [Cleanup] Remove unnecessary macros in tilelang examples by @Rachmanino in #1514
  • Fix ramp_lanes calculation in CUDA codegen by @LJC00118 in #1518
  • [Misc] add env for default target/backend/verbose by @lucifer1004 in #1512
  • [Dtype] Improve host codegen handling for subtype by @LeiWang1999 in #1517
  • [Bugfix] Fallback to a Linear Layout instead of raising errors by @LeiWang1999 in #1521
  • Use TargetIsCuda for all cuda target by @oraluben in https:...
Read more

v0.1.7.post3

18 Jan 09:32
62b8505

Choose a tag to compare

What's Changed

  • [Pipeline] Refactor buffer allocation in Inject Pipeline Pass by @LeiWang1999 in #1525
  • [Dev] Fix when build local version with isolated build by @oraluben in #1487
  • [Bugfix] Skip stride check for subtype by @LeiWang1999 in #1531
  • [Lint] Enable whitespace and permission bit hooks by @XuehaiPan in #1439
  • [Enhancement][Tool] Tree-style pretty ASTPrinter by @SiriusNEO in #1468
  • [Fix] Add support for non-var complement arithmetic computation (#1374) by @kurisu6912 in #1533
  • [BugFix] Complete vectorized loading for common dtypes by @SiriusNEO in #1536
  • [Compat] Add CUDA version check for __nv_fp8_e8m0 type by @LeiWang1999 in #1537
  • [BugFix] Fix bugs of varlen attention forward examples caused by S_q != S_kv by @hukongyi in #1530
  • [Bug] Fix hanging from reduction on sm120 by @PannenetsF in #1540
  • [example] use T.dynamic instead of tvm.te.var by @botbw in #1538
  • [Enhancement] Refactor KernelCache to use inheritance-based design by @sgjzfzzf in #1483
  • [Bugfix] Avoid considering local.var buffer as local by @LeiWang1999 in #1541
  • [Bugfix] Fix of T.Fill for local.var by @LeiWang1999 in #1543
  • [Z3] Change z3 timeout to rlimit for determistic prove behavior by @kurisu6912 in #1542
  • [Feat] Adapt gemm v2 for cutedsl backend by @lucifer1004 in #1544
  • [Enhancement] Support larger H in deepseek sparse mla backward via split-H by @Rachmanino in #1548
  • [Bugfix] Fix regression test to use installed package instead of source directory by @xwhzz in #1550
  • [Refactor] Introduce layout annotations for ParallelOPNode and CopyNode by @LeiWang1999 in #1539
  • [Script] Provide regression test script to help benchmark regression in local env by @LeiWang1999 in #1551
  • [Typing] Update Kernel signature and add type hints for buffer operations by @clouds56 in #1545
  • [CI]: Bump actions/upload-artifact from 4 to 6 by @dependabot[bot] in #1555
  • [Refactor] Use cuda capability from torch to be more generic by @oraluben in #1557
  • [CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #1556
  • [Host] Provide post process to customize host code and enhance nullable check by @LeiWang1999 in #1562
  • [Release] Build tilelang against CUDA 13.1 in CI by @oraluben in #1532
  • [LazyJIT] Move Type Annotations to Function Body by @kurisu6912 in #1480
  • [bugfix] fix missing clear_accum logic for gemm_sp_v2 by @botbw in #1563
  • [Misc] Remove unused tl_pipeline_sync. by @c8ef in #1566
  • [Refactor] Improve scalarization handling in Pass VectorizeLoop by @LeiWang1999 in #1565
  • [Refactor] Simplify do_bench calls by using default warmup and rep parameters by @LeiWang1999 in #1568
  • [CI] Refactor PR regression test job conditions by @xwhzz in #1569
  • [Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition by @LeiWang1999 in #1559
  • [Refactor] Enhance deterministic ordering in shared memory allocation merge. by @LeiWang1999 in #1570
  • [Enhancement] Improve equality checks in layout nodes and fragment validation by @LeiWang1999 in #1573
  • [Feature] add kUseCooperativeLaunch tag for tvm_ffi by @silentCoder-dev in #1572
  • [Refactor] Remove unnecessary logging configuration in Analyzer.py by @LeiWang1999 in #1574
  • [Release] Bump version to 0.1.7.post2 by @LeiWang1999 in #1575
  • [BugFix] Change default rounding mode for fp4 conversions by @LJC00118 in #1580
  • [CI] Add CUDA-aware pytest scheduler + auto workers by @LeiWang1999 in #1584
  • [Enhancement] Improve performance regression output with timing and streaming by @xwhzz in #1585
  • [Bugfix] Add kernel_global_source property to TVMFFIKernelAdapter by @haok1402 in #1589
  • [BugFix] Add PrimExpr substitution support for AttrStmt nodes by @LJC00118 in #1583
  • [BugFix] fix tcgen5mma example by @Rachmanino in #1577
  • [Refactor] Use access_ptr instead of buffer and offsets for cp async params by @LeiWang1999 in #1590
  • [Layout] Support annotating loop layout in frontend by @LeiWang1999 in #1579
  • [Typo] Rename loop layout annotation test by @LeiWang1999 in #1596
  • [Fix] Add register to read A ptr in test_tilelang_language_cooperative.py by @silentCoder-dev in #1593
  • [Feat] PDL Support by @w169q169 in #1494
  • [Enhancement][Subtype] Enhance symbolic shape/stride handling for subtype by @LeiWang1999 in #1599
  • [Fix][CuteDSL] add support for tanh/tanhf (fixes #1595) by @lucifer1004 in #1597
  • [Release] Fix race condition when publishing by @oraluben in #1578
  • Add conversion from cutlass::float_e4m3/e5m2 to tl::float_e4m3/e5m2 by @LJC00118 in #1600
  • [Enhancement][AMD] Add preshuffle fp8 gemm example on amd. by @Gongen-Ali in #1605
  • [Bugfix] Mangle Single Precision Mathematical Functions of cuda math api by @silentCoder-dev in #1602
  • [Bugfix] Open Rocm ci test and fix some bugs. by @Gongen-Ali in #1443
  • [Feature] Add more curand operations & support vectorization by @silentCoder-dev in #1582
  • [Enhancement] Allow import tilelang on CPU-only machines without CUDA libraries by @XuehaiPan in #1481
  • [BugFix] Add pre-commit to requirements-dev.txt by @asaadkhaja99 in #1611
  • [BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop by @SiriusNEO in #1607
  • [Feat] Add strong checker to detect data racing in T.Parallel by @kurisu6912 in #1615
  • [Feature] add T.sync_warp & T.shfl_sync; change extern pdl into intrin by @silentCoder-dev in #1614
  • [RaceChecker] RaceChecker report warning rather than error for backward compatibility by @kurisu6912 in #1620
  • [BugFix] Fix ForwardRef usage in v2 frontend (#1619) by @kurisu6912 in #1621
  • [Refactor] Move ConstrVisitor to src/transform/common/constr_visitor.h for reuse by @silentCoder-dev in #1622
  • [Feat] Improve T.reduce_absmax to use less abs call by @kurisu6912 in #1626
  • [Bugfix] Do not consider local.var as local buffer during LowerTileOP by @LeiWang1999 in #1628
  • [Feature] Add hoist_broadcast_values pass by @silentCoder-dev in #1606
  • [Enhancement][CUDA] Support nvidia-cuda-nvcc as nvcc by @clouds56 in #1528
  • [Bugfix] Fallback into full region when dynamic buffer read region cannot be proved by @LeiWang1999 in #1618
  • [Feat] Allow print macro call stack in device assert by @kurisu6912 in #1616
  • [BugFix] Correct index_map selection for transposed A matrix in MFMA Layout with k_dim==4 and open rocm-ci for gemmsr by @benenzhu in #1627
  • [Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3.2 by @hammersam in #1636
  • [Bugfix] Introduce a flag to avoid unnecessary broadcast hoist and enable for let stmt by @LeiWang1999 in #1638
  • [Refactor][CI] Reduce sparse related test time by @LeiWang1999 in #1637
  • [Refactor] Unify @jit and @lazy_jit into a single @jit decorator by @LeiWang1999 in #1632
  • [Bugfix] Fix pdl related intrin handling to avoid strict annotation codegen by @LeiWang1999 in #1650
  • [Bugfix] reverted unexpected tvm changes by @LEI...
Read more

v0.1.7.post2

31 Dec 06:12
2d8d367

Choose a tag to compare

What's Changed

  • [Pipeline] Refactor buffer allocation in Inject Pipeline Pass by @LeiWang1999 in #1525
  • [Dev] Fix when build local version with isolated build by @oraluben in #1487
  • [Bugfix] Skip stride check for subtype by @LeiWang1999 in #1531
  • [Lint] Enable whitespace and permission bit hooks by @XuehaiPan in #1439
  • [Enhancement][Tool] Tree-style pretty ASTPrinter by @SiriusNEO in #1468
  • [Fix] Add support for non-var complement arithmetic computation (#1374) by @kurisu6912 in #1533
  • [BugFix] Complete vectorized loading for common dtypes by @SiriusNEO in #1536
  • [Compat] Add CUDA version check for __nv_fp8_e8m0 type by @LeiWang1999 in #1537
  • [BugFix] Fix bugs of varlen attention forward examples caused by S_q != S_kv by @hukongyi in #1530
  • [Bug] Fix hanging from reduction on sm120 by @PannenetsF in #1540
  • [example] use T.dynamic instead of tvm.te.var by @botbw in #1538
  • [Enhancement] Refactor KernelCache to use inheritance-based design by @sgjzfzzf in #1483
  • [Bugfix] Avoid considering local.var buffer as local by @LeiWang1999 in #1541
  • [Bugfix] Fix of T.Fill for local.var by @LeiWang1999 in #1543
  • [Z3] Change z3 timeout to rlimit for determistic prove behavior by @kurisu6912 in #1542
  • [Feat] Adapt gemm v2 for cutedsl backend by @lucifer1004 in #1544
  • [Enhancement] Support larger H in deepseek sparse mla backward via split-H by @Rachmanino in #1548
  • [Bugfix] Fix regression test to use installed package instead of source directory by @xwhzz in #1550
  • [Refactor] Introduce layout annotations for ParallelOPNode and CopyNode by @LeiWang1999 in #1539
  • [Script] Provide regression test script to help benchmark regression in local env by @LeiWang1999 in #1551
  • [Typing] Update Kernel signature and add type hints for buffer operations by @clouds56 in #1545
  • [CI]: Bump actions/upload-artifact from 4 to 6 by @dependabot[bot] in #1555
  • [Refactor] Use cuda capability from torch to be more generic by @oraluben in #1557
  • [CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #1556
  • [Host] Provide post process to customize host code and enhance nullable check by @LeiWang1999 in #1562
  • [Release] Build tilelang against CUDA 13.1 in CI by @oraluben in #1532
  • [LazyJIT] Move Type Annotations to Function Body by @kurisu6912 in #1480
  • [bugfix] fix missing clear_accum logic for gemm_sp_v2 by @botbw in #1563
  • [Misc] Remove unused tl_pipeline_sync. by @c8ef in #1566
  • [Refactor] Improve scalarization handling in Pass VectorizeLoop by @LeiWang1999 in #1565
  • [Refactor] Simplify do_bench calls by using default warmup and rep parameters by @LeiWang1999 in #1568
  • [CI] Refactor PR regression test job conditions by @xwhzz in #1569
  • [Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition by @LeiWang1999 in #1559
  • [Refactor] Enhance deterministic ordering in shared memory allocation merge. by @LeiWang1999 in #1570
  • [Enhancement] Improve equality checks in layout nodes and fragment validation by @LeiWang1999 in #1573
  • [Feature] add kUseCooperativeLaunch tag for tvm_ffi by @silentCoder-dev in #1572
  • [Refactor] Remove unnecessary logging configuration in Analyzer.py by @LeiWang1999 in #1574
  • [Release] Bump version to 0.1.7.post2 by @LeiWang1999 in #1575

New Contributors

Full Changelog: v0.1.7.post1...0.1.7.post2

0.1.7.post1

24 Dec 15:16
3c11823

Choose a tag to compare

What's Changed

  • [Bugfix][Build] Update CMake configuration to remove project root injection for sys.path by @LeiWang1999 in #1385
  • [BugFix] Fix split kernel layout bug of GQA decode by @tzj-fxz in #1386
  • [Feat] Add better repr print for Layout and Fragment by @kurisu6912 in #1392
  • [Doc] Logging docs for Tilelang/TVM by @SiriusNEO in #1395
  • [Enhancement] Refactor inflight computing to support dynamic pipeline extents by @LeiWang1999 in #1399
  • [AMD] Fix 3 bugs when build docker on amd mi3x gpu by @danielhua23 in #1401
  • [Typo] Fix tilelang link in README.md by @senlyu163 in #1402
  • [Dependency] Update apache-tvm-ffi version to >=0.1.2 by @LeiWang1999 in #1400
  • [AMD] Enable FA2 fwd on AMD MI300X by @danielhua23 in #1406
  • [Typo] fix typo for SM120 by @Cunxiao2002 in #1408
  • [Doc] Minor documentation update by @LeiWang1999 in #1410
  • [Dependency] Add torch-c-dlpack-ext to project requirements by @LeiWang1999 in #1403
  • [Bugfix] Alloc T.make_tensor not on the top of prim_func by @LeiWang1999 in #1412
  • [Enhancement] Introduce T.__ldg by @LeiWang1999 in #1414
  • [Enhancement] Improve vectorization invariant check by @LJC00118 in #1398
  • [Lint] Phaseout Yapf format and embrace ruff format by @LeiWang1999 in #1417
  • [Atomic] Use ptr for atomicAdd dst instead of reference by @LeiWang1999 in #1425
  • [CUDA] Add read-only parameter annotation for CUDA codegen by @LeiWang1999 in #1416
  • [Refactor] Phase out the primitives folder since its design has been merged into tileop by @LeiWang1999 in #1429
  • [CI]: Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #1431
  • [CI]: Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in #1432
  • [Bugfix] Convey compile_flags to ffi compilation path with pass_configs by @LeiWang1999 in #1434
  • [Enhancement] Improve buffer usage tracking in MakePackedAPI by @LeiWang1999 in #1435
  • [Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice by @SiriusNEO in #1405
  • [Enhancement] Include PrimFunc name in memory cache logs for better ebugging by @LeiWang1999 in #1437
  • [CI] Update lint dependencies and fix lint on trunk by @XuehaiPan in #1433
  • [Enhancement] Refactor vectorization checks in loop_vectorize by @LeiWang1999 in #1440
  • [Enhancement] Implement vectorized FP8 to FP32 cast by @LJC00118 in #1438
  • [Feature] Support region as input of T.cumsum by @Dayuxiaoshui in #1426
  • [Fix] Fix analyzer bind conflicting bug in #1442 by @kurisu6912 in #1446
  • [Refactor] Reduce direct dependency on PyTorch due to its limited type support by @LeiWang1999 in #1444
  • [Refactor] Use pytest.mark.parameterize to speedup parallel testing by @kurisu6912 in #1447
  • [Docs] Improve installation instructions for developers by @SiriusNEO in #1450
  • [Feat] Integrate Z3 in TVM Arith Analyzer by @kurisu6912 in #1367
  • [Bugfix] Improve autotune from elementwise_add function in examples by @senlyu163 in #1445
  • [Language] Introduce T.annotate_restrict_buffers by @LeiWang1999 in #1428
  • [Analyzer] Require loop extent > 0 when entering loop (#1012) by @kurisu6912 in #1451
  • [BugFix] Update CI to ROCm-7.1 by @Gongen-Ali in #1449
  • [Enhancement] Update examples and tests for improved type handling functionality by @LeiWang1999 in #1448
  • [Issue Template] Enable blank issues in GitHub issue template by @LeiWang1999 in #1453
  • [CI] Moved the clang-tidy step to after pip install by @LeiWang1999 in #1456
  • [Bug] Fix tvm build script when patchelf is not found by @kurisu6912 in #1459
  • [Analyzer] Fix floordiv & floormod bug in z3 prover by @kurisu6912 in #1458
  • [Cache] Rename sparse compress cache directory by @LeiWang1999 in #1460
  • [Language]Adds a random number generation capability through curand_kernel by @silentCoder-dev in #1461
  • remove unused duplicated type check by @sgjzfzzf in #1462
  • feat(cutedsl): add CuTeDSL backend by @lucifer1004 in #1421
  • [Refactor] Rename test for curand & add triton baseline in test_tilelang_language_rand.py by @silentCoder-dev in #1464
  • [ArgBinder] Enhance shape variable handling and assertions by @LeiWang1999 in #1467
  • [Language] Make TL scripts friendly to Python syntax highlights by @SiriusNEO in #1466
  • [Refactor] Remove triton dependence in testing & move triton baseline into examples by @silentCoder-dev in #1470
  • [Language] Enhance T.dtype.as_torch conversion for compatibility by @LeiWang1999 in #1473
  • [News] update with latest news by @LeiWang1999 in #1475
  • [Enhancement] Use static Z3 context by @LeiWang1999 in #1482
  • [Enhancement] Enhance let binding handling in layout inference and warp specialized pass by @LeiWang1999 in #1484
  • [Refactor] Phaseout PassConfig kDisableDynamicTailSplit and kDynamicAlignment as they are legacy by @LeiWang1999 in #1486
  • [Enhancement] Optimize the time cost of critical path for IntervalSetEvaluator by @LeiWang1999 in #1491
  • [CI] Add preformance regression test script by @xwhzz in #1489
  • Pin nvidia-cutlass-dsl to 4.3.3 by @lucifer1004 in #1497
  • [Language] Remove ConstIf Frame for Better Meta-Programming by @kurisu6912 in #1496
  • [Bugfix][CI] Fix concurrency bug in regression test workflow by @xwhzz in #1500
  • [Refactor] Phaseout legacy alloc_local statement in examples and introduce processing for floating fragment buffers by @LeiWang1999 in #1495
  • [Enhancement] Optimize MHA varlen fwd and support autotune by @Rachmanino in #1499
  • [Enhancement] Refactor CUDA vectorized cast generation and remove unsupported FP8 type by @LJC00118 in #1474
  • [Dependency] Update apache-tvm-ffi to >=0.1.6 for memory safety when gc is not enabled by @LeiWang1999 in #1502
  • Update cutedsl docs and version check by @lucifer1004 in #1503
  • [Misc] configure pymarkdown by @lucifer1004 in #1505
  • [Language] Fix gemm syntax highlight by @SiriusNEO in #1476
  • [Fix] Fix TL_ENABLE_PTXAS_VERBOSE_OUTPUT has no effect in tvm-ffi by @kurisu6912 in #1511
  • [Refactor] Phaseout execution_backend ctypes by @LeiWang1999 in #1510
  • [Testing] Add Memory Leak Test by @kurisu6912 in #1516
  • [Refactor] Support auto swizzling for tma store and phaseout related layout annotations by @LeiWang1999 in #1509
  • [CuTeDSL][Fix] thread safety + context safety by @lucifer1004 in #1513
  • [BugFix] Phaseout unused tests for gqa decode kernels and add the kernels to CI by @tzj-fxz in #1515
  • [Cleanup] Remove unnecessary macros in tilelang examples by @Rachmanino in #1514
  • Fix ramp_lanes calculation in CUDA codegen by @LJC00118 in #1518
  • [Misc] add env for default target/backend/verbose by @lucifer1004 in #1512
  • [Dtype] Improve host codegen handling for subtype by @LeiWang1999 in #1517
  • [Bugfix] Fallback to a Linear Layout instead of raising errors by @LeiWang1999 in #1521
  • Use TargetIsCuda for all cuda target by @oraluben in https:...
Read more

v0.1.7

07 Dec 03:09
305c854

Choose a tag to compare

What's Changed

  • [PATCH] Static libg++ linking fix by @LeiWang1999 in #854
  • [Analyzer] Enhance ConstIntBoundAnalyzer and IntervalSet with modular set analysis by @LeiWang1999 in #856
  • [Doc] Optimize the quickstart guide for clarity and not just for CUDA by @LeiWang1999 in #858
  • [TMA] Bugfix when a shared buffer is both issued with tma store and tma load by @LeiWang1999 in #857
  • [AMD][MLA] Fix mla autotune for rocm by @LeiWang1999 in #861
  • [Bugfix] Ensure correct handling for cases where seq_q<seq_kv in flash attention examples by @Rachmanino in #864
  • [AMD] refactor MatrixCoreIntrinEmitter by @Paran0idy in #860
  • [Feat] Add fast sine and cosine definitions in CUDA templates by @Rachmanino in #865
  • [Layout] Support layout forward with multi dimension by @LeiWang1999 in #867
  • [Autotune][Conv] optimize convolution examples to use autotune by @LeiWang1999 in #866
  • [Example] Add examples to support efficient attention sink forward process by @Rachmanino in #853
  • [Parser] Adapt Parser to work with Python 3.8 in some cases by @LeiWang1999 in #869
  • [Fix] Fix bug 0905: tilelang doesn't vectorize B[i,j] = c[i] + A[i,j] by @kurisu6912 in #798
  • [Language] Support sequence comparisons by @LeiWang1999 in #872
  • [Language] Support loop_break primitive by @chengyupku in #873
  • [Bugfix] Use ExprDeepEqual instead of StructuralEqual when merge consecutive If stmt by @LeiWang1999 in #876
  • [Language] Support atomic add with ret by @LeiWang1999 in #870
  • [Cython] Remove an incorrect check by @LJC00118 in #880
  • Update amd_ci.yml by @Alex4210987 in #881
  • [FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke by @LeiWang1999 in #875
  • [Example] Add efficient attention sink backward implementations and tests by @Rachmanino in #877
  • [Precision] Introduce T.ieee_rsqrt and related high precision op by @LeiWang1999 in #882
  • [Dist] Provide an option to include commit ID in version by @LeiWang1999 in #884
  • [Example] Optimize sink attention forward via swizzled layout and report benchmark results by @Rachmanino in #885
  • [Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop by @LeiWang1999 in #844
  • [Bugfix][Enhancement] Fix a bug in previous commit and enhance cuda backend by @Hamerlate in #887
  • [Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst by @Rachmanino in #888
  • [Layout] Fix plot layout by @Paran0idy in #890
  • [Example] Add example by @LeiWang1999 in #894
  • [News] Add announcement of support for Huawei Ascend chips by @xwhzz in #895
  • [Example] Add sparse mla examples by @LeiWang1999 in #896
  • [Typo] Fix backend name for Huawei Ascend by @xwhzz in #898
  • [CI] Legalize math related test by @LeiWang1999 in #899
  • [Bugfix] Fix flops comp and softmax scale in mla by @Edenzzzz in #900
  • [Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples by @LeiWang1999 in #913
  • [CI] optimize CI time for sparse gemm by @botbw in #906
  • [Enhancement] Include compile flags into the hash key of cached kernels by @Rachmanino in #911
  • [Bugfix] Fix saving kernel source code where JITKernel.artifact is None by @zjudmd1015 in #921
  • [CI] Refactor import paths in dequantization examples to use dequantize_utils by @LeiWang1999 in #914
  • [Example] Add MLA decode ws example by @chengyupku in #928
  • [CI] Fix documentation runner by adding 'nvidia' tag by @xwhzz in #927
  • [Layout] Strict annotate completed replicated layout for fragment with constant index by @LeiWang1999 in #929
  • [Bugfix] Fix tensor memory copy layout by @Hamerlate in #933
  • [Example] Optimize online_softmax example by @lijinpei in #934
  • [Example] Add correctness assert into dsa example by @LeiWang1999 in #937
  • [Enhancement] Enhance and add new GQA backward examples for Hopper by @Rachmanino in #930
  • [Enhancement] Fix lint to improve grouped GEMM performance with TMA by @Cunxiao2002 in #938
  • [Example] Introduce split+sum template, and optimize atomic_add performance for bwd examples by @LeiWang1999 in #940
  • [Example] Disable TMA and enable FastMath for NSA Examples (#941) by @LeiWang1999 in #941
  • [Example] Revert the atomic/split&sum templates in MHA backward examples by @Rachmanino in #943
  • [Example] Add sparse mla bwd example for deepseek_v32 by @Zhichenzzz in #919
  • [Profiler]Adds CUPTI profiler support by @Cunxiao2002 in #936
  • [Enhancement] Support Copy for Buffer Load witih scalar indices by @LeiWang1999 in #946
  • [Code Style] Refine nvrtc compile related check style by @BBuf in #945
  • [Backend] Add metal backend by @oraluben in #799
  • [CI] enable dependabot for GHA workflows by @XuehaiPan in #950
  • Modify the SM architecture number to support Thor’s sm110. by @iloveai8086 in #957
  • [CI] auto-cancel in-progress PR CI when new commits are pushed by @XuehaiPan in #956
  • [bug] fix type object is not subscriptable in py38 by @BBuf in #959
  • [Bugfix][Doc] Add astroid version constraint to requirements.txt by @xwhzz in #958
  • [CI]: Bump actions/setup-python from 2 to 6 by @dependabot[bot] in #951
  • [CI]: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #952
  • [CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #954
  • [CI]: Bump actions/checkout from 2 to 5 by @dependabot[bot] in #953
  • [TileOp] Implement WGMMA for T.gemm_v2 by @LeiWang1999 in #813
  • [Docs] add CODE_OF_CONDUCT.md by @XuehaiPan in #965
  • [Example] Add support for bfloat16 and user-defined sm_scale in attention sink examples by @Rachmanino in #924
  • [Bugfix] Do not force inline let stmt by @LeiWang1999 in #947
  • [CI] add pre-commit integration by @XuehaiPan in #955
  • [Doc] Install docs add docker install method by @BBuf in #961
  • [Bugfix] Fix dummy kernel compliation by @SiriusNEO in #962
  • [CI][Refactor] Refactor non-test CI workflow files by @XuehaiPan in #971
  • [TileOp] Implememt CumSum1D by @LeiWang1999 in #978
  • [Language] Enhance T.alloc_var for AugAssign and AnnAsign by @LeiWang1999 in #979
  • [Refactor] Refactor Pass InjectFenceProxy and expose some warp group primitives in frontend by @LeiWang1999 in #977
  • [Typo] Remove debug print by @LeiWang1999 in #980
  • [Bugfix] Use access_ptr("r") instead of access_ptr("w") for correct pipeline analysis by @LeiWang1999 in #983
  • [Feature][Example] Support TMA reduce operation and update GQA bwd example by @chengyupku in #969
  • [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) by @Degeneracy-Evil in #976
  • [BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures by @tzj-fxz in #984
  • [Bugfix] Fallback torch.accelerator.synchronize() to torch.cuda.synchronize() by @yyttt6 in https...
Read more

v0.1.6.post2

31 Oct 01:00
c37621c

Choose a tag to compare

The Last Release for Python 3.8 (without tvm-ffi) 🚀

What's Changed

  • [Analyzer] Enhance ConstIntBoundAnalyzer and IntervalSet with modular set analysis by @LeiWang1999 in #856
  • [Doc] Optimize the quickstart guide for clarity and not just for CUDA by @LeiWang1999 in #858
  • [TMA] Bugfix when a shared buffer is both issued with tma store and tma load by @LeiWang1999 in #857
  • [AMD][MLA] Fix mla autotune for rocm by @LeiWang1999 in #861
  • [Bugfix] Ensure correct handling for cases where seq_q<seq_kv in flash attention examples by @Rachmanino in #864
  • [AMD] refactor MatrixCoreIntrinEmitter by @Paran0idy in #860
  • [Feat] Add fast sine and cosine definitions in CUDA templates by @Rachmanino in #865
  • [Layout] Support layout forward with multi dimension by @LeiWang1999 in #867
  • [Autotune][Conv] optimize convolution examples to use autotune by @LeiWang1999 in #866
  • [Example] Add examples to support efficient attention sink forward process by @Rachmanino in #853
  • [Parser] Adapt Parser to work with Python 3.8 in some cases by @LeiWang1999 in #869
  • [Fix] Fix bug 0905: tilelang doesn't vectorize B[i,j] = c[i] + A[i,j] by @kurisu6912 in #798
  • [Language] Support sequence comparisons by @LeiWang1999 in #872
  • [Language] Support loop_break primitive by @chengyupku in #873
  • [Bugfix] Use ExprDeepEqual instead of StructuralEqual when merge consecutive If stmt by @LeiWang1999 in #876
  • [Language] Support atomic add with ret by @LeiWang1999 in #870
  • [Cython] Remove an incorrect check by @LJC00118 in #880
  • Update amd_ci.yml by @Alex4210987 in #881
  • [FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke by @LeiWang1999 in #875
  • [Example] Add efficient attention sink backward implementations and tests by @Rachmanino in #877
  • [Precision] Introduce T.ieee_rsqrt and related high precision op by @LeiWang1999 in #882
  • [Dist] Provide an option to include commit ID in version by @LeiWang1999 in #884
  • [Example] Optimize sink attention forward via swizzled layout and report benchmark results by @Rachmanino in #885
  • [Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop by @LeiWang1999 in #844
  • [Bugfix][Enhancement] Fix a bug in previous commit and enhance cuda backend by @Hamerlate in #887
  • [Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst by @Rachmanino in #888
  • [Layout] Fix plot layout by @Paran0idy in #890
  • [Example] Add example by @LeiWang1999 in #894
  • [News] Add announcement of support for Huawei Ascend chips by @xwhzz in #895
  • [Example] Add sparse mla examples by @LeiWang1999 in #896
  • [Typo] Fix backend name for Huawei Ascend by @xwhzz in #898
  • [CI] Legalize math related test by @LeiWang1999 in #899
  • [Bugfix] Fix flops comp and softmax scale in mla by @Edenzzzz in #900
  • [Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples by @LeiWang1999 in #913
  • [CI] optimize CI time for sparse gemm by @botbw in #906
  • [Enhancement] Include compile flags into the hash key of cached kernels by @Rachmanino in #911
  • [Bugfix] Fix saving kernel source code where JITKernel.artifact is None by @zjudmd1015 in #921
  • [CI] Refactor import paths in dequantization examples to use dequantize_utils by @LeiWang1999 in #914
  • [Example] Add MLA decode ws example by @chengyupku in #928
  • [CI] Fix documentation runner by adding 'nvidia' tag by @xwhzz in #927
  • [Layout] Strict annotate completed replicated layout for fragment with constant index by @LeiWang1999 in #929
  • [Bugfix] Fix tensor memory copy layout by @Hamerlate in #933
  • [Example] Optimize online_softmax example by @lijinpei in #934
  • [Example] Add correctness assert into dsa example by @LeiWang1999 in #937
  • [Enhancement] Enhance and add new GQA backward examples for Hopper by @Rachmanino in #930
  • [Enhancement] Fix lint to improve grouped GEMM performance with TMA by @Cunxiao2002 in #938
  • [Example] Introduce split+sum template, and optimize atomic_add performance for bwd examples by @LeiWang1999 in #940
  • [Example] Disable TMA and enable FastMath for NSA Examples (#941) by @LeiWang1999 in #941
  • [Example] Revert the atomic/split&sum templates in MHA backward examples by @Rachmanino in #943
  • [Example] Add sparse mla bwd example for deepseek_v32 by @Zhichenzzz in #919
  • [Profiler]Adds CUPTI profiler support by @Cunxiao2002 in #936
  • [Enhancement] Support Copy for Buffer Load witih scalar indices by @LeiWang1999 in #946
  • [Code Style] Refine nvrtc compile related check style by @BBuf in #945
  • [Backend] Add metal backend by @oraluben in #799
  • [CI] enable dependabot for GHA workflows by @XuehaiPan in #950
  • Modify the SM architecture number to support Thor’s sm110. by @iloveai8086 in #957
  • [CI] auto-cancel in-progress PR CI when new commits are pushed by @XuehaiPan in #956
  • [bug] fix type object is not subscriptable in py38 by @BBuf in #959
  • [Bugfix][Doc] Add astroid version constraint to requirements.txt by @xwhzz in #958
  • [CI]: Bump actions/setup-python from 2 to 6 by @dependabot[bot] in #951
  • [CI]: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #952
  • [CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #954
  • [CI]: Bump actions/checkout from 2 to 5 by @dependabot[bot] in #953
  • [TileOp] Implement WGMMA for T.gemm_v2 by @LeiWang1999 in #813
  • [Docs] add CODE_OF_CONDUCT.md by @XuehaiPan in #965
  • [Example] Add support for bfloat16 and user-defined sm_scale in attention sink examples by @Rachmanino in #924
  • [Bugfix] Do not force inline let stmt by @LeiWang1999 in #947
  • [CI] add pre-commit integration by @XuehaiPan in #955
  • [Doc] Install docs add docker install method by @BBuf in #961
  • [Bugfix] Fix dummy kernel compliation by @SiriusNEO in #962
  • [CI][Refactor] Refactor non-test CI workflow files by @XuehaiPan in #971
  • [TileOp] Implememt CumSum1D by @LeiWang1999 in #978
  • [Language] Enhance T.alloc_var for AugAssign and AnnAsign by @LeiWang1999 in #979
  • [Refactor] Refactor Pass InjectFenceProxy and expose some warp group primitives in frontend by @LeiWang1999 in #977
  • [Typo] Remove debug print by @LeiWang1999 in #980
  • [Bugfix] Use access_ptr("r") instead of access_ptr("w") for correct pipeline analysis by @LeiWang1999 in #983
  • [Feature][Example] Support TMA reduce operation and update GQA bwd example by @chengyupku in #969
  • [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) by @Degeneracy-Evil in #976
  • [BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures by @tzj-fxz in #984
  • [Bugfix] Fallback torch.accelerator.synchronize() to torch.cuda.synchronize() by @yyttt6 in #987
    ...
Read more

v0.1.6.post1

21 Sep 19:43
a3497eb

Choose a tag to compare

In version 0.1.6, libgcc and libg++ were statically linked to improve version compatibility. However, this could introduce certain unpredictable risks in some programs.
In post1, this process was reworked based on the PyTorch build workflow, eliminating the risks while ensuring better compatibility. This is the reason for releasing version 0.1.6.post1.

v0.1.6

19 Sep 15:28
1ad6e46

Choose a tag to compare

What's Changed

  • [Bugfix] Added missing thread offsets and other information to reduce by @LeiWang1999 in #646
  • [Bugfix] Adjust role assignment in warp specialization based on read access by @chengyupku in #647
  • Fix/jit kernel use target by @meinie0826 in #648
  • [Bugfix] Remove small array reuse condition in shared memory allocation merging by @LeiWang1999 in #654
  • [Enhancement] Add role assignment for AllocateNode in warp specialization by @chengyupku in #657
  • [Bugfix][CI] Bug fixing and migrate CI from ada to hopper by @xwhzz in #652
  • [CI] Enable cache for virtual env and parallelize pytest via xdist by @LeiWang1999 in #660
  • [Cache] Support shared cache directories for multiple process by @LeiWang1999 in #649
  • [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control by @xwhzz in #656
  • add the support of rocm arch detecting by @zhangnju in #661
  • [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking by @tzj-fxz in #653
  • [Bugfix][Docs] Update documentation build process and configurations for autoapi support by @xwhzz in #663
  • [Enhancement] Improve buffer conflict detection in thread storage synchronization by @LeiWang1999 in #658
  • [Bugfix] Consider buffer data type into indices provably disjoint analysis by @LeiWang1999 in #664
  • [Bugfix] Remove redundant T.fill to fix precision issue by @xuchangtolearn in #667
  • [Enhancement] Refactor buffer index handling for improved precision a… by @Alex4210987 in #671
  • Reverts #671 by @LeiWang1999 in #672
  • [Bugfix] Passing correct nvcc to cmake by @chenyang78 in #670
  • [CI] Improve format check output and automate commit of changes by @xwhzz in #669
  • [Bugfix][CI] Use valid runner labels in workflow by @xwhzz in #674
  • [Enhancement] passing verbose to LibraryGenerator by @chenyang78 in #673
  • [Enhancement] Enhance lint error messaging in CI by @xwhzz in #675
  • Refactor to support upstream tvm by @Hzfengsy in #595
  • Do not check for short variables by @oraluben in #676
  • [Refactor] Phaseout version with commit id in editable model by @LeiWang1999 in #677
  • [CI] Update CI workflow to use Python 3.12 by @LeiWang1999 in #679
  • [Enhancement] Output cache-file-related messages with verbose=True by @chenyang78 in #683
  • [Enhancement] Enhance warp specialization logic by @chengyupku in #680
  • Add Flash Attn example on amd mi300 series by @Alex4210987 in #682
  • [Enhancement] Refactored buffer detection logic in warp_specialized_rewriter.cc by @chengyupku in #685
  • [Fix] fix some issues with JIT decorators existing in the examples by @Cunxiao2002 in #681
  • [Enhancement] Add --ptxas-options=--register-usage-level=10 option by @LeiWang1999 in #684
  • [Feature]:Add auto vectorize for atomic add by @yyttt6 in #686
  • [Refactor] Rebase pipeline injector from upstream tvm by @LeiWang1999 in #687
  • [Refactor] Introduce GemmInst for different targets handling by @LeiWang1999 in #688
  • [Enhancement] Optimize BF16 casting performance by @xwhzz in #689
  • [Smem Reuse] Optimize to do memory alignment on identical buffers. by @LeiWang1999 in #693
  • [Version] Keep local commit id as it somehow help with debugging by @LeiWang1999 in #697
  • [Example] Optimize warp specialize flashmla example by @LeiWang1999 in #698
  • Bump transformers from 4.52.1 to 4.53.0 in /examples/bitnet-1.58b by @dependabot[bot] in #700
  • Gated Delta Net(GDN) kernel implementation in TileLang by @tzj-fxz in #695
  • Trivial update to calculate target arch by @oraluben in #702
  • [CI] Remove Flash Attention dependency by @LeiWang1999 in #705
  • [Layout] Introduce a new layout inference mechanism by @LeiWang1999 in #699
  • [Pipeline] Optimize inject software pipeline and pipeline planing pass by @LeiWang1999 in #706
  • Low-bit kernels fix and implementation by @tzj-fxz in #704
  • [Feat] Support gemm with stride by @smallscientist1 in #701
  • [Enhancement] Add eviction policy support for TMA operations, enhance CUDA codegen, and introduce new pass config by @xwhzz in #690
  • [Enhancement] Enhance the robustness and generality of MLA examples by @Rachmanino in #709
  • [Refactor] MergeAnnotations function to accept Map<Any, Any> instead of Map<String, Any> by @LeiWang1999 in #710
  • [Pipeline] Phaseout fragment and double buffer info from pipeline pass by @LeiWang1999 in #711
  • [Pipeline] Skip condition expression analysis for global reading by @LeiWang1999 in #713
  • [Index] Relocate Int64 Auto Promoter to ConfigBitWidth Pass, removing it from FlattenBuffer by @LeiWang1999 in #714
  • [CI] Bind build-test CI to NVIDIA as AMD runners are being introduced by @LeiWang1999 in #718
  • fix: NVRTC backend by @lucifer1004 in #717
  • [CUDA] Init support for sm_120 by @oraluben in #716
  • [Bugfix] Correct git configuration in docs CI by @xwhzz in #720
  • [Chore] fix typos by @lucifer1004 in #719
  • [CI][AMD] Add AMD GPU CI and fix some related bugs by @Alex4210987 in #694
  • [Carver][Bugfix] Correct score function for warp tile selection in tensorcore policy by @NaOHCC in #724
  • [Refactor] Refactor CUDA code generation to simplify eviction policy handling by @LeiWang1999 in #721
  • [Language] Introduce StridedTensor to support non contigious torch inputs by @LeiWang1999 in #722
  • [Enhancement][Bugfix] Fix bug in warp specialized pass and add gemm_sr fallback support for Hopper by @xwhzz in #712
  • 📝 Add docstrings to fix by @coderabbitai[bot] in #726
  • fix amd ci&add examples by @Alex4210987 in #729
  • [Feature] Low-bit twiddling dequantization and FP4 GEMM by @tzj-fxz in #725
  • 📝 Add docstrings to mxfp4 by @coderabbitai[bot] in #732
  • [Refactor] Refactor env into a more flexible version by @LeiWang1999 in #740
  • [Bugfix] Align stride index validation with torch in CythonKernelWrapper by @LeiWang1999 in #743
  • [Bugfix]:Fix atomic add auto vectorize memory access out of bound error by @yyttt6 in #742
  • 📝 Add docstrings to main by @coderabbitai[bot] in #745
  • [Refactor] Refactor barrier management by @LeiWang1999 in #744
  • [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy by @LeiWang1999 in #746
  • [Refactor] Merge ThreadPartialSync and ThreadStorageSync by @LeiWang1999 in #741
  • [Enhancement] Optimize loop body handling in IR by @chengyupku in #749
  • [MXFP4] Fix bugs and optimize exponential operation by @tzj-fxz in #750
  • [Enhancement] Add DispatchInstruction specialization for fp8 types in gemm_sm90.h by @LeiWang1999 in #751
  • [Enhancement] Add shape checking for reduce options by @kurisu6912 in #748
  • [Bugfix] Add missing FP8 header include by @LeiWang1999 in #752
  • [MXFP4] Add bias to MXFP4 GEMM kernel by @tzj-fxz in #753
  • [Bugfix][WS] Consider loop min e...
Read more

v0.1.5

05 Jun 08:32
a32009b

Choose a tag to compare

What's Changed

  • [Release] Bump version from 0.1.3 into 0.1.4 by @LeiWang1999 in #375
  • [Enhancement] Remove redundant recursive rewrite rule for FloorDiv in RewriteSimplifier by @LeiWang1999 in #408
  • [Docker] cu128 Support by @andyluo03 in #410
  • [Refactor] Phaseout python dependency attrs and decorator by @LeiWang1999 in #411
  • [Language] make linter and type checker happy with mocking by @YouJiacheng in #407
  • [Bugfix] Support larger than 256 box size tma copy by @LeiWang1999 in #413
  • [Enhancement] Add get_nvcc_compiler function to retrieve nvcc path by @LeiWang1999 in #414
  • Update lower.py to set default value for params by @Alex4210987 in #416
  • [Enhancement] Support Auto Layout Inference and Parallelism with variable constraint by @LeiWang1999 in #417
  • [Enhancement] Support to find Cython path more automatically by @FrozenGene in #418
  • [Refactor] Enhance layout inference logic in ParallelOp by @chengyupku in #420
  • [BugFix] Fix tvm simplify pass by @smallscientist1 in #421
  • [Enhancement] Add TMA+WS support in pipeline planning logic by @chengyupku in #422
  • [Language] Support tile operator T.cumsum by @LeiWang1999 in #423
  • Delete testing/python/language/test_tilelang_language_reduce_sum.py by @LeiWang1999 in #424
  • [Bugfix] Fix a bug for simplifier by @LeiWang1999 in #425
  • [Layout] Enhance layout inference pass by @LeiWang1999 in #427
  • [Enhancement] Remove DeReplicate during parallel loop layout inference by @LeiWang1999 in #430
  • [Bugfix] Fix the test data distribution of cumsum by @LeiWang1999 in #432
  • [Enhancement] Support cute mma tile mxn8ky by @LeiWang1999 in #434
  • [Bugfix] Removed the behavior that treated global -> local as a copy operation. by @LeiWang1999 in #435
  • [Language] Support accumulative T.reduce_sum by @LeiWang1999 in #436
  • [Bugfix] fix the unexpected keyword error of autotune by @yyttt6 in #438
  • [Testing] Add atomic add test by @LeiWang1999 in #439
  • [Typo] Rename warp_source to wrap_source by @lucifer1004 in #440
  • [Refactor] Update KernelLaunch to clarify block name by @LeiWang1999 in #441
  • [Enhancement] Reduce CPU overhead during kernel execution by @Cunxiao2002 in #437
  • [Enhancement] Improve layout inference accuracy in ParallelOp by @LeiWang1999 in #442
  • [Bugfix] Fix layout inference for free fragment buffer by @LeiWang1999 in #443
  • Bump transformers from 4.48.0 to 4.50.0 in /examples/bitnet-1.58b by @dependabot in #444
  • [Language] Support explicit programming for identified warp groups by @LeiWang1999 in #445
  • [Bugfix] Fix safe memory legalization for fragment store by @LeiWang1999 in #446
  • [Refactor] Separate warp specialize rewriter and tma barrier injector pass by @LeiWang1999 in #447
  • [Enhancement] Add new examples for warp specialization and TMA integration by @LeiWang1999 in #448
  • [Refactor] Phaseout torch>=2.2.0 dependency by @LeiWang1999 in #451
  • [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP by @LeiWang1999 in #450
  • [Enhancement] Introduce pass_configs parameter for kernel Caching by @LeiWang1999 in #452
  • [Feature] Add cache directory management functions in tilelang.cache by @LeiWang1999 in #453
  • [Bugfix] Fix get_swizzle_layout implementation. by @cherichy in #455
  • [Refactor] Update barrier functions and add new example for GEMM with warp specialization by @LeiWang1999 in #456
  • [Refactor] Include examples in CI by @LeiWang1999 in #457
  • docs: add llvm version info to installation.md. by @AsakusaRinne in #459
  • [CI] Add elementwise and gemv examples to CI. by @Cunxiao2002 in #458
  • [Bugfix] Fix for T.copy with dynamic range by @LeiWang1999 in #462
  • [Bugfix] Fix copy region automation for dynamic extent by @LeiWang1999 in #465
  • [Feature] Implement fast integer power operation and related API by @LeiWang1999 in #466
  • [Typo] Rename power_of_int with pow_of_int for consistency by @LeiWang1999 in #468
  • [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI by @tzj-fxz in #467
  • [Refactor] Update set_compile_args to allow None for out_idx parameter by @LeiWang1999 in #469
  • [Refactor] Simplify buffer_region_to_tile_region function in copy.py by @LeiWang1999 in #470
  • [CI] Add Convolution example to CI by @xwhzz in #473
  • [BugFix] Correct argparse for example_convolution test by @xwhzz in #474
  • [Refactor] set USE_LLVM to optional. by @hyx1999 in #476
  • [CI] Add Analyzer and blocksparse_attention examples to CI by @yyttt6 in #472
  • [Refactor] Skip patchelf if not installed by @LeiWang1999 in #477
  • [Refactor] Improve layout equality checks and error messaging by @LeiWang1999 in #471
  • [Doc] Update version retrieval in conf.py to read from VERSION file by @xwhzz in #478
  • Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check by @yuanjypku in #481
  • [Bugfix] Check CUDA target before checking for TMA by @gau-nernst in #482
  • [Bugfix] Use AutoTune cache_input_tensors properly by @yyttt6 in #483
  • Revert "[Bugfix] Use AutoTune cache_input_tensors properly" by @LeiWang1999 in #488
  • [Enhancement] Support register input for gemm when trans_a or trans_b is true by @LeiWang1999 in #490
  • [CI] Add flash_decoding example to CI by @xuchangtolearn in #487
  • [CI] Add Reminder Bot for pull request contributions by @xwhzz in #491
  • [Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple by @LeiWang1999 in #494
  • [Enhancement] Introduce flag to visualize shared memory merge plan by @LeiWang1999 in #496
  • [Refactor] Update main function structure in example scripts and add tests by @chengyupku in #475
  • [Bugfix] Fix Hopper GEMM layout for small tile size by @LeiWang1999 in #497
  • [Enhancement] Fallback transposed_ldmatrix into SM75_U16x4_LDSM_N when warp_n is 8 by @LeiWang1999 in #498
  • [Bugfix] Rename SM75_U16x8_LDSM_N to SM75_U16x8_LDSM_T to reflect correct matrix type by @LeiWang1999 in #499
  • [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility by @LeiWang1999 in #500
  • [Refactor] Update JIT kernel functions and streamline GEMM tests by @LeiWang1999 in #501
  • Fix AMD Docker issues related to conda environment setup by @Hamerlate in #503
  • [Refactor] Refactor jit to _JitImplementation to support @tilelang.jit by @LeiWang1999 in #502
  • [Refactor] Adjust in fragment GEMM layout by @LeiWang1999 in #504
  • [Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically by @LeiWang1999 in #505
  • [Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization by @LeiWang1999 in #507
  • [Refactor] Update buffer handling in layout transformation to support layout on T.view by @LeiWang1999 in #509
  • [Bugfix] Enhance smem copy selector for uncommon shape by @LeiWang1999 in https://github.com/tile-ai/tilelang...
Read more

v0.1.4

18 Apr 09:14
a41a473

Choose a tag to compare

What's Changed

  • [Bugfix] Support T.clear for let binding by @LeiWang1999 in #268
  • [Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter by @chengyupku in #269
  • [Refactor] Improve flash attention example and layout comparison logic by @LeiWang1999 in #270
  • [Bugfix]Add CUDA availability check in CtypesKernelAdapter by @XueSongTap in #267
  • [CI] Add gemm performance test by @xwhzz in #274
  • [Language] Introduce T.ptr and T.Tensor by @LeiWang1999 in #276
  • [Refactor] Enhance Autotune by @yyttt6 in #266
  • [Refactor] Update cache key generation in KernelCache by @LeiWang1999 in #283
  • [Docs][Tutorial] Add tutorial for auto-tuning by @yyttt6 in #285
  • [Refactor] Deprecated T.Buffer as arguments and rename related calls into T.Tensor by @LeiWang1999 in #281
  • [Doc] Update README.md to correct documentation link for TileLang debug tools by @chengyupku in #286
  • [Feature] Introduce NoSetMaxNReg for warp specialization by @chengyupku in #289
  • [Language] Proxy tvm ir to make linter happy by @LeiWang1999 in #287
  • [Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 by @LeiWang1999 in #291
  • [Doc] Update Python API docs generation by @xwhzz in #278
  • [Doc] Remove citation page by @LeiWang1999 in #292
  • [Dev] Correcting cxx compiler by @penguin-wwy in #294
  • [doc/example] add gemv doc and examples by @botbw in #293
  • [Feature] Implement ParallelLoopTransformer for enhanced loop analysis by @LeiWang1999 in #295
  • [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h by @LeiWang1999 in #297
  • [Refactor] Improve documentation and add detailed docstrings across multiple modules by @LeiWang1999 in #298
  • [Bugfix] Correct method call for block reduction check when analyzing memory footprint by @NaOHCC in #299
  • [Dynamic Symbolic] Refactor passes with dynamic symbolic and check shape bound precisely by @tzj-fxz in #302
  • Add autotune to conv example by @yyttt6 in #301
  • [Bugfix] Resolve autotuner bugs for blocksparse GEMM example by @tth37 in #300
  • [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError by @LeslinD in #305
  • [Enhancement] Add support for CUDA architecture 8.9 in GEMM template by @LeiWang1999 in #304
  • [BugFix] Fix unintended Git config overrides in CI runners by @xwhzz in #306
  • [Cache] Implement in-memory cache by @LeiWang1999 in #308
  • [Bugfix] Updated autotune usage in the examples to align with the latest changes by @LeiWang1999 in #309
  • [Bugfix] Fix dynamic axis with variable extent by @LeiWang1999 in #311
  • [Bugfix] Fix layout conflict issue for gqa decoding examples by @LeiWang1999 in #314
  • [Bugfix] Fixed the handling logic of IfThenElseNode in if_stmt_binding by @chengyupku in #315
  • [Bugfix] Fix logic error in ReduceOp when handling CUDA architecture by @chengyupku in #316
  • [CostModel] Introduce cuda driver api to get precise shared memory capacity by @LeiWang1999 in #317
  • [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support by @chengyupku in #320
  • [Tools] Summarize TFLOPS Information from a tilelang program by @yyttt6 in #321
  • Support block_N sizes that are 2^n in deepgemm example by @zcnrex in #319
  • [Feat] Enhance CUDA Property Handling by @LeiWang1999 in #322
  • [Bugfix] add a patch to fix T.abs on float16 by @botbw in #325
  • [AMD] Adapt rocm and support T.gemm with transpose_b=False for amd backend by @LeiWang1999 in #327
  • [Dynamic Symbolic] Adaptively vectorize with different condition expressions by @tzj-fxz in #326
  • [Bugfix] Fix fragment layout annotation in example gqa decode by @LeiWang1999 in #329
  • [AMD] Support Transpose_A=True and GEMM_RS for hip backend by @LeiWang1999 in #331
  • [Refactor] Optimize RMS normalization kernel in rms_norm.py by @chengyupku in #333
  • [AMD] Fix for missing composable kernel include path when compile kernels on amd gpus by @LeiWang1999 in #334
  • [Example] Add sparse gqa decode example by @xiayuqing0622 in #332
  • [Enhancement] Enhance FP8/FP4 type handling in CUDA codegen by @LeiWang1999 in #323
  • [Doc] Fix typo and heading level in GEMV tutorial by @yeh-sudo in #337
  • [Dev] Add Group Cast FP8 Example by @chengyupku in #338
  • [Enhancement] Support region padding when convert buffer load to buffer region by @LeiWang1999 in #342
  • [Example] Add triton block sparse gqa decode by @YizhaoGao in #341
  • [Enhancement] Support index bit width configuration by @LeiWang1999 in #343
  • [Bugfix] Fix X_amax Correctness Issue in Group Cast FP8 by @chengyupku in #345
  • [Bugfix] Fix Transposed Fragment Layout for amd GEMM_RS matrix core by @LeiWang1999 in #346
  • [AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler by @LeiWang1999 in #344
  • [Bugfix] Compile/"cached" still not loading cached kernel for example in example_mha_bwd by @Alex4210987 in #339
  • [Refactor] Implement thread-local storage for FrameStack in frame.py and kernel.py by @LeiWang1999 in #352
  • [Typo] Replace kernel.func with kernel in mla benchmark scripts by @LeiWang1999 in #354
  • [AMD][Docker] Create Dockerfile for ROCm environment setup by @LeiWang1999 in #355
  • [Enhancement] Update group_per_split_token_cast_to_fp8 to support multiple data types by @chengyupku in #356
  • [Enhancement] Support pass config disable_warp_specialize to disable auto specialization on hopper by @LeiWang1999 in #357
  • [Example] Introduce autotuning example for GEMM with enhanced configuration options by @chengyupku in #360
  • [Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid Block Indices for Sparse Attention by @xiayuqing0622 in #361
  • [Bugfix] Correct dynamic shared memory size error handling in HIP by @LeiWang1999 in #362
  • [AMD] Implement Deepseek MLA for AMD by @LeiWang1999 in #363
  • [Bugfix] Fix compilation issues for amd cdna element size check by @LeiWang1999 in #364
  • [AMD] Support FlashMLA with num split template for AMD gpus by @LeiWang1999 in #366
  • [MLA][AMD] Add amd mla benchmarking by @LeiWang1999 in #367
  • [Bugfix] Adjust Autotuner threadpool max_workers limit to available CPUs by @tth37 in #368
  • [Language] Introduce T.any_of and T.all_of to reduce a bool arrary by @LeiWang1999 in #371
  • [AMD][Setup] Support HIP in setup.py by @zhhangBian in #369
  • [Typo] Remove debug print by @LeiWang1999 in #373
  • [Docs] Add AMD Flash MLA Documentation to Tutorials Section by @LeiWang1999 in #376
  • [Bugfix] Add filelock for cython build by @LeiWang1999 in #377
  • [Typo] Remove unused comments generated by copilot by @LeiWang1999 in #379
  • [Doc] Add deepseek_mla to documentation index by @LeiWang1999 in #380
  • [Refactor] Remove debug message in pass legalize_safe_memory_access by @LeiWang1999 in #381
  • [Enhancement][Pipeline] More precise copy code block detection in pipeline by ...
Read more