v0.1.8 · tile-ai tilelang · Discussion #1854 · GitHub

LeiWang1999
Feb 16, 2026
Maintainer

What's Changed

[Bugfix][Build] Update CMake configuration to remove project root injection for sys.path by @LeiWang1999 in [Bugfix][Build] Update CMake configuration to remove project root injection for sys.path #1385
[BugFix] Fix split kernel layout bug of GQA decode by @tzj-fxz in [BugFix] Fix split kernel layout bug of GQA decode #1386
[Feat] Add better repr print for Layout and Fragment by @kurisu6912 in [Feat] Add better repr print for Layout and Fragment #1392
[Doc] Logging docs for Tilelang/TVM by @SiriusNEO in [Doc] Logging docs for Tilelang/TVM #1395
[Enhancement] Refactor inflight computing to support dynamic pipeline extents by @LeiWang1999 in [Enhancement] Refactor inflight computing to support dynamic pipeline extents #1399
[AMD] Fix 3 bugs when build docker on amd mi3x gpu by @danielhua23 in [AMD] Fix 3 bugs when build docker on amd mi3x gpu #1401
[Typo] Fix tilelang link in README.md by @senlyu163 in [Typo] Fix tilelang link in README.md #1402
[Dependency] Update apache-tvm-ffi version to >=0.1.2 by @LeiWang1999 in [Dependency] Update apache-tvm-ffi version to >=0.1.2 #1400
[AMD] Enable FA2 fwd on AMD MI300X by @danielhua23 in [AMD] Enable FA2 fwd on AMD MI300X #1406
[Typo] fix typo for SM120 by @Cunxiao2002 in [Typo] fix typo for SM120 #1408
[Doc] Minor documentation update by @LeiWang1999 in [Doc] Minor documentation update #1410
[Dependency] Add torch-c-dlpack-ext to project requirements by @LeiWang1999 in [Dependency] Add torch-c-dlpack-ext to project requirements #1403
[Bugfix] Alloc T.make_tensor not on the top of prim_func by @LeiWang1999 in [Bugfix] Alloc T.make_tensor not on the top of prim_func #1412
[Enhancement] Introduce T.__ldg by @LeiWang1999 in [Enhancement] Introduce T.__ldg #1414
[Enhancement] Improve vectorization invariant check by @LJC00118 in [Enhancement] Improve vectorization invariant check #1398
[Lint] Phaseout Yapf format and embrace ruff format by @LeiWang1999 in [Lint] Phaseout Yapf format and embrace ruff format #1417
[Atomic] Use ptr for atomicAdd dst instead of reference by @LeiWang1999 in [Atomic] Use ptr for atomicAdd dst instead of reference #1425
[CUDA] Add read-only parameter annotation for CUDA codegen by @LeiWang1999 in [CUDA] Add read-only parameter annotation for CUDA codegen #1416
[Refactor] Phase out the primitives folder since its design has been merged into tileop by @LeiWang1999 in [Refactor] Phase out the primitives folder since its design has been merged into tileop #1429
[CI]: Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in [CI]: Bump actions/upload-artifact from 5 to 6 #1431
[CI]: Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in [CI]: Bump actions/download-artifact from 6 to 7 #1432
[Bugfix] Convey compile_flags to ffi compilation path with pass_configs by @LeiWang1999 in [Bugfix] Convey compile_flags to ffi compilation path with pass_configs #1434
[Enhancement] Improve buffer usage tracking in MakePackedAPI by @LeiWang1999 in [Enhancement] Improve buffer usage tracking in MakePackedAPI #1435
[Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice by @SiriusNEO in [Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice #1405
[Enhancement] Include PrimFunc name in memory cache logs for better ebugging by @LeiWang1999 in [Enhancement] Include PrimFunc name in memory cache logs for better ebugging #1437
[CI] Update lint dependencies and fix lint on trunk by @XuehaiPan in [CI] Update lint dependencies and fix lint on trunk #1433
[Enhancement] Refactor vectorization checks in loop_vectorize by @LeiWang1999 in [Enhancement] Refactor vectorization checks in loop_vectorize #1440
[Enhancement] Implement vectorized FP8 to FP32 cast by @LJC00118 in [Enhancement] Implement vectorized FP8 to FP32 cast #1438
[Feature] Support region as input of T.cumsum by @Dayuxiaoshui in [Feature] Support region as input of T.cumsum #1426
[Fix] Fix analyzer bind conflicting bug in [BUG] compilation error in example/dsa_sparse_finetune/indexer_topk_reducesum.py #1442 by @kurisu6912 in [Fix] Fix analyzer bind conflicting bug in #1442 #1446
[Refactor] Reduce direct dependency on PyTorch due to its limited type support by @LeiWang1999 in [Refactor] Reduce direct dependency on PyTorch due to its limited type support #1444
[Refactor] Use pytest.mark.parameterize to speedup parallel testing by @kurisu6912 in [Refactor] Use pytest.mark.parameterize to speedup parallel testing #1447
[Docs] Improve installation instructions for developers by @SiriusNEO in [Docs] Improve installation instructions for developers #1450
[Feat] Integrate Z3 in TVM Arith Analyzer by @kurisu6912 in [Feat] Integrate Z3 in TVM Arith Analyzer #1367
[Bugfix] Improve autotune from elementwise_add function in examples by @senlyu163 in [Bugfix] Improve autotune from elementwise_add function in examples #1445
[Language] Introduce T.annotate_restrict_buffers by @LeiWang1999 in [Language] Introduce T.annotate_restrict_buffers #1428
[Analyzer] Require loop extent > 0 when entering loop ([Bug] Failed to prove never-OOB accesses #1012) by @kurisu6912 in [Analyzer] Require loop extent > 0 when entering loop (#1012) #1451
[BugFix] Update CI to ROCm-7.1 by @Gongen-Ali in [BugFix] Update CI to ROCm-7.1 #1449
[Enhancement] Update examples and tests for improved type handling functionality by @LeiWang1999 in [Enhancement] Update examples and tests for improved type handling functionality #1448
[Issue Template] Enable blank issues in GitHub issue template by @LeiWang1999 in [Issue Template] Enable blank issues in GitHub issue template #1453
[CI] Moved the clang-tidy step to after pip install by @LeiWang1999 in [CI] Moved the clang-tidy step to after pip install #1456
[Bug] Fix tvm build script when patchelf is not found by @kurisu6912 in [Bug] Fix tvm build script when patchelf is not found #1459
[Analyzer] Fix floordiv & floormod bug in z3 prover by @kurisu6912 in [Analyzer] Fix floordiv & floormod bug in z3 prover #1458
[Cache] Rename sparse compress cache directory by @LeiWang1999 in [Cache] Rename sparse compress cache directory #1460
[Language]Adds a random number generation capability through curand_kernel by @silentCoder-dev in [Language]Adds a random number generation capability through curand_kernel #1461
remove unused duplicated type check by @sgjzfzzf in remove unused duplicated type check #1462
feat(cutedsl): add CuTeDSL backend by @lucifer1004 in feat(cutedsl): add CuTeDSL backend #1421
[Refactor] Rename test for curand & add triton baseline in test_tilelang_language_rand.py by @silentCoder-dev in [Refactor] Rename test for curand & add triton baseline in test_tilelang_language_rand.py #1464
[ArgBinder] Enhance shape variable handling and assertions by @LeiWang1999 in [ArgBinder] Enhance shape variable handling and assertions #1467
[Language] Make TL scripts friendly to Python syntax highlights by @SiriusNEO in [Language] Make TL scripts friendly to Python syntax highlights #1466
[Refactor] Remove triton dependence in testing & move triton baseline into examples by @silentCoder-dev in [Refactor] Remove triton dependence in testing & move triton baseline into examples #1470
[Language] Enhance T.dtype.as_torch conversion for compatibility by @LeiWang1999 in [Language] Enhance T.dtype.as_torch conversion for compatibility #1473
[News] update with latest news by @LeiWang1999 in [News] update with latest news #1475
[Enhancement] Use static Z3 context by @LeiWang1999 in [Enhancement] Use static Z3 context #1482
[Enhancement] Enhance let binding handling in layout inference and warp specialized pass by @LeiWang1999 in [Enhancement] Enhance let binding handling in layout inference and warp specialized pass #1484
[Refactor] Phaseout PassConfig kDisableDynamicTailSplit and kDynamicAlignment as they are legacy by @LeiWang1999 in [Refactor] Phaseout PassConfig kDisableDynamicTailSplit and kDynamicAlignment as they are legacy #1486
[Enhancement] Optimize the time cost of critical path for IntervalSetEvaluator by @LeiWang1999 in [Enhancement] Optimize the time cost of critical path for IntervalSetEvaluator #1491
[CI] Add preformance regression test script by @xwhzz in [CI] Add preformance regression test script #1489
Pin nvidia-cutlass-dsl to 4.3.3 by @lucifer1004 in Pin nvidia-cutlass-dsl to 4.3.3 #1497
[Language] Remove ConstIf Frame for Better Meta-Programming by @kurisu6912 in [Language] Remove ConstIf Frame for Better Meta-Programming #1496
[Bugfix][CI] Fix concurrency bug in regression test workflow by @xwhzz in [Bugfix][CI] Fix concurrency bug in regression test workflow #1500
[Refactor] Phaseout legacy alloc_local statement in examples and introduce processing for floating fragment buffers by @LeiWang1999 in [Refactor] Phaseout legacy alloc_local statement in examples and introduce processing for floating fragment buffers #1495
[Enhancement] Optimize MHA varlen fwd and support autotune by @Rachmanino in [Enhancement] Optimize MHA varlen fwd and support autotune #1499
[Enhancement] Refactor CUDA vectorized cast generation and remove unsupported FP8 type by @LJC00118 in [Enhancement] Refactor CUDA vectorized cast generation and remove unsupported FP8 type #1474
[Dependency] Update apache-tvm-ffi to >=0.1.6 for memory safety when gc is not enabled by @LeiWang1999 in [Dependency] Update apache-tvm-ffi to >=0.1.6 for memory safety when gc is not enabled #1502
Update cutedsl docs and version check by @lucifer1004 in Update cutedsl docs and version check #1503
[Misc] configure pymarkdown by @lucifer1004 in [Misc] configure pymarkdown #1505
[Language] Fix gemm syntax highlight by @SiriusNEO in [Language] Fix gemm syntax highlight #1476
[Fix] Fix TL_ENABLE_PTXAS_VERBOSE_OUTPUT has no effect in tvm-ffi by @kurisu6912 in [Fix] Fix TL_ENABLE_PTXAS_VERBOSE_OUTPUT has no effect in tvm-ffi #1511
[Refactor] Phaseout execution_backend ctypes by @LeiWang1999 in [Refactor] Phaseout execution_backend ctypes #1510
[Testing] Add Memory Leak Test by @kurisu6912 in [Testing] Add Memory Leak Test #1516
[Refactor] Support auto swizzling for tma store and phaseout related layout annotations by @LeiWang1999 in [Refactor] Support auto swizzling for tma store and phaseout related layout annotations #1509
[CuTeDSL][Fix] thread safety + context safety by @lucifer1004 in [CuTeDSL][Fix] thread safety + context safety #1513
[BugFix] Phaseout unused tests for gqa decode kernels and add the kernels to CI by @tzj-fxz in [BugFix] Phaseout unused tests for gqa decode kernels and add the kernels to CI #1515
[Cleanup] Remove unnecessary macros in tilelang examples by @Rachmanino in [Cleanup] Remove unnecessary macros in tilelang examples #1514
Fix ramp_lanes calculation in CUDA codegen by @LJC00118 in Fix ramp_lanes calculation in CUDA codegen #1518
[Misc] add env for default target/backend/verbose by @lucifer1004 in [Misc] add env for default target/backend/verbose #1512
[Dtype] Improve host codegen handling for subtype by @LeiWang1999 in [Dtype] Improve host codegen handling for subtype #1517
[Bugfix] Fallback to a Linear Layout instead of raising errors by @LeiWang1999 in [Bugfix] Fallback to a Linear Layout instead of raising errors #1521
Use TargetIsCuda for all cuda target by @oraluben in Use TargetIsCuda for all cuda target #1522
Fix fp4 pointer arithmetic in CUDA codegen by @LJC00118 in Fix fp4 pointer arithmetic in CUDA codegen #1524
[Enhancement] Improve GitHub Actions permissions check and refine performance regression testing by @xwhzz in [Enhancement] Improve GitHub Actions permissions check and refine performance regression testing #1519
[Release] Bump version into 0.1.7.post1 by @LeiWang1999 in [Release] Bump version into 0.1.7.post1 #1506
[Pipeline] Refactor buffer allocation in Inject Pipeline Pass by @LeiWang1999 in [Pipeline] Refactor buffer allocation in Inject Pipeline Pass #1525
[Dev] Fix when build local version with isolated build by @oraluben in [Dev] Fix when build local version with isolated build #1487
[Bugfix] Skip stride check for subtype by @LeiWang1999 in [Bugfix] Skip stride check for subtype #1531
[Lint] Enable whitespace and permission bit hooks by @XuehaiPan in [Lint] Enable whitespace and permission bit hooks #1439
[Enhancement][Tool] Tree-style pretty ASTPrinter by @SiriusNEO in [Enhancement][Tool] Tree-style pretty ASTPrinter #1468
[Fix] Add support for non-var complement arithmetic computation ([BUG] Layout Inference Fails for Cases Requiring Replication #1374) by @kurisu6912 in [Fix] Add support for non-var complement arithmetic computation (#1374) #1533
[BugFix] Complete vectorized loading for common dtypes by @SiriusNEO in [BugFix] Complete vectorized loading for common dtypes #1536
[Compat] Add CUDA version check for __nv_fp8_e8m0 type by @LeiWang1999 in [Compat] Add CUDA version check for __nv_fp8_e8m0 type #1537
[BugFix] Fix bugs of varlen attention forward examples caused by S_q != S_kv by @hukongyi in [BugFix] Fix bugs of varlen attention forward examples caused by S_q != S_kv #1530
[Bug] Fix hanging from reduction on sm120 by @PannenetsF in [Bug] Fix hanging from reduction on sm120 #1540
[example] use T.dynamic instead of tvm.te.var by @botbw in [example] use T.dynamic instead of tvm.te.var #1538
[Enhancement] Refactor KernelCache to use inheritance-based design by @sgjzfzzf in [Enhancement] Refactor KernelCache to use inheritance-based design #1483
[Bugfix] Avoid considering local.var buffer as local by @LeiWang1999 in [Bugfix] Avoid considering local.var buffer as local #1541
[Bugfix] Fix of T.Fill for local.var by @LeiWang1999 in [Bugfix] Fix of T.Fill for local.var #1543
[Z3] Change z3 timeout to rlimit for determistic prove behavior by @kurisu6912 in [Z3] Change z3 timeout to rlimit for determistic prove behavior #1542
[Feat] Adapt gemm v2 for cutedsl backend by @lucifer1004 in [Feat] Adapt gemm v2 for cutedsl backend #1544
[Enhancement] Support larger H in deepseek sparse mla backward via split-H by @Rachmanino in [Enhancement] Support larger H in deepseek sparse mla backward via split-H #1548
[Bugfix] Fix regression test to use installed package instead of source directory by @xwhzz in [Bugfix] Fix regression test to use installed package instead of source directory #1550
[Refactor] Introduce layout annotations for ParallelOPNode and CopyNode by @LeiWang1999 in [Refactor] Introduce layout annotations for ParallelOPNode and CopyNode #1539
[Script] Provide regression test script to help benchmark regression in local env by @LeiWang1999 in [Script] Provide regression test script to help benchmark regression in local env #1551
[Typing] Update Kernel signature and add type hints for buffer operations by @clouds56 in [Typing] Update Kernel signature and add type hints for buffer operations #1545
[CI]: Bump actions/upload-artifact from 4 to 6 by @dependabot[bot] in [CI]: Bump actions/upload-artifact from 4 to 6 #1555
[Refactor] Use cuda capability from torch to be more generic by @oraluben in [Refactor] Use cuda capability from torch to be more generic #1557
[CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in [CI]: Bump actions/github-script from 7 to 8 #1556
[Host] Provide post process to customize host code and enhance nullable check by @LeiWang1999 in [Host] Provide post process to customize host code and enhance nullable check #1562
[Release] Build tilelang against CUDA 13.1 in CI by @oraluben in [Release] Build tilelang against CUDA 13.1 in CI #1532
[LazyJIT] Move Type Annotations to Function Body by @kurisu6912 in [LazyJIT] Move Type Annotations to Function Body #1480
[bugfix] fix missing clear_accum logic for gemm_sp_v2 by @botbw in [bugfix] fix missing clear_accum logic for gemm_sp_v2 #1563
[Misc] Remove unused tl_pipeline_sync. by @c8ef in [Misc] Remove unused tl_pipeline_sync. #1566
[Refactor] Improve scalarization handling in Pass VectorizeLoop by @LeiWang1999 in [Refactor] Improve scalarization handling in Pass VectorizeLoop #1565
[Refactor] Simplify do_bench calls by using default warmup and rep parameters by @LeiWang1999 in [Refactor] Simplify do_bench calls by using default warmup and rep parameters #1568
[CI] Refactor PR regression test job conditions by @xwhzz in [CI] Refactor PR regression test job conditions #1569
[Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition by @LeiWang1999 in [Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition #1559
[Refactor] Enhance deterministic ordering in shared memory allocation merge. by @LeiWang1999 in [Refactor] Enhance deterministic ordering in shared memory allocation merge. #1570
[Enhancement] Improve equality checks in layout nodes and fragment validation by @LeiWang1999 in [Enhancement] Improve equality checks in layout nodes and fragment validation #1573
[Feature] add kUseCooperativeLaunch tag for tvm_ffi by @silentCoder-dev in [Feature] add kUseCooperativeLaunch tag for tvm_ffi #1572
[Refactor] Remove unnecessary logging configuration in Analyzer.py by @LeiWang1999 in [Refactor] Remove unnecessary logging configuration in Analyzer.py #1574
[Release] Bump version to 0.1.7.post2 by @LeiWang1999 in [Release] Bump version to 0.1.7.post2 #1575
[BugFix] Change default rounding mode for fp4 conversions by @LJC00118 in [BugFix] Change default rounding mode for fp4 conversions #1580
[CI] Add CUDA-aware pytest scheduler + auto workers by @LeiWang1999 in [CI] Add CUDA-aware pytest scheduler + auto workers #1584
[Enhancement] Improve performance regression output with timing and streaming by @xwhzz in [Enhancement] Improve performance regression output with timing and streaming #1585
[Bugfix] Add kernel_global_source property to TVMFFIKernelAdapter by @haok1402 in [Bugfix] Add kernel_global_source property to TVMFFIKernelAdapter #1589
[BugFix] Add PrimExpr substitution support for AttrStmt nodes by @LJC00118 in [BugFix] Add PrimExpr substitution support for AttrStmt nodes #1583
[BugFix] fix tcgen5mma example by @Rachmanino in [BugFix] fix tcgen5mma example #1577
[Refactor] Use access_ptr instead of buffer and offsets for cp async params by @LeiWang1999 in [Refactor] Use access_ptr instead of buffer and offsets for cp async params #1590
[Layout] Support annotating loop layout in frontend by @LeiWang1999 in [Layout] Support annotating loop layout in frontend #1579
[Typo] Rename loop layout annotation test by @LeiWang1999 in [Typo] Rename loop layout annotation test #1596
[Fix] Add register to read A ptr in test_tilelang_language_cooperative.py by @silentCoder-dev in [Fix] Add register to read A ptr in test_tilelang_language_cooperative.py #1593
[Feat] PDL Support by @w169q169 in [Feat] PDL Support #1494
[Enhancement][Subtype] Enhance symbolic shape/stride handling for subtype by @LeiWang1999 in [Enhancement][Subtype] Enhance symbolic shape/stride handling for subtype #1599
[Fix][CuteDSL] add support for tanh/tanhf (fixes [BUG] CuTe-DSL backend wrongly converts tanh to tanhf(op) as opposed to tanh(op, fastmath=True) #1595) by @lucifer1004 in [Fix][CuteDSL] add support for tanh/tanhf (fixes #1595) #1597
[Release] Fix race condition when publishing by @oraluben in [Release] Fix race condition when publishing #1578
Add conversion from cutlass::float_e4m3/e5m2 to tl::float_e4m3/e5m2 by @LJC00118 in Add conversion from cutlass::float_e4m3/e5m2 to tl::float_e4m3/e5m2 #1600
[Enhancement][AMD] Add preshuffle fp8 gemm example on amd. by @Gongen-Ali in [Enhancement][AMD] Add preshuffle fp8 gemm example on amd. #1605
[Bugfix] Mangle Single Precision Mathematical Functions of cuda math api by @silentCoder-dev in [Bugfix] Mangle Single Precision Mathematical Functions of cuda math api #1602
[Bugfix] Open Rocm ci test and fix some bugs. by @Gongen-Ali in [Bugfix] Open Rocm ci test and fix some bugs. #1443
[Feature] Add more curand operations & support vectorization by @silentCoder-dev in [Feature] Add more curand operations & support vectorization #1582
[Enhancement] Allow import tilelang on CPU-only machines without CUDA libraries by @XuehaiPan in [Enhancement] Allow import tilelang on CPU-only machines without CUDA libraries #1481
[BugFix] Add pre-commit to requirements-dev.txt by @asaadkhaja99 in [BugFix] Add pre-commit to requirements-dev.txt #1611
[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop by @SiriusNEO in [BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop #1607
[Feat] Add strong checker to detect data racing in T.Parallel by @kurisu6912 in [Feat] Add strong checker to detect data racing in T.Parallel #1615
[Feature] add T.sync_warp & T.shfl_sync; change extern pdl into intrin by @silentCoder-dev in [Feature] add T.sync_warp & T.shfl_sync; change extern pdl into intrin #1614
[RaceChecker] RaceChecker report warning rather than error for backward compatibility by @kurisu6912 in [RaceChecker] RaceChecker report warning rather than error for backward compatibility #1620
[BugFix] Fix ForwardRef usage in v2 frontend ([BUG] Incorrect usage of ForwardRef in v2 frontend #1619) by @kurisu6912 in [BugFix] Fix ForwardRef usage in v2 frontend (#1619) #1621
[Refactor] Move ConstrVisitor to src/transform/common/constr_visitor.h for reuse by @silentCoder-dev in [Refactor] Move ConstrVisitor to src/transform/common/constr_visitor.h for reuse #1622
[Feat] Improve T.reduce_absmax to use less abs call by @kurisu6912 in [Feat] Improve T.reduce_absmax to use less abs call #1626
[Bugfix] Do not consider local.var as local buffer during LowerTileOP by @LeiWang1999 in [Bugfix] Do not consider local.var as local buffer during LowerTileOP #1628
[Feature] Add hoist_broadcast_values pass by @silentCoder-dev in [Feature] Add hoist_broadcast_values pass #1606
[Enhancement][CUDA] Support nvidia-cuda-nvcc as nvcc by @clouds56 in [Enhancement][CUDA] Support nvidia-cuda-nvcc as nvcc #1528
[Bugfix] Fallback into full region when dynamic buffer read region cannot be proved by @LeiWang1999 in [Bugfix] Fallback into full region when dynamic buffer read region cannot be proved #1618
[Feat] Allow print macro call stack in device assert by @kurisu6912 in [Feat] Allow print macro call stack in device assert #1616
[BugFix] Correct index_map selection for transposed A matrix in MFMA Layout with k_dim==4 and open rocm-ci for gemmsr by @benenzhu in [BugFix] Correct index_map selection for transposed A matrix in MFMA Layout with k_dim==4 and open rocm-ci for gemmsr #1627
[Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3.2 by @hammersam in [Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3.2 #1636
[Bugfix] Introduce a flag to avoid unnecessary broadcast hoist and enable for let stmt by @LeiWang1999 in [Bugfix] Introduce a flag to avoid unnecessary broadcast hoist and enable for let stmt #1638
[Refactor][CI] Reduce sparse related test time by @LeiWang1999 in [Refactor][CI] Reduce sparse related test time #1637
[Refactor] Unify @jit and @lazy_jit into a single @jit decorator by @LeiWang1999 in [Refactor] Unify @jit and @lazy_jit into a single @jit decorator #1632
[Bugfix] Fix pdl related intrin handling to avoid strict annotation codegen by @LeiWang1999 in [Bugfix] Fix pdl related intrin handling to avoid strict annotation codegen #1650
[Bugfix] reverted unexpected tvm changes by @LeiWang1999 in [Bugfix] reverted unexpected tvm changes #1651
[Bugfix] reverted unexpected tvm changes by @LeiWang1999 in [Bugfix] reverted unexpected tvm changes #1652
[Refactor] Move dtypes.py from eager to language and add bits/bytes properties by @LeiWang1999 in [Refactor] Move dtypes.py from eager to language and add bits/bytes properties #1646
[Feat] Allow dangling producer in wasp pipeline planning ([BUG] Another regression: "variables X are used, but are not passed in as API arguments" error in T.Pipelined #1263) by @kurisu6912 in [Feat] Allow dangling producer in wasp pipeline planning (#1263) #1647
[bugfix] fix smem alloc for single warp reduce by @botbw in [bugfix] fix smem alloc for single warp reduce #1643
[Example] Add attention sink varlen examples by @Rachmanino in [Example] Add attention sink varlen examples #1645
[ASTPrinter] Fix IfThenElse printing and some format problems by @SiriusNEO in [ASTPrinter] Fix IfThenElse printing and some format problems #1640
[CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in [CI] [pre-commit.ci] autoupdate #1610
[Enhancement] Update LetStmtNode handling in loop vectorization to support variable binding overrides by @Rachmanino in [Enhancement] Update LetStmtNode handling in loop vectorization to support variable binding overrides #1649
[Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py by @GoldenStain in [Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py #1634
[CUDA] Introduce simulated load/store 256bits access for CUDA compatibility by @LeiWang1999 in [CUDA] Introduce simulated load/store 256bits access for CUDA compatibility #1656
[Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case by @LeiWang1999 in [Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case #1654
[Bugfix] Fix missing annotations for default CallNode Visitor by @LeiWang1999 in [Bugfix] Fix missing annotations for default CallNode Visitor #1659
[Clean] Remove unnecessary debug print by @LeiWang1999 in [Clean] Remove unnecessary debug print #1661
[Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies by @LeiWang1999 in [Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies #1657
[Refactor] Improve CallNode handling to include annotations in various operations by @LeiWang1999 in [Refactor] Improve CallNode handling to include annotations in various operations #1663
[EagerJIT] Add Support for Parameter Only Kernel Compilation by @kurisu6912 in [EagerJIT] Add Support for Parameter Only Kernel Compilation #1664
[AutoDD] Add Tilelang AutoDD to Reduce Buggy Program by @KEKE046 in [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program #1639
[Feature] Support cp.reduce.async.bulk.tensor by @Rachmanino in [Feature] Support cp.reduce.async.bulk.tensor #1667
chore: update CI cutedsl version to 4.3.5 by @lucifer1004 in chore: update CI cutedsl version to 4.3.5 #1665
[CUDA] Enhance Broadcast Codegen for Symbolic Value by @LeiWang1999 in [CUDA] Enhance Broadcast Codegen for Symbolic Value #1669
[EagerJIT] Fix bug in handling of positional arguments by @kurisu6912 in [EagerJIT] Fix bug in handling of positional arguments #1675
[Feature] Reimplement Threadsync with ConstrVisitor by @silentCoder-dev in [Feature] Reimplement Threadsync with ConstrVisitor #1631
[Clean][Refactor] Phaseout Legacy Pass ParallelLoopTransformer by @LeiWang1999 in [Clean][Refactor] Phaseout Legacy Pass ParallelLoopTransformer #1672
[Feature] Atomic Reduction Operations and Vectorization Enhancement by @LeiWang1999 in [Feature] Atomic Reduction Operations and Vectorization Enhancement #1676
[Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass by @LeiWang1999 in [Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass #1677
[Bugfix] Relax region analysis for complex expression by @LeiWang1999 in [Bugfix] Relax region analysis for complex expression #1679
[Example] Add example for mHC inference kernels. by @Elevator14B in [Example] Add example for mHC inference kernels. #1684
[Analyzer] Fix missing assume in tvm analyzer by @kurisu6912 in [Analyzer] Fix missing assume in tvm analyzer #1680
Refactor: Use centralized do_bench from tilelang.profiler by @LeiWang1999 in Refactor: Use centralized do_bench from tilelang.profiler #1670
[Feature] Introduce DecoupleTypeCast pass for mixed-precision vectorization by @LeiWang1999 in [Feature] Introduce DecoupleTypeCast pass for mixed-precision vectorization #1644
[Release] Bump Version into v0.1.7.post3 by @LeiWang1999 in [Release] Bump Version into v0.1.7.post3 #1685
[Release] Fix release wheels by @oraluben in [Release] Fix release wheels #1687
[BUG] Fix dsa_sparse_finetune/sparse_mla_bwd.py bug by @xiuhu17 in [BUG] Fix dsa_sparse_finetune/sparse_mla_bwd.py bug #1588
[Bugfix] Reorganize pass for thread_sync by @silentCoder-dev in [Bugfix] Reorganize pass for thread_sync #1682
[BugFix] fix warning on deepseek_v32 topk_selector.py by @sgjzfzzf in [BugFix] fix warning on deepseek_v32 topk_selector.py #1681
[tvm-ffi] Enable tvm-ffi for metal backend by @oraluben in [tvm-ffi] Enable tvm-ffi for metal backend #1289
[Analyzer] Fix missing assume in tvm analyzer by @LJC00118 in [Analyzer] Fix missing assume in tvm analyzer #1695
[Chore] Use python-side control flow keywords in examples for consistency by @Rachmanino in [Chore] Use python-side control flow keywords in examples for consistency #1692
[Bugfix][Refactor] Always disable light storage reuse by @LeiWang1999 in [Bugfix][Refactor] Always disable light storage reuse #1691
[Enhancement] Log warnings for OOB acceses to non-global buffers by @SiriusNEO in [Enhancement] Log warnings for OOB acceses to non-global buffers #1693
Enhance loop vectorization logic for CallNode handling by @LeiWang1999 in Enhance loop vectorization logic for CallNode handling #1696
[BugFix] Fix JITKernel export_library bug by @chengyupku in [BugFix] Fix JITKernel export_library bug #1699
[Enhancement] Handle vectorizable calls by @LeiWang1999 in [Enhancement] Handle vectorizable calls #1700
[BugFix] Fix unsafe visit else case under WarpSpecializationScope by @SiriusNEO in [BugFix] Fix unsafe visit else case under WarpSpecializationScope #1702
[Enhancement] Use cute::elect_one_sync() for slightly better performance by @Rachmanino in [Enhancement] Use cute::elect_one_sync() for slightly better performance #1703
[Enhancement] Remove RewriteUnsafeSelect Pass by @LJC00118 in [Enhancement] Remove RewriteUnsafeSelect Pass #1705
[BugFix] Corrected when proving loop layout contains a fragment buffer layout by @LeiWang1999 in [BugFix] Corrected when proving loop layout contains a fragment buffer layout #1708
[Bugfix] Improve robustness of ProveFragmentContains with fully replicated layout by @LeiWang1999 in [Bugfix] Improve robustness of ProveFragmentContains with fully replicated layout #1709
[BugFix] Add int64_t support for AtomicAdd by @LeiWang1999 in [BugFix] Add int64_t support for AtomicAdd #1716
[Refactor] Introduce GemmInst enumeration and update warp partitioning logic by @Rachmanino in [Refactor] Introduce GemmInst enumeration and update warp partitioning logic #1707
[Refactor] Phaseout unnecessary checks for pr [Refactor] Introduce GemmInst enumeration and update warp partitioning logic #1707 by @LeiWang1999 in [Refactor] Phaseout unnecessary checks for pr #1707 #1721
[Refactor] re-implement vector subtype and its access method by @LeiWang1999 in [Refactor] re-implement vector subtype and its access method #1722
[EagerJIT] Lazy Evaluation of Kernel Body in Eager JIT ([BUG] Tilelang Reduction Ops is Conflict with EagerJIT #1690) by @kurisu6912 in [EagerJIT] Lazy Evaluation of Kernel Body in Eager JIT (#1690) #1694
[Enhancement] Legalize subtype access by @LeiWang1999 in [Enhancement] Legalize subtype access #1724
[EagerJIT] Enhance auto inference of lazyjit and eager jit by @kurisu6912 in [EagerJIT] Enhance auto inference of lazyjit and eager jit #1704
[Refactor] Enhance variable substitution in device function generation by @LeiWang1999 in [Refactor] Enhance variable substitution in device function generation #1723
[Bugfix] Fix incorrect alignment of vectorized subtype by @LeiWang1999 in [Bugfix] Fix incorrect alignment of vectorized subtype #1726
[Enhancement] Add explicit global memory load/store intrinsics (ldg/stg 32/64/128) by @LeiWang1999 in [Enhancement] Add explicit global memory load/store intrinsics (ldg/stg 32/64/128) #1717
[Refactor] Remove external buffer conflict check in pipeline injection by @LeiWang1999 in [Refactor] Remove external buffer conflict check in pipeline injection #1727
[Refactor] Relocate layout transformation of ptx_stmatrix by @LeiWang1999 in [Refactor] Relocate layout transformation of ptx_stmatrix #1689
[AMD] Add MI350/MI355 FP8 support by @hubertlu-tw in [AMD] Add MI350/MI355 FP8 support #1718
[Bugfix] revert incorrect fast path for parallel layout inference by @LeiWang1999 in [Bugfix] revert incorrect fast path for parallel layout inference #1730
[Example] Add KDA algorithm implementation in tilelang by @wfloveiu in [Example] Add KDA algorithm implementation in tilelang #1660
[Feature] Support E8M0 related type conversion and vectorized cast by @SiriusNEO in [Feature] Support E8M0 related type conversion and vectorized cast #1731
[BugFix] Remove unnecessary binding in loop variable analysis and add test for issue 1728 by @kurisu6912 in [BugFix] Remove unnecessary binding in loop variable analysis and add test for issue 1728 #1735
Add swizzle layout detection and automatic merging for layout conflicts by @LeiWang1999 in Add swizzle layout detection and automatic merging for layout conflicts #1736
[Bugfix] Handle offset handling for subtype ptr by @LeiWang1999 in [Bugfix] Handle offset handling for subtype ptr #1738
[EagerJIT] Allow dummy parameter in jit kernel by @kurisu6912 in [EagerJIT] Allow dummy parameter in jit kernel #1737
[Feature] Add build date to version metadata by @LeiWang1999 in [Feature] Add build date to version metadata #1742
[BugFix] Fix FP4 related vectorized cast by @chaospointer in [BugFix] Fix FP4 related vectorized cast #1741
[Refactor] Disable Predicated LDG PTX Lowering by default by @LeiWang1999 in [Refactor] Disable Predicated LDG PTX Lowering by default #1739
[Layout] Fix Layout Bugs in Parallel and Reduce by @kurisu6912 in [Layout] Fix Layout Bugs in Parallel and Reduce #1713
[fix]: fix deepseek_mla amd example and add aiter mla compare test by @ZiguanWang in [fix]: fix deepseek_mla amd example and add aiter mla compare test #1740
[Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics by @Rachmanino in [Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics #1733
[BugFix] Fix several bugs in CodeGen for CuTeDSL backend by @Rachmanino in [BugFix] Fix several bugs in CodeGen for CuTeDSL backend #1746
Update import for compare_tensors from test_utils_kda by @pmixer in Update import for compare_tensors from test_utils_kda #1748
[Lint] Remove diff arguments in Ruff and sync some versions by @SiriusNEO in [Lint] Remove diff arguments in Ruff and sync some versions #1751
[Refactor] Rename EagerJIT examples to avoid confusion by @SiriusNEO in [Refactor] Rename EagerJIT examples to avoid confusion #1750
[AMD] Fix ROCm FP8 dtype selection and MFMA support on gfx942/gfx950 by @hubertlu-tw in [AMD] Fix ROCm FP8 dtype selection and MFMA support on gfx942/gfx950 #1743
[Feature] Support message-only debug print by @Rachmanino in [Feature] Support message-only debug print #1755
[EagerJIT] Update README example to eager jit by @kurisu6912 in [EagerJIT] Update README example to eager jit #1752
[BugFix] Stride check and fix for tensors with zero-stride argument by @tzj-fxz in [BugFix] Stride check and fix for tensors with zero-stride argument #1749
[BugFix] Always build guard in loop partitioning to prevent out-of-bounds access by @LeiWang1999 in [BugFix] Always build guard in loop partitioning to prevent out-of-bounds access #1756
[Tool] Add tool to print fragment in thread value view by @kurisu6912 in [Tool] Add tool to print fragment in thread value view #1759
[Enhancement] Add dynamic symbolic constraints support for Profiler benchmarking by @LeiWang1999 in [Enhancement] Add dynamic symbolic constraints support for Profiler benchmarking #1753
[ThreadSync] Use Z3 for constraint equivalence checking by @LeiWang1999 in [ThreadSync] Use Z3 for constraint equivalence checking #1760
[Feature] Implement LoopUnswitching Pass by @chengyupku in [Feature] Implement LoopUnswitching Pass #1747
[Chore] Remove unnecessary log from z3 by @Rachmanino in [Chore] Remove unnecessary log from z3 #1763
[Bugfix] Revert the initial value of Z3 SetRLimit by @LeiWang1999 in [Bugfix] Revert the initial value of Z3 SetRLimit #1765
[Feature] Enhance Loop Unswitching with Let Binding and Condition Handling by @LeiWang1999 in [Feature] Enhance Loop Unswitching with Let Binding and Condition Handling #1766
[Bugfix] Add predicate to loads inside predicated stores in LowerLDGSTG pass by @LeiWang1999 in [Bugfix] Add predicate to loads inside predicated stores in LowerLDGSTG pass #1767
[Feature] Add PassConfig for Controlling Let Statement Inlining in Simplify Pass by @LeiWang1999 in [Feature] Add PassConfig for Controlling Let Statement Inlining in Simplify Pass #1769
[Fix] Change ue8m0 default round mode to cudaRoundPosInf by @SiriusNEO in [Fix] Change ue8m0 default round mode to cudaRoundPosInf #1770
[Feature] Support tcgen5mma lowering for .kind::i8 by @Rachmanino in [Feature] Support tcgen5mma lowering for .kind::i8 #1764
[Refactor] Unify the usage of cast-related operators by @SiriusNEO in [Refactor] Unify the usage of cast-related operators #1757
[Bugfix] Copy pass_configs dict to prevent mutation across multiple JIT compilations by @LeiWang1999 in [Bugfix] Copy pass_configs dict to prevent mutation across multiple JIT compilations #1776
[CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in [CI] [pre-commit.ci] autoupdate #1775
[Refactor] Improve type annotations and reduce some lint errors in frontend by @SiriusNEO in [Refactor] Improve type annotations and reduce some lint errors in frontend #1777
Update TVM: fix select/if_then_else out-of-bounds access by @LeiWang1999 in Update TVM: fix select/if_then_else out-of-bounds access #1783
[Feature] Add fully replicated layout interface in annotation layout by @tzj-fxz in [Feature] Add fully replicated layout interface in annotation layout #1772
[Example][BugFix] Fix arguements override in deepseek_v32 topk_selector by @ljwljwljwljw in [Example][BugFix] Fix arguements override in deepseek_v32 topk_selector #1784
[BugFix] Fix reduce_sum with clear=False not accumulating correctly by @ShaobinChen-AH in [BugFix] Fix reduce_sum with clear=False not accumulating correctly #1778
fix(intrinsics): add missing _legalize_to_buffer_region in SM70 emitter by @Coloured-glaze in fix(intrinsics): add missing _legalize_to_buffer_region in SM70 emitter #1786
[Enhancement] Enhance register vectorize inference by @LeiWang1999 in [Enhancement] Enhance register vectorize inference #1785
[Bugfix] Fix thread storage sync conflict detection for loop carry write-after-read by @LeiWang1999 in [Bugfix] Fix thread storage sync conflict detection for loop carry write-after-read #1781
[Fix] cython 3.0 generates incorrect code for python stable api by @oraluben in [Fix] cython 3.0 generates incorrect code for python stable api #1789
[BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly by @xwhzz in [BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly #1794
[ThreadSync] Skip (tx1 != tx2) checking for loop carry analysis by @LeiWang1999 in [ThreadSync] Skip (tx1 != tx2) checking for loop carry analysis #1795
[Feature] Add option to disable out-of-bound access warnings in safe memory access legalization by @kurisu6912 in [Feature] Add option to disable out-of-bound access warnings in safe memory access legalization #1797
[Docs] Add Python Compatibility document of TileLang by @LeiWang1999 in [Docs] Add Python Compatibility document of TileLang #1745
[Refactor] Reorganize ParallelOp code structure and move ProveFragmentContains to layout utils by @LeiWang1999 in [Refactor] Reorganize ParallelOp code structure and move ProveFragmentContains to layout utils #1779
[Feature] Support passing PrimExpr value in tile-level atomic operation by @SiriusNEO in [Feature] Support passing PrimExpr value in tile-level atomic operation #1796
[Bugfix] Support loop-dependent conditions in IfThenElse within T.Pipelined by @ljwljwljwljw in [Bugfix] Support loop-dependent conditions in IfThenElse within T.Pipelined #1799
[BugFix] Missing Recursive Loop Var Checking in Loop Unswitching by @kurisu6912 in [BugFix] Missing Recursive Loop Var Checking in Loop Unswitching #1801
Fix a 3.9 issue. add _typing.py to dist check by @oraluben in Fix a 3.9 issue. add _typing.py to dist check #1803
[Docs][Puzzles] Add TileLang puzzles in README by @SiriusNEO in [Docs][Puzzles] Add TileLang puzzles in README #1806
[Docs] Hotfix wrong link by @SiriusNEO in [Docs] Hotfix wrong link #1807
[Enhancement] Improve plot_layout visualization for Layouts by @LeiWang1999 in [Enhancement] Improve plot_layout visualization for Layouts #1811
[Feat] profiler support cudagraph backend by @cscyuge in [Feat] profiler support cudagraph backend #1658
Handle staled autotune state with tvm-ffi adapter. by @haok1402 in Handle staled autotune state with tvm-ffi adapter. #1812
[BugFix] LoopUnswitching: gate non-trivial else behind PassConfig by @LeiWang1999 in [BugFix] LoopUnswitching: gate non-trivial else behind PassConfig #1816
[Release] Update dependencies to resolve several issues by @oraluben in [Release] Update dependencies to resolve several issues #1817
[BugFix] Fix fp16 annotate_l2_hit_ratio host stub compilation (issue [BUG] Host compilation fails with "half was not declared" when using float16 buffers #1810) by @LeiWang1999 in [BugFix] Fix fp16 annotate_l2_hit_ratio host stub compilation (issue #1810) #1818
[Bugfix] Remove mistaken coalesced_width parameter in regression test of fusedmoe kernel by @xwhzz in [Bugfix] Remove mistaken coalesced_width parameter in regression test of fusedmoe kernel #1820
[Release] Add build for python 3.14t by @oraluben in [Release] Add build for python 3.14t #1805
Fix: treat kParallel as serial when vectorizing by @LeiWang1999 in Fix: treat kParallel as serial when vectorizing #1819
[Dist] Add lazy-loading stubs for CUDART + NVRTC (CUDA 11/12/13 compatible wheels) by @LeiWang1999 in [Dist] Add lazy-loading stubs for CUDART + NVRTC (CUDA 11/12/13 compatible wheels) #1821
[Analyzer] Add SideEffect Checking in ConstIntBound Analyzer by @kurisu6912 in [Analyzer] Add SideEffect Checking in ConstIntBound Analyzer #1824
[Bugfix] Fix ast builder error for value -= 1 by @LeiWang1999 in [Bugfix] Fix ast builder error for value -= 1 #1825
[Release][Build] Merge libtilelang and libtilelang_modules by @oraluben in [Release][Build] Merge libtilelang and libtilelang_modules #1814
[Bugfix] Fix threadIdx variable lookup by thread_tag instead of position in ThreadSync by @LeiWang1999 in [Bugfix] Fix threadIdx variable lookup by thread_tag instead of position in ThreadSync #1829
[Docs] Update nightly build installation instructions in README and Installation guide by @xwhzz in [Docs] Update nightly build installation instructions in README and Installation guide #1830
[BugFix] Reset cur_expect_idx_ correctly for multi-kernel TMA barrier injection by @ColmaLiu in [BugFix] Reset cur_expect_idx_ correctly for multi-kernel TMA barrier injection #1828
[Refactor] Treat local.var as local buffers when deciding vectorization for stable actions by @LeiWang1999 in [Refactor] Treat local.var as local buffers when deciding vectorization for stable actions #1835
Fix tilelang global load/store template by @LJC00118 in Fix tilelang global load/store template #1837
[Refactor] Introduce T.access_of to combine T.address_of and access_ptr by @LeiWang1999 in [Refactor] Introduce T.access_of to combine T.address_of and access_ptr #1827
[CUDA][Feature] Add packed FP32x2 math intrinsics and auto vectorized support by @LeiWang1999 in [CUDA][Feature] Add packed FP32x2 math intrinsics and auto vectorized support #1839
[Example][BugFix] 1SM GEMM example on Blackwell and fix handling of mbar by @Rachmanino in [Example][BugFix] 1SM GEMM example on Blackwell and fix handling of mbar #1774
[Feature] Hierarchical reduction and warp reduction intrinsics support by @tzj-fxz in [Feature] Hierarchical reduction and warp reduction intrinsics support #1762
[Dist][Release] Use one wheel for different CUDA version by @oraluben in [Dist][Release] Use one wheel for different CUDA version #1826
[Enhancement] Optimize templates for half/bfloat16 by @LJC00118 in [Enhancement] Optimize templates for half/bfloat16 #1845
ThreadSync: avoid barriers between atomic ops by @LeiWang1999 in ThreadSync: avoid barriers between atomic ops #1852
[BugFix] Fix eager mode where there is no tensor args by @Rachmanino in [BugFix] Fix eager mode where there is no tensor args #1851
[AMD] Fix bugs about AMD FA kernel by @danielhua23 in [AMD] Fix bugs about AMD FA kernel #1701
Add an example: mHC residual projection backward by @Da1sypetals in Add an example: mHC residual projection backward #1758
[Release] Bump version into v0.1.8 by @LeiWang1999 in [Release] Bump version into v0.1.8 #1853

New Contributors

@danielhua23 made their first contribution in [AMD] Fix 3 bugs when build docker on amd mi3x gpu #1401
@senlyu163 made their first contribution in [Typo] Fix tilelang link in README.md #1402
@Dayuxiaoshui made their first contribution in [Feature] Support region as input of T.cumsum #1426
@silentCoder-dev made their first contribution in [Language]Adds a random number generation capability through curand_kernel #1461
@sgjzfzzf made their first contribution in remove unused duplicated type check #1462
@hukongyi made their first contribution in [BugFix] Fix bugs of varlen attention forward examples caused by S_q != S_kv #1530
@clouds56 made their first contribution in [Typing] Update Kernel signature and add type hints for buffer operations #1545
@c8ef made their first contribution in [Misc] Remove unused tl_pipeline_sync. #1566
@haok1402 made their first contribution in [Bugfix] Add kernel_global_source property to TVMFFIKernelAdapter #1589
@w169q169 made their first contribution in [Feat] PDL Support #1494
@asaadkhaja99 made their first contribution in [BugFix] Add pre-commit to requirements-dev.txt #1611
@hammersam made their first contribution in [Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3.2 #1636
@GoldenStain made their first contribution in [Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py #1634
@KEKE046 made their first contribution in [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program #1639
@xiuhu17 made their first contribution in [BUG] Fix dsa_sparse_finetune/sparse_mla_bwd.py bug #1588
@hubertlu-tw made their first contribution in [AMD] Add MI350/MI355 FP8 support #1718
@wfloveiu made their first contribution in [Example] Add KDA algorithm implementation in tilelang #1660
@chaospointer made their first contribution in [BugFix] Fix FP4 related vectorized cast #1741
@ZiguanWang made their first contribution in [fix]: fix deepseek_mla amd example and add aiter mla compare test #1740
@pmixer made their first contribution in Update import for compare_tensors from test_utils_kda #1748
@ljwljwljwljw made their first contribution in [Example][BugFix] Fix arguements override in deepseek_v32 topk_selector #1784
@ShaobinChen-AH made their first contribution in [BugFix] Fix reduce_sum with clear=False not accumulating correctly #1778
@Coloured-glaze made their first contribution in fix(intrinsics): add missing _legalize_to_buffer_region in SM70 emitter #1786
@cscyuge made their first contribution in [Feat] profiler support cudagraph backend #1658
@ColmaLiu made their first contribution in [BugFix] Reset cur_expect_idx_ correctly for multi-kernel TMA barrier injection #1828
@Da1sypetals made their first contribution in Add an example: mHC residual projection backward #1758

Full Changelog: v0.1.7...v0.1.8

This discussion was created from the release v0.1.8.

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment