v0.1.8 #1854
LeiWang1999
announced in
Announcements
v0.1.8
#1854
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
What's Changed
T.make_tensornot on the top of prim_func by @LeiWang1999 in [Bugfix] AllocT.make_tensornot on the top of prim_func #1412T.__ldgby @LeiWang1999 in [Enhancement] IntroduceT.__ldg#1414compile_flagsto ffi compilation path with pass_configs by @LeiWang1999 in [Bugfix] Conveycompile_flagsto ffi compilation path with pass_configs #1434example/dsa_sparse_finetune/indexer_topk_reducesum.py#1442 by @kurisu6912 in [Fix] Fix analyzer bind conflicting bug in #1442 #1446pytest.mark.parameterizeto speedup parallel testing by @kurisu6912 in [Refactor] Usepytest.mark.parameterizeto speedup parallel testing #1447T.annotate_restrict_buffersby @LeiWang1999 in [Language] IntroduceT.annotate_restrict_buffers#1428test_tilelang_language_rand.pyby @silentCoder-dev in [Refactor] Rename test for curand & add triton baseline intest_tilelang_language_rand.py#1464kDisableDynamicTailSplitandkDynamicAlignmentas they are legacy by @LeiWang1999 in [Refactor] Phaseout PassConfigkDisableDynamicTailSplitandkDynamicAlignmentas they are legacy #1486alloc_localstatement in examples and introduce processing for floating fragment buffers by @LeiWang1999 in [Refactor] Phaseout legacyalloc_localstatement in examples and introduce processing for floating fragment buffers #1495ctypesby @LeiWang1999 in [Refactor] Phaseout execution_backendctypes#1510TargetIsCudafor all cuda target by @oraluben in UseTargetIsCudafor all cuda target #1522S_q != S_kvby @hukongyi in [BugFix] Fix bugs of varlen attention forward examples caused byS_q != S_kv#1530local.varbuffer aslocalby @LeiWang1999 in [Bugfix] Avoid consideringlocal.varbuffer aslocal#1541T.Fillfor local.var by @LeiWang1999 in [Bugfix] Fix ofT.Fillfor local.var #1543Hin deepseek sparse mla backward via split-H by @Rachmanino in [Enhancement] Support largerHin deepseek sparse mla backward via split-H #1548ParallelOPNodeandCopyNodeby @LeiWang1999 in [Refactor] Introduce layout annotations forParallelOPNodeandCopyNode#1539tl_pipeline_sync. by @c8ef in [Misc] Remove unusedtl_pipeline_sync. #1566test_tilelang_language_cooperative.pyby @silentCoder-dev in [Fix] Add register to read A ptr intest_tilelang_language_cooperative.py#1593import tilelangon CPU-only machines without CUDA libraries by @XuehaiPan in [Enhancement] Allowimport tilelangon CPU-only machines without CUDA libraries #1481T.sync_warp&T.shfl_sync; change extern pdl into intrin by @silentCoder-dev in [Feature] addT.sync_warp&T.shfl_sync; change extern pdl into intrin #1614ForwardRefusage in v2 frontend ([BUG] Incorrect usage ofForwardRefin v2 frontend #1619) by @kurisu6912 in [BugFix] FixForwardRefusage in v2 frontend (#1619) #1621ConstrVisitortosrc/transform/common/constr_visitor.hfor reuse by @silentCoder-dev in [Refactor] MoveConstrVisitortosrc/transform/common/constr_visitor.hfor reuse #1622T.reduce_absmaxto use less abs call by @kurisu6912 in [Feat] ImproveT.reduce_absmaxto use less abs call #1626nvidia-cuda-nvccasnvccby @clouds56 in [Enhancement][CUDA] Supportnvidia-cuda-nvccasnvcc#1528k_dim==4and open rocm-ci for gemmsr by @benenzhu in [BugFix] Correct index_map selection for transposed A matrix in MFMA Layout withk_dim==4and open rocm-ci for gemmsr #1627T.Pipelined#1263) by @kurisu6912 in [Feat] Allow dangling producer in wasp pipeline planning (#1263) #1647examples/deepseek_v32/sparse_mla_fwd.pyby @GoldenStain in [Example] Remove redundant T.copy inexamples/deepseek_v32/sparse_mla_fwd.py#1634cp.reduce.async.bulk.tensorby @Rachmanino in [Feature] Supportcp.reduce.async.bulk.tensor#1667ThreadsyncwithConstrVisitorby @silentCoder-dev in [Feature] ReimplementThreadsyncwithConstrVisitor#1631ParallelLoopTransformerby @LeiWang1999 in [Clean][Refactor] Phaseout Legacy PassParallelLoopTransformer#1672thread_syncby @silentCoder-dev in [Bugfix] Reorganize pass forthread_sync#1682cute::elect_one_sync()for slightly better performance by @Rachmanino in [Enhancement] Usecute::elect_one_sync()for slightly better performance #1703RewriteUnsafeSelectPass by @LJC00118 in [Enhancement] RemoveRewriteUnsafeSelectPass #1705ptx_stmatrixby @LeiWang1999 in [Refactor] Relocate layout transformation ofptx_stmatrix#1689T.alloc_barrierwith new features and deprecate legacy mbarrier related intrinsics by @Rachmanino in [Refactor] EnhanceT.alloc_barrierwith new features and deprecate legacy mbarrier related intrinsics #1733.kind::i8by @Rachmanino in [Feature] Support tcgen5mma lowering for.kind::i8#1764_typing.pyto dist check by @oraluben in Fix a 3.9 issue. add_typing.pyto dist check #1803value -= 1by @LeiWang1999 in [Bugfix] Fix ast builder error forvalue -= 1#1825local.varaslocalbuffers when deciding vectorization for stable actions by @LeiWang1999 in [Refactor] Treatlocal.varaslocalbuffers when deciding vectorization for stable actions #1835T.access_ofto combineT.address_ofandaccess_ptrby @LeiWang1999 in [Refactor] IntroduceT.access_ofto combineT.address_ofandaccess_ptr#1827mbarby @Rachmanino in [Example][BugFix] 1SM GEMM example on Blackwell and fix handling ofmbar#1774New Contributors
S_q != S_kv#1530tl_pipeline_sync. #1566examples/deepseek_v32/sparse_mla_fwd.py#1634Full Changelog: v0.1.7...v0.1.8
This discussion was created from the release v0.1.8.
Beta Was this translation helpful? Give feedback.
All reactions