v0.2.2
What's Changed
- kernel: added flash infer attention impl by @guocuimi in #327
- refactor: flatten block tables to 1d tensor by @guocuimi in #328
- kernel: added script to generate instantiation for flashinfer kernels by @guocuimi in #329
- refactor: move flash attn and flash infer into attention folder by @guocuimi in #330
- kernel: port flash infer handler + wrapper logics by @guocuimi in #331
- ut: added unittests for flash infer kernels by @guocuimi in #332
- refactor: replaced last_page_len with kv_indptr for flash infer kernel by @guocuimi in #333
- feat: added pass-in alibi slopes support for flash infer kernel by @guocuimi in #334
- refactor: move paged kv related logic into paged_kv_t by @guocuimi in #335
- ut: added fp8 kv unittests for flash infer kernel by @guocuimi in #336
- ci: added pip cache to avoid redownloading by @guocuimi in #337
- upgrade pytorch to 2.4.1 by @guocuimi in #341
- ci: run package test in docker by @guocuimi in #345
- ci: build cuda 12.4 for scalellm cpp images by @guocuimi in #346
- Upgrade pytorch to 2.5.0 by @guocuimi in #347
- ut: add more tests for different warp layout by @guocuimi in #340
- misc: attention kernel refactoring by @guocuimi in #339
Full Changelog: v0.2.1...v0.2.2