v0.30.4
Highlights
- Metal: Much faster vector fused grouped-query attention for long context
- CUDA: Several improvements to speed up LLM inference for CUDA backend
- CUDA: Support for dense MoEs
- CUDA: Better support for consumer GPUs (4090, 5090, RTX 6000, ...)
What's Changed
- patch bump for next release by @awni in #2991
- Fix fence by @awni in #2998
- Reverts changing the MLX_IBV_DEVICES to MLX_JACCL_DEVICES by @angeloskath in #2999
- fix distributed all_to_sharded bias shard axis from -2 to -1 by @gufengc in #2987
- Fix sharding of quantized models with non-power-of-2 bits by @kernelpool in #3006
- Update CCCL to v3.1.3 by @zcbenz in #3012
- Fix python package install path in stubgen by @zcbenz in #3009
- Type Enhancement for Func Transforms and Bug Fix by @XXXXRT666 in #3003
- Do not clear disk space in setup-linux by @zcbenz in #3013
- Do not give workflow boolean inputs default values by @zcbenz in #3014
- Fix negative dim indexing by @MillaFleurs in #2994
- Windows CI by @zcbenz in #3021
- Optimize erf function with expm1f in Metal backend by @bjornefisk in #3025
- [CUDA] Faster grouped mm by @zcbenz in #3011
- PR 3007 Fix Seg Fault by @MillaFleurs in #3008
- Use higher precision for linspace with double by @awni in #3029
- Handle data smaller than BUFFER_SIZE in jaccl recv by @rltakashige in #3033
- build 26.0 release in actions by @awni in #3035
- Remove xmlrunner from macOS CI by @zcbenz in #3032
- Columnwise quantize by @nastya236 in #2989
- Turn nccl_stub into a normal target by @zcbenz in #3037
- Use cuda::std for math ops by @zcbenz in #3041
- win: symbol exports and minor fixes by @dhiltgen in #3024
- CUDA gather mv by @angeloskath in #3039
- Link with prebuilt OpenBLAS and fix shared libs build on Windows by @zcbenz in #3036
- Allow take on empty array when it makes sense by @awni in #3046
- Add missing include to buffer_cache.h by @Anri-Lombard in #3053
- Build and test python package on Windows CI by @zcbenz in #3049
- Fix some MSVC compilation errors by @zcbenz in #3048
- Use C++20 by @zcbenz in #3050
- Faster two pass sdpa by @awni in #3023
- Find system-installed cuDNN on Windows by @zcbenz in #3052
- Fix some NVCC warnings when building CUDA backend with MSVC by @zcbenz in #3038
- Hide symbols by default for mac/linux by @zcbenz in #3057
- [CUDA] Fast sorting by @awni in #3060
- Fix flaky macOS test by @awni in #3063
- Update pre-commit hooks and versions for clang-format, black, and isort by @NripeshN in #3059
- GPU discovery by @dhiltgen in #3055
- Add NAX Split-K GEMM for large-K matmuls to improve performance by @hxu296 in #3018
- Improve CPU discovery by @dhiltgen in #3068
- Fix long cache file path on Windows by @zcbenz in #3065
- Better support consumer CUDA GPUs by @jessegross in #3056
- Delay load CUDA libs and resolve DLL paths at runtime by @zcbenz in #3061
- Do not require ConcurrentManagedAccess when not used by @zcbenz in #3062
- Fp qmv by @awni in #2984
- remove thrust by @awni in #3067
New Contributors
- @gufengc made their first contribution in #2987
- @kernelpool made their first contribution in #3006
- @bjornefisk made their first contribution in #3025
- @rltakashige made their first contribution in #3033
- @dhiltgen made their first contribution in #3024
- @hxu296 made their first contribution in #3018
- @jessegross made their first contribution in #3056
Full Changelog: v0.30.3...v0.30.4