Release v0.30.4 · ml-explore/mlx

Highlights

Metal: Much faster vector fused grouped-query attention for long context
CUDA: Several improvements to speed up LLM inference for CUDA backend
CUDA: Support for dense MoEs
CUDA: Better support for consumer GPUs (4090, 5090, RTX 6000, ...)

What's Changed

patch bump for next release by @awni in #2991
Fix fence by @awni in #2998
Reverts changing the MLX_IBV_DEVICES to MLX_JACCL_DEVICES by @angeloskath in #2999
fix distributed all_to_sharded bias shard axis from -2 to -1 by @gufengc in #2987
Fix sharding of quantized models with non-power-of-2 bits by @kernelpool in #3006
Update CCCL to v3.1.3 by @zcbenz in #3012
Fix python package install path in stubgen by @zcbenz in #3009
Type Enhancement for Func Transforms and Bug Fix by @XXXXRT666 in #3003
Do not clear disk space in setup-linux by @zcbenz in #3013
Do not give workflow boolean inputs default values by @zcbenz in #3014
Fix negative dim indexing by @MillaFleurs in #2994
Windows CI by @zcbenz in #3021
Optimize erf function with expm1f in Metal backend by @bjornefisk in #3025
[CUDA] Faster grouped mm by @zcbenz in #3011
PR 3007 Fix Seg Fault by @MillaFleurs in #3008
Use higher precision for linspace with double by @awni in #3029
Handle data smaller than BUFFER_SIZE in jaccl recv by @rltakashige in #3033
build 26.0 release in actions by @awni in #3035
Remove xmlrunner from macOS CI by @zcbenz in #3032
Columnwise quantize by @nastya236 in #2989
Turn nccl_stub into a normal target by @zcbenz in #3037
Use cuda::std for math ops by @zcbenz in #3041
win: symbol exports and minor fixes by @dhiltgen in #3024
CUDA gather mv by @angeloskath in #3039
Link with prebuilt OpenBLAS and fix shared libs build on Windows by @zcbenz in #3036
Allow take on empty array when it makes sense by @awni in #3046
Add missing include to buffer_cache.h by @Anri-Lombard in #3053
Build and test python package on Windows CI by @zcbenz in #3049
Fix some MSVC compilation errors by @zcbenz in #3048
Use C++20 by @zcbenz in #3050
Faster two pass sdpa by @awni in #3023
Find system-installed cuDNN on Windows by @zcbenz in #3052
Fix some NVCC warnings when building CUDA backend with MSVC by @zcbenz in #3038
Hide symbols by default for mac/linux by @zcbenz in #3057
[CUDA] Fast sorting by @awni in #3060
Fix flaky macOS test by @awni in #3063
Update pre-commit hooks and versions for clang-format, black, and isort by @NripeshN in #3059
GPU discovery by @dhiltgen in #3055
Add NAX Split-K GEMM for large-K matmuls to improve performance by @hxu296 in #3018
Improve CPU discovery by @dhiltgen in #3068
Fix long cache file path on Windows by @zcbenz in #3065
Better support consumer CUDA GPUs by @jessegross in #3056
Delay load CUDA libs and resolve DLL paths at runtime by @zcbenz in #3061
Do not require ConcurrentManagedAccess when not used by @zcbenz in #3062
Fp qmv by @awni in #2984
remove thrust by @awni in #3067

New Contributors

@gufengc made their first contribution in #2987
@kernelpool made their first contribution in #3006
@bjornefisk made their first contribution in #3025
@rltakashige made their first contribution in #3033
@dhiltgen made their first contribution in #3024
@hxu296 made their first contribution in #3018
@jessegross made their first contribution in #3056

Full Changelog: v0.30.3...v0.30.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.30.4

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

What's Changed

New Contributors

Contributors

Uh oh!