Releases · ml-explore/mlx

06 Feb 17:05

angeloskath

v0.30.6

185b06d

v0.30.6 Latest

Latest

Highlights

Much faster bandwidth with JACCL on macOS >= 26.3 (some numbers)

What's Changed

patch by @awni in #3093
Disable managed memory on WSL when concurrentManagedAccess is not supported by @jessegross in #3095
Fix non simd f16 build by @awni in #3097
Fix 2pass sdpa on < M2 by @awni in #3099
JACCL update by @angeloskath in #3094
Fix qmv_impl for small N by @manuelcandales in #3096
Patch for multi device CUDA by @awni in #3100

New Contributors

@manuelcandales made their first contribution in #3096

Full Changelog: v0.30.5...v0.30.6

Contributors

angeloskath, awni, and 2 other contributors

Assets 2

03 Feb 02:56

awni

v0.30.5

adcbb91

v0.30.5

What's Changed

patch by @awni in #3074
[CUDA] Fallback Event impl when there is no hardware cpu/gpu coherency by @zcbenz in #3070
Tune CUDA gaph sizes on B200 and H100 by @awni in #3077
[Docs] Simple example of using MLX distributed by @stefpi in #2973
Use lower-right causal mask alignment consistently by @Anri-Lombard in #2967
Fix ALiBi slopes for non-power-of-2 num_heads by @vovw in #3071
More useful error for large indices by @awni in #3079
Fix nax condition for iphone by @awni in #3083
Fallback to pinned host memory when managed memory is not supported by @zcbenz in #3075
Fix failing python tests on Windows by @zcbenz in #3076
[Metal] Tune splitk gemm dispatch conditions and partition sizes by @awni in #3087
Fix for NAX overflow. by @awni in #3092

New Contributors

@stefpi made their first contribution in #2973
@vovw made their first contribution in #3071

Full Changelog: v0.30.4...v0.30.5

Contributors

zcbenz, awni, and 3 other contributors

Assets 2

27 Jan 22:27

zcbenz

v0.30.4

2f324cc

v0.30.4

Highlights

Metal: Much faster vector fused grouped-query attention for long context
CUDA: Several improvements to speed up LLM inference for CUDA backend
CUDA: Support for dense MoEs
CUDA: Better support for consumer GPUs (4090, 5090, RTX 6000, ...)

What's Changed

patch bump for next release by @awni in #2991
Fix fence by @awni in #2998
Reverts changing the MLX_IBV_DEVICES to MLX_JACCL_DEVICES by @angeloskath in #2999
fix distributed all_to_sharded bias shard axis from -2 to -1 by @gufengc in #2987
Fix sharding of quantized models with non-power-of-2 bits by @kernelpool in #3006
Update CCCL to v3.1.3 by @zcbenz in #3012
Fix python package install path in stubgen by @zcbenz in #3009
Type Enhancement for Func Transforms and Bug Fix by @XXXXRT666 in #3003
Do not clear disk space in setup-linux by @zcbenz in #3013
Do not give workflow boolean inputs default values by @zcbenz in #3014
Fix negative dim indexing by @MillaFleurs in #2994
Windows CI by @zcbenz in #3021
Optimize erf function with expm1f in Metal backend by @bjornefisk in #3025
[CUDA] Faster grouped mm by @zcbenz in #3011
PR 3007 Fix Seg Fault by @MillaFleurs in #3008
Use higher precision for linspace with double by @awni in #3029
Handle data smaller than BUFFER_SIZE in jaccl recv by @rltakashige in #3033
build 26.0 release in actions by @awni in #3035
Remove xmlrunner from macOS CI by @zcbenz in #3032
Columnwise quantize by @nastya236 in #2989
Turn nccl_stub into a normal target by @zcbenz in #3037
Use cuda::std for math ops by @zcbenz in #3041
win: symbol exports and minor fixes by @dhiltgen in #3024
CUDA gather mv by @angeloskath in #3039
Link with prebuilt OpenBLAS and fix shared libs build on Windows by @zcbenz in #3036
Allow take on empty array when it makes sense by @awni in #3046
Add missing include to buffer_cache.h by @Anri-Lombard in #3053
Build and test python package on Windows CI by @zcbenz in #3049
Fix some MSVC compilation errors by @zcbenz in #3048
Use C++20 by @zcbenz in #3050
Faster two pass sdpa by @awni in #3023
Find system-installed cuDNN on Windows by @zcbenz in #3052
Fix some NVCC warnings when building CUDA backend with MSVC by @zcbenz in #3038
Hide symbols by default for mac/linux by @zcbenz in #3057
[CUDA] Fast sorting by @awni in #3060
Fix flaky macOS test by @awni in #3063
Update pre-commit hooks and versions for clang-format, black, and isort by @NripeshN in #3059
GPU discovery by @dhiltgen in #3055
Add NAX Split-K GEMM for large-K matmuls to improve performance by @hxu296 in #3018
Improve CPU discovery by @dhiltgen in #3068
Fix long cache file path on Windows by @zcbenz in #3065
Better support consumer CUDA GPUs by @jessegross in #3056
Delay load CUDA libs and resolve DLL paths at runtime by @zcbenz in #3061
Do not require ConcurrentManagedAccess when not used by @zcbenz in #3062
Fp qmv by @awni in #2984
remove thrust by @awni in #3067

New Contributors

@gufengc made their first contribution in #2987
@kernelpool made their first contribution in #3006
@bjornefisk made their first contribution in #3025
@rltakashige made their first contribution in #3033
@dhiltgen made their first contribution in #3024
@hxu296 made their first contribution in #3018
@jessegross made their first contribution in #3056

Full Changelog: v0.30.3...v0.30.4

Contributors

kernelpool, zcbenz, and 13 other contributors

Assets 2

13 Jan 23:52

awni

v0.30.3

ac26a4c

v0.30.3

Highlights

Support nvfp4 and mxfp8 quantized ops on Metal
Support nvfp4 and mxfp8 quantized-quantized matrix-matrix multiplication on CUDA

What's Changed

Bump the patch version by @angeloskath in #2922
Faster copy for col contig to row contig by @awni in #2917
Fix cuda release by @awni in #2925
Metal logging by @CC-Yeh in #2904
fix cuda release part 2 by @awni in #2926
new[CI]: add linux sanitizer tests by @incertum in #2860
patch bump by @awni in #2927
Fix CUDA pypi release by @awni in #2929
Move allocate_workspace to cuda/utils.h by @zcbenz in #2923
Allow dry run for PyPI release workflow by @zcbenz in #2928
Set rpath with cmake for CUDA build by @zcbenz in #2932
Fix nightly build by @zcbenz in #2933
Set install rpath of python bindings with cmake by @zcbenz in #2934
Fix pid in local launch by @angeloskath in #2936
Make CUDA CI run faster by @zcbenz in #2939
refactor: use perf_counter for accurate benchmarking by @Satyam12singh in #2940
Fix for non row-contig scales by @awni in #2941
Fix stubgen by @zcbenz in #2942
ci: add macOS 26 target by @madrob in #2937
Fix float64 size in data_types.rst by @pdevine in #2948
Fixes in mlx.distributed_config by @angeloskath in #2947
Metal/CPU nvfp4 and mxfp8 by @awni in #2946
[CUDA] Implement gather_mm_rhs by @zcbenz in #2902
Fetch nanobind with cmake by @zcbenz in #2949
refactor: use time.perf_counter for consistent and accurate benchmarking by @Satyam12singh in #2943
BUG FIX - Addition of missing parameter in random::uniform by @hwiesmann in #2963
Fix doc issues in mlx.nn.init.he_normal and mlx.nn.hard_tanh by @Redempt1onzzZZ in #2968
fix numpy dtype bug by @awni in #2960
QQ linear by @nastya236 in #2931
fix array allocator with user buffer and deleter by @andresy in #2971
Swizzle scales by @nastya236 in #2979
Fix grid_dim_x calculations by @CC-Yeh in #2980
Add asarray to array_namespace by @Anri-Lombard in #2966
fix doc by @CC-Yeh in #2988
replace MLX_IBV_COORDINATOR with MLX_JACCL_COORDINATOR by @Evanev7 in #2986
Fix RandomBits::is_equivalent to include width by @MillaFleurs in #2978
Don't try to use NAX at run-time if kernels aren't there by @awni in #2982
Expose to/from fp8 in Python and don't auto-convert fp8 when loading from safetensors by @awni in #2985
Allow some non 2D inputs in qqmm by @awni in #2981

New Contributors

@pdevine made their first contribution in #2948
@hwiesmann made their first contribution in #2963
@Anri-Lombard made their first contribution in #2966
@Evanev7 made their first contribution in #2986
@MillaFleurs made their first contribution in #2978

Full Changelog: v0.30.1...v0.30.3

Contributors

pdevine, andresy, and 13 other contributors

Assets 2

18 Dec 00:32

angeloskath

v0.30.1

c215b6f

v0.30.1

Highlights

RDMA over thunderbolt with the JACCL backend (macOS >= 26.2) (some numbers)
NAX with JIT so that they can be used in MLX Swift
CUDA improvements
- Many improvements to SDPA (masking, T_q != T_kv)
- Faster quantize/dequantize
- QQMM to make use of faster tensor cores
- Fix in col reduce speeds up training

What's Changed

patch + fix docs build by @awni in #2799
Fix macos release target and linux arm release by @awni in #2802
Fix cuda allocator copy condition by @awni in #2800
[CUDA] Partly fix random for large sizes by @awni in #2798
patch bump for future version by @awni in #2804
Centralize NAX condition by @awni in #2811
Tolerance for some ops tests on cuda by @awni in #2815
Fix typo: refs/head/main => refs/heads/main by @zcbenz in #2818
Add float64 Eig and complex64 SVD/Eig support (Fixes #2708) by @harsh-sutariya in #2737
Fix mx.core.load type annotation by @CC-Yeh in #2819
Force cudaGraphExec reinstantiation when clusters are used by @andportnoy in #2813
Bump actions/checkout from 5 to 6 by @dependabot[bot] in #2828
Fix mx.core.linspace type annotation by @CC-Yeh in #2820
[CUDA] Exit on crash and more helpful errors by @awni in #2830
[CUDA] Add debug env to save cuda graphs to dot files by @zcbenz in #2825
[CUDA] Output of SDPA should have same layout with inputs by @zcbenz in #2826
Merge build-cuda and build-linux actions by @zcbenz in #2783
[CUDA] Support array mask in SDPA by @zcbenz in #2822
[CUDA] Faster rms norm for small dimension by @awni in #2838
Added clarification to apply_fn parameter of apply_to_modules by @yuchaoran2011 in #2831
[CUDA] Use cuDNN attention when T_q != T_kv by @zcbenz in #2843
[CUDA] Migrate conv code to new cuDNN APIs by @zcbenz in #2847
Support more Numpy interfaces for masked_scatter by @CC-Yeh in #2832
use thread local cpature mode by @awni in #2850
Fix export scatters by @awni in #2852
Reduce JVP by @awni in #2854
Fix graph updating by @awni in #2857
Fix init from double by @awni in #2861
Update gumbel function signature parameters by @tianenchong in #2868
Added support for pytree types that inherit from tuple and typing.namedtuple by @romanoneg in #2845
Layer norm throws on dimension mismatch by @awni in #2870
fix compile copying by @awni in #2871
Do a PyPi release for cuda on arm by @awni in #2866
Add a 2-pass col reduce for CUDA by @angeloskath in #2863
[CUDA] Faster general copy by @awni in #2873
[CUDA] Release build for cuda 13 by @awni in #2872
Make allocator::malloc throw on allocation failure by @zcbenz in #2874
[Metal] No copy array init by @awni in #2875
Try not to fail when there should be memory available by @awni in #2869
[CUDA] Enable more graphs to be updatable by @awni in #2883
Fix docs: replace nonexistent mx.random.randn with mx.random.normal by @Satyam12singh in #2890
Allow events in sub graph to be updatable by @awni in #2886
bump minimum required Python version by @ngoldbaum in #2891
do not use simd neon intrinsics on x86 by @davidkoski in #2893
Fix input buffer donation in compile by @CC-Yeh in #2897
Update nanobind pin to most recent version by @ngoldbaum in #2896
fp quantize by @nastya236 in #2892
Fix grad in place updates by @awni in #2899
[CUDA] Add host nodes to subgraph types for graph update by @awni in #2901
fix: possible heap-buffer-overflow in RandomBits::eval_cpu (follow for new ASAN CI tests) by @incertum in #2877
Fix ccache getting disabled by @zcbenz in #2905
Fix attention for large sizes by @awni in #2903
No VJP for mask or sinks in attention by @awni in #2909
Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #2911
Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in #2912
Use CUDA runtime headers from local python package by @zcbenz in #2906
DOC : Add compile state example by @Satyam12singh in #2910
qqmm by @nastya236 in #2789
Thunderbolt RDMA communications backend by @angeloskath in #2808
Add JIT support for NAX kernels by @jagrit06 in #2916
Fix warnings for the NAX build by @angeloskath in #2921

New Contributors

@dependabot[bot] made their first contribution in #2828
@yuchaoran2011 made their first contribution in #2831
@tianenchong made their first contribution in #2868
@romanoneg made their first contribution in #2845
@Satyam12singh made their first contribution in #2890
@ngoldbaum made their first contribution in #2891

Full Changelog: v0.30.0...v0.30.1

Contributors

zcbenz, yuchaoran2011, and 14 other contributors

Assets 2

19 Nov 23:09

jagrit06

v0.30.0

54f1cc6

v0.30.0

Highlights

Support for Neural Accelerators on M5 (macOS >= 26.2)

What's Changed

Fix AdamW weight_decay default value in docstring by @goingreen in #2557
Fix dequantize python sig by @wrmsr in #2562
fix copies in sdpa by @awni in #2563
chore: Update Docs With Slice Copy Example by @krishi-saripalli in #2559
Fixed several type annotations in the MLX stubs which degraded to Unknown/Any by @Maalvi14 in #2560
typing: add type hints to mlx.core.array, linalg, and random by @XXXXRT666 in #2565
Set ccache size before building by @zcbenz in #2570
Faster fully depthwise-separable 1D conv by @awni in #2567
Fix a few ccache cache miss by @zcbenz in #2573
Some tweaks in cmake files by @zcbenz in #2574
Add batch offsets for mx.fast.rope by @awni in #2564
[CUDA] Use GEMM with epilogue instead of AddMM by @zcbenz in #2569
[CUDA] Fix alpha not respected when using bias epilogue by @zcbenz in #2578
Fix flaky addmm tests by @zcbenz in #2581
Adding Relu2 by @Goekdeniz-Guelmez in #2582
Add sdpa with sinks by @awni in #2558
[CUDA] Set bias as input when using bias epilogue by @zcbenz in #2584
[CUDA] Fix NCCL stub for release build by @awni in #2587
patch bump by @awni in #2588
Refactor code examples to use 'gelu' by @umbertomig in #2592
Fix metal scan by @awni in #2591
Fix typo in average_gradients function call by @umbertomig in #2594
No copy batch rope by @awni in #2595
Update export function example for array input by @umbertomig in #2598
Expose mx.depends to Python by @awni in #2606
fix: library loading for swift dynamic frameworks by @bilousoleksandr in #2568
Detect cache thrashing in LRUCache by @zcbenz in #2600
Lower sorted QMM gather threshold by @awni in #2609
implement Convolution::output_shape by @josharian in #2601
Avoid producing NaN in attention by @awni in #2608
[CUDA] Recycle CUDA events by @zcbenz in #2604
[CUDA] fix cudaGraphLaunch by @CC-Yeh in #2613
Support pickling array for bfloat16 by @CC-Yeh in #2586
New tuning for small K gemv by @jagrit06 in #2620
Allow None input to compiled functions by @awni in #2621
Compiled should not end in broadcast by @angeloskath in #2622
Bump the version by @angeloskath in #2627
[CUDA] Make CudaEvent work with multi-device by @zcbenz in #2614
Fix incorrect path and typos by @aisk in #2630
Fix for max block dim by @awni in #2631
Compile now can attach arbitrary data to an entry by @angeloskath in #2634
[CUDA] Wait for tasks in cuda by @awni in #2636
Fix status message by @angeloskath in #2638
fix cross entropy axis param by @awni in #2641
Faster triu, tril, where with scalar by @awni in #2644
[CUDA] Add a small column specialization to reduce by @angeloskath in #2642
[CUDA] Fix flaky test by @awni in #2646
Configure CMake to export compile_commands.json by @andportnoy in #2645
Faster complex matmul by @CC-Yeh in #2571
Fix compile when outputs change by @awni in #2648
Speed up compile for node with many parents by @awni in #2649
Fix and refactor row-reduce by @angeloskath in #2650
[CUDA] Fix jit file cache for large kernel names by @angeloskath in #2656
Fix all_gather vjp by @awni in #2654
Fix fast synch when fence is waited before a command buffer is created by @awni in #2657
Fix cumulative operations when axis=None by @aisk in #2653
Export with callback by @awni in #2612
bump patch by @awni in #2658
Enable addmm low-precision cpu by @awni in #2661
Precise sigmoid by @awni in #2659
Debug cuda conv by @awni in #2662
Speed up scalars part 2 by @awni in #2669
Normalize README bullet formatting and other Markdown small fixes by @Mistobaan in #2671
Modified sort behavior when running CPU or Metal to match NumPy/JAX by @Maalvi14 in #2667
remove unused unary file by @awni in #2672
Nccl timeout by @nastya236 in #2673
suppress gcc 10.1 warnings by @awni in #2679
patch bump by @awni in #2680
Improved mx.split() docs by @Maalvi14 in #2689
fix warnings showing up with -Wall by @andresy in #2692
Einsum error msg improvement by @Maalvi14 in #2690
optionally load metallib from framework by @davidkoski in #2702
Fix addmm cpu for beta != 1.0 by @awni in #2699
Add mx.median op by @awni in #2705
bump python by @awni in #2694
Fp8 conversion by @awni in #2686
fix: linux-{fedora}x86_64-build by @incertum in #2707
Add quantize/dequantize for mxfp8 and nvfp4 by @awni in #2688
Migrate CircleCI to GitHub Actions by @madrob in #2716
Fix KeyError for missing domain_uuid_key in Thunderbolt setup by @thechriswebb in #2682
fix memory count bug by @awni in #2717
Fix the order of hosts in the ring by @angeloskath in #2718
Fix docs path by @madrob in #2719
Use faster dequant for fp4 by @awni in #2720
update: add linux fedora container CI - CPP build test only by @incertum in #2722
add null check -- the bundleIdentifier is optional by @davidkoski in #2709
Fix compile multi capture by @awni in #2678
Set up publishing to PyPI and Test-PyPI by @madrob in #2721
Check isnan in maximum / minimum with CPU backend by @aisk in #2652
Fix addmm with empty matrices and beta != 1.0 by @harsh-sutariya in #2715
skip self-hosted runners on forks by @madrob in #2730
only build for macos 14 and up by @awni in #2731
don't test when doing release by @awni in #2734
Make cpu binary_op easily accessible by @angeloskath in #2733
fix property name by @madrob in #2736
Nccl reduce scatter, all gather by @nastya236 in #2727
[CUDA] Reduce use of managed memory by @awni in #2725
Shapeless support for zeros/ones_like by @CC-Yeh in #2726
Compatibility with pip-installed openmpi by @pcuenca in #2741
Fix release builds by @awni in #2746
patch bump by @awni in #2750
Fix dequantize python sig (dtype default) by @wrmsr in #2752
remove circle by @awni in #2753
Fix irregular_strides benchmark shape type by @wrmsr in #2754
Linux on arm by @awni ...