Releases: ml-explore/mlx
Releases · ml-explore/mlx
v0.30.6
Highlights
- Much faster bandwidth with JACCL on macOS >= 26.3 (some numbers)
What's Changed
- patch by @awni in #3093
- Disable managed memory on WSL when concurrentManagedAccess is not supported by @jessegross in #3095
- Fix non simd f16 build by @awni in #3097
- Fix 2pass sdpa on < M2 by @awni in #3099
- JACCL update by @angeloskath in #3094
- Fix qmv_impl for small N by @manuelcandales in #3096
- Patch for multi device CUDA by @awni in #3100
New Contributors
- @manuelcandales made their first contribution in #3096
Full Changelog: v0.30.5...v0.30.6
v0.30.5
What's Changed
- patch by @awni in #3074
- [CUDA] Fallback Event impl when there is no hardware cpu/gpu coherency by @zcbenz in #3070
- Tune CUDA gaph sizes on B200 and H100 by @awni in #3077
- [Docs] Simple example of using MLX distributed by @stefpi in #2973
- Use lower-right causal mask alignment consistently by @Anri-Lombard in #2967
- Fix ALiBi slopes for non-power-of-2 num_heads by @vovw in #3071
- More useful error for large indices by @awni in #3079
- Fix nax condition for iphone by @awni in #3083
- Fallback to pinned host memory when managed memory is not supported by @zcbenz in #3075
- Fix failing python tests on Windows by @zcbenz in #3076
- [Metal] Tune splitk gemm dispatch conditions and partition sizes by @awni in #3087
- Fix for NAX overflow. by @awni in #3092
New Contributors
Full Changelog: v0.30.4...v0.30.5
v0.30.4
Highlights
- Metal: Much faster vector fused grouped-query attention for long context
- CUDA: Several improvements to speed up LLM inference for CUDA backend
- CUDA: Support for dense MoEs
- CUDA: Better support for consumer GPUs (4090, 5090, RTX 6000, ...)
What's Changed
- patch bump for next release by @awni in #2991
- Fix fence by @awni in #2998
- Reverts changing the MLX_IBV_DEVICES to MLX_JACCL_DEVICES by @angeloskath in #2999
- fix distributed all_to_sharded bias shard axis from -2 to -1 by @gufengc in #2987
- Fix sharding of quantized models with non-power-of-2 bits by @kernelpool in #3006
- Update CCCL to v3.1.3 by @zcbenz in #3012
- Fix python package install path in stubgen by @zcbenz in #3009
- Type Enhancement for Func Transforms and Bug Fix by @XXXXRT666 in #3003
- Do not clear disk space in setup-linux by @zcbenz in #3013
- Do not give workflow boolean inputs default values by @zcbenz in #3014
- Fix negative dim indexing by @MillaFleurs in #2994
- Windows CI by @zcbenz in #3021
- Optimize erf function with expm1f in Metal backend by @bjornefisk in #3025
- [CUDA] Faster grouped mm by @zcbenz in #3011
- PR 3007 Fix Seg Fault by @MillaFleurs in #3008
- Use higher precision for linspace with double by @awni in #3029
- Handle data smaller than BUFFER_SIZE in jaccl recv by @rltakashige in #3033
- build 26.0 release in actions by @awni in #3035
- Remove xmlrunner from macOS CI by @zcbenz in #3032
- Columnwise quantize by @nastya236 in #2989
- Turn nccl_stub into a normal target by @zcbenz in #3037
- Use cuda::std for math ops by @zcbenz in #3041
- win: symbol exports and minor fixes by @dhiltgen in #3024
- CUDA gather mv by @angeloskath in #3039
- Link with prebuilt OpenBLAS and fix shared libs build on Windows by @zcbenz in #3036
- Allow take on empty array when it makes sense by @awni in #3046
- Add missing include to buffer_cache.h by @Anri-Lombard in #3053
- Build and test python package on Windows CI by @zcbenz in #3049
- Fix some MSVC compilation errors by @zcbenz in #3048
- Use C++20 by @zcbenz in #3050
- Faster two pass sdpa by @awni in #3023
- Find system-installed cuDNN on Windows by @zcbenz in #3052
- Fix some NVCC warnings when building CUDA backend with MSVC by @zcbenz in #3038
- Hide symbols by default for mac/linux by @zcbenz in #3057
- [CUDA] Fast sorting by @awni in #3060
- Fix flaky macOS test by @awni in #3063
- Update pre-commit hooks and versions for clang-format, black, and isort by @NripeshN in #3059
- GPU discovery by @dhiltgen in #3055
- Add NAX Split-K GEMM for large-K matmuls to improve performance by @hxu296 in #3018
- Improve CPU discovery by @dhiltgen in #3068
- Fix long cache file path on Windows by @zcbenz in #3065
- Better support consumer CUDA GPUs by @jessegross in #3056
- Delay load CUDA libs and resolve DLL paths at runtime by @zcbenz in #3061
- Do not require ConcurrentManagedAccess when not used by @zcbenz in #3062
- Fp qmv by @awni in #2984
- remove thrust by @awni in #3067
New Contributors
- @gufengc made their first contribution in #2987
- @kernelpool made their first contribution in #3006
- @bjornefisk made their first contribution in #3025
- @rltakashige made their first contribution in #3033
- @dhiltgen made their first contribution in #3024
- @hxu296 made their first contribution in #3018
- @jessegross made their first contribution in #3056
Full Changelog: v0.30.3...v0.30.4
v0.30.3
Highlights
- Support nvfp4 and mxfp8 quantized ops on Metal
- Support nvfp4 and mxfp8 quantized-quantized matrix-matrix multiplication on CUDA
What's Changed
- Bump the patch version by @angeloskath in #2922
- Faster copy for col contig to row contig by @awni in #2917
- Fix cuda release by @awni in #2925
- Metal logging by @CC-Yeh in #2904
- fix cuda release part 2 by @awni in #2926
- new[CI]: add linux sanitizer tests by @incertum in #2860
- patch bump by @awni in #2927
- Fix CUDA pypi release by @awni in #2929
- Move allocate_workspace to cuda/utils.h by @zcbenz in #2923
- Allow dry run for PyPI release workflow by @zcbenz in #2928
- Set rpath with cmake for CUDA build by @zcbenz in #2932
- Fix nightly build by @zcbenz in #2933
- Set install rpath of python bindings with cmake by @zcbenz in #2934
- Fix pid in local launch by @angeloskath in #2936
- Make CUDA CI run faster by @zcbenz in #2939
- refactor: use perf_counter for accurate benchmarking by @Satyam12singh in #2940
- Fix for non row-contig scales by @awni in #2941
- Fix stubgen by @zcbenz in #2942
- ci: add macOS 26 target by @madrob in #2937
- Fix float64 size in data_types.rst by @pdevine in #2948
- Fixes in mlx.distributed_config by @angeloskath in #2947
- Metal/CPU nvfp4 and mxfp8 by @awni in #2946
- [CUDA] Implement gather_mm_rhs by @zcbenz in #2902
- Fetch nanobind with cmake by @zcbenz in #2949
- refactor: use time.perf_counter for consistent and accurate benchmarking by @Satyam12singh in #2943
- BUG FIX - Addition of missing parameter in random::uniform by @hwiesmann in #2963
- Fix doc issues in
mlx.nn.init.he_normalandmlx.nn.hard_tanhby @Redempt1onzzZZ in #2968 - fix numpy dtype bug by @awni in #2960
- QQ linear by @nastya236 in #2931
- fix array allocator with user buffer and deleter by @andresy in #2971
- Swizzle scales by @nastya236 in #2979
- Fix
grid_dim_xcalculations by @CC-Yeh in #2980 - Add asarray to array_namespace by @Anri-Lombard in #2966
- fix doc by @CC-Yeh in #2988
- replace MLX_IBV_COORDINATOR with MLX_JACCL_COORDINATOR by @Evanev7 in #2986
- Fix RandomBits::is_equivalent to include width by @MillaFleurs in #2978
- Don't try to use NAX at run-time if kernels aren't there by @awni in #2982
- Expose to/from fp8 in Python and don't auto-convert fp8 when loading from safetensors by @awni in #2985
- Allow some non 2D inputs in qqmm by @awni in #2981
New Contributors
- @pdevine made their first contribution in #2948
- @hwiesmann made their first contribution in #2963
- @Anri-Lombard made their first contribution in #2966
- @Evanev7 made their first contribution in #2986
- @MillaFleurs made their first contribution in #2978
Full Changelog: v0.30.1...v0.30.3
v0.30.1
Highlights
- RDMA over thunderbolt with the JACCL backend (macOS >= 26.2) (some numbers)
- NAX with JIT so that they can be used in MLX Swift
- CUDA improvements
- Many improvements to SDPA (masking, T_q != T_kv)
- Faster quantize/dequantize
- QQMM to make use of faster tensor cores
- Fix in col reduce speeds up training
What's Changed
- patch + fix docs build by @awni in #2799
- Fix macos release target and linux arm release by @awni in #2802
- Fix cuda allocator copy condition by @awni in #2800
- [CUDA] Partly fix random for large sizes by @awni in #2798
- patch bump for future version by @awni in #2804
- Centralize NAX condition by @awni in #2811
- Tolerance for some ops tests on cuda by @awni in #2815
- Fix typo: refs/head/main => refs/heads/main by @zcbenz in #2818
- Add float64 Eig and complex64 SVD/Eig support (Fixes #2708) by @harsh-sutariya in #2737
- Fix
mx.core.loadtype annotation by @CC-Yeh in #2819 - Force cudaGraphExec reinstantiation when clusters are used by @andportnoy in #2813
- Bump actions/checkout from 5 to 6 by @dependabot[bot] in #2828
- Fix
mx.core.linspacetype annotation by @CC-Yeh in #2820 - [CUDA] Exit on crash and more helpful errors by @awni in #2830
- [CUDA] Add debug env to save cuda graphs to dot files by @zcbenz in #2825
- [CUDA] Output of SDPA should have same layout with inputs by @zcbenz in #2826
- Merge build-cuda and build-linux actions by @zcbenz in #2783
- [CUDA] Support array mask in SDPA by @zcbenz in #2822
- [CUDA] Faster rms norm for small dimension by @awni in #2838
- Added clarification to apply_fn parameter of apply_to_modules by @yuchaoran2011 in #2831
- [CUDA] Use cuDNN attention when T_q != T_kv by @zcbenz in #2843
- [CUDA] Migrate conv code to new cuDNN APIs by @zcbenz in #2847
- Support more
Numpyinterfaces formasked_scatterby @CC-Yeh in #2832 - use thread local cpature mode by @awni in #2850
- Fix export scatters by @awni in #2852
- Reduce JVP by @awni in #2854
- Fix graph updating by @awni in #2857
- Fix init from double by @awni in #2861
- Update gumbel function signature parameters by @tianenchong in #2868
- Added support for pytree types that inherit from tuple and typing.namedtuple by @romanoneg in #2845
- Layer norm throws on dimension mismatch by @awni in #2870
- fix compile copying by @awni in #2871
- Do a PyPi release for cuda on arm by @awni in #2866
- Add a 2-pass col reduce for CUDA by @angeloskath in #2863
- [CUDA] Faster general copy by @awni in #2873
- [CUDA] Release build for cuda 13 by @awni in #2872
- Make allocator::malloc throw on allocation failure by @zcbenz in #2874
- [Metal] No copy array init by @awni in #2875
- Try not to fail when there should be memory available by @awni in #2869
- [CUDA] Enable more graphs to be updatable by @awni in #2883
- Fix docs: replace nonexistent mx.random.randn with mx.random.normal by @Satyam12singh in #2890
- Allow events in sub graph to be updatable by @awni in #2886
- bump minimum required Python version by @ngoldbaum in #2891
- do not use simd neon intrinsics on x86 by @davidkoski in #2893
- Fix input buffer donation in compile by @CC-Yeh in #2897
- Update nanobind pin to most recent version by @ngoldbaum in #2896
- fp quantize by @nastya236 in #2892
- Fix grad in place updates by @awni in #2899
- [CUDA] Add host nodes to subgraph types for graph update by @awni in #2901
- fix: possible heap-buffer-overflow in RandomBits::eval_cpu (follow for new ASAN CI tests) by @incertum in #2877
- Fix ccache getting disabled by @zcbenz in #2905
- Fix attention for large sizes by @awni in #2903
- No VJP for mask or sinks in attention by @awni in #2909
- Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #2911
- Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in #2912
- Use CUDA runtime headers from local python package by @zcbenz in #2906
- DOC : Add compile state example by @Satyam12singh in #2910
- qqmm by @nastya236 in #2789
- Thunderbolt RDMA communications backend by @angeloskath in #2808
- Add JIT support for NAX kernels by @jagrit06 in #2916
- Fix warnings for the NAX build by @angeloskath in #2921
New Contributors
- @dependabot[bot] made their first contribution in #2828
- @yuchaoran2011 made their first contribution in #2831
- @tianenchong made their first contribution in #2868
- @romanoneg made their first contribution in #2845
- @Satyam12singh made their first contribution in #2890
- @ngoldbaum made their first contribution in #2891
Full Changelog: v0.30.0...v0.30.1
v0.30.0
Highlights
- Support for Neural Accelerators on M5 (macOS >= 26.2)
What's Changed
- Fix AdamW weight_decay default value in docstring by @goingreen in #2557
- Fix dequantize python sig by @wrmsr in #2562
- fix copies in sdpa by @awni in #2563
- chore: Update Docs With Slice Copy Example by @krishi-saripalli in #2559
- Fixed several type annotations in the MLX stubs which degraded to Unknown/Any by @Maalvi14 in #2560
- typing: add type hints to mlx.core.array, linalg, and random by @XXXXRT666 in #2565
- Set ccache size before building by @zcbenz in #2570
- Faster fully depthwise-separable 1D conv by @awni in #2567
- Fix a few ccache cache miss by @zcbenz in #2573
- Some tweaks in cmake files by @zcbenz in #2574
- Add batch offsets for mx.fast.rope by @awni in #2564
- [CUDA] Use GEMM with epilogue instead of AddMM by @zcbenz in #2569
- [CUDA] Fix alpha not respected when using bias epilogue by @zcbenz in #2578
- Fix flaky addmm tests by @zcbenz in #2581
- Adding Relu2 by @Goekdeniz-Guelmez in #2582
- Add sdpa with sinks by @awni in #2558
- [CUDA] Set bias as input when using bias epilogue by @zcbenz in #2584
- [CUDA] Fix NCCL stub for release build by @awni in #2587
- patch bump by @awni in #2588
- Refactor code examples to use 'gelu' by @umbertomig in #2592
- Fix metal scan by @awni in #2591
- Fix typo in average_gradients function call by @umbertomig in #2594
- No copy batch rope by @awni in #2595
- Update export function example for array input by @umbertomig in #2598
- Expose
mx.dependsto Python by @awni in #2606 - fix: library loading for swift dynamic frameworks by @bilousoleksandr in #2568
- Detect cache thrashing in LRUCache by @zcbenz in #2600
- Lower sorted QMM gather threshold by @awni in #2609
- implement Convolution::output_shape by @josharian in #2601
- Avoid producing NaN in attention by @awni in #2608
- [CUDA] Recycle CUDA events by @zcbenz in #2604
- [CUDA] fix cudaGraphLaunch by @CC-Yeh in #2613
- Support pickling array for bfloat16 by @CC-Yeh in #2586
- New tuning for small K gemv by @jagrit06 in #2620
- Allow None input to compiled functions by @awni in #2621
- Compiled should not end in broadcast by @angeloskath in #2622
- Bump the version by @angeloskath in #2627
- [CUDA] Make CudaEvent work with multi-device by @zcbenz in #2614
- Fix incorrect path and typos by @aisk in #2630
- Fix for max block dim by @awni in #2631
- Compile now can attach arbitrary data to an entry by @angeloskath in #2634
- [CUDA] Wait for tasks in cuda by @awni in #2636
- Fix status message by @angeloskath in #2638
- fix cross entropy axis param by @awni in #2641
- Faster triu, tril, where with scalar by @awni in #2644
- [CUDA] Add a small column specialization to reduce by @angeloskath in #2642
- [CUDA] Fix flaky test by @awni in #2646
- Configure CMake to export
compile_commands.jsonby @andportnoy in #2645 - Faster complex matmul by @CC-Yeh in #2571
- Fix compile when outputs change by @awni in #2648
- Speed up compile for node with many parents by @awni in #2649
- Fix and refactor row-reduce by @angeloskath in #2650
- [CUDA] Fix jit file cache for large kernel names by @angeloskath in #2656
- Fix all_gather vjp by @awni in #2654
- Fix fast synch when fence is waited before a command buffer is created by @awni in #2657
- Fix cumulative operations when axis=None by @aisk in #2653
- Export with callback by @awni in #2612
- bump patch by @awni in #2658
- Enable addmm low-precision cpu by @awni in #2661
- Precise sigmoid by @awni in #2659
- Debug cuda conv by @awni in #2662
- Speed up scalars part 2 by @awni in #2669
- Normalize README bullet formatting and other Markdown small fixes by @Mistobaan in #2671
- Modified sort behavior when running CPU or Metal to match NumPy/JAX by @Maalvi14 in #2667
- remove unused unary file by @awni in #2672
- Nccl timeout by @nastya236 in #2673
- suppress gcc 10.1 warnings by @awni in #2679
- patch bump by @awni in #2680
- Improved mx.split() docs by @Maalvi14 in #2689
- fix warnings showing up with -Wall by @andresy in #2692
- Einsum error msg improvement by @Maalvi14 in #2690
- optionally load metallib from framework by @davidkoski in #2702
- Fix addmm cpu for beta != 1.0 by @awni in #2699
- Add
mx.medianop by @awni in #2705 - bump python by @awni in #2694
- Fp8 conversion by @awni in #2686
- fix: linux-{fedora}x86_64-build by @incertum in #2707
- Add quantize/dequantize for mxfp8 and nvfp4 by @awni in #2688
- Migrate CircleCI to GitHub Actions by @madrob in #2716
- Fix KeyError for missing domain_uuid_key in Thunderbolt setup by @thechriswebb in #2682
- fix memory count bug by @awni in #2717
- Fix the order of hosts in the ring by @angeloskath in #2718
- Fix docs path by @madrob in #2719
- Use faster dequant for fp4 by @awni in #2720
- update: add linux fedora container CI - CPP build test only by @incertum in #2722
- add null check -- the bundleIdentifier is optional by @davidkoski in #2709
- Fix compile multi capture by @awni in #2678
- Set up publishing to PyPI and Test-PyPI by @madrob in #2721
- Check isnan in maximum / minimum with CPU backend by @aisk in #2652
- Fix addmm with empty matrices and beta != 1.0 by @harsh-sutariya in #2715
- skip self-hosted runners on forks by @madrob in #2730
- only build for macos 14 and up by @awni in #2731
- don't test when doing release by @awni in #2734
- Make cpu binary_op easily accessible by @angeloskath in #2733
- fix property name by @madrob in #2736
- Nccl reduce scatter, all gather by @nastya236 in #2727
- [CUDA] Reduce use of managed memory by @awni in #2725
- Shapeless support for
zeros/ones_likeby @CC-Yeh in #2726 - Compatibility with pip-installed openmpi by @pcuenca in #2741
- Fix release builds by @awni in #2746
- patch bump by @awni in #2750
- Fix dequantize python sig (dtype default) by @wrmsr in #2752
- remove circle by @awni in #2753
- Fix irregular_strides benchmark shape type by @wrmsr in #2754
- Linux on arm by @awni ...
v0.29.4
v0.29.3
v0.29.2
⬆️