Releases: ml-explore/mlx
Releases · ml-explore/mlx
v0.14.0
Highlights
- Small-size build that JIT compiles kernels and omits the CPU backend which results in a binary <4MB
mx.gather_qmmquantized equivalent formx.gather_mmwhich speeds up MoE inference by ~2x- Grouped 2D convolutions
Core
mx.conjugatemx.conv3dandnn.Conv3d- List based indexing
- Started
mx.distributedwhich uses MPI (if installed) for communication across machinesmx.distributed.initmx.distributed.all_gathermx.distributed.all_reduce_sum
- Support conversion to and from dlpack
mx.linalg.choleskyon CPUmx.quantized_matmulsped up for vector-matrix productsmx.tracemx.block_masked_mmnow supports floating point masks!
Fixes
- Error messaging in eval
- Add some missing docs
- Scatter index bug
- The extensions example now compiles and runs
- CPU copy bug with many dimensions
v0.13.1
v0.13.0
Highlights
- Block sparse matrix multiply speeds up MoEs by >2x
- Improved quantization algorithm should work well for all networks
- Improved gpu command submission speeds up training and inference
Core
- Bitwise ops added:
mx.bitwise_[or|and|xor],mx.[left|right]_shift, operator overloads
- Groups added to Conv1d
- Added
mx.metal.device_infoto get better informed memory limits - Added resettable memory stats
mlx.optimizers.clip_grad_normandmlx.utils.tree_reduceadded- Add
mx.arctan2 - Unary ops now accept array-like inputs ie one can do
mx.sqrt(2)
Bugfixes
- Fixed shape for slice update
- Bugfix in quantize that used slightly wrong scales/biases
- Fixed memory leak for multi-output primitives encountered with gradient checkpointing
- Fixed conversion from other frameworks for all datatypes
- Fixed index overflow for matmul with large batch size
- Fixed initialization ordering that occasionally caused segfaults
v0.12.2
v0.12.0
Highlights
- Faster quantized matmul
- Up to 40% faster QLoRA or prompt processing, some numbers
Core
mx.synchronizeto wait for computation dispatched withmx.async_evalmx.radiansandmx.degreesmx.metal.clear_cacheto return to the OS the memory held by MLX as a cache for future allocations- Change quantization to always represent 0 exactly (relevant issue)
Bugfixes
- Fixed quantization of a block with all 0s that produced NaNs
- Fixed the
lenfield in the buffer protocol implementation
v0.11.0
v0.10.0
Highlights
- Improvements for LLM generation
- Reshapeless quant matmul/matvec
mx.async_eval- Async command encoding
Core
- Slightly faster reshapeless quantized gemms
- Option for precise softmax
mx.metal.start_captureandmx.metal.stop_capturefor GPU debug/profilemx.expm1mx.stdmx.meshgrid- CPU only
mx.random.multivariate_normal mx.cumsum(and other scans) forbfloat- Async command encoder with explicit barriers / dependency management
NN
nn.upsamplesupport bicubic interpolation
Misc
- Updated MLX Extension to work with nanobind
Bugfixes
- Fix buffer donation in softmax and fast ops
- Bug in layer norm vjp
- Bug initializing from lists with scalar
- Bug in indexing
- CPU compilation bug
- Multi-output compilation bug
- Fix stack overflow issues in eval and array destruction
v0.9.0
Highlights:
- Fast partial RoPE (used by Phi-2)
- Fast gradients for RoPE, RMSNorm, and LayerNorm
- Up to 7x faster, benchmarks
Core
- More overhead reductions
- Partial fast RoPE (fast Phi-2)
- Better buffer donation for copy
- Type hierarchy and issubdtype
- Fast VJPs for RoPE, RMSNorm, and LayerNorm
NN
Module.set_dtype- Chaining in
nn.Module(model.freeze().update(…))
Bugfixes
- Fix set item bugs
- Fix scatter vjp
- Check shape integer overlow on array construction
- Fix bug with module attributes
- Fix two bugs for odd shaped QMV
- Fix GPU sort for large sizes
- Fix bug in negative padding for convolutions
- Fix bug in multi-stream race condition for graph evaluation
- Fix random normal generation for half precision
v0.8.0
Highlights
- More perf!
mx.fast.rms_normandmx.fast.layer_norm- Switch to Nanobind substantially reduces overhead
- Up to 4x faster
__setitem__(e.g.a[...] = b)
Core
mx.inverse, CPU only- vmap over
mx.matmulandmx.addmm - Switch to nanobind from pybind11
- Faster setitem indexing
mx.fast.rms_norm, token generation benchmarkmx.fast.layer_norm, token generation benchmark- vmap for inverse and svd
- Faster non-overlapping pooling
Optimizers
- Set minimum value in cosine decay scheduler
Bugfixes
- Fix bug in multi-dimensional reduction
v0.7.0
Highlights
- Perf improvements for attention ops:
- No copy broadcast matmul (benchmarks)
- Fewer copies in reshape
Core
- Faster broadcast + gemm
mx.linalg.svd(CPU only)- Fewer copies in reshape
- Faster small reductions
NN
nn.RNN,nn.LSTM,nn.GRU
Bugfixes
- Fix bug in depth traversal ordering
- Fix two edge case bugs in compilation
- Fix bug with modules with dictionaries of weights
- Fix bug with scatter which broke MOE training
- Fix bug with compilation kernel collision