Add contiguity check to try_as_slice() and contiguous() function by ymote · Pull Request #310 · oxiglade/mlx-rs

ymote · 2026-01-25T14:44:48Z

Summary

Add is_contiguous() method to Array to check if the array has contiguous memory layout
Add NotContiguous error variant to AsSliceError
Update try_as_slice() to check contiguity before returning slice (BREAKING CHANGE)
Add contiguous() function to ops module to make non-contiguous arrays contiguous

Motivation

Previously, as_slice() would return raw memory ignoring strides, causing silent data corruption when the array was not contiguous. This was a significant source of bugs when working with indexed or transposed arrays.

For example:

let arr = Array::from_slice(&[1i32, 2, 3, 4, 5, 6], &[2, 3]);
let transposed = ops::transpose(&arr)?;
let slice = transposed.as_slice::<i32>();  // WRONG DATA!

The slice would contain elements in the wrong order because transpose() creates a strided view, but as_slice() ignored the strides.

Changes

`Array::is_contiguous()`

Checks if the array is contiguous (row-major/C-style) by verifying strides match expected values.

`AsSliceError::NotContiguous`

New error variant returned when calling try_as_slice() on a non-contiguous array.

`try_as_slice()` update

Now checks contiguity after evaluation and returns NotContiguous error if the array is not contiguous.

`ops::contiguous()`

New function that returns a contiguous copy of the array if needed.

Migration Guide

If your code used as_slice() on indexed/transposed arrays, it was likely producing incorrect results. To fix:

// Option 1: Use contiguous()
let c = ops::contiguous(&arr)?;
let slice = c.as_slice::<f32>();

// Option 2: Reshape to force contiguity
let n = arr.size() as i32;
let c = arr.reshape(&[n])?.reshape(arr.shape())?;
let slice = c.as_slice::<f32>();

Test Plan

Unit tests for is_contiguous() with various array shapes
Unit tests for try_as_slice() returning NotContiguous error
Unit tests for contiguous() function
Verified fix in real-world use case (image generation pipeline that was producing tiled images due to this bug)

🤖 Generated with Claude Code

release: 0.25.2

Major changes: - Rename mlx-lm to mlx-rs-lm to avoid confusion with official Python package - Update mlx-c submodule from v0.3.0 to v0.4.1 (MLX v0.30.1) - Fix 128-token performance threshold bug from MLX v0.29.1 Performance optimizations: - Add async evaluation pipelining in Generate iterator (main speedup) - Fix KVCache dtype bug: use input dtype instead of hardcoded f32 - Add periodic mlx_clear_cache() every 256 tokens API changes for MLX v0.30.1: - Update SDPA to use single Array mask instead of VectorArray - Add optional int/dtype helpers for quantization functions - Patch Metal 3.2 for macOS Tahoe beta compatibility - Disable NAX feature to avoid __isPlatformVersionAtLeast link error Results: Rust now achieves ~42.3 tok/s vs Python's ~42.9 tok/s (1.4% gap) Previously: ~40.2 tok/s vs ~42.9 tok/s (6.8% gap)

- Remove unused 'y' field from GenerateState::Decode (was cloned but never used) - Add explicit eval() on first token to match Python behavior Results: Rust now ~2% faster than Python (42.2 vs 41.4 tok/s)

The Rust nn::Rope implementation was reshaping input from [B, H, L, D] to [B*H, L, D] before calling fast::rope. This caused the underlying MLX rope kernel to only process the first "sequence" correctly when L=1 (decode phase), producing zeros for all other attention heads. Python's nn.RoPE does not reshape and passes input directly to mx.fast.rope. This fix aligns Rust behavior with Python. This resolves the Mixtral generation producing garbled output during autoregressive decoding.

Two bugs fixed: 1. Remove duplicate eval() in try_item() (mlx-rs/src/array/mod.rs) - try_item() was calling eval() twice due to copy-paste error - This caused 15% overhead on every .item() call 2. Fix StreamOrDevice::default() to use task-local stream (mlx-rs/src/stream.rs) - StreamOrDevice::default() was using Stream::new() which ignores task-local streams - Changed to use Stream::task_local_or_default() - This enables proper async pipelining with with_new_default_stream() Performance impact (Mixtral-8x7B-4bit): - Before: ~35.6 tok/s - After: ~45.1 tok/s - Improvement: ~27% Rust now achieves parity with Python mlx-lm (~45.6 tok/s, <1% gap).

Add complete voice cloning pipeline without Python dependencies: - Voice cloning API with zero-shot and few-shot modes - Text processing: Chinese G2P (pypinyin), English G2P (CMU dict) - Mixed Chinese/English text support with language segmentation - Models: BERT, HuBERT, T2S (text-to-semantic), VITS vocoder - Audio: WAV I/O, mel spectrogram, resampling - Architecture documentation with SSML roadmap Performance: ~27x faster than Python for mixed text synthesis Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Implement complete Paraformer-large (220M) model port: - SAN-M encoder (50 layers with FSMN memory) - CIF predictor for non-autoregressive length prediction - Bidirectional decoder (16 layers) - Add MelFrontend for FunASR-compatible feature extraction: - 80-bin mel spectrogram with pre-emphasis (0.97) - LFR stacking (7 frames, stride 6) - CMVN normalization with robust multi-line parser - Deprecate duplicate frontend in audio.rs - Achieve 18.9x real-time on Apple Silicon (RTF 0.053) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix Mistral example to use pre-quantized models (41% faster) - Add proper async pipelining to Mistral Generate iterator - Add GLM4-MoE benchmark example with async pipelining - Add comprehensive performance comparison documentation Benchmark results vs Python mlx-lm: - Qwen3-30B-A3B-4bit (MoE): +0.5% (98.3 vs 97.8 tok/s) - GLM-4.5-Air-3bit (MoE): +5.8% (45.3 vs 42.8 tok/s) - Mixtral-8x7B-4bit (MoE): -3.5% (44.5 vs 46.1 tok/s) - Mistral-7B-4bit: -11% (74.2 vs 83.5 tok/s) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Move Mistral from examples/mistral to a proper mistral-mlx crate that shares infrastructure with other -mlx crates via mlx-lm-core. - Add mistral-mlx crate with pre-quantized model support - Add async pipelining in Generate iterator - Add benchmark example achieving 82.8 tok/s (matches Python's 83.5) - Add InvalidConfig error variant to mlx-lm-core Performance improvement: 11% gap -> 0.8% gap vs Python mlx-lm Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Updated benchmark results showing Mistral-7B-4bit achieving 82.8 tok/s with the new mistral-mlx crate using mlx-lm-core shared infrastructure, reducing the gap from -38% to just -0.8% vs Python mlx-lm. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implement G2PW (Grapheme-to-Phoneme for Written Chinese) using ONNX Runtime for context-aware polyphonic character pronunciation. Key changes: - Add g2pw.rs module with ONNX Runtime integration - Use BERT-based model for context-aware disambiguation - Integrate G2PW into chinese_g2p() preprocessing pipeline - Add lingua-rs for ML-based language detection The polyphonic character "行" is now correctly pronounced based on context: - 几行代码 → háng (lines of code) - 银行存款 → háng (bank) - 行走江湖 → xíng (walk) - 举行会议 → xíng (hold/conduct) Dependencies: - ort (ONNX Runtime) for model inference - lingua for language detection Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Previously, unknown English words like "Rashbass" were spelled out letter by letter (R-A-S-H-B-A-S-S), producing unintelligible speech. Now uses rule-based grapheme-to-phoneme conversion that handles: - Common digraphs: ch, sh, th, ph, ng, etc. - Double consonants: ss, tt, ll, etc. - Vowel patterns: ai, ea, ee, oo, ou, etc. - Magic-e pattern: a_e, i_e, o_e - Context-dependent consonants: c before e/i/y → S Example: "Rashbass" → R AE1 SH B AE1 S (sounds like "RASH-BASS") Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Enable CoreML execution provider for the G2PW ONNX model, allowing it to run on Apple Silicon GPU and Neural Engine (ANE) instead of CPU. Changes: - Add `coreml` feature to ort dependency - Configure CoreML with ComputeUnits::All (GPU + ANE + CPU) - Add model cache directory for faster subsequent loads - Use NeuralNetwork format for better compatibility This significantly improves G2PW inference performance on Apple Silicon. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add number_to_english_phonemes() for converting numbers like "2050" to spoken English phonemes (e.g., "twenty fifty") - Add number_to_english_word() for number-to-word conversion - Update english_g2p() to handle digits by accumulating and converting them - Update segment_by_language() to include digits in English segments This fixes an issue where English numbers in mixed Chinese-English text (e.g., "The World in 2050") were being silently dropped during TTS. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Created MODEL_USER_GUIDE.md covering all available models: - LLM models: Mistral, Mixtral, Qwen3, Qwen3 MoE, GLM-4, GLM-4.5 MoE - Image generation: FLUX.2 Klein, Qwen-Image - Speech/Audio: FunASR Paraformer (ASR), GPT-SoVITS (TTS) Includes usage examples, recommended models, performance tips, and API usage patterns for each model. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace custom number_to_english_word() with num2en crate, which is the Rust equivalent of Python's inflect library used by g2p_en. This ensures number pronunciation matches Python exactly: - 2050 → "two thousand fifty" (was "twenty fifty") - 2001 → "two thousand one" (was "twenty one") - 123 → "one hundred twenty-three" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Text normalization fixes for TTS: 1. Strip leading/trailing punctuation in voice_clone.rs - Trailing commas caused BERT feature misalignment - Beginning of audio was being skipped (e.g., "从区域上" missed) - Now strips: , . ! ? ，。！？、；： 2. Improved segment_by_language() in preprocessor.rs - Digits are now context-dependent: - "126.4亿斤" stays together as Chinese - "Room 404" stays together as English - Looks ahead to determine if number is followed by CJK 3. Decimal and percentage conversion already in place: - "163.6" → "一百六十三点六" - "70%" → "百分之七十" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Performance comparison between Rust MLX and Python MPS: - Rust is 2x faster at synthesis (4.9s vs 9.6s for 91 chars) - Rust runs at 4x realtime, Python at 2.3x realtime - Model loading 79x faster in Rust (49ms vs 3.9s) Updated TTS skill with: - Performance benchmark results and commands - Voice options (doubao, luoxiang) - New fixes: boundary punctuation, number segmentation, num2en Added voice_clone.rs example with multi-voice support. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

BREAKING CHANGE: try_as_slice() now returns NotContiguous error for non-contiguous arrays (e.g., after index() or transpose_axes()). Previously, as_slice() would return raw memory ignoring strides, causing silent data corruption when the array was not contiguous. This was a significant source of bugs when working with indexed or transposed arrays. Changes: - Add is_contiguous() method to Array to check memory layout - Add NotContiguous error variant to AsSliceError - Update try_as_slice() to check contiguity before returning slice - Add contiguous() function to make non-contiguous arrays contiguous Migration guide: If your code used as_slice() on indexed/transposed arrays, it was likely producing incorrect results. To fix: 1. Call contiguous() first: let c = ops::contiguous(&arr)?; let slice = c.as_slice::<f32>(); 2. Or use reshape to force contiguity: let n = arr.size() as i32; let c = arr.reshape(&[n])?.reshape(arr.shape())?; let slice = c.as_slice::<f32>(); Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dcvz · 2026-01-25T14:59:46Z

@ymote this seems to do more than just what the title suggests

minghuaw and others added 19 commits October 5, 2025 00:20

Merge pull request oxiglade#283 from oxideai/release/0.25.2

7c77f32

release: 0.25.2

perf: Remove unnecessary clones and add first-token sync

9ecaf72

- Remove unused 'y' field from GenerateState::Decode (was cloned but never used) - Add explicit eval() on first token to match Python behavior Results: Rust now ~2% faster than Python (42.2 vs 41.4 tok/s)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add contiguity check to try_as_slice() and contiguous() function#310

Add contiguity check to try_as_slice() and contiguous() function#310
ymote wants to merge 19 commits intooxiglade:mainfrom
ymote:fix/as-slice-contiguity-check

ymote commented Jan 25, 2026

Uh oh!

dcvz commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ymote commented Jan 25, 2026

Summary

Motivation

Changes

Array::is_contiguous()

AsSliceError::NotContiguous

try_as_slice() update

ops::contiguous()

Migration Guide

Test Plan

Uh oh!

dcvz commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`Array::is_contiguous()`

`AsSliceError::NotContiguous`

`try_as_slice()` update

`ops::contiguous()`