Skip to content

Add contiguity check to try_as_slice() and contiguous() function#310

Open
ymote wants to merge 19 commits intooxiglade:mainfrom
ymote:fix/as-slice-contiguity-check
Open

Add contiguity check to try_as_slice() and contiguous() function#310
ymote wants to merge 19 commits intooxiglade:mainfrom
ymote:fix/as-slice-contiguity-check

Conversation

@ymote
Copy link

@ymote ymote commented Jan 25, 2026

Summary

  • Add is_contiguous() method to Array to check if the array has contiguous memory layout
  • Add NotContiguous error variant to AsSliceError
  • Update try_as_slice() to check contiguity before returning slice (BREAKING CHANGE)
  • Add contiguous() function to ops module to make non-contiguous arrays contiguous

Motivation

Previously, as_slice() would return raw memory ignoring strides, causing silent data corruption when the array was not contiguous. This was a significant source of bugs when working with indexed or transposed arrays.

For example:

let arr = Array::from_slice(&[1i32, 2, 3, 4, 5, 6], &[2, 3]);
let transposed = ops::transpose(&arr)?;
let slice = transposed.as_slice::<i32>();  // WRONG DATA!

The slice would contain elements in the wrong order because transpose() creates a strided view, but as_slice() ignored the strides.

Changes

Array::is_contiguous()

Checks if the array is contiguous (row-major/C-style) by verifying strides match expected values.

AsSliceError::NotContiguous

New error variant returned when calling try_as_slice() on a non-contiguous array.

try_as_slice() update

Now checks contiguity after evaluation and returns NotContiguous error if the array is not contiguous.

ops::contiguous()

New function that returns a contiguous copy of the array if needed.

Migration Guide

If your code used as_slice() on indexed/transposed arrays, it was likely producing incorrect results. To fix:

// Option 1: Use contiguous()
let c = ops::contiguous(&arr)?;
let slice = c.as_slice::<f32>();

// Option 2: Reshape to force contiguity
let n = arr.size() as i32;
let c = arr.reshape(&[n])?.reshape(arr.shape())?;
let slice = c.as_slice::<f32>();

Test Plan

  • Unit tests for is_contiguous() with various array shapes
  • Unit tests for try_as_slice() returning NotContiguous error
  • Unit tests for contiguous() function
  • Verified fix in real-world use case (image generation pipeline that was producing tiled images due to this bug)

🤖 Generated with Claude Code

minghuaw and others added 19 commits October 5, 2025 00:20
Major changes:
- Rename mlx-lm to mlx-rs-lm to avoid confusion with official Python package
- Update mlx-c submodule from v0.3.0 to v0.4.1 (MLX v0.30.1)
- Fix 128-token performance threshold bug from MLX v0.29.1

Performance optimizations:
- Add async evaluation pipelining in Generate iterator (main speedup)
- Fix KVCache dtype bug: use input dtype instead of hardcoded f32
- Add periodic mlx_clear_cache() every 256 tokens

API changes for MLX v0.30.1:
- Update SDPA to use single Array mask instead of VectorArray
- Add optional int/dtype helpers for quantization functions
- Patch Metal 3.2 for macOS Tahoe beta compatibility
- Disable NAX feature to avoid __isPlatformVersionAtLeast link error

Results: Rust now achieves ~42.3 tok/s vs Python's ~42.9 tok/s (1.4% gap)
Previously: ~40.2 tok/s vs ~42.9 tok/s (6.8% gap)
- Remove unused 'y' field from GenerateState::Decode (was cloned but never used)
- Add explicit eval() on first token to match Python behavior

Results: Rust now ~2% faster than Python (42.2 vs 41.4 tok/s)
The Rust nn::Rope implementation was reshaping input from [B, H, L, D]
to [B*H, L, D] before calling fast::rope. This caused the underlying
MLX rope kernel to only process the first "sequence" correctly when
L=1 (decode phase), producing zeros for all other attention heads.

Python's nn.RoPE does not reshape and passes input directly to
mx.fast.rope. This fix aligns Rust behavior with Python.

This resolves the Mixtral generation producing garbled output during
autoregressive decoding.
Two bugs fixed:

1. Remove duplicate eval() in try_item() (mlx-rs/src/array/mod.rs)
   - try_item() was calling eval() twice due to copy-paste error
   - This caused 15% overhead on every .item() call

2. Fix StreamOrDevice::default() to use task-local stream (mlx-rs/src/stream.rs)
   - StreamOrDevice::default() was using Stream::new() which ignores task-local streams
   - Changed to use Stream::task_local_or_default()
   - This enables proper async pipelining with with_new_default_stream()

Performance impact (Mixtral-8x7B-4bit):
- Before: ~35.6 tok/s
- After:  ~45.1 tok/s
- Improvement: ~27%

Rust now achieves parity with Python mlx-lm (~45.6 tok/s, <1% gap).
Add complete voice cloning pipeline without Python dependencies:

- Voice cloning API with zero-shot and few-shot modes
- Text processing: Chinese G2P (pypinyin), English G2P (CMU dict)
- Mixed Chinese/English text support with language segmentation
- Models: BERT, HuBERT, T2S (text-to-semantic), VITS vocoder
- Audio: WAV I/O, mel spectrogram, resampling
- Architecture documentation with SSML roadmap

Performance: ~27x faster than Python for mixed text synthesis

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Implement complete Paraformer-large (220M) model port:
  - SAN-M encoder (50 layers with FSMN memory)
  - CIF predictor for non-autoregressive length prediction
  - Bidirectional decoder (16 layers)
- Add MelFrontend for FunASR-compatible feature extraction:
  - 80-bin mel spectrogram with pre-emphasis (0.97)
  - LFR stacking (7 frames, stride 6)
  - CMVN normalization with robust multi-line parser
- Deprecate duplicate frontend in audio.rs
- Achieve 18.9x real-time on Apple Silicon (RTF 0.053)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix Mistral example to use pre-quantized models (41% faster)
- Add proper async pipelining to Mistral Generate iterator
- Add GLM4-MoE benchmark example with async pipelining
- Add comprehensive performance comparison documentation

Benchmark results vs Python mlx-lm:
- Qwen3-30B-A3B-4bit (MoE): +0.5% (98.3 vs 97.8 tok/s)
- GLM-4.5-Air-3bit (MoE): +5.8% (45.3 vs 42.8 tok/s)
- Mixtral-8x7B-4bit (MoE): -3.5% (44.5 vs 46.1 tok/s)
- Mistral-7B-4bit: -11% (74.2 vs 83.5 tok/s)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move Mistral from examples/mistral to a proper mistral-mlx crate
that shares infrastructure with other -mlx crates via mlx-lm-core.

- Add mistral-mlx crate with pre-quantized model support
- Add async pipelining in Generate iterator
- Add benchmark example achieving 82.8 tok/s (matches Python's 83.5)
- Add InvalidConfig error variant to mlx-lm-core

Performance improvement: 11% gap -> 0.8% gap vs Python mlx-lm

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updated benchmark results showing Mistral-7B-4bit achieving 82.8 tok/s
with the new mistral-mlx crate using mlx-lm-core shared infrastructure,
reducing the gap from -38% to just -0.8% vs Python mlx-lm.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement G2PW (Grapheme-to-Phoneme for Written Chinese) using ONNX
Runtime for context-aware polyphonic character pronunciation.

Key changes:
- Add g2pw.rs module with ONNX Runtime integration
- Use BERT-based model for context-aware disambiguation
- Integrate G2PW into chinese_g2p() preprocessing pipeline
- Add lingua-rs for ML-based language detection

The polyphonic character "行" is now correctly pronounced based on context:
- 几行代码 → háng (lines of code)
- 银行存款 → háng (bank)
- 行走江湖 → xíng (walk)
- 举行会议 → xíng (hold/conduct)

Dependencies:
- ort (ONNX Runtime) for model inference
- lingua for language detection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, unknown English words like "Rashbass" were spelled out
letter by letter (R-A-S-H-B-A-S-S), producing unintelligible speech.

Now uses rule-based grapheme-to-phoneme conversion that handles:
- Common digraphs: ch, sh, th, ph, ng, etc.
- Double consonants: ss, tt, ll, etc.
- Vowel patterns: ai, ea, ee, oo, ou, etc.
- Magic-e pattern: a_e, i_e, o_e
- Context-dependent consonants: c before e/i/y → S

Example: "Rashbass" → R AE1 SH B AE1 S (sounds like "RASH-BASS")

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Enable CoreML execution provider for the G2PW ONNX model, allowing
it to run on Apple Silicon GPU and Neural Engine (ANE) instead of CPU.

Changes:
- Add `coreml` feature to ort dependency
- Configure CoreML with ComputeUnits::All (GPU + ANE + CPU)
- Add model cache directory for faster subsequent loads
- Use NeuralNetwork format for better compatibility

This significantly improves G2PW inference performance on Apple Silicon.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add number_to_english_phonemes() for converting numbers like "2050" to
  spoken English phonemes (e.g., "twenty fifty")
- Add number_to_english_word() for number-to-word conversion
- Update english_g2p() to handle digits by accumulating and converting them
- Update segment_by_language() to include digits in English segments

This fixes an issue where English numbers in mixed Chinese-English text
(e.g., "The World in 2050") were being silently dropped during TTS.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Created MODEL_USER_GUIDE.md covering all available models:
- LLM models: Mistral, Mixtral, Qwen3, Qwen3 MoE, GLM-4, GLM-4.5 MoE
- Image generation: FLUX.2 Klein, Qwen-Image
- Speech/Audio: FunASR Paraformer (ASR), GPT-SoVITS (TTS)

Includes usage examples, recommended models, performance tips,
and API usage patterns for each model.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace custom number_to_english_word() with num2en crate, which is
the Rust equivalent of Python's inflect library used by g2p_en.

This ensures number pronunciation matches Python exactly:
- 2050 → "two thousand fifty" (was "twenty fifty")
- 2001 → "two thousand one" (was "twenty one")
- 123 → "one hundred twenty-three"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Text normalization fixes for TTS:

1. Strip leading/trailing punctuation in voice_clone.rs
   - Trailing commas caused BERT feature misalignment
   - Beginning of audio was being skipped (e.g., "从区域上" missed)
   - Now strips: , . ! ? , 。 ! ? 、 ; :

2. Improved segment_by_language() in preprocessor.rs
   - Digits are now context-dependent:
     - "126.4亿斤" stays together as Chinese
     - "Room 404" stays together as English
   - Looks ahead to determine if number is followed by CJK

3. Decimal and percentage conversion already in place:
   - "163.6" → "一百六十三点六"
   - "70%" → "百分之七十"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Performance comparison between Rust MLX and Python MPS:
- Rust is 2x faster at synthesis (4.9s vs 9.6s for 91 chars)
- Rust runs at 4x realtime, Python at 2.3x realtime
- Model loading 79x faster in Rust (49ms vs 3.9s)

Updated TTS skill with:
- Performance benchmark results and commands
- Voice options (doubao, luoxiang)
- New fixes: boundary punctuation, number segmentation, num2en

Added voice_clone.rs example with multi-voice support.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
BREAKING CHANGE: try_as_slice() now returns NotContiguous error for
non-contiguous arrays (e.g., after index() or transpose_axes()).

Previously, as_slice() would return raw memory ignoring strides, causing
silent data corruption when the array was not contiguous. This was a
significant source of bugs when working with indexed or transposed arrays.

Changes:
- Add is_contiguous() method to Array to check memory layout
- Add NotContiguous error variant to AsSliceError
- Update try_as_slice() to check contiguity before returning slice
- Add contiguous() function to make non-contiguous arrays contiguous

Migration guide:
If your code used as_slice() on indexed/transposed arrays, it was likely
producing incorrect results. To fix:

1. Call contiguous() first:
   let c = ops::contiguous(&arr)?;
   let slice = c.as_slice::<f32>();

2. Or use reshape to force contiguity:
   let n = arr.size() as i32;
   let c = arr.reshape(&[n])?.reshape(arr.shape())?;
   let slice = c.as_slice::<f32>();

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dcvz
Copy link
Collaborator

dcvz commented Jan 25, 2026

@ymote this seems to do more than just what the title suggests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants