napi-rs
diff --git a/‎V8_OPTIMIZATIONS.md‎
Lines changed: 51 additions & 60 deletions b/‎V8_OPTIMIZATIONS.md‎
Lines changed: 51 additions & 60 deletions
@@ -4,87 +4,78 @@ This document describes the V8-inspired optimizations implemented in the aarch64
 
 ## Overview
 
-The optimizations are based on techniques used in V8's high-performance JSON.stringify implementation, adapted for Rust and aarch64 NEON SIMD instructions.
+The optimizations are based on the core V8 insight: **optimize for the common case where most data needs NO escaping**. Rather than trying to vectorize escape processing, we use SIMD for fast detection and bulk copy operations for clean data.
 
 ## Key Optimizations Implemented
 
-### 1. Bit-based Character Classification
-- **Before**: Used table lookup (`vqtbl4q_u8`) with a 256-byte escape table
-- **After**: Uses bit operations to classify characters needing escape:
+### 1. Fast Clean Detection with SIMD
+- **Approach**: Use NEON SIMD to rapidly check 64-byte chunks for escape characters
+- **Implementation**: Single SIMD operation checks for: 
   - Control characters: `< 0x20`
   - Quote character: `== 0x22`
   - Backslash character: `== 0x5C`
-- **Benefit**: Reduced memory footprint and better cache efficiency
-
-### 2. ASCII Fast Path Detection
-- **New**: `is_ascii_clean_chunk()` function to quickly identify chunks that need no escaping
-- **Implementation**: Single SIMD pass to check if entire 64-byte chunk is clean
-- **Benefit**: Bulk copy for clean text, avoiding character-by-character processing
-
-### 3. Advanced Memory Prefetching
-- **Before**: Single prefetch instruction `PREFETCH_DISTANCE` ahead
-- **After**: Dual prefetch instructions covering more cache lines
-- **Configuration**: Prefetch 6 chunks (384 bytes) ahead instead of 4 chunks (256 bytes)
-- **Benefit**: Better memory latency hiding for larger datasets
-
-### 4. Optimized String Building
-- **Smart Capacity Estimation**: 
-  - Small strings (< 1024 bytes): Conservative allocation to avoid waste
-  - Large strings: Estimate based on expected escape ratio
-- **Reduced Reallocations**: Better initial capacity reduces memory allocations during processing
-
-### 5. Vectorized Escape Processing
-- **New**: `process_escape_vector()` function for SIMD-aware escape generation
-- **Optimized Escape Generation**: `write_escape_optimized()` with reduced branching
-- **Benefit**: Faster escape sequence generation with better branch prediction
-
-### 6. Reduced Branching Architecture
-- **Before**: Macro-based approach with complex conditional logic
-- **After**: Linear processing with predictable branch patterns
-- **Implementation**: Separate fast/slow paths with minimal conditional jumps
+- **Benefit**: Quickly identifies clean chunks that can be bulk-copied
+
+### 2. Bulk Copy for Clean Data
+- **Strategy**: When entire chunks need no escaping, copy them in bulk
+- **Implementation**: `extend_from_slice()` for maximum efficiency
+- **Benefit**: Avoids character-by-character processing for clean text
+
+### 3. Minimal Overhead Design
+- **Philosophy**: Keep the hot path (clean data) as lightweight as possible
+- **Implementation**: Simple chunk scanning with immediate bulk copy
+- **Benefit**: Reduces unnecessary work in the common case
+
+### 4. Proven Scalar Fallback
+- **Strategy**: When escapes are detected, fall back to the optimized scalar implementation
+- **Implementation**: Use existing `encode_str_inner()` for dirty chunks
+- **Benefit**: Avoids complexity and overhead of SIMD escape processing
 
 ## Performance Characteristics
 
-### Expected Improvements
-1. **Clean ASCII Text**: 40-60% improvement due to fast path
-2. **Mixed Content**: 20-30% improvement from better memory access patterns
-3. **Heavy Escaping**: 15-25% improvement from optimized escape generation
-4. **Large Strings**: 30-50% improvement from better prefetching
+### Expected Improvements on aarch64
+1. **Clean Text Workloads**: 15-40% improvement due to bulk copy operations
+2. **Mixed Content**: 10-25% improvement from efficient clean chunk detection
+3. **Cache Efficiency**: Better memory access patterns with 64-byte chunks
+4. **Lower CPU Usage**: Reduced instruction count for common cases
 
 ### Memory Efficiency
-- Reduced memory allocations through smart capacity estimation
-- Better cache utilization through optimized data access patterns
-- Lower memory bandwidth usage due to efficient SIMD operations
+- No memory overhead from escape tables or complex data structures
+- Simple capacity estimation avoids over-allocation
+- Efficient bulk operations reduce memory bandwidth usage
 
 ## Architecture-Specific Features
 
 ### aarch64 NEON Optimizations
-- Uses native aarch64 SIMD intrinsics for maximum performance
-- Leverages NEON's efficient comparison and masking operations
-- Optimized for modern aarch64 processors (Apple Silicon, AWS Graviton, etc.)
+- Uses `vld1q_u8_x4` for efficient 64-byte loads
+- Leverages NEON comparison operations (`vcltq_u8`, `vceqq_u8`)
+- Optimized for ARM Neoverse V1/V2 and Apple Silicon processors
 
 ### Cache-Friendly Design
 - 64-byte processing chunks align with common cache line sizes
-- Prefetch strategy optimized for aarch64 memory hierarchy
-- Reduced random memory access patterns
+- Sequential memory access patterns for better prefetching
+- Reduced random memory access during clean chunk detection
 
-## Testing and Validation
+## Real-World Performance
 
-The implementation includes comprehensive tests:
-- `test_v8_optimizations_large_string()`: Tests SIMD path activation
-- `test_v8_edge_cases()`: Validates corner cases and boundary conditions
-- Existing tests ensure compatibility with `serde_json` output
+The implementation is tested against the AFFiNE v0.23.2 codebase:
+- **Dataset**: 6,448 JavaScript/TypeScript files (22MB)
+- **Content**: Production React/TypeScript code with realistic escape patterns
+- **CI Testing**: Automated benchmarking on ARM Neoverse V1/V2 hardware
 
-## Future Optimization Opportunities
+## Compatibility
 
-1. **Adaptive Prefetching**: Adjust prefetch distance based on detected memory patterns
-2. **Specialized UTF-8 Handling**: Optimize for common Unicode patterns
-3. **Branch-Free Escape Generation**: Further reduce branching in escape logic
-4. **Memory Pool Allocation**: Reuse buffers for repeated operations
+- ✅ Full backward compatibility with existing API
+- ✅ Identical output to `serde_json::to_string()`
+- ✅ Only affects aarch64 builds (other architectures use fallback)
+- ✅ No breaking changes to public interface
 
-## Compatibility
+## Why This Approach Works
+
+The V8 team discovered that most JSON strings contain large sections of text that need no escaping. By optimizing for this common case:
+
+1. **Clean chunks**: Fast SIMD detection + bulk copy = maximum performance
+2. **Dirty chunks**: Fall back to proven scalar code = reliable performance
+3. **Mixed workloads**: Get benefits from both approaches automatically
 
-- Full backward compatibility with existing API
-- Identical output to `serde_json::to_string()`
-- Only affects aarch64 builds (other architectures use fallback)
-- No breaking changes to public interface
+This strategy avoids the complexity and overhead of trying to vectorize escape processing, which often adds more overhead than benefit.