Performance improvements: reduce allocations in MLP algorithm #123

ChrisRackauckas-Claude · 2026-01-07T21:06:37Z

Summary

This PR significantly reduces allocations in the MLP (Multi-Level Picard) algorithm's inner loops, particularly benefiting problems with Neumann boundary conditions.

Key optimizations:

_reflect() for MLP Vector inputs (used by Neumann BC):
- BEFORE: 3552 bytes, 75 allocations
- AFTER: 192 bytes, 4 allocations (96% reduction)
NEW _reflect!() in-place version:
- 0 bytes, 0 allocations (100% reduction when temp buffer provided)
- Avoids allocating intermediate vector n by tracking only the index
- Uses loop fusion instead of broadcasting to avoid temporaries
UniformSampling Monte Carlo sampling:
- BEFORE: 432 bytes, 6 allocations
- AFTER: 0 bytes, 0 allocations (100% reduction)
- Replaced broadcasting with explicit loop to avoid temporaries
_mlt_sde_loop! with Neumann BC:
- Added optional temp buffer parameter for zero-allocation reflection
- BEFORE: 416 bytes, 10 allocations
- AFTER: 224 bytes, 5 allocations with temp buffer (46% reduction)
- Pre-allocated temp buffers in _ml_picard, _ml_picard_mlt, _ml_picard_call

Benchmark results:

_reflect! (in-place):     37 ns,  0 allocations, 0 bytes
_reflect (non-mutating): 152 ns,  4 allocations, 192 bytes
UniformSampling:          25 ns,  0 allocations, 0 bytes
NormalSampling:           37 ns,  0 allocations, 0 bytes

These optimizations significantly reduce allocation pressure in the exponentially-called MLP inner loops (_mlt_sde_loop! is called M^L times for level-0 and similar exponential counts at other levels), improving performance especially for problems with Neumann boundary conditions.

Test plan

Sanity tests pass for heat equation (with and without Neumann BC)
Sanity tests pass for multi-threaded execution
Sanity tests pass for non-local problems with NormalSampling and UniformSampling
Full test suite (MLP tests timeout after 10 minutes on CI, but pass locally for smaller parameter values)

cc @ChrisRackauckas

🤖 Generated with Claude Code

Key optimizations: 1. _reflect() for MLP Vector inputs: - BEFORE: 3552 bytes, 75 allocations - AFTER: 192 bytes, 4 allocations (96% reduction) 2. NEW _reflect!() in-place version: - 0 bytes, 0 allocations (100% reduction when buffer provided) - Avoids allocating intermediate vector `n` by tracking only the index - Uses loop fusion instead of broadcasting to avoid temporaries 3. UniformSampling Monte Carlo sampling: - BEFORE: 432 bytes, 6 allocations - AFTER: 0 bytes, 0 allocations (100% reduction) - Replaced broadcasting with explicit loop to avoid temporaries 4. _mlt_sde_loop! with Neumann BC: - Added optional temp buffer parameter for zero-allocation reflection - BEFORE: 416 bytes, 10 allocations - AFTER: 224 bytes, 5 allocations with temp buffer (46% reduction) - Pre-allocated temp buffers in _ml_picard, _ml_picard_mlt, _ml_picard_call These optimizations significantly reduce allocation pressure in the exponentially-called MLP inner loops, improving performance especially for problems with Neumann boundary conditions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ChrisRackauckas closed this Jan 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance improvements: reduce allocations in MLP algorithm #123

Performance improvements: reduce allocations in MLP algorithm #123

Uh oh!

ChrisRackauckas-Claude commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Performance improvements: reduce allocations in MLP algorithm #123

Performance improvements: reduce allocations in MLP algorithm #123

Uh oh!

Conversation

ChrisRackauckas-Claude commented Jan 7, 2026

Summary

Key optimizations:

Benchmark results:

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants