Skip to content

Conversation

@ChrisRackauckas-Claude
Copy link
Contributor

Summary

This PR significantly reduces allocations in the MLP (Multi-Level Picard) algorithm's inner loops, particularly benefiting problems with Neumann boundary conditions.

Key optimizations:

  1. _reflect() for MLP Vector inputs (used by Neumann BC):

    • BEFORE: 3552 bytes, 75 allocations
    • AFTER: 192 bytes, 4 allocations (96% reduction)
  2. NEW _reflect!() in-place version:

    • 0 bytes, 0 allocations (100% reduction when temp buffer provided)
    • Avoids allocating intermediate vector n by tracking only the index
    • Uses loop fusion instead of broadcasting to avoid temporaries
  3. UniformSampling Monte Carlo sampling:

    • BEFORE: 432 bytes, 6 allocations
    • AFTER: 0 bytes, 0 allocations (100% reduction)
    • Replaced broadcasting with explicit loop to avoid temporaries
  4. _mlt_sde_loop! with Neumann BC:

    • Added optional temp buffer parameter for zero-allocation reflection
    • BEFORE: 416 bytes, 10 allocations
    • AFTER: 224 bytes, 5 allocations with temp buffer (46% reduction)
    • Pre-allocated temp buffers in _ml_picard, _ml_picard_mlt, _ml_picard_call

Benchmark results:

_reflect! (in-place):     37 ns,  0 allocations, 0 bytes
_reflect (non-mutating): 152 ns,  4 allocations, 192 bytes
UniformSampling:          25 ns,  0 allocations, 0 bytes
NormalSampling:           37 ns,  0 allocations, 0 bytes

These optimizations significantly reduce allocation pressure in the exponentially-called MLP inner loops (_mlt_sde_loop! is called M^L times for level-0 and similar exponential counts at other levels), improving performance especially for problems with Neumann boundary conditions.

Test plan

  • Sanity tests pass for heat equation (with and without Neumann BC)
  • Sanity tests pass for multi-threaded execution
  • Sanity tests pass for non-local problems with NormalSampling and UniformSampling
  • Full test suite (MLP tests timeout after 10 minutes on CI, but pass locally for smaller parameter values)

cc @ChrisRackauckas

🤖 Generated with Claude Code

Key optimizations:
1. _reflect() for MLP Vector inputs:
   - BEFORE: 3552 bytes, 75 allocations
   - AFTER:  192 bytes, 4 allocations (96% reduction)

2. NEW _reflect!() in-place version:
   - 0 bytes, 0 allocations (100% reduction when buffer provided)
   - Avoids allocating intermediate vector `n` by tracking only the index
   - Uses loop fusion instead of broadcasting to avoid temporaries

3. UniformSampling Monte Carlo sampling:
   - BEFORE: 432 bytes, 6 allocations
   - AFTER:  0 bytes, 0 allocations (100% reduction)
   - Replaced broadcasting with explicit loop to avoid temporaries

4. _mlt_sde_loop! with Neumann BC:
   - Added optional temp buffer parameter for zero-allocation reflection
   - BEFORE: 416 bytes, 10 allocations
   - AFTER:  224 bytes, 5 allocations with temp buffer (46% reduction)
   - Pre-allocated temp buffers in _ml_picard, _ml_picard_mlt, _ml_picard_call

These optimizations significantly reduce allocation pressure in the
exponentially-called MLP inner loops, improving performance especially
for problems with Neumann boundary conditions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants