Skip to content

Latest commit

 

History

History
331 lines (240 loc) · 7.4 KB

File metadata and controls

331 lines (240 loc) · 7.4 KB

Performance Guide

Optimization strategies and benchmarks for Atoms_Rust.


Performance Targets

Binary Particles Target FPS Actual FPS* Memory
realtime 100,000 60 60-80 ~200 MB
realtime 500,000 15 12-18 ~500 MB
realtime 1,000,000 8 6-10 ~900 MB
raytracer 10,000 30 30-45 ~50 MB
raytracer 50,000 8 8-12 ~150 MB
atom_2d 20 atoms 60 60+ ~20 MB
wave_2d 1 electron 60 60+ ~10 MB

*On a modern desktop (Ryzen 7, RTX 3060, 16GB RAM)


Performance Architecture

flowchart TD
    subgraph "CPU Bound"
        P1[Particle Updates<br/>Parallel with rayon]
        P2[CDF Sampling<br/>Binary search O log N]
        P3[Color Calculation<br/>Polynomial evaluation]
    end

    subgraph "GPU Bound"
        G1[Draw Calls<br/>One per particle]
        G2[Vertex Processing<br/>Sphere geometry]
        G3[Fragment Shading<br/>Lighting calculation]
    end

    subgraph "Memory Bound"
        M1[Particle Data<br/>48 bytes each]
        M2[CDF Tables<br/>~50KB per n,l combo]
        M3[Cache Efficiency<br/>Sequential access]
    end

    P1 --> PERF[Frame Time]
    P2 --> PERF
    P3 --> PERF
    G1 --> PERF
    G2 --> PERF
    G3 --> PERF
    M1 --> PERF
    M2 --> PERF
    M3 --> PERF
Loading

Frame Time Breakdown

pie title Typical Frame Time (100k particles, 60 FPS)
    "GPU Draw Calls" : 55
    "Particle Position Updates" : 25
    "CDF Sampling" : 8
    "Color Calculation" : 7
    "Input & Camera" : 5
Loading

Optimization Techniques

1. Parallel Processing

Particle updates use rayon for parallel processing:

// Parallel iteration across all CPU cores
particles.par_iter_mut().for_each(|p| {
    // Calculate probability flow (independent per particle)
    p.vel = calculate_probability_flow(p.pos, n, l, m);

    // Update position
    let temp_pos = p.pos + p.vel * dt;
    let new_phi = temp_pos.z.atan2(temp_pos.x);
    p.pos = spherical_to_cartesian(p.r, p.theta, new_phi);
});

Performance gain: ~4x on 4-core CPU, ~8x on 8-core CPU

2. CDF Caching

The Sampler caches CDF tables between frames:

sequenceDiagram
    participant Code
    participant Sampler
    participant Cache

    Note over Code,Cache: First call for (n=2, l=1, m=0)
    Code->>Sampler: prepare(2, 1, 0)
    Sampler->>Sampler: Build CDF tables
    Sampler->>Cache: Store tables

    Note over Code,Cache: Subsequent calls
    Code->>Sampler: prepare(2, 1, 0)
    Sampler->>Cache: Check cache
    Cache-->>Sampler: Cache hit!
    Note right of Sampler: Skip rebuild
Loading

Performance gain: Avoids O(6000) work per frame when quantum numbers unchanged

3. Memory Pre-allocation

// GOOD: Pre-allocate exact capacity
let mut particles: Vec<Particle> = Vec::with_capacity(num_particles);

// BAD: Repeated reallocations
let mut particles: Vec<Particle> = Vec::new();
for _ in 0..100_000 {
    particles.push(...); // Triggers multiple reallocations
}

Performance gain: Eliminates memory reallocation during runtime

4. Efficient Polynomial Evaluation

Use recurrence relations instead of closed-form formulas:

// Recurrence for Laguerre: O(k) where k = n-l-1
fn laguerre(n: i32, l: i32, rho: f64) -> f64 {
    let k = n - l - 1;
    let alpha = 2 * l + 1;

    // Base cases
    if k == 0 { return 1.0; }
    if k == 1 { return 1.0 + alpha as f64 - rho; }

    // Recurrence relation (iterative, no recursion overhead)
    let mut l_m1 = 1.0 + alpha as f64 - rho;
    let mut l_m2 = 1.0;
    for j in 2..=k {
        let l_val = ((2*j - 1 + alpha) as f64 - rho) * l_m1
                  - ((j - 1 + alpha) as f64) * l_m2;
        l_val /= j as f64;
        l_m2 = l_m1;
        l_m1 = l_val;
    }
    l_m1
}

GPU Optimization Tips

Current Bottleneck

The main GPU bottleneck is individual draw calls per particle:

// Current approach: O(P) draw calls
for p in &particles {
    draw_sphere(p.pos, radius, None, p.color);
}

Future Optimization: Instanced Rendering

// Future: Single draw call for all particles
let mesh = create_sphere_mesh(radius, 10, 10);
let instances: Vec<ParticleInstance> = particles.iter()
    .map(|p| ParticleInstance::new(p.pos, p.color))
    .collect();

draw_mesh_instanced(&mesh, &instances);

Expected performance gain: 5-10x for large particle counts

Point Sprite Optimization

For distant particles, use point sprites instead of spheres:

// Vertex shader
gl_PointSize = base_size / gl_Position.w;

// Fragment shader: fake sphere lighting
vec2 coord = gl_PointCoord - vec2(0.5);
float dist_sq = dot(coord, coord);
if (dist_sq > 0.25) discard;

float z = sqrt(0.25 - dist_sq);
vec3 normal = normalize(vec3(coord, z));
float diffuse = max(dot(normal, light_dir), 0.0);

Memory Usage

Per-Particle Memory

Field Type Size
pos Vec3 12 bytes
vel Vec3 12 bytes
color Color 16 bytes
r f32 4 bytes
theta f32 4 bytes
Total 48 bytes

Memory Calculation

For $N$ particles:

$$\text{Memory} = N \times 48 \text{ bytes} + \text{CDF tables}$$

Particles Particle Data CDF Tables Total
100,000 4.8 MB ~50 KB ~5 MB
500,000 24 MB ~50 KB ~25 MB
1,000,000 48 MB ~50 KB ~50 MB

The actual runtime memory is higher due to macroquad's rendering buffers.


Benchmarking

Built-in FPS Counter

All binaries display FPS in the UI. Monitor this for performance issues.

Profiling with perf (Linux)

# Record performance data
perf record -g cargo run --bin realtime --release

# View report
perf report

# Flame graph (requires FlameGraph tools)
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

Benchmarking with Criterion

// benches/sampling_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use atoms_rust::Sampler;

fn sampling_benchmark(c: &mut Criterion) {
    let mut sampler = Sampler::new();
    let mut rng = rand::thread_rng();

    c.bench_function("sample_r", |b| {
        b.iter(|| sampler.sample_r(2, 1, &mut rng))
    });
}

criterion_group!(benches, sampling_benchmark);
criterion_main!(benches);

Run with:

cargo bench

Performance Checklist

Before Running

  • Build with --release flag
  • Close other GPU-intensive applications
  • Ensure GPU drivers are up to date

For Best Performance

  • Use rayon for parallel updates
  • Pre-allocate vectors with with_capacity()
  • Cache CDF tables between frames
  • Match particle count to hardware capabilities

For Troubleshooting

  • Monitor FPS counter
  • Check CPU/GPU usage with task manager
  • Reduce particle count if FPS < 30
  • Profile with perf to identify bottlenecks

Platform-Specific Notes

Linux

  • Ensure hardware OpenGL drivers are installed
  • Check with: glxinfo | grep "OpenGL renderer"
  • May need: sudo apt install mesa-utils

macOS

  • Uses Metal backend automatically
  • Apple Silicon: Native performance
  • Intel Mac: Rosetta may affect performance

Windows

  • Requires Visual Studio Build Tools
  • DirectX or OpenGL backend
  • Check Task Manager for GPU usage

WebAssembly

  • Limited to ~50,000 particles for 30 FPS
  • WebGL 2.0 required
  • Browser DevTools for profiling