Performance Guide

Optimization strategies and benchmarks for Atoms_Rust.

Performance Targets

Binary	Particles	Target FPS	Actual FPS*	Memory
realtime	100,000	60	60-80	~200 MB
realtime	500,000	15	12-18	~500 MB
realtime	1,000,000	8	6-10	~900 MB
raytracer	10,000	30	30-45	~50 MB
raytracer	50,000	8	8-12	~150 MB
atom_2d	20 atoms	60	60+	~20 MB
wave_2d	1 electron	60	60+	~10 MB

*On a modern desktop (Ryzen 7, RTX 3060, 16GB RAM)

Performance Architecture

flowchart TD
    subgraph "CPU Bound"
        P1[Particle Updates<br/>Parallel with rayon]
        P2[CDF Sampling<br/>Binary search O log N]
        P3[Color Calculation<br/>Polynomial evaluation]
    end

    subgraph "GPU Bound"
        G1[Draw Calls<br/>One per particle]
        G2[Vertex Processing<br/>Sphere geometry]
        G3[Fragment Shading<br/>Lighting calculation]
    end

    subgraph "Memory Bound"
        M1[Particle Data<br/>48 bytes each]
        M2[CDF Tables<br/>~50KB per n,l combo]
        M3[Cache Efficiency<br/>Sequential access]
    end

    P1 --> PERF[Frame Time]
    P2 --> PERF
    P3 --> PERF
    G1 --> PERF
    G2 --> PERF
    G3 --> PERF
    M1 --> PERF
    M2 --> PERF
    M3 --> PERF

Frame Time Breakdown

pie title Typical Frame Time (100k particles, 60 FPS)
    "GPU Draw Calls" : 55
    "Particle Position Updates" : 25
    "CDF Sampling" : 8
    "Color Calculation" : 7
    "Input & Camera" : 5

Optimization Techniques

1. Parallel Processing

Particle updates use rayon for parallel processing:

// Parallel iteration across all CPU cores
particles.par_iter_mut().for_each(|p| {
    // Calculate probability flow (independent per particle)
    p.vel = calculate_probability_flow(p.pos, n, l, m);

    // Update position
    let temp_pos = p.pos + p.vel * dt;
    let new_phi = temp_pos.z.atan2(temp_pos.x);
    p.pos = spherical_to_cartesian(p.r, p.theta, new_phi);
});

Performance gain: ~4x on 4-core CPU, ~8x on 8-core CPU

2. CDF Caching

The Sampler caches CDF tables between frames:

sequenceDiagram
    participant Code
    participant Sampler
    participant Cache

    Note over Code,Cache: First call for (n=2, l=1, m=0)
    Code->>Sampler: prepare(2, 1, 0)
    Sampler->>Sampler: Build CDF tables
    Sampler->>Cache: Store tables

    Note over Code,Cache: Subsequent calls
    Code->>Sampler: prepare(2, 1, 0)
    Sampler->>Cache: Check cache
    Cache-->>Sampler: Cache hit!
    Note right of Sampler: Skip rebuild

Performance gain: Avoids O(6000) work per frame when quantum numbers unchanged

3. Memory Pre-allocation

// GOOD: Pre-allocate exact capacity
let mut particles: Vec<Particle> = Vec::with_capacity(num_particles);

// BAD: Repeated reallocations
let mut particles: Vec<Particle> = Vec::new();
for _ in 0..100_000 {
    particles.push(...); // Triggers multiple reallocations
}

Performance gain: Eliminates memory reallocation during runtime

4. Efficient Polynomial Evaluation

Use recurrence relations instead of closed-form formulas:

// Recurrence for Laguerre: O(k) where k = n-l-1
fn laguerre(n: i32, l: i32, rho: f64) -> f64 {
    let k = n - l - 1;
    let alpha = 2 * l + 1;

    // Base cases
    if k == 0 { return 1.0; }
    if k == 1 { return 1.0 + alpha as f64 - rho; }

    // Recurrence relation (iterative, no recursion overhead)
    let mut l_m1 = 1.0 + alpha as f64 - rho;
    let mut l_m2 = 1.0;
    for j in 2..=k {
        let l_val = ((2*j - 1 + alpha) as f64 - rho) * l_m1
                  - ((j - 1 + alpha) as f64) * l_m2;
        l_val /= j as f64;
        l_m2 = l_m1;
        l_m1 = l_val;
    }
    l_m1
}

GPU Optimization Tips

Current Bottleneck

The main GPU bottleneck is individual draw calls per particle:

// Current approach: O(P) draw calls
for p in &particles {
    draw_sphere(p.pos, radius, None, p.color);
}

Future Optimization: Instanced Rendering

// Future: Single draw call for all particles
let mesh = create_sphere_mesh(radius, 10, 10);
let instances: Vec<ParticleInstance> = particles.iter()
    .map(|p| ParticleInstance::new(p.pos, p.color))
    .collect();

draw_mesh_instanced(&mesh, &instances);

Expected performance gain: 5-10x for large particle counts

Point Sprite Optimization

For distant particles, use point sprites instead of spheres:

// Vertex shader
gl_PointSize = base_size / gl_Position.w;

// Fragment shader: fake sphere lighting
vec2 coord = gl_PointCoord - vec2(0.5);
float dist_sq = dot(coord, coord);
if (dist_sq > 0.25) discard;

float z = sqrt(0.25 - dist_sq);
vec3 normal = normalize(vec3(coord, z));
float diffuse = max(dot(normal, light_dir), 0.0);

Memory Usage

Per-Particle Memory

Field	Type	Size
pos	Vec3	12 bytes
vel	Vec3	12 bytes
color	Color	16 bytes
r	f32	4 bytes
theta	f32	4 bytes
Total		48 bytes

Memory Calculation

For $N$ particles:

$$\text{Memory} = N \times 48 \text{ bytes} + \text{CDF tables}$$

Particles	Particle Data	CDF Tables	Total
100,000	4.8 MB	~50 KB	~5 MB
500,000	24 MB	~50 KB	~25 MB
1,000,000	48 MB	~50 KB	~50 MB

The actual runtime memory is higher due to macroquad's rendering buffers.

Benchmarking

Built-in FPS Counter

All binaries display FPS in the UI. Monitor this for performance issues.

Profiling with `perf` (Linux)

# Record performance data
perf record -g cargo run --bin realtime --release

# View report
perf report

# Flame graph (requires FlameGraph tools)
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

Benchmarking with Criterion

// benches/sampling_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use atoms_rust::Sampler;

fn sampling_benchmark(c: &mut Criterion) {
    let mut sampler = Sampler::new();
    let mut rng = rand::thread_rng();

    c.bench_function("sample_r", |b| {
        b.iter(|| sampler.sample_r(2, 1, &mut rng))
    });
}

criterion_group!(benches, sampling_benchmark);
criterion_main!(benches);

Run with:

cargo bench

Performance Checklist

Before Running

Build with --release flag
Close other GPU-intensive applications
Ensure GPU drivers are up to date

For Best Performance

Use rayon for parallel updates
Pre-allocate vectors with with_capacity()
Cache CDF tables between frames
Match particle count to hardware capabilities

For Troubleshooting

Monitor FPS counter
Check CPU/GPU usage with task manager
Reduce particle count if FPS < 30
Profile with perf to identify bottlenecks

Platform-Specific Notes

Linux

Ensure hardware OpenGL drivers are installed
Check with: glxinfo | grep "OpenGL renderer"
May need: sudo apt install mesa-utils

macOS

Uses Metal backend automatically
Apple Silicon: Native performance
Intel Mac: Rosetta may affect performance

Windows

Requires Visual Studio Build Tools
DirectX or OpenGL backend
Check Task Manager for GPU usage

WebAssembly

Limited to ~50,000 particles for 30 FPS
WebGL 2.0 required
Browser DevTools for profiling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Guide

Performance Targets

Performance Architecture

Frame Time Breakdown

Optimization Techniques

1. Parallel Processing

2. CDF Caching

3. Memory Pre-allocation

4. Efficient Polynomial Evaluation

GPU Optimization Tips

Current Bottleneck

Future Optimization: Instanced Rendering

Point Sprite Optimization

Memory Usage

Per-Particle Memory

Memory Calculation

Benchmarking

Built-in FPS Counter

Profiling with `perf` (Linux)

Benchmarking with Criterion

Performance Checklist

Before Running

For Best Performance

For Troubleshooting

Platform-Specific Notes

Linux

macOS

Windows

WebAssembly

FilesExpand file tree

PERFORMANCE.md

Latest commit

History

PERFORMANCE.md

File metadata and controls

Performance Guide

Performance Targets

Performance Architecture

Frame Time Breakdown

Optimization Techniques

1. Parallel Processing

2. CDF Caching

3. Memory Pre-allocation

4. Efficient Polynomial Evaluation

GPU Optimization Tips

Current Bottleneck

Future Optimization: Instanced Rendering

Point Sprite Optimization

Memory Usage

Per-Particle Memory

Memory Calculation

Benchmarking

Built-in FPS Counter

Profiling with perf (Linux)

Benchmarking with Criterion

Performance Checklist

Before Running

For Best Performance

For Troubleshooting

Platform-Specific Notes

Linux

macOS

Windows

WebAssembly

Profiling with `perf` (Linux)