Optimization strategies and benchmarks for Atoms_Rust.
| Binary | Particles | Target FPS | Actual FPS* | Memory |
|---|---|---|---|---|
| realtime | 100,000 | 60 | 60-80 | ~200 MB |
| realtime | 500,000 | 15 | 12-18 | ~500 MB |
| realtime | 1,000,000 | 8 | 6-10 | ~900 MB |
| raytracer | 10,000 | 30 | 30-45 | ~50 MB |
| raytracer | 50,000 | 8 | 8-12 | ~150 MB |
| atom_2d | 20 atoms | 60 | 60+ | ~20 MB |
| wave_2d | 1 electron | 60 | 60+ | ~10 MB |
*On a modern desktop (Ryzen 7, RTX 3060, 16GB RAM)
flowchart TD
subgraph "CPU Bound"
P1[Particle Updates<br/>Parallel with rayon]
P2[CDF Sampling<br/>Binary search O log N]
P3[Color Calculation<br/>Polynomial evaluation]
end
subgraph "GPU Bound"
G1[Draw Calls<br/>One per particle]
G2[Vertex Processing<br/>Sphere geometry]
G3[Fragment Shading<br/>Lighting calculation]
end
subgraph "Memory Bound"
M1[Particle Data<br/>48 bytes each]
M2[CDF Tables<br/>~50KB per n,l combo]
M3[Cache Efficiency<br/>Sequential access]
end
P1 --> PERF[Frame Time]
P2 --> PERF
P3 --> PERF
G1 --> PERF
G2 --> PERF
G3 --> PERF
M1 --> PERF
M2 --> PERF
M3 --> PERF
pie title Typical Frame Time (100k particles, 60 FPS)
"GPU Draw Calls" : 55
"Particle Position Updates" : 25
"CDF Sampling" : 8
"Color Calculation" : 7
"Input & Camera" : 5
Particle updates use rayon for parallel processing:
// Parallel iteration across all CPU cores
particles.par_iter_mut().for_each(|p| {
// Calculate probability flow (independent per particle)
p.vel = calculate_probability_flow(p.pos, n, l, m);
// Update position
let temp_pos = p.pos + p.vel * dt;
let new_phi = temp_pos.z.atan2(temp_pos.x);
p.pos = spherical_to_cartesian(p.r, p.theta, new_phi);
});Performance gain: ~4x on 4-core CPU, ~8x on 8-core CPU
The Sampler caches CDF tables between frames:
sequenceDiagram
participant Code
participant Sampler
participant Cache
Note over Code,Cache: First call for (n=2, l=1, m=0)
Code->>Sampler: prepare(2, 1, 0)
Sampler->>Sampler: Build CDF tables
Sampler->>Cache: Store tables
Note over Code,Cache: Subsequent calls
Code->>Sampler: prepare(2, 1, 0)
Sampler->>Cache: Check cache
Cache-->>Sampler: Cache hit!
Note right of Sampler: Skip rebuild
Performance gain: Avoids O(6000) work per frame when quantum numbers unchanged
// GOOD: Pre-allocate exact capacity
let mut particles: Vec<Particle> = Vec::with_capacity(num_particles);
// BAD: Repeated reallocations
let mut particles: Vec<Particle> = Vec::new();
for _ in 0..100_000 {
particles.push(...); // Triggers multiple reallocations
}Performance gain: Eliminates memory reallocation during runtime
Use recurrence relations instead of closed-form formulas:
// Recurrence for Laguerre: O(k) where k = n-l-1
fn laguerre(n: i32, l: i32, rho: f64) -> f64 {
let k = n - l - 1;
let alpha = 2 * l + 1;
// Base cases
if k == 0 { return 1.0; }
if k == 1 { return 1.0 + alpha as f64 - rho; }
// Recurrence relation (iterative, no recursion overhead)
let mut l_m1 = 1.0 + alpha as f64 - rho;
let mut l_m2 = 1.0;
for j in 2..=k {
let l_val = ((2*j - 1 + alpha) as f64 - rho) * l_m1
- ((j - 1 + alpha) as f64) * l_m2;
l_val /= j as f64;
l_m2 = l_m1;
l_m1 = l_val;
}
l_m1
}The main GPU bottleneck is individual draw calls per particle:
// Current approach: O(P) draw calls
for p in &particles {
draw_sphere(p.pos, radius, None, p.color);
}// Future: Single draw call for all particles
let mesh = create_sphere_mesh(radius, 10, 10);
let instances: Vec<ParticleInstance> = particles.iter()
.map(|p| ParticleInstance::new(p.pos, p.color))
.collect();
draw_mesh_instanced(&mesh, &instances);Expected performance gain: 5-10x for large particle counts
For distant particles, use point sprites instead of spheres:
// Vertex shader
gl_PointSize = base_size / gl_Position.w;
// Fragment shader: fake sphere lighting
vec2 coord = gl_PointCoord - vec2(0.5);
float dist_sq = dot(coord, coord);
if (dist_sq > 0.25) discard;
float z = sqrt(0.25 - dist_sq);
vec3 normal = normalize(vec3(coord, z));
float diffuse = max(dot(normal, light_dir), 0.0);| Field | Type | Size |
|---|---|---|
| pos | Vec3 | 12 bytes |
| vel | Vec3 | 12 bytes |
| color | Color | 16 bytes |
| r | f32 | 4 bytes |
| theta | f32 | 4 bytes |
| Total | 48 bytes |
For
| Particles | Particle Data | CDF Tables | Total |
|---|---|---|---|
| 100,000 | 4.8 MB | ~50 KB | ~5 MB |
| 500,000 | 24 MB | ~50 KB | ~25 MB |
| 1,000,000 | 48 MB | ~50 KB | ~50 MB |
The actual runtime memory is higher due to macroquad's rendering buffers.
All binaries display FPS in the UI. Monitor this for performance issues.
# Record performance data
perf record -g cargo run --bin realtime --release
# View report
perf report
# Flame graph (requires FlameGraph tools)
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg// benches/sampling_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use atoms_rust::Sampler;
fn sampling_benchmark(c: &mut Criterion) {
let mut sampler = Sampler::new();
let mut rng = rand::thread_rng();
c.bench_function("sample_r", |b| {
b.iter(|| sampler.sample_r(2, 1, &mut rng))
});
}
criterion_group!(benches, sampling_benchmark);
criterion_main!(benches);Run with:
cargo bench- Build with
--releaseflag - Close other GPU-intensive applications
- Ensure GPU drivers are up to date
- Use
rayonfor parallel updates - Pre-allocate vectors with
with_capacity() - Cache CDF tables between frames
- Match particle count to hardware capabilities
- Monitor FPS counter
- Check CPU/GPU usage with task manager
- Reduce particle count if FPS < 30
- Profile with
perfto identify bottlenecks
- Ensure hardware OpenGL drivers are installed
- Check with:
glxinfo | grep "OpenGL renderer" - May need:
sudo apt install mesa-utils
- Uses Metal backend automatically
- Apple Silicon: Native performance
- Intel Mac: Rosetta may affect performance
- Requires Visual Studio Build Tools
- DirectX or OpenGL backend
- Check Task Manager for GPU usage
- Limited to ~50,000 particles for 30 FPS
- WebGL 2.0 required
- Browser DevTools for profiling