High-performance motion planning through SIMD vectorization
Implementation of "Motions in Microseconds via Vectorized Sampling-Based Planning" (Thomason et al., ICRA 2024)
- 15.27x Speedup over sequential implementation
- 1.8M+ checks/sec collision detection throughput
- Sub-microsecond state validation (0.55 µs per check)
- Full 6-DOF robot arm support with 3D kinematics
- Hierarchical sphere culling with early exit optimization
- SLEEF vectorized trigonometric functions
| Test Case | States | Sequential | SIMD (VAMP) | Speedup |
|---|---|---|---|---|
| 100K State Validation | 100,000 | 842.84 ms | 55.21 ms | 15.27x |
| Throughput | - | 118,647 checks/sec | 1,811,390 checks/sec | 15.27x |
| Single State Check | 1 | 8.43 µs | 0.55 µs | 15.27x |
| Motion Validation | 1,000 | 21.25 ms | 4.25 ms | 5.0x |
| Correctness | 100,000 | ✓ PASSED | ✓ PASSED | Identical |
| Benchmark | Metric | Value |
|---|---|---|
| Single State Validation | Time (10K states) | 25.68 ms |
| Time per check | 2.57 µs | |
| Throughput | 389,406 checks/sec | |
| Valid states | 15.73% | |
| Motion Validation | Time (1K motions) | 4.25 ms |
| Time per motion | 4.25 µs | |
| Throughput | 235,175 motions/sec | |
| Valid motions | 0.7% | |
| RRT-Connect Planning | Mean planning time | 0.076 ms |
| Median planning time | 0.079 ms | |
| Success rate | 10/10 (100%) | |
| Mean path length | 12.4 waypoints | |
| Environment | Obstacles | 51 boxes, 20 spheres |
| Robot | 6-DOF arm | |
| Hierarchical spheres | L0=1, L1=2, L2=4 |
| Implementation | Method | Optimization | Throughput | Speedup |
|---|---|---|---|---|
| Sequential | Scalar FK + Scalar CC | -O2 | 118,647 checks/sec | 1.0x (baseline) |
| SIMD (no SLEEF) | Vector FK + Vector CC | -O2 | ~600,000 checks/sec | ~5x |
| SIMD + SLEEF | Vector FK + Vector CC | -O3 -ffast-math | 1,811,390 checks/sec | 15.27x ✓ |
| Feature | Original Paper | This Implementation | Status |
|---|---|---|---|
| Speedup Range | 5-20x | 15.27x | ✓ Within range |
| SIMD Width | 8 (AVX2) | 8 (AVX2) | ✓ Identical |
| Hierarchical Culling | 3 levels | 3 levels (Coarse/Medium/Fine) | ✓ Implemented |
| FK/CC Interleaving | Per-link | Per-link with early exit | ✓ Implemented |
| Batched Validation | 8-wide | 8-wide RAKE pattern | ✓ Implemented |
| Vectorized Math | SLEEF | SLEEF (optional) | ✓ Supported |
| Robot Support | Multi-DOF | 6-DOF arm with DH params | ✓ Implemented |
Conclusion: This implementation achieves performance in the top tier of the original VAMP paper's reported range (15.27x out of 5-20x).
- C++17 compiler (GCC 7+, Clang 5+)
- AVX2 support (Intel Haswell+, AMD Excavator+)
- SLEEF library (optional, for vectorized trig)
# Clone the repository
git clone <your-repo-url>
cd VAMP
# Build with default settings (no SLEEF)
make
# Build with SLEEF + maximum optimizations (recommended)
make v13-perf-sleef
# Run benchmarks
./build/planner_v13_perf_sleef# Build Docker image
docker build -t vamp .
# Run in Docker
docker run -it vamp
# Inside container
cd /workspace/VAMP
make v13-perf-sleef
./build/planner_v13_perf_sleef-
Vectorized Collision Detection
- Struct-of-Arrays (SoA) data layout
- Batch processing of 8 configurations simultaneously
- AVX2 intrinsics for SIMD parallelism
-
Hierarchical Sphere Culling
- 3-level hierarchy (Coarse → Medium → Fine)
- Early exit optimization
- Conservative bounds for efficient pruning
-
FK/CC Interleaving
- Forward kinematics and collision checking interleaved
- Per-link computation and validation
- Early collision detection
-
Batched State Validation
- Rake pattern for motion validation
- 8-wide SIMD batching
- Optimized memory access patterns
- Language: C++17
- SIMD: Intel AVX2 (
__m256) - Vectorized Math: SLEEF (optional)
- Planner: RRT-Connect
- Robot: 6-DOF arm with DH parameters
VAMP/
├── src/
│ └── planner.cpp # Main implementation
├── proj_desc/
│ ├── README.md # Project description
│ ├── summary-zh.md # Chinese summary
│ └── vamp-differences.md # VAMP vs sequential comparison
├── Makefile # Build configuration
├── Dockerfile # Docker environment
├── result.txt # Benchmark results
└── README.md # This file
// Batch 8 states together
VectorizedConfig3DArm vec(6);
vec.loadFromAOS(batch); // Array-of-Structures → Structure-of-Arrays
// Check all 8 states simultaneously
uint8_t collision_mask = fk_cc.computeFK_Interleaved_6DOF(vec);
// Extract results
for (int i = 0; i < 8; i++) {
if (collision_mask & (1 << i)) {
// State i collides
}
}Level 0 (Coarse): ●━━━━━━━━━━━━━━━●
Check large bounding sphere
↓ (if hit)
Level 1 (Medium): ●━━━●━━━●━━━●━━━●
Check 2-3 medium spheres
↓ (if hit)
Level 2 (Fine): ●●●●●●●●●●●●●●●●●
Check 4+ fine spheres
All SIMD data is 32-byte aligned for optimal performance:
alignas(32) float data[8]; // Aligned for AVX2
__m256 vec = _mm256_load_ps(data); // Fast aligned loadThis project demonstrates:
- SIMD Programming: AVX2 intrinsics, vectorization patterns
- Data Structure Optimization: SoA layout, memory alignment
- Algorithm Design: Hierarchical culling, early exit strategies
- Performance Engineering: Profiling, benchmarking, optimization
- Robotics: Forward kinematics, collision detection, motion planning
# Correctness build (recommended for development)
make v13 # Scalar trig, -O2
make v13-sleef # SLEEF trig, -O2
# Performance build (recommended for benchmarking)
make v13-perf # Scalar trig, -O3 -ffast-math
make v13-perf-sleef # SLEEF trig, -O3 -ffast-math (FASTEST)
# Run benchmarks
make run-v13
make run-v13-perf-sleef
# Compare scalar vs SLEEF
make compare
# Clean build artifacts
make cleanBefore: Checking 1 state with 7 dummy states (only 12.5% utilization) After: Checking 8 real states simultaneously (100% utilization) Impact: ~8x speedup
Before: Scalar sinf()/cosf() in loops
After: Vectorized Sleef_sinf4_u35()/Sleef_cosf4_u35()
Impact: ~2x speedup on trig-heavy code
Before: Check all spheres always After: Coarse → Medium → Fine with early exit Impact: Reduced collision checks by 30-50%
-O3 -mavx2 -march=native -ffast-math -fltoImpact: 20-30% additional speedup
Traditional motion planners spend 60-80% of time in collision detection. VAMP's vectorization approach:
- Reduces collision checking time by 15x
- Enables real-time motion planning
- Scales to complex robots and environments
- Works on standard CPUs (no GPU required)
- Zero overhead compared to sequential code
- 15.27x speedup (exceeds theoretical 8x SIMD width)
- 1.8M+ collision checks per second
- Full 6-DOF robot support
- Hierarchical sphere culling
- SLEEF vectorized trig integration
- Complete RRT-Connect planner
- Comprehensive benchmarking
- Correctness validation (parity test)
-
W. Thomason, Z. Kingston, and L. E. Kavraki, "Motions in microseconds via vectorized sampling-based planning," in IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 8749–8756. Link
-
C. Ericson, Real-Time Collision Detection. USA: CRC Press, Inc., 2004.
This is an educational project implementing the VAMP framework. Contributions welcome for:
- Additional robot models
- More obstacle types
- Performance optimizations
- Documentation improvements
MIT License - see LICENSE file for details
- Original VAMP paper authors (Thomason, Kingston, Kavraki)
- SLEEF library developers
- Course instructors and TAs
Built for high-performance robotics
Last updated: 2025