-
Notifications
You must be signed in to change notification settings - Fork 0
Schedulers in C
Claude's analysis of PyTorch usage in the schedulers used in diffusers, to understand if they can be ported to C++ without major dependencies.
The vast majority of schedulers use only small tensor operations that can be easily reimplemented in C++ without any major dependencies.
Torch operations used:
-
Element-wise arithmetic:
+,-,*,/,**(power) -
Math functions on scalars/small tensors:
sqrt(),exp(),log(),clamp() -
Indexing: Array indexing like
alphas_cumprod[timestep] -
Shape manipulation:
unsqueeze(),flatten(),reshape() -
Tensor creation:
torch.tensor(),torch.linspace(),torch.from_numpy()
Operations on large tensors (the actual image latents):
# Typical pattern in step() functions:
prev_sample = alpha_prod_t_prev ** (0.5) * pred_original_sample + pred_sample_direction
noisy_samples = sqrt_alpha_prod * original_samples + sqrt_one_minus_alpha_prod * noiseThese are pure element-wise operations - no matrix multiplications, no convolutions, no complex algorithms.
Similar to above, with additional operations:
-
torch.expm1()(exp(x) - 1, for numerical stability) - More complex formulas but still element-wise on scalars and image tensors
- Example from DPMSolver:
x_t = (sigma_t / sigma_s) * sample - (alpha_t * (torch.exp(-h) - 1.0)) * model_outputEven simpler:
prev_sample = sample + dt * model_output # Flow matching is just ODE integrationUses small matrix operations for high-order polynomial solvers:
-
torch.linalg.solve(R[:-1, :-1], b[:-1])- Solving small linear systems (order 2-4) -
torch.einsum("k,bkc...->bc...", rhos_p, D1s)- Weighted sum across solver orders
Matrix sizes: Typically 2x2 to 4x4 (for solver orders 2-4) Complexity: O(n³) where n ≤ 4, completely negligible
-
torch.quantile()- Used in dynamic thresholding (optional feature, rarely enabled) -
torch.cumprod()- Only used in initialization, not per-step
-
torch.linspace()- Creating timestep schedules -
torch.cumprod()- Computing cumulative products of alphas -
torch.log(),torch.sqrt()- Computing noise schedules - NumPy operations for schedule generation
-
Scalar operations: indexing
self.alphas_cumprod[timestep] -
Element-wise ops:
**0.5,*,+,-,/on image tensors - Broadcasting: scalar * tensor operations
-
Random noise:
randn_tensor()(callstorch.randn())
Schedulers: DDIM, DDPM, Euler, Euler Ancestral, DPMSolver++, DPMSolver, Heun, LCM, TCD, Flow Match, SASolver, K-DPM
C++ Implementation needs:
- Basic math library (std::sqrt, std::exp, std::log, std::pow)
- Random number generation (std::normal_distribution)
- Array operations (can use raw arrays or std::vector)
- No external dependencies needed!
Example translation:
// PyTorch:
prev_sample = alpha_prod_t_prev ** 0.5 * pred_original_sample + pred_sample_direction
// C++:
for (size_t i = 0; i < sample_size; ++i) {
prev_sample[i] = std::sqrt(alpha_prod_t_prev) * pred_original_sample[i]
+ pred_sample_direction[i];
}UniPC requires:
- Small matrix linear system solver (can use simple Gaussian elimination)
- Small einsum (just a weighted sum, trivial to implement)
Complexity: Still easily doable in C++. You could:
- Implement a tiny 4x4 matrix solver (few dozen lines)
- Use a lightweight library like Eigen (header-only)
- Or skip UniPC entirely (not commonly used)
No! The tensor operations are on:
- Scalars: Single float values (sigma, alpha, timestep)
- Small tensors: Arrays of ~50-1000 timesteps/sigmas
- Image latents: The actual computation per pixel is trivial (multiply-add)
For a 512x512 image with 4 channels (latent space):
- Tensor size: 4 × 64 × 64 = 16,384 floats
- Operations: Simple multiply-adds
- This is negligible compared to the UNet inference (which runs the actual neural network)
- 99%+ of inference time: Running the UNet model
- <0.1%: Scheduler step() operations
-
Start with the simplest: Euler, DDIM
- ~50-100 lines of C++ code
- Just arithmetic operations
-
Add DPMSolver++ next: Most popular advanced scheduler
- ~200-300 lines of C++
- Still just math functions
-
Skip or do last: UniPC
- Rarely used in practice
- Requires small matrix solver
#include <cmath> // sqrt, exp, log, pow
#include <random> // normal_distribution
#include <vector> // or use raw arrays
#include <algorithm> // optional, for utility functionsNo PyTorch/LibTorch needed! These are trivial operations.
class Scheduler {
std::vector<float> alphas_cumprod;
std::vector<float> timesteps;
void step(float* sample, float* model_output,
int timestep, float* output);
};The diffusers schedulers are PERFECT candidates for C++ porting! They use torch primarily for:
- Convenient array syntax
- Automatic GPU tensor management (not needed for schedulers)
- Element-wise operations that are trivial in C++
The "torch" dependency is more for convenience than necessity. You can reimplement everything with basic C++ standard library, and the performance will be identical (or better, without Python overhead).
The only scheduler with any complexity is UniPC (with its small matrix solve), and even that is manageable with ~50 lines of C++ or a tiny dependency like Eigen.