Author: Tugbars Heptaskin
Last Updated: 2026-01-28
This is a highly-optimized implementation of the Savitzky-Golay filter for data smoothing and differentiation. The implementation achieves 4-6x speedup over naive approaches through a comprehensive optimization strategy while maintaining bit-exact compatibility with MATLAB's sgolayfilt function (differences on the order of 10⁻⁶).
- ✅ Production-ready performance: Multiple layers of optimization for real-world applications
- ✅ Embedded-safe: No heap allocation during filtering, static buffers only
- ✅ MATLAB-validated: Output matches MATLAB's Savitzky-Golay filter to floating-point precision
- ✅ Thread-safe: Filter objects are read-only after creation, safe for concurrent use
- ✅ Flexible: Batch processing, real-time streaming, and coefficient export modes
- ✅ Multiple boundary modes: Polynomial, reflect, periodic, and constant edge handling
| File | Purpose |
|---|---|
savgolFilter.c/h |
Batch processing — Create filter once, apply to arrays. Supports multiple boundary modes, strided access, in-place operation. |
savgol_stream.c/h |
Online streaming — Real-time sample-by-sample processing with fixed latency of half_window samples. |
savgol_export.c |
Coefficient export — CLI tool to generate C headers with precomputed weights for MCUs/FPGAs where runtime computation is too expensive. |
The Savitzky-Golay filter performs polynomial least-squares fitting on a sliding window of data points. For each window position, it fits a polynomial of degree m through 2n+1 points (where n is the half-window size), then evaluates the fitted polynomial (or its derivative) at a target point.
-
Gram Polynomials F(k,d)
- Orthogonal polynomials over discrete points [-n, ..., +n]
- Computed using a three-term recurrence relation
- The derivative order d allows computing smoothed derivatives
-
Generalized Factorial GenFact(a,b)
- Product: a × (a-1) × (a-2) × ... × (a-b+1)
- Used for normalization factors in Gram polynomials
-
Filter Weights
- Each data point gets a weight determined by Gram polynomial projections
- Weights are precomputed at filter creation and reused via convolution
-
Convolution
- Filtered value: y[j] = Σ(i=0 to 2n) w[i] × x[j-n+i]
This implementation employs five major optimization techniques for a cumulative 4-6x total speedup:
- Problem: Repeated expensive factorial-like computations
- Solution: Precompute all GenFact values into a 2D table at startup
- Trade-off: ~23KB memory for O(1) lookups instead of O(n) computation
- Impact: Eliminates repeated multiplications in weight calculation
- Problem: VLAs and heap allocation problematic for embedded systems
- Solution: Fixed-size stack buffers with compile-time known maximum sizes
- Benefits:
- No stack probing overhead
- Better cache locality
- Safe for resource-constrained environments
Multiple sub-optimizations:
- Rolling buffer approach: O(d) space instead of O(k×d)
- Strength reduction: Hoist divisions (1/n) out of loops
- Branch elimination: Separate d=0 case to avoid conditionals in hot paths
- Problem: Edge weights recomputed for every filter application
- Solution: All edge weights computed once at filter creation
- Key insight: Leading and trailing edges use same weights (different data traversal order)
- Problem: Serial dependency chains prevent parallel execution
- Solution: Four independent accumulation chains exploit CPU superscalar execution
- Benefits:
- Intel/AMD CPUs: 2 FMA units → 2 parallel operations
- Apple M-series: 4 FMA units → 4 parallel operations
- Pairwise reduction for numerical stability
| Parameter | Description | Constraints |
|---|---|---|
half_window |
Half-window size n. Window spans 2n+1 points. | 1 ≤ n ≤ 32 |
poly_order |
Polynomial degree m for least-squares fit. Higher values preserve more signal features. | m < 2n+1, typically 2-4 |
derivative |
Derivative order. 0=smoothing, 1=velocity, 2=acceleration. | d ≤ m |
time_step |
Time interval between samples (Δt). Scales derivative output by 1/Δt^d. | > 0 |
boundary |
Edge handling mode (see below). | — |
When the filter window extends beyond data boundaries, these modes determine how missing samples are handled:
| Mode | Description | Best For |
|---|---|---|
SAVGOL_BOUNDARY_POLYNOMIAL |
Fits asymmetric polynomials at edges. Default and highest quality. | General use, preserves signal features |
SAVGOL_BOUNDARY_REFLECT |
Mirrors data at boundary: [d₁,d₀ | d₀,d₁,d₂...] | Signals with zero slope at edges |
SAVGOL_BOUNDARY_PERIODIC |
Wraps data around: [...d_{n-1} | d₀,d₁...] | Periodic signals (e.g., circular data) |
SAVGOL_BOUNDARY_CONSTANT |
Extends edge value: [d₀,d₀ | d₀,d₁,d₂...] | Signals with constant edges |
The filter can compute smoothed derivatives by setting the derivative parameter:
| Value | Output | Typical Use |
|---|---|---|
| 0 | Smoothed signal | Noise reduction |
| 1 | First derivative × (1/Δt) | Velocity, rate of change |
| 2 | Second derivative × (1/Δt²) | Acceleration, curvature |
The time_step parameter ensures correct physical units in the output.
The batch API uses a create/apply/destroy pattern:
- Create:
savgol_create()allocates filter and precomputes all weights - Apply:
savgol_apply()filters data using precomputed weights (reusable) - Destroy:
savgol_destroy()frees resources
Additional functions:
savgol_apply_strided()— Filter a field within an array of structssavgol_apply_valid()— Output only where full window fits (no edge handling, shorter output)
Thread Safety: After creation, the filter object is read-only. Multiple threads can safely share one filter for concurrent filtering of different data.
For real-time sample-by-sample processing:
- Create:
savgol_stream_create()initializes streaming state - Push:
savgol_stream_push()orsavgol_stream_push_full()processes one sample - Flush:
savgol_stream_flush()outputs remaining samples at end of stream - Destroy:
savgol_stream_destroy()frees resources
Latency: Fixed delay of half_window samples after buffer fills.
Output Modes:
push()— Simple interface, skips leading edge samplespush_full()— Complete output including edge handling, total outputs = total inputs
Command-line tool for generating C headers with precomputed weights:
savgol_export -n <half_window> -m <poly_order> [-d <derivative>] [-o <output.h>] [-p <prefix>]
| Option | Description |
|---|---|
-n |
Half-window size (required) |
-m |
Polynomial order (required) |
-d |
Derivative order (default: 0) |
-o |
Output file (default: stdout) |
-p |
Symbol prefix (default: SAVGOL) |
Generated header includes:
- Configuration macros (
*_HALF_WINDOW,*_POLY_ORDER, etc.) - Center weight array (
*_CENTER_WEIGHTS[]) - Edge weight array (
*_EDGE_WEIGHTS[][]) - Inline apply function (
*_apply())
The filter applies different strategies for three regions:
-
Leading edge (first n points):
- Uses asymmetric windows with shifted target points
- Weights applied in reverse order to available data
- Ensures smooth transition to central region
-
Central region (points n to N-n-1):
- Full symmetric windows
- Optimal smoothing/differentiation
- Most computationally efficient
-
Trailing edge (last n points):
- Reuses leading edge weights (mirrored)
- Weights applied in forward order
- Maintains consistency with leading edge
The implementation preserves numerical accuracy through:
- Careful operation ordering in floating-point arithmetic
- Pairwise reduction in accumulation chains
- No premature rounding or truncation
- Validated against MATLAB reference implementation
| Component | Size |
|---|---|
| Filter object | ~8 KB (center + edge weights) |
| GenFact table | ~23 KB (shared, initialized once) |
| Stack per apply | < 1 KB |
| Streaming state | ~300 bytes + filter reference |
include/iterative/
├── savgolFilter.h # Batch filter API
└── savgol_stream.h # Streaming filter API
src/iterative/
├── savgolFilter.c # Batch implementation
├── savgol_stream.c # Streaming implementation
├── savgol_export.c # Coefficient export tool
└── CMakeLists.txt
test/iterative/
├── test_savgol.c # Batch filter tests
├── test_savgol_stream.c # Streaming tests
├── test_savgol_main.c # Benchmark
└── CMakeLists.txt
mkdir build && cd build
cmake ..
makeOptional: -DUSE_PARALLEL_SAVGOL=ON builds OpenMP-parallelized version.
- Signal smoothing: Remove noise while preserving signal features
- Numerical differentiation: Compute derivatives with noise suppression
- Feature extraction: Detect peaks, valleys, inflection points
- Trend analysis: Extract underlying trends from noisy data
- Real-time filtering: Streaming mode for online data processing
- Embedded systems: Export mode for resource-constrained targets
The implementation has been validated against MATLAB's sgolayfilt function:
- Accuracy: Differences on the order of 10⁻⁶ (floating-point precision limits)
- Test case: 350-point dataset, halfWindow=12, polynomialOrder=4
- Original paper: Savitzky, A.; Golay, M.J.E. (1964). "Smoothing and Differentiation of Data by Simplified Least Squares Procedures". Analytical Chemistry 36 (8): 1627–1639.
- MATLAB documentation: sgolayfilt
MIT License