Skip to content

High-performance Savitzky-Golay filter in C: batch, streaming, and 2D image processing. Embedded-friendly with coefficient export for MCUs. MATLAB-validated.

License

Notifications You must be signed in to change notification settings

Tugbars/Savitzky-Golay-Filter

Repository files navigation

Savitzky-Golay Filter Implementation in C

Author: Tugbars Heptaskin
Last Updated: 2026-01-28

Overview

This is a highly-optimized implementation of the Savitzky-Golay filter for data smoothing and differentiation. The implementation achieves 4-6x speedup over naive approaches through a comprehensive optimization strategy while maintaining bit-exact compatibility with MATLAB's sgolayfilt function (differences on the order of 10⁻⁶).

Key Features

  • Production-ready performance: Multiple layers of optimization for real-world applications
  • Embedded-safe: No heap allocation during filtering, static buffers only
  • MATLAB-validated: Output matches MATLAB's Savitzky-Golay filter to floating-point precision
  • Thread-safe: Filter objects are read-only after creation, safe for concurrent use
  • Flexible: Batch processing, real-time streaming, and coefficient export modes
  • Multiple boundary modes: Polynomial, reflect, periodic, and constant edge handling

Components

File Purpose
savgolFilter.c/h Batch processing — Create filter once, apply to arrays. Supports multiple boundary modes, strided access, in-place operation.
savgol_stream.c/h Online streaming — Real-time sample-by-sample processing with fixed latency of half_window samples.
savgol_export.c Coefficient export — CLI tool to generate C headers with precomputed weights for MCUs/FPGAs where runtime computation is too expensive.

Mathematical Foundation

The Savitzky-Golay filter performs polynomial least-squares fitting on a sliding window of data points. For each window position, it fits a polynomial of degree m through 2n+1 points (where n is the half-window size), then evaluates the fitted polynomial (or its derivative) at a target point.

Core Components

  1. Gram Polynomials F(k,d)

    • Orthogonal polynomials over discrete points [-n, ..., +n]
    • Computed using a three-term recurrence relation
    • The derivative order d allows computing smoothed derivatives
  2. Generalized Factorial GenFact(a,b)

    • Product: a × (a-1) × (a-2) × ... × (a-b+1)
    • Used for normalization factors in Gram polynomials
  3. Filter Weights

    • Each data point gets a weight determined by Gram polynomial projections
    • Weights are precomputed at filter creation and reused via convolution
  4. Convolution

    • Filtered value: y[j] = Σ(i=0 to 2n) w[i] × x[j-n+i]

Optimization Strategy

This implementation employs five major optimization techniques for a cumulative 4-6x total speedup:

1. GenFact Lookup Table (1.5-2x speedup)

  • Problem: Repeated expensive factorial-like computations
  • Solution: Precompute all GenFact values into a 2D table at startup
  • Trade-off: ~23KB memory for O(1) lookups instead of O(n) computation
  • Impact: Eliminates repeated multiplications in weight calculation

2. Static Allocation (embedded-safe)

  • Problem: VLAs and heap allocation problematic for embedded systems
  • Solution: Fixed-size stack buffers with compile-time known maximum sizes
  • Benefits:
    • No stack probing overhead
    • Better cache locality
    • Safe for resource-constrained environments

3. Gram Polynomial Optimization (2-3x speedup)

Multiple sub-optimizations:

  • Rolling buffer approach: O(d) space instead of O(k×d)
  • Strength reduction: Hoist divisions (1/n) out of loops
  • Branch elimination: Separate d=0 case to avoid conditionals in hot paths

4. Precomputed Edge Weights (1.5x amortized speedup)

  • Problem: Edge weights recomputed for every filter application
  • Solution: All edge weights computed once at filter creation
  • Key insight: Leading and trailing edges use same weights (different data traversal order)

5. ILP-Optimized Convolution (1.2-1.3x speedup)

  • Problem: Serial dependency chains prevent parallel execution
  • Solution: Four independent accumulation chains exploit CPU superscalar execution
  • Benefits:
    • Intel/AMD CPUs: 2 FMA units → 2 parallel operations
    • Apple M-series: 4 FMA units → 4 parallel operations
    • Pairwise reduction for numerical stability

Configuration Options

Filter Parameters

Parameter Description Constraints
half_window Half-window size n. Window spans 2n+1 points. 1 ≤ n ≤ 32
poly_order Polynomial degree m for least-squares fit. Higher values preserve more signal features. m < 2n+1, typically 2-4
derivative Derivative order. 0=smoothing, 1=velocity, 2=acceleration. d ≤ m
time_step Time interval between samples (Δt). Scales derivative output by 1/Δt^d. > 0
boundary Edge handling mode (see below).

Boundary Handling Modes

When the filter window extends beyond data boundaries, these modes determine how missing samples are handled:

Mode Description Best For
SAVGOL_BOUNDARY_POLYNOMIAL Fits asymmetric polynomials at edges. Default and highest quality. General use, preserves signal features
SAVGOL_BOUNDARY_REFLECT Mirrors data at boundary: [d₁,d₀ | d₀,d₁,d₂...] Signals with zero slope at edges
SAVGOL_BOUNDARY_PERIODIC Wraps data around: [...d_{n-1} | d₀,d₁...] Periodic signals (e.g., circular data)
SAVGOL_BOUNDARY_CONSTANT Extends edge value: [d₀,d₀ | d₀,d₁,d₂...] Signals with constant edges

Derivative Computation

The filter can compute smoothed derivatives by setting the derivative parameter:

Value Output Typical Use
0 Smoothed signal Noise reduction
1 First derivative × (1/Δt) Velocity, rate of change
2 Second derivative × (1/Δt²) Acceleration, curvature

The time_step parameter ensures correct physical units in the output.

API Overview

Batch Processing (savgolFilter.c/h)

The batch API uses a create/apply/destroy pattern:

  1. Create: savgol_create() allocates filter and precomputes all weights
  2. Apply: savgol_apply() filters data using precomputed weights (reusable)
  3. Destroy: savgol_destroy() frees resources

Additional functions:

  • savgol_apply_strided() — Filter a field within an array of structs
  • savgol_apply_valid() — Output only where full window fits (no edge handling, shorter output)

Thread Safety: After creation, the filter object is read-only. Multiple threads can safely share one filter for concurrent filtering of different data.

Streaming Processing (savgol_stream.c/h)

For real-time sample-by-sample processing:

  1. Create: savgol_stream_create() initializes streaming state
  2. Push: savgol_stream_push() or savgol_stream_push_full() processes one sample
  3. Flush: savgol_stream_flush() outputs remaining samples at end of stream
  4. Destroy: savgol_stream_destroy() frees resources

Latency: Fixed delay of half_window samples after buffer fills.

Output Modes:

  • push() — Simple interface, skips leading edge samples
  • push_full() — Complete output including edge handling, total outputs = total inputs

Coefficient Export (savgol_export.c)

Command-line tool for generating C headers with precomputed weights:

savgol_export -n <half_window> -m <poly_order> [-d <derivative>] [-o <output.h>] [-p <prefix>]
Option Description
-n Half-window size (required)
-m Polynomial order (required)
-d Derivative order (default: 0)
-o Output file (default: stdout)
-p Symbol prefix (default: SAVGOL)

Generated header includes:

  • Configuration macros (*_HALF_WINDOW, *_POLY_ORDER, etc.)
  • Center weight array (*_CENTER_WEIGHTS[])
  • Edge weight array (*_EDGE_WEIGHTS[][])
  • Inline apply function (*_apply())

Implementation Details

Edge Handling

The filter applies different strategies for three regions:

  1. Leading edge (first n points):

    • Uses asymmetric windows with shifted target points
    • Weights applied in reverse order to available data
    • Ensures smooth transition to central region
  2. Central region (points n to N-n-1):

    • Full symmetric windows
    • Optimal smoothing/differentiation
    • Most computationally efficient
  3. Trailing edge (last n points):

    • Reuses leading edge weights (mirrored)
    • Weights applied in forward order
    • Maintains consistency with leading edge

Numerical Stability

The implementation preserves numerical accuracy through:

  • Careful operation ordering in floating-point arithmetic
  • Pairwise reduction in accumulation chains
  • No premature rounding or truncation
  • Validated against MATLAB reference implementation

Memory Requirements

Component Size
Filter object ~8 KB (center + edge weights)
GenFact table ~23 KB (shared, initialized once)
Stack per apply < 1 KB
Streaming state ~300 bytes + filter reference

File Layout

include/iterative/
├── savgolFilter.h      # Batch filter API
└── savgol_stream.h     # Streaming filter API

src/iterative/
├── savgolFilter.c      # Batch implementation
├── savgol_stream.c     # Streaming implementation
├── savgol_export.c     # Coefficient export tool
└── CMakeLists.txt

test/iterative/
├── test_savgol.c       # Batch filter tests
├── test_savgol_stream.c # Streaming tests
├── test_savgol_main.c  # Benchmark
└── CMakeLists.txt

Build

mkdir build && cd build
cmake ..
make

Optional: -DUSE_PARALLEL_SAVGOL=ON builds OpenMP-parallelized version.

Typical Applications

  • Signal smoothing: Remove noise while preserving signal features
  • Numerical differentiation: Compute derivatives with noise suppression
  • Feature extraction: Detect peaks, valleys, inflection points
  • Trend analysis: Extract underlying trends from noisy data
  • Real-time filtering: Streaming mode for online data processing
  • Embedded systems: Export mode for resource-constrained targets

Performance Validation

The implementation has been validated against MATLAB's sgolayfilt function:

  • Accuracy: Differences on the order of 10⁻⁶ (floating-point precision limits)
  • Test case: 350-point dataset, halfWindow=12, polynomialOrder=4

References

  • Original paper: Savitzky, A.; Golay, M.J.E. (1964). "Smoothing and Differentiation of Data by Simplified Least Squares Procedures". Analytical Chemistry 36 (8): 1627–1639.
  • MATLAB documentation: sgolayfilt

License

MIT License

About

High-performance Savitzky-Golay filter in C: batch, streaming, and 2D image processing. Embedded-friendly with coefficient export for MCUs. MATLAB-validated.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published