Savitzky-Golay Filter Implementation in C

Author: Tugbars Heptaskin
Last Updated: 2026-01-28

Overview

This is a highly-optimized implementation of the Savitzky-Golay filter for data smoothing and differentiation. The implementation achieves 4-6x speedup over naive approaches through a comprehensive optimization strategy while maintaining bit-exact compatibility with MATLAB's sgolayfilt function (differences on the order of 10⁻⁶).

Key Features

✅ Production-ready performance: Multiple layers of optimization for real-world applications
✅ Embedded-safe: No heap allocation during filtering, static buffers only
✅ MATLAB-validated: Output matches MATLAB's Savitzky-Golay filter to floating-point precision
✅ Thread-safe: Filter objects are read-only after creation, safe for concurrent use
✅ Flexible: Batch processing, real-time streaming, and coefficient export modes
✅ Multiple boundary modes: Polynomial, reflect, periodic, and constant edge handling

Components

File	Purpose
`savgolFilter.c/h`	Batch processing — Create filter once, apply to arrays. Supports multiple boundary modes, strided access, in-place operation.
`savgol_stream.c/h`	Online streaming — Real-time sample-by-sample processing with fixed latency of `half_window` samples.
`savgol_export.c`	Coefficient export — CLI tool to generate C headers with precomputed weights for MCUs/FPGAs where runtime computation is too expensive.

Mathematical Foundation

The Savitzky-Golay filter performs polynomial least-squares fitting on a sliding window of data points. For each window position, it fits a polynomial of degree m through 2n+1 points (where n is the half-window size), then evaluates the fitted polynomial (or its derivative) at a target point.

Core Components

Gram Polynomials F(k,d)
- Orthogonal polynomials over discrete points [-n, ..., +n]
- Computed using a three-term recurrence relation
- The derivative order d allows computing smoothed derivatives
Generalized Factorial GenFact(a,b)
- Product: a × (a-1) × (a-2) × ... × (a-b+1)
- Used for normalization factors in Gram polynomials
Filter Weights
- Each data point gets a weight determined by Gram polynomial projections
- Weights are precomputed at filter creation and reused via convolution
Convolution
- Filtered value: y[j] = Σ(i=0 to 2n) w[i] × x[j-n+i]

Optimization Strategy

This implementation employs five major optimization techniques for a cumulative 4-6x total speedup:

1. GenFact Lookup Table (1.5-2x speedup)

Problem: Repeated expensive factorial-like computations
Solution: Precompute all GenFact values into a 2D table at startup
Trade-off: ~23KB memory for O(1) lookups instead of O(n) computation
Impact: Eliminates repeated multiplications in weight calculation

2. Static Allocation (embedded-safe)

Problem: VLAs and heap allocation problematic for embedded systems
Solution: Fixed-size stack buffers with compile-time known maximum sizes
Benefits:
- No stack probing overhead
- Better cache locality
- Safe for resource-constrained environments

3. Gram Polynomial Optimization (2-3x speedup)

Multiple sub-optimizations:

Rolling buffer approach: O(d) space instead of O(k×d)
Strength reduction: Hoist divisions (1/n) out of loops
Branch elimination: Separate d=0 case to avoid conditionals in hot paths

4. Precomputed Edge Weights (1.5x amortized speedup)

Problem: Edge weights recomputed for every filter application
Solution: All edge weights computed once at filter creation
Key insight: Leading and trailing edges use same weights (different data traversal order)

5. ILP-Optimized Convolution (1.2-1.3x speedup)

Problem: Serial dependency chains prevent parallel execution
Solution: Four independent accumulation chains exploit CPU superscalar execution
Benefits:
- Intel/AMD CPUs: 2 FMA units → 2 parallel operations
- Apple M-series: 4 FMA units → 4 parallel operations
- Pairwise reduction for numerical stability

Configuration Options

Filter Parameters

Parameter	Description	Constraints
`half_window`	Half-window size n. Window spans 2n+1 points.	1 ≤ n ≤ 32
`poly_order`	Polynomial degree m for least-squares fit. Higher values preserve more signal features.	m < 2n+1, typically 2-4
`derivative`	Derivative order. 0=smoothing, 1=velocity, 2=acceleration.	d ≤ m
`time_step`	Time interval between samples (Δt). Scales derivative output by 1/Δt^d.	> 0
`boundary`	Edge handling mode (see below).	—

Boundary Handling Modes

When the filter window extends beyond data boundaries, these modes determine how missing samples are handled:

Mode	Description	Best For
`SAVGOL_BOUNDARY_POLYNOMIAL`	Fits asymmetric polynomials at edges. Default and highest quality.	General use, preserves signal features
`SAVGOL_BOUNDARY_REFLECT`	Mirrors data at boundary: [d₁,d₀ \| d₀,d₁,d₂...]	Signals with zero slope at edges
`SAVGOL_BOUNDARY_PERIODIC`	Wraps data around: [...d_{n-1} \| d₀,d₁...]	Periodic signals (e.g., circular data)
`SAVGOL_BOUNDARY_CONSTANT`	Extends edge value: [d₀,d₀ \| d₀,d₁,d₂...]	Signals with constant edges

Derivative Computation

The filter can compute smoothed derivatives by setting the derivative parameter:

Value	Output	Typical Use
0	Smoothed signal	Noise reduction
1	First derivative × (1/Δt)	Velocity, rate of change
2	Second derivative × (1/Δt²)	Acceleration, curvature

The time_step parameter ensures correct physical units in the output.

API Overview

Batch Processing (`savgolFilter.c/h`)

The batch API uses a create/apply/destroy pattern:

Create: savgol_create() allocates filter and precomputes all weights
Apply: savgol_apply() filters data using precomputed weights (reusable)
Destroy: savgol_destroy() frees resources

Additional functions:

savgol_apply_strided() — Filter a field within an array of structs
savgol_apply_valid() — Output only where full window fits (no edge handling, shorter output)

Thread Safety: After creation, the filter object is read-only. Multiple threads can safely share one filter for concurrent filtering of different data.

Streaming Processing (`savgol_stream.c/h`)

For real-time sample-by-sample processing:

Create: savgol_stream_create() initializes streaming state
Push: savgol_stream_push() or savgol_stream_push_full() processes one sample
Flush: savgol_stream_flush() outputs remaining samples at end of stream
Destroy: savgol_stream_destroy() frees resources

Latency: Fixed delay of half_window samples after buffer fills.

Output Modes:

push() — Simple interface, skips leading edge samples
push_full() — Complete output including edge handling, total outputs = total inputs

Coefficient Export (`savgol_export.c`)

Command-line tool for generating C headers with precomputed weights:

savgol_export -n <half_window> -m <poly_order> [-d <derivative>] [-o <output.h>] [-p <prefix>]

Option	Description
`-n`	Half-window size (required)
`-m`	Polynomial order (required)
`-d`	Derivative order (default: 0)
`-o`	Output file (default: stdout)
`-p`	Symbol prefix (default: SAVGOL)

Generated header includes:

Configuration macros (*_HALF_WINDOW, *_POLY_ORDER, etc.)
Center weight array (*_CENTER_WEIGHTS[])
Edge weight array (*_EDGE_WEIGHTS[][])
Inline apply function (*_apply())

Implementation Details

Edge Handling

The filter applies different strategies for three regions:

Leading edge (first n points):
- Uses asymmetric windows with shifted target points
- Weights applied in reverse order to available data
- Ensures smooth transition to central region
Central region (points n to N-n-1):
- Full symmetric windows
- Optimal smoothing/differentiation
- Most computationally efficient
Trailing edge (last n points):
- Reuses leading edge weights (mirrored)
- Weights applied in forward order
- Maintains consistency with leading edge

Numerical Stability

The implementation preserves numerical accuracy through:

Careful operation ordering in floating-point arithmetic
Pairwise reduction in accumulation chains
No premature rounding or truncation
Validated against MATLAB reference implementation

Memory Requirements

Component	Size
Filter object	~8 KB (center + edge weights)
GenFact table	~23 KB (shared, initialized once)
Stack per apply	< 1 KB
Streaming state	~300 bytes + filter reference

File Layout

include/iterative/
├── savgolFilter.h      # Batch filter API
└── savgol_stream.h     # Streaming filter API

src/iterative/
├── savgolFilter.c      # Batch implementation
├── savgol_stream.c     # Streaming implementation
├── savgol_export.c     # Coefficient export tool
└── CMakeLists.txt

test/iterative/
├── test_savgol.c       # Batch filter tests
├── test_savgol_stream.c # Streaming tests
├── test_savgol_main.c  # Benchmark
└── CMakeLists.txt

Build

mkdir build && cd build
cmake ..
make

Optional: -DUSE_PARALLEL_SAVGOL=ON builds OpenMP-parallelized version.

Typical Applications

Signal smoothing: Remove noise while preserving signal features
Numerical differentiation: Compute derivatives with noise suppression
Feature extraction: Detect peaks, valleys, inflection points
Trend analysis: Extract underlying trends from noisy data
Real-time filtering: Streaming mode for online data processing
Embedded systems: Export mode for resource-constrained targets

Performance Validation

The implementation has been validated against MATLAB's sgolayfilt function:

Accuracy: Differences on the order of 10⁻⁶ (floating-point precision limits)
Test case: 350-point dataset, halfWindow=12, polynomialOrder=4

References

Original paper: Savitzky, A.; Golay, M.J.E. (1964). "Smoothing and Differentiation of Data by Simplified Least Squares Procedures". Analytical Chemistry 36 (8): 1627–1639.
MATLAB documentation: sgolayfilt

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.vscode		.vscode
include/iterative		include/iterative
src		src
test/iterative		test/iterative
tool for matlab comparisons		tool for matlab comparisons
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Savitzky-Golay Filter Implementation in C

Overview

Key Features

Components

Mathematical Foundation

Core Components

Optimization Strategy

1. GenFact Lookup Table (1.5-2x speedup)

2. Static Allocation (embedded-safe)

3. Gram Polynomial Optimization (2-3x speedup)

4. Precomputed Edge Weights (1.5x amortized speedup)

5. ILP-Optimized Convolution (1.2-1.3x speedup)

Configuration Options

Filter Parameters

Boundary Handling Modes

Derivative Computation

API Overview

Batch Processing (`savgolFilter.c/h`)

Streaming Processing (`savgol_stream.c/h`)

Coefficient Export (`savgol_export.c`)

Implementation Details

Edge Handling

Numerical Stability

Memory Requirements

File Layout

Build

Typical Applications

Performance Validation

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Tugbars/Savitzky-Golay-Filter

Folders and files

Latest commit

History

Repository files navigation

Savitzky-Golay Filter Implementation in C

Overview

Key Features

Components

Mathematical Foundation

Core Components

Optimization Strategy

1. GenFact Lookup Table (1.5-2x speedup)

2. Static Allocation (embedded-safe)

3. Gram Polynomial Optimization (2-3x speedup)

4. Precomputed Edge Weights (1.5x amortized speedup)

5. ILP-Optimized Convolution (1.2-1.3x speedup)

Configuration Options

Filter Parameters

Boundary Handling Modes

Derivative Computation

API Overview

Batch Processing (savgolFilter.c/h)

Streaming Processing (savgol_stream.c/h)

Coefficient Export (savgol_export.c)

Implementation Details

Edge Handling

Numerical Stability

Memory Requirements

File Layout

Build

Typical Applications

Performance Validation

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Batch Processing (`savgolFilter.c/h`)

Streaming Processing (`savgol_stream.c/h`)

Coefficient Export (`savgol_export.c`)

Packages