Skip to content

Commit 2fba7db

Browse files
committed
Add dataset augmentation and reduction utilities for SINDY workflow
This PR implements the feature requested in #332, adding support for: 1. **Delay embedding** (`delay_embedding`): Create time-delay coordinates from data, useful for Takens' embedding theorem and HAVOK analysis 2. **Hankel matrix construction** (`hankel_matrix`): Build Hankel matrices from 1D time series, enabling Hankel DMD and related methods 3. **Dimensionality reduction** (`reduce_dimension`, `truncated_svd`): Project data onto principal components with automatic or manual rank selection These utilities enable HAVOK-style workflows where: - Time series data is embedded via delay coordinates or Hankel matrices - SVD is used to identify dominant modes - Sparse regression discovers dynamics in the reduced space References: - Brunton et al. (2017) "Chaos as an intermittently forced linear system" - Arbabi & Mezic (2017) "Ergodic theory, dynamic mode decomposition, and computation of spectral properties of the Koopman operator" Fixes #332 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent e10e728 commit 2fba7db

File tree

3 files changed

+396
-0
lines changed

3 files changed

+396
-0
lines changed

src/DataDrivenDiffEq.jl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,10 @@ export collocate_data
119119
include("./utils/utils.jl")
120120
export optimal_shrinkage, optimal_shrinkage!
121121

122+
include("./utils/data_augmentation.jl")
123+
export hankel_matrix, delay_embedding
124+
export truncated_svd, reduce_dimension
125+
122126
include("./problem/type.jl")
123127

124128
export DataDrivenProblem

src/utils/data_augmentation.jl

Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
"""
2+
Data augmentation and reduction utilities for the SINDY workflow.
3+
4+
This module provides functions for:
5+
- Delay embedding (time-delay coordinates)
6+
- Hankel matrix construction
7+
- Dimensionality reduction via truncated SVD
8+
9+
These are useful for HAVOK analysis and handling chaotic systems where
10+
time-delay embeddings can reveal underlying structure.
11+
12+
# References
13+
14+
- Brunton, S. L., Brunton, B. W., Proctor, J. L., Kaiser, E., & Kutz, J. N. (2017).
15+
Chaos as an intermittently forced linear system. Nature Communications, 8(1), 19.
16+
https://doi.org/10.1038/s41467-017-00030-8
17+
18+
- Arbabi, H., & Mezic, I. (2017). Ergodic theory, dynamic mode decomposition,
19+
and computation of spectral properties of the Koopman operator.
20+
SIAM Journal on Applied Dynamical Systems, 16(4), 2096-2126.
21+
"""
22+
23+
"""
24+
$(SIGNATURES)
25+
26+
Construct a Hankel matrix from a 1D time series vector.
27+
28+
Given a vector `x = [x₁, x₂, ..., xₙ]` and `num_rows` rows, constructs the Hankel matrix:
29+
30+
```
31+
H = [x₁ x₂ x₃ ... xₙ₋ₘ₊₁]
32+
[x₂ x₃ x₄ ... xₙ₋ₘ₊₂]
33+
[x₃ x₄ x₅ ... xₙ₋ₘ₊₃]
34+
[⋮ ⋮ ⋮ ⋱ ⋮ ]
35+
[xₘ xₘ₊₁ xₘ₊₂ ... xₙ ]
36+
```
37+
38+
where `m = num_rows`.
39+
40+
# Arguments
41+
- `x`: Input vector (1D time series)
42+
- `num_rows`: Number of rows in the Hankel matrix (number of delays + 1)
43+
44+
# Returns
45+
- Hankel matrix of size `(num_rows, length(x) - num_rows + 1)`
46+
47+
# Example
48+
```julia
49+
x = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
50+
H = hankel_matrix(x, 4)
51+
# Returns:
52+
# 4×7 Matrix:
53+
# 1.0 2.0 3.0 4.0 5.0 6.0 7.0
54+
# 2.0 3.0 4.0 5.0 6.0 7.0 8.0
55+
# 3.0 4.0 5.0 6.0 7.0 8.0 9.0
56+
# 4.0 5.0 6.0 7.0 8.0 9.0 10.0
57+
```
58+
"""
59+
function hankel_matrix(x::AbstractVector, num_rows::Integer)
60+
n = length(x)
61+
@assert num_rows >= 1 "num_rows must be at least 1"
62+
@assert num_rows <= n "num_rows ($num_rows) cannot exceed length of x ($n)"
63+
64+
num_cols = n - num_rows + 1
65+
H = similar(x, num_rows, num_cols)
66+
67+
for j in 1:num_cols
68+
for i in 1:num_rows
69+
H[i, j] = x[i + j - 1]
70+
end
71+
end
72+
73+
return H
74+
end
75+
76+
"""
77+
$(SIGNATURES)
78+
79+
Create delay-embedded coordinates from a data matrix.
80+
81+
Given data matrix `X` with states as rows and time samples as columns,
82+
creates new coordinates by stacking time-delayed versions of each state.
83+
84+
For a single state `x = [x(t₁), x(t₂), ...]` with `num_delays` delays and
85+
delay interval `τ`, the embedded coordinates are:
86+
```
87+
[x(t), x(t+τ), ...]
88+
[x(t-τ), x(t), ...]
89+
[x(t-2τ), x(t-τ), ...]
90+
⋮ ⋮ ⋱
91+
[x(t-mτ), x(t-(m-1)τ), ...]
92+
```
93+
94+
where `m = num_delays`.
95+
96+
# Arguments
97+
- `X`: Data matrix of size `(n_states, n_samples)`
98+
- `num_delays`: Number of time delays to add (resulting in `num_delays + 1`
99+
copies of each state at different delays)
100+
101+
# Keyword Arguments
102+
- `τ::Integer = 1`: Delay interval in samples (default: 1 sample)
103+
104+
# Returns
105+
- Embedded data matrix of size `(n_states * (num_delays + 1), n_samples - num_delays * τ)`
106+
107+
# Example
108+
```julia
109+
# Single state variable
110+
X = reshape(1.0:10.0, 1, 10) # 1×10 matrix
111+
X_embed = delay_embedding(X, 2)
112+
# Creates 3×8 matrix with original and 2 delayed copies
113+
114+
# Multiple state variables
115+
X = randn(3, 100) # 3 states, 100 samples
116+
X_embed = delay_embedding(X, 3, τ=2)
117+
# Creates 12×94 matrix (3 states × 4 time copies)
118+
```
119+
120+
See also: [`hankel_matrix`](@ref), [`reduce_dimension`](@ref)
121+
"""
122+
function delay_embedding(X::AbstractMatrix, num_delays::Integer; τ::Integer = 1)
123+
n_states, n_samples = size(X)
124+
@assert num_delays >= 0 "num_delays must be non-negative"
125+
@assert τ >= 1 "delay interval τ must be at least 1"
126+
@assert n_samples > num_delays * τ "Not enough samples for requested delays"
127+
128+
new_n_states = n_states * (num_delays + 1)
129+
new_n_samples = n_samples - num_delays * τ
130+
131+
X_embedded = similar(X, new_n_states, new_n_samples)
132+
133+
for d in 0:num_delays
134+
offset = num_delays * τ - d * τ
135+
for i in 1:n_states
136+
row_idx = d * n_states + i
137+
for j in 1:new_n_samples
138+
X_embedded[row_idx, j] = X[i, j + offset]
139+
end
140+
end
141+
end
142+
143+
return X_embedded
144+
end
145+
146+
"""
147+
$(SIGNATURES)
148+
149+
Delay embedding for a 1D time series vector.
150+
151+
Convenience method that converts a vector to a 1-row matrix, applies
152+
delay embedding, and returns the result.
153+
154+
# Arguments
155+
- `x`: Input vector (1D time series)
156+
- `num_delays`: Number of time delays to add
157+
158+
# Keyword Arguments
159+
- `τ::Integer = 1`: Delay interval in samples
160+
161+
# Returns
162+
- Embedded data matrix of size `(num_delays + 1, length(x) - num_delays * τ)`
163+
164+
# Example
165+
```julia
166+
x = collect(1.0:10.0)
167+
X_embed = delay_embedding(x, 2)
168+
# Returns 3×8 matrix
169+
```
170+
"""
171+
function delay_embedding(x::AbstractVector, num_delays::Integer; τ::Integer = 1)
172+
X = reshape(x, 1, length(x))
173+
return delay_embedding(X, num_delays; τ = τ)
174+
end
175+
176+
"""
177+
$(SIGNATURES)
178+
179+
Compute a truncated singular value decomposition of data matrix `X`.
180+
181+
# Arguments
182+
- `X`: Data matrix
183+
- `rank`: Number of singular values/vectors to retain
184+
185+
# Returns
186+
- Named tuple `(U = U_r, S = S_r, V = V_r)` where:
187+
- `U_r` is the first `rank` columns of U
188+
- `S_r` is a vector of the first `rank` singular values
189+
- `V_r` is the first `rank` columns of V
190+
191+
# Example
192+
```julia
193+
X = randn(100, 50)
194+
result = truncated_svd(X, 5)
195+
X_approx = result.U * Diagonal(result.S) * result.V'
196+
```
197+
198+
See also: [`reduce_dimension`](@ref), [`optimal_shrinkage`](@ref)
199+
"""
200+
function truncated_svd(X::AbstractMatrix, rank::Integer)
201+
@assert rank >= 1 "rank must be at least 1"
202+
max_rank = minimum(size(X))
203+
rank = min(rank, max_rank)
204+
205+
U, S, V = svd(X)
206+
return (U = U[:, 1:rank], S = S[1:rank], V = V[:, 1:rank])
207+
end
208+
209+
"""
210+
$(SIGNATURES)
211+
212+
Reduce the dimensionality of data by projecting onto the top singular vectors.
213+
214+
Projects data matrix `X` onto its first `rank` left singular vectors,
215+
reducing the number of effective dimensions while preserving the most
216+
important features.
217+
218+
# Arguments
219+
- `X`: Data matrix of size `(n_features, n_samples)`
220+
- `rank`: Number of dimensions to retain
221+
222+
# Returns
223+
- Reduced data matrix of size `(rank, n_samples)`
224+
- The reduced coordinates `Y = U_r' * X` where `U_r` contains the first `rank` left singular vectors
225+
226+
# Example
227+
```julia
228+
X = randn(100, 50) # 100 features, 50 samples
229+
X_reduced = reduce_dimension(X, 5) # Reduce to 5 dimensions
230+
# X_reduced is 5×50
231+
```
232+
233+
See also: [`truncated_svd`](@ref), [`optimal_shrinkage`](@ref)
234+
"""
235+
function reduce_dimension(X::AbstractMatrix, rank::Integer)
236+
result = truncated_svd(X, rank)
237+
return result.U' * X
238+
end
239+
240+
"""
241+
$(SIGNATURES)
242+
243+
Reduce the dimensionality of data using automatic rank selection.
244+
245+
Uses the optimal singular value hard threshold from Gavish & Donoho (2014)
246+
to automatically determine the number of significant singular values,
247+
then projects onto those dimensions.
248+
249+
# Arguments
250+
- `X`: Data matrix of size `(n_features, n_samples)`
251+
252+
# Returns
253+
- Reduced data matrix with automatically selected dimensionality
254+
255+
# Example
256+
```julia
257+
X = randn(100, 50) + 0.1 * randn(100, 50) # Low-rank + noise
258+
X_reduced = reduce_dimension(X) # Automatically selects rank
259+
```
260+
261+
See also: [`reduce_dimension(X, rank)`](@ref), [`optimal_shrinkage`](@ref)
262+
"""
263+
function reduce_dimension(X::AbstractMatrix)
264+
m, n = minimum(size(X)), maximum(size(X))
265+
U, S, V = svd(X)
266+
τ = optimal_svht(m, n)
267+
threshold = τ * median(S)
268+
rank = count(s -> s >= threshold, S)
269+
rank = max(rank, 1) # Keep at least one dimension
270+
return U[:, 1:rank]' * X
271+
end

0 commit comments

Comments
 (0)