Skip to content

Commit 07fed4e

Browse files
Ken LippoldKen Lippold
authored andcommitted
Updated ETL README to include aggregation step
1 parent 88ebb61 commit 07fed4e

File tree

1 file changed

+76
-5
lines changed

1 file changed

+76
-5
lines changed

src/hydroserverpy/etl/README.md

Lines changed: 76 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -115,8 +115,8 @@ transformer = CSVTransformer(
115115

116116
| `timezone_type` | Behaviour | Requires |
117117
|---|---|---|
118-
| `"utc"` (default) | Treats all timestamps as UTC. ||
119-
| `"embedded"` | Reads timezone offset from the timestamp string itself. Falls back to UTC if the timestamps are naive. ||
118+
| `None` (default) | Reads timezone offset from the timestamp string itself. Falls back to UTC if the timestamps are naive. ||
119+
| `"utc"` | Treats all timestamps as UTC. ||
120120
| `"offset"` | Treats timestamps as naive and applies a fixed UTC offset. Strips any embedded offset if present. | `timezone` in `±HHMM` or `±HH:MM` format |
121121
| `"iana"` | Treats timestamps as naive and applies a named IANA timezone. Strips any embedded offset if present. | `timezone` as a valid IANA name |
122122

@@ -135,10 +135,10 @@ transformer = CSVTransformer(
135135
timezone="America/Denver",
136136
)
137137

138-
# Embedded offset — timestamps include their own offset, e.g. "2024-01-15T08:30:00-07:00"
138+
# Embedded offsets — timestamps include their own offset, e.g. "2024-01-15T08:30:00-07:00"
139+
# Omit timezone_type (or set it to None) to read offsets from the timestamps directly.
139140
transformer = CSVTransformer(
140141
timestamp_key="datetime",
141-
timezone_type="embedded",
142142
)
143143
```
144144

@@ -209,6 +209,77 @@ ETLTargetPath(
209209

210210
Operations are applied in order. The output of each operation becomes the input of the next.
211211

212+
### Temporal Aggregation
213+
214+
Temporal aggregation is an optional step that reduces the per-observation DataFrame produced by the transformer into period-level summaries before loading. When configured, the same aggregation is applied uniformly to every target series in the pipeline.
215+
216+
```python
217+
from hydroserverpy.etl.models import TemporalAggregation
218+
219+
aggregation = TemporalAggregation(
220+
aggregation_statistic="simple_mean",
221+
aggregation_interval=1,
222+
aggregation_interval_unit="day",
223+
)
224+
```
225+
226+
Pass it to the transformer at construction time:
227+
228+
```python
229+
transformer = CSVTransformer(
230+
timestamp_key="datetime",
231+
temporal_aggregation=aggregation,
232+
)
233+
```
234+
235+
#### Aggregation statistic
236+
237+
| `aggregation_statistic` | Behaviour |
238+
|---|---|
239+
| `"simple_mean"` | Arithmetic mean of all observations within the window. |
240+
| `"time_weighted_mean"` | Mean weighted by the time between observations, computed via trapezoidal integration. Values at window boundaries are estimated by linear interpolation from the nearest surrounding observations. |
241+
| `"last_value_of_period"` | The last observation within the window. |
242+
243+
#### Aggregation interval
244+
245+
`aggregation_interval` (integer, default `1`) and `aggregation_interval_unit` (currently `"day"`) together define the window width. An `aggregation_interval` of `3` with unit `"day"` produces 3-day windows.
246+
247+
#### Timezone
248+
249+
Window boundaries are aligned to local midnight in the configured timezone. The timezone fields follow the same conventions as the transformer timestamp configuration, with `None` (the default) falling back to UTC-day boundaries.
250+
251+
| `timezone_type` | Window boundary alignment | Requires |
252+
|---|---|---|
253+
| `None` (default) | UTC midnight ||
254+
| `"utc"` | UTC midnight ||
255+
| `"offset"` | Local midnight at a fixed UTC offset | `timezone` in `±HHMM` or `±HH:MM` format |
256+
| `"iana"` | Local midnight in a named timezone, handling DST automatically | `timezone` as a valid IANA name |
257+
258+
```python
259+
# Daily windows aligned to US Mountain Time (UTC-7, DST-aware)
260+
aggregation = TemporalAggregation(
261+
aggregation_statistic="simple_mean",
262+
aggregation_interval=1,
263+
aggregation_interval_unit="day",
264+
timezone_type="iana",
265+
timezone="America/Denver",
266+
)
267+
268+
# Daily windows at a fixed offset (no DST adjustment)
269+
aggregation = TemporalAggregation(
270+
aggregation_statistic="time_weighted_mean",
271+
aggregation_interval=1,
272+
aggregation_interval_unit="day",
273+
timezone_type="offset",
274+
timezone="-0700",
275+
)
276+
```
277+
278+
**Window boundary semantics:** Windows run from the local midnight that contains the first observation to the local midnight that contains the last observation. The last observation defines the exclusive upper boundary — observations on that final local day are not aggregated. Ensure your source data extends at least one day past the last period you want included, or that the last observation falls on the day following the final window.
279+
280+
Days with no observations are omitted from the output rather than filled with null values.
281+
282+
212283
### Loader
213284

214285
```python
@@ -349,4 +420,4 @@ for target_id, target in context.results.target_results.items():
349420
| Error | Likely cause |
350421
|---|---|
351422
| `Missing datastream IDs: ...` | One or more target datastream UUIDs don't exist on the HydroServer instance |
352-
| `HydroServer loader failed to retrieve datastream` | A network or authentication error occurred while looking up a datastream |
423+
| `HydroServer loader failed to retrieve datastream` | A network or authentication error occurred while looking up a datastream |

0 commit comments

Comments
 (0)