-
Notifications
You must be signed in to change notification settings - Fork 20
Improve resampler structure and performance #1242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tests should try to be as close as possible as real code. Signed-off-by: Leandro Lucarella <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reorganizes the resampling module into smaller submodules, switches internal sample storage from Sample objects to (timestamp, value) tuples for performance, and updates the default resampling function to use statistics.fmean. It also exposes new types, converts ResamplingFunction to a Protocol, and adjusts imports, tests, docs, and release notes to match the refactoring.
- Split
_resamplinginto_base_types,_config,_exceptions, and_resamplersubmodules - Changed internal buffer to store
(datetime, float)tuples and updated tests accordingly - Replaced custom
averagewithstatistics.fmean, exposedResamplingFunctionandSourceProperties, and updated documentation
Reviewed Changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/timeseries/test_resampling.py | Updated imports, added astuple helper, adapted tests to tuple samples |
| tests/timeseries/test_periodic_feature_extractor.py | Changed UNIX_EPOCH import to frequenz.core.datetime |
| tests/timeseries/test_moving_window.py | Adjusted imports for UNIX_EPOCH, MovingWindow, ResamplerConfig |
| tests/microgrid/test_datapipeline.py | Updated ResamplerConfig import path |
| src/frequenz/sdk/timeseries/_ringbuffer/buffer.py | Removed base_types UNIX_EPOCH, added core import |
| src/frequenz/sdk/timeseries/_resampling/_resampler.py | Buffer now holds tuples; methods updated |
| src/frequenz/sdk/timeseries/_resampling/_exceptions.py | New module for resampling errors |
| src/frequenz/sdk/timeseries/_resampling/_config.py | Config moved to its own submodule, uses Protocol |
| src/frequenz/sdk/timeseries/_resampling/_base_types.py | Defined Sink, Source, SourceProperties types |
| src/frequenz/sdk/timeseries/_resampling/init.py | Added package init |
| src/frequenz/sdk/timeseries/_moving_window.py | Updated imports to new submodule structure |
| src/frequenz/sdk/timeseries/_base_types.py | Removed deprecated UNIX_EPOCH constant |
| src/frequenz/sdk/timeseries/init.py | Revised exports and doc string |
| src/frequenz/sdk/microgrid/_resampling.py | Adjusted imports for resampler components |
| src/frequenz/sdk/actor/init.py | Updated ResamplerConfig import path |
| mkdocs.yml | Bumped external doc URLs to latest versions |
| benchmarks/timeseries/resampling.py | Benchmarks updated for tuple-based samples |
| RELEASE_NOTES.md | Added notes on value type change and UNIX_EPOCH removal |
Future changes require adding more code to this module, which is already quite big. This commit makes it a package and split the module into a few sub-modules to make it more manageable. Signed-off-by: Leandro Lucarella <[email protected]>
`ResamplingFunction` is used by `ResamplerConfig`, which is public, and `SourceProperties` is used by `ResamplingFunction`, so both should be public too. Signed-off-by: Leandro Lucarella <[email protected]>
Using a type alias forces us to use strings to specify types because Python can't use forward declarations for statements. Also using a protocol allows for more control, as even when we don't use it for now, it will also check function argument names if `/` is not used. Signed-off-by: Leandro Lucarella <[email protected]>
ca7af9d to
a1aff5c
Compare
Also use `disable-next` instead of disabling blocks, as it is less error prone. Signed-off-by: Leandro Lucarella <[email protected]>
`Sample` can have `None` as value, but the resampler buffer (and function) never get any `None` value stored/passed. The reason is saving a sample with no value provides no extra information and would make the resampling function implementation more complicated, as it would need to check values for `None`. Currently, the resampling function will never receive a `None` value, but it still needs to check for them for type-checking reasons. This commit uses a `tuple[datetime, Quantity]` to store samples in the buffer and pass it to the resampling function. This way it is trivially clear that values can't be `None` in this context. Signed-off-by: Leandro Lucarella <[email protected]>
a1aff5c to
be4311e
Compare
benchmarks/timeseries/resampling.py
Outdated
|
|
||
| def nop( # pylint: disable=unused-argument | ||
| samples: Sequence[Sample[Quantity]], | ||
| samples: Sequence[tuple[datetime, Quantity]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a NoNoneSample type or so might be a nice abstraction here. I can imagine there are actually many places that would enjoy not having to check for None after they initially received the data. I can certainly use it in FCR as well.
For now its fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in any case here we want to get rid of even the Quantity, we shouldn't really use Sample either, at least until we can use Sample[float].
If we go with the None/not None sample, maybe it can be OptionalSample (None values allowed) and Sample (can't have a None value) to follow Python's old Optional as an alias of | None. Another option, which I'm not sure if it is doable or not, is Sample[Power] vs Sample[Power | None]. I have the feeling we tried to make that work and failed, but maybe I'm just remembering wrongly.
tests/timeseries/test_resampling.py
Outdated
| await chan.close() | ||
|
|
||
|
|
||
| def astuple(sample: Sample[Quantity]) -> tuple[datetime, float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
astuple splitted into the wrong commit I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appears to be an unused implementation. you're using it from dataclasses somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, damn, I will have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, the commits are correct. I first used the "standard" dataclasses.astuple and then changed the implementation to convert to float too. At first I used a different name for the function, but then thought re-using astuple could make the diff smaller and, at the end, we want to convert to a tuple too, but maybe it is just too confusing, so I can rename the custom astuple to something less confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. astuple is added one commit later than the one that uses it, but not very important imo.
tests/timeseries/test_resampling.py
Outdated
| await chan.close() | ||
|
|
||
|
|
||
| def astuple(sample: Sample[Quantity]) -> tuple[datetime, float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appears to be an unused implementation. you're using it from dataclasses somewhere else.
|
|
||
| resampling_function: ResamplingFunction = average | ||
| resampling_function: ResamplingFunction = lambda samples, _, __: statistics.fmean( | ||
| [s[1] for s in samples] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the docs this can be an iterable: https://docs.python.org/3/library/statistics.html#statistics.fmean
| [s[1] for s in samples] | |
| s[1] for s in samples |
|
Updated with fixup commits. |
When resampling we know we are working always with the same units, so using a `Quantity` doesn't add any advantages to plain `float`s. Instead, it makes the resampler slower, as it needs to do an extra lookup each time a buffer value is needed and potentially avoids the need to create `Quantity`s from the inputs. The performance can be improved by about 25% (for resamples=1000 samples=1000, before 3.7s, now 2.8s) according to the resampling benchmark. Signed-off-by: Leandro Lucarella <[email protected]>
The `average` function is just a mean function and Python already offers a function to calculate a mean, so we use that instead. We also wrap `fmean` to match the `ResamplingFunction` signature inline so it is clear in the documentation what the default resampling function does. Signed-off-by: Leandro Lucarella <[email protected]>
This reduces the benchmark time by about 4x. Signed-off-by: Leandro Lucarella <[email protected]>
Also add the missing docs cross-reference for the core library. Signed-off-by: Leandro Lucarella <[email protected]>
We know the latest version will still have all the features we need, so there is no need to target a particular minor. Signed-off-by: Leandro Lucarella <[email protected]>
Signed-off-by: Leandro Lucarella <[email protected]>
Signed-off-by: Leandro Lucarella <[email protected]>
5dc64e8 to
da45456
Compare
| """ | ||
|
|
||
| resampling_function: ResamplingFunction = average | ||
| resampling_function: ResamplingFunction = lambda samples, _, __: statistics.fmean( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure there are no NaNs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no NaNs, no Nones, no . That's filtered out before they are fed to the resampling function.infs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, inf is still allowed:
frequenz-sdk-python/src/frequenz/sdk/timeseries/_resampling/_resampler.py
Lines 512 to 514 in c1bc917
| async for sample in self._source: | |
| if sample.value is not None and not sample.value.isnan(): | |
| self._helper.add_sample((sample.timestamp, sample.value.base_value)) |
This pull request splits the
resamplingmodule into many submodules as it was getting very largeIf also uses tuples internally to store and calculate samples, which improves performance and simplifies the logic, as
Samples can haveNonevalues, but the values in the internal buffer can never beNone, and we don't need to use aQuantityas we know we are always working with the same units.It also exposes the
ResamplingFunctionandSourcePropertiestypes, convertsResamplingFunctionto aProtocol, and replacesaverageas the default resampling function with Python'sstatistics.fmean, among other small improvements and fixes.