Time Weighted Average API/Window Function? #52
Replies: 9 comments 8 replies
-
I'd love thoughts on this formulation from @inselbuch, and whoever else has thoughts, Would love ideas especially around naming things |
Beta Was this translation helpful? Give feedback.
-
Note that this works nicely with WITH t as (SELECT time_bucket_gapfill('5 min', ts) as bucket, tw_agg(ts, value, method=>'locf') --produces nulls in gaps
FROM foo
WHERE ts > '2020-10-01' and ts <= '2020-10-02'
GROUP BY 1 )
SELECT bucket, time_weighted_average(tw_agg, prev=>(SELECT ts, value FROM foo WHERE ts < '2020-10-01' ORDER BY ts DESC LIMIT 1)) OVER (ORDER BY bucket ASC )
FROM t; The window function would fill in any null buckets and do lookback across them so it could fill back and the resulting |
Beta Was this translation helpful? Give feedback.
-
If there is no previous value to the requested time range:
advance the start time of the range forward to match the timestamp of the first value within the range
so basically you are reducing the total amount of weight to distribute by a little
which is accurate
|
Beta Was this translation helpful? Give feedback.
-
twa()
|
Beta Was this translation helpful? Give feedback.
-
what we dearly want to avoid is generating a set of evenly-spaced values
if you are computing a twa over a month and there are only three values in there…
a) you would be generating potentially millions of duplicate values, slow and memory-intensive
b) your calculation would, by definition, not be accurate to the precision of the timestamp (consider three values with subsecond timestamps)
10:01:05.2 • to be accurate you would have to generate a set of values 0.2 seconds apart (not good)
|
Beta Was this translation helpful? Give feedback.
-
great
you are smart
|
Beta Was this translation helpful? Give feedback.
-
@davidkohn88 This design looks great! One suggestion: flush out the query with a more-complicate group by (for example group by time, device_id). It may make the lookback more complicated, which would be important to know. |
Beta Was this translation helpful? Give feedback.
-
This was implemented by another time-series database using a derived table named "AGGREGATES" with the following column definitions.
NAME
FIELD_ID – The name of the field on which aggregates are to be calculated.
TS – The timestamp associated with the aggregates. This column is used to specify the time period for the calculation. The default range is 1 hour from the current time.
TS_MIDDLE – The timestamp of the middle of the period.
TS_END – The last timestamp in the period.
PERIOD – The time period in seconds for the calculation.
REQUEST – This is an integer field with 6 possible values. The two most common are:
· 0 – actual data points are used
· 1 – The default. The calculation is an integration.
STEPPED – This is an integer field. 0 is the default referring to interpolated data. 1 treats the data as stepped.
AVG – The average value
GOOD – The number of good values
MAX – The maximum value
MIN – The minimum value
NG – The number of not good values
RNG – The difference between the MIN and MAX (range)
STD – The standard deviation
SUM – The sum of the values
VAR – The variance
Here's a sample query:
SELECT
MIN,
MAX,
AVG
FROM
AGGREGATES
WHERE
Name like ‘L20%’
AND
DEFINITION = ‘IP_AnalogDef’
AND
TS BETWEEN ‘20-Nov-00 07:00:00’ AND ‘20-Nov-00 15:00:00’
AND
PERIOD = 8:00;
That is a lot easier to understand and express than:
WITH t as (SELECT time_bucket('5 min', ts) as bucket, id, tw_agg(ts, value, method=>'locf')
FROM foo
WHERE ts > '2020-10-01' and ts <= '2020-10-02'
GROUP BY 1, 2 ),
lasts as (SELECT distinct ON (id) id, (SELECT tspoint(ts, value) FROM foo f WHERE f.id = t.id AND f.ts < '2020-10-01' ORDER BY ts DESC LIMIT 1) as last_point FROM t)
SELECT bucket, id, time_weighted_average(tw_agg, prev=>(SELECT last_point FROM lasts l WHERE l.id = t.id)) OVER (PARTITION BY id ORDER BY bucket ASC )
FROM t;
|
Beta Was this translation helpful? Give feedback.
-
Okay so we've had some internal discussions and we think there are a few more options here that we want to consider as a general API for these types of things, because the same sort of problem comes up in Counter aggregates and other bits. I'm going to close this discussion and open a new one based on that stuff. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In planning for the work on #46 we've run into a few different issues around how lookback/lookahead would work as we think about the API design.
As a thought on the initial API, we thought something like:
where the method parameter determines how to weight observations, for now, I think the possible values are locf (last observation carried forward) and linear, which would do linear interpolation...
The thing is that this works as long as we have values at the starts of buckets, but dealing with the values outside of the range doesn't work horribly well in the aggregate context, as we'd have to do subqueries within each bucket in order to get the previous value, it also would make integrating it into continuous aggregates either impossible or weirdly error prone as the invalidation could then very easily go outside of the bucket which is not supported by continuous aggs and would break the invalidation framework (where invalidations may only have affects inside their time buckets).
The way to solve this that we've come up with is to break the aggregation into two parts, one an aggregation and the next a window function.
ts_point(timestamptz, double)
The
tw_agg
aggregate would produce a summary of the things needed to calculate the time_weighted average, namely, the first and last values of the time and value as well as the time weighted sum of everything in between, in order to finalize the aggregate, the sum would be divided by the overall time. I think you'd likely need to store the method as well, as a bitmask, or potentially pass it in to the time_weighted_average function as well. The interface for prev is to provide a subselect that provides the value outside of the time range to carry it forward.This allows you to store these values in a continuous aggregate without problem as all of the out of range data is dealt with in the window function afterwards, and is much more efficient. The window function basically has the ability to look forward and back depending on what type of weighting it's doing (it would need to look forward in the case of linear interpolation and the window clause would need to become
ORDER BY bucket ASC UNBOUNDED PRECEDING UNBOUNDED FOLLOWING
).If the prev value is not supplied or no value exists, the weighted average would be calculated using the earliest value in the window and assuming that that is the first value that exists. This means we could also use the function to just get the
time_weighted_average
over a single bucket without providing a window clause by just passing atw_agg
(or whatever better name we come up with), as long as we were willing to take the inaccuracy of not having the initial value.Beta Was this translation helpful? Give feedback.
All reactions