Time Weighted Average API/Window Function? #52

davidkohn88 · 2021-01-28T22:32:50Z

davidkohn88
Jan 28, 2021

In planning for the work on #46 we've run into a few different issues around how lookback/lookahead would work as we think about the API design.

As a thought on the initial API, we thought something like:

SELECT time_bucket('5 min', ts) as bucket, time_weighted_average(ts, value, method=>'locf')
FROM foo
WHERE ts > '2020-10-01' and ts <= '2020-10-02'
GROUP BY 1 ;

where the method parameter determines how to weight observations, for now, I think the possible values are locf (last observation carried forward) and linear, which would do linear interpolation...
The thing is that this works as long as we have values at the starts of buckets, but dealing with the values outside of the range doesn't work horribly well in the aggregate context, as we'd have to do subqueries within each bucket in order to get the previous value, it also would make integrating it into continuous aggregates either impossible or weirdly error prone as the invalidation could then very easily go outside of the bucket which is not supported by continuous aggs and would break the invalidation framework (where invalidations may only have affects inside their time buckets).

The way to solve this that we've come up with is to break the aggregation into two parts, one an aggregation and the next a window function.
ts_point(timestamptz, double)

WITH t as (SELECT time_bucket('5 min', ts) as bucket, tw_agg(ts, value, method=>'locf')
    FROM foo
    WHERE ts > '2020-10-01' and ts <= '2020-10-02'
    GROUP BY 1 )
SELECT bucket, time_weighted_average(tw_agg, prev=>(SELECT ts, value FROM foo WHERE ts < '2020-10-01' ORDER BY ts DESC LIMIT 1)) OVER (ORDER BY bucket ASC )
FROM t;

The tw_agg aggregate would produce a summary of the things needed to calculate the time_weighted average, namely, the first and last values of the time and value as well as the time weighted sum of everything in between, in order to finalize the aggregate, the sum would be divided by the overall time. I think you'd likely need to store the method as well, as a bitmask, or potentially pass it in to the time_weighted_average function as well. The interface for prev is to provide a subselect that provides the value outside of the time range to carry it forward.

This allows you to store these values in a continuous aggregate without problem as all of the out of range data is dealt with in the window function afterwards, and is much more efficient. The window function basically has the ability to look forward and back depending on what type of weighting it's doing (it would need to look forward in the case of linear interpolation and the window clause would need to become ORDER BY bucket ASC UNBOUNDED PRECEDING UNBOUNDED FOLLOWING).

If the prev value is not supplied or no value exists, the weighted average would be calculated using the earliest value in the window and assuming that that is the first value that exists. This means we could also use the function to just get the time_weighted_average over a single bucket without providing a window clause by just passing a tw_agg (or whatever better name we come up with), as long as we were willing to take the inaccuracy of not having the initial value.

davidkohn88 · 2021-01-28T22:34:57Z

davidkohn88
Jan 28, 2021
Author

I'd love thoughts on this formulation from @inselbuch, and whoever else has thoughts,

Would love ideas especially around naming things tw_agg doesn't seem great.

0 replies

davidkohn88 · 2021-01-28T22:40:03Z

davidkohn88
Jan 28, 2021
Author

Note that this works nicely withtime_bucket_gapfill in the following way:

WITH t as (SELECT time_bucket_gapfill('5 min', ts) as bucket, tw_agg(ts, value, method=>'locf') --produces nulls in gaps
    FROM foo
    WHERE ts > '2020-10-01' and ts <= '2020-10-02'
    GROUP BY 1 )
SELECT bucket, time_weighted_average(tw_agg, prev=>(SELECT ts, value FROM foo WHERE ts < '2020-10-01' ORDER BY ts DESC LIMIT 1)) OVER (ORDER BY bucket ASC )
FROM t;

The window function would fill in any null buckets and do lookback across them so it could fill back and the resulting time_weighted_average over any of the null buckets in the initial gapfill would be filled in with proper values during the window function step.

0 replies

inselbuch · 2021-01-28T23:16:24Z

inselbuch
Jan 28, 2021

If there is no previous value to the requested time range: advance the start time of the range forward to match the timestamp of the first value within the range so basically you are reducing the total amount of weight to distribute by a little which is accurate

1 reply

davidkohn88 Jan 29, 2021
Author

This should definitely be an option, if you don't specify something with the lookback part of this it should just treat it as if it started there, or if there's no previous value.

inselbuch · 2021-01-28T23:16:38Z

inselbuch
Jan 28, 2021

twa()

0 replies

inselbuch · 2021-01-28T23:18:53Z

inselbuch
Jan 28, 2021

what we dearly want to avoid is generating a set of evenly-spaced values if you are computing a twa over a month and there are only three values in there… a) you would be generating potentially millions of duplicate values, slow and memory-intensive b) your calculation would, by definition, not be accurate to the precision of the timestamp (consider three values with subsecond timestamps) 10:01:05.2 • to be accurate you would have to generate a set of values 0.2 seconds apart (not good)

1 reply

davidkohn88 Jan 29, 2021
Author

Sorry, I think you may be misunderstanding what I'm advocating for here...if you're looking for a month's worth of data you'd just run it something like:

WITH t as (SELECT tw_agg(ts, value, method=>'locf')

    FROM foo

    WHERE ts > '2020-10-01' and ts <= '2020-10-31' )

SELECT time_weighted_average(tw_agg, prev=>(SELECT ts, value FROM foo WHERE ts < '2020-10-01' ORDER BY ts DESC LIMIT 1)) )

Which would use the 3 values in there to do the time weighted average and then lookback using the prev subquery in the outer select.

The other case is for when you have more values and you're looking to get it every 5 minutes, so say you have 0-100 values every 5 minutes and you're looking to get the time weighted average as a series that's what that case was doing. It would still be perfectly accurate down to the submillisecond level and would not be using the series to do the calculation. It's just using the series to do efficient lookback for out of the time_bucket values (dealing with the edges not the things internal to the bucket). Within the bucket, or in this case within the entire aggregate here, we'd be doing the weighting based on the simple arithmetic (multiply a value by its duration to generate a weighted sum then divide by the total duration).

inselbuch · 2021-01-29T16:26:32Z

inselbuch
Jan 29, 2021

great you are smart

2 replies

davidkohn88 Jan 29, 2021
Author

These emails seem to do some weird stuff, if you want to try replying in the interface it might be better: #52

mfreed Feb 6, 2021
Maintainer

[I stripped the bottom of the email response to remove various contact info that might not have been meant or desired for public github forum.]

cevian · 2021-02-05T22:50:30Z

cevian
Feb 5, 2021
Maintainer

@davidkohn88 This design looks great! One suggestion: flush out the query with a more-complicate group by (for example group by time, device_id). It may make the lookback more complicated, which would be important to know.

3 replies

davidkohn88 Feb 7, 2021
Author

This should actually be relatively simple:

WITH t as (SELECT time_bucket('5 min', ts) as bucket, id,  tw_agg(ts, value, method=>'locf')
    FROM foo
    WHERE ts > '2020-10-01' and ts <= '2020-10-02'
    [AND id IN ('foo', 'bar', 'baz')]
    GROUP BY 1, 2 )
SELECT bucket, id, time_weighted_average(tw_agg, prev=>(SELECT tspoint(ts, value) FROM foo f WHERE f.id = t.id AND f.ts < '2020-10-01' ORDER BY ts DESC LIMIT 1)) OVER (PARTITION BY id ORDER BY bucket ASC )
FROM t;

I don't think it should be much different than the other as you should be able to refer to that id column in the subquery with no problem. I do find the subquery a bit hard to read in here. I also don't know if that will affect caching, but you'd think that it would be able to cache that as it shoudl be the same for all rows in a partition there. if not, I wonder if there's a way to get it to evaluate lazily with a coalesce or something like that...or you could do something slightly crazy and just materialize the whole thing like:

WITH t as (SELECT time_bucket('5 min', ts) as bucket, id,  tw_agg(ts, value, method=>'locf')
    FROM foo
    WHERE ts > '2020-10-01' and ts <= '2020-10-02'
    [AND id IN ('foo', 'bar', 'baz')]
    GROUP BY 1, 2 ), 
lasts as (SELECT distinct ON (id) id,  tspoint as last_point FROM t, LATERAL (SELECT tspoint(ts, value) FROM foo f WHERE f.id = t.id AND f.ts < '2020-10-01' ORDER BY ts DESC LIMIT 1) l)
SELECT bucket, id, time_weighted_average(tw_agg, prev=>(SELECT last_point FROM lasts l WHERE l.id = t.id)) OVER (PARTITION BY id ORDER BY bucket ASC )
FROM t;

cevian Feb 8, 2021
Maintainer

I agree that syntax works. I do worry whether you can refer to t,id in a subquery like that. Have you tried it?

davidkohn88 Feb 16, 2021
Author

I can't test the exact query, but the subquery thing appears to work:

WITH t as (SELECT time_bucket('5 min', ts) as bucket, id, avg(value)
    FROM foo
    WHERE ts > '2020-10-01' and ts <= '2020-10-02'
    AND id IN ('foo', 'bar', 'baz')
    GROUP BY 1, 2 )
SELECT bucket, id, (SELECT value FROM foo f WHERE f.id = t.id AND f.ts < '2020-10-01' ORDER BY ts DESC LIMIT 1)
FROM t;

(I don't have the real aggregates yet, but this is essentially a lateral syntax and I think it's treated as such). Or were you worried about how that would be planned?

inselbuch · 2021-02-07T21:25:34Z

inselbuch
Feb 7, 2021

This was implemented by another time-series database using a derived table named "AGGREGATES" with the following column definitions. NAME FIELD_ID – The name of the field on which aggregates are to be calculated. TS – The timestamp associated with the aggregates. This column is used to specify the time period for the calculation. The default range is 1 hour from the current time. TS_MIDDLE – The timestamp of the middle of the period. TS_END – The last timestamp in the period. PERIOD – The time period in seconds for the calculation. REQUEST – This is an integer field with 6 possible values. The two most common are: · 0 – actual data points are used · 1 – The default. The calculation is an integration. STEPPED – This is an integer field. 0 is the default referring to interpolated data. 1 treats the data as stepped. AVG – The average value GOOD – The number of good values MAX – The maximum value MIN – The minimum value NG – The number of not good values RNG – The difference between the MIN and MAX (range) STD – The standard deviation SUM – The sum of the values VAR – The variance Here's a sample query: SELECT MIN, MAX, AVG FROM AGGREGATES WHERE Name like ‘L20%’ AND DEFINITION = ‘IP_AnalogDef’ AND TS BETWEEN ‘20-Nov-00 07:00:00’ AND ‘20-Nov-00 15:00:00’ AND PERIOD = 8:00; That is a lot easier to understand and express than: WITH t as (SELECT time_bucket('5 min', ts) as bucket, id, tw_agg(ts, value, method=>'locf') FROM foo WHERE ts > '2020-10-01' and ts <= '2020-10-02' GROUP BY 1, 2 ), lasts as (SELECT distinct ON (id) id, (SELECT tspoint(ts, value) FROM foo f WHERE f.id = t.id AND f.ts < '2020-10-01' ORDER BY ts DESC LIMIT 1) as last_point FROM t) SELECT bucket, id, time_weighted_average(tw_agg, prev=>(SELECT last_point FROM lasts l WHERE l.id = t.id)) OVER (PARTITION BY id ORDER BY bucket ASC ) FROM t;

1 reply

davidkohn88 Feb 16, 2021
Author

Hmmm...well, I think the thought here was that the first part of the query would actually most likely be a continuous aggregate rather than a with clause in many cases, which is sort of a similar idea. will have to look at the syntax in more detail there, I think it accomplishes somethign similar to continuous aggregates, and I do think this could work in a single statement if we wanted it to...will have to see if i can write that up...

davidkohn88 · 2021-03-24T15:49:24Z

davidkohn88
Mar 24, 2021
Author

Okay so we've had some internal discussions and we think there are a few more options here that we want to consider as a general API for these types of things, because the same sort of problem comes up in Counter aggregates and other bits. I'm going to close this discussion and open a new one based on that stuff.

0 replies

Time Weighted Average API/Window Function? #52

Uh oh!

Uh oh!

davidkohn88 Jan 28, 2021

Replies: 9 comments · 8 replies

Uh oh!

davidkohn88 Jan 28, 2021 Author

Uh oh!

Uh oh!

davidkohn88 Jan 28, 2021 Author

Uh oh!

Uh oh!

inselbuch Jan 28, 2021

Uh oh!

davidkohn88 Jan 29, 2021 Author

Uh oh!

Uh oh!

inselbuch Jan 28, 2021

Uh oh!

Uh oh!

inselbuch Jan 28, 2021

Uh oh!

davidkohn88 Jan 29, 2021 Author

Uh oh!

Uh oh!

inselbuch Jan 29, 2021

Uh oh!

davidkohn88 Jan 29, 2021 Author

Uh oh!

Uh oh!

mfreed Feb 6, 2021 Maintainer

Uh oh!

cevian Feb 5, 2021 Maintainer

Uh oh!

Uh oh!

davidkohn88 Feb 7, 2021 Author

Uh oh!

cevian Feb 8, 2021 Maintainer

Uh oh!

davidkohn88 Feb 16, 2021 Author

Uh oh!

Uh oh!

inselbuch Feb 7, 2021

Uh oh!

davidkohn88 Feb 16, 2021 Author

Uh oh!

davidkohn88 Mar 24, 2021 Author

davidkohn88
Jan 28, 2021

Replies: 9 comments 8 replies

davidkohn88
Jan 28, 2021
Author

davidkohn88
Jan 28, 2021
Author

inselbuch
Jan 28, 2021

davidkohn88 Jan 29, 2021
Author

inselbuch
Jan 28, 2021

inselbuch
Jan 28, 2021

davidkohn88 Jan 29, 2021
Author

inselbuch
Jan 29, 2021

davidkohn88 Jan 29, 2021
Author

mfreed Feb 6, 2021
Maintainer

cevian
Feb 5, 2021
Maintainer

davidkohn88 Feb 7, 2021
Author

cevian Feb 8, 2021
Maintainer

davidkohn88 Feb 16, 2021
Author

inselbuch
Feb 7, 2021

davidkohn88 Feb 16, 2021
Author

davidkohn88
Mar 24, 2021
Author