Change RemoteEndpoints interface to support remote engines pruning#680
Change RemoteEndpoints interface to support remote engines pruning#680SuperPaintman wants to merge 7 commits intothanos-io:mainfrom
RemoteEndpoints interface to support remote engines pruning#680Conversation
Signed-off-by: Aleksandr Krivoshchekov <SuperPaintmanDeveloper@gmail.com>
d1885ac to
63ff685
Compare
| } | ||
|
|
||
| func (m staticEndpoints) EnginesV2(mint, maxt int64) []RemoteEngine { | ||
| return m.engines |
There was a problem hiding this comment.
Should we apply filtering here?
There was a problem hiding this comment.
I tried, but since pruning is optional, I decided to not change DistributedExecutionOptimizer implementation, because it assumes that some engines might not have the requested range and fall back.
Some some unit tests check for that, and if we apply filter in static endpoints, they fail.
But now when I'm thinking about that again, I'm not sure that's an issue anymore.
There was a problem hiding this comment.
I think yes, the API contract suggest that we only should return intersecting engines.
There was a problem hiding this comment.
Are you suggesting something like this?
func (m staticEndpoints) Engines(mint, maxt int64) []RemoteEngine {
var engines []RemoteEngine
for _, e := range m.engines {
if e.MaxT() < mint || e.MinT() > maxt {
continue
}
engines = append(engines, e)
}
return engines
}There was a problem hiding this comment.
If so, I'm a bit hesitant about filtering engines themselves here (Maybe I don't fully understand the original implementation).
The optimizers seem to expect all engines to be available, especially for handling absent() (in DistributedExecutionOptimizer.distributeAbsent). If the engine is removed from query (e.g. if it has a data gap for the requested range), dist engine will return incorrect results.
Also, filtering engines breaks TestDistributedExecutionWithLongSelectorRanges/skip_distributing_queries_with_timestamps_outside_of_the_range_of_an_engine:
--- Expected
+++ Actual
@@ -1 +1 @@
-sum(sum by (region) (metric @ 18000.000))
+sum(dedup(remote(sum by (region) (metric @ 18000.000))))Let me know if I'm mistaken, but it seems like an invalid result.
The current "real" implementation also assumes all engines are return.
The intent of the mint/maxt is to prune internal metadata (like TSDBInfos) within each engine, to reduce unnecessary computations later.
Does that make sense, or am I missing something?
There was a problem hiding this comment.
Gotcha, this indeed makes sense, especially since you brought up absent. I think we can leave this as it is and optimize later.
RemoteEndpointsV2/V3 that requests pruned remote enginesRemoteEndpointsV2/V3 that returns pruned remote engines
|
I think the contention is between having API parameters vs hints. Our experience with Prometheus hints hasn't been too positive since they complicate everything downstream. Each change subsequent has to take into account that the hint might not be enforced. I would vote for the V2 implementation which treats parameters as part of a hard API. I would also be fine with adding a new method to the interface instead of introducing a new one. |
Signed-off-by: Aleksandr Krivoshchekov <SuperPaintmanDeveloper@gmail.com>
Signed-off-by: Aleksandr Krivoshchekov <SuperPaintmanDeveloper@gmail.com>
Signed-off-by: Aleksandr Krivoshchekov <SuperPaintmanDeveloper@gmail.com>
Signed-off-by: Aleksandr Krivoshchekov <SuperPaintmanDeveloper@gmail.com>
Signed-off-by: Aleksandr Krivoshchekov <SuperPaintmanDeveloper@gmail.com>
|
I've added a few unit tests for the caching behavior in this PR, but I think the main test suite and benchmarking should happen in the Also, I removed So now this PR introduces a breaking change to the |
RemoteEndpointsV2/V3 that returns pruned remote enginesRemoteEndpoints interface to support remote engines pruning
fpetkovski
left a comment
There was a problem hiding this comment.
This seems good to me, still has to be tested through Thanos but I am okay with merging the change.
| } | ||
|
|
||
| func (m staticEndpoints) EnginesV2(mint, maxt int64) []RemoteEngine { | ||
| return m.engines |
There was a problem hiding this comment.
Gotcha, this indeed makes sense, especially since you brought up absent. I think we can leave this as it is and optimize later.
|
Prepared a small PR to Thanos, where we only update the function signature: thanos-io/thanos#8653 @fpetkovski / @MichaHoffmann could you merge this PR. It seems I don't have permissions. |
This PR changes the
RemoteEndpointsinterface. Now it accepts 'hints' from the caller when selecting remote engines. These hints will be used to prune TSDBInfos (as described in this thanos-io/thanos#8599).Notes
Note 1:I added two interfaces:RemoteEndpointsV2(as suggested here thanos-io/thanos#8599 (comment)), andRemoteEndpointsV3which can be extended without making any new breaking changes.Let me know which one you prefer, and I'll remove the other one.Verification
I'm going to add a benchmark for dist-engine optimizers with synthetic data (as described in the main issue: query: distributed query mode is too slow when the number of external labels is high (~1kk) thanos#8597).thanosrepo.--
Issue: thanos-io/thanos#8597
CC: @MichaHoffmann