-
Notifications
You must be signed in to change notification settings - Fork 682
Description
Context
When shutting down query-schedulers the query-frontends log these errors
ts=2025-09-04T03:40:59.450600734Z caller=handler.go:430 level=info msg="query stats" component=query-frontend method=POST path=/prometheus/api/v1/query route_name=prometheus_api_v1_query user_agent=Go-http-client/1.1 status_code=499 response_time=17.158379ms response_size_bytes=0 query_wall_time_seconds=0.010860358 fetched_series_count=0 fetched_chunk_bytes=0 fetched_chunks_count=0 fetched_index_bytes=0 sharded_queries=0 split_queries=0 spun_off_subqueries=0 estimated_series_count=0 queue_time_seconds=3.2402e-05 encode_time_seconds=0 samples_processed=0 samples_processed_cache_adjusted=0 param_query="<redacted>" param_time=2025-09-04T03:40:50Z length=34m59.999s time_since_min_time=35m9.432405588s time_since_max_time=9.433405588s results_cache_hit_bytes=0 results_cache_miss_bytes=0
status=failed
err="context canceled: query cancelled: rpc error: code = Canceled desc = context canceled: frontend disconnected"
this is confusing because the client didn't cancel the query and it observed the HTTP 499 error. For example this is grafana alerting reporting the error
2025-09-04 14:40:15.816,"[sse.dataQueryError] failed to execute query [A]: unexpected response with status code 499: {""status"":""error"",""errorType"":""canceled"",""error"":""context canceled: query cancelled: rpc error: code = Canceled desc = context canceled: frontend disconnected""}"
Problem
Today the query-scheduler shuts down before waiting for active queries to finish. While it waits for all inflight queries to be flushed and for all querier workers to disconnect, it also immediately closes the connections to the query-frontend, which in turn cancels all in-flight queries from frontends.
mimir/pkg/scheduler/scheduler.go
Lines 249 to 250 in 51cc661
| // We stop accepting new queries in Stopping state. By returning quickly, we disconnect frontends, which in turns | |
| // cancels all their queries. |
Shutdown
There are two places which are responsible for shutting down:
A. where the scheduler communicates to the queriers that the frontend has disconnected and they can discard their queries
mimir/pkg/scheduler/scheduler.go
Line 240 in 51cc661
| defer s.frontendDisconnected(frontendAddress) |
B. where the frontend closes the loop with the scheduler (which will in turn trigger 1.)
mimir/pkg/frontend/v2/frontend_scheduler_worker.go
Lines 322 to 325 in 699a122
| loopErr = w.schedulerLoop(loop) | |
| if closeErr := util.CloseAndExhaust[*schedulerpb.SchedulerToFrontend](loop); closeErr != nil { | |
| level.Debug(w.log).Log("msg", "failed to close frontend loop", "err", closeErr, "addr", w.schedulerAddr) | |
| } |
This is a correlation between the scheduler shutting down and the query-frontend rejecting queries with HTTP 499
Proposal
We should change both A. and B. above.
A: Scheduler: Cancel the queries context and close the gRPC stream only after we know that all queries have been answered and/or all querier workers have disconnected.
B: Frontend: stop sending new queries to the same scheduler (e.g. don't read from requestsCh) but do not disconnect from the scheduler (so that the frontend can still send e.g. cancellation notices via the scheduler to the queriers)