Skip to content

scheduler: wait for inflight queries before shutting down #12605

@dimitarvdimitrov

Description

@dimitarvdimitrov

Context

When shutting down query-schedulers the query-frontends log these errors

ts=2025-09-04T03:40:59.450600734Z caller=handler.go:430 level=info msg="query stats" component=query-frontend method=POST path=/prometheus/api/v1/query route_name=prometheus_api_v1_query user_agent=Go-http-client/1.1 status_code=499 response_time=17.158379ms response_size_bytes=0 query_wall_time_seconds=0.010860358 fetched_series_count=0 fetched_chunk_bytes=0 fetched_chunks_count=0 fetched_index_bytes=0 sharded_queries=0 split_queries=0 spun_off_subqueries=0 estimated_series_count=0 queue_time_seconds=3.2402e-05 encode_time_seconds=0 samples_processed=0 samples_processed_cache_adjusted=0 param_query="<redacted>" param_time=2025-09-04T03:40:50Z length=34m59.999s time_since_min_time=35m9.432405588s time_since_max_time=9.433405588s results_cache_hit_bytes=0 results_cache_miss_bytes=0 
status=failed 
err="context canceled: query cancelled: rpc error: code = Canceled desc = context canceled: frontend disconnected"

this is confusing because the client didn't cancel the query and it observed the HTTP 499 error. For example this is grafana alerting reporting the error

2025-09-04 14:40:15.816,"[sse.dataQueryError] failed to execute query [A]: unexpected response with status code 499: {""status"":""error"",""errorType"":""canceled"",""error"":""context canceled: query cancelled: rpc error: code = Canceled desc = context canceled: frontend disconnected""}"

Problem

Today the query-scheduler shuts down before waiting for active queries to finish. While it waits for all inflight queries to be flushed and for all querier workers to disconnect, it also immediately closes the connections to the query-frontend, which in turn cancels all in-flight queries from frontends.

// We stop accepting new queries in Stopping state. By returning quickly, we disconnect frontends, which in turns
// cancels all their queries.

Shutdown

There are two places which are responsible for shutting down:

A. where the scheduler communicates to the queriers that the frontend has disconnected and they can discard their queries

defer s.frontendDisconnected(frontendAddress)

B. where the frontend closes the loop with the scheduler (which will in turn trigger 1.)

loopErr = w.schedulerLoop(loop)
if closeErr := util.CloseAndExhaust[*schedulerpb.SchedulerToFrontend](loop); closeErr != nil {
level.Debug(w.log).Log("msg", "failed to close frontend loop", "err", closeErr, "addr", w.schedulerAddr)
}

This is a correlation between the scheduler shutting down and the query-frontend rejecting queries with HTTP 499

Image

Proposal

We should change both A. and B. above.

A: Scheduler: Cancel the queries context and close the gRPC stream only after we know that all queries have been answered and/or all querier workers have disconnected.

B: Frontend: stop sending new queries to the same scheduler (e.g. don't read from requestsCh) but do not disconnect from the scheduler (so that the frontend can still send e.g. cancellation notices via the scheduler to the queriers)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions