Skip to content

Trino Gateway returns 500 for existing queries when routing group has no active backends during graceful shutdown #943

@hebestreit

Description

@hebestreit

Environment

  • Kubernetes v1.33
  • trinodb/trino-gateway:17
  • trinodb/trino:476

Problem

We've implemented an automated graceful shutdown of the Trino coordinator with a preStop lifecycle hook which deactivates the cluster and then checks if there are any active queries running before terminating the pod. During some tests we noticed that the Trino Gateway throws an internal server error "Number of active backends found zero" when all backends of a routing group are not "Active" anymore. This can happen if there's only one cluster part of the group or if for a reason all clusters of a group are inactive.

From the docs we understand that at least for any submitted queries the Trino Gateway is able to proxy the request to the terminating coordinator and only clients who initiate a new query receive a 500 error.

Trino Gateway supports graceful shutdown of Trino clusters. Even when a cluster is deactivated, any submitted query states can still be retrieved based on the Query ID.

To graceful shutdown a trino cluster without query losses, the steps are:

  1. Deactivate the cluster by turning off the 'Active' switch. This ensures that no new incoming queries are routed to the cluster.
  2. Poll the Trino cluster coordinator URL until the queued query count and the running query count are both zero.
  3. Terminate the Trino coordinator and worker Java processes.

To gracefully shutdown a single worker process, refer to the Trino documentation for more details.

https://trinodb.github.io/trino-gateway/operation/#graceful-shutdown

Steps to reproduce

  1. register a cluster in a single routing group
  2. POST to /v1/statement to start a query
  3. GET to the nextUri
  4. deactivate the cluster by turning off the 'Active' switch (cluster status is still HEALTHY)
  5. GET to the nextUri

Observed Behavior

The client receives:

  • 500 Internal Server Error
  • Trino Gateway logs "Number of active backends found zero"
jakarta.servlet.ServletException: java.lang.IllegalStateException: Number of active backends found zero
	at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:437)
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:374)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:355)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:309)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:202)
	at org.eclipse.jetty.ee11.servlet.ServletHolder.handle(ServletHolder.java:750)
	at org.eclipse.jetty.ee11.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1668)
	at io.airlift.http.server.tracing.TracingServletFilter.doFilter(TracingServletFilter.java:115)
	at org.eclipse.jetty.ee11.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee11.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1640)
	at org.eclipse.jetty.ee11.servlet.ServletHandler$MappedServlet.handle(ServletHandler.java:1602)
	at org.eclipse.jetty.ee11.servlet.ServletChannel.dispatch(ServletChannel.java:868)
	at org.eclipse.jetty.ee11.servlet.ServletChannel.handle(ServletChannel.java:449)
	at org.eclipse.jetty.ee11.servlet.ServletHandler.handle(ServletHandler.java:470)
	at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1224)
	at org.eclipse.jetty.compression.server.CompressionHandler.handle(CompressionHandler.java:346)
	at org.eclipse.jetty.server.Handler$Wrapper.handle(Handler.java:794)
	at org.eclipse.jetty.server.handler.EventsHandler.handle(EventsHandler.java:81)
	at org.eclipse.jetty.server.Handler$Wrapper.handle(Handler.java:794)
	at org.eclipse.jetty.server.handler.GracefulHandler.handle(GracefulHandler.java:112)
	at org.eclipse.jetty.server.Server.handle(Server.java:197)
	at org.eclipse.jetty.server.internal.HttpChannelState$HandlerInvoker.run(HttpChannelState.java:720)
	at org.eclipse.jetty.server.internal.HttpConnection.onFillable(HttpConnection.java:412)
	at org.eclipse.jetty.server.internal.HttpConnection$FillableCallback.succeeded(HttpConnection.java:1810)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:54)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:492)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.epcRunTask(AdaptiveExecutionStrategy.java:428)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:401)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:255)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:204)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:317)
	at org.eclipse.jetty.util.thread.MonitoredQueuedThreadPool$1.run(MonitoredQueuedThreadPool.java:73)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:1009)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1239)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1194)
	at java.base/java.lang.Thread.run(Thread.java:1474)
Caused by: java.lang.IllegalStateException: Number of active backends found zero
	at io.trino.gateway.ha.router.BaseRoutingManager.lambda$provideDefaultBackendConfiguration$1(BaseRoutingManager.java:98)
	at java.base/java.util.Optional.orElseThrow(Optional.java:403)
	at io.trino.gateway.ha.router.BaseRoutingManager.provideDefaultBackendConfiguration(BaseRoutingManager.java:98)
	at io.trino.gateway.ha.router.BaseRoutingManager.lambda$provideBackendConfiguration$1(BaseRoutingManager.java:111)
	at java.base/java.util.Optional.orElseGet(Optional.java:364)
	at io.trino.gateway.ha.router.BaseRoutingManager.provideBackendConfiguration(BaseRoutingManager.java:111)
	at io.trino.gateway.ha.handler.RoutingTargetHandler.getRoutingTargetResponse(RoutingTargetHandler.java:98)
	at io.trino.gateway.ha.handler.RoutingTargetHandler.resolveRouting(RoutingTargetHandler.java:83)
	at io.trino.gateway.proxyserver.RouteToBackendResource.getHandler(RouteToBackendResource.java:76)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189)
	at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:159)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:470)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:394)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80)
	at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:274)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:266)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:253)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:703)
	at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:422)
	... 36 more

Expected Behavior

  • Polling an already submitted query (nextUri) should continue to work as long as the coordinator is still running and the status is HEALTHY.
  • Deactivating the cluster should only prevent new queries from being routed.
  • A 500 error should only be returned for new query submissions when no active backends exist.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions