-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Environment
- Kubernetes v1.33
- trinodb/trino-gateway:17
- trinodb/trino:476
Problem
We've implemented an automated graceful shutdown of the Trino coordinator with a preStop lifecycle hook which deactivates the cluster and then checks if there are any active queries running before terminating the pod. During some tests we noticed that the Trino Gateway throws an internal server error "Number of active backends found zero" when all backends of a routing group are not "Active" anymore. This can happen if there's only one cluster part of the group or if for a reason all clusters of a group are inactive.
From the docs we understand that at least for any submitted queries the Trino Gateway is able to proxy the request to the terminating coordinator and only clients who initiate a new query receive a 500 error.
Trino Gateway supports graceful shutdown of Trino clusters. Even when a cluster is deactivated, any submitted query states can still be retrieved based on the Query ID.
To graceful shutdown a trino cluster without query losses, the steps are:
- Deactivate the cluster by turning off the 'Active' switch. This ensures that no new incoming queries are routed to the cluster.
- Poll the Trino cluster coordinator URL until the queued query count and the running query count are both zero.
- Terminate the Trino coordinator and worker Java processes.
To gracefully shutdown a single worker process, refer to the Trino documentation for more details.
https://trinodb.github.io/trino-gateway/operation/#graceful-shutdown
Steps to reproduce
- register a cluster in a single routing group
POSTto/v1/statementto start a queryGETto thenextUri- deactivate the cluster by turning off the 'Active' switch (cluster status is still HEALTHY)
GETto thenextUri
Observed Behavior
The client receives:
- 500 Internal Server Error
- Trino Gateway logs "Number of active backends found zero"
jakarta.servlet.ServletException: java.lang.IllegalStateException: Number of active backends found zero
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:437)
at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:374)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:355)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:309)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:202)
at org.eclipse.jetty.ee11.servlet.ServletHolder.handle(ServletHolder.java:750)
at org.eclipse.jetty.ee11.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1668)
at io.airlift.http.server.tracing.TracingServletFilter.doFilter(TracingServletFilter.java:115)
at org.eclipse.jetty.ee11.servlet.FilterHolder.doFilter(FilterHolder.java:205)
at org.eclipse.jetty.ee11.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1640)
at org.eclipse.jetty.ee11.servlet.ServletHandler$MappedServlet.handle(ServletHandler.java:1602)
at org.eclipse.jetty.ee11.servlet.ServletChannel.dispatch(ServletChannel.java:868)
at org.eclipse.jetty.ee11.servlet.ServletChannel.handle(ServletChannel.java:449)
at org.eclipse.jetty.ee11.servlet.ServletHandler.handle(ServletHandler.java:470)
at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1224)
at org.eclipse.jetty.compression.server.CompressionHandler.handle(CompressionHandler.java:346)
at org.eclipse.jetty.server.Handler$Wrapper.handle(Handler.java:794)
at org.eclipse.jetty.server.handler.EventsHandler.handle(EventsHandler.java:81)
at org.eclipse.jetty.server.Handler$Wrapper.handle(Handler.java:794)
at org.eclipse.jetty.server.handler.GracefulHandler.handle(GracefulHandler.java:112)
at org.eclipse.jetty.server.Server.handle(Server.java:197)
at org.eclipse.jetty.server.internal.HttpChannelState$HandlerInvoker.run(HttpChannelState.java:720)
at org.eclipse.jetty.server.internal.HttpConnection.onFillable(HttpConnection.java:412)
at org.eclipse.jetty.server.internal.HttpConnection$FillableCallback.succeeded(HttpConnection.java:1810)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:54)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:492)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.epcRunTask(AdaptiveExecutionStrategy.java:428)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:401)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:255)
at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:204)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:317)
at org.eclipse.jetty.util.thread.MonitoredQueuedThreadPool$1.run(MonitoredQueuedThreadPool.java:73)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:1009)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1239)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1194)
at java.base/java.lang.Thread.run(Thread.java:1474)
Caused by: java.lang.IllegalStateException: Number of active backends found zero
at io.trino.gateway.ha.router.BaseRoutingManager.lambda$provideDefaultBackendConfiguration$1(BaseRoutingManager.java:98)
at java.base/java.util.Optional.orElseThrow(Optional.java:403)
at io.trino.gateway.ha.router.BaseRoutingManager.provideDefaultBackendConfiguration(BaseRoutingManager.java:98)
at io.trino.gateway.ha.router.BaseRoutingManager.lambda$provideBackendConfiguration$1(BaseRoutingManager.java:111)
at java.base/java.util.Optional.orElseGet(Optional.java:364)
at io.trino.gateway.ha.router.BaseRoutingManager.provideBackendConfiguration(BaseRoutingManager.java:111)
at io.trino.gateway.ha.handler.RoutingTargetHandler.getRoutingTargetResponse(RoutingTargetHandler.java:98)
at io.trino.gateway.ha.handler.RoutingTargetHandler.resolveRouting(RoutingTargetHandler.java:83)
at io.trino.gateway.proxyserver.RouteToBackendResource.getHandler(RouteToBackendResource.java:76)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.base/java.lang.reflect.Method.invoke(Method.java:565)
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189)
at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:159)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:470)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:394)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80)
at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:274)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:266)
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:253)
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:703)
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:422)
... 36 more
Expected Behavior
- Polling an already submitted query (nextUri) should continue to work as long as the coordinator is still running and the status is HEALTHY.
- Deactivating the cluster should only prevent new queries from being routed.
- A 500 error should only be returned for new query submissions when no active backends exist.