Skip to content

Random and frequent usePermission timeouts causing 404s  #603

@marsea24

Description

@marsea24

Issue submitter TODO list

  • I've looked up my issue in FAQ
  • I've searched for an already existing issues here
  • I've tried running main-labeled docker image and the issue still persists there
  • I'm running a supported version of the application which is listed here

Describe the bug (actual behavior)

Users are encountering pages refusing to load (spinning) and giving a timeout 404 on waiting. This is during normal browsing through pages like Topics and Messages for a particular cluster. Additionally, some users are seeing timeouts for random frontend assets such as other js or css items.

The usual culprit when this happens is usePermission.js which seems to timeout after some time. The full request url for this specific asset from what we've encountered is http://kafka-ui.example.io/assets/usePermission-D1xwE96v.js and if we curl the asset continuously from cli, we can usually reproduce a hanging timeout somewhere after the 3rd or 4th try.

In our kafka-ui logs, we see the following printed that correlates with the failure:

2024-10-09 19:34:18,224 INFO  [kafka-admin-client-thread | kafbat-ui-admin-1728494355-1] o.a.k.c.NetworkClient: [AdminClient clientId=kafbat-ui-admin-1728494355-1] Cancelled in-flight METADATA request with correlation id 351 due to node 5 being disconnected (elapsed time since creation: 1ms, elapsed time since send: 1ms, request timeout: 30000ms)
2024-10-10 15:29:40,715 INFO [kafka-admin-client-thread | kafbat-ui-admin-1728494355-1] o.a.k.c.NetworkClient: [AdminClient clientId=kafbat-ui-admin-1728494355-1] Cancelled in-flight METADATA request with correlation id 2479 due to node 14 being disconnected (elapsed time since creation: 1ms, elapsed time since send: 1ms, request timeout: 29797ms)

However, we've also seen errors like these:

org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:535)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:296)
	at org.apache.kafka.clients.consumer.internals.AbstractFetch.maybeCloseFetchSessions(AbstractFetch.java:756)
	at org.apache.kafka.clients.consumer.internals.AbstractFetch.close(AbstractFetch.java:777)
	at org.apache.kafka.clients.consumer.internals.Fetcher.close(Fetcher.java:110)
	at org.apache.kafka.clients.consumer.KafkaConsumer.lambda$close$3(KafkaConsumer.java:2472)
	at org.apache.kafka.common.utils.Utils.swallow(Utils.java:1025)
	at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2472)
	at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2415)
	at io.kafbat.ui.emitter.EnhancedConsumer.close(EnhancedConsumer.java:75)
	at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:2388)
	at io.kafbat.ui.emitter.RangePollingEmitter.accept(RangePollingEmitter.java:51)
	at io.kafbat.ui.emitter.ForwardEmitter.accept(ForwardEmitter.java:14)
	at io.kafbat.ui.emitter.RangePollingEmitter.accept(RangePollingEmitter.java:18)
	at reactor.core.publisher.FluxCreate.subscribe(FluxCreate.java:95)
	at reactor.core.publisher.Flux.subscribe(Flux.java:8773)
	at reactor.core.publisher.FluxFlatMap$FlatMapMain.onNext(FluxFlatMap.java:427)
	at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.runAsync(FluxPublishOn.java:446)
	at reactor.core.publisher.FluxPublishOn$PublishOnSubscriber.run(FluxPublishOn.java:533)
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.InterruptedException: null
	... 26 common frames omitted
kafka-ui-9d87748df-v7gxr 2024-10-10 15:51:56,826 ERROR [parallel-1] o.s.b.a.w.r.e.AbstractErrorWebExceptionHandler: [91c18eb1-6690] 500 Server Error for HTTP GET "/api/clusters/herp-derpnet/topics/assets.herpderp.evm_1.dlq.v1/messages/v2?limit=100&mode="
java.lang.NullPointerException: null
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
*__checkpoint ⇢ io.kafbat.ui.config.CorsGlobalConfiguration$$Lambda$1013/0x00007ff51a62a728 [DefaultWebFilterChain]
*__checkpoint ⇢ io.kafbat.ui.config.CustomWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ io.kafbat.ui.config.ReadOnlyModeFilter [DefaultWebFilterChain]
*__checkpoint ⇢ AuthorizationWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ExceptionTranslationWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ LogoutWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ServerRequestCacheWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ SecurityContextServerWebExchangeWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ReactorContextWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ HttpHeaderWriterWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ ServerWebExchangeReactorContextWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.security.web.server.WebFilterChainProxy [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.web.filter.reactive.ServerHttpObservationFilter [DefaultWebFilterChain]
*__checkpoint ⇢ HTTP GET "/api/clusters/herp-derpnet/topics/assets.herpderptopic.evm_1.dlq.v1/messages/v2?limit=100&mode=" [ExceptionHandlingWebHandler]

Expected behavior

Expected pages to load on refresh with full content with a few seconds as usually happens after one or more refreshes from encountering the problem. Note the same page may have an issue one time but not another.

Your installation details

  1. We've seen this issue with 91ed167 and 273e64c.
  2. We've seen this issue with both Helm Chart versions 1.4.2 and 1.4.6
  3. Here is the helm and application env configs https://gist.github.com/marsea24/01fe5002eb4b363e35b8c5166da95797
  4. See above gist for what should be all relevant kafka-ui configuration, happy to provide any other specifics needed.

Steps to reproduce

Clicking around in UI under any cluster, diving into Topics pages and even Messages usually is where we see the issue most likely happen (although this is biased, other pages might actually exhibit it more often). Note there is no authentication configured in our kafka-ui instance.

Screenshots

No response

Logs

No response

Additional context

  1. We've tried other users trying to reproduce the problem, different helm chart and image tags of kafka-ui, granting our Confluent Cloud service account full permissions, and running the kafka-ui locally. None of these have made any difference, the issue still persists.
  2. Yes, we have one particular user who gets random timeouts of kafka-ui frontend assets which cause him to get more frequent page load spinning. In many cases when this happens, it takes multiple refreshes or reloading the site in another tab before he can get anything to load.
  3. All logs provided in above sections, happy to provide more upon request
  4. Impact on the end-user here is a really difficult and drawn out process of debugging and developing our platform. With all the issues, sometimes they just have to give up and move onto other things. Watching users deal with the issue directly sees them spamming the refresh button, opening multiple tabs, trying to refresh VPN connections, and continually coming to the Infra team looking for help (which we can't provide).

Finally, we've encountered other UI weirdness like ERR_EMPTY_RESPONSE with assets like http://kafka-ui.example.io/assets/Indicator-BUTjfyDu.js or http://kafka-ui.example.io/assets/Input-BPtTPA5k.js and well as timeouts with other random css or js files which cause the UI to spin. It's about a 50/50 chance on if a single refresh solves the problem or multiple/fullreload is required to even get the UI to load successfully.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions