Skip to content

[Bug]: TaskRuntime.stateLock and TableOptimizingProcess.lock can cause deadlocks.Β #4001

@wardlican

Description

@wardlican

What happened?

TaskRuntime.stateLock and TableOptimizingProcess.lock can cause deadlocks.This can cause tables in a certain resource group to malfunction, and OptimizerKeeper will also fail to function properly.

Thread1: Holding: 0x000000059a732858 Waiting: 0x000000059a739970

stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for <0x000000059a739970> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:48)
at org.apache.amoro.server.optimizing.TaskRuntime.tryCanceling(TaskRuntime.java:148)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$2143/1314391570.accept(Unknown Source)
at java.util.HashMap$Values.forEach(HashMap.java:982)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.cancelTasks(OptimizingQueue.java:768)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.lambda$persistAndSetCompleted$13(OptimizingQueue.java:748)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$2139/44375492.run(Unknown Source)
at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
at org.apache.amoro.server.optimizing.OptimizingQueue.access$500(OptimizingQueue.java:74)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.persistAndSetCompleted(OptimizingQueue.java:745)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:539)
at org.apache.amoro.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:104)
at org.apache.amoro.server.optimizing.TaskRuntime$$Lambda$2004/1538729933.run(Unknown Source)
at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:51)
at org.apache.amoro.server.optimizing.TaskRuntime.complete(TaskRuntime.java:77)
at org.apache.amoro.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:266)
at sun.reflect.GeneratedMethodAccessor354.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.amoro.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
at com.sun.proxy.$Proxy60.completeTask(Unknown Source)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:629)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:605)
at org.apache.amoro.shade.thrift.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:147)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.Invocation.run(Invocation.java:19)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Locked ownable synchronizers:
- <0x000000059a732858> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
- <0x000000059a79be88> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
- <0x000000059bbf71e8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Thread2: Holding: 0x000000059a739970 Waiting: 0x000000059a732858

stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000059a732858> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.apache.amoro.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:487)
at org.apache.amoro.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:104)
at org.apache.amoro.server.optimizing.TaskRuntime$$Lambda$2004/1538729933.run(Unknown Source)
at org.apache.amoro.server.persistence.PersistentBase$$Lambda$573/1461930169.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.amoro.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:77)
at org.apache.amoro.server.persistence.StatedPersistentBase.invokeConsistency(StatedPersistentBase.java:51)
at org.apache.amoro.server.optimizing.TaskRuntime.complete(TaskRuntime.java:77)
at org.apache.amoro.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:266)
at sun.reflect.GeneratedMethodAccessor354.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.amoro.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
at com.sun.proxy.$Proxy60.completeTask(Unknown Source)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:629)
at org.apache.amoro.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:605)
at org.apache.amoro.shade.thrift.org.apache.thrift.ProcessFunction.process(ProcessFunction.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:40)
at org.apache.amoro.shade.thrift.org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:147)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492)
at org.apache.amoro.shade.thrift.org.apache.thrift.server.Invocation.run(Invocation.java:19)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Locked ownable synchronizers:
- <0x000000059a1e0228> (a java.util.concurrent.ThreadPoolExecutor$Worker)
- <0x000000059a739970> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

The root cause of the deadlock

Nested lock acquisition leading to a circular wait

  • Thread 1 holds a TaskRuntime.stateLock, waiting for TableOptimizingProcess.lock.
  • Thread 2 holds a TableOptimizingProcess.lock, waiting for a TaskRuntime.stateLock.

Specific problem

In the acceptResult() method, when holding TableOptimizingProcess.lock, persistAndSetCompleted(false) is called. This method calls cancelTasks() in doAsTransaction(). cancelTasks() iterates through all tasks and calls TaskRuntime.tryCanceling(), which requires acquiring the stateLock of the TaskRuntime.

If another thread (Thread 1) already holds a stateLock of a TaskRuntime and is waiting for TableOptimizingProcess.lock, a deadlock will occur.

Affects Versions

0.7.1

What table formats are you seeing the problem on?

Iceberg

What engines are you seeing the problem on?

AMS

How to reproduce

No response

Relevant log output

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions