Commit f63ccce
[SPARK-54818][SQL] TaskMemoryManager allocate failed should log errorstack to help check memory usage
### What changes were proposed in this pull request?
These days I am checking cluster's OOM failed APP, I found allocate failed didn't log out the error stack. A little not friendly for user check app.
Such as user set 200M broadcast threshold, but allocate 4G memory
```
25/12/21 07:08:13 WARN [broadcast-exchange-4] TaskMemoryManager: Failed to allocate a page (4294967296 bytes), try again.
25/12/21 07:08:58 WARN [broadcast-exchange-4] TaskMemoryManager: Failed to allocate a page (4294967296 bytes), try again.
```
### Why are the changes needed?
Help user debug
### Does this PR introduce _any_ user-facing change?
User can know allocate what memory failed
### How was this patch tested?
**Before**
```
11:45:10.693 WARN org.apache.spark.memory.TaskMemoryManager: Failed to allocate a page (67108848 bytes), try again.
```
**After**
```
11:45:10.693 WARN org.apache.spark.memory.TaskMemoryManager: Failed to allocate a page (67108848 bytes), try again.
java.lang.OutOfMemoryError: Failed to allocate 67108848
at org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:49)
at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:398)
at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:359)
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:118)
at org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch.acquirePage(RowBasedKeyValueBatch.java:129)
at org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch.<init>(RowBasedKeyValueBatch.java:108)
at org.apache.spark.sql.catalyst.expressions.FixedLengthRowBasedKeyValueBatch.<init>(FixedLengthRowBasedKeyValueBatch.java:1 at org.apache.spark.scheduler.Task.run(Task.scala:147)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:716)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:86)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:83)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:97)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:719)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:842)
java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:49)
at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:398)
at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:359)
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:118)
at org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch.acquirePage(RowBasedKeyValueBatch.java:129)
at org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch.<init>(RowBasedKeyValueBatch.java:108)
at org.apache.spark.sql.catalyst.expressions.FixedLengthRowBasedKeyValueBatch.<init>(FixedLengthRowBasedKeyValueBatch.java:169)
at org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch.allocate(RowBasedKeyValueBatch.java:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1$hashAgg_FastHashMap_0.<init>(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:593)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:153)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:111)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:180)
at org.apache.spark.scheduler.Task.run(Task.scala:147)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:716)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:86)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:83)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:97)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:719)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:842)
```
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #53578 from AngersZhuuuu/SPARK-54818.
Authored-by: Angerszhuuuu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>1 parent 24207e2 commit f63ccce
File tree
1 file changed
+18
-4
lines changed- core/src/main/java/org/apache/spark/memory
1 file changed
+18
-4
lines changedLines changed: 18 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
355 | 355 | | |
356 | 356 | | |
357 | 357 | | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
358 | 362 | | |
359 | 363 | | |
360 | 364 | | |
| |||
364 | 368 | | |
365 | 369 | | |
366 | 370 | | |
367 | | - | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
368 | 375 | | |
369 | 376 | | |
370 | 377 | | |
| |||
390 | 397 | | |
391 | 398 | | |
392 | 399 | | |
393 | | - | |
394 | | - | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
395 | 409 | | |
396 | 410 | | |
397 | 411 | | |
398 | 412 | | |
399 | 413 | | |
400 | 414 | | |
401 | 415 | | |
402 | | - | |
| 416 | + | |
403 | 417 | | |
404 | 418 | | |
405 | 419 | | |
| |||
0 commit comments