chore: Add memory reservation debug logging and visualization #2521

andygrove · 2025-10-03T15:19:14Z

Which issue does this PR close?

Closes #.

Rationale for this change

Debugging.

[Task 486] MemoryPool[ExternalSorter[6]].try_grow(256232960) returning Ok
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(256375168) returning Ok
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(256899456) returning Ok
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(257296128) returning Ok
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(257820416) returning Err
[Task 486] MemoryPool[ExternalSorterMerge[6]].shrink(10485760)
[Task 486] MemoryPool[ExternalSorter[6]].shrink(150464)
[Task 486] MemoryPool[ExternalSorter[6]].shrink(146688)
[Task 486] MemoryPool[ExternalSorter[6]].shrink(137856)
[Task 486] MemoryPool[ExternalSorter[6]].shrink(141952)
[Task 486] MemoryPool[ExternalSorterMerge[6]].try_grow(0) returning Ok
[Task 486] MemoryPool[ExternalSorterMerge[6]].try_grow(0) returning Ok
[Task 486] MemoryPool[ExternalSorter[6]].shrink(524288)
[Task 486] MemoryPool[ExternalSorterMerge[6]].try_grow(0) returning Ok
[Task 486] MemoryPool[ExternalSorterMerge[6]].try_grow(68928) returning Ok

From this, we can make pretty charts to help with comprehension:

What changes are included in this PR?

Add new config spark.comet.debug.memory
Add new LoggingPool that is enabled when the new config is set

How are these changes tested?

andygrove · 2025-10-03T17:40:13Z

native/core/src/execution/jni_api.rs

-    debug_native: jboolean,
-    explain_native: jboolean,
-    tracing_enabled: jboolean,


rather than adding yet another flag to this API call, I am now using the already available spark config map in native code.

+1. The config map should be the preferred method

codecov-commenter · 2025-10-03T17:53:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.93%. Comparing base (f09f8af) to head (2884ed3).
⚠️ Report is 585 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2521      +/-   ##
============================================
+ Coverage     56.12%   58.93%   +2.80%     
- Complexity      976     1449     +473     
============================================
  Files           119      147      +28     
  Lines         11743    13649    +1906     
  Branches       2251     2369     +118     
============================================
+ Hits           6591     8044    +1453     
- Misses         4012     4382     +370     
- Partials       1140     1223      +83

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

comphead · 2025-10-03T18:09:15Z

native/core/src/execution/memory_pools/logging_pool.rs

+
+impl MemoryPool for LoggingPool {
+    fn grow(&self, reservation: &MemoryReservation, additional: usize) {
+        println!(


should be println as info! or trace! ?

I guess info! would be ok. I pushed that change. If we use trace! then we would have to set spark.comet.debug.memory=true and also configure trace logging for this one file, which seem like overkill for a debug feature

spark/src/main/scala/org/apache/comet/CometExecIterator.scala

andygrove · 2025-10-03T20:53:22Z

moving to draft while I work on the Python scripts

andygrove · 2025-10-03T21:16:30Z

Still experimenting...

…o debug-mem

andygrove · 2025-10-03T21:54:21Z

Chart now shows when try_grow failed:

martin-g · 2025-11-06T08:21:46Z

docs/source/contributor-guide/debugging.md

+Next, generate a chart from the CSV file for a specific Spark task:
+
+```shell
+python3 dev/scripts/plot_memory_usage.py /tmp/mem.csv --task 1234


Suggested change

python3 dev/scripts/plot_memory_usage.py /tmp/mem.csv --task 1234

python3 dev/scripts/plot_memory_usage.py /tmp/mem.csv

plot_memory_usage.py does not accept --task argument

martin-g · 2025-11-06T08:24:39Z

dev/scripts/mem_debug_to_csv.py

+if __name__ == "__main__":
+    ap = argparse.ArgumentParser(description="Generate CSV From memory debug output")
+    ap.add_argument("--task", default=None, help="Task ID.")
+    ap.add_argument("--file", default=None, help="Spark log containing memory debug output")


The file argument seems to be mandatory, not optional.
It is used at https://github.com/apache/datafusion-comet/pull/2521/files#diff-b1b45e935652f7568175f6d7b83ff247fab24d507c782ef1e53392a53410e095R30

martin-g · 2025-11-06T08:25:12Z

common/src/main/scala/org/apache/comet/CometConf.scala

    "Guide (https://datafusion.apache.org/comet/user-guide/tracing.html)"

+  private val DEBUGGING_GUIDE = "For more information, refer to the Comet Debugging " +
+    "Guide (https://datafusion.apache.org/comet/contributor-guide/debugging.html"


Suggested change

"Guide (https://datafusion.apache.org/comet/contributor-guide/debugging.html"

"Guide (https://datafusion.apache.org/comet/contributor-guide/debugging.html)"

martin-g · 2025-11-06T08:25:42Z

docs/source/user-guide/latest/configs.md

 | spark.comet.convert.json.enabled | When enabled, data from Spark (non-native) JSON v1 and v2 scans will be converted to Arrow format. Note that to enable native vectorized execution, both this config and 'spark.comet.exec.enabled' need to be enabled. | false |
 | spark.comet.convert.parquet.enabled | When enabled, data from Spark (non-native) Parquet v1 and v2 scans will be converted to Arrow format. Note that to enable native vectorized execution, both this config and 'spark.comet.exec.enabled' need to be enabled. | false |
 | spark.comet.debug.enabled | Whether to enable debug mode for Comet. When enabled, Comet will do additional checks for debugging purpose. For example, validating array when importing arrays from JVM at native side. Note that these checks may be expensive in performance and should only be enabled for debugging purpose. | false |
+| spark.comet.debug.memory | When enabled, log all native memory pool interactions. For more information, refer to the Comet Debugging Guide (https://datafusion.apache.org/comet/contributor-guide/debugging.html. | false |


Suggested change

| spark.comet.debug.memory | When enabled, log all native memory pool interactions. For more information, refer to the Comet Debugging Guide (https://datafusion.apache.org/comet/contributor-guide/debugging.html. | false |

| spark.comet.debug.memory | When enabled, log all native memory pool interactions. For more information, refer to the Comet Debugging Guide (https://datafusion.apache.org/comet/contributor-guide/debugging.html). | false |

martin-g · 2025-11-06T08:28:19Z

dev/scripts/mem_debug_to_csv.py

+    ap.add_argument("--task", default=None, help="Task ID.")
+    ap.add_argument("--file", default=None, help="Spark log containing memory debug output")
+    args = ap.parse_args()
+    main(args.file, int(args.task))


The task is optional parameter. Calling int(None) will fail with TypeError

martin-g · 2025-11-06T08:40:38Z

dev/scripts/mem_debug_to_csv.py

+                    size = int(re_match.group(4))
+
+                    if alloc.get(consumer) is None:
+                        alloc[consumer] = size


Would it be possible the method to be shrink for the first occurrence ?

martin-g · 2025-11-06T08:42:35Z

dev/scripts/mem_debug_to_csv.py

+                        elif method == "shrink":
+                            alloc[consumer] = alloc[consumer] - size
+
+                    print(consumer, ",", alloc[consumer])


Suggested change

print(consumer, ",", alloc[consumer])

print(f"{consumer},{alloc[consumer]}")

nit: to avoid the extra spaces around each item

martin-g · 2025-11-06T09:07:44Z

dev/scripts/plot_memory_usage.py

+
+    # Pivot the data to have consumers as columns
+    pivot_df = df.pivot(index='time', columns='name', values='size')
+    pivot_df = pivot_df.fillna(method='ffill').fillna(0)


Suggested change

pivot_df = pivot_df.fillna(method='ffill').fillna(0)

pivot_df = pivot_df.ffill().fillna(0)

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html - Deprecated since version 2.1.0: Use ffill or bfill instead.

martin-g · 2025-11-06T09:08:30Z

native/core/src/execution/jni_api.rs

        let tracing_enabled = spark_config.get_bool(COMET_TRACING_ENABLED);
        let max_temp_directory_size =
            spark_config.get_u64(COMET_MAX_TEMP_DIRECTORY_SIZE, 100 * 1024 * 1024 * 1024);
+        let logging_memory_pool = spark_config.get_bool(COMET_DEBUG_MEMORY);


Suggested change

let logging_memory_pool = spark_config.get_bool(COMET_DEBUG_MEMORY);

let debug_memory_enabled = spark_config.get_bool(COMET_DEBUG_MEMORY);

martin-g · 2025-11-06T09:08:37Z

native/core/src/execution/jni_api.rs

            let memory_pool =
                create_memory_pool(&memory_pool_config, task_memory_manager, task_attempt_id);

+            let memory_pool = if logging_memory_pool {


Suggested change

let memory_pool = if logging_memory_pool {

let memory_pool = if debug_memory_enabled {

andygrove added 5 commits August 22, 2025 10:10

Access Spark configs from native code

7252605

code cleanup

d084cfa

revert

4837935

debug

ad9c9b8

use df release

f3bb412

andygrove changed the title ~~chore: Add memory pool trace logging [WIP]~~ chore: Add memory pool trace logging [WIP] [skip-ci] Oct 3, 2025

andygrove changed the title ~~chore: Add memory pool trace logging [WIP] [skip-ci]~~ chore: Add memory pool trace logging [WIP] [skip ci] Oct 3, 2025

andygrove added 2 commits October 3, 2025 09:36

cargo update

13f14d3

[skip ci]

78f5b4f

andygrove changed the title ~~chore: Add memory pool trace logging [WIP] [skip ci]~~ chore: Add memory pool trace logging [WIP] Oct 3, 2025

andygrove added 10 commits October 3, 2025 09:40

merge other PR [skip-ci]

5a39d3b

save [skip-ci]

dc11515

[skip ci]

d2a1ab1

save [skip ci]

31cdbc6

Merge remote-tracking branch 'apache/main' into debug-mem

ffb1f71

info logging

322b4c5

log task id [skip ci]

89e10ac

println

3b191fd

revert lock file

7c24836

prep for review

405f5b7

andygrove marked this pull request as ready for review October 3, 2025 17:36

andygrove mentioned this pull request Oct 3, 2025

chore: Access Spark configs from native code #2219

Closed

andygrove changed the title ~~chore: Add memory pool trace logging [WIP]~~ chore: Add memory pool trace logging Oct 3, 2025

andygrove requested review from comphead and parthchandra October 3, 2025 17:37

save

522238d

andygrove commented Oct 3, 2025

View reviewed changes

comphead reviewed Oct 3, 2025

View reviewed changes

spark/src/main/scala/org/apache/comet/CometExecIterator.scala Outdated Show resolved Hide resolved

andygrove added 2 commits October 3, 2025 14:44

add Python script to convert log to csv

ad891a0

Python script to generate chart

8756256

andygrove marked this pull request as draft October 3, 2025 20:53

andygrove added 2 commits October 3, 2025 15:00

scripts

7eb1bc1

new script

21bd386

andygrove added 5 commits October 3, 2025 15:24

show err

ec823c2

save

a66fa65

Merge branch 'debug-mem' of github.com:andygrove/datafusion-comet int…

12db37f

…o debug-mem

track errors

2fb336e

format

706f5e7

andygrove mentioned this pull request Oct 3, 2025

Add ability to log all interactions with a memory pool, by consumer apache/datafusion#17901

Open

andygrove marked this pull request as ready for review October 3, 2025 21:59

andygrove added 3 commits October 3, 2025 16:13

ASF header

4faf881

add brief docs

d91abda

docs

f6128b5

andygrove changed the title ~~chore: Add memory pool trace logging~~ chore: Add memory reservation debug logging and visualization Oct 4, 2025

andygrove added 6 commits October 5, 2025 13:00

fix

7d40ac2

cargo fmt

c495897

upmerge

06814b7

format

e51751f

upmerge

75e727f

fix regression

e844287

andygrove mentioned this pull request Oct 10, 2025

chore: Pass Comet configs to native createPlan #2543

Merged

andygrove marked this pull request as draft October 10, 2025 09:20

upmerge

2884ed3

andygrove marked this pull request as ready for review October 10, 2025 15:42

martin-g reviewed Nov 6, 2025

View reviewed changes

	python3 dev/scripts/plot_memory_usage.py /tmp/mem.csv --task 1234
	python3 dev/scripts/plot_memory_usage.py /tmp/mem.csv

	"Guide (https://datafusion.apache.org/comet/contributor-guide/debugging.html"
	"Guide (https://datafusion.apache.org/comet/contributor-guide/debugging.html)"

	\| spark.comet.debug.memory \| When enabled, log all native memory pool interactions. For more information, refer to the Comet Debugging Guide (https://datafusion.apache.org/comet/contributor-guide/debugging.html. \| false \|
	\| spark.comet.debug.memory \| When enabled, log all native memory pool interactions. For more information, refer to the Comet Debugging Guide (https://datafusion.apache.org/comet/contributor-guide/debugging.html). \| false \|

	print(consumer, ",", alloc[consumer])
	print(f"{consumer},{alloc[consumer]}")

	pivot_df = pivot_df.fillna(method='ffill').fillna(0)
	pivot_df = pivot_df.ffill().fillna(0)

	let logging_memory_pool = spark_config.get_bool(COMET_DEBUG_MEMORY);
	let debug_memory_enabled = spark_config.get_bool(COMET_DEBUG_MEMORY);

	let memory_pool = if logging_memory_pool {
	let memory_pool = if debug_memory_enabled {

chore: Add memory reservation debug logging and visualization #2521

Are you sure you want to change the base?

chore: Add memory reservation debug logging and visualization #2521

Uh oh!

Conversation

andygrove commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Oct 3, 2025

Uh oh!

andygrove commented Oct 3, 2025

Uh oh!

andygrove commented Oct 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andygrove commented Oct 3, 2025 •

edited

Loading

codecov-commenter commented Oct 3, 2025 •

edited

Loading