Skip to content

Conversation

@gaogaotiantian
Copy link
Contributor

@gaogaotiantian gaogaotiantian commented Jan 7, 2026

What changes were proposed in this pull request?

Add an optional capability to dump thread info of all pyspark processes. It is intentionally hidden now because it's not fully polished. It can be used as python -m pyspark.threaddump -p <pid>. It requires pystack and psutil. Without these libraries the command will fail.

For now it was only used when test hangs. The result would be like:

Thread dump:
Dumping threads for process 1904
Traceback for thread 2175 (python3.12) [] (most recent call last):
    (Python) File "/usr/lib/python3.12/threading.py", line 1032, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
        self.run()
    (Python) File "/usr/lib/python3.12/threading.py", line 1012, in run
        self._target(*self._args, **self._kwargs)
    (Python) File "/usr/lib/python3.12/socketserver.py", line 240, in serve_forever
        self._handle_request_noblock()
    (Python) File "/usr/lib/python3.12/socketserver.py", line 318, in _handle_request_noblock
        self.process_request(request, client_address)
    (Python) File "/usr/lib/python3.12/socketserver.py", line 349, in process_request
        self.finish_request(request, client_address)
    (Python) File "/usr/lib/python3.12/socketserver.py", line 362, in finish_request
        self.RequestHandlerClass(request, client_address, self)
    (Python) File "/usr/lib/python3.12/socketserver.py", line 766, in __init__
        self.handle()
    (Python) File "/workspaces/spark/python/pyspark/accumulators.py", line 327, in handle
        poll(accum_updates)
    (Python) File "/workspaces/spark/python/pyspark/accumulators.py", line 281, in poll
        for fd, event in poller.poll(1000):

Traceback for thread 2034 (python3.12) [] (most recent call last):
    (Python) File "/usr/lib/python3.12/threading.py", line 1032, in _bootstrap
        self._bootstrap_inner()
    (Python) File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
        self.run()
    (Python) File "/workspaces/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 58, in run

Traceback for thread 1904 (python3.12) [] (most recent call last):
    (Python) File "<frozen runpy>", line 198, in _run_module_as_main
    (Python) File "<frozen runpy>", line 88, in _run_code
    (Python) File "/workspaces/spark/python/pyspark/sql/tests/test_udf.py", line 1790, in <module>
        unittest.main(testRunner=testRunner, verbosity=2)
    (Python) File "/workspaces/spark/python/pyspark/testing/__init__.py", line 30, in unittest_main
        res = _unittest_main(*args, **kwargs)
    (Python) File "/usr/lib/python3.12/unittest/main.py", line 105, in __init__
        self.runTests()
    (Python) File "/usr/lib/python3.12/unittest/main.py", line 281, in runTests
        self.result = testRunner.run(self.test)
    (Python) File "/usr/local/lib/python3.12/dist-packages/xmlrunner/runner.py", line 67, in run
        test(result)
    (Python) File "/usr/lib/python3.12/unittest/suite.py", line 84, in __call__
        return self.run(*args, **kwds)
    (Python) File "/usr/lib/python3.12/unittest/suite.py", line 122, in run
        test(result)
    (Python) File "/usr/lib/python3.12/unittest/suite.py", line 84, in __call__
        return self.run(*args, **kwds)
    (Python) File "/usr/lib/python3.12/unittest/suite.py", line 122, in run
        test(result)
    (Python) File "/usr/lib/python3.12/unittest/case.py", line 690, in __call__
        return self.run(*args, **kwds)
    (Python) File "/usr/lib/python3.12/unittest/case.py", line 634, in run
        self._callTestMethod(testMethod)
    (Python) File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
        if method() is not None:
    (Python) File "/workspaces/spark/python/pyspark/sql/tests/test_udf.py", line 212, in test_chained_udf
        [row] = self.spark.sql("SELECT double_int(double_int(1) + 1)").collect()
    (Python) File "/workspaces/spark/python/pyspark/sql/classic/dataframe.py", line 469, in collect
        sock_info = self._jdf.collectToPython()
    (Python) File "/workspaces/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1361, in __call__
    (Python) File "/workspaces/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1038, in send_command
    (Python) File "/workspaces/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 535, in send_command
    (Python) File "/usr/lib/python3.12/socket.py", line 720, in readinto
        return self._sock.recv_into(b)

Dumping threads for process 2191
Traceback for thread 2191 (python3.12) [] (most recent call last):
    (Python) File "<frozen runpy>", line 198, in _run_module_as_main
    (Python) File "<frozen runpy>", line 88, in _run_code
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 287, in <module>
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 180, in manager

Dumping threads for process 2198
Traceback for thread 2198 (python3.12) [] (most recent call last):
    (Python) File "<frozen runpy>", line 198, in _run_module_as_main
    (Python) File "<frozen runpy>", line 88, in _run_code
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 287, in <module>
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 259, in manager
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 88, in worker
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/util.py", line 981, in wrapper
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/worker.py", line 3439, in main
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 583, in read_int

Dumping threads for process 2257
Traceback for thread 2257 (python3.12) [] (most recent call last):
    (Python) File "<frozen runpy>", line 198, in _run_module_as_main
    (Python) File "<frozen runpy>", line 88, in _run_code
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 287, in <module>
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 259, in manager
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 88, in worker
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/util.py", line 981, in wrapper
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/worker.py", line 3439, in main
    (Python) File "/workspaces/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 583, in read_int

Notice that it has not only the driver process, but also daemon and worker process. The plan is to incorporate this into our existing debug framework so threaddump button will return both JVM executor threads and python worker threads.

Why are the changes needed?

We need insights into python worker/daemon.

We have some thread dump capability in our test, but that's not stable. SIGTERM sometimes is hooked and faulthandler can't work properly. Also it can't dump the subprocesses.

Does this PR introduce any user-facing change?

Yes, but it's hidden for now. A new command entry is introduced.

How was this patch tested?

Locally it works.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

JIRA Issue Information

=== New Feature SPARK-54925 ===
Summary: Add the capability in pyspark to dump thread info from all processes
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

flameprof==0.4
viztracer
debugpy
pystack>=1.5.1; python_version!='3.13' and sys_platform=='linux' # no 3.13t wheels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is only available on linux?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes. pystack only supports Linux. Good news is that that's actually most of our users (and most of our CIs). We can't use it on mac locally though.

Python 3.14 comes with remote exec capability which can does similar things (and supports all platforms). My goal is to use the latest python native method when possible and eventually get rid of pystack.

@zhengruifeng
Copy link
Contributor

thanks, merged to master

@gaogaotiantian gaogaotiantian deleted the python-threaddump branch January 8, 2026 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants