-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Closed
Labels
bugSomething isn't workingSomething isn't workingfree threadingIssues found in free threading buildsIssues found in free threading builds
Description
Description
TSAN CI 3.13 Job: https://github.com/jax-ml/jax/actions/runs/14437900284/job/40482078527 reports a segfault in //tests:shard_map_test_cpu:
2025-04-14T05:56:14.4946356Z �[31m�[1mFAIL: �[0m//tests:shard_map_test_cpu (shard 29 of 50) (see /__w/.cache/bazel/bazel/_bazel_root/a40625930fdef4f0a3483cd60aa1bb86/execroot/__main__/bazel-out/k8-opt/testlogs/tests/shard_map_test_cpu/shard_29_of_50/test.log)
2025-04-14T05:56:14.4985132Z ==================== Test output for //tests:shard_map_test_cpu (shard 29 of 50):
2025-04-14T05:56:14.4989472Z Running tests under Python 3.13.3: /__w/.cache/bazel/bazel/_bazel_root/a40625930fdef4f0a3483cd60aa1bb86/execroot/__main__/bazel-out/k8-opt/bin/tests/shard_map_test_cpu.runfiles/python_x86_64-unknown-linux-gnu-freethreaded/bin/python3
2025-04-14T05:56:14.4991543Z INFO:2025-04-14 05:56:10,685:jax._src.test_loader:131: Test start: __main__.ShardMapTest.test_check_rep_false_doesnt_hit_rep_rules
2025-04-14T05:56:14.4992851Z INFO:2025-04-14 05:56:10,687:jax._src.test_loader:131: Test start: __main__.ShardMapTest.test_forwarding_correctness27 (1, 2, 3)
2025-04-14T05:56:14.4994104Z I0414 05:56:10.685487 137516109289152 test_loader.py:131] Test start: __main__.ShardMapTest.test_check_rep_false_doesnt_hit_rep_rules
2025-04-14T05:56:14.4995648Z �[32mINFO: �[0mFrom Testing //tests:shard_map_test_cpu (shard 29 of 50):
2025-04-14T05:56:14.4997181Z INFO:2025-04-14 05:56:10,689:jax._src.test_loader:131: Test start: __main__.ShardMapTest.test_identity
2025-04-14T05:56:14.4998305Z I0414 05:56:10.687771 137516100896448 test_loader.py:131] Test start: __main__.ShardMapTest.test_forwarding_correctness27 (1, 2, 3)
2025-04-14T05:56:14.4999450Z I0414 05:56:10.689798 137516092503744 test_loader.py:131] Test start: __main__.ShardMapTest.test_identity
2025-04-14T05:56:14.5001073Z INFO:2025-04-14 05:56:10,693:jax._src.test_loader:131: Test start: __main__.ShardMapTest.test_reduce_scatter_with_axis_index_groups
2025-04-14T05:56:14.5002420Z I0414 05:56:10.693300 137516084111040 test_loader.py:131] Test start: __main__.ShardMapTest.test_reduce_scatter_with_axis_index_groups
2025-04-14T05:56:14.5003683Z INFO:2025-04-14 05:56:10,695:jax._src.test_loader:131: Test start: __main__.ShardMapTest.test_vmap_grad_shmap_spmd_axis_name_residuals
2025-04-14T05:56:14.5004935Z I0414 05:56:10.695018 137516075718336 test_loader.py:131] Test start: __main__.ShardMapTest.test_vmap_grad_shmap_spmd_axis_name_residuals
2025-04-14T05:56:14.5006552Z INFO:2025-04-14 05:56:10,968:jax._src.xla_bridge:867: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2025-04-14T05:56:14.5008307Z I0414 05:56:10.968123 137516109289152 xla_bridge.py:867] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2025-04-14T05:56:14.5009428Z Fatal Python error: Segmentation fault
2025-04-14T05:56:14.5009712Z
2025-04-14T05:56:14.5009905Z Thread 0x00007d11f4d816c0 (most recent call first):
2025-04-14T05:56:14.5011076Z File "/__w/.cache/bazel/bazel/_bazel_root/a40625930fdef4f0a3483cd60aa1bb86/execroot/__main__/bazel-out/k8-opt/bin/tests/shard_map_test_cpu.runfiles/__main__/jax/_src/mesh.py", line 335 in <genexpr>
2025-04-14T05:56:14.5012207Z
2025-04-14T05:56:14.5012408Z Thread 0x00007d11f55826c0 (most recent call first):
2025-04-14T05:56:14.5013677Z File "/__w/.cache/bazel/bazel/_bazel_root/a40625930fdef4f0a3483cd60aa1bb86/external/python_x86_64-unknown-linux-gnu-freethreaded/lib/python3.13t/functools.py", line 397 in _unwrap_partial
2025-04-14T05:56:14.5015925Z File "/__w/.cache/bazel/bazel/_bazel_root/a40625930fdef4f0a3483cd60aa1bb86/external/python_x86_64-unknown-linux-gnu-freethreaded/lib/python3.13t/inspect.py", line 410 in _has_code_flag
2025-04-14T05:56:14.5017287Z File ================================================================================
Locally is not always reproducible, but seen consuming >100GB of RAM on peak.
System info (python version, jaxlib version, accelerator, etc.)
TSAN CI 3.13 Job
LInux
CPU
- Related to Crash due to racy read in dictobject do_lookup under free threading python/cpython#132869
- Traceback was suppressed due to an entry in suppressions list: Updated TSAN suppressions files to get traceback of crashed process #28245
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingfree threadingIssues found in free threading buildsIssues found in free threading builds