Skip to content

Segfault in //tests:shard_map_test_cpu under TSAN CI 3.13-ft job #27995

@vfdev-5

Description

@vfdev-5

Description

TSAN CI 3.13 Job: https://github.com/jax-ml/jax/actions/runs/14437900284/job/40482078527 reports a segfault in //tests:shard_map_test_cpu:

2025-04-14T05:56:14.4946356Z �[31m�[1mFAIL: �[0m//tests:shard_map_test_cpu (shard 29 of 50) (see /__w/.cache/bazel/bazel/_bazel_root/a40625930fdef4f0a3483cd60aa1bb86/execroot/__main__/bazel-out/k8-opt/testlogs/tests/shard_map_test_cpu/shard_29_of_50/test.log)
2025-04-14T05:56:14.4985132Z ==================== Test output for //tests:shard_map_test_cpu (shard 29 of 50):
2025-04-14T05:56:14.4989472Z Running tests under Python 3.13.3: /__w/.cache/bazel/bazel/_bazel_root/a40625930fdef4f0a3483cd60aa1bb86/execroot/__main__/bazel-out/k8-opt/bin/tests/shard_map_test_cpu.runfiles/python_x86_64-unknown-linux-gnu-freethreaded/bin/python3
2025-04-14T05:56:14.4991543Z INFO:2025-04-14 05:56:10,685:jax._src.test_loader:131: Test start: __main__.ShardMapTest.test_check_rep_false_doesnt_hit_rep_rules
2025-04-14T05:56:14.4992851Z INFO:2025-04-14 05:56:10,687:jax._src.test_loader:131: Test start: __main__.ShardMapTest.test_forwarding_correctness27 (1, 2, 3)
2025-04-14T05:56:14.4994104Z I0414 05:56:10.685487 137516109289152 test_loader.py:131] Test start: __main__.ShardMapTest.test_check_rep_false_doesnt_hit_rep_rules
2025-04-14T05:56:14.4995648Z �[32mINFO: �[0mFrom Testing //tests:shard_map_test_cpu (shard 29 of 50):
2025-04-14T05:56:14.4997181Z INFO:2025-04-14 05:56:10,689:jax._src.test_loader:131: Test start: __main__.ShardMapTest.test_identity
2025-04-14T05:56:14.4998305Z I0414 05:56:10.687771 137516100896448 test_loader.py:131] Test start: __main__.ShardMapTest.test_forwarding_correctness27 (1, 2, 3)
2025-04-14T05:56:14.4999450Z I0414 05:56:10.689798 137516092503744 test_loader.py:131] Test start: __main__.ShardMapTest.test_identity
2025-04-14T05:56:14.5001073Z INFO:2025-04-14 05:56:10,693:jax._src.test_loader:131: Test start: __main__.ShardMapTest.test_reduce_scatter_with_axis_index_groups
2025-04-14T05:56:14.5002420Z I0414 05:56:10.693300 137516084111040 test_loader.py:131] Test start: __main__.ShardMapTest.test_reduce_scatter_with_axis_index_groups
2025-04-14T05:56:14.5003683Z INFO:2025-04-14 05:56:10,695:jax._src.test_loader:131: Test start: __main__.ShardMapTest.test_vmap_grad_shmap_spmd_axis_name_residuals
2025-04-14T05:56:14.5004935Z I0414 05:56:10.695018 137516075718336 test_loader.py:131] Test start: __main__.ShardMapTest.test_vmap_grad_shmap_spmd_axis_name_residuals
2025-04-14T05:56:14.5006552Z INFO:2025-04-14 05:56:10,968:jax._src.xla_bridge:867: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2025-04-14T05:56:14.5008307Z I0414 05:56:10.968123 137516109289152 xla_bridge.py:867] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
2025-04-14T05:56:14.5009428Z Fatal Python error: Segmentation fault
2025-04-14T05:56:14.5009712Z 
2025-04-14T05:56:14.5009905Z Thread 0x00007d11f4d816c0 (most recent call first):
2025-04-14T05:56:14.5011076Z   File "/__w/.cache/bazel/bazel/_bazel_root/a40625930fdef4f0a3483cd60aa1bb86/execroot/__main__/bazel-out/k8-opt/bin/tests/shard_map_test_cpu.runfiles/__main__/jax/_src/mesh.py", line 335 in <genexpr>
2025-04-14T05:56:14.5012207Z 
2025-04-14T05:56:14.5012408Z Thread 0x00007d11f55826c0 (most recent call first):
2025-04-14T05:56:14.5013677Z   File "/__w/.cache/bazel/bazel/_bazel_root/a40625930fdef4f0a3483cd60aa1bb86/external/python_x86_64-unknown-linux-gnu-freethreaded/lib/python3.13t/functools.py", line 397 in _unwrap_partial
2025-04-14T05:56:14.5015925Z   File "/__w/.cache/bazel/bazel/_bazel_root/a40625930fdef4f0a3483cd60aa1bb86/external/python_x86_64-unknown-linux-gnu-freethreaded/lib/python3.13t/inspect.py", line 410 in _has_code_flag
2025-04-14T05:56:14.5017287Z   File ================================================================================

Locally is not always reproducible, but seen consuming >100GB of RAM on peak.

System info (python version, jaxlib version, accelerator, etc.)

TSAN CI 3.13 Job
LInux
CPU

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingfree threadingIssues found in free threading builds

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions