Commit a5f36a8
[DTensor] Fix deadlock after fast cache clear (pytorch#168069)
This is the necessary fix for meta-pytorch/autoparallel#256.
### Issue:
when we call `_clear_fast_path_sharding_prop_cache()`, and then `get_thread_local_native_sharding_propagator_cache()`, the code will stuck due to deadlock.
### Cause:
When you assign to a Python dict key that already exists:
```C++
thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = old_capsule // capsule #1 stored
...
clear_DTensor_sharding_propagator_cache() // call to clean up the cache
...
get_thread_local_native_sharding_propagator_cache() {
std::lock_guard<std::mutex> lock(
native_sharding_propagator_cache_cleanup_mutex); // FIRST claims the lock!
if (!native_sharding_propagator_cache_DO_NOT_USE.has_value()) { // enter this again because we have cleared the cache.
...
// Destroys old_capsule FIRST then stores new_capsule. However, where we destroy the old_capsule,
// it will trigger the destructor to claim `native_sharding_propagator_cache_cleanup_mutex` again!
thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = new_capsule // SECOND claims the lock before FIRST releases
}
}
```
Pull Request resolved: pytorch#168069
Approved by: https://github.com/ezyang1 parent 1c0bf2a commit a5f36a8
File tree
2 files changed
+49
-20
lines changed- test/distributed/tensor
- torch/csrc/autograd
2 files changed
+49
-20
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | | - | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
38 | 42 | | |
39 | 43 | | |
40 | 44 | | |
| |||
479 | 483 | | |
480 | 484 | | |
481 | 485 | | |
482 | | - | |
| 486 | + | |
| 487 | + | |
483 | 488 | | |
484 | 489 | | |
485 | 490 | | |
| |||
645 | 650 | | |
646 | 651 | | |
647 | 652 | | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
648 | 675 | | |
649 | 676 | | |
650 | 677 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1200 | 1200 | | |
1201 | 1201 | | |
1202 | 1202 | | |
1203 | | - | |
1204 | | - | |
1205 | | - | |
1206 | | - | |
1207 | | - | |
1208 | | - | |
1209 | | - | |
1210 | | - | |
1211 | | - | |
1212 | | - | |
1213 | | - | |
1214 | | - | |
1215 | | - | |
1216 | | - | |
1217 | | - | |
| 1203 | + | |
| 1204 | + | |
| 1205 | + | |
| 1206 | + | |
| 1207 | + | |
| 1208 | + | |
| 1209 | + | |
| 1210 | + | |
| 1211 | + | |
| 1212 | + | |
| 1213 | + | |
| 1214 | + | |
| 1215 | + | |
| 1216 | + | |
| 1217 | + | |
| 1218 | + | |
| 1219 | + | |
1218 | 1220 | | |
1219 | | - | |
1220 | | - | |
1221 | | - | |
| 1221 | + | |
| 1222 | + | |
| 1223 | + | |
1222 | 1224 | | |
1223 | 1225 | | |
1224 | 1226 | | |
| |||
0 commit comments