Skip to content

Commit ae605d1

Browse files
committed
TL/CUDA: fix oob_req leak and double-free in NVLS init cleanup
Two bugs in the BARRIER state error path and cleanup section: 1. When req_test returns a negative status (barrier failure), team->oob_req was freed via barrier_data path but req_free was never called, leaking the OOB transport request handle. Add req_free + NULL the pointer before goto cleanup. 2. nvls->mc_va, nvls->uc_va, and nvls->mc_memhandle are stored at the end of STATE_ADD_DEVICE before falling through to STATE_BARRIER. If the barrier then fails and jumps to cleanup, the cleanup block frees these resources via local variables but leaves the nvls struct fields non-NULL. A subsequent ucc_tl_cuda_nvls_destroy call then unmaps/releases them again causing a double-free / CUDA resource corruption. Zero the nvls fields immediately after the local-variable cleanup blocks.
1 parent 2551467 commit ae605d1

File tree

1 file changed

+5
-0
lines changed

1 file changed

+5
-0
lines changed

src/components/tl/cuda/tl_cuda_nvls.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -695,6 +695,8 @@ ucc_status_t ucc_tl_cuda_nvls_init(
695695
}
696696
if (status < 0) {
697697
tl_error(UCC_TL_TEAM_LIB(team), "NVLS barrier failed");
698+
team->oob.req_free(team->oob_req);
699+
team->oob_req = NULL;
698700
ucc_free(nvls->barrier_data);
699701
nvls->barrier_data = NULL;
700702
goto cleanup;
@@ -738,6 +740,7 @@ ucc_status_t ucc_tl_cuda_nvls_init(
738740
tl_error(UCC_TL_TEAM_LIB(team),
739741
"failed to free mc_va during cleanup");
740742
}
743+
nvls->mc_va = 0;
741744
}
742745

743746
// Unmap and free unicast VA if it was reserved/mapped
@@ -750,6 +753,7 @@ ucc_status_t ucc_tl_cuda_nvls_init(
750753
tl_error(UCC_TL_TEAM_LIB(team),
751754
"failed to free uc_va during cleanup");
752755
}
756+
nvls->uc_va = 0;
753757
}
754758

755759
// Release memory handle if it was created
@@ -758,6 +762,7 @@ ucc_status_t ucc_tl_cuda_nvls_init(
758762
tl_error(UCC_TL_TEAM_LIB(team),
759763
"failed to release mem_handle during cleanup");
760764
}
765+
nvls->mc_memhandle = 0;
761766
}
762767

763768
// Release multicast handle if it was created or imported

0 commit comments

Comments
 (0)