Releases: openucx/ucc
Releases · openucx/ucc
v1.7.0-rc1
New Features and Enhancements
Core
- Ported UCS logger from libucs to UCC, enabling file filtering and log-to-file features {PR #1191}
TL/CUDA
- Added multinode NVLS support using CUDA fabric handles for cross-node allreduce {PR #1185}
- Added NVLS reduce_scatterv with BF16 datatype support and kernel-based synchronization {PR #1211}
- Added ptrace permissions for NVLS POSIX handle sharing via pidfd_getfd {PR #1218}
- Added NVLS allgatherv using multimem.st instructions with 16-byte alignment {PR #1240}
TL/UCP
- Added memory type parameter to tl_ucp_put/get for GPU memory in onesided collectives {PR #1253}
- Fixed crashes in inplace mode for allgather, alltoall, and alltoallv {PR #1254}
- Fixed onesided alltoall algorithm selection to default to PUT for 1 PPN {PR #1247}
TL/NCCL
- Added native ncclAlltoAll support for NCCL 2.28.3+ {PR #1244}
Build and Test
- Bumped version to v1.7 {PR #1225}
- Updated clang-format rules for function wrapping and comment reflow {PR #1192}
- Added Greptile AI code review configuration {PR #1208}
- Improved configure status reporting for CUDA/NVML detection {PR #1239}
- Fixed m4 configuration syntax for CUDA {PR #1252}
- Fixed uninitialized variable warning in MLX5 UMR WQE test {PR #1195}
- Added multinode NVLS tests on GB300 Slurm clusters {PR #1235}
- Added timeout to MPI and DLRM tests to prevent hung jobs {PR #1226}
- Added 90-minute timeout to torch UCC tests {PR #1204}
- Added Blossom CI Jenkins dispatcher job for /build trigger {PR #1229}
- Added GitHub Action workflow for Blossom pipeline initialization {PR #1227}
- Migrated Jenkins credentials to swx-hpcx service account for SSH key rotation {PR #1233}
- Added separate GitHub UI checks for each Jenkins job via Blossom {PR #1237}
- Added Blossom CI separated checks and job output upload to GitHub {PR #1238}
- Fixed Jenkins job folder name and email in CI configuration {PR #1236}
- Fixed clang-format command to use git-clang-format-21 for Ubuntu 22.04 {PR #1212}
- Migrated hpcsdk build from GitHub workflow to Jenkins + CI-DEMO {PR #1215}
- Set Coverity aggressiveness level to medium for better issue detection {PR #1207}
- Fixed parallel GPU tests with CUDA context creation and IB port validation {PR #1209}
- Enabled parallel UCC test execution in CI {PR #1206}
- Fixed Jenkins JJB YAML variable syntax for check separation {PR #1246}
Documentation
- Fixed various typos throughout comments and outputs {PR #1228}
Tools
v1.6.0
New Features and Enhancements
Core
- Added UCC_DEBUGGER_WAIT environment variable {PR #1130}
CL/HIER
- Fixed Wlto-type-mismatch {PR #1179}
TL/CUDA
- Fixed printing of device PCI id {PR #1053}
- Added NVLS improvements and bfloat16 data type support {PR #1162}
- Added NVLS barrier {PR #1180}
- Added Alltoall(v) copy engine {PR #1138}
TL/UCP
- Removed a debug print statement {PR #1177}
- Added knomial allgather with mapped buffers {PR #1176}
- Added node local id config {PR #1189}
- Enable knomial allgatherv {PR #1188}
- Added congestion avoidant onesided Alltoall {PR #1096}
EC/CUDA
- Fixed cuctx creation in EC CUDA {PR #1219}
Build and Test
- Added check to see if target exists in CMAKE {PR #1173}
- Fixed build with GCC 14 {PR #1190}
- Added gtest and mpi test for ucc_mem_map and ucc_mem_unmap {PR #1165}
- Check for CX7 in wait_on_data gtest {PR #1127}
Tools
v1.6.0-rc2
v1.6.0-rc1
New Features and Enhancements
Core
- Added UCC_DEBUGGER_WAIT environment variable {PR #1130}
CL/HIER
- Fixed Wlto-type-mismatch {PR #1179}
TL/CUDA
- Fixed printing of device PCI id {PR #1053}
- Added NVLS improvements and bfloat16 data type support {PR #1162}
- Added NVLS barrier {PR #1180}
- Added Alltoall(v) copy engine {PR #1138}
TL/UCP
- Removed a debug print statement {PR #1177}
- Added knomial allgather with mapped buffers {PR #1176}
- Added node local id config {PR #1189}
- Enable knomial allgatherv {PR #1188}
- Added congestion avoidant onesided Alltoall {PR #1096}
Build and Test
- Added check to see if target exists in CMAKE {PR #1173}
- Fixed build with GCC 14 {PR #1190}
- Added gtest and mpi test for ucc_mem_map and ucc_mem_unmap {PR #1165}
Tools
v1.5.1
v1.5.0
New Features and Enhancements
Core
- Added support for ucc_mem_map and ucc_mem_unmap {PR #1070}
- Enhanced error logs in context creation {PR #1135}
- Enhanced error log in collective init {PR #1104}
- Added ucc net devices config {PR #1141}
- EC/CUDA: Link with stdc++ {PR #1168}
CL/HIER
TL/UCP
- Fixed allreduce knomial data consistency {PR #1145}
- Fixed ag oneshot {PR #1134}
- Added Allgather linear implementation {PR #1122}
- Fall back if memh not passed {PR #1136}
TL/MLX5
- Added HCA-assisted copy & CUDA scratch design {PR #1154}
- Added logging for mcast FORCE/TRY modes {PR #1156}
- Fixed segfault in multicast team creation {PR #1150}
- Recover from ipoib issue in mcast init {PR #1140}
- Added configuration to set IB QP SL {PR #1057}
- Added ctx global status check {PR #1113}
- Added cuda support for zcopy mcast {PR #1118}
- Add reliability-init improvements {PR #1163}
TL/CUDA
- Added NVLink SHARP (NVLS) Allreduce {PR #1148}
- Added Topology Cache {PR #1137}
- Added NVLink SHARP (NVLS) Reduce Scatter {PR #1144}
EC/ROCM
- Include stdbool.h for new versions of ROCM {PR #1146}
TOPO
- Node ldr ordered by team {PR #1129}
Build and Test
- Fixed coverity issues {PR #1152}
- Updated cuda arch {PR #1143}
- Changed to CUDA 12.9 {PR #1155}
- Added buffers for onesided tests {PR #1100}
- Added perftest generator {PR #1147}
- Added missing progress calls in UCC_PERFTEST {PR #1151}
- Updated versions in CI {PR #1115}
- Bumped version to v1.5 {PR #1121}
Documentation
- Updated component image 1.4.4 {PR #1153}
Tools
v1.5.0-rc1
New Features and Enhancements
Core
- Added support for ucc_mem_map and ucc_mem_unmap {PR #1070}
- Enhanced error logs in context creation {PR #1135}
- Enhanced error log in collective init {PR #1104}
- Added ucc net devices config {PR #1141}
CL/HIER
TL/UCP
- Fixed allreduce knomial data consistency {PR #1145}
- Fixed ag oneshot {PR #1134}
- Added Allgather linear implementation {PR #1122}
- Fall back if memh not passed {PR #1136}
TL/MLX5
- Added HCA-assisted copy & CUDA scratch design {PR #1154}
- Added logging for mcast FORCE/TRY modes {PR #1156}
- Fixed segfault in multicast team creation {PR #1150}
- Recover from ipoib issue in mcast init {PR #1140}
- Added configuration to set IB QP SL {PR #1057}
- Added ctx global status check {PR #1113}
- Added cuda support for zcopy mcast {PR #1118}
TL/CUDA
- Added NVLink SHARP (NVLS) Allreduce {PR #1148}
- Added Topology Cache {PR #1137}
- Added NVLink SHARP (NVLS) Reduce Scatter {PR #1144}
EC/ROCM
- Include stdbool.h for new versions of ROCM {PR #1146}
TOPO
- Node ldr ordered by team {PR #1129}
Build and Test
- Fixed coverity issues {PR #1152}
- Updated cuda arch {PR #1143}
- Changed to CUDA 12.9 {PR #1155}
- Added buffers for onesided tests {PR #1100}
- Added perftest generator {PR #1147}
- Added missing progress calls in UCC_PERFTEST {PR #1151}
- Updated versions in CI {PR #1115}
- Bumped version to v1.5 {PR #1121}
Documentation
- Updated component image 1.4.4 {PR #1153}
Tools
1.4.4
New Features and Enhancements
Core
- Implemented asymmetric memory support {PR #1000}
- Enhanced error handling and resource cleanup {PR #960, #951}
- Improved service team handling {PR #1046}
- Fixed triggered post for zero size collectives {PR #960}
CL/HIER
- Added allgatherv support {PR #1111}
- Implemented node subgroup unpacking {PR #1103}
- Added reduce to supported collectives {PR #997}
- Fixed integer overflow in alltoall {PR #944}
TL/UCP
- Split single and multithreaded send/receive operations {PR #1109}
- Added knomial allgather with CUDA memory support {PR #1095}
- Implemented reduce SRG knomial algorithm {PR #1058}
- Added radix selection to knomial operations {PR #1072}
- Added sliding window allreduce implementation {PR #958}
- Added knomial allgatherv support {PR #1008}
- Added sparbit algorithm for allgather {PR #940}
- Extended broadcast active set support for size > 2 {PR #926}
- Added knomial algorithm for reduce-scatter {PR #970}
TL/MLX5
- Added multicast-based zero-copy broadcast {PR #1087}
- Implemented mcast multi-group support {PR #1060}
- Added non-blocking CUDA memory copy support {PR #1040}
- Added device memory multicast broadcast {PR #989}
- Enhanced mcast allgather staging-based algorithm {PR #994}
- Improved one-sided mcast reliability initialization {PR #980}
- Various performance optimizations in alltoall {PR #1067}
- Fixed fences in all-to-all WQEs {PR #1069}
- Added context option to disable all-to-all operations {PR #1062}
- Improved error handling and device checks {PR #1102}
- Disabled mcast for thread multiple mode {PR #961}
TL/SHARP
- Added support for allgather operation {PR #1081}
- Enabled reduce-scatter with SAT support {PR #1084}
- Added SHARP multi-channel support {PR #1049}
- Fixed service team OOB handling {PR #1001}
- Improved internal OOB usage {PR #986}
CUDA
- Added linear broadcast implementation {PR #948}
- Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
- Enhanced error handling for CUDA context operations {PR #1025}
- Fixed context cleanup in CUDA operations {PR #954}
Build and Test
- Added support for specific GPU architectures with ROCM {PR #987}
- Added UCC pkg-config support {PR #1036}
- Fixed build compatibility with NVC compiler {PR #1052}
- Enhanced config parser functionality {PR #1092}
- Enhanced ASAN/LSAN memory leak detection {PR #1074}
- Added error checking and exit handling in gtests {PR #1083}
Documentation
- Updated README with UCC publication information {PR #1028}
- Added DOCA_UROM documentation {PR #999}
- Fixed Doxygen documentation issues {PR #1038}
- Enhanced code style consistency {PR #1020}
CL/DOCA_UROM
- Implemented new DOCA UROM plugin {PR #978}
- Added support for offloading collective operations to DPUs
- Implemented allreduce collective
v1.4.4-rc1
New Features and Enhancements
Core
- Implemented asymmetric memory support {PR #1000}
- Enhanced error handling and resource cleanup {PR #960, #951}
- Improved service team handling {PR #1046}
- Fixed triggered post for zero size collectives {PR #960}
CL/HIER
- Added allgatherv support {PR #1111}
- Implemented node subgroup unpacking {PR #1103}
- Added reduce to supported collectives {PR #997}
- Fixed integer overflow in alltoall {PR #944}
TL/UCP
- Split single and multithreaded send/receive operations {PR #1109}
- Added knomial allgather with CUDA memory support {PR #1095}
- Implemented reduce SRG knomial algorithm {PR #1058}
- Added radix selection to knomial operations {PR #1072}
- Added sliding window allreduce implementation {PR #958}
- Added knomial allgatherv support {PR #1008}
- Added sparbit algorithm for allgather {PR #940}
- Extended broadcast active set support for size > 2 {PR #926}
- Added knomial algorithm for reduce-scatter {PR #970}
TL/MLX5
- Added multicast-based zero-copy broadcast {PR #1087}
- Implemented mcast multi-group support {PR #1060}
- Added non-blocking CUDA memory copy support {PR #1040}
- Added device memory multicast broadcast {PR #989}
- Enhanced mcast allgather staging-based algorithm {PR #994}
- Improved one-sided mcast reliability initialization {PR #980}
- Various performance optimizations in alltoall {PR #1067}
- Fixed fences in all-to-all WQEs {PR #1069}
- Added context option to disable all-to-all operations {PR #1062}
- Improved error handling and device checks {PR #1102}
- Disabled mcast for thread multiple mode {PR #961}
TL/SHARP
- Added support for allgather operation {PR #1081}
- Enabled reduce-scatter with SAT support {PR #1084}
- Added SHARP multi-channel support {PR #1049}
- Fixed service team OOB handling {PR #1001}
- Improved internal OOB usage {PR #986}
CUDA
- Added linear broadcast implementation {PR #948}
- Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
- Enhanced error handling for CUDA context operations {PR #1025}
- Fixed context cleanup in CUDA operations {PR #954}
Build and Test
- Added support for specific GPU architectures with ROCM {PR #987}
- Added UCC pkg-config support {PR #1036}
- Fixed build compatibility with NVC compiler {PR #1052}
- Enhanced config parser functionality {PR #1092}
- Enhanced ASAN/LSAN memory leak detection {PR #1074}
- Added error checking and exit handling in gtests {PR #1083}
Documentation
- Updated README with UCC publication information {PR #1028}
- Added DOCA_UROM documentation {PR #999}
- Fixed Doxygen documentation issues {PR #1038}
- Enhanced code style consistency {PR #1020}
CL/DOCA_UROM
- Implemented new DOCA UROM plugin {PR #978}
- Added support for offloading collective operations to DPUs
- Implemented allreduce collective
1.3.0 (April 18th, 2024)
1.3.0 (April 18, 2024)
New Features and Enhancements
CL/HIER
- Disable onesided alltoallv {PR #875}
TL/CUDA
- Initialize remote CUDA scratch to NULL {PR #911}
TL/UCP
- Enable hybrid alltoallv {PR #781}
- Avoid copy in knomial scatter {PR #771}
- Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
- Remove memcpy in last SRA step {PR #743}
- Fix sparse pack in hybrid a2av {PR #825}
- Fix recycle in hybrid a2av {PR #827}
- Reorder ranks for SRA {PR #834}
- Use ring allgather when reordering needed {PR #879}
- Use pipelining in SRA allreduce for CUDA {PR #873}
- Poll for onesided alltoall completion {PR #876}
- Add support for non-host buffers in bruck alltoall {PR #852}
- Added Neighbor Exchange Allgather{PR #822}
TL/SHARP
- Enable bcast for any predefined dt {PR #774}
- Don't print team create error {PR #777}
- Check datasize supported {PR #776}
- Fix sharp context cleanup {PR #843}
API
- Remove duplicate get_version_string {PR #933}
TL/NCCL
- Make team init non-blocking {PR #772}
- Add CUDA managed to score {PR #793}
- Make ncclGroupEnd nb {PR #798}
- Lazy init nccl comm {PR #851}
TL/MLX5
- Share ib_ctx and pd {PR #749}
- Rcache {PR #753}
- Device memory and topo init {PR #780}
- Adding mcast interface {PR #784}
- A2A part 1 -- coll init {PR #790}
- A2A part 2 -- full collective {PR #802}
- Revisit team and ctx init {PR #815}
- Fix context create hang {PR #887}
- Add librdmacm linkage {PR #910}
CORE
- Fix score update when only score given {PR #779}
- Coverity fixes {PR #809}
- Additional coverty fixes {PR #813}
- Fix error handling for ctx create epilog {PR #818}
- Skip zero size collectives {PR #787}