Skip to content

Releases: openucx/ucc

v1.7.0-rc1

20 Jan 16:57
ec0bc8a

Choose a tag to compare

v1.7.0-rc1 Pre-release
Pre-release

New Features and Enhancements

Core

  • Ported UCS logger from libucs to UCC, enabling file filtering and log-to-file features {PR #1191}

TL/CUDA

  • Added multinode NVLS support using CUDA fabric handles for cross-node allreduce {PR #1185}
  • Added NVLS reduce_scatterv with BF16 datatype support and kernel-based synchronization {PR #1211}
  • Added ptrace permissions for NVLS POSIX handle sharing via pidfd_getfd {PR #1218}
  • Added NVLS allgatherv using multimem.st instructions with 16-byte alignment {PR #1240}

TL/UCP

  • Added memory type parameter to tl_ucp_put/get for GPU memory in onesided collectives {PR #1253}
  • Fixed crashes in inplace mode for allgather, alltoall, and alltoallv {PR #1254}
  • Fixed onesided alltoall algorithm selection to default to PUT for 1 PPN {PR #1247}

TL/NCCL

  • Added native ncclAlltoAll support for NCCL 2.28.3+ {PR #1244}

Build and Test

  • Bumped version to v1.7 {PR #1225}
  • Updated clang-format rules for function wrapping and comment reflow {PR #1192}
  • Added Greptile AI code review configuration {PR #1208}
  • Improved configure status reporting for CUDA/NVML detection {PR #1239}
  • Fixed m4 configuration syntax for CUDA {PR #1252}
  • Fixed uninitialized variable warning in MLX5 UMR WQE test {PR #1195}
  • Added multinode NVLS tests on GB300 Slurm clusters {PR #1235}
  • Added timeout to MPI and DLRM tests to prevent hung jobs {PR #1226}
  • Added 90-minute timeout to torch UCC tests {PR #1204}
  • Added Blossom CI Jenkins dispatcher job for /build trigger {PR #1229}
  • Added GitHub Action workflow for Blossom pipeline initialization {PR #1227}
  • Migrated Jenkins credentials to swx-hpcx service account for SSH key rotation {PR #1233}
  • Added separate GitHub UI checks for each Jenkins job via Blossom {PR #1237}
  • Added Blossom CI separated checks and job output upload to GitHub {PR #1238}
  • Fixed Jenkins job folder name and email in CI configuration {PR #1236}
  • Fixed clang-format command to use git-clang-format-21 for Ubuntu 22.04 {PR #1212}
  • Migrated hpcsdk build from GitHub workflow to Jenkins + CI-DEMO {PR #1215}
  • Set Coverity aggressiveness level to medium for better issue detection {PR #1207}
  • Fixed parallel GPU tests with CUDA context creation and IB port validation {PR #1209}
  • Enabled parallel UCC test execution in CI {PR #1206}
  • Fixed Jenkins JJB YAML variable syntax for check separation {PR #1246}

Documentation

  • Fixed various typos throughout comments and outputs {PR #1228}

Tools

  • Added matrix generator for alltoallv traffic patterns (uniform, biased, random) {PR #1220}
  • Fixed segfault in scatterv perftest inplace mode due to early memory free {PR #1234}
  • Optimized perftest traffic matrix to reuse displacements for same-size messages {PR #1250}

v1.6.0

14 Nov 18:23
87ee888

Choose a tag to compare

New Features and Enhancements

Core

  • Added UCC_DEBUGGER_WAIT environment variable {PR #1130}

CL/HIER

  • Fixed Wlto-type-mismatch {PR #1179}

TL/CUDA

  • Fixed printing of device PCI id {PR #1053}
  • Added NVLS improvements and bfloat16 data type support {PR #1162}
  • Added NVLS barrier {PR #1180}
  • Added Alltoall(v) copy engine {PR #1138}

TL/UCP

  • Removed a debug print statement {PR #1177}
  • Added knomial allgather with mapped buffers {PR #1176}
  • Added node local id config {PR #1189}
  • Enable knomial allgatherv {PR #1188}
  • Added congestion avoidant onesided Alltoall {PR #1096}

EC/CUDA

  • Fixed cuctx creation in EC CUDA {PR #1219}

Build and Test

  • Added check to see if target exists in CMAKE {PR #1173}
  • Fixed build with GCC 14 {PR #1190}
  • Added gtest and mpi test for ucc_mem_map and ucc_mem_unmap {PR #1165}
  • Check for CX7 in wait_on_data gtest {PR #1127}

Tools

  • Updated perftest to print BusBW {PR #1186}
  • Added support for onesided alltoall in perftest {PR #1194}
  • Added CUDA managed memory type to ucc_perftest {PR #1199}
  • Fixes for onesided alltoall in perftest {PR #1216}

v1.6.0-rc2

21 Oct 15:46
9093b01

Choose a tag to compare

v1.6.0-rc2 Pre-release
Pre-release

What's Changed

Build and Test

  • Check for CX7 in wait_on_data gtest {PR #1127}

Tools

  • Add CUDA managed memory type to ucc_perftest {PR #1199}

v1.6.0-rc1

13 Oct 17:13
6575f83

Choose a tag to compare

v1.6.0-rc1 Pre-release
Pre-release

New Features and Enhancements

Core

  • Added UCC_DEBUGGER_WAIT environment variable {PR #1130}

CL/HIER

  • Fixed Wlto-type-mismatch {PR #1179}

TL/CUDA

  • Fixed printing of device PCI id {PR #1053}
  • Added NVLS improvements and bfloat16 data type support {PR #1162}
  • Added NVLS barrier {PR #1180}
  • Added Alltoall(v) copy engine {PR #1138}

TL/UCP

  • Removed a debug print statement {PR #1177}
  • Added knomial allgather with mapped buffers {PR #1176}
  • Added node local id config {PR #1189}
  • Enable knomial allgatherv {PR #1188}
  • Added congestion avoidant onesided Alltoall {PR #1096}

Build and Test

  • Added check to see if target exists in CMAKE {PR #1173}
  • Fixed build with GCC 14 {PR #1190}
  • Added gtest and mpi test for ucc_mem_map and ucc_mem_unmap {PR #1165}

Tools

  • Updated perftest to print BusBW {PR #1186}
  • Added support for onesided alltoall in perftest {PR #1194}

v1.5.1

10 Sep 14:15
16ec7ab

Choose a tag to compare

What's Changed

CL/HIER

  • Fix Wlto-type-mismatch {PR #1179}

Build and Test

  • Adjusting rocm gfx targets for rocm {PR #1183}

Documentation

  • v1.5.x: update NEWS {PR #1184}

Full Changelog: v1.5.0...v1.5.1

v1.5.0

07 Aug 19:17
430e241

Choose a tag to compare

New Features and Enhancements

Core

  • Added support for ucc_mem_map and ucc_mem_unmap {PR #1070}
  • Enhanced error logs in context creation {PR #1135}
  • Enhanced error log in collective init {PR #1104}
  • Added ucc net devices config {PR #1141}
  • EC/CUDA: Link with stdc++ {PR #1168}

CL/HIER

  • Added flag for nonroot info {PR #1123}
  • Removed per node leader, fix double free {PR #1126}

TL/UCP

  • Fixed allreduce knomial data consistency {PR #1145}
  • Fixed ag oneshot {PR #1134}
  • Added Allgather linear implementation {PR #1122}
  • Fall back if memh not passed {PR #1136}

TL/MLX5

  • Added HCA-assisted copy & CUDA scratch design {PR #1154}
  • Added logging for mcast FORCE/TRY modes {PR #1156}
  • Fixed segfault in multicast team creation {PR #1150}
  • Recover from ipoib issue in mcast init {PR #1140}
  • Added configuration to set IB QP SL {PR #1057}
  • Added ctx global status check {PR #1113}
  • Added cuda support for zcopy mcast {PR #1118}
  • Add reliability-init improvements {PR #1163}

TL/CUDA

  • Added NVLink SHARP (NVLS) Allreduce {PR #1148}
  • Added Topology Cache {PR #1137}
  • Added NVLink SHARP (NVLS) Reduce Scatter {PR #1144}

EC/ROCM

  • Include stdbool.h for new versions of ROCM {PR #1146}

TOPO

  • Node ldr ordered by team {PR #1129}

Build and Test

  • Fixed coverity issues {PR #1152}
  • Updated cuda arch {PR #1143}
  • Changed to CUDA 12.9 {PR #1155}
  • Added buffers for onesided tests {PR #1100}
  • Added perftest generator {PR #1147}
  • Added missing progress calls in UCC_PERFTEST {PR #1151}
  • Updated versions in CI {PR #1115}
  • Bumped version to v1.5 {PR #1121}

Documentation

  • Updated component image 1.4.4 {PR #1153}

Tools

  • Added perftest generator {PR #1147}
  • Added missing progress calls in UCC_PERFTEST {PR #1151}

v1.5.0-rc1

17 Jul 06:18
91a5549

Choose a tag to compare

v1.5.0-rc1 Pre-release
Pre-release

New Features and Enhancements

Core

  • Added support for ucc_mem_map and ucc_mem_unmap {PR #1070}
  • Enhanced error logs in context creation {PR #1135}
  • Enhanced error log in collective init {PR #1104}
  • Added ucc net devices config {PR #1141}

CL/HIER

  • Added flag for nonroot info {PR #1123}
  • Removed per node leader, fix double free {PR #1126}

TL/UCP

  • Fixed allreduce knomial data consistency {PR #1145}
  • Fixed ag oneshot {PR #1134}
  • Added Allgather linear implementation {PR #1122}
  • Fall back if memh not passed {PR #1136}

TL/MLX5

  • Added HCA-assisted copy & CUDA scratch design {PR #1154}
  • Added logging for mcast FORCE/TRY modes {PR #1156}
  • Fixed segfault in multicast team creation {PR #1150}
  • Recover from ipoib issue in mcast init {PR #1140}
  • Added configuration to set IB QP SL {PR #1057}
  • Added ctx global status check {PR #1113}
  • Added cuda support for zcopy mcast {PR #1118}

TL/CUDA

  • Added NVLink SHARP (NVLS) Allreduce {PR #1148}
  • Added Topology Cache {PR #1137}
  • Added NVLink SHARP (NVLS) Reduce Scatter {PR #1144}

EC/ROCM

  • Include stdbool.h for new versions of ROCM {PR #1146}

TOPO

  • Node ldr ordered by team {PR #1129}

Build and Test

  • Fixed coverity issues {PR #1152}
  • Updated cuda arch {PR #1143}
  • Changed to CUDA 12.9 {PR #1155}
  • Added buffers for onesided tests {PR #1100}
  • Added perftest generator {PR #1147}
  • Added missing progress calls in UCC_PERFTEST {PR #1151}
  • Updated versions in CI {PR #1115}
  • Bumped version to v1.5 {PR #1121}

Documentation

  • Updated component image 1.4.4 {PR #1153}

Tools

  • Added perftest generator {PR #1147}
  • Added missing progress calls in UCC_PERFTEST {PR #1151}

1.4.4

09 May 07:55
2c77074

Choose a tag to compare

New Features and Enhancements

Core

  • Implemented asymmetric memory support {PR #1000}
  • Enhanced error handling and resource cleanup {PR #960, #951}
  • Improved service team handling {PR #1046}
  • Fixed triggered post for zero size collectives {PR #960}

CL/HIER

  • Added allgatherv support {PR #1111}
  • Implemented node subgroup unpacking {PR #1103}
  • Added reduce to supported collectives {PR #997}
  • Fixed integer overflow in alltoall {PR #944}

TL/UCP

  • Split single and multithreaded send/receive operations {PR #1109}
  • Added knomial allgather with CUDA memory support {PR #1095}
  • Implemented reduce SRG knomial algorithm {PR #1058}
  • Added radix selection to knomial operations {PR #1072}
  • Added sliding window allreduce implementation {PR #958}
  • Added knomial allgatherv support {PR #1008}
  • Added sparbit algorithm for allgather {PR #940}
  • Extended broadcast active set support for size > 2 {PR #926}
  • Added knomial algorithm for reduce-scatter {PR #970}

TL/MLX5

  • Added multicast-based zero-copy broadcast {PR #1087}
  • Implemented mcast multi-group support {PR #1060}
  • Added non-blocking CUDA memory copy support {PR #1040}
  • Added device memory multicast broadcast {PR #989}
  • Enhanced mcast allgather staging-based algorithm {PR #994}
  • Improved one-sided mcast reliability initialization {PR #980}
  • Various performance optimizations in alltoall {PR #1067}
  • Fixed fences in all-to-all WQEs {PR #1069}
  • Added context option to disable all-to-all operations {PR #1062}
  • Improved error handling and device checks {PR #1102}
  • Disabled mcast for thread multiple mode {PR #961}

TL/SHARP

  • Added support for allgather operation {PR #1081}
  • Enabled reduce-scatter with SAT support {PR #1084}
  • Added SHARP multi-channel support {PR #1049}
  • Fixed service team OOB handling {PR #1001}
  • Improved internal OOB usage {PR #986}

CUDA

  • Added linear broadcast implementation {PR #948}
  • Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
  • Enhanced error handling for CUDA context operations {PR #1025}
  • Fixed context cleanup in CUDA operations {PR #954}

Build and Test

  • Added support for specific GPU architectures with ROCM {PR #987}
  • Added UCC pkg-config support {PR #1036}
  • Fixed build compatibility with NVC compiler {PR #1052}
  • Enhanced config parser functionality {PR #1092}
  • Enhanced ASAN/LSAN memory leak detection {PR #1074}
  • Added error checking and exit handling in gtests {PR #1083}

Documentation

  • Updated README with UCC publication information {PR #1028}
  • Added DOCA_UROM documentation {PR #999}
  • Fixed Doxygen documentation issues {PR #1038}
  • Enhanced code style consistency {PR #1020}

CL/DOCA_UROM

  • Implemented new DOCA UROM plugin {PR #978}
  • Added support for offloading collective operations to DPUs
  • Implemented allreduce collective

v1.4.4-rc1

15 Apr 08:06

Choose a tag to compare

v1.4.4-rc1 Pre-release
Pre-release

New Features and Enhancements

Core

  • Implemented asymmetric memory support {PR #1000}
  • Enhanced error handling and resource cleanup {PR #960, #951}
  • Improved service team handling {PR #1046}
  • Fixed triggered post for zero size collectives {PR #960}

CL/HIER

  • Added allgatherv support {PR #1111}
  • Implemented node subgroup unpacking {PR #1103}
  • Added reduce to supported collectives {PR #997}
  • Fixed integer overflow in alltoall {PR #944}

TL/UCP

  • Split single and multithreaded send/receive operations {PR #1109}
  • Added knomial allgather with CUDA memory support {PR #1095}
  • Implemented reduce SRG knomial algorithm {PR #1058}
  • Added radix selection to knomial operations {PR #1072}
  • Added sliding window allreduce implementation {PR #958}
  • Added knomial allgatherv support {PR #1008}
  • Added sparbit algorithm for allgather {PR #940}
  • Extended broadcast active set support for size > 2 {PR #926}
  • Added knomial algorithm for reduce-scatter {PR #970}

TL/MLX5

  • Added multicast-based zero-copy broadcast {PR #1087}
  • Implemented mcast multi-group support {PR #1060}
  • Added non-blocking CUDA memory copy support {PR #1040}
  • Added device memory multicast broadcast {PR #989}
  • Enhanced mcast allgather staging-based algorithm {PR #994}
  • Improved one-sided mcast reliability initialization {PR #980}
  • Various performance optimizations in alltoall {PR #1067}
  • Fixed fences in all-to-all WQEs {PR #1069}
  • Added context option to disable all-to-all operations {PR #1062}
  • Improved error handling and device checks {PR #1102}
  • Disabled mcast for thread multiple mode {PR #961}

TL/SHARP

  • Added support for allgather operation {PR #1081}
  • Enabled reduce-scatter with SAT support {PR #1084}
  • Added SHARP multi-channel support {PR #1049}
  • Fixed service team OOB handling {PR #1001}
  • Improved internal OOB usage {PR #986}

CUDA

  • Added linear broadcast implementation {PR #948}
  • Batch CUDA stream memory operations, reduced CPU and GPU execution overhead {PR #1093}
  • Enhanced error handling for CUDA context operations {PR #1025}
  • Fixed context cleanup in CUDA operations {PR #954}

Build and Test

  • Added support for specific GPU architectures with ROCM {PR #987}
  • Added UCC pkg-config support {PR #1036}
  • Fixed build compatibility with NVC compiler {PR #1052}
  • Enhanced config parser functionality {PR #1092}
  • Enhanced ASAN/LSAN memory leak detection {PR #1074}
  • Added error checking and exit handling in gtests {PR #1083}

Documentation

  • Updated README with UCC publication information {PR #1028}
  • Added DOCA_UROM documentation {PR #999}
  • Fixed Doxygen documentation issues {PR #1038}
  • Enhanced code style consistency {PR #1020}

CL/DOCA_UROM

  • Implemented new DOCA UROM plugin {PR #978}
  • Added support for offloading collective operations to DPUs
  • Implemented allreduce collective

1.3.0 (April 18th, 2024)

18 Apr 18:10
1522ccf

Choose a tag to compare

1.3.0 (April 18, 2024)

New Features and Enhancements

CL/HIER

  • Disable onesided alltoallv {PR #875}

TL/CUDA

  • Initialize remote CUDA scratch to NULL {PR #911}

TL/UCP

  • Enable hybrid alltoallv {PR #781}
  • Avoid copy in knomial scatter {PR #771}
  • Enable reorder ranks to reduce_scatter, Knomial Allreduce, Ring Allgather/v {PR #819}
  • Remove memcpy in last SRA step {PR #743}
  • Fix sparse pack in hybrid a2av {PR #825}
  • Fix recycle in hybrid a2av {PR #827}
  • Reorder ranks for SRA {PR #834}
  • Use ring allgather when reordering needed {PR #879}
  • Use pipelining in SRA allreduce for CUDA {PR #873}
  • Poll for onesided alltoall completion {PR #876}
  • Add support for non-host buffers in bruck alltoall {PR #852}
  • Added Neighbor Exchange Allgather{PR #822}

TL/SHARP

  • Enable bcast for any predefined dt {PR #774}
  • Don't print team create error {PR #777}
  • Check datasize supported {PR #776}
  • Fix sharp context cleanup {PR #843}

API

  • Remove duplicate get_version_string {PR #933}

TL/NCCL

  • Make team init non-blocking {PR #772}
  • Add CUDA managed to score {PR #793}
  • Make ncclGroupEnd nb {PR #798}
  • Lazy init nccl comm {PR #851}

TL/MLX5

  • Share ib_ctx and pd {PR #749}
  • Rcache {PR #753}
  • Device memory and topo init {PR #780}
  • Adding mcast interface {PR #784}
  • A2A part 1 -- coll init {PR #790}
  • A2A part 2 -- full collective {PR #802}
  • Revisit team and ctx init {PR #815}
  • Fix context create hang {PR #887}
  • Add librdmacm linkage {PR #910}

CORE

  • Fix score update when only score given {PR #779}
  • Coverity fixes {PR #809}
  • Additional coverty fixes {PR #813}
  • Fix error handling for ctx create epilog {PR #818}
  • Skip zero size collectives {PR #787}

DOCS

  • Updating NEWS for v1.2 {PR #791}
  • Updating NEWS for v1.3 {PR #937}

BUILD and TEST

  • Updated build system to enable UCC with ROCm 6.x {PR #906 and #917}
  • Check op and dt compatibility {PR #773}
  • Fix barrier test {PR #799}
  • Propagate HIP_CXXFLAGS to gtest and mpi {PR #803}