Skip to content

Commit 92fb93f

Browse files
committed
rdma: Re-enable eager messages
While the eager protocol did not have an impact on nccl-tests performance, it did have a sizeable (30% difference in step time for Maxtext Llama2 70B on P5) impact on applications. So re-enable the eager protocol, and adjust the early completion detection to automatically adjust to eager enablement. We need to come back to the early completion protocol and find a way to have early completion and eager co-exist in the general case, but that's future work. Signed-off-by: Brian Barrett <[email protected]>
1 parent d809a88 commit 92fb93f

File tree

2 files changed

+12
-2
lines changed

2 files changed

+12
-2
lines changed

include/nccl_ofi_param.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -342,7 +342,7 @@ OFI_NCCL_PARAM_INT(net_latency, "NET_LATENCY", -1);
342342
* Eager message size limit when using RDMA protocol. Message sizes greater than
343343
* this limit will always be sent using RDMA write instead of eagerly.
344344
*/
345-
OFI_NCCL_PARAM_INT(eager_max_size, "EAGER_MAX_SIZE", -1);
345+
OFI_NCCL_PARAM_INT(eager_max_size, "EAGER_MAX_SIZE", 8192);
346346

347347
/*
348348
* Decide whether or not mutexes should default to errorcheck mode.

src/nccl_ofi_rdma.cpp

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8109,7 +8109,17 @@ int nccl_net_ofi_rdma_init(const char *provider_filter,
81098109
* - Provider must use FI_PROGRESS_AUTO data progress model
81108110
*/
81118111
if (ofi_nccl_early_completion() < 0) {
8112-
early_completion = data_progress_auto;
8112+
if (!data_progress_auto) {
8113+
NCCL_OFI_TRACE(NCCL_INIT | NCCL_NET,
8114+
"Early completion disabled due to progress model");
8115+
early_completion = false;
8116+
} else if (ofi_nccl_eager_max_size() >= 0) {
8117+
NCCL_OFI_TRACE(NCCL_INIT | NCCL_NET,
8118+
"Early completion disabled because eager is enabled");
8119+
early_completion = false;
8120+
} else {
8121+
early_completion = true;
8122+
}
81138123
} else if (ofi_nccl_early_completion() == 0) {
81148124
early_completion = false;
81158125
} else {

0 commit comments

Comments
 (0)