Skip to content

UCP/CORE: Allow specifying mlx devices without a port in UCX_NET_DEVICES#11142

Merged
brminich merged 14 commits intoopenucx:masterfrom
guy-ealey-morag:mlx-default-port
Feb 4, 2026
Merged

UCP/CORE: Allow specifying mlx devices without a port in UCX_NET_DEVICES#11142
brminich merged 14 commits intoopenucx:masterfrom
guy-ealey-morag:mlx-default-port

Conversation

@guy-ealey-morag
Copy link
Contributor

@guy-ealey-morag guy-ealey-morag commented Jan 23, 2026

What?

Allow specifying mlx devices in UCX_NET_DEVICES without a port, and as ranges.

Why?

Currently the only valid format is UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1

This PR now allows specifying mlx devices without a port mlx5_0,mlx5_1,mlx5_2

@guy-ealey-morag guy-ealey-morag force-pushed the mlx-default-port branch 5 times, most recently from 66145ae to b3000e6 Compare January 23, 2026 17:02
@guy-ealey-morag
Copy link
Contributor Author

I added unit tests to ensure the correct behavior of the different device name formats

Comment on lines 225 to 226
EXPECT_GT(count, 0) << "Expected at least one device with base name '"
<< test_base_name << "' to be selected";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we expect exactly one device?

Comment on lines 1116 to 1117
ucs_assert(dev_name_suffix > resource->dev_name &&
dev_name_suffix <= resource->dev_name + strlen(resource->dev_name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd move this assert inside ucp_get_dev_name_suffix if really needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the assert

Copy link
Contributor

@evgeny-leksikov evgeny-leksikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • pls reword "colon" -> "delimeter" in all places
  • Add a negative test with duplicates like UCX_NET_DEVICES=mlx5_1:1,mlx5_1,mlx5_2
  • the rest LGTM 👍

@guy-ealey-morag
Copy link
Contributor Author

Due to some flaws I found in the current implementation of devices ranges (like mlx5_[0-2]),
I decided to remove that logic from this PR and leave only the base name support for now.

Copy link
Contributor

@evgeny-leksikov evgeny-leksikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls avoid force push during review

Comment on lines 221 to 244
EXPECT_EQ(m_warnings.size() - warn_count, 1)
<< "Expected exactly one warning";

// print all warnings
for (const std::string &warn : m_warnings) {
std::cout << warn << std::endl;
}

/* Check that a warning about duplicate device was printed */
std::string expected_warn = "device '" + duplicate_dev_name +
"' was selected multiple times";
bool found_warning = false;
for (size_t i = warn_count; i < m_warnings.size(); ++i) {
if (m_warnings[i].find(expected_warn) != std::string::npos) {
found_warning = true;
break;
}
}

EXPECT_TRUE(found_warning)
<< "Expected warning about duplicate device '"
<< duplicate_dev_name << "' with config '" << devices_config
<< "'";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about to add a wrapper for hide_warns_logger and handle all these checks just in time?

Copy link
Contributor Author

@guy-ealey-morag guy-ealey-morag Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to to match the warning string with the device name duplicate_dev_name.
It's not trivial to pass it to a warpper and the solutions I found make it too complicated for what it is.


// print all warnings
for (const std::string &warn : m_warnings) {
std::cout << warn << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEST_MESSAGE is preferable

Comment on lines 1042 to 1043
memcpy(dev_basename_p, dev_name, basename_len);
dev_basename_p[basename_len] = '\0';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why memcpy? maybe ucs_strncpy_zero?

Comment on lines +1037 to +1040
const char *delimiter = strchr(dev_name, ':');
size_t basename_len;

if (delimiter != NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: how about to use ucs_string_split? (probably overkill in this case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little overkill, and also ucs_string_split modifies the string in-place so I prefer avoiding it here

/* for network devices, also search for the base name (before the delimiter) */
if (dev_type == UCT_DEVICE_TYPE_NET) {
ucp_get_dev_basename(resource->dev_name, dev_basename);
if (*dev_basename != '\0') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucs_string_is_empty

tl_cfg_mask));
}

static UCS_F_ALWAYS_INLINE void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why UCS_F_ALWAYS_INLINE?

}

static UCS_F_ALWAYS_INLINE void
ucp_get_dev_basename(const char *dev_name, char *dev_basename_p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to move it to ucp_device.<h|c> files + rename to ucp_device_name_get_base

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls revert this change. ucp_device is about GPU initiated APIs and in that context device stands for GPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it is not API

}

/* Get all mlx5 network device names from the context */
static std::set<std::string> get_mlx5_device_names(const entity &e) {
Copy link
Contributor

@brminich brminich Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe now we can search for any device name containing : delimeter?
it does not have to be just mlx anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generalized the tests to run on any device with : delimiter.
I only kept the specific "mlx" references in the warning tests because it's simpler to create the different cases this way.

Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 besides comment from Thomas

@guy-ealey-morag
Copy link
Contributor Author

👍 besides comment from Thomas

Fixed

@brminich brminich merged commit f3a1ae1 into openucx:master Feb 4, 2026
147 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants