UCP/CORE: Allow specifying mlx devices without a port in UCX_NET_DEVICES#11142
UCP/CORE: Allow specifying mlx devices without a port in UCX_NET_DEVICES#11142brminich merged 14 commits intoopenucx:masterfrom
Conversation
66145ae to
b3000e6
Compare
|
I added unit tests to ensure the correct behavior of the different device name formats |
4110644 to
77411b4
Compare
test/gtest/ucp/test_ucp_context.cc
Outdated
| EXPECT_GT(count, 0) << "Expected at least one device with base name '" | ||
| << test_base_name << "' to be selected"; |
There was a problem hiding this comment.
can we expect exactly one device?
src/ucp/core/ucp_context.c
Outdated
| ucs_assert(dev_name_suffix > resource->dev_name && | ||
| dev_name_suffix <= resource->dev_name + strlen(resource->dev_name)); |
There was a problem hiding this comment.
i'd move this assert inside ucp_get_dev_name_suffix if really needed
There was a problem hiding this comment.
Removed the assert
7dc8adb to
e3be8e9
Compare
1ace8ff to
838ec0f
Compare
evgeny-leksikov
left a comment
There was a problem hiding this comment.
- pls reword "colon" -> "delimeter" in all places
- Add a negative test with duplicates like UCX_NET_DEVICES=mlx5_1:1,mlx5_1,mlx5_2
- the rest LGTM 👍
|
Due to some flaws I found in the current implementation of devices ranges (like |
ea7be81 to
0335732
Compare
0335732 to
cd8fdcf
Compare
evgeny-leksikov
left a comment
There was a problem hiding this comment.
pls avoid force push during review
| EXPECT_EQ(m_warnings.size() - warn_count, 1) | ||
| << "Expected exactly one warning"; | ||
|
|
||
| // print all warnings | ||
| for (const std::string &warn : m_warnings) { | ||
| std::cout << warn << std::endl; | ||
| } | ||
|
|
||
| /* Check that a warning about duplicate device was printed */ | ||
| std::string expected_warn = "device '" + duplicate_dev_name + | ||
| "' was selected multiple times"; | ||
| bool found_warning = false; | ||
| for (size_t i = warn_count; i < m_warnings.size(); ++i) { | ||
| if (m_warnings[i].find(expected_warn) != std::string::npos) { | ||
| found_warning = true; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| EXPECT_TRUE(found_warning) | ||
| << "Expected warning about duplicate device '" | ||
| << duplicate_dev_name << "' with config '" << devices_config | ||
| << "'"; | ||
| } |
There was a problem hiding this comment.
how about to add a wrapper for hide_warns_logger and handle all these checks just in time?
There was a problem hiding this comment.
I want to to match the warning string with the device name duplicate_dev_name.
It's not trivial to pass it to a warpper and the solutions I found make it too complicated for what it is.
test/gtest/ucp/test_ucp_context.cc
Outdated
|
|
||
| // print all warnings | ||
| for (const std::string &warn : m_warnings) { | ||
| std::cout << warn << std::endl; |
There was a problem hiding this comment.
TEST_MESSAGE is preferable
src/ucp/core/ucp_context.c
Outdated
| memcpy(dev_basename_p, dev_name, basename_len); | ||
| dev_basename_p[basename_len] = '\0'; |
There was a problem hiding this comment.
why memcpy? maybe ucs_strncpy_zero?
| const char *delimiter = strchr(dev_name, ':'); | ||
| size_t basename_len; | ||
|
|
||
| if (delimiter != NULL) { |
There was a problem hiding this comment.
minor: how about to use ucs_string_split? (probably overkill in this case)
There was a problem hiding this comment.
It's a little overkill, and also ucs_string_split modifies the string in-place so I prefer avoiding it here
src/ucp/core/ucp_context.c
Outdated
| /* for network devices, also search for the base name (before the delimiter) */ | ||
| if (dev_type == UCT_DEVICE_TYPE_NET) { | ||
| ucp_get_dev_basename(resource->dev_name, dev_basename); | ||
| if (*dev_basename != '\0') { |
There was a problem hiding this comment.
ucs_string_is_empty
src/ucp/core/ucp_context.c
Outdated
| tl_cfg_mask)); | ||
| } | ||
|
|
||
| static UCS_F_ALWAYS_INLINE void |
There was a problem hiding this comment.
why UCS_F_ALWAYS_INLINE?
src/ucp/core/ucp_context.c
Outdated
| } | ||
|
|
||
| static UCS_F_ALWAYS_INLINE void | ||
| ucp_get_dev_basename(const char *dev_name, char *dev_basename_p) |
There was a problem hiding this comment.
I would suggest to move it to ucp_device.<h|c> files + rename to ucp_device_name_get_base
There was a problem hiding this comment.
pls revert this change. ucp_device is about GPU initiated APIs and in that context device stands for GPU.
test/gtest/ucp/test_ucp_context.cc
Outdated
| } | ||
|
|
||
| /* Get all mlx5 network device names from the context */ | ||
| static std::set<std::string> get_mlx5_device_names(const entity &e) { |
There was a problem hiding this comment.
maybe now we can search for any device name containing : delimeter?
it does not have to be just mlx anymore
There was a problem hiding this comment.
I generalized the tests to run on any device with : delimiter.
I only kept the specific "mlx" references in the warning tests because it's simpler to create the different cases this way.
brminich
left a comment
There was a problem hiding this comment.
👍 besides comment from Thomas
Fixed |
What?
Allow specifying mlx devices in
UCX_NET_DEVICESwithout a port, and as ranges.Why?
Currently the only valid format is
UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1This PR now allows specifying mlx devices without a port
mlx5_0,mlx5_1,mlx5_2