Description
The NIXL_LIBFABRIC_NUM_RAILS environment variable is not enforced during EFA device discovery, causing all available devices to be used regardless of the setting.
Environment
- NIXL 0.8.0
- AWS P5.48xlarge (32 EFA devices)
- libfabric backend
File
src/utils/libfabric/libfabric_rail_manager.cpp
Symptoms
export NIXL_LIBFABRIC_NUM_RAILS=8
# Still initializes all 32 rails
Proposed Fix
Enforce rail limit during discovery loop:
const char* num_rails_env = std::getenv("NIXL_LIBFABRIC_NUM_RAILS");
size_t max_rails = SIZE_MAX;
if (num_rails_env) {
max_rails = std::stoul(num_rails_env);
}
// In discovery loop:
if (rail_count >= max_rails) break;
Also support NIXL_LIBFABRIC_MAX_RAILS as alternative name for consistency.
Impact
Cannot limit rails for testing or resource management without modifying code.