Skip to content

Conversation

@tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Oct 28, 2024

What

Processes have their _end that depends on the program built. Try negotiation first assuming symmetric layout will lead to same available memory areas. If not all ranks can create at the same position, fallback on the current hardcoded method.

We need to keep the mmap() as a reservation in all cases, so that intermediate library calls do not consume it in between. If that happens, UCX module overrides it, causing some later corruption.

Tested

  1. -mca sshmem_base_start_address 0xffffffffffffffff or no option: negotiation takes place, mmap reservation
  2. -mca sshmem_base_start_address 0x7f.....: no negotiation, mmap reservation, detection if failure to allocate.
  3. when one or more ranks fail to negotiate, all of them fallback on hardcoded method with mmap reservation

Static segment creation always skips module-created segment. Segments found in /proc/self/maps are always bigger or equal than module-allocated one.

Misc

Configure: ./configure --prefix=rfs --enable-debug --with-ucx=rfs
Options: -mca memheap_base_verbose 100, -mca sshmem sysv/mmap/ucx

#endif
}

if (mca_sshmem_base_start_address != memheap_mmap_get(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is based on mmap() behavior where it always creates vma at the hint position if possible. If this not always true (kernel vesions..), this could regress existing behavior and even fail to honor command line parameter.

Shall we remove that confirmation check and proceed regardless? Or maybe only ignore that check when address was passed from command line?

/* init the contents of map_segment_t */
shmem_ds_reset(ds_buf);

(void)munmap(mca_sshmem_base_start_address, size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why added then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now "reserve" that area by holding an mmap() on it as it seems there is no randomization between mmap/munmap + mmap sequence and area could be consumed by unrelated mmap() in between.

Then on the modules we "overwrite" it with (ucp_mem_map() / mmap() / shmat()). It's a try to make it explicit, although it opens for race and mmap() anyways replaces it with MAP_FIXED.

Will remove, need to check with shmat() that it overwrites existing area too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed for mmap module, kept for sysv module as it is needed

@tvegas1
Copy link
Contributor Author

tvegas1 commented Oct 28, 2024

@brminich

Copy link
Member

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like negotiation is not done by default, as default value of sshmem_base_start_address remains the same

/* init the contents of map_segment_t */
shmem_ds_reset(ds_buf);

(void)munmap(mca_sshmem_base_start_address, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why added then?

Comment on lines 157 to 162
rc = oshmem_shmem_allgather(&ptr, bases, sizeof(ptr));
if (OSHMEM_SUCCESS != rc) {
MEMHEAP_ERROR("Failed to exchange selected vma for base segment "
"(error %d)", rc);
goto out;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can also introduce an option without fallback to the original behavior? Then allgatherv will not be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, in that case we could depend on mca_sshmem_base_start_address value:
1- if 0: bcast the pointer value, and any rank unable to create fails on its side, global failure
2- if UINTPTR_MAX: bcast the pointer value, allgather so that they all fallback on default value

default could be point 2-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

base = ptr;
}

rc = oshmem_shmem_bcast(&base, sizeof(base), 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich, tried the patch below where they all do the mmap(). the mmap() returned address is randomized like below, so we need some form of synchronization of the base adddress.

memheap_exchange_base_address() #1: exchange base address: base 0x7fa7d9dff000: ok
memheap_exchange_base_address() #3: exchange base address: base 0x7fdc5a15b000: ok
memheap_exchange_base_address() #2: exchange base address: base 0x7fe8aa56a000: ok
memheap_exchange_base_address() #0: exchange base address: base 0x7f3d1736b000: ok
diff --git a/oshmem/mca/memheap/base/memheap_base_select.c b/oshmem/mca/memheap/base/memheap_base_select.c
index 0ec74de6aa..0b0cfe4bee 100644
--- a/oshmem/mca/memheap/base/memheap_base_select.c
+++ b/oshmem/mca/memheap/base/memheap_base_select.c
@@ -134,21 +134,8 @@ static int memheap_exchange_base_address(size_t size, void **address)
         return OSHMEM_ERROR;
     }

-    if (oshmem_my_proc_id() == 0) {
-        ptr = memheap_mmap_get(NULL, size);
-        base = ptr;
-    }
-
-    rc = oshmem_shmem_bcast(&base, sizeof(base), 0);
-    if (OSHMEM_SUCCESS != rc) {
-        MEMHEAP_ERROR("Failed to exchange allocated vma for base segment "
-                      "(error %d)", rc);
-        goto out;
-    }
-
-    if (oshmem_my_proc_id() != 0) {
-        ptr = memheap_mmap_get(base, size);
-    }
+    ptr = memheap_mmap_get(NULL, size);
+    base = ptr;

     MEMHEAP_VERBOSE(100, "#%d: exchange base address: base %p: %s",
                     oshmem_my_proc_id(), base,

@tvegas1
Copy link
Contributor Author

tvegas1 commented Nov 5, 2024

seems like negotiation is not done by default, as default value of sshmem_base_start_address remains the same

i do not understand that comment since new default address is ~0 and rank 0 allocates and bcast's the pointer value, but ack it is not a full negotiation.

Comment on lines +174 to +177
} else if (ptr != base) {
/* Any failure terminates the rank and others start teardown */
rc = OSHMEM_ERROR;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use this as default flow (i mean setting mca_sshmem_base_start_address = NULL by default)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@brminich
Copy link
Member

@yosefe

@brminich
Copy link
Member

brminich commented Jan 8, 2025

@tvegas1 can you pls squash?

@tvegas1
Copy link
Contributor Author

tvegas1 commented Jan 8, 2025

Retested two nodes, 7 ranks, for sysv, mmap and ucx:

  • specifying address: Used on all
  • no argument: same address used, on at least one error, run terminates
  • specifying address 0, same address used, if one error, fallback on 0xff000000

@tvegas1 tvegas1 force-pushed the oshmem_base_exchange branch from afb0775 to 52b907c Compare January 8, 2025 18:59
@github-actions
Copy link

github-actions bot commented Jan 8, 2025

Hello! The Git Commit Checker CI bot found a few problems with this PR:

52b907c: oshmem/shmem: Allocate and exchange base segment a...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants