-
Notifications
You must be signed in to change notification settings - Fork 929
Pr/v3.0.x OMPIO Crash on Large Data #6427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Topic/get element fix
…contiguous_flag io/romio314: mark datatypes of size 0 as contiguous
Allowing MPI_PROC_NULL as a neighbor in any topology allows us to add gaps on the send and recv buffers. This does make the traditional neighbor collective have a similar behavior as the V version, but in same time it allows the users to skip the step where they prepare the counts and the displacement array. For more info please take a look at issue open-mpi#4675. Signed-off-by: George Bosilca <[email protected]>
* The `MPIR_PROCDESC` structure needs to be visible even in optimized
builds so that debuggers can attach to `mpirun` and properly read the
`MPIR_proctable`.
* In the v2.0.x and v2.x series this structure resided in the `orterun`
directory and included the `CFLAGS` fix included here. This code
moved in the v3.x series and the `CFLAGS` did not move causing this
issue.
- Instead of applying the debug `CFLAGS` globally to libopen-rte,
only apply them to the `orted_submit.c` compile which contains the
MPIR symbols.
Signed-off-by: Joshua Hursey <[email protected]>
data sieving has to occur for any offset provided that is larger or equal zero for this implementation to work correctly. Signed-off-by: Edgar Gabriel <[email protected]>
…ta_sieving_fix fcoll/two_phase: data sieving has to occur at offset 0 as well
Fix MPIR_proctable structure visibility
Signed-off-by: Benoît Legat <[email protected]>
Fix typo in MPI_Cart_shift doc
999de13 accidentally reset opal_cuda_verbose's default value. This commit puts it back. Signed-off-by: Jeff Squyres <[email protected]>
…bose-value opal_datatype_module.c: reset opal_cuda_verbose
Flush out the DVM ready notice on stdout Signed-off-by: Aurelien Bouteiller <[email protected]>
This commit is a large update to the osc/rdma component. Included in this commit: - Add support for using hardware atomics for fetch-and-op and single count accumulate when using the accumulate lock. This will improve the performance of these operations even when not setting the single intrinsic info key. - Rework how large accumulates are done. They now block on the get operation to fix some bugs discovered by an IBM one-sided test. I may roll back some of the changes if the underlying bug in the original design is discovered. There appear to be no real difference (on the hardware this was tested with) in performance so its probably a non-issue. References open-mpi#2530. - Add support for an additional lock-all algorithm: on-demand. The on-demand algorithm will attempt to acquire the peer lock when starting an RMA operation. The lock algorithm default has not changed. The algorithm can be selected by setting the osc_rdma_locking_mode MCA variable. The valid values are two_level and on_demand. - Make use of the btl_flush function if available. This can improve performance with some btls. - When using btl_flush do not keep track of the number of put operations. This reduces the number of atomic operations in the critical path. - Make the window buffers more friendly to multi-threaded applications. This was done by dropping support for multiple buffers per MPI window. I intend to re-add that support once the underlying performance bug under the old buffering scheme is fixed. - Fix a bug in request completion in the accumulate, get, and put paths. This also helps with open-mpi#2530. - General code cleanup and fixes. Signed-off-by: Nathan Hjelm <[email protected]>
Scaling.pl: Fix Srun options and wait for DVM launch
Improve the range and accuracy of MPI_Wtime.
Signed-off-by: Brian Barrett <[email protected]>
dist: Sync 2.1.3 NEWS items into master
This commit fixes the case when local client asks for the key from the process on the remote node. The local server don't have commit count for remote ranks, it is maintained by another PMIx server, so commit count should be ignored for remote requests. Signed-off-by: Boris Karasev <[email protected]>
Sync to PMIx master PR openpmix/openpmix#697
Signed-off-by: Jeff Squyres <[email protected]>
Signed-off-by: Boris Karasev <[email protected]>
mpool/memkind: fix typo in partition page sizes
We have a small number of requirements for contributions (e.g., "Signed-off-by"), so let's make sure that people have an easy way of knowing these things. Signed-off-by: Jeff Squyres <[email protected]>
…iens CONTRIBUTING.md: add Github contribution guidelines
Signed-off-by: Boris Karasev <[email protected]>
pmix: dstore returned for direct modex
plfs components are at this point not utilized by anybody as far as I know. Easy to bring back if we want to. Signed-off-by: Edgar Gabriel <[email protected]>
never got to move this sharedfp component into anything usable. Can easily be restored if necessary. Signed-off-by: Edgar Gabriel <[email protected]>
somehow the flag indicating to gather performance data on collective io operations has changed to 1 accidentally. Should be 0 ( false) by default. Signed-off-by: Edgar Gabriel <[email protected]>
Remove the MXM MTL, which has been deprecated in preference for the Yalla PML. This was discussed at the last developers meeting and somehow I ended up with the action item to do the removal. Signed-off-by: Brian Barrett <[email protected]>
- supported 4 or 8 bytes only Signed-off-by: Sergey Oblomov <[email protected]>
…ng for C11 features to prevent e.g. _Static_assert being treated as an implicitly-defined function. Signed-off-by: Ben Menadue <[email protected]>
mtl: remove MXM MTL
configure: use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE for C11 tests
fix the logic in the decision which aggregator selection algorithm to use. Signed-off-by: Edgar Gabriel <[email protected]>
Signed-off-by: Sylvain Jeaugey <[email protected]>
io/ompio: fix an erroneous condition when selecting aggregator selection algorithm
enable_oshmem holds the result of a customer decision and, like most user options, can have the values "yes" (user wants us to build feature), "no" (user wants us not to build feature), "" (user wants us to figure it out), and "<something>" (user wants us to build feature, with <something> turned on). This change updates oshmem to not lose this data by not overwriting enable_oshmem with a yes/no and leaving the original customer intent in place. Aside from fixing one bug (below) there are no customer visible changes in this patch, but it makes it possible to do the right thing in the upcoming work to allow oshmem to be disabled based on test results. There was a cosmetic bug in the existing code where specifying a feature argument (like --enable-oshmem=awesome) would result in the "checking if want oshmem" test reporting no, but oshmem being built anyway. With this cleanup, the "checking if want oshmem" test, the final output summary, and what actually happens will all match. Signed-off-by: Brian Barrett <[email protected]>
Two related changes to allow projects to not build based on configure test results, as opposed to only reacting to user configure options today. Use case is disabling a project like oshmem because no communication channels can be built. First, Move PROJECT_* AM_CONDITIONALs from the top of configure to the bottom, so that we can change the results during configure. Second, add a DIST_SUBDIRS to Makefile.am (and populate it in opal_mca) so that "make dist" will work even when a project is disabled. Signed-off-by: Brian Barrett <[email protected]>
This patch disables the oshmem layer if there are no SPMLs that will build. With the limited set of SPMLs available to support oshmem, many builds end up installing an oshmem library that we know will not work. There has been a bit of customer confusion over oshmem, hopefully this will lead customers in the right direction. Signed-off-by: Brian Barrett <[email protected]>
+ Add quiet method to SPML, so it can have different implementation with fence. + Use ucp_worker_fence for spml_fence method of UCX SPML Signed-off-by: Mikhail Brinskii <[email protected]>
cuda: add option to remove warning about missing libcuda.
oshmem: remove `shmem_put/get` when not the C11 case in accordance with the spec v1.3
Implements butterfly algorithm for MPI_Reduce_scatter_block. The algorithm can be used both by commutative and non-commutative operations, for power-of-two and non-power-of-two number of processes. Signed-off-by: Mikhail Kurnosov <[email protected]>
OSHMEM/SMPL/UCX: Add real fence support
…butterfly coll: reduce_scatter_block: add butterfly algorithm
MCA/UCX: fixed error messages for incorrect msg size
opal/bitmap: fix opal_bitmap_set_bit()
Per discussion at open-mpi#2614 (comment), do not allow for selection of the OSC PT2PT when creating an MPI RMA window when THREAD_MULTIPLE is active. Print a helpful message and return a not-supported error. Signed-off-by: Howard Pritchard <[email protected]> Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit d0ffd66) Signed-off-by: Jeff Squyres <[email protected]>
…or-thread-multiple osc/pt2pt: disable when THREAD_MULITPLE
Signed-off-by: Jeff Squyres <[email protected]>
Signed-off-by: Jeff Squyres <[email protected]>
Signed-off-by: Jeff Squyres <[email protected]>
This fix was already included in pmix upstream (openpmix/openpmix@fb7af8af2). Signed-off-by: Jeff Squyres <[email protected]>
Minor compiler warning stomps
- Improve descriptions - Fix some typos - Remove MPI-1 functions and replace them with MPI-2 functions Signed-off-by: Kurita, Takehiro <[email protected]>
java: Improve descriptions of `javadoc`
|
Can one of the admins verify this patch? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Same as #6344, but applied to v3.0.x.
cc @edgargabriel @jsquyres