- 
                Notifications
    You must be signed in to change notification settings 
- Fork 928
Fix singleton tmp files cleanup #13261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| The btl cleanup looks good; looks like that was a miss that we never noticed because pmix cleaned it up for us. I'm a little dubious of the session directory cleanup: 
 | 
| 
 Yes - you remove proc, then job, then session (i.e., work your way upwards) 
 Can't say - however, PMIx and PRRTE always set that flag to "true". I suppose that in this one special case (a singleton), you could technically just call dirpath_destroy on the top-level session directory with recursive set to "true" since no other job can share the tree. 🤷♂️ 
 The top session directory's name includes the pid of the singleton process - this is necessary to ensure that some other process doesn't attempt to remove it from underneath you. Accordingly, no other process can use it, so it is safe to remove it. | 
| 
 Is this code specific to the singleton case? If so, I missed that. | 
| 
 Yeah, it's a little convoluted, but it still is restricted to singleton case. You only set the "destroy_xxx" flag if we are a singleton and therefore created the session directory ourselves. So you only execute the destruct code if that flag is set, which means you have to be a singleton. You might want to double-check that the logic behind that didn't get changed and is correct. Just glancing, it looked like it was still okay. | 
| Thank you @rhc54 for the detailed explanation — I completely agree with your points and appreciate the context you provided. 
 Yes. And here is some additional information. When PRRTE is present, it is responsible for both the creation and cleanup of the session directory. In  If PRRTE is running correctly,  In the  In this case, Open MPI should create the session directory itself, handle its cleanup, and use the PID to distinguish the top-level session directory. These responsibilities were not fully handled in the original implementation. In the original implementation, the  This structure requires additional directory parsing to perform a full cleanup, which the original implementation does not handle. As a result, the directory  | 
| 
 In addition, based on my observation, starting from Open MPI 5.0.x, the singleton no longer launches a thread that runs pmix server. This is why singleton processes can no longer clean up their paths upon abnormal termination. In contrast, for non-singleton programs, PRRTE or OpenPMIx takes care of path cleanup after an abnormal termination. So, what should we do? Adding a signal handler for singleton programs may not be a good idea, since it would be registered into the "user program" during MPI_Init. On the other hand, we would need an additional "registration system" to track and manage files created by sm. Personally, I lean toward documenting this behavior—indicating that when a singleton process terminates abnormally, we cannot guarantee the cleanup of temporary files. | 
| 
 I’ve taken a closer look and have now understood it myself — it turns out to be just a variable initialization. There’s no need to follow up on this further. | 
| 
 Errr...that's not true. You initialized PMIx, and so the PMIx library is indeed running its progress thread. Technically, you could follow the same code as in  Bit of work, so up to you guys - suspect users will complain if it doesn't clean up, but... 🤷♂️ | 
| 
 I hadn't noticed that before — I'll go take a look. My earlier question was based on the assumption that the PMIx library is not running at all in singleton mode in 5.0.x. If what you said is correct, then following the same code as in prte.c to handle file cleanup — rather than just documenting the behavior — would be the best solution. | 
| Just a thought: why is the sm BTL component creating a backing file when operating as a singleton? There's nobody you can communicate with over that transport - so maybe one step would be to just not create the backing file if the "singleton" flag is set? Edited: or I guess just have the btl/sm component disqualify itself in singleton mode? | 
| Let me summarize what I’ve observed so far to clarify the current behavior and my remaining questions: 1.Singleton mode in OMPI 5.0.xStarting from the user's call to  2.Non-singleton modeIn non-singleton mode, the application is launched via  3.A special case:  | 
| Just a couple of comments: 
 Not exactly, or at least not in a way that impacts this discussion. The  
 I agree that having the MPI library trap the signal is probably not a good idea. I think you could argue that the session directory is (in this circumstance) the responsibility of the user. Greater concern is the backing file when placed in a non-obvious location like  What I'd suggest is modifying the btl/sm component to disqualify itself when in a singleton so the backing file never gets created in the first place. After all, there is nobody to communicate with over that channel. This is true even if they call comm_spawn. So there really isn't any point in creating this backing file, and no obvious need for the sm component itself in this case. | 
4bfc636    to
    40ee5d2      
    Compare
  
    | 
 thanks to @rhc54 , that's a very good point. I spent some time reading the code and was pleasantly surprised to find that  However, in  btl/sm segment file in /dev/shm will never be created in singleton mode now. | 
40ee5d2    to
    7e18deb      
    Compare
  
    | Let me restate the purpose of this PR: In singleton mode, directory cleaning needs to be done by the  This commit fixes some of these issues. 
 After fixing this, we can clean up the segment file created by  
 After this, we can cleanup the session dir. (when singletons normally terminated) 
 After this, the  Currently, only the session directory from a singleton process that terminates abnormally fails to be cleaned up. | 
| Just to be clear: The value returned in ompi_rte.c includes the proc itself, and thus would be 1 for a singleton. Only reason it matters is when a non-singleton proc is alone on a node - in which case, the value would be 1 and it still makes no sense to include the sm components. All depends on how OMPI wants to interpret its internal "num_local_peers" variable. 🤷♂️ | 
7e18deb    to
    ae35d75      
    Compare
  
    | I forgot the sign-off-by. Nothing else changed. | 
| 
 It appears that MPI consistently uses a value that is decremented by one, which is why singleton mode is represented as 0. | 
| One other little refinement you could consider: if the singleton does a comm_spawn, it will start a "prte" to shepherd the launch of the child job. Down towards the bottom of ompi/dpm/dpm.c, you could use  Only helps for the case where a comm_spawn was done - but that might prove to cover a majority of singleton executions. | 
| I agree with your opinion. Additionally, I am wondering who might be using the session dir in singleton mode? Is it possible to postpone session directory creation to  Perhaps some work needs to be done to ensure that components do not generate files in singleton mode. I am not very familiar with how session dir works in PMIX and PRRTE. Is it possible to not use session dir at all during singleton execution (before spawn)? | 
| The only things going into that directory tree (prior to spawn) would be files OMPI puts into it. I can think of just two things that would be possibilities: (a) shared memory backing files, and (b) opal_output that had been directed to go to files. You're taking care of the first. IIRC, there is an MCA param that controls the second, so maybe that could trigger the session directory creation as well as spawn? PMIx only creates session directories if the host is a server, which the MPI app is definitely not - so you won't see anything from PMIx here. PRRTE isn't involved in the MPI app, so nothing to be concerned about there - daemons will take care of themselves. | 
ae35d75    to
    0aab3e9      
    Compare
  
    | 
 Registered the cleanup directory after PRTE startup. | 
| Just leaving some notes: I checked OMPI’s usage of the session directory — please note that I might have missed some cases. The following components may use the job session directory: 
 The following components may use the proc session directory: 
 Using singleton mode can be helpful for checking errors that are unrelated to MPI. I think abnormal termination in singleton mode is also quite common. Despite the many challenges, I think it would be better to avoid creating  Once we decide not to create  
 These checks and adjustments would involve a fairly wide range of code, and this is not something I intend to implement in this PR. The points above are just ideas for discussion. | 
| Hi, @jsquyres this PR is ready for review. | 
| Hi, @jsquyres this PR is ready for review. | 
        
          
                ompi/runtime/ompi_rte.c
              
                Outdated
          
        
      | opal_process_info.top_session_dir = NULL; | ||
| return OPAL_ERR_OUT_OF_RESOURCE; | ||
| } | ||
| rc = opal_os_dirpath_create(opal_process_info.top_session_dir, 0755); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why we add the nodename to a file that is supposed to be purely local ?
Why 0755 ? This directory should only be accessed by the same user so 0700 should be a safer access permission value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nodename was added many years ago to avoid conflict when the top session dir is located on a shared file system. I know we warn when we detect that (due to shared memory concerns), but I believe we still allow it?
The permissions should indeed be set to 0700 - that's what PRRTE does, and we should follow suit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placing the temporary directories on a shared file system was always a bad idea, and we had enough troubles with this over the years. If we don't encourage it then we don't need the node name for node-only directories, and the topdir would be shorter, potentially solving other issues we had as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True - but do you want to discourage it (which we do now with the warning) or prohibit it? I'm okay either way, just need some direction so we can make PRRTE consistent with whatever we do here. If we remove the nodename, then we effectively prohibit temp on shared file systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't get me wrong, I understand the discrepancy between PRTE and OMPI needs regarding the topdir file. From PRTE perspective this is just a folder for storing information, access performance is not a critical criterion. OMPI is at the opposite, we depend on the access (and fstat) performance in this folder.
With this in mind, I would strongly discourage it (like PRTE does for allow_root) and remove the hostname from the path. This is as if we are basically prohibiting it. It could work assuming pids never collide, but that's not how OMPI should be run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to 0700.
| I'd rather be safe here and support what the user does even if it is not fast. I don't buy the argument that performance trumps correctness (for most users it does not). A strongly worded warning is sufficient. An abort (or god forbid crash) is the worst response. | 
| A strongly worded warning didn't work for the last decade, but I like your optimism. There is no safety is allowing bad behavior to persist. An abort is however a solid solution, because then they will reach out either to read the documentation or to our mailing lists, and fix both safety and performance issues. | 
| How do we know it hasn't worked for the last decade? I don't remember seeing complaints about low performance on shared FS. I do remember working on a machine with a ridiculously small TMP and running on a shared fs was the only way to run at all. Changing this now is a change in behavior that may break people's workflows. It is annoying to find what used to work doesn't work anymore just because I (or my admins) upgraded OMPI. We should be nicer to our users, unless we really don't care having them. | 
| I can offer a potential compromise for you consideration. We share the nidmap prior to creating the session directories, so every daemon knows the name of every node in the system. Longest node name we've seen (at least, as reported) so far was the 70+ character one that recently broke ORTE in OMPI v4.x - had to set an MCA param to work around it. Usually, they aren't that bad. Note the daemons don't need to know the session dir names for other daemons - we just need to avoid collisions. So here are some things we could do: 
 Any of those make sense? | 
| FWIW: I prototyped option 3 and it works fine. If there is a shared filesystem for tmpdir, then the top session directory is  If the tmpdir is not a shared filesystem, then the top session directory is just  This is the shortest I can make the name while preserving uniqueness. It also avoids blocking users with tmpdir on shared filesystems. | 
| Regarding the nodename discussion, I'd like to share my perspective specifically for singleton processes: In singleton mode, abnormal termination (Ctrl-C, SIGKILL, crashes, etc.) will inevitably leave session directories behind since we cannot implement comprehensive signal handling in user applications. When this happens on shared filesystems, we need to manually clean up these leftover directories. Singletons are more likely to terminate abnormally during development/testing. They don't have PRRTE daemons to handle cleanup. For this PR focused on singleton cleanup, I think we should keep the full nodename. For singletons specifically, the nodename provides essential diagnostic information when cleanup fails. What do you think about retaining the nodename at least for the singleton case ? | 
0aab3e9    to
    1905fba      
    Compare
  
    Only in singleton mode, directory cleaning needs to be done by the program itself. There are some problems with these parts of the code that cause the directory to not be cleaned. This commit fixes *some of* these issues. 1. btl/sm will not unlink its segments file. We never noticed this in non-singleton mode because pmix cleaned it up for us. After this, we can clean up the segment file created by sm in /dev/shm.(when singletons normally terminated) 2. Modified the singleton session directory structure and enabled recursive deletion. After this, we can cleanup the session dir. (when singletons normally terminated) 3. Fix a bug - local peer number of a singleton should be 0, not 1. After this, the btl/sm and btl/smcuda components will return NULL during their init process and will be automatically closed. btl/sm segment file in /dev/shm will never be created in singleton mode now. 4. If the singleton does a comm_spawn, register the singleton's session directory for cleanup by the "prte". Signed-off-by: xbw <[email protected]>
1905fba    to
    7a5a0de      
    Compare
  
    | @bosilca Please take a look when you have time, thanks! | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective we can bend over as much as we want, temporary file systems mounted over any shared file system is a horrible idea. Anywhere and anytime, because performance degradation will happen randomly (whenever the OS decide to flush your backend file). I would rather have a deterministic behavior even if that means we have to educate the users on such topics.
That being said, correctly configured platforms, aka. the overwhelming majority of users, will have none of these issues because the temporary file systems are local and because the SM backend will be located on /dev/shmem. So this might not be as critical as it sounds.
In singleton mode, directory cleaning needs to be done by the ompi library.
However, there are issues in the current implementation that prevent complete cleanup of the directory.
Singleton Mode Cleanup Behavior
When the program terminates normally
1. The session dir is only partially cleaned.(Fixed)
The directory
/tmp/ompi.<username>.<uid>will still be left behind, along with some subdirectories.I modified the directory structure and enabled recursive deletion.
2. Segment files created by btl/sm are not properly removed.(Fixed)
In
ompi/opal/mca/btl/sm/btl_sm_module.c: function sm_finalize, unlink will never be done, singleton or not. However, in non-singleton mode, the file will be cleaned by pmix afteropal_pmix_register_cleanupsucceeds.I registered the segment cleanup with OPAL, which is similar to
mca_btl_smcuda_component_init.Question : Although not directly related to this pr. In
opal/mca/btl/sm/btl_sm_component.caround line 421, is this correct? is this need to be removed?When the program terminates by ctrl-C (see issue #13002)
Question : No file cleaning up is possible in this case. Should we explicitly document this behavior or add a signal handling mechanism? It is not easy to add a signal handling mechanism in
btl/sm.