Skip to content

Conversation

@Lucccyo
Copy link
Contributor

@Lucccyo Lucccyo commented Sep 12, 2022

No description provided.

@jmid jmid changed the title Add STM tests for Sys module [WIP] Add STM tests for Sys module Sep 30, 2022
@jmid
Copy link
Collaborator

jmid commented Nov 17, 2022

I've taken a first quick pass over the code.

There's duplicate code:

  • a duplicate p function in line 188 + 203. I suggest pulling it out as a top-level function and to reuse it instead
  • there's duplicate logic in postcond line 216 + 235:
    (match find_opt fs complete_path with
                | None -> true
                | Some target_fs -> not (is_a_dir target_fs)))
    rather than copy it, it would make sense to reuse it
  • I think you can reuse the code in is_existing_is_a_dir in line 221 and is_a_dir in line 241 + 265 if we turn it into a function accepting a path argument?

There may be other occurrences, so take it as an exercise to review your own code critically and see if you can spot other occurrences or possibilities to simplify, shorten, and reuse.

Since this is a multicoretests PR it should run the parallel tests too now that the model is OK.
This will probably fail, because the Sys functions were not designed to be used in parallel, so it is fine to leave it as a negative test (expected to fail). We hope that parallel testing will not cause a segmentation fault though 🤞

In the harder end, the underlying generators could be adjusted:

Finally, as of last night we now have Windows CI tests. This means some of the Unix-specific Sys.command operations (rm -rf, touch, ...) need adjustments. Checking Sys.os_type may help here.

@jmid
Copy link
Collaborator

jmid commented Nov 21, 2022

I found out that the CI workflow will not run when there are conflicts:

Note: Workflows will not run on pull_request activity if the pull request has a merge conflict. The merge conflict must be resolved first.
-- from https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request

So I guess it is easiest to rebase on the latest main
Note: #204 that removes src/kcas has been merged and might cause a conflict.

(* | "Win32" ->
ignore (Sys.command (
"powershell -Command \"Remove-Item -Path " ^ (static_path / "sandbox_root") ^ " -Recurse -Force -ErrorAction Ignore \"
& mkdir " ^ (static_path / "sandbox_root"))) *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can safely remove these commented out lines now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the CI clean everything after a run? Imagine if a test crashes (with an assert false, for example), the filesys will not be removed. I can remove it locally, but does the CI automatically reset things?

@Lucccyo Lucccyo force-pushed the sys-stm-tests branch 3 times, most recently from a629d12 to 4f584e4 Compare December 21, 2022 10:00
@jmid
Copy link
Collaborator

jmid commented Dec 21, 2022

It seems there is a problem with Windows giving different error messages in Sys_error.

For example, Rmdir and Readdir can yield "Invalid argument":

Messages for test STM Sys test sequential:

  Results incompatible with model

   (Touch ([], "aaa", 511)) : 0
   (Rmdir ([], "aaa")) : Error (Sys_error("D:\\a\\multicoretests\\multicoretests\\_build\\default\\src\\sys\\sandbox_root\\aaa: Invalid argument"))
Messages for test STM Sys test sequential:

  Results incompatible with model

   (Touch ([], "eee", 511)) : 0
   (Readdir ["eee"]) : Error (Sys_error("D:\\a\\multicoretests\\multicoretests\\_build\\default\\src\\sys\\sandbox_root\\eee: Invalid argument"))

@jmid
Copy link
Collaborator

jmid commented Dec 22, 2022

There's also a problem with Mkdir it seems (when path is a file - not a directory?):
https://github.com/ocaml-multicore/multicoretests/actions/runs/3756940216/jobs/6383585814#step:8:9981

@Lucccyo
Copy link
Contributor Author

Lucccyo commented Dec 22, 2022

(Touch ([], "eee", 511)) : 0
(Mkdir (["eee"], "ccc", 511)) : 
     Error(Sys_error("D:\\a\\multicoretests\\multicoretests\\_build\\default\\src\\sys\\sandbox_root\\eee\\ccc: No such file or 
     directory"))

I think the problem here is that on Windows, a Mkdir will recursively create the directories if they don't exist.

See https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/mkdir

@jmid
Copy link
Collaborator

jmid commented Dec 22, 2022

The current error case in postcond reads

(s = (p complete_path) ^ ": No such file or directory" && not (mem_model fs path)) ||

As I understand it, this will only be true when "eee" is not a member of the file-system model.
But here "eee" is a member of the file-system model - but it is a freshly touched file - not a directory as expected.

@Lucccyo
Copy link
Contributor Author

Lucccyo commented Dec 22, 2022

As I understand it, this will only be true when "eee" is not a member of the file-system model.
But here "eee" is a member of the file-system model - but it is a freshly touched file - not a directory as expected.

Yes, of course, you are right! I fix that mistake.

@Lucccyo
Copy link
Contributor Author

Lucccyo commented Dec 22, 2022

Can I conclude that on Windows, there is no separation between Not a directory and No such file or directory ?
I just tested both windows and Linux the succession of the touch and the mkdir command, and I handled a different error.
If that is the case, I will separate the two OS in the code to be more readable.
Instead of

(s = (p complete_path) ^ ": No such file or directory" &&
       (not (mem_model fs path) || not (path_is_a_dir fs complete_path) )) ||

@jmid
Copy link
Collaborator

jmid commented Dec 22, 2022

I'm largely clueless about Windows unfortunately. Overall, I would expect that the error messages in Sys_error might be different on Windows compared to Linux/macOS.

It is good that you are doing experiments! 👍 I suggest first getting something to work on Windows by experiments as you are doing now - and then afterwards we can consider cleaning up the code if it is needed.

@jmid
Copy link
Collaborator

jmid commented Dec 22, 2022

I think the next problem is that Touch works differently across Linux/macOS and Windows:
https://github.com/ocaml-multicore/multicoretests/actions/runs/3757700936/jobs/6385193698#step:8:21520

I believe that touch of an existing file or directory just updates the last modification time, whereas the "Touch implementation" on Windows will fail when called with an existing file or directory...

@Lucccyo
Copy link
Contributor Author

Lucccyo commented Dec 22, 2022

I did some tests on my PC. The two things seem to behave the same, except touching an existing dir_name, where windows complain.

I miss a case in the touch postcond. If I get back n = 1 (the file can't be created), I check either:

  • the path (without the file_name) not exists
  • or the path (without the file_name) doesn't reach a directory

But it can also fail if the full path ( with the file's name) already exists.

jmid and others added 15 commits January 17, 2023 09:51
This could allow for some more sharing, when k is already mapped to v
Use Sys.mkdir instead of Sys.command "mkdir"... to create the sandbox to
avoid discrepancies between OSes
This also allows to set safer permissions, by restricting access to the
sandbox to the user running the test
Use Filename.quote to ensure that the sandbox is properly quoted before
invoking Sys.command
On windows, use the simple 'rd' command instead of calling powershell
(which would require double-quoting, we should probably avoid that)
On unix, use 'rm -r' instead of 'rm -rf' since, at the moment, '-f'
should not be required
Since the generated permissions were always the same and that it is
probably an aspect that is quite different between Unix and Windows,
remove them for now
Change the implementation of Touch so that the behaviours are similar on
Unix and Windows: by using an 'echo . > file' on all platforms, it will
uniformly fail when Touch is attempted on an existing directory
Choose a better name for the command that makes a new regular file
Use `Out_channel` instead of a shell command to create such a file
Take into account the fact that the creation can fail in postcond, which
makes Mkfile fairly similar to Mkdir
Instead of using the full absolute path, which makes the logs with
counter-examples harder to read, use a simple relative path for the
sandbox
Change its name from `sandbox` to `_sandbox`, to follow the common usage
for generated directories (such as `_build` etc.)
The missing `close_process_in` did leave a zombie around for the
duration of the test
@jmid jmid marked this pull request as ready for review January 17, 2023 08:51
@jmid
Copy link
Collaborator

jmid commented Jan 17, 2023

This CI run was all green, so I've removed the two commits focusing tests on Sys and squashed the remaining commits a bit. Suggestions for more squashing are welcome, otherwise I suggest we merge this (assuming the CI turns green 🤞)

@jmid
Copy link
Collaborator

jmid commented Jan 17, 2023

I just spotted some left-over commented-out code which f932eda removes.

@jmid
Copy link
Collaborator

jmid commented Jan 17, 2023

We've been over this repeatedly - so the time has come: merging 😃

@jmid jmid merged commit 1dc53af into ocaml-multicore:main Jan 17, 2023
@shym
Copy link
Collaborator

shym commented Jan 18, 2023

Even if that PR is already merged, it’s still the best place to store the fact that I could reproduce the behaviour of two simultaneous rmdir of the same directory under Windows/Cygwin, using the same small C program. As it was observed in CI (so Windows+MingW), I don’t think it’s necessary to try and port the small test program to MingW, it certainly must be a behaviour of the underlying Windows syscalls.

@shym
Copy link
Collaborator

shym commented Jan 18, 2023

Another follow-up: I wrote a small test to track down the weird behaviour we saw where a mkdir fails to create a subdirectory in a given directory X but calling file_exists afterwards nevertheless states that X exists. We had observed this behaviour only in the bytecode version on linux, but this must be by chance since the CI (and local) runs of that test confirm that it exists also on native code. It appears only a couple of times in million, though, see this log.
(This suggests that removing a directory is not an atomic operation as far as the kernel is concerned)

@jmid
Copy link
Collaborator

jmid commented Jan 18, 2023

Interesting!
Could the observed behaviour be explained by Sys simply not being Thread safe - and hence acting weirdly?
Looking at the documentation just now https://v2.ocaml.org/releases/5.0/api/Sys.html I spotted the following which indicates that we may be coloring outside the lines of the spec (on top of the parallel usage): 🤔

val rmdir : string -> unit
  Remove an empty directory.

  Since 4.12.0

@shym
Copy link
Collaborator

shym commented Jan 19, 2023

All the functions in Sys I’ve looked at are very very thin layers over the actual syscalls. Sys.rmdir documentation is pretty much rmdir() documentation.
It seems that once again we’re just observing the behaviour of the underlying syscalls / OS: I could reproduce exactly that same behaviour with a small C reproducer: https://github.com/shym/multicoretests/blob/parallel-rmdir/rmmkdir.c (Sys.file_exists is doing a stat, so I did just that).
I can easily imagine why the rmdir is not atomic to avoid big locks in the filesystem, but that makes it tricky to see how we could / should handle them in our tests. (When not looking specifically for them, they seem to appear fairly rarely, too rarely I think to tune our generator to run into them by chance reliably)

@shym
Copy link
Collaborator

shym commented Jan 20, 2023

According to this article and the detailed PhD thesis behind, POSIX doesn’t mandate directory operations to be atomic:

POSIX does not require directory operations to have an atomic “effect” in order to allow for such implementations. [...] Note that POSIX specifies an empty directory as one that contains at most “.” and “..” ([7],XBD,3.144), thus intermediate states in which an empty directory is even emptier are allowed.

@jmid
Copy link
Collaborator

jmid commented Jan 20, 2023

Good detective work, sir! 👍

@shym shym mentioned this pull request Feb 6, 2023
@shym
Copy link
Collaborator

shym commented Feb 6, 2023

As the case of having an observable intermediate state during the removal of a directory cropped up again in a CI run (see that comment), I’m wondering whether we should reimplement file_exists as a stat succeeding and such that st_nlink is strictly positive. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants