[WIP] Add STM tests for Sys module #129

Lucccyo · 2022-09-12T07:08:29Z

No description provided.

src/sys/dune

src/sys/stm_tests.ml

jmid · 2022-11-17T08:57:56Z

I've taken a first quick pass over the code.

There's duplicate code:

a duplicate p function in line 188 + 203. I suggest pulling it out as a top-level function and to reuse it instead

there's duplicate logic in postcond line 216 + 235:

(match find_opt fs complete_path with
            | None -> true
            | Some target_fs -> not (is_a_dir target_fs)))

rather than copy it, it would make sense to reuse it

I think you can reuse the code in is_existing_is_a_dir in line 221 and is_a_dir in line 241 + 265 if we turn it into a function accepting a path argument?

There may be other occurrences, so take it as an exercise to review your own code critically and see if you can spot other occurrences or possibilities to simplify, shorten, and reuse.

Since this is a multicoretests PR it should run the parallel tests too now that the model is OK.
This will probably fail, because the Sys functions were not designed to be used in parallel, so it is fine to leave it as a negative test (expected to fail). We hope that parallel testing will not cause a segmentation fault though 🤞

In the harder end, the underlying generators could be adjusted:

name_gen: For inspiration here's how QCheck generates printable_chars and numeral characters: https://github.com/c-cube/qcheck/blob/063c1d74795a24eb77fa661d218c4715382df566/src/core/QCheck.ml#L356-L366 We would like to keep the ability to generate duplicate names though.
perm_gen: Currently this is just a constant. Since all operations are run as the same user, generating an owner permission may be sufficient

Finally, as of last night we now have Windows CI tests. This means some of the Unix-specific Sys.command operations (rm -rf, touch, ...) need adjustments. Checking Sys.os_type may help here.

src/sys/stm_tests.ml

jmid · 2022-11-21T14:16:48Z

I found out that the CI workflow will not run when there are conflicts:

Note: Workflows will not run on pull_request activity if the pull request has a merge conflict. The merge conflict must be resolved first.
-- from https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request

So I guess it is easiest to rebase on the latest main
Note: #204 that removes src/kcas has been merged and might cause a conflict.

src/sys/dune

src/sys/stm_tests.ml

jmid · 2022-12-20T14:57:28Z

src/sys/stm_tests.ml

+    (* | "Win32" ->
+      ignore (Sys.command (
+      "powershell -Command \"Remove-Item -Path " ^ (static_path / "sandbox_root") ^ " -Recurse -Force -ErrorAction Ignore \"
+       & mkdir " ^ (static_path / "sandbox_root"))) *)


You can safely remove these commented out lines now

Do the CI clean everything after a run? Imagine if a test crashes (with an assert false, for example), the filesys will not be removed. I can remove it locally, but does the CI automatically reset things?

src/sys/stm_tests.ml

jmid · 2022-12-21T17:09:40Z

It seems there is a problem with Windows giving different error messages in Sys_error.

For example, Rmdir and Readdir can yield "Invalid argument":

Messages for test STM Sys test sequential:

  Results incompatible with model

   (Touch ([], "aaa", 511)) : 0
   (Rmdir ([], "aaa")) : Error (Sys_error("D:\\a\\multicoretests\\multicoretests\\_build\\default\\src\\sys\\sandbox_root\\aaa: Invalid argument"))

Messages for test STM Sys test sequential:

  Results incompatible with model

   (Touch ([], "eee", 511)) : 0
   (Readdir ["eee"]) : Error (Sys_error("D:\\a\\multicoretests\\multicoretests\\_build\\default\\src\\sys\\sandbox_root\\eee: Invalid argument"))

jmid · 2022-12-22T11:39:35Z

There's also a problem with Mkdir it seems (when path is a file - not a directory?):
https://github.com/ocaml-multicore/multicoretests/actions/runs/3756940216/jobs/6383585814#step:8:9981

Lucccyo · 2022-12-22T11:45:14Z

(Touch ([], "eee", 511)) : 0
(Mkdir (["eee"], "ccc", 511)) : 
     Error(Sys_error("D:\\a\\multicoretests\\multicoretests\\_build\\default\\src\\sys\\sandbox_root\\eee\\ccc: No such file or 
     directory"))

I think the problem here is that on Windows, a Mkdir will recursively create the directories if they don't exist.

See https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/mkdir

jmid · 2022-12-22T11:53:41Z

The current error case in postcond reads

(s = (p complete_path) ^ ": No such file or directory" && not (mem_model fs path)) ||

As I understand it, this will only be true when "eee" is not a member of the file-system model.
But here "eee" is a member of the file-system model - but it is a freshly touched file - not a directory as expected.

Lucccyo · 2022-12-22T11:56:43Z

As I understand it, this will only be true when "eee" is not a member of the file-system model.
But here "eee" is a member of the file-system model - but it is a freshly touched file - not a directory as expected.

Yes, of course, you are right! I fix that mistake.

Lucccyo · 2022-12-22T12:09:12Z

Can I conclude that on Windows, there is no separation between Not a directory and No such file or directory ?
I just tested both windows and Linux the succession of the touch and the mkdir command, and I handled a different error.
If that is the case, I will separate the two OS in the code to be more readable.
Instead of

(s = (p complete_path) ^ ": No such file or directory" &&
       (not (mem_model fs path) || not (path_is_a_dir fs complete_path) )) ||

jmid · 2022-12-22T12:26:57Z

I'm largely clueless about Windows unfortunately. Overall, I would expect that the error messages in Sys_error might be different on Windows compared to Linux/macOS.

It is good that you are doing experiments! 👍 I suggest first getting something to work on Windows by experiments as you are doing now - and then afterwards we can consider cleaning up the code if it is needed.

jmid · 2022-12-22T14:18:33Z

I think the next problem is that Touch works differently across Linux/macOS and Windows:
https://github.com/ocaml-multicore/multicoretests/actions/runs/3757700936/jobs/6385193698#step:8:21520

I believe that touch of an existing file or directory just updates the last modification time, whereas the "Touch implementation" on Windows will fail when called with an existing file or directory...

Lucccyo · 2022-12-22T14:42:13Z

I did some tests on my PC. The two things seem to behave the same, except touching an existing dir_name, where windows complain.

I miss a case in the touch postcond. If I get back n = 1 (the file can't be created), I check either:

the path (without the file_name) not exists
or the path (without the file_name) doesn't reach a directory

But it can also fail if the full path ( with the file's name) already exists.

This could allow for some more sharing, when k is already mapped to v

Use Sys.mkdir instead of Sys.command "mkdir"... to create the sandbox to avoid discrepancies between OSes This also allows to set safer permissions, by restricting access to the sandbox to the user running the test

Use Filename.quote to ensure that the sandbox is properly quoted before invoking Sys.command On windows, use the simple 'rd' command instead of calling powershell (which would require double-quoting, we should probably avoid that) On unix, use 'rm -r' instead of 'rm -rf' since, at the moment, '-f' should not be required

Since the generated permissions were always the same and that it is probably an aspect that is quite different between Unix and Windows, remove them for now

Change the implementation of Touch so that the behaviours are similar on Unix and Windows: by using an 'echo . > file' on all platforms, it will uniformly fail when Touch is attempted on an existing directory

Choose a better name for the command that makes a new regular file Use `Out_channel` instead of a shell command to create such a file Take into account the fact that the creation can fail in postcond, which makes Mkfile fairly similar to Mkdir

Instead of using the full absolute path, which makes the logs with counter-examples harder to read, use a simple relative path for the sandbox Change its name from `sandbox` to `_sandbox`, to follow the common usage for generated directories (such as `_build` etc.)

The missing `close_process_in` did leave a zombie around for the duration of the test

jmid · 2023-01-17T08:54:07Z

This CI run was all green, so I've removed the two commits focusing tests on Sys and squashed the remaining commits a bit. Suggestions for more squashing are welcome, otherwise I suggest we merge this (assuming the CI turns green 🤞)

jmid · 2023-01-17T16:41:15Z

I just spotted some left-over commented-out code which f932eda removes.

jmid · 2023-01-17T21:56:18Z

We've been over this repeatedly - so the time has come: merging 😃

shym · 2023-01-18T13:24:39Z

Even if that PR is already merged, it’s still the best place to store the fact that I could reproduce the behaviour of two simultaneous rmdir of the same directory under Windows/Cygwin, using the same small C program. As it was observed in CI (so Windows+MingW), I don’t think it’s necessary to try and port the small test program to MingW, it certainly must be a behaviour of the underlying Windows syscalls.

shym · 2023-01-18T18:14:10Z

Another follow-up: I wrote a small test to track down the weird behaviour we saw where a mkdir fails to create a subdirectory in a given directory X but calling file_exists afterwards nevertheless states that X exists. We had observed this behaviour only in the bytecode version on linux, but this must be by chance since the CI (and local) runs of that test confirm that it exists also on native code. It appears only a couple of times in million, though, see this log.
(This suggests that removing a directory is not an atomic operation as far as the kernel is concerned)

jmid · 2023-01-18T21:42:04Z

Interesting!
Could the observed behaviour be explained by Sys simply not being Thread safe - and hence acting weirdly?
Looking at the documentation just now https://v2.ocaml.org/releases/5.0/api/Sys.html I spotted the following which indicates that we may be coloring outside the lines of the spec (on top of the parallel usage): 🤔

val rmdir : string -> unit
  Remove an empty directory.

  Since 4.12.0

shym · 2023-01-19T09:05:57Z

All the functions in Sys I’ve looked at are very very thin layers over the actual syscalls. Sys.rmdir documentation is pretty much rmdir() documentation.
It seems that once again we’re just observing the behaviour of the underlying syscalls / OS: I could reproduce exactly that same behaviour with a small C reproducer: https://github.com/shym/multicoretests/blob/parallel-rmdir/rmmkdir.c (Sys.file_exists is doing a stat, so I did just that).
I can easily imagine why the rmdir is not atomic to avoid big locks in the filesystem, but that makes it tricky to see how we could / should handle them in our tests. (When not looking specifically for them, they seem to appear fairly rarely, too rarely I think to tune our generator to run into them by chance reliably)

shym · 2023-01-20T08:27:26Z

According to this article and the detailed PhD thesis behind, POSIX doesn’t mandate directory operations to be atomic:

POSIX does not require directory operations to have an atomic “effect” in order to allow for such implementations. [...] Note that POSIX specifies an empty directory as one that contains at most “.” and “..” ([7],XBD,3.144), thus intermediate states in which an empty directory is even emptier are allowed.

jmid · 2023-01-20T09:41:06Z

Good detective work, sir! 👍

shym · 2023-02-06T14:11:16Z

As the case of having an observable intermediate state during the removal of a directory cropped up again in a CI run (see that comment), I’m wondering whether we should reimplement file_exists as a stat succeeding and such that st_nlink is strictly positive. 🤔

jmid reviewed Sep 20, 2022

View reviewed changes

src/sys/dune Outdated Show resolved Hide resolved

jmid reviewed Sep 20, 2022

View reviewed changes

src/sys/dune Outdated Show resolved Hide resolved

jmid changed the title ~~Add STM tests for Sys module~~ [WIP] Add STM tests for Sys module Sep 30, 2022

Lucccyo force-pushed the sys-stm-tests branch from bf90b9e to d12f0e8 Compare October 24, 2022 15:10

jmid reviewed Oct 27, 2022

View reviewed changes

src/sys/dune Outdated Show resolved Hide resolved

Lucccyo commented Nov 9, 2022

View reviewed changes