Skip to content

Conversation

@shanedsnyder
Copy link

@shanedsnyder shanedsnyder commented Oct 13, 2022

The existing changes already included in the PR:

  • add struct timespec start/end times to job struct
  • add custom MPI reduction ops for max/min values of timespecs
  • add logutils code for extracting start/end timespecs
  • update some analysis utilities to use new timespec values to calculate floating point app runtimes

TODO items:

  • remove duplicate calculation of start_time/end_time in job header, and rely solely on start_ts/end_ts
  • update PyDarshan and CFFI bindings for changes
  • update darshan-merge utility
  • update darshan-diff utility
  • update darshan-util docs to show new output format

@shanedsnyder
Copy link
Author

Here's what the new output looks like for darshan-parser on an example log:

# compression method: ZLIB
# exe: ./mpi-test
# uid: 1000
# jobid: 839408
# start_time: 1665678206
# start_time_asci: Thu Oct 13 11:23:26 2022
# end_time: 1665678209
# end_time_asci: Thu Oct 13 11:23:29 2022
# nprocs: 4
# run time: 3.0021
# metadata: lib_ver = 3.4.0
# metadata: h = romio_no_indep_rw=true;cb_nodes=4

We still just display integer start/end times (and corresponding time string), but runtime is now floating point (4 decimal points shown).

@shanedsnyder
Copy link
Author

Note that we currently use the following definition in the Darshan job header for these new values:

    struct timespec start_ts;
    struct timespec end_ts;

Where a timespec is defined as follows:

struct timespec {
    time_t   tv_sec;        /* seconds */
    long     tv_nsec;       /* nanoseconds */
};

I think we probably want to change from the struct timespec definitions into something more well defined (that we can just cast the timespecs into) for a couple of reasons:

  • General log portability if the log is generated on a system with different time_t (or long) sizes than the system where logs are analyzed.
  • CFFI bindings are probably complicated to get right if we pull in a struct timespec, instead of something we just define ourselves? My attempt to pull in the existing definitions to the bindings predictably caused things to explode, but maybe it's still somehow workable.

Given that, I guess we should update to a fixed-width definition like:

    int64_t start_sec;
    int32_t start_nsec;
    int64_t end_sec;
    int32_t end_nsec;

Make sense?

@carns
Copy link
Contributor

carns commented Oct 19, 2022

That makes sense to me @shanedsnyder , and the code looks good so far (I know it's not done but I went ahead and looked over what's there).

I'd probably be inclined to just make the nsec fields 64 bit too in the log format for simplicity, but I doubt it matters much.

@shanedsnyder
Copy link
Author

Okay, I have added code for more portable representation of time seconds/nanoseconds in the new job structure, and have also added the backwards compatibility code to up-convert logs with old job description formats.

One of the main outstanding issues is how to best calculate the application runtime in Darshan utilities (and PyDarshan) consistently across logs with different versions. The new method for calculating runtime will always produce a non-zero runtime value for newer log versions. However, this method does not properly handle old log versions, which traditionally had their runtime rounded up (i.e., runtime = end_time - start_time + 1) so it was never reported as 0.

To avoid having version-specific handling in each Darshan utility, I'd be inclined to add a new logutils function for calculating a floating point runtime value. For new versions of logs, it uses the new method introduced in this PR. For old versions, it uses the old method that produces a rounded up integer value (converted to float when returning). Any objections or alternatives to that approach that I should consider?

@shanedsnyder
Copy link
Author

To avoid having version-specific handling in each Darshan utility, I'd be inclined to add a new logutils function for calculating a floating point runtime value. For new versions of logs, it uses the new method introduced in this PR. For old versions, it uses the old method that produces a rounded up integer value (converted to float when returning). Any objections or alternatives to that approach that I should consider?

I implemented this in 19fe161, seems to work as expected for new logs and old logs from the darshan-logs repo.

carns
carns previously approved these changes Nov 8, 2022
Copy link
Contributor

@carns carns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like the helper utility function to help handle both cases.

Shane Snyder added 8 commits November 8, 2022 17:19
* add struct timespec start/end times to job struct
* add custom MPI reduction ops for max/min values of
  struct timespecs
* add logutils code for extracting start/end timespecs
* update some analysis utilities to use new timespec values
method for calculating this depends on log format version,
so a helper routine seems useful for utilities
* new logutils function (`darshan_log_get_job_runtime`) to
  calculate job runtime as a floating point value
  * for pre 3.41 logs, this is end-start+1 (i.e., runtime
    rounded up to nearest integer)
  * for 3.41+ logs, this is exact runtime to nsec precision
* CFFI definition updates of job types, type sizes
* CFFI bindings updates to extract new job timers and runtime
* updates to summary and report interfaces to use new job timers
  and runtime values
* updates to plotting code to use new job timers and runtime
  values
  * `plot_dxt_heatmap.py` in particular had some code removed
    that tried to correct for a calculated runtime of 0
* updates to tests to accommodate new changes to various
  interfaces and record formats
  * `test_plot_dxt_heatmap.py` has changes to avoid need for
    rounding up calculated runtime
  * new ior-PNETCF/ior-HDF5 log files are included as previous
    ones are no longer valid due to log format changes
@shanedsnyder
Copy link
Author

Just rebased onto main and added a big commit to get pytest passing for me locally. We'll see how it works with CI more generally.

@shanedsnyder shanedsnyder changed the title WIP, ENH: improved precision of darshan job start/end times ENH: improved precision of darshan job start/end times Nov 9, 2022
@shanedsnyder shanedsnyder reopened this Nov 9, 2022
@shanedsnyder
Copy link
Author

Not sure what's going on now, but just the ubuntu-latest, 3.10 tests now are hanging. I tried re-running them specifically, and have a closer look but they are hanging essentially immediately upon starting the "test with pytest" step...

@shanedsnyder shanedsnyder reopened this Nov 9, 2022
Copy link
Collaborator

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I have is with the log file turnover here--if we don't add the new log files to the logs repo first, they aren't getting flushed by pytest for report generation in the CI here? And why remove the old ones? They're not being used by any tests or?

create_dxttimeline.py may not really be under test/in active usage?

I suppose I might feel more comfortable if the new log files were in the logs repo so that summary report generation was automatically tested for them.

As for the CI hanging, it may be a memory issue (?)--I believe the MacOS runners have more physical memory available than the Ubuntu runners: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources

Beyond any specific comments I made, the Python changes seem sensible enough and looks like Phil has reviewed the C.

char mnt_type[3031];
char mnt_path[3031];
char mnt_type[3015];
char mnt_path[3015];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correcting an error unrelated to the primary thrust of this PR or?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of convoluted, but these array sizes are ultimately based on a macro calculation that is based on the size of the Darshan job structure. (job structure grew by 16 bytes in this PR, and you can see above this reduces these array sizes by 16 bytes).

another argument to somehow automate this, as the only reason i realized this needed to be fixed was by observing weird failures in testing

cols = ["length", "start_time", "end_time", "rank"]
agg_df = pd.DataFrame(data=data, columns=cols)
runtime = 1
runtime = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on here? changing the behavior for the synthetic data set because you always round up under the hood now or?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's exactly right

we are always rounding up under the covers for older logs, so calculated runtime in those cases should always be greater than any of the DXT timestamp values. since the DXT timestamps in the synthetic data above reference timestamps greater than 1, the runtime needs to be at least 2 (or more accurately, it needs to be greater than the largest DXT timestamp value, but whole numbers for simplicity).

),
(
"shane_ior-PNETCDF_id864223-864223_10-27-46849-11258636277699483231_1.darshan",
"shane_ior-PNETCDF_id438100-438100_11-9-41525-10280033558448664385_1.darshan",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there much benefit to the turnover of log files here and elsewhere vs. adding to logs repo and flushing through the new logs that way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only reason i switched log files again is that the old ones were generated by a previous development branch (pnetcdf) that we will no longer be able to parse correctly on this branch. essentially, the old logs have the same embedded format versions as the logs in this PR (we only really bump them on release), but the logs do have slightly different underlying formats. this means logutils code can't properly parse the old versions unfortunately.

this is one of the drawbacks on testing logs that were not generated by stable darshan releases (as opposed to dev branches), as we can't reliably guarantee backwards compatibility between releases (well, i guess we could, but we won't).

@shanedsnyder
Copy link
Author

One concern I have is with the log file turnover here--if we don't add the new log files to the logs repo first, they aren't getting flushed by pytest for report generation in the CI here? And why remove the old ones? They're not being used by any tests or?

The new logs should be under test by pytest, they are found in the same locations that various other test logs are found locally in Darshan. Maybe I'm missing the point.

I mentioned this above, but I'm not sure there's an easy answer here. Pushing development logs into darshan-logs can cause other issues, particularly if subsequent feature branches further modify log format between releases, breaking ability to parse prior development logs -- in this case, the logs will need to be regenerated. Also, if outside folks wanted to mine through our logs, it might be confusing if they would need specific Darshan branches to parse them.

It might be cleaner just to keep test logs locally in Darshan as needed, then flush them all out to darshan-logs at release time once we are sure they have a stable format. I didn't say anything about it in this PR or elsewhere, but that's basically what I was planning to do with these 2 logs. I'm open to other suggestions though.

create_dxttimeline.py may not really be under test/in active usage?

Ah yeah, looks like it isn't. I just fixed things I noticed when grepping around for job field keys that are no longer around (i.e. start_sec and end_sec), but wasted time on this one probably :)

I suppose I might feel more comfortable if the new log files were in the logs repo so that summary report generation was automatically tested for them.

As for the CI hanging, it may be a memory issue (?)--I believe the MacOS runners have more physical memory available than the Ubuntu runners: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources

Not sure, I don't really see any output at all from pytest after I kill the tests and view the raw logs. It appears like it's hanging before it even compiles the list of tests or anything. I don't understand what is going on though -- these tests work fine on other PRs which does make it seem related to this PR.

@tylerjereddy
Copy link
Collaborator

tylerjereddy commented Nov 10, 2022

The advantage of the logs repo is that test_main_all_logs_repo_files() will use _provide_logs_repo_filepaths to automatically pull in all logs from over there and check the HTML reports. There is no such equivalent for files you place directly in the repo at this time. I suppose you could add to test_main_without_args() manually for the two new files you have here perhaps, then delete when you do the migration to logs repo later..

The backwards compatibility business is pretty confusing, too bad, I think backwards compatibility in general (log files, semi-public struct member ordering) is one of the pain points on the development side at least. Note also that some log files in the logs repo were generated from development hashes/"latest main" type thing I think--it should even say in their README.. so may need to check that.

We've seen the CI flush through perfectly today in gh-849, so maybe try again tomorrow if you genuinely think it is a fluke here. On the other hand, I suspect we've been scraping close to the limit for some time, so it may not take much to push us over. GitHub Universe conference is running today and tomorrow, maybe some extra stress from that on the resources. I guess you could open a dummy/test PR that basically changes something innocuous to see if that also fails CI for same reason.

@carns
Copy link
Contributor

carns commented Nov 10, 2022

I mentioned this above, but I'm not sure there's an easy answer here. Pushing development logs into darshan-logs can cause other issues, particularly if subsequent feature branches further modify log format between releases, breaking ability to parse prior development logs -- in this case, the logs will need to be regenerated. Also, if outside folks wanted to mine through our logs, it might be confusing if they would need specific Darshan branches to parse them.

Maybe make a top-level subdir for unstable version logs in the darshan-logs repo? They could be available for specific CI tests as things land, but omitted from "let's try this on all the example logs" CI cases, and deleted once their immediate purpose is served?

Or maybe they shouldn't be in darshan-logs at all, just pull from box or ftp or whatever during dev if we need unstable version logs to test features as they are developed?

Just throwing ideas out for discussion. I agree that dev logs in darshan-logs are bad news, at least unless they are clearly categorized differently.

@shanedsnyder
Copy link
Author

We've seen the CI flush through perfectly today in gh-849, so maybe try again tomorrow if you genuinely think it is a fluke here. On the other hand, I suspect we've been scraping close to the limit for some time, so it may not take much to push us over. GitHub Universe conference is running today and tomorrow, maybe some extra stress from that on the resources. I guess you could open a dummy/test PR that basically changes something innocuous to see if that also fails CI for same reason.

Yeah, I'm convinced it's something specific to this PR, as #850 passes tests fine now -- ubuntu-latest, python 3.10 is passing in 11 minutes there, and it's just hung for over 20 minutes on this PR.

The changeset here isn't trivial, obviously, but I really don't get what specifically could cause a hiccup in the tests (particularly on only one of the CI environments). I didn't add any new tests, just updated expected values for old ones and added a couple of new logs. I'd be a little more comfortable if the tests were clearly failing or at least making progress until some test with larger memory requirements and then stalling, but I'm not seeing any initial clues about what's going on. Any ideas on how to test whether we're close to memory limits or something? Not really sure what else to try here.

@tylerjereddy
Copy link
Collaborator

One thing to try might be https://pypi.org/project/pytest-monitor/ -- then we could get the per-test memory usage into a sql database->pandas and sort by memory footprint. If nothing else, we could then try disabling (@pytest.mark.xfail(run=False) or perhaps most appropriately check for available memory with psutil and then skip if not enough (like: https://github.com/scipy/scipy/blob/main/scipy/spatial/tests/test_hausdorff.py#L163).

One thing that we've known for some time is stuff like gh-784 where even an average laptop can't handle some of our DXT stuff until we get those data structure adjustments decided upon, so something tiny might have just pushed it over the edge when things run at the same time perhaps.

I'm on vacation til Sunday night, but use your discretion I guess if you need to move forward (maybe open an issue if you temporarily skip/disable some problem tests).

@tylerjereddy
Copy link
Collaborator

tylerjereddy commented Nov 14, 2022

I tried xfailing (with execution blocked) one of the more expensive test cases (just one case, not entire test) when comparing memory footprints against main using pytest-monitor:

df_main:
        MEM_USAGE                                     ITEM_PATH                           ITEM                                       ITEM_VARIANT
400  1900.062500                    darshan.tests.test_summary  test_main_all_logs_repo_files  test_main_all_logs_repo_files[/home/tyler/pyth...
463  1647.609375                    darshan.tests.test_summary              test_module_table  test_module_table[partial_data_dxt.darshan, ex...
38   1388.152344  darshan.tests.test_data_access_by_filesystem           test_vertical_resize  test_vertical_resize[snyder_acme.exe_id1253318...
399  1354.300781                    darshan.tests.test_summary  test_main_all_logs_repo_files  test_main_all_logs_repo_files[/home/tyler/pyth...
401  1307.378906                    darshan.tests.test_summary  test_main_all_logs_repo_files  test_main_all_logs_repo_files[/home/tyler/pyth...
..           ...                                           ...                            ...                                                ...
4     113.292969                  darshan.tests.test_cffi_misc    test_log_get_generic_record                 test_log_get_generic_record[numpy]
3     113.164062                  darshan.tests.test_cffi_misc            test_file_hash_type   test_file_hash_type[sample, dxt, simple.darshan]
2     111.937500                  darshan.tests.test_cffi_misc            test_file_hash_type       test_file_hash_type[sample, goodost.darshan]
1     110.257812                  darshan.tests.test_cffi_misc            test_file_hash_type                test_file_hash_type[sample.darshan]
0     107.796875                  darshan.tests.test_cffi_misc           test_get_lib_version                               test_get_lib_version

[479 rows x 4 columns]
df_shane:
        MEM_USAGE                                     ITEM_PATH                           ITEM                                       ITEM_VARIANT
400  1851.195312                    darshan.tests.test_summary  test_main_all_logs_repo_files  test_main_all_logs_repo_files[/home/tyler/pyth...
463  1605.031250                    darshan.tests.test_summary              test_module_table  test_module_table[partial_data_dxt.darshan, ex...
38   1535.367188  darshan.tests.test_data_access_by_filesystem           test_vertical_resize  test_vertical_resize[snyder_acme.exe_id1253318...
399  1372.820312                    darshan.tests.test_summary  test_main_all_logs_repo_files  test_main_all_logs_repo_files[/home/tyler/pyth...
401  1307.511719                    darshan.tests.test_summary  test_main_all_logs_repo_files  test_main_all_logs_repo_files[/home/tyler/pyth...
..           ...                                           ...                            ...                                                ...
4     114.128906                  darshan.tests.test_cffi_misc    test_log_get_generic_record                 test_log_get_generic_record[numpy]
3     114.000000                  darshan.tests.test_cffi_misc            test_file_hash_type   test_file_hash_type[sample, dxt, simple.darshan]
2     112.964844                  darshan.tests.test_cffi_misc            test_file_hash_type       test_file_hash_type[sample, goodost.darshan]
1     112.503906                  darshan.tests.test_cffi_misc            test_file_hash_type                test_file_hash_type[sample.darshan]
0     107.988281                  darshan.tests.test_cffi_misc           test_get_lib_version                               test_get_lib_version

[479 rows x 4 columns]
df_shane - df_main:
 38     147.214844
399     18.519531
34       3.761719
1        2.246094
2        1.027344
          ...    
424   -121.417969
427   -121.417969
426   -121.417969
425   -121.417969
429   -121.417969

I wasn't entirely convinced of the effect, since memory_profiler was not super convincing when I repeated that test many times on each branch. That said, memory_profiler did seem to show a slight increase in footprint for the overall test suite, though I'm again not sure on statistical significance given challenges with memory profiling in general (and it takes a while to run the profiling with serial test suite):

main:

image

here:

image

Even if the measured changes above are largely stochastic, we may still see some benefit from suppressing one of the most expensive test cases in the suite (note that we don't suppress the entire test, just that case with large data file). If this still gets us nowhere, we can try adding the verbose (-v) flag to pytest in CI here to see the last test that gets printed before the output is stopped/cancelled.

@tylerjereddy
Copy link
Collaborator

ubuntu-latest 3.10 is taking forever at pytest step.. I'll let that run until it somehow cancels/dies if possible. If there is no info after that, I'll try testing with act locally for that scenario perhaps.

@github-actions github-actions bot added the CI continuous integration label Nov 15, 2022
@tylerjereddy
Copy link
Collaborator

Even with debug mode and verbose pytest the ubuntu-latest 3.10 still doesn't show much for test run after 8 minutes:

image

It should be reporting the first several completed tests by then.. I'll try a few more things. It works fine in act locally for this ubuntu/python combo in verbose mode.

@tylerjereddy
Copy link
Collaborator

argh, let's see if I can use the ssh debug approach to get right on the problem virtual machine per https://github.com/marketplace/actions/debugging-with-ssh

@tylerjereddy
Copy link
Collaborator

Ok, now we can log right into the affected runner/job using output from the actions log:
image

I'll see if I can reproduce the problem

@tylerjereddy
Copy link
Collaborator

I confirmed that in the ubuntu-latest 3.10 machine/runner, when logged in by ssh by mechanism noted above, following the exact steps of the Test with pytest step allows the test suite to pass as expected. So, that's very confusing!

Next steps will probably be:

  • branch off of this branch and debug in my fork a bit
  • probably just try a bunch of different things like pinning the pytest version, the python patch version, and so on.. to see if any of those stick..

* use Ubuntu 22.04 LTS for Python 3.10
testing to avoid mysterious issues
described in gh-851
@tylerjereddy tylerjereddy force-pushed the snyder/dev-precise-job-timestamps branch from 0ce81a4 to 5410a73 Compare November 15, 2022 20:36
@tylerjereddy
Copy link
Collaborator

Ok, let's see if the approach described in gh-851 works to get CI passing then. It looks like using Ubuntu 22.04 LTS for the Python 3.10 CI job may do the trick, but we'll see. ubuntu-latest currently means 20.04, which confusingly is an older version, so testing the latest LTS is probably helpful for other reasons as well anyway.

@tylerjereddy
Copy link
Collaborator

tylerjereddy commented Nov 15, 2022

Ok, the CI is finally passing here. I'd suggest the next steps are:

  • confirm that you are "ok" with my course of action re: the CI issue
  • confirm that I haven't introduced any undesired changes during the debugging/history editing process
  • confirm that reviewer comments have been addressed--it sounds like Phil may be happy with the C language changes and not many of my comments can be dealt with given that we allow C structs to be internally edited in our development process vs. appending only
  • merge the PR

Copy link
Contributor

@carns carns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this (including the CI platform changes), thanks.

@shanedsnyder
Copy link
Author

All looks good to me, I'll go ahead and merge.

FWIW, I agree this is a pretty bizarre bug...it clearly is only triggered by changes introduced on this branch (or other side effects), as other recent PRs (as well as my dummy PR I tried last week) run without issue. The fact that none of the tests ever seem to start is very odd, and given we can only trigger this via CI workflows, debugging it sounds like a nightmare. I agree it's probably not worth us sinking more time in.

@shanedsnyder shanedsnyder merged commit f806e22 into main Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI continuous integration pydarshan

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants