How to handle performance tests #7970
Replies: 34 comments
-
|
Depending on the answer to 1. and 2. it may not be possible for a test to be both a baseline and performance test. If that's the case, then it raises the question of whether the new test-suite flag |
Beta Was this translation helpful? Give feedback.
-
At least in CESM, and at least last time I checked, the current TPUTCOMP is just based on the overall run times given in the model log files. This often isn't very useful for understanding differences in timings between runs. What I was referring to in ESMCI/cime#2918 was saving the full contents of the timing directory (i.e., the detailed timing results that give min/max/etc. times spent in different blocks of code surrounded by timers). Is this possibly related to what you already do with SAVE_TIMING? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @billsacks , that does sound more useful than the TPUTCOMP check and possibly addresses point (1) above since it presumably would allow us to compare times that don't include I/O. |
Beta Was this translation helpful? Give feedback.
-
|
Yes if we examine the timing output more closely, we can look at timers that don't include I/O. Note that turning off I/O would also break any compareTwo test. Currently, the fail condition is "time taken is outside tolerance from the last saved run". But if repeated PRs are just under the threshold, the model will gradually slow down and never fail a test. The threshold can't be to small because machine variability can lead to false negatives. So we need something outside of CIME that tracks performance trends over a few days or a a week. Does the new capability in 6 require separate blesses for both performance fails AND baselines fails? When v3 went in, it slowed down the model and changed answers from baselines. The bless for the baselines also blessed the throughput change. In that case, we knew about the slowdown and were going to bless it but other baseline-change blesses could sneak in a performance degradation unless the performance slowdown is caught and requires its own bless. |
Beta Was this translation helpful? Give feedback.
-
|
@rljacob , that's a good point regarding (6). The clear intent from my meetings with @sarats is that we want to keep hist and performance baselines conceptually separate. |
Beta Was this translation helpful? Give feedback.
-
|
@jasonb5 , see above. This may require a bit more work on bless_test_results than we initially thought. We may also want to take a look at system_tests_common.py and the timing output to see if there are more useful comparisons that can be implemented. |
Beta Was this translation helpful? Give feedback.
-
|
I still think we need a nightly test to see if there's a slowdown above the threshold and then separate bless for that. But we'll need another monitoring system watching for a gradual slowdown over days, weeks. Maybe that was part of the original PACE place. @sarats ? |
Beta Was this translation helpful? Give feedback.
-
|
@rljacob Yes, long-term longitudinal tracking was previously done using the E3SM Standardized performance benchmarks and now with proper PFS_* tests along with PACE would facilitate a better process. Distracted by a conference and trying to catch up on what the remaining questions are to get the initial system in place. |
Beta Was this translation helpful? Give feedback.
-
|
I have the CIME baseline update started (ESMCI/cime#4491). Right now it is still parsing and storing throughput and memory usage from the coupler log. With The code for generating the performance baselines lives in CIME's repo right now, I can also update the design so that the code can live with the coupler. This way each coupler/model can define their own methods of generating/comparing performance metrics. |
Beta Was this translation helpful? Give feedback.
-
|
@jasonb5 , nice! |
Beta Was this translation helpful? Give feedback.
-
|
The component high-level timers that are used to compute throughput would include certain amount of I/O as well if I/O isn't turned off. Regarding fine-grain timers: We parse and capture these within PACE. However, there isn't sufficient consistency or patterns across components that would make capturing relevant "performance" from a baseline run straight forward. As it stands, we have to just use PFS and baseline tests separately to start with until we have a clear definition of every component's run loop timer without I/O. |
Beta Was this translation helpful? Give feedback.
-
|
There must be something in model_timing_stats we could look at. |
Beta Was this translation helpful? Give feedback.
-
|
@jasonb5 Curious about the status of your CIME updates. |
Beta Was this translation helpful? Give feedback.
-
|
@sarats Just finishing up testing on the CIME side, I want to ensure we have good unit test coverage before it's merged and we do a CIME update. Should be merged by the end of this week. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @jasonb5 for getting ESMCI/cime#4491 done. Looks like we should be set to start performance tests now. |
Beta Was this translation helpful? Give feedback.
-
|
@sarats That PR doesn't impact creation or comparison of perf baselines. |
Beta Was this translation helpful? Give feedback.
-
|
@jasonb5 If there are no Also, please add the commit-hash to |
Beta Was this translation helpful? Give feedback.
-
|
@amametjanov and @jasonb5 Can you guys get together to iron out any other missing bits and minimize the iterations with the CIME PR, merge, E3SM upstream pull etc. Perhaps we can test some of this stuff locally/manually before issuing the final PRs. |
Beta Was this translation helpful? Give feedback.
-
|
@sarats @amametjanov I've fixed generating the performance baselines by |
Beta Was this translation helpful? Give feedback.
-
|
@amametjanov I've added the commit hash and date to the baseline files, you can use the branch from ESMCI/cime#4508 to test. |
Beta Was this translation helpful? Give feedback.
-
|
@jgfouca I guess we need to do a final CIME update including ESMCI/cime#4508 to close this, correct? |
Beta Was this translation helpful? Give feedback.
-
|
We need a couple more updates:
|
Beta Was this translation helpful? Give feedback.
-
|
@jasonb5 Will you be able to look into the above two tasks? |
Beta Was this translation helpful? Give feedback.
-
|
@amametjanov , @sarats , (1) is easy. I'm confused about (2). Is |
Beta Was this translation helpful? Give feedback.
-
|
@jgfouca timing is getting archived into |
Beta Was this translation helpful? Give feedback.
-
|
@amametjanov , I did some debugging and it looks to me like it is working as intended: I'm wondering if maybe you are getting tripped-up by the timing_dir gatekeeping code that checks the project: ... so, only if PROJECT matches SAVE_TIMING_DIR_PROJECTS, which is a comma-separated list of regular expressions, will it copy the stuff to SAVE_TIMING_DIR. |
Beta Was this translation helpful? Give feedback.
-
|
@sarats @amametjanov @jgfouca Would it be better to bypass both history and namelists when blessing performance, keeping the two blessing paths completely separate? |
Beta Was this translation helpful? Give feedback.
-
|
I'm fine with a separate and distinct pathway for performance blesses bypassing history and namelists. |
Beta Was this translation helpful? Give feedback.
-
|
@amametjanov CIME's master branch now has the fix to separate hist/nml and performance blessing. |
Beta Was this translation helpful? Give feedback.
-
(1): pending |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Based on the discussion during the performance team meeting, notes here: https://acme-climate.atlassian.net/wiki/spaces/EP/pages/3922133143/2023-09-18+Performance+Infrastructure+Meeting+Notes
I am still not 100% sure on what our approach should be so I thought it might be helpful to list the features/behaviors that we want for performance tests:
--check-throughput --check-memoryto the test script (in E3SM_test_scripts repo) but this is very easy to forget and it looks like almost no jobs are currently using these flags. Again, in the interest of making life easy for users, I think both these checks should be on by default for tests that have SAVE_TIMING on.Feel free to edit to add features.
Related issues:
#5885
Beta Was this translation helpful? Give feedback.
All reactions