Skip to content

Commit 2195f68

Browse files
author
Vasileios Karakasis
authored
Merge pull request #1325 from ekouts/feat/time_profiling
[feat] Add execution time profiling for the framework's pipeline
2 parents f8bde58 + b30e760 commit 2195f68

File tree

5 files changed

+196
-75
lines changed

5 files changed

+196
-75
lines changed

docs/pipeline.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,3 +132,19 @@ ReFrame tries to keep concurrency high by maintaining as many test cases as poss
132132
When the `concurrency limit <config_reference.html#.systems[].partitions[].max_jobs>`__ is reached, ReFrame will first try to free up execution slots by checking if any of the spawned jobs have finished, and it will fill that slots first before throttling execution.
133133

134134
ReFrame uses polling to check the status of the spawned jobs, but it does so in a dynamic way, in order to ensure both responsiveness and avoid overloading the system job scheduler with excessive polling.
135+
136+
Timing the Test Pipeline
137+
------------------------
138+
139+
.. versionadded:: 3.0
140+
141+
ReFrame keeps track of the time a test spends in every pipeline stage and reports that after each test finishes.
142+
However, it does so from its own perspective and not from that of the scheduler backend used.
143+
This has some practical implications:
144+
As soon as a test enters the "run" phase, ReFrame's timer for that phase starts ticking regardless if the associated job is pending.
145+
Similarly, the "run" phase ends as soon as ReFrame realizes it.
146+
This will happen after the associated job has finished.
147+
For this reason, the time spent in the pipeline's "run" phase should *not* be interpreted as the actual runtime of the test, especially if a non-local scheduler backend is used.
148+
149+
Finally, the execution time of the "cleanup" phase is not reported when a test finishes, since it may be deferred in case that there exist tests that depend on that one.
150+
See :doc:`dependencies` for more information on how ReFrame treats tests with dependencies.

docs/tutorial_basic.rst

Lines changed: 45 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -177,50 +177,51 @@ If everything is configured correctly for your system, you should get an output
177177

178178
.. code-block:: none
179179
180-
[ReFrame Setup]
181-
version: 3.0-dev6 (rev: 89d50861)
182-
command: './bin/reframe -C tutorial/config/settings.py -c tutorial/example1.py -r'
183-
launched by: user@daint101
184-
working directory: '/path/to/reframe'
185-
check search path: (R) '/path/to/reframe/tutorial/example1.py'
186-
stage directory: '/path/to/reframe/stage'
187-
output directory: '/path/to/reframe/output'
188-
189-
[==========] Running 1 check(s)
190-
[==========] Started on Sat May 9 22:10:16 2020
191-
192-
[----------] started processing Example1Test (Simple matrix-vector multiplication example)
193-
[ RUN ] Example1Test on daint:login using PrgEnv-cray
194-
[ RUN ] Example1Test on daint:login using PrgEnv-gnu
195-
[ RUN ] Example1Test on daint:login using PrgEnv-intel
196-
[ RUN ] Example1Test on daint:login using PrgEnv-pgi
197-
[ RUN ] Example1Test on daint:gpu using PrgEnv-cray
198-
[ RUN ] Example1Test on daint:gpu using PrgEnv-gnu
199-
[ RUN ] Example1Test on daint:gpu using PrgEnv-intel
200-
[ RUN ] Example1Test on daint:gpu using PrgEnv-pgi
201-
[ RUN ] Example1Test on daint:mc using PrgEnv-cray
202-
[ RUN ] Example1Test on daint:mc using PrgEnv-gnu
203-
[ RUN ] Example1Test on daint:mc using PrgEnv-intel
204-
[ RUN ] Example1Test on daint:mc using PrgEnv-pgi
205-
[----------] finished processing Example1Test (Simple matrix-vector multiplication example)
206-
207-
[----------] waiting for spawned checks to finish
208-
[ OK ] ( 1/12) Example1Test on daint:mc using PrgEnv-cray
209-
[ OK ] ( 2/12) Example1Test on daint:gpu using PrgEnv-intel
210-
[ OK ] ( 3/12) Example1Test on daint:gpu using PrgEnv-cray
211-
[ OK ] ( 4/12) Example1Test on daint:login using PrgEnv-intel
212-
[ OK ] ( 5/12) Example1Test on daint:login using PrgEnv-cray
213-
[ OK ] ( 6/12) Example1Test on daint:mc using PrgEnv-gnu
214-
[ OK ] ( 7/12) Example1Test on daint:gpu using PrgEnv-gnu
215-
[ OK ] ( 8/12) Example1Test on daint:login using PrgEnv-gnu
216-
[ OK ] ( 9/12) Example1Test on daint:login using PrgEnv-pgi
217-
[ OK ] (10/12) Example1Test on daint:gpu using PrgEnv-pgi
218-
[ OK ] (11/12) Example1Test on daint:mc using PrgEnv-intel
219-
[ OK ] (12/12) Example1Test on daint:mc using PrgEnv-pgi
220-
[----------] all spawned checks have finished
221-
222-
[ PASSED ] Ran 12 test case(s) from 1 check(s) (0 failure(s))
223-
[==========] Finished on Sat May 9 22:10:46 2020
180+
[ReFrame Setup]
181+
version: 3.0-dev7 (rev: 85ca676f)
182+
command: './bin/reframe -C tutorial/config/settings.py -c tutorial/example1.py -r'
183+
launched by: user@daint104
184+
working directory: '/path/to/reframe'
185+
settings file: 'tutorial/config/settings.py'
186+
check search path: (R) '/path/to/reframe/tutorial/example1.py'
187+
stage directory: '/path/to/reframe/stage'
188+
output directory: '/path/to/reframe/output'
189+
190+
[==========] Running 1 check(s)
191+
[==========] Started on Wed Jun 3 08:50:46 2020
192+
193+
[----------] started processing Example1Test (Simple matrix-vector multiplication example)
194+
[ RUN ] Example1Test on daint:login using PrgEnv-cray
195+
[ RUN ] Example1Test on daint:login using PrgEnv-gnu
196+
[ RUN ] Example1Test on daint:login using PrgEnv-intel
197+
[ RUN ] Example1Test on daint:login using PrgEnv-pgi
198+
[ RUN ] Example1Test on daint:gpu using PrgEnv-cray
199+
[ RUN ] Example1Test on daint:gpu using PrgEnv-gnu
200+
[ RUN ] Example1Test on daint:gpu using PrgEnv-intel
201+
[ RUN ] Example1Test on daint:gpu using PrgEnv-pgi
202+
[ RUN ] Example1Test on daint:mc using PrgEnv-cray
203+
[ RUN ] Example1Test on daint:mc using PrgEnv-gnu
204+
[ RUN ] Example1Test on daint:mc using PrgEnv-intel
205+
[ RUN ] Example1Test on daint:mc using PrgEnv-pgi
206+
[----------] finished processing Example1Test (Simple matrix-vector multiplication example)
207+
208+
[----------] waiting for spawned checks to finish
209+
[ OK ] ( 1/12) Example1Test on daint:login using PrgEnv-intel [compile: 1.940s run: 20.747s total: 23.778s]
210+
[ OK ] ( 2/12) Example1Test on daint:login using PrgEnv-cray [compile: 0.347s run: 25.096s total: 26.591s]
211+
[ OK ] ( 3/12) Example1Test on daint:mc using PrgEnv-intel [compile: 2.019s run: 6.286s total: 8.357s]
212+
[ OK ] ( 4/12) Example1Test on daint:mc using PrgEnv-cray [compile: 0.506s run: 11.056s total: 11.744s]
213+
[ OK ] ( 5/12) Example1Test on daint:gpu using PrgEnv-cray [compile: 0.435s run: 19.499s total: 20.483s]
214+
[ OK ] ( 6/12) Example1Test on daint:login using PrgEnv-gnu [compile: 1.648s run: 25.631s total: 27.964s]
215+
[ OK ] ( 7/12) Example1Test on daint:mc using PrgEnv-gnu [compile: 1.825s run: 10.434s total: 12.301s]
216+
[ OK ] ( 8/12) Example1Test on daint:gpu using PrgEnv-intel [compile: 2.018s run: 16.316s total: 18.529s]
217+
[ OK ] ( 9/12) Example1Test on daint:login using PrgEnv-pgi [compile: 1.643s run: 22.118s total: 24.090s]
218+
[ OK ] (10/12) Example1Test on daint:gpu using PrgEnv-gnu [compile: 1.729s run: 20.028s total: 21.901s]
219+
[ OK ] (11/12) Example1Test on daint:gpu using PrgEnv-pgi [compile: 1.753s run: 15.128s total: 16.923s]
220+
[ OK ] (12/12) Example1Test on daint:mc using PrgEnv-pgi [compile: 1.732s run: 35.556s total: 37.330s]
221+
[----------] all spawned checks have finished
222+
223+
[ PASSED ] Ran 12 test case(s) from 1 check(s) (0 failure(s))
224+
[==========] Finished on Wed Jun 3 08:51:46 2020
224225
225226
226227
Notice how our regression test is run on every partition of the configured system and for every programming environment.

docs/tutorial_deps.rst

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ Here is the output when running the OSU tests with the asynchronous execution po
7777
.. code-block:: none
7878
7979
[==========] Running 7 check(s)
80-
[==========] Started on Wed Mar 25 13:51:06 2020
80+
[==========] Started on Wed Jun 3 09:00:40 2020
8181
8282
[----------] started processing OSUBuildTest (OSU benchmarks build test)
8383
[ RUN ] OSUBuildTest on daint:gpu using PrgEnv-gnu
@@ -140,31 +140,31 @@ Here is the output when running the OSU tests with the asynchronous execution po
140140
[----------] finished processing OSUAllreduceTest_16 (OSU Allreduce test)
141141
142142
[----------] waiting for spawned checks to finish
143-
[ OK ] ( 1/21) OSUBuildTest on daint:gpu using PrgEnv-pgi
144-
[ OK ] ( 2/21) OSUBuildTest on daint:gpu using PrgEnv-gnu
145-
[ OK ] ( 3/21) OSUBuildTest on daint:gpu using PrgEnv-intel
146-
[ OK ] ( 4/21) OSUAllreduceTest_2 on daint:gpu using PrgEnv-pgi
147-
[ OK ] ( 5/21) OSUAllreduceTest_4 on daint:gpu using PrgEnv-pgi
148-
[ OK ] ( 6/21) OSUAllreduceTest_8 on daint:gpu using PrgEnv-pgi
149-
[ OK ] ( 7/21) OSUAllreduceTest_16 on daint:gpu using PrgEnv-pgi
150-
[ OK ] ( 8/21) OSUAllreduceTest_4 on daint:gpu using PrgEnv-gnu
151-
[ OK ] ( 9/21) OSUAllreduceTest_16 on daint:gpu using PrgEnv-gnu
152-
[ OK ] (10/21) OSUAllreduceTest_8 on daint:gpu using PrgEnv-gnu
153-
[ OK ] (11/21) OSUAllreduceTest_16 on daint:gpu using PrgEnv-intel
154-
[ OK ] (12/21) OSULatencyTest on daint:gpu using PrgEnv-pgi
155-
[ OK ] (13/21) OSUAllreduceTest_2 on daint:gpu using PrgEnv-gnu
156-
[ OK ] (14/21) OSULatencyTest on daint:gpu using PrgEnv-gnu
157-
[ OK ] (15/21) OSUBandwidthTest on daint:gpu using PrgEnv-pgi
158-
[ OK ] (16/21) OSUBandwidthTest on daint:gpu using PrgEnv-gnu
159-
[ OK ] (17/21) OSUAllreduceTest_8 on daint:gpu using PrgEnv-intel
160-
[ OK ] (18/21) OSUAllreduceTest_4 on daint:gpu using PrgEnv-intel
161-
[ OK ] (19/21) OSULatencyTest on daint:gpu using PrgEnv-intel
162-
[ OK ] (20/21) OSUAllreduceTest_2 on daint:gpu using PrgEnv-intel
163-
[ OK ] (21/21) OSUBandwidthTest on daint:gpu using PrgEnv-intel
143+
[ OK ] ( 1/21) OSUBuildTest on daint:gpu using PrgEnv-pgi [compile: 29.581s run: 0.086s total: 29.708s]
144+
[ OK ] ( 2/21) OSUBuildTest on daint:gpu using PrgEnv-gnu [compile: 26.250s run: 69.120s total: 95.437s]
145+
[ OK ] ( 3/21) OSUBuildTest on daint:gpu using PrgEnv-intel [compile: 39.385s run: 89.213s total: 129.871s]
146+
[ OK ] ( 4/21) OSULatencyTest on daint:gpu using PrgEnv-pgi [compile: 0.012s run: 145.355s total: 154.504s]
147+
[ OK ] ( 5/21) OSUAllreduceTest_2 on daint:gpu using PrgEnv-pgi [compile: 0.014s run: 148.276s total: 154.433s]
148+
[ OK ] ( 6/21) OSUAllreduceTest_4 on daint:gpu using PrgEnv-pgi [compile: 0.011s run: 149.763s total: 154.407s]
149+
[ OK ] ( 7/21) OSUAllreduceTest_8 on daint:gpu using PrgEnv-pgi [compile: 0.013s run: 151.262s total: 154.378s]
150+
[ OK ] ( 8/21) OSUAllreduceTest_16 on daint:gpu using PrgEnv-pgi [compile: 0.010s run: 152.716s total: 154.360s]
151+
[ OK ] ( 9/21) OSULatencyTest on daint:gpu using PrgEnv-gnu [compile: 0.014s run: 210.952s total: 220.847s]
152+
[ OK ] (10/21) OSUBandwidthTest on daint:gpu using PrgEnv-pgi [compile: 0.015s run: 213.285s total: 220.758s]
153+
[ OK ] (11/21) OSUAllreduceTest_4 on daint:gpu using PrgEnv-gnu [compile: 0.011s run: 215.596s total: 220.717s]
154+
[ OK ] (12/21) OSUAllreduceTest_16 on daint:gpu using PrgEnv-gnu [compile: 0.011s run: 218.742s total: 220.651s]
155+
[ OK ] (13/21) OSUAllreduceTest_2 on daint:gpu using PrgEnv-intel [compile: 0.013s run: 203.214s total: 206.115s]
156+
[ OK ] (14/21) OSUAllreduceTest_8 on daint:gpu using PrgEnv-intel [compile: 0.016s run: 204.819s total: 206.078s]
157+
[ OK ] (15/21) OSUBandwidthTest on daint:gpu using PrgEnv-gnu [compile: 0.012s run: 258.772s total: 266.873s]
158+
[ OK ] (16/21) OSUAllreduceTest_8 on daint:gpu using PrgEnv-gnu [compile: 0.014s run: 263.576s total: 266.752s]
159+
[ OK ] (17/21) OSULatencyTest on daint:gpu using PrgEnv-intel [compile: 0.011s run: 227.234s total: 231.789s]
160+
[ OK ] (18/21) OSUAllreduceTest_4 on daint:gpu using PrgEnv-intel [compile: 0.013s run: 229.729s total: 231.724s]
161+
[ OK ] (19/21) OSUAllreduceTest_2 on daint:gpu using PrgEnv-gnu [compile: 0.013s run: 286.203s total: 292.444s]
162+
[ OK ] (20/21) OSUAllreduceTest_16 on daint:gpu using PrgEnv-intel [compile: 0.028s run: 242.030s total: 242.091s]
163+
[ OK ] (21/21) OSUBandwidthTest on daint:gpu using PrgEnv-intel [compile: 0.013s run: 243.719s total: 247.384s]
164164
[----------] all spawned checks have finished
165165
166166
[ PASSED ] Ran 21 test case(s) from 7 check(s) (0 failure(s))
167-
[==========] Finished on Wed Mar 25 14:37:53 2020
167+
[==========] Finished on Wed Jun 3 09:07:24 2020
168168
169169
Before starting running the tests, ReFrame topologically sorts them based on their dependencies and schedules them for running using the selected execution policy.
170170
With the serial execution policy, ReFrame simply executes the tests to completion as they "arrive", since the tests are already topologically sorted.

reframe/frontend/executors/__init__.py

Lines changed: 78 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
import copy
88
import signal
99
import sys
10+
import time
1011
import weakref
1112

1213
import reframe.core.debug as debug
@@ -130,6 +131,63 @@ def __init__(self, case, listeners=[]):
130131
# Test case has finished, but has not been waited for yet
131132
self.zombie = False
132133

134+
# Timestamps for the start and finish phases of the pipeline
135+
self._timestamps = {}
136+
137+
def duration(self, phase):
138+
# Treat pseudo-phases first
139+
if phase == 'compile_complete':
140+
t_start = 'compile_start'
141+
t_finish = 'compile_wait_finish'
142+
elif phase == 'run_complete':
143+
t_start = 'run_start'
144+
t_finish = 'wait_finish'
145+
elif phase == 'total':
146+
t_start = 'setup_start'
147+
t_finish = 'pipeline_end'
148+
else:
149+
t_start = f'{phase}_start'
150+
t_finish = f'{phase}_finish'
151+
152+
start = self._timestamps.get(t_start)
153+
if not start:
154+
return None
155+
156+
finish = self._timestamps.get(t_finish)
157+
if not finish:
158+
finish = self._timestamps.get('pipeline_end')
159+
160+
return finish - start
161+
162+
def pipeline_timings(self, phases):
163+
def _tf(t):
164+
return f'{t:.3f}s' if t else 'n/a'
165+
166+
msg = ''
167+
for phase in phases:
168+
if phase == 'compile_complete':
169+
msg += f"compile: {_tf(self.duration('compile_complete'))} "
170+
elif phase == 'run_complete':
171+
msg += f"run: {_tf(self.duration('run_complete'))} "
172+
else:
173+
msg += f"{phase}: {_tf(self.duration(phase))} "
174+
175+
if msg:
176+
msg = msg[:-1]
177+
178+
return msg
179+
180+
def pipeline_timings_all(self):
181+
return self.pipeline_timings([
182+
'setup', 'compile_complete', 'run_complete',
183+
'sanity', 'performance', 'total'
184+
])
185+
186+
def pipeline_timings_basic(self):
187+
return self.pipeline_timings([
188+
'compile_complete', 'run_complete', 'total'
189+
])
190+
133191
@property
134192
def testcase(self):
135193
return self._case
@@ -160,13 +218,31 @@ def _notify_listeners(self, callback_name):
160218
callback(self)
161219

162220
def _safe_call(self, fn, *args, **kwargs):
221+
class update_timestamps:
222+
'''Context manager to set the start and finish timestamps.'''
223+
224+
# We use `this` to refer to the update_timestamps object, because
225+
# we don't want to masquerade the self argument of our containing
226+
# function
227+
def __enter__(this):
228+
if fn.__name__ != 'poll':
229+
stage = self._current_stage
230+
self._timestamps[f'{stage}_start'] = time.time()
231+
232+
def __exit__(this, exc_type, exc_value, traceback):
233+
stage = self._current_stage
234+
self._timestamps[f'{stage}_finish'] = time.time()
235+
self._timestamps['pipeline_end'] = time.time()
236+
163237
if fn.__name__ != 'poll':
164238
self._current_stage = fn.__name__
165239

166240
try:
167241
with logging.logging_context(self.check) as logger:
168-
logger.debug('entering stage: %s' % self._current_stage)
169-
return fn(*args, **kwargs)
242+
logger.debug(f'entering stage: {self._current_stage}')
243+
with update_timestamps():
244+
return fn(*args, **kwargs)
245+
170246
except ABORT_REASONS:
171247
self.fail()
172248
raise

reframe/frontend/executors/policies.py

Lines changed: 34 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -94,13 +94,36 @@ def on_task_exit(self, task):
9494
pass
9595

9696
def on_task_failure(self, task):
97+
timings = task.pipeline_timings(['compile_complete',
98+
'run_complete',
99+
'total'])
100+
msg = f'{task.check.info()} [{timings}]'
97101
if task.failed_stage == 'cleanup':
98-
self.printer.status('ERROR', task.check.info(), just='right')
102+
self.printer.status('ERROR', msg, just='right')
99103
else:
100-
self.printer.status('FAIL', task.check.info(), just='right')
104+
self.printer.status('FAIL', msg, just='right')
105+
106+
timings = task.pipeline_timings(['setup',
107+
'compile_complete',
108+
'run_complete',
109+
'sanity',
110+
'performance',
111+
'total'])
112+
getlogger().verbose(f"==> {timings}")
101113

102114
def on_task_success(self, task):
103-
self.printer.status('OK', task.check.info(), just='right')
115+
timings = task.pipeline_timings(['compile_complete',
116+
'run_complete',
117+
'total'])
118+
msg = f'{task.check.info()} [{timings}]'
119+
self.printer.status('OK', msg, just='right')
120+
timings = task.pipeline_timings(['setup',
121+
'compile_complete',
122+
'run_complete',
123+
'sanity',
124+
'performance',
125+
'total'])
126+
getlogger().verbose(f"==> {timings}")
104127
# update reference count of dependencies
105128
for c in task.testcase.deps:
106129
self._task_index[c].ref_count -= 1
@@ -204,14 +227,19 @@ def on_task_run(self, task):
204227
self._running_tasks.append(task)
205228

206229
def on_task_failure(self, task):
230+
msg = f'{task.check.info()} [{task.pipeline_timings_basic()}]'
207231
if task.failed_stage == 'cleanup':
208-
self.printer.status('ERROR', task.check.info(), just='right')
232+
self.printer.status('ERROR', msg, just='right')
209233
else:
210234
self._remove_from_running(task)
211-
self.printer.status('FAIL', task.check.info(), just='right')
235+
self.printer.status('FAIL', msg, just='right')
236+
237+
getlogger().verbose(f"==> {task.pipeline_timings_all()}")
212238

213239
def on_task_success(self, task):
214-
self.printer.status('OK', task.check.info(), just='right')
240+
msg = f'{task.check.info()} [{task.pipeline_timings_basic()}]'
241+
self.printer.status('OK', msg, just='right')
242+
getlogger().verbose(f"==> {task.pipeline_timings_all()}")
215243
# update reference count of dependencies
216244
for c in task.testcase.deps:
217245
self._task_index[c].ref_count -= 1

0 commit comments

Comments
 (0)