You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PSI/J provides a common interface for obtaining allocations on compute resources.
15
15
16
16
Usually, those compute resources will already have some batch scheduler in place (for example, SLURM).
17
17
18
18
A PSI/J executor is the code that tells the core of PSI/J how to interact with
19
-
such a batch scheduler, so that it can provide a common interface to applications.
19
+
such a batch scheduler so that it can provide a common interface to applications.
20
20
21
21
A PSI/J executor needs to implement the abstract methods defined on the :class:`psij.job_executor.JobExecutor` base class.
22
22
The documentation for that class has reference material for each of the methods that won't be repeated here.
@@ -25,11 +25,11 @@ For batch scheduler systems, the :class:`.BatchSchedulerExecutor` subclass provi
25
25
This tutorial will focus on using BatchSchedulerExecutor as a base, rather than implementing JobExecutor directly.
26
26
27
27
The batch scheduler executor is based around a model where interactions with a local resource manager happen via command line invocations.
28
-
For example, with PBS, that `qsub` and `qstat` commands are used to submit a request and to see status.
28
+
For example, with PBS `qsub` and `qstat` commands are used to submit a request and to see status.
29
29
30
-
To use BatchSchedulerExecutor for a new local resource manager that uses this command line interface, subclass BatchSchedulerExecutor and add in code that understands how to form the command lines necessary to submit a request for an allocation and to get allocation status. This tutorial will do that for PBSPro.
30
+
To use BatchSchedulerExecutor for a new local resource manager that uses this command line interface, use subclass BatchSchedulerExecutor and add in code that understands how to form the command lines necessary to submit a request for an allocation and to get allocation status. This tutorial will do that for PBSPro.
31
31
32
-
First setup a directory structure::
32
+
First set up a directory structure::
33
33
34
34
mkdir project/
35
35
cd project/
@@ -38,36 +38,36 @@ First setup a directory structure::
38
38
39
39
We're going to create three source files in this directory structure:
40
40
41
-
* ``psijpbs/pbspro.py`` - this will contain the bulk of the code
41
+
* ``psijpbs/pbspro.py`` - This will contain the bulk of the code.
42
42
43
-
* ``psijpbs/pbspro.mustace`` - this will contain a template for a PBS Pro job submission file
43
+
* ``psijpbs/pbspro.mustace`` - This will contain a template for a PBS Pro job submission file.
44
44
45
-
* ``psij-descriptors/pbspro_descriptor.py`` - this file tells the PSI/J core what this package implements.
45
+
* ``psij-descriptors/pbspro_descriptor.py`` - This file tells the PSI/J core what this package implements.
46
46
47
47
First, we'll build a skeleton that won't work, and see that it doesn't work in the test suite. Then we'll build up to the full functionality.
48
48
49
49
Prerequisites:
50
50
51
-
* you have the psij-python package installed already and are able to run whatever basic verification you think is necessary
51
+
* You have the psij-python package installed already and are able to run whatever basic verification you think is necessary.
52
52
53
-
* you are able to submit to PBS Pro on a local system
53
+
* You are able to submit to PBS Pro on a local system.
54
54
55
55
56
-
A not-implemented stub
56
+
A Not-implemented Stub
57
57
----------------------
58
58
59
-
Add the project directory to the python path directory::
59
+
Add the project directory to the Python path directory::
60
60
61
61
export PYTHONPATH=$(pwd):$PYTHONPATH
62
62
63
-
Create a simple BatchSchedulerExecutor subclass that does nothing new, in `psijpbs/pbspro.py`::
63
+
Create a simple BatchSchedulerExecutor subclass that does nothing new in `psijpbs/pbspro.py`::
64
64
65
65
from psij.executors.batch.batch_scheduler_executor import BatchSchedulerExecutor
66
66
67
67
class PBSProJobExecutor(BatchSchedulerExecutor):
68
68
pass
69
69
70
-
and create a descriptor file to tell psi/j about this, ``psij-descriptors/pbspro.py``::
70
+
and create a descriptor file to tell PSI/J about this, ``psij-descriptors/pbspro.py``::
71
71
72
72
from distutils.version import StrictVersion
73
73
@@ -85,24 +85,24 @@ Now, run the test suite. It should fail with an error reporting that the resourc
85
85
86
86
That error message tells us what we need to implement. There are three broad pieces of functionality:
87
87
88
-
* submitting a job::
88
+
* Submitting a job::
89
89
90
90
generate_submit_script
91
91
get_submit_command
92
92
job_id_from_submit_output
93
93
94
-
* requesting job status::
94
+
* Requesting job status::
95
95
96
96
get_status_command
97
97
parse_status_output
98
98
99
-
* cancelling a job::
99
+
* Cancelling a job::
100
100
101
101
get_cancel_command
102
102
process_cancel_command_output
103
103
104
104
105
-
Let's implement all of these with stubs that return NotImplementedError that we will then flesh out::
105
+
Let's implement all of these with stubs that return a NotImplementedError that we will then flesh out::
106
106
107
107
class PBSProJobExecutor(BatchSchedulerExecutor):
108
108
@@ -127,15 +127,13 @@ Let's implement all of these with stubs that return NotImplementedError that we
127
127
def parse_status_output(*args, **kwargs):
128
128
raise NotImplementedError
129
129
130
-
Now running the same pytest command will give a different error - further along into attempting to submit a job:
131
-
132
-
... ::
130
+
Now running the same pytest command will give a different error further along into attempting to submit a job::
133
131
134
132
> assert config
135
133
E AssertionError
136
134
137
135
138
-
This default BatchSchedulerExecutor code needs a configuration object, and none was supplied.
136
+
This default BatchSchedulerExecutor code needs a configuration object and none was supplied.
139
137
140
138
A configuration object can contain configuration specific to this particular executor. However,
141
139
for now we are not going to specify a custom configuration object and instead will re-use
@@ -163,27 +161,27 @@ Running pytest again, we get as far as seeing PSI/J is trying to do submit-relat
You can read the docstrings for each of these methods for more information, but briefly the submission process is:
176
174
177
-
``generate_submit_script`` should generate a submit script specific to the batch scheduler.
175
+
1. ``generate_submit_script`` should generate a submit script specific to the batch scheduler.
178
176
179
-
``get_submit_command`` should return the command line necessary to submit that script to the batch scheduler.
177
+
2. ``get_submit_command`` should return the command line necessary to submit that script to the batch scheduler.
180
178
181
179
The output of that command should be interpreted by ``job_id_from_submit_output`` to extract a batch scheduler specific job ID,
182
180
which can be used later when cancelling a job or getting job status.
183
181
184
182
So let's implement those.
185
183
186
-
In line with other PSI/J executors, we're going to delegate script generation to a template based helper. So add a line to initialise a :py:class:`.TemplatedScriptGenerator` in the
184
+
In line with other PSI/J executors, we're going to delegate script generation to a template based helper. So add a line to initialize a :py:class:`.TemplatedScriptGenerator` in the
187
185
executor initializer, pointing at a (as yet non-existent) template file, and replace ``generate_submit_script`` with a delegated call to `TemplatedScriptGenerator`::
188
186
189
187
from pathlib import Path
@@ -272,17 +270,17 @@ In the PBS Pro case, as shown in the example above, that is pretty straightforwa
272
270
return out.strip()
273
271
274
272
275
-
That's enough to get jobs submitted using PSI/J, but not enough to run the test suite. Instead, the test suite will appear to hang, because the PSI/J core code gets a bit upset by status monitoring methods raising NotImplementedError.
273
+
That's enough to get jobs submitted using PSI/J, but not enough to run the test suite. Instead, the test suite will appear to hang, because the PSI/J core code gets a bit upset by status monitoring methods raising a NotImplementedError.
276
274
277
275
278
-
Implementing status
276
+
Implementing Status
279
277
-------------------
280
278
281
-
PSI/J needs to ask the batch scheduler for status about jobs that it has submitted. This can be done with ``BatchSchedulerExecutor`` by overriding these two methods, which we stubbed out as not-implemented earlier on:
279
+
PSI/J needs to ask the batch scheduler for the status of jobs that it has submitted. This can be done with ``BatchSchedulerExecutor`` by overriding these two methods, which we stubbed out as not-implemented earlier on:
282
280
283
-
* :py:meth:`.BatchSchedulerExecutor.get_status_command` - like ``get_submit_command``, this should return a batch scheduler specific commandline, this time to output job status.
281
+
* :py:meth:`.BatchSchedulerExecutor.get_status_command` - Like ``get_submit_command``, this should return a batch scheduler specific command line, this time to output job status.
284
282
285
-
* :py:meth:`.BatchSchedulerExecutor.parse_status_output` - this will interpret the output of the above status command, a bit like ``job_id_from_submit_output``.
283
+
* :py:meth:`.BatchSchedulerExecutor.parse_status_output` - This will interpret the output of the above status command, a bit like ``job_id_from_submit_output``.
286
284
287
285
Here's an implementation for ``get_status_command``::
288
286
@@ -296,7 +294,7 @@ This constructs a command line which looks something like this::
296
294
297
295
qstat -f -F json -x 2154.edtb-01.mcp.alcf.anl.gov
298
296
299
-
The parameters change the default behaviour of ``qstat`` to something more useful for parsing: ``-f`` asks for full output, with `-x` including information for completed jobs (which is normally suppressed) and ``-F json`` asking for the output to be formatted as JSON (rather than a default text tabular view).
297
+
The parameters change the default behavior of ``qstat`` to something more useful for parsing: ``-f`` asks for full output, with `-x` including information for completed jobs (which is normally suppressed) and ``-F json`` asking for the output to be formatted as JSON (rather than a default text tabular view).
300
298
301
299
This JSON output, which is passed to ``parse_status_output`` looks something like this (with a lot of detail removed)::
302
300
@@ -350,28 +348,28 @@ We still haven't implemented the cancel methods, though. That will be revealed b
The two methods to implement for cancellation follow the same pattern as for submission and status:
361
359
362
-
* :py:meth:`.BatchSchedulerExecutor.get_cancel_command` - this should form a command for cancelling a job.
363
-
* :py:meth:`.BatchSchedulerExecutor.process_cancel_command_output` - this should interpret the output from the cancel command.
360
+
* :py:meth:`.BatchSchedulerExecutor.get_cancel_command` - This should form a command for cancelling a job.
361
+
* :py:meth:`.BatchSchedulerExecutor.process_cancel_command_output` - This should interpret the output from the cancel command.
364
362
365
-
It looks like you don't actually need to implement process_cancel_command_output beyond the stub we already have, to make the abstract class mechanism happy. Maybe that's something that should change in psi/j?
363
+
It looks like you don't actually need to implement `process_cancel_command_output` beyond the stub we already have, to make the abstract class mechanism happy. Maybe that's something that should change in psi/j?
366
364
367
365
Here's an implementation of `get_cancel_command`::
That's enough to tell PBS Pro how to cancel a job, but it isn't enough for PSI/J to know that a job was actually cancelled: the JobState from `parse_status_output` will still return a state of COMPLETED, when we actually want CANCELED. That's because the existing job marks a job as COMPLETED whenever it reaches PBS Pro state `F` - no matter how the job finished.
370
+
That's enough to tell PBS Pro how to cancel a job, but it isn't enough for PSI/J to know that a job was actually cancelled: the JobState from `parse_status_output` will still return a state of COMPLETED, when we actually want CANCELED. That's because the existing job marks a job as COMPLETED whenever it reaches PBS Pro state `F`—no matter how the job finished.
373
371
374
-
So here's an updated `parse_status_output` which checks the ``Exit_status`` field in the qstat JSON to see if it exited with status code 265 - that means that the job was killed with signal 9. and if so, marks the job as CANCELED instead of completed::
372
+
So here's an updated `parse_status_output` which checks the ``Exit_status`` field in the qstat JSON to see if it exited with status code 265—that means that the job was killed with signal 9. and if so, marks the job as CANCELED instead of COMPLETED::
@@ -399,19 +397,18 @@ This isn't necessarily the right thing to do: some PBS installs will use 128+9 =
399
397
400
398
401
399
402
-
What's missing?
400
+
What's Missing?
403
401
---------------
404
402
405
403
The biggest thing that was omitted was in the mustache template. A :py:class:`psij.Job` object contains lots of options which could be transcribed into the template (otherwise they will be ignored). Have a look at the docstrings for ``Job`` and at other templates in the PSI/J source code for examples.
406
404
407
405
The _STATE_MAP given here is also not exhaustive: if PBS Pro qstat returns a different state for a job than what is in it, this will break. So make sure you deal with all the states of your batch scheduler, not just a few that seem obvious.
408
406
409
-
How to distribute your executor
407
+
How to Distribute Your Executor
410
408
-------------------------------
411
409
412
410
If you want to share your executor with others, here are two ways:
413
411
414
-
i) you can make a python package and distribute that as an add-on without needing to interact with the psi/j project
415
-
416
-
ii) you can make a pull request against the psi/j repo
412
+
1. You can make a Python package and distribute that as an add-on without needing to interact with the PSI/J project.
417
413
414
+
2. You can make a pull request against the PSI/J repo.
0 commit comments