Skip to content

Commit 12f8293

Browse files
authored
Merge pull request #288 from ExaWorks/feature/user_guide
Feature/user guide
2 parents b4a0918 + 4d975e6 commit 12f8293

File tree

6 files changed

+443
-218
lines changed

6 files changed

+443
-218
lines changed

docs/getting_started.rst

Lines changed: 22 additions & 218 deletions
Original file line numberDiff line numberDiff line change
@@ -31,58 +31,28 @@ Install from source
3131
pip install .
3232
3333
34-
Who PSI/J is for
35-
----------------
36-
37-
PSI/J is a library intended to make your job-launching code portable, or at
38-
least easier to port, across
39-
HPC centers. If you want your project to be able to request resources
40-
from one or more of Slurm, LSF, Flux, Cobalt, PBS, and your local machine,
41-
we think you will find that PSI/J simplifies your work considerably.
42-
43-
44-
Who PSI/J is (probably) not for
45-
-------------------------------
46-
47-
If you were sure that you will only *ever* be launching jobs on ORNL's Summit
48-
system, and you don't care about any other cluster or machine, then you may as well
49-
interact with LSF (the resource manager on Summit) directly, rather than
50-
indirectly through PSI/J. In that case PSI/J would not really be adding much
51-
other than complexity.
52-
53-
If you write application code that is meant to run on various HPC clusters, but
54-
which never make calls to the underlying resource manager (e.g. by calling into
55-
Flux's client library, or executing ``srun``/``jsrun``/``aprun`` etc.), then
56-
PSI/J will not help you. This is likely your situation if you are a developer working
57-
on a MPI-based science simulation, since we have observed that it is often the users'
58-
responsibility to actually launch the simulation through the resource manager.
59-
However, PSI/J is more likely to help with various tools
60-
associated with your simulation--for instance, your test suite.
61-
62-
Terminology and "job" vs ``Job``
63-
--------------------------------
64-
65-
In PSI/J's terminology, a resource manager job is an executable
66-
plus a bunch of attributes. Generally, when jobs are submitted, they will
67-
need to sit in the queue for a variable amount of time, depending on how
68-
busy the cluster is. Then the job will be started, the executable will
69-
run to completion, and the job will be marked as completed.
70-
71-
PSI/J's :class:`Job <psij.job.Job>` objects are representations of underlying
72-
resource manager jobs. One :class:`Job <psij.job.Job>` instance might represent a Slurm
73-
job running on a LLNL cluster, another a Cobalt job running on ALCF's Theta, another a
74-
Flux job in the cloud, and so on.
75-
76-
However, a newly-created :class:`Job <psij.job.Job>` object does not represent
77-
any resource manager job, it is a kind of free agent.
78-
To convert it to a resource manager job, the
79-
:class:`Job <psij.job.Job>` needs to be submitted to a
80-
:class:`JobExecutor <psij.job_executor.JobExecutor>` instance. That action
81-
creates a new resource manager job and permanently binds the
82-
:class:`Job <psij.job.Job>` to it. Alternatively, a :class:`Job <psij.job.Job>`
83-
can be bound to an *existing* resource manager job by
84-
calling :meth:`JobExecutor.attach <psij.job_executor.JobExecutor.attach>`, passing in a
85-
:class:`Job <psij.job.Job>` and the ID of the underlying resource manager job.
34+
35+
Overview
36+
--------
37+
38+
In PSI/J's terminology, a :class:`Job <psij.job.Job>` represents an executable
39+
plus a bunch of attributes. Static job attributes such as resource requirements
40+
are defined by the :class:`JobSpec <psij.job_spec.JobSpec>` at
41+
creation. Dynamic job attributes such as the :class:`JobState
42+
<psij.job_state.JobState>` are modified by :class:`JobExecutors
43+
<psij.job_executor.JobExecutor>` as the :class:`Job <psij.job.Job>`
44+
progresses through its lifecycle.
45+
46+
A :class:`JobExecutor <psij.job_executor.JobExecutor>` represents a specific
47+
Resource Manager, e.g. Slurm, on which the Job is being executed. Generally,
48+
when jobs are submitted, they will be queued for a variable period of time,
49+
depending on how busy the target machine is. Once the Job is started, its
50+
executable is launched and runs to completion.
51+
52+
In PSI/J, a job is submitted by :meth:`JobExecutor.submit(Job)
53+
<psij.job_executor.JobExecutor.submit>` which permanently binds the Job to that
54+
executor and submits it to the underlying resource manager.
55+
8656

8757
Basic Usage
8858
-----------
@@ -119,172 +89,6 @@ the ``/bin/date`` command. Once that command has finished executing
11989
(which should be almost as soon as the job starts, since ``date`` does very little work)
12090
the resource manager will mark the job as complete, triggering PSI/J to do the same.
12191

122-
Adding Complexity
123-
-----------------
124-
125-
Multiple Jobs
126-
^^^^^^^^^^^^^
127-
128-
In the last section we submitted a single job, and didn't check
129-
whether it succeeded or failed.
130-
131-
Submitting multiple jobs is as simple as adding a loop:
132-
133-
.. rst-class:: executor-type-selector selector-mode-tabs
134-
135-
Local // Slurm // LSF // PBS // Cobalt
136-
137-
.. code-block:: python
138-
139-
from psij import Job, JobExecutor, JobSpec
140-
141-
ex = JobExecutor.get_instance("<&executor-type>")
142-
for _ in range(10):
143-
job = Job(JobSpec(executable="/bin/date"))
144-
ex.submit(job)
145-
146-
Every :class:`JobExecutor <psij.job_executor.JobExecutor>` can handle arbitrary
147-
numbers of jobs. Most of the functionality provided by
148-
:class:`JobExecutor <psij.job_executor.JobExecutor>` is
149-
contained in the :meth:`JobExecutor.submit <psij.job_executor.JobExecutor.submit>` and
150-
:meth:`JobExecutor.attach <psij.job_executor.JobExecutor.attach>` methods.
151-
152-
Checking Job Completion
153-
^^^^^^^^^^^^^^^^^^^^^^^
154-
155-
In all the above examples, we have submitted jobs without
156-
checking on what happened to them.
157-
158-
To wait for a job to complete once it has been submitted, it suffices
159-
to call the :meth:`wait <psij.job.Job.wait>` method with no arguments:
160-
161-
.. code-block:: python
162-
163-
from psij import Job, JobSpec
164-
165-
job = Job(JobSpec(executable="/bin/date"))
166-
ex.submit(job)
167-
job.wait()
168-
169-
The :meth:`wait <psij.job.Job.wait>` call will return once the job has reached
170-
a terminal state, which almost always means that it finished or was
171-
cancelled.
172-
173-
To distinguish jobs that complete successfully from ones that fail or
174-
are cancelled, fetch the status of the job after calling
175-
:meth:`wait <psij.job.Job.wait>`:
176-
177-
.. code-block:: python
178-
179-
job.wait()
180-
print(str(job.status))
181-
182-
183-
Canceling your job
184-
^^^^^^^^^^^^^^^^^^
185-
If supported by the underlying job scheduler, PSI/J jobs can be canceled by
186-
invoking the :meth:`cancel <psij.job.Job.cancel>` method.
187-
188-
189-
Status Callbacks
190-
^^^^^^^^^^^^^^^^
191-
192-
Waiting for jobs to complete with :meth:`wait <psij.job.Job.wait>` is fine if you don't
193-
mind blocking while you wait for a single job to complete. However,
194-
if you want to wait on multiple jobs without blocking, or you want
195-
to get updates when jobs start running, you can attach a callback
196-
to a :class:`JobExecutor <psij.job_executor.JobExecutor>` which will
197-
fire whenever any job submitted to that executor changes status.
198-
199-
To wait on multiple jobs at once:
200-
201-
.. rst-class:: executor-type-selector selector-mode-tabs
202-
203-
Local // Slurm // LSF // PBS // Cobalt
204-
205-
.. code-block:: python
206-
207-
import time
208-
from psij import Job, JobExecutor, JobSpec
209-
210-
count = 10
211-
212-
def callback(job, status):
213-
global count
214-
215-
if status.final:
216-
print(f"Job {job} completed with status {status}")
217-
count -= 1
218-
219-
ex = JobExecutor.get_instance("<&executor-type>")
220-
ex.set_job_status_callback(callback)
221-
for _ in range(count):
222-
job = Job(JobSpec(executable="/bin/date"))
223-
ex.submit(job)
224-
225-
while count > 0:
226-
time.sleep(0.01)
227-
228-
Job Information
229-
---------------
230-
231-
So far we have been assuming that your job is very simple--you just want to
232-
run ``/bin/date`` and there is no mention of node, MPI rank, or GPU counts,
233-
or of different partitions/queues, and all the other resource manager
234-
concepts you may be familiar with.
235-
236-
However, much of what you wish to specify is supported (although we hope it all is).
237-
238-
Resources
239-
^^^^^^^^^
240-
To specify your job's resources, like GPUs and nodes, create a
241-
:class:`ResourceSpecV1 <psij.resource_spec.ResourceSpecV1>` and set it
242-
with ``JobSpec(..., resources=my_spec_v1)``.
243-
244-
Launching Methods
245-
^^^^^^^^^^^^^^^^^
246-
To specify how the processes in your job should be started once resources have been
247-
allocated for it, pass the name of a launcher (e.g. ``"mpirun"``, ``"srun"``, etc.)
248-
like so: ``JobSpec(..., launcher='srun')``.
249-
250-
Scheduling Information
251-
^^^^^^^^^^^^^^^^^^^^^^
252-
To specify resource-manager-specific information, like queues/partitions,
253-
runtime, and so on, create a
254-
:class:`JobAttributes <psij.job_attributes.JobAttributes>` and set it with
255-
``JobSpec(..., attributes=my_job_attributes)``.
256-
257-
Example of Adding Job Information
258-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
259-
260-
Below we add resource and scheduling information to a job before submitting it.
261-
262-
263-
.. rst-class:: executor-type-selector selector-mode-tabs
264-
265-
Local // Slurm // LSF // PBS // Cobalt
266-
267-
.. code-block:: python
268-
269-
from psij import Job, JobExecutor, JobSpec, JobAttributes, ResourceSpecV1
270-
271-
executor = JobExecutor.get_instance("<&executor-type>")
272-
273-
job = Job(
274-
JobSpec(
275-
executable="/bin/date",
276-
resources=ResourceSpecV1(node_count=1),
277-
attributes=JobAttributes(
278-
queue_name="<QUEUE_NAME>", project_name="<ALLOCATION>"
279-
),
280-
)
281-
)
282-
283-
executor.submit(job)
284-
285-
Where the `<QUEUE_NAME>` and `<ALLOCATION>` fields will depend on the
286-
system you are running on.
287-
28892

28993
Examples
29094
--------

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,5 +35,6 @@ with HPC centers.
3535
:maxdepth: 3
3636

3737
getting_started
38+
user_guide
3839
api
3940
development/index.rst

docs/psij_arch.png

136 KB
Loading

docs/states.png

92.9 KB
Loading

docs/states_alt.png

61.9 KB
Loading

0 commit comments

Comments
 (0)