@@ -31,58 +31,28 @@ Install from source
3131 pip install .
3232
3333
34- Who PSI/J is for
35- ----------------
36-
37- PSI/J is a library intended to make your job-launching code portable, or at
38- least easier to port, across
39- HPC centers. If you want your project to be able to request resources
40- from one or more of Slurm, LSF, Flux, Cobalt, PBS, and your local machine,
41- we think you will find that PSI/J simplifies your work considerably.
42-
43-
44- Who PSI/J is (probably) not for
45- -------------------------------
46-
47- If you were sure that you will only *ever * be launching jobs on ORNL's Summit
48- system, and you don't care about any other cluster or machine, then you may as well
49- interact with LSF (the resource manager on Summit) directly, rather than
50- indirectly through PSI/J. In that case PSI/J would not really be adding much
51- other than complexity.
52-
53- If you write application code that is meant to run on various HPC clusters, but
54- which never make calls to the underlying resource manager (e.g. by calling into
55- Flux's client library, or executing ``srun ``/``jsrun ``/``aprun `` etc.), then
56- PSI/J will not help you. This is likely your situation if you are a developer working
57- on a MPI-based science simulation, since we have observed that it is often the users'
58- responsibility to actually launch the simulation through the resource manager.
59- However, PSI/J is more likely to help with various tools
60- associated with your simulation--for instance, your test suite.
61-
62- Terminology and "job" vs ``Job ``
63- --------------------------------
64-
65- In PSI/J's terminology, a resource manager job is an executable
66- plus a bunch of attributes. Generally, when jobs are submitted, they will
67- need to sit in the queue for a variable amount of time, depending on how
68- busy the cluster is. Then the job will be started, the executable will
69- run to completion, and the job will be marked as completed.
70-
71- PSI/J's :class: `Job <psij.job.Job> ` objects are representations of underlying
72- resource manager jobs. One :class: `Job <psij.job.Job> ` instance might represent a Slurm
73- job running on a LLNL cluster, another a Cobalt job running on ALCF's Theta, another a
74- Flux job in the cloud, and so on.
75-
76- However, a newly-created :class: `Job <psij.job.Job> ` object does not represent
77- any resource manager job, it is a kind of free agent.
78- To convert it to a resource manager job, the
79- :class: `Job <psij.job.Job> ` needs to be submitted to a
80- :class: `JobExecutor <psij.job_executor.JobExecutor> ` instance. That action
81- creates a new resource manager job and permanently binds the
82- :class: `Job <psij.job.Job> ` to it. Alternatively, a :class: `Job <psij.job.Job> `
83- can be bound to an *existing * resource manager job by
84- calling :meth: `JobExecutor.attach <psij.job_executor.JobExecutor.attach> `, passing in a
85- :class: `Job <psij.job.Job> ` and the ID of the underlying resource manager job.
34+
35+ Overview
36+ --------
37+
38+ In PSI/J's terminology, a :class: `Job <psij.job.Job> ` represents an executable
39+ plus a bunch of attributes. Static job attributes such as resource requirements
40+ are defined by the :class: `JobSpec <psij.job_spec.JobSpec> ` at
41+ creation. Dynamic job attributes such as the :class: `JobState
42+ <psij.job_state.JobState> ` are modified by :class: `JobExecutors
43+ <psij.job_executor.JobExecutor> ` as the :class: `Job <psij.job.Job> `
44+ progresses through its lifecycle.
45+
46+ A :class: `JobExecutor <psij.job_executor.JobExecutor> ` represents a specific
47+ Resource Manager, e.g. Slurm, on which the Job is being executed. Generally,
48+ when jobs are submitted, they will be queued for a variable period of time,
49+ depending on how busy the target machine is. Once the Job is started, its
50+ executable is launched and runs to completion.
51+
52+ In PSI/J, a job is submitted by :meth: `JobExecutor.submit(Job)
53+ <psij.job_executor.JobExecutor.submit> ` which permanently binds the Job to that
54+ executor and submits it to the underlying resource manager.
55+
8656
8757Basic Usage
8858-----------
@@ -119,172 +89,6 @@ the ``/bin/date`` command. Once that command has finished executing
11989(which should be almost as soon as the job starts, since ``date `` does very little work)
12090the resource manager will mark the job as complete, triggering PSI/J to do the same.
12191
122- Adding Complexity
123- -----------------
124-
125- Multiple Jobs
126- ^^^^^^^^^^^^^
127-
128- In the last section we submitted a single job, and didn't check
129- whether it succeeded or failed.
130-
131- Submitting multiple jobs is as simple as adding a loop:
132-
133- .. rst-class :: executor-type-selector selector-mode-tabs
134-
135- Local // Slurm // LSF // PBS // Cobalt
136-
137- .. code-block :: python
138-
139- from psij import Job, JobExecutor, JobSpec
140-
141- ex = JobExecutor.get_instance(" <&executor-type>" )
142- for _ in range (10 ):
143- job = Job(JobSpec(executable = " /bin/date" ))
144- ex.submit(job)
145-
146- Every :class: `JobExecutor <psij.job_executor.JobExecutor> ` can handle arbitrary
147- numbers of jobs. Most of the functionality provided by
148- :class: `JobExecutor <psij.job_executor.JobExecutor> ` is
149- contained in the :meth: `JobExecutor.submit <psij.job_executor.JobExecutor.submit> ` and
150- :meth: `JobExecutor.attach <psij.job_executor.JobExecutor.attach> ` methods.
151-
152- Checking Job Completion
153- ^^^^^^^^^^^^^^^^^^^^^^^
154-
155- In all the above examples, we have submitted jobs without
156- checking on what happened to them.
157-
158- To wait for a job to complete once it has been submitted, it suffices
159- to call the :meth: `wait <psij.job.Job.wait> ` method with no arguments:
160-
161- .. code-block :: python
162-
163- from psij import Job, JobSpec
164-
165- job = Job(JobSpec(executable = " /bin/date" ))
166- ex.submit(job)
167- job.wait()
168-
169- The :meth: `wait <psij.job.Job.wait> ` call will return once the job has reached
170- a terminal state, which almost always means that it finished or was
171- cancelled.
172-
173- To distinguish jobs that complete successfully from ones that fail or
174- are cancelled, fetch the status of the job after calling
175- :meth: `wait <psij.job.Job.wait> `:
176-
177- .. code-block :: python
178-
179- job.wait()
180- print (str (job.status))
181-
182-
183- Canceling your job
184- ^^^^^^^^^^^^^^^^^^
185- If supported by the underlying job scheduler, PSI/J jobs can be canceled by
186- invoking the :meth: `cancel <psij.job.Job.cancel> ` method.
187-
188-
189- Status Callbacks
190- ^^^^^^^^^^^^^^^^
191-
192- Waiting for jobs to complete with :meth: `wait <psij.job.Job.wait> ` is fine if you don't
193- mind blocking while you wait for a single job to complete. However,
194- if you want to wait on multiple jobs without blocking, or you want
195- to get updates when jobs start running, you can attach a callback
196- to a :class: `JobExecutor <psij.job_executor.JobExecutor> ` which will
197- fire whenever any job submitted to that executor changes status.
198-
199- To wait on multiple jobs at once:
200-
201- .. rst-class :: executor-type-selector selector-mode-tabs
202-
203- Local // Slurm // LSF // PBS // Cobalt
204-
205- .. code-block :: python
206-
207- import time
208- from psij import Job, JobExecutor, JobSpec
209-
210- count = 10
211-
212- def callback (job , status ):
213- global count
214-
215- if status.final:
216- print (f " Job { job} completed with status { status} " )
217- count -= 1
218-
219- ex = JobExecutor.get_instance(" <&executor-type>" )
220- ex.set_job_status_callback(callback)
221- for _ in range (count):
222- job = Job(JobSpec(executable = " /bin/date" ))
223- ex.submit(job)
224-
225- while count > 0 :
226- time.sleep(0.01 )
227-
228- Job Information
229- ---------------
230-
231- So far we have been assuming that your job is very simple--you just want to
232- run ``/bin/date `` and there is no mention of node, MPI rank, or GPU counts,
233- or of different partitions/queues, and all the other resource manager
234- concepts you may be familiar with.
235-
236- However, much of what you wish to specify is supported (although we hope it all is).
237-
238- Resources
239- ^^^^^^^^^
240- To specify your job's resources, like GPUs and nodes, create a
241- :class: `ResourceSpecV1 <psij.resource_spec.ResourceSpecV1> ` and set it
242- with ``JobSpec(..., resources=my_spec_v1) ``.
243-
244- Launching Methods
245- ^^^^^^^^^^^^^^^^^
246- To specify how the processes in your job should be started once resources have been
247- allocated for it, pass the name of a launcher (e.g. ``"mpirun" ``, ``"srun" ``, etc.)
248- like so: ``JobSpec(..., launcher='srun') ``.
249-
250- Scheduling Information
251- ^^^^^^^^^^^^^^^^^^^^^^
252- To specify resource-manager-specific information, like queues/partitions,
253- runtime, and so on, create a
254- :class: `JobAttributes <psij.job_attributes.JobAttributes> ` and set it with
255- ``JobSpec(..., attributes=my_job_attributes) ``.
256-
257- Example of Adding Job Information
258- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
259-
260- Below we add resource and scheduling information to a job before submitting it.
261-
262-
263- .. rst-class :: executor-type-selector selector-mode-tabs
264-
265- Local // Slurm // LSF // PBS // Cobalt
266-
267- .. code-block :: python
268-
269- from psij import Job, JobExecutor, JobSpec, JobAttributes, ResourceSpecV1
270-
271- executor = JobExecutor.get_instance(" <&executor-type>" )
272-
273- job = Job(
274- JobSpec(
275- executable = " /bin/date" ,
276- resources = ResourceSpecV1(node_count = 1 ),
277- attributes = JobAttributes(
278- queue_name = " <QUEUE_NAME>" , project_name = " <ALLOCATION>"
279- ),
280- )
281- )
282-
283- executor.submit(job)
284-
285- Where the `<QUEUE_NAME> ` and `<ALLOCATION> ` fields will depend on the
286- system you are running on.
287-
28892
28993Examples
29094--------
0 commit comments