22User Guide
33==========
44
5- Who PSI/J is for
5+ Who PSI/J Is for
66----------------
77
88PSI/J is a Python library for submitting and managing HPC jobs via arbitrary
@@ -13,17 +13,17 @@ LSF, Flux, Cobalt, PBS, and your local machine, we think you will find that
1313PSI/J simplifies your work considerably.
1414
1515
16- Who PSI/J is (probably) not for
16+ Who PSI/J Is (Probably) Not for
1717-------------------------------
1818
19- If you were sure that you will only *ever * be launching jobs on ORNL's Summit
19+ If you are sure that you will only *ever * be launching jobs on ORNL's Summit
2020system, and you don't care about any other cluster or machine, then you may as
2121well interact with LSF (the resource manager on Summit) directly, rather than
2222indirectly through PSI/J. In that case PSI/J would not really be adding much
2323other than complexity.
2424
2525If you write application code that is meant to run on various HPC clusters, but
26- which never make calls to the underlying resource manager (e.g. by calling into
26+ which never makes calls to the underlying resource manager (e.g. by calling into
2727Flux's client library, or executing ``srun ``/``jsrun ``/``aprun `` etc.), then
2828PSI/J will not help you. This is likely your situation if you are a developer
2929working on a MPI-based science simulation, since we have observed that it is
@@ -57,7 +57,7 @@ What is a JobExecutor?
5757
5858A :class: `JobExecutor <psij.job_executor.JobExecutor> ` represents a specific RM,
5959e.g. Slurm, on which the job is being executed. Generally, when jobs are
60- submitted, they will be queued for a variable period of time, depending on how
60+ submitted they will be queued for a variable period of time, depending on how
6161busy the target machine is. Once the job is started, its executable is
6262launched and runs to completion, and the job will be marked as completed.
6363
@@ -79,8 +79,8 @@ PSI/J currently provides executors for the following backends:
7979 - `pbspro `: `Altair's PBS-Professional <https://www.altair.com/pbs-professional >`_
8080 - `cobalt `: `ALCF's Cobalt job scheduler <https://www.alcf.anl.gov/support/user-guides/theta/queueing-and-running-jobs/job-and-queue-scheduling/index.html >`_
8181
82- We encourage the contribution of executors for additional backends - please
83- reference the `developers documentation
82+ We encourage the contribution of executors for additional backends— please
83+ reference the `developer documentation
8484<development/tutorial_add_executor.html> `_ for details.
8585
8686
@@ -109,14 +109,14 @@ Slurm // Local // LSF // PBS // Cobalt
109109 ex.submit(job)
110110
111111 And by way of comparison, other backends can be selected with the tabs above.
112- Note that the only difference is the argument to the get_instance method.
112+ Note that the only difference is the argument to the `` get_instance `` method.
113113
114114The ``JobExecutor `` implementation will translate all PSI/J API activities into the
115115respective backend commands and run them on the backend, while at the same time
116116monitoring the backend jobs for failure, completion or other state updates.
117117
118118Assuming there are no errors, you should see a new entry in your resource
119- manager’s queue after running that example above.
119+ manager’s queue after running the example above.
120120
121121
122122Multiple Jobs
@@ -142,18 +142,18 @@ Every :class:`JobExecutor <psij.job_executor.JobExecutor>` can handle arbitrary
142142numbers of jobs (tested with up to 64k jobs).
143143
144144
145- Configuring your Job
145+ Configuring Your Job
146146--------------------
147147
148148In the example above, the ``executable='/bin/date' `` part tells PSI/J that we want
149149the job to run the ``/bin/date `` command. But there are other parts to the job
150150which can be configured:
151151
152- - arguments for the job executable
153- - environment the job is running in
154- - destination for standard output and error streams
155- - resource requirements for the job's execution
156- - accounting details to be used
152+ - Arguments for the job executable
153+ - Environment the job is running in
154+ - Destination for standard output and error streams
155+ - Resource requirements for the job's execution
156+ - Accounting details to be used
157157
158158That information is encoded in the ``JobSpec `` which is used to create the
159159``Job `` instance.
@@ -226,32 +226,32 @@ redirected to files by setting the ``stdout_path`` and ``stderr_path`` attribute
226226 spec.stdout_path = ' /tmp/date.out'
227227 spec.stderr_path = ' /tmp/date.err'
228228
229- The job's standard input stream can also be redirected to read from a file, by
229+ A job's standard input stream can also be redirected to read from a file by
230230setting the ``spec.stdin_path `` attribute.
231231
232232
233233Job Resources
234234^^^^^^^^^^^^^
235235
236236A job submitted to a cluster is allocated a specific set of resources to run on.
237- The amount and type of resources are defined by a resource specification
238- ``ResourceSpec `` which becomes a part of the job specification. The resource
237+ The number and type of resources are defined by a resource specification,
238+ ``ResourceSpec ``, which becomes part of the job specification. The resource
239239specification supports the following attributes:
240240
241- - ``node_count ``: allocate that number of compute nodes to the job. All
241+ - ``node_count ``: Allocate that number of compute nodes to the job. All
242242 cpu-cores and gpu-cores on the allocated node can be exclusively used by the
243243 submitted job.
244- - ``processes_per_node ``: on the allocated nodes, execute that given number of
244+ - ``processes_per_node ``: On the allocated nodes, execute that given number of
245245 processes.
246- - ``process_count ``: the total number of processes (MPI ranks) to be started
247- - ``cpu_cores_per_process ``: the number of cpu cores allocated to each launched
248- process. PSI/J uses the system definition of a cpu core which may refer to
249- a physical cpu core or to a virtual cpu core, also known as a hardware thread.
250- - ``gpu_cores_per_process ``: the number of gpu cores allocated to each launched
251- process. The system definition of an gpu core is used, but usually refers
246+ - ``process_count ``: The total number of processes (MPI ranks) to be started.
247+ - ``cpu_cores_per_process ``: The number of cpu cores allocated to each launched
248+ process. PSI/J uses the system definition of a cpu core, which may refer to
249+ a physical cpu core or to a virtual cpu core ( also known as a hardware thread) .
250+ - ``gpu_cores_per_process ``: The number of gpu cores allocated to each launched
251+ process. The system definition of a gpu core is used, but usually refers
252252 to a full physical GPU.
253253 - ``exclusive_node_use ``: When this boolean flag is set to ``True ``, then PSI/J
254- will ensure that no other jobs, neither of the same user nor of other users
254+ will ensure that no other jobs, neither from the same user nor from other users
255255 of the same system, will run on any of the compute nodes on which processes
256256 for this job are launched.
257257
@@ -274,14 +274,14 @@ node count contradicts the value of ``process_count / processes_per_node``:
274274 # the line above should raise an 'psij.InvalidJobException' exception
275275
276276
277- Processes versus ranks
277+ Processes Versus Ranks
278278""""""""""""""""""""""
279279
280- All processes of the job will share a single MPI communicator
280+ All processes of a job will share a single MPI communicator
281281(`MPI_COMM_WORLD `), independent of their placement, and the term `rank ` (which
282282usually refers to an MPI rank) is thus equivalent. However, jobs started with
283283a single process instance may, depending on the executor implementation, not get
284- an MPI communicator. How Jobs are launched can be specified by the `launcher `
284+ an MPI communicator. How jobs are launched can be specified by the `launcher `
285285attribute of the ``JobSpec ``, as documented below.
286286
287287
@@ -296,7 +296,7 @@ like so: ``JobSpec(..., launcher='srun')``.
296296Scheduling Information
297297^^^^^^^^^^^^^^^^^^^^^^
298298
299- To specify resource- manager-specific information, like queues/partitions,
299+ To specify resource manager-specific information, like queues/partitions,
300300runtime, and so on, create a :class: `JobAttributes
301301<psij.job_attributes.JobAttributes> ` and set it with ``JobSpec(...,
302302attributes=my_job_attributes) ``:
@@ -357,10 +357,10 @@ to call the :meth:`wait <psij.job.Job.wait>` method with no arguments:
357357
358358 The :meth: `wait <psij.job.Job.wait> ` call will return once the job has reached
359359a terminal state, which almost always means that it finished or was
360- cancelled .
360+ canceled .
361361
362362To distinguish jobs that complete successfully from ones that fail or
363- are cancelled , fetch the status of the job after calling
363+ are canceled , fetch the status of the job after calling
364364:meth: `wait <psij.job.Job.wait> `:
365365
366366.. code-block :: python
@@ -369,7 +369,7 @@ are cancelled, fetch the status of the job after calling
369369 print (str (job.status))
370370
371371
372- Canceling your Job
372+ Canceling Your Job
373373^^^^^^^^^^^^^^^^^^
374374
375375If supported by the underlying job scheduler, PSI/J jobs can be canceled by
@@ -381,7 +381,7 @@ Status Callbacks
381381
382382Waiting for jobs to complete with :meth: `wait <psij.job.Job.wait> ` is fine if
383383you don't mind blocking while you wait for a single job to complete. However, if
384- you want to wait on multiple jobs without blocking, or you want to get updates
384+ you want to wait on multiple jobs without blocking or you want to get updates
385385when jobs start running, you can attach a callback to a :class: `JobExecutor
386386<psij.job_executor.JobExecutor> ` which will fire whenever any job submitted to
387387that executor changes status.
0 commit comments