Skip to content

Commit 4fa491e

Browse files
committed
rfc33: add new RFC for job execution module protocol
Problem: The distributed protocol between Flux job execution modules is not designed or documented. Add RFC 33 to cover a high-level design of a distributed job execution protocol, used by the job execution system to launch, monitor, and control the job shells of a Flux job.
1 parent b451bfe commit 4fa491e

File tree

3 files changed

+338
-0
lines changed

3 files changed

+338
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ Table of Contents
4343
- [30/Job Urgency](spec_30.rst)
4444
- [31/Job Constraints Specification](spec_31.rst)
4545
- [32/Flux Job Execution Protocol Version 1](spec_32.rst)
46+
- [33/Flux Job Execution Module Protocol Version 1](spec_33.rst)
4647

4748
Build Instructions
4849
------------------

index.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,12 @@ job constraints.
228228
This specification describes Version 1 of the Flux Execution Protocol
229229
implemented by the job manager and job execution system.
230230

231+
:doc:`33/Flux Job Execution Module Protocol Version 1 <spec_33>`
232+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
233+
234+
This specification describes Version 1 of the Flux Execution Module Protocol,
235+
a distributed protocol used by job execution broker modules.
236+
231237
.. Each file must appear in a toctree
232238
.. toctree::
233239
:hidden:
@@ -263,3 +269,4 @@ implemented by the job manager and job execution system.
263269
spec_30
264270
spec_31
265271
spec_32
272+
spec_33

spec_33.rst

Lines changed: 330 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,330 @@
1+
.. github display
2+
GitHub is NOT the preferred viewer for this file. Please visit
3+
https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_32.html
4+
5+
33/Flux Job Execution Module Protocol Version 1
6+
===============================================
7+
8+
This specification describes the distributed protocol that the job
9+
execution service uses to launch, monitor, and control job shells
10+
in a Flux job.
11+
12+
- Name: github.com/flux-framework/rfc/spec_32.rst
13+
14+
- Editor: Mark A. Grondona <[email protected]>
15+
16+
- State: raw
17+
18+
19+
Language
20+
--------
21+
22+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
23+
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to
24+
be interpreted as described in `RFC 2119 <https://tools.ietf.org/html/rfc2119>`__.
25+
26+
27+
Related Standards
28+
-----------------
29+
30+
- :doc:`21/Job States and Events <spec_21>`
31+
32+
- :doc:`22/Idset String Representation <spec_22>`
33+
34+
- :doc:`32/Flux Job Execution Protocol Version 1 <spec_32>`
35+
36+
Background
37+
----------
38+
39+
RFC 32 describes the protocol between the execution service and job manager
40+
used to initiate and control jobs during the execution phase. Upon receipt
41+
of a ``start`` request, the execution service is responsible for the launch,
42+
monitoring, and control of job shells on all execution targets involved
43+
in the job. Therefore, the execution service is necessarily distributed
44+
among all ranks of a Flux instance.
45+
46+
The Flux Job Execution Module Protocol Version 1 describes how a set of
47+
execution service broker modules interact in distributed fashion to meet
48+
the requirements of executing job shells on behalf of the job manager.
49+
50+
Design Criteria
51+
---------------
52+
53+
The job execution module protocol must adhere to the following criteria:
54+
55+
- Avoid global distributed operations which would require all ranks to
56+
be online before the service is ready to execute work.
57+
58+
- Avoid presenting obstacles to the scaling of job size, the number of jobs
59+
running concurrently, or job throughput.
60+
61+
- Support execution module reload.
62+
63+
- Support recovery of running jobs after instance restart or execution
64+
module reload.
65+
66+
- Support execution of a job prolog and/or epilog.
67+
68+
- Support for collecting stdout and stderr from IMP and/or job shells
69+
70+
- Support for a barrier implementation used by the job shells, so that
71+
the execution service may determine if shells exit early due to error.
72+
73+
- Support partial release of allocated resources.
74+
75+
- Support for job termination on job exceptions, job time limit, and other
76+
error conditions.
77+
78+
- Support delivery of signals to jobs.
79+
80+
Implementation
81+
--------------
82+
83+
Job execution modules SHALL be loaded on all ranks in an instance, and are
84+
organized in a hierarchy with rank 0 at the root. Each module SHALL track
85+
the set of all running jobs for itself and all of its children. This state
86+
SHALL include at a minimum the jobid, userid, job state, and the idset of
87+
execution targets on which the job has an allocation.
88+
89+
All job execution modules register a ``job-exec.hello`` service endpoint.
90+
Downstream execution modules send a ``hello`` request to their upstream
91+
peer to initiate the execution module protocol. An execution module SHALL
92+
wait to send a ``hello`` response to its downstream peers until n an initial
93+
``hello`` response from upstream has been received. In the case of rank 0,
94+
the job execution module SHALL wait to send ``hello`` responses until the
95+
initial RFC 32 ``hello`` response is received from the job manager.
96+
97+
Responses to the ``job-exec.hello`` request are used to distribute job state
98+
and other events downstream through the job execution module hierarchy.
99+
These responses have a JSON object payload including the REQUIRED keys
100+
``type``, ``idset``, and ``data``.
101+
102+
Supported types of ``job-exec.hello`` responses SHALL include at a minimum
103+
the following:
104+
105+
state-update
106+
A ``state-update`` response is used to update the distributed state of
107+
jobs. The ``data`` object SHALL have a single key, ``jobs``, which SHALL
108+
be an array of (id, userid, type, idset) tuples. The ``type`` entry of the
109+
tuple SHALL indicate how the state is to be resolved on ranks in ``idset``.
110+
Possible values for ``type`` MAY be *add*, *remove*, or *check*, to *add*
111+
a new job, *remove* an inactive job, or *check* that an existing job is
112+
active as expected.
113+
114+
When a job execution module receives a ``state-update`` response from
115+
upstream, it SHALL take the following actions, depending on the value of
116+
the ``type`` key:
117+
118+
add
119+
If the jobid already exists in the local module's state, then do nothing.
120+
121+
Otherwise, if the provided ``idset`` intersects any child idset, then
122+
the module SHALL send a ``state`` response to matching children of type
123+
``add``. Then, the local module SHALL determine if the provided ``idset``
124+
contains its rank, and if so, the module SHALL execute the job locally
125+
using the currently selected execution implementation.
126+
127+
remove
128+
If the provided ``idset`` intersects any child idset, then the job exec
129+
module SHALL send a ``state`` response to matching children with type
130+
``remove``. Then, the the referenced ``jobid`` SHALL be purged from the
131+
local module's state.
132+
133+
check
134+
If the provided ``idset`` intersects any child idset, then the job exec
135+
module SHALL send a ``state`` response to matching children with type
136+
``check``.
137+
138+
If the provided ``idset`` contains the local module's rank, then the
139+
module SHALL check if the referenced ``jobid`` exists locally. If not,
140+
then a job exception SHALL be raised.
141+
142+
The first response to ``job-exec.hello`` SHALL be of type ``state-update``.
143+
The included ``jobs`` tuples SHALL all be of ``type=check`` and MUST
144+
include the entire set of jobs which are expected to be currently running
145+
on the execution targets of the current module and its children. If a job
146+
execution module discovers a locally running job which is not in the initial
147+
``state-update`` list, then the module SHALL terminate the job processes
148+
and log an error.
149+
150+
When the rank 0 job execution module receives an RFC 32 ``start`` request
151+
from the job manager, it SHALL determine the idset associated with the
152+
job from *R*, and then locally issue a state update of type ``add``,
153+
following the specification for ``add`` listed above.
154+
155+
While job execution is in progress, execution modules SHALL update their
156+
upstream peer with the following status changes:
157+
158+
start
159+
when the local job shell has started
160+
barrier
161+
the local job shell has entered a barrier
162+
finish
163+
the local job shell has exited
164+
exception
165+
a job exception has occurred
166+
release
167+
all local work is completed, the resources on this rank may be released
168+
(e.g. after job epilog is complete)
169+
170+
Upon receiving one of the requests above, a job execution module MAY
171+
attempt a reduction and SHALL forward the request upstream. On rank 0, the
172+
job exec module SHALL collect and translate job execution module requests
173+
to job-manager ``start`` response payloads including:
174+
175+
start
176+
after job exec ``start`` has been received from all ranks
177+
finish
178+
after all job exec ``finish`` requests have been received from all ranks
179+
exception
180+
forwarded immediately to job-manager
181+
release
182+
release requests may be aggregated and forwarded in chunks to the job
183+
manager to allow for partial release.
184+
185+
Each job exec module SHALL subscribe to ``job-exception`` events and MUST
186+
handle exceptions locally. For fatal job exceptions, the default behavior
187+
SHALL be to kill the local job shell and its children.
188+
189+
After receiving the final ``release`` request from a downstream module,
190+
the rank 0 job execution module SHALL perform the following final steps:
191+
192+
- post a terminating event to the exec eventlog
193+
- copy guest namespace to primary namespace
194+
- issue a ``release`` response with final=true to the job manager
195+
- remove local state entry for the job
196+
- update distributed state so job is removed from all children
197+
198+
Job-Exec Hello Request
199+
^^^^^^^^^^^^^^^^^^^^^^
200+
The ``job-exec.hello`` request has no payload.
201+
202+
Job-Exec Hello Response
203+
^^^^^^^^^^^^^^^^^^^^^^^
204+
205+
A ``job-exec.hello`` response payload SHALL be a JSON object containing
206+
the following REQUIRED keys:
207+
208+
type
209+
(string) The response type
210+
211+
idset
212+
(string) RFC 22 Idset string indicating the ranks to which this response
213+
should be delivered
214+
215+
data
216+
(object) type-specific data
217+
218+
State-update
219+
~~~~~~~~~~~~
220+
221+
The ``state-update`` ``hello`` response ``data`` object SHALL contain the
222+
following REQUIRED keys:
223+
224+
jobs
225+
A list of job tuples where a tuple is an array ``[ id, userid, type, idset]``.
226+
227+
Where
228+
229+
id
230+
(integer) the job ID
231+
232+
userid
233+
(integer) the job user ID
234+
235+
idset
236+
(string) An RFC 22 idset string denoting all ranks which are included
237+
in the assigned resources for job ``id``.
238+
239+
type
240+
(string) The type of state update. One of ``add``, ``remove``, or ``check``.
241+
242+
Job-Exec Start Request
243+
^^^^^^^^^^^^^^^^^^^^^^
244+
245+
A ``job-exec.start`` request SHALL be sent upstream by an execution module
246+
once the job shell or IMP has been started. The payload SHALL be a JSON
247+
object containing the following REQUIRED keys:
248+
249+
id
250+
(integer) the job ID
251+
252+
ranks
253+
(string) an RFC 22 Idset string of ranks on which the job shell has started
254+
255+
256+
Job-Exec Barrier Request
257+
^^^^^^^^^^^^^^^^^^^^^^^^
258+
259+
A ``job-exec.barrier`` request SHALL be sent upstream from a execution
260+
module when the locally executed job shell enters a barrier. The payload
261+
SHALL be a JSON object containing the following REQUIRED keys:
262+
263+
id
264+
(integer) the job ID
265+
266+
ranks
267+
(string) an RFC 22 Idset string of execution targets on which the shell
268+
barrier has been started.
269+
270+
seq
271+
(integer) a shell barrier sequence number
272+
273+
The upstream module SHALL respond to a ``job-exec.barrier`` request
274+
once all job shells have entered the barrier with a matching sequence
275+
number.
276+
277+
278+
Job-Exec Finish Request
279+
^^^^^^^^^^^^^^^^^^^^^^^
280+
281+
A ``job-exec.finish`` request SHALL be sent upstream by an execution
282+
module once the job shell has exited. The payload SHALL be a JSON object
283+
containing the following REQUIRED keys:
284+
285+
id
286+
(integer) the job ID
287+
288+
ranks
289+
(string) an RFC 22 idset string of execution targets on which the job
290+
shell has exited.
291+
292+
status
293+
(integer) the greatest job shell wait status among ``ranks``
294+
295+
296+
Job-Exec Exception Request
297+
^^^^^^^^^^^^^^^^^^^^^^^^^^
298+
299+
A ``job-exec.execption`` request SHALL be sent upstream by an execution
300+
module when the module wishes to raise a execution related job exception. The
301+
payload SHALL be a JSON object containing the following REQUIRED keys:
302+
303+
id
304+
(integer) the job ID
305+
306+
severity
307+
(integer) the exception severity
308+
309+
type
310+
(string) the exception type
311+
312+
note
313+
(string) a human readable description of the job exception
314+
315+
316+
Job-Exec Release Request
317+
^^^^^^^^^^^^^^^^^^^^^^^^
318+
319+
A ``job-exec.release`` request SHALL be sent upstream by an execution
320+
module after the job shell has exited and any job epilog or other work
321+
associated with the job has completed. The payload SHALL be a JSON object
322+
with the following REQUIRED keys:
323+
324+
id
325+
(integer) the job ID
326+
327+
ranks
328+
(string) an RFC 22 Idset including the execution target ranks on which
329+
resources should be released
330+

0 commit comments

Comments
 (0)