Skip to content

Commit 49f2e10

Browse files
authored
Render a root-cause exception for dependency and join errors (#3717)
# Description This PR reworks two exception types, DependencyError and JoinError. Both of these exceptions report that a task failed because some other task/future failed - in the dependency case, because a task dependency failed, and in the join case because one of the tasks/futures being joined failed. This PR introduces a common superclass `PropagatedException` to acknowledge that the meaning and behaviour of these two exceptions is very similar. `PropagatedException` has a new implementation for reporting the failures that are being propagated. Parsl has tried a couple of ways to do this in the past: * The implementation immediately before this PR reports only the immediate task IDs (or future reprs, for non-tasks) in the exception message. For details of the chain of exceptions and original/non-propagated exception, the user can examine the exception object via the `dependent_exceptions_tids` attribute. * Prior to PR #1802, the repr/str (and so the printed form) of dependency exceptions rendered the entire exception. In the case of deep dependency chains or where a dependency graph has many paths to a root cause, this resulted in extremely voluminous output with a lot of boiler plate dependency exception text. The approach introduced by this current PR attempts a fusion of these two approaches: * The user will often be waiting only on the final task of a dependency chain (because the DFK will be managing everything in between) - so they will often get a dependency exception. * When they get a dependency exception, they are likely to actually be interested in the root cause at the earliest part of the chain. So this PR makes dependency exceptions traverse the chain and discover a root cause * When there are multiple root causes, or multiple paths to the same root cause, the user should not be overwhelmed with output. So this PR picks a single root cause exception to report fully, and when there are other causes/paths adds a small annotation `(+ others)` * The user is sometimes interested in the path from that root cause exception to the current failure, but often not. That path is rendered roughly the same as immediately before this PR as a sequence of task IDs (or Future reprs for non-tasks) * Python has a native mechanism for indicating that an exception is caused by another exception, the `__cause__` magic attribute which is usually populated by `raise e1 from e2`. This PR populates that magic attribute at construction so that displaying the exception will show the cause using Python's native format. * The user may want to ask other Parsl-relevant questions about the exception chain, so this PR keeps the `dependent_exceptions_tids` attribute for such introspection. A dependency or join error is now rendered by Python as exactly two exceptions next to each other: ``` Traceback (most recent call last): File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 922, in _unwrap_futures new_args.extend([self.dependency_resolver.traverse_to_unwrap(dep)]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/functools.py", line 907, in wrapper return dispatch(args[0].__class__)(*args, **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/functools.py", line 907, in wrapper return dispatch(args[0].__class__)(*args, **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/benc/parsl/src/parsl/parsl/dataflow/dependency_resolvers.py", line 48, in _ return fut.result() ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 339, in handle_exec_update res = self._unwrap_remote_exception_wrapper(future) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 603, in _unwrap_remote_exception_wrapper result.reraise() File "/home/benc/parsl/src/parsl/parsl/app/errors.py", line 114, in reraise raise v File "/home/benc/parsl/src/parsl/parsl/app/errors.py", line 138, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^ File "/home/benc/parsl/src/parsl/taskchain.py", line 13, in failer raise RuntimeError("example root failure") ^^^^^^^^^^^^^^^^^ RuntimeError: example root failure The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/benc/parsl/src/parsl/taskchain.py", line 16, in <module> inter(inter(inter(inter(inter(failer()))))).result() File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 339, in handle_exec_update res = self._unwrap_remote_exception_wrapper(future) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/benc/parsl/src/parsl/parsl/dataflow/dflow.py", line 601, in _unwrap_remote_exception_wrapper result = future.result() ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result raise self._exception parsl.dataflow.errors.DependencyError: Dependency failure for task 5. The representative cause is via task 4 <- task 3 <- task 2 <- task 1 <- task 0 ``` # Changed Behaviour DependencyErrors and JoinErrors will render differently ## Type of change - Update to human readable text: Documentation/error messages/comments
1 parent 47e60f0 commit 49f2e10

File tree

2 files changed

+66
-18
lines changed

2 files changed

+66
-18
lines changed

parsl/dataflow/errors.py

Lines changed: 60 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from typing import Optional, Sequence, Tuple
1+
from typing import List, Sequence, Tuple
22

33
from parsl.errors import ParslError
44

@@ -29,35 +29,77 @@ def __str__(self) -> str:
2929
return self.reason
3030

3131

32-
class DependencyError(DataFlowException):
33-
"""Error raised if an app cannot run because there was an error
34-
in a dependency.
32+
class PropagatedException(DataFlowException):
33+
"""Error raised if an app fails because there was an error
34+
in a related task. This is intended to be subclassed for
35+
dependency and join_app errors.
3536
3637
Args:
37-
- dependent_exceptions_tids: List of exceptions and identifiers for
38-
dependencies which failed. The identifier might be a task ID or
39-
the repr of a non-DFK Future.
38+
- dependent_exceptions_tids: List of exceptions and brief descriptions
39+
for dependencies which failed. The description might be a task ID or
40+
the repr of a non-AppFuture.
4041
- task_id: Task ID of the task that failed because of the dependency error
4142
"""
4243

43-
def __init__(self, dependent_exceptions_tids: Sequence[Tuple[Exception, str]], task_id: int) -> None:
44+
def __init__(self,
45+
dependent_exceptions_tids: Sequence[Tuple[BaseException, str]],
46+
task_id: int,
47+
*,
48+
failure_description: str) -> None:
4449
self.dependent_exceptions_tids = dependent_exceptions_tids
4550
self.task_id = task_id
51+
self._failure_description = failure_description
52+
53+
(cause, cause_sequence) = self._find_any_root_cause()
54+
self.__cause__ = cause
55+
self._cause_sequence = cause_sequence
4656

4757
def __str__(self) -> str:
48-
deps = ", ".join(tid for _exc, tid in self.dependent_exceptions_tids)
49-
return f"Dependency failure for task {self.task_id} with failed dependencies from {deps}"
58+
sequence_text = " <- ".join(self._cause_sequence)
59+
return f"{self._failure_description} for task {self.task_id}. " \
60+
f"The representative cause is via {sequence_text}"
61+
62+
def _find_any_root_cause(self) -> Tuple[BaseException, List[str]]:
63+
"""Looks recursively through self.dependent_exceptions_tids to find
64+
an exception that caused this propagated error, that is not itself
65+
a propagated error.
66+
"""
67+
e: BaseException = self
68+
dep_ids = []
69+
while isinstance(e, PropagatedException) and len(e.dependent_exceptions_tids) >= 1:
70+
id_txt = e.dependent_exceptions_tids[0][1]
71+
assert isinstance(id_txt, str)
72+
# if there are several causes for this exception, label that
73+
# there are more so that we know that the representative fail
74+
# sequence is not the full story.
75+
if len(e.dependent_exceptions_tids) > 1:
76+
id_txt += " (+ others)"
77+
dep_ids.append(id_txt)
78+
e = e.dependent_exceptions_tids[0][0]
79+
return e, dep_ids
80+
81+
82+
class DependencyError(PropagatedException):
83+
"""Error raised if an app cannot run because there was an error
84+
in a dependency. There can be several exceptions (one from each
85+
dependency) and DependencyError collects them all together.
5086
87+
Args:
88+
- dependent_exceptions_tids: List of exceptions and brief descriptions
89+
for dependencies which failed. The description might be a task ID or
90+
the repr of a non-AppFuture.
91+
- task_id: Task ID of the task that failed because of the dependency error
92+
"""
93+
def __init__(self, dependent_exceptions_tids: Sequence[Tuple[BaseException, str]], task_id: int) -> None:
94+
super().__init__(dependent_exceptions_tids, task_id,
95+
failure_description="Dependency failure")
5196

52-
class JoinError(DataFlowException):
97+
98+
class JoinError(PropagatedException):
5399
"""Error raised if apps joining into a join_app raise exceptions.
54100
There can be several exceptions (one from each joining app),
55101
and JoinError collects them all together.
56102
"""
57-
def __init__(self, dependent_exceptions_tids: Sequence[Tuple[BaseException, Optional[str]]], task_id: int) -> None:
58-
self.dependent_exceptions_tids = dependent_exceptions_tids
59-
self.task_id = task_id
60-
61-
def __str__(self) -> str:
62-
dep_tids = [tid for (exception, tid) in self.dependent_exceptions_tids]
63-
return "Join failure for task {} with failed join dependencies from tasks {}".format(self.task_id, dep_tids)
103+
def __init__(self, dependent_exceptions_tids: Sequence[Tuple[BaseException, str]], task_id: int) -> None:
104+
super().__init__(dependent_exceptions_tids, task_id,
105+
failure_description="Join failure")

parsl/tests/test_python_apps/test_fail.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,9 @@ def test_fail_sequence_first():
3939
assert isinstance(t_final.exception().dependent_exceptions_tids[0][0], DependencyError)
4040
assert t_final.exception().dependent_exceptions_tids[0][1].startswith("task ")
4141

42+
assert hasattr(t_final.exception(), '__cause__')
43+
assert t_final.exception().__cause__ == t1.exception()
44+
4245

4346
def test_fail_sequence_middle():
4447
t1 = random_fail(fail_prob=0)
@@ -50,3 +53,6 @@ def test_fail_sequence_middle():
5053

5154
assert len(t_final.exception().dependent_exceptions_tids) == 1
5255
assert isinstance(t_final.exception().dependent_exceptions_tids[0][0], ManufacturedTestFailure)
56+
57+
assert hasattr(t_final.exception(), '__cause__')
58+
assert t_final.exception().__cause__ == t2.exception()

0 commit comments

Comments
 (0)