|
6 | 6 | "source": [ |
7 | 7 | "# Troubleshooting\n", |
8 | 8 | "\n", |
9 | | - "This tutorial steps through tecnhiques to identify errors and pipeline failures, and\n", |
10 | | - "avoid common pitfalls." |
| 9 | + "This tutorial steps through tecnhiques to identify errors and pipeline failures, as well\n", |
| 10 | + "as avoid common pitfalls setting up executing over multiple processes." |
11 | 11 | ] |
12 | 12 | }, |
13 | 13 | { |
|
45 | 45 | "source": [ |
46 | 46 | "### Enclosing multi-process code within `if __name__ == \"__main__\"`\n", |
47 | 47 | "\n", |
48 | | - "If running a script that executes a workflow with the concurrent futures worker\n", |
49 | | - "(i.e. `worker=\"cf\"`) on macOS or Windows, then the submissing/execution call needs to\n", |
50 | | - "be enclosed within a `if __name__ == \"__main__\"` blocks, e.g." |
| 48 | + "When running multi-process Python code on macOS or Windows, as is the case when the \n", |
| 49 | + "concurrent futures worker is selected (i.e. `worker=\"cf\"`), then scripts that execute\n", |
| 50 | + "the forking code need to be enclosed within an `if __name__ == \"__main__\"` block, e.g." |
51 | 51 | ] |
52 | 52 | }, |
53 | 53 | { |
|
71 | 71 | "cell_type": "markdown", |
72 | 72 | "metadata": {}, |
73 | 73 | "source": [ |
74 | | - "### Remove stray lockfiles\n", |
| 74 | + "This allows the secondary processes to import the script without executing it. Without\n", |
| 75 | + "such a block Pydra will lock up and not process the workflow. On Linux this is not an\n", |
| 76 | + "issue due to the way that processes are forked, but is good practice in any case for\n", |
| 77 | + "code portability." |
| 78 | + ] |
| 79 | + }, |
| 80 | + { |
| 81 | + "cell_type": "markdown", |
| 82 | + "metadata": {}, |
| 83 | + "source": [ |
| 84 | + "### Removing stray lockfiles\n", |
75 | 85 | "\n", |
76 | | - "During the execution of a task, a lockfile is generated to signify that a task is running.\n", |
77 | | - "These lockfiles are released after a task completes, either successfully or with an error,\n", |
78 | | - "within a *try/finally* block. However, if a task/workflow is terminated by an interactive\n", |
79 | | - "debugger the finally block may not be executed causing stray lockfiles to hang around. This\n", |
| 86 | + "When a Pydra task is executed, a lockfile is generated to signify that the task is running.\n", |
| 87 | + "Other processes will wait for this lock to be released before attempting to access the\n", |
| 88 | + "tasks results. The lockfiles are automatically deleted after a task completes, either\n", |
| 89 | + "successfully or with an error, within a *try/finally* block so should run most of the time.\n", |
| 90 | + "However, if a task/workflow is terminated by an interactive\n", |
| 91 | + "debugger, the finally block may not be executed, leaving stray lockfiles. This\n", |
80 | 92 | "can cause the Pydra to hang waiting for the lock to be released. If you suspect this to be\n", |
81 | 93 | "an issue, and there are no other jobs running, then simply remove all lock files from your\n", |
82 | 94 | "cache directory (e.g. `rm <your-run-cache-dir>/*.lock`) and re-submit your job.\n", |
|
91 | 103 | "cell_type": "markdown", |
92 | 104 | "metadata": {}, |
93 | 105 | "source": [ |
94 | | - "## Finding errors\n", |
| 106 | + "## Inspecting errors\n", |
95 | 107 | "\n", |
96 | 108 | "### Running in *debug* mode\n", |
97 | 109 | "\n", |
|
116 | 128 | "# This workflow will fail because we are trying to divide by 0\n", |
117 | 129 | "wf = UnsafeDivisionWorkflow(a=10, b=5).split(denominator=[3, 2 ,0])\n", |
118 | 130 | "\n", |
119 | | - "with Submitter(worker=\"cf\") as sub:\n", |
120 | | - " result = sub(wf)\n", |
| 131 | + "if __name__ == \"__main__\":\n", |
| 132 | + " with Submitter(worker=\"cf\") as sub:\n", |
| 133 | + " result = sub(wf)\n", |
121 | 134 | " \n", |
122 | 135 | "if result.errored:\n", |
123 | 136 | " print(\"Workflow failed with errors:\\n\" + str(result.errors))\n", |
|
129 | 142 | "cell_type": "markdown", |
130 | 143 | "metadata": {}, |
131 | 144 | "source": [ |
132 | | - "Work in progress..." |
| 145 | + "The error pickle files can be loaded using the `cloudpickle` library, noting that it is\n", |
| 146 | + "important to use the same Python version to load the files that was used to run the Pydra\n", |
| 147 | + "workflow" |
| 148 | + ] |
| 149 | + }, |
| 150 | + { |
| 151 | + "cell_type": "code", |
| 152 | + "execution_count": null, |
| 153 | + "metadata": {}, |
| 154 | + "outputs": [], |
| 155 | + "source": [ |
| 156 | + "import cloudpickle as cp\n", |
| 157 | + "\n", |
| 158 | + "with open(\"<your-cache-root>/<task-cache-dir/_error.pklz\", \"rb\") as f:\n", |
| 159 | + " error = cp.load(f)\n", |
| 160 | + "\n", |
| 161 | + "print(error)" |
133 | 162 | ] |
134 | 163 | }, |
135 | 164 | { |
|
147 | 176 | "Currently in Pydra you need to step backwards through the tasks of the workflow, load\n", |
148 | 177 | "the saved task object and inspect its inputs to find the preceding nodes. If any of the\n", |
149 | 178 | "inputs that have been generated by previous nodes are not ok, then you should check the\n", |
150 | | - "tasks that generated them in turn.\n", |
| 179 | + "tasks that generated them in turn. For file-based inputs, you should be able to find\n", |
| 180 | + "the path of the preceding task's cache directory from the provided file path. However,\n", |
| 181 | + "for non-file inputs you may need to exhaustively iterate through all the task dirs\n", |
| 182 | + "in your cache root to find the issue.\n", |
151 | 183 | "\n", |
152 | | - "For example, in the following example if we are not happy with the mask brain that has\n", |
153 | | - "been generated, we can check the mask to see whether it looks sensible by first loading\n", |
154 | | - "the apply mask task and then inspecting its inputs." |
| 184 | + "For example, in the following example workflow, if a divide by 0 occurs within the division\n", |
| 185 | + "node of the workflow, then an `float('inf')` will be returned, which will then propagate\n", |
| 186 | + "through the workflow." |
155 | 187 | ] |
156 | 188 | }, |
157 | 189 | { |
158 | 190 | "cell_type": "code", |
159 | | - "execution_count": null, |
| 191 | + "execution_count": 2, |
160 | 192 | "metadata": {}, |
161 | | - "outputs": [], |
162 | | - "source": [] |
| 193 | + "outputs": [ |
| 194 | + { |
| 195 | + "ename": "NameError", |
| 196 | + "evalue": "name 'Submitter' is not defined", |
| 197 | + "output_type": "error", |
| 198 | + "traceback": [ |
| 199 | + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", |
| 200 | + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", |
| 201 | + "Cell \u001b[0;32mIn[2], line 5\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mpydra\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mtasks\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mtesting\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m SafeDivisionWorkflow\n\u001b[1;32m 3\u001b[0m wf \u001b[38;5;241m=\u001b[39m SafeDivisionWorkflow(a\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m10\u001b[39m, b\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m5\u001b[39m)\u001b[38;5;241m.\u001b[39msplit(denominator\u001b[38;5;241m=\u001b[39m[\u001b[38;5;241m3\u001b[39m, \u001b[38;5;241m2\u001b[39m ,\u001b[38;5;241m0\u001b[39m])\n\u001b[0;32m----> 5\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[43mSubmitter\u001b[49m(worker\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcf\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01mas\u001b[39;00m sub:\n\u001b[1;32m 6\u001b[0m result \u001b[38;5;241m=\u001b[39m sub(wf)\n\u001b[1;32m 8\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mWorkflow completed successfully, results saved in: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mresult\u001b[38;5;241m.\u001b[39moutput_dir\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n", |
| 202 | + "\u001b[0;31mNameError\u001b[0m: name 'Submitter' is not defined" |
| 203 | + ] |
| 204 | + } |
| 205 | + ], |
| 206 | + "source": [ |
| 207 | + "from pydra.tasks.testing import SafeDivisionWorkflow\n", |
| 208 | + "\n", |
| 209 | + "wf = SafeDivisionWorkflow(a=10, b=5).split(denominator=[3, 2 ,0])\n", |
| 210 | + "\n", |
| 211 | + "with Submitter(worker=\"cf\") as sub:\n", |
| 212 | + " result = sub(wf)\n", |
| 213 | + " \n", |
| 214 | + "print(f\"Workflow completed successfully, results saved in: {result.output_dir}\")" |
| 215 | + ] |
163 | 216 | }, |
164 | 217 | { |
165 | 218 | "cell_type": "markdown", |
166 | 219 | "metadata": {}, |
167 | 220 | "source": [ |
168 | | - "Work in progress..." |
| 221 | + "To find the task directory where the issue first surfaced, iterate through every task\n", |
| 222 | + "cache dir and check the results for `float(\"inf\")`s" |
169 | 223 | ] |
170 | 224 | }, |
171 | 225 | { |
172 | | - "cell_type": "markdown", |
| 226 | + "cell_type": "code", |
| 227 | + "execution_count": null, |
173 | 228 | "metadata": {}, |
174 | | - "source": [] |
| 229 | + "outputs": [], |
| 230 | + "source": [ |
| 231 | + "import cloudpickle as cp\n", |
| 232 | + "from pydra.utils import user_cache_dir\n", |
| 233 | + "\n", |
| 234 | + "run_cache = user_cache_dir / \"run-cache\"\n", |
| 235 | + "\n", |
| 236 | + "for task_cache_dir in run_cache.iterdir():\n", |
| 237 | + " with open(task_cache_dir / \"_result.pklz\", \"rb\") as f:\n", |
| 238 | + " error = cp.load(f)\n", |
| 239 | + " for \n", |
| 240 | + " " |
| 241 | + ] |
175 | 242 | } |
176 | 243 | ], |
177 | 244 | "metadata": { |
|
0 commit comments