Skip to content

Commit bc964b2

Browse files
start of troubleshooting section
1 parent eb97348 commit bc964b2

File tree

3 files changed

+207
-0
lines changed

3 files changed

+207
-0
lines changed

src/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,4 @@ indefinitely.
2323
workflow-design-guide/index
2424
reference/index
2525
glossary
26+
troubleshooting

src/installation.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,8 @@ These dependencies are not installed by Conda or pip:
9797

9898
* ``bash``
9999
* GNU `coreutils`_
100+
* ``ssh``
101+
* ``rsync``
100102
* ``mail`` (optional - for automated email functionality)
101103

102104
These dependencies are installed by Conda but not by pip:

src/troubleshooting.rst

Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
Troubleshooting
2+
===============
3+
4+
If things have gone wrong and you're not sure why, there are few files which
5+
should contain the required information to work out what's going on.
6+
7+
``log/scheduler/log``
8+
There's a log file for each workflow in
9+
``~/cylc-run/<workflow>/log/scheduler/log``.
10+
11+
You can view this in the GUI or on the command line using
12+
``cylc cat-log <workflow>``.
13+
``job.err``
14+
This contains the stderr captured when the job ran. It's useful for
15+
debugging job failures.
16+
17+
You can view this in the GUI or on the command line using
18+
``cylc cat-log <workflow>//<cycle>/<task>/<job> -f e``.
19+
``job-activity.log``
20+
This file records the interaction Cylc has had with a job, namely submission
21+
and polling. This can be useful in determining the cause of job submission
22+
failures.
23+
24+
You can view this in the GUI or on the command line using
25+
``cylc cat-log <workflow>//<cycle>/<task>/<job> -f a``.
26+
27+
28+
Problems
29+
--------
30+
31+
32+
Job Status Isn't Updating
33+
^^^^^^^^^^^^^^^^^^^^^^^^^
34+
35+
Cylc keeps track of a job's progress in one of two ways (according to how
36+
the platform the job was submitted to is configured):
37+
38+
* Jobs send messages to the scheduler (push).
39+
* The scheduler polls jobs (pull).
40+
41+
In either case, the job will also write its updates to the ``job.status`` file.
42+
43+
This is what the ``job.status`` file should look like for a successful job,
44+
note the ``SUCCEEDED`` line:
45+
46+
.. code-block::
47+
48+
CYLC_JOB_RUNNER_NAME=background
49+
CYLC_JOB_ID=12345
50+
CYLC_JOB_RUNNER_SUBMIT_TIME=2000-01-01T00:00:00
51+
CYLC_JOB_PID=108179
52+
CYLC_JOB_INIT_TIME=2000-01-01T00:10:00
53+
CYLC_JOB_EXIT=SUCCEEDED
54+
CYLC_JOB_EXIT_TIME=2000-01-01T01:30:00
55+
56+
If the ``job.status`` file is showing something different to what the GUI or
57+
Tui is showing, then...
58+
59+
.. rubric:: If your platform uses push communication:
60+
61+
If messages aren't getting back to the scheduler, there should be some
62+
evidence of this in the ``job.err`` file, likely either an error or a
63+
traceback.
64+
65+
Likely causes:
66+
67+
* There is a network issue.
68+
* TCP ports are not open (zmq communications).
69+
* Non-interactive SSH has not been correctly configured (ssh communications).
70+
71+
.. rubric:: If your platform uses pull communication:
72+
73+
Firstly, check the polling interval, it's possible that the scheduler has been
74+
configured to poll infrequently and you need to wait for the next poll, or use
75+
the ``cylc poll`` command (also available in the GUI).
76+
77+
Use the ``cylc config`` command to inspect the platform's configuration to
78+
determine the configured polling schedule.
79+
80+
Then check the ``job-activity.log`` file, there may have been a problem polling
81+
the remote platform, e.g. a network or configuration error.
82+
83+
Likely causes:
84+
85+
* The platform is down (e.g. all login nodes are offline).
86+
* There is a network issue.
87+
* Non-interactive SSH has not been correctly configured.
88+
89+
90+
My Job Submit-Failed
91+
^^^^^^^^^^^^^^^^^^^^
92+
93+
A submit-failed job means one of three things:
94+
95+
1. There is a Bash syntax error in the task configuration.
96+
97+
E.G. the following ``script`` has a syntax error, it is missing a
98+
``"`` character:
99+
100+
.. code-block:: cylc
101+
102+
[runtime]
103+
[[foo]]
104+
script = echo "Hello $WORLD
105+
106+
This will result in a submission-failure which should appear in the
107+
``job-activity.log`` file (and also the scheduler log) something like this:
108+
109+
.. code-block::
110+
111+
/path/to/job.tmp: line 46: unexpected EOF while looking for matching `"'
112+
/path/to/job.tmp: line 50: syntax error: unexpected end of file
113+
114+
2. There was an error submitting the job to the specified platform (including
115+
network issues).
116+
117+
See the ``job-activity.log`` and the scheduler log. The error should be in
118+
one or both of those files.
119+
120+
3. The platform is not correctly configured.
121+
122+
123+
My Job Failed
124+
^^^^^^^^^^^^^
125+
126+
This means something went wrong executing the job.
127+
128+
To find out more, see the ``job.err`` file.
129+
130+
If you're struggling to track down the error, you might want to restart the
131+
workflow in debug mode and run the task again:
132+
133+
.. code-block:: console
134+
135+
# shut the workflow down (leave any active jobs running)
136+
$ cylc stop --now --now <workflow>
137+
# restart the workflow in debug mode
138+
$ cylc play <workflow> --debug
139+
# re-run all failed task(s)
140+
$ cylc trigger '<workflow>//*:failed'
141+
142+
When a workflow is running in debug mode, all jobs will create a ``job.xtrace``
143+
file which can help you to locate the error within the job script.
144+
145+
146+
My workflow shutdown unexpectedly
147+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
148+
149+
When a Cylc scheduler shuts down, it should leave behind a log message explaining why.
150+
151+
E.G. this message means that a workflow shut down because it was told to:
152+
153+
.. code-block::
154+
155+
Workflow shutting down - REQUEST(CLEAN)
156+
157+
If a workflow shut down due to a critical problem, you should find some
158+
traceback in this log. If this traceback doesn't look like it comes from your
159+
system, please report it to the Cylc developers for investigation (on
160+
GitHub or Discourse).
161+
162+
In some extreme cases, Cylc might not be able to write a log message e.g:
163+
164+
* There's not enough disk space for Cylc to write a log message.
165+
* If the scheduler will killed in a nasty way e.g. ``kill -9``.
166+
* If the scheduler host goes down (e.g. power off).
167+
168+
169+
Error Messages
170+
--------------
171+
172+
173+
FileNotFoundError: No such file or directory
174+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
175+
176+
This is the error message Python gives when you try to call an exectuable which
177+
does not exist in the ``$PATH``. It means there's something wrong with the Cylc
178+
installation.
179+
180+
E.G. the following error:
181+
182+
.. code-block::
183+
184+
FileNotFoundError: [Errno 2] No such file or directory: 'ssh'
185+
186+
Means that ``ssh`` is not installed.
187+
188+
See :ref:`non-python-requirements` for details on system requirements.
189+
190+
191+
platform: <name> - initialisation did not complete
192+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
193+
194+
This means that Cylc was unable to install the required workflow files onto
195+
a remote platform.
196+
197+
This either means that:
198+
199+
1. The platform is down (e.g. all login nodes are offline).
200+
2. There is a network problem (e.g. you cannot connect to the login nodes).
201+
3. The platform is not correctly configured.
202+
203+
Check the scheduler log, you might find some stderr associated with this
204+
message.

0 commit comments

Comments
 (0)