How to troubleshoot failed jobs #393
leoll2
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
If you're reading this, you've probably encountered an error while trying to train a model or import a dataset. This guide aims to help you gather the logs relative to the failed job, so that you can share this information with us for effective troubleshooting.
The error looks something like this, in the jobs panel:
If you're lucky, the error message contains detailed information about the problem; in other cases, like the example above, the cause is more complex so the details are available in the logs.
There are two ways to view logs: the simplest is through Grafana, assuming you have it installed, while the other requires interacting with the K8s cluster.
Get the job id
Each job has a unique identifier. Most often it is included in the error message, so that you can directly copy from there.
Alternatively, there is a small button to copy the job id in the top-right corner.
Browse logs via Grafana
Grafana is accessible at
<host>/api/v1/grafana
where<host>
represents the URL that you normally use to access Geti UI. Make sure you're logged in before opening Grafana. The initial screen looks like this; open the "Logs" dashboard.Now configure the filters:
jobID
is the job identifier (see the previous section how to get it) andsuffix
denotes the step within the job; for simplicity, start typingex-
and scroll through the auto-completion suggestions.flyte-workflow
or leave it blankIf the configuration is correct, the logs will be immediately displayed in reverse chronological order (top = latest).
Feel free to apply additional filters if it helps to highlight the relevant ones.
To download the logs, you have two options:
Browse logs via K8s
If Grafana is not installed or not available, then an alternative way to get the job logs is through K8s directly. The downside, however, is that job pods are deleted along with the respective logs as soon as they terminate, therefore you may not be able to retrieve the logs of previously failed jobs. What you can do is trying to replicate the failing scenario again and watch the logs in real-time.
The list of jobs pods can be displayed with the following command. When no job is running, the list is empty.
You can then show the logs for a specific job with:
where
<pod_name>
is obtained from the previous command.Beta Was this translation helpful? Give feedback.
All reactions