How to troubleshoot failed jobs #393

leoll2 · 2025-06-05T07:23:04Z

leoll2
Jun 5, 2025
Maintainer

If you're reading this, you've probably encountered an error while trying to train a model or import a dataset. This guide aims to help you gather the logs relative to the failed job, so that you can share this information with us for effective troubleshooting.

The error looks something like this, in the jobs panel:

If you're lucky, the error message contains detailed information about the problem; in other cases, like the example above, the cause is more complex so the details are available in the logs.

There are two ways to view logs: the simplest is through Grafana, assuming you have it installed, while the other requires interacting with the K8s cluster.

Get the job id

Each job has a unique identifier. Most often it is included in the error message, so that you can directly copy from there.
Alternatively, there is a small button to copy the job id in the top-right corner.

Browse logs via Grafana

Grafana is accessible at <host>/api/v1/grafana where <host> represents the URL that you normally use to access Geti UI. Make sure you're logged in before opening Grafana. The initial screen looks like this; open the "Logs" dashboard.

Now configure the filters:

Time interval: adjust the range to include the time when the job started and when it failed
Auto-refresh: turn it off
Pod Name: it usually follows the format "ex-{JobID}-{suffix}", where jobID is the job identifier (see the previous section how to get it) and suffix denotes the step within the job; for simplicity, start typing ex- and scroll through the auto-completion suggestions.
Container Name: set to flyte-workflow or leave it blank

If the configuration is correct, the logs will be immediately displayed in reverse chronological order (top = latest).
Feel free to apply additional filters if it helps to highlight the relevant ones.

To download the logs, you have two options:

(recommended) Click "View in Explore"; a new page will open, with a "Download" button on the right. Choose txt as a format.

Open the menu in the top-right corner of the logs panel, choose "Inspect" then "Data" and "Download" as txt.

Browse logs via K8s

If Grafana is not installed or not available, then an alternative way to get the job logs is through K8s directly. The downside, however, is that job pods are deleted along with the respective logs as soon as they terminate, therefore you may not be able to retrieve the logs of previously failed jobs. What you can do is trying to replicate the failing scenario again and watch the logs in real-time.

The list of jobs pods can be displayed with the following command. When no job is running, the list is empty.

kubectl get pods -n impt-jobs-production

You can then show the logs for a specific job with:

kubectl logs -n impt-jobs-production -f <pod_name>

where <pod_name> is obtained from the previous command.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to troubleshoot failed jobs #393

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to troubleshoot failed jobs #393

Uh oh!

Uh oh!

leoll2 Jun 5, 2025 Maintainer

Get the job id

Browse logs via Grafana

Browse logs via K8s

Replies: 0 comments

leoll2
Jun 5, 2025
Maintainer