Skip to content

prometheus: better resiliency: consider add --continue-on-error #236

@Dentrax

Description

@Dentrax

Prometheus analyzing process is always crashing after long run at random queries. There are a few workarounds to handle this situation in better way:

  • better resiliency by applying retry/timeout per query
  • continue to next query in case an error thrown

$ cortextool analyse prometheus ... command throws an exception during the analyzing process as such:

...
DEBU[0218] additional repository_duration_seconds_bucket 900
DEBU[0218] additional repository_duration_seconds_count 75
DEBU[0218] additional repository_duration_seconds_sum 75
cortextool: error: error querying count by (job) (request_duration_seconds_bucket): server_error: server error: 503, try --help

It throws 503 error but actually it returns 200 response:

$ curl <ADDR>/api/v1/query?query=count%20by%20(job)%20(consul_k8s_p_beholder_p2_1venus_worker_64_runtime_sys_bytes)

# 200 OK

Similar to $ cortextool analyse grafana ... command, we can continue to querying Prometheus and list the errors in a custom variable like query_errors as we already do in the grafana by defining a parse_errors field.

$ cortextool analyse grafana --address <ADDR> --key <KEY>
unmarshal board: json: cannot unmarshal object into Go struct field Current.templating.list.current.text of type []string for MJvznCp7z Prometheus / Remote Write

cc @developer-guy @eminaktas @yasintahaerol

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions