prometheus: better resiliency: consider add `--continue-on-error`

Prometheus analyzing process is always crashing after long run at random queries. There are a few workarounds to handle this situation in better way:

* better resiliency by applying retry/timeout _per_ query
* continue to next query in case an error thrown

`$ cortextool analyse prometheus ...` command throws an exception during the analyzing process as such:
```
...
DEBU[0218] additional repository_duration_seconds_bucket 900
DEBU[0218] additional repository_duration_seconds_count 75
DEBU[0218] additional repository_duration_seconds_sum 75
cortextool: error: error querying count by (job) (request_duration_seconds_bucket): server_error: server error: 503, try --help
```

It throws `503` error but actually it returns `200` response:

```
$ curl <ADDR>/api/v1/query?query=count%20by%20(job)%20(consul_k8s_p_beholder_p2_1venus_worker_64_runtime_sys_bytes)

# 200 OK
```

Similar to `$ cortextool analyse grafana ...` command, we can continue to querying Prometheus and list the errors in a custom variable like `query_errors` as we already do in the _grafana_ by defining a `parse_errors` field.

```bash
$ cortextool analyse grafana --address <ADDR> --key <KEY>
unmarshal board: json: cannot unmarshal object into Go struct field Current.templating.list.current.text of type []string for MJvznCp7z Prometheus / Remote Write
```

cc @developer-guy @eminaktas @yasintahaerol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prometheus: better resiliency: consider add `--continue-on-error` #236

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

prometheus: better resiliency: consider add --continue-on-error #236

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

prometheus: better resiliency: consider add `--continue-on-error` #236