Skip to content

Commit 87efcf8

Browse files
k6-operator: polish troubleshooting webpage (#1978)
* feat(k6-operator): polish troubleshooting webpage * Apply suggestions from code review Co-authored-by: Heitor Tashiro Sergent <[email protected]> * feat(k6-operator): mirror troubleshooting changes to past k6 versions --------- Co-authored-by: Heitor Tashiro Sergent <[email protected]>
1 parent 88c4e57 commit 87efcf8

File tree

9 files changed

+326
-144
lines changed

9 files changed

+326
-144
lines changed

docs/sources/k6/next/set-up/set-up-distributed-k6/troubleshooting.md

Lines changed: 50 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ If you’re using Private Load Zones in Grafana Cloud k6, refer to [Troubleshoot
1515

1616
{{< docs/shared source="k6" lookup="k6-operator/troubleshooting-how-to.md" version="<K6_VERSION>" >}}
1717

18-
## Common scenarios
18+
## Common errors
1919

2020
### Issues with environment variables
2121

@@ -39,20 +39,63 @@ time="2024-01-11T11:11:27Z" level=error msg="invalid argument \"product_id=\\\"T
3939
4040
This is a common problem with escaping the characters. You can find an [issue](https://github.com/grafana/k6-operator/issues/211) in the k6 Operator repository that can be upvoted.
4141
42-
### Initializer logs an error but it's not about tags
42+
### An error on reading output of the initializer Pod
4343
44-
This can happen because of lack of attention to the [preparation](#preparation) step. One command that you can use to help diagnose issues with your script is the following:
44+
The k6 runners fail to start, and in the k6 Operator logs, you see the `unable to marshal` error. This can happen for several reasons:
45+
46+
1. Your Kubernetes setup includes some tool that is implicitly adding symbols to the log output of Pods. You can verify this case by checking the logs of the initializer Pod: they should contain valid JSON, generated by k6. Currently, to fix this, the tool adding symbols must be switched off for the k6 Operator workloads.
47+
48+
2. Multi-file script includes many files which all must be fully accessible from the runner Pod. You can verify this case by checking the logs of the initializer Pod: there will be an error about some file not being found. To fix this, refer to [Multi-file tests](https://grafana.com/docs/k6/latest/set-up/set-up-distributed-k6/usage/executing-k6-scripts-with-testrun-crd/#multi-file-tests) on how to configure multi-file tests in `TestRun`.
49+
50+
3. There are problems with environment variables or with importing an extension. Following the steps found in [testing locally](#test-your-script-locally) can help debug this issue. One additional command that you can use to help diagnose issues with your script is the following:
4551

4652
```bash
4753
k6 inspect --execution-requirements script.js
4854
```
4955

5056
That command is a shortened version of what the initializer Pod is executing. If the command produces an error, there's a problem with the script itself and it should be solved outside of the k6 Operator. The error itself may contain a hint to what's wrong, such as a syntax error.
5157

52-
If the standalone `k6 inspect --execution-requirements` executes successfully, then it's likely a problem with `TestRun` deployment specific to your Kubernetes setup. A couple of recommendations here are:
58+
If the standalone `k6 inspect --execution-requirements` executes successfully, then it's likely a problem with `TestRun` deployment specific to your Kubernetes setup.
59+
60+
### An issue with `volumeClaim`
61+
62+
Storing k6 scripts on a persistent volume is one approach to [multi-file tests](https://grafana.com/docs/k6/latest/set-up/set-up-distributed-k6/usage/executing-k6-scripts-with-testrun-crd/#multi-file-tests). However, errors can occur due to misconfiguration of the volume. These errors are not within the purview of the k6 Operator; they are inherent to the Kubernetes setup itself, as the k6 Operator only mounts volumes to the Pods. However, here are some general recommendations to help debug such errors.
63+
64+
The `volumeClaim` option is expecting a persistent volume claim, so first, check the [Kubernetes documentation](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) and your infrastructure provider documentation to confirm if the volume is indeed set up correctly and can be mounted by Kubernetes pods.
65+
66+
Then, if the volume appears to be correct and is mounted to the k6 Pods without an issue, yet the `TestRun` fails with an error like the following:
67+
68+
```bash
69+
The moduleSpecifier \"/test/utils.js\" couldn't be found on local disk.
70+
```
71+
72+
This error implies that either the file was not written successfully to the Volume or there is a misconfiguration with a path. So it makes sense to create a separate debug Pod, for example, with the [`busybox` image](https://hub.docker.com/_/busybox) to confirm that the Volume contains the script and all its dependencies. Such a Pod should have a configuration similar to this one:
73+
74+
```yaml
75+
apiVersion: v1
76+
kind: Pod
77+
metadata:
78+
name: busybox
79+
spec:
80+
volumes:
81+
- name: test-volume
82+
volumeSource:
83+
persistentVolumeClaim:
84+
claimName: test-pvc
85+
readOnly: false
86+
containers:
87+
- image: busybox
88+
name: busybox
89+
imagePullPolicy: IfNotPresent
90+
command:
91+
- sleep
92+
- "3600"
93+
volumeMounts:
94+
- mountPath: /test
95+
name: test-volume
96+
restartPolicy: Always
97+
```
5398

54-
- Review the output of the initializer Pod: is it logged by the k6 process or by something else?
55-
- k6 Operator expects the initializer logs to contain only the output of `k6 inspect`. If there are any other log lines present, then the k6 Operator will fail to parse it and the test won't start. Refer to this [issue](https://github.com/grafana/k6-operator/issues/193) for more details.
56-
- Check events in the initializer Job and Pod as they may contain another hint about what's wrong.
99+
Then execute `ls /test` on this debug Pod to see which files are present.
57100

58101
{{< docs/shared source="k6" lookup="k6-operator/troubleshooting-common-scenarios.md" version="<K6_VERSION>" >}}

docs/sources/k6/next/shared/k6-operator/troubleshooting-common-scenarios.md

Lines changed: 24 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,21 @@
22
title: Shared scenarios for troubleshooting k6 Operator
33
---
44

5+
### k6 runners do not start
6+
7+
The k6 runners fail to start, and in the k6 Operator logs, you see the error `Waiting for initializing pod to finish`.
8+
9+
In this case, it's most likely that an initializer Pod was not able to start for some reason.
10+
11+
#### How to fix
12+
13+
Refer to [The Jobs and Pods](#the-jobs-and-pods) section to see how to:
14+
15+
1. Check if the initializer Pod has started and finished.
16+
1. See an issue in the initializer Job's description that prevents a Pod from being scheduled.
17+
18+
Once the error preventing the initializer Pod from starting and completing is resolved, redeploy the `TestRun` or, in case of a `PrivateLoadZone` test, restart the k6 process.
19+
520
### Non-existent ServiceAccount
621

722
A ServiceAccount can be defined as `serviceAccountName` in a PrivateLoadZone, and as `runner.serviceAccountName` in a TestRun CRD. If the specified ServiceAccount doesn't exist, k6 Operator will successfully create Jobs but corresponding Pods will fail to be deployed, and the k6 Operator will wait indefinitely for Pods to be `Ready`. This error can be best seen in the events of the Job:
@@ -17,7 +32,7 @@ k6 Operator doesn't try to analyze such scenarios on its own, but you can refer
1732

1833
#### How to fix
1934

20-
To fix this issue, the incorrect `serviceAccountName` must be corrected, and the TestRun or PrivateLoadZone resource must be re-deployed.
35+
To fix this issue, the incorrect `serviceAccountName` must be corrected, and the `TestRun` or `PrivateLoadZone` resource must be re-deployed.
2136

2237
### Non-existent `nodeSelector`
2338

@@ -34,7 +49,7 @@ Events:
3449

3550
#### How to fix
3651

37-
To fix this issue, the incorrect `nodeSelector` must be corrected and the TestRun or PrivateLoadZone resource must be re-deployed.
52+
To fix this issue, the incorrect `nodeSelector` must be corrected and the `TestRun` or `PrivateLoadZone` resource must be re-deployed.
3853

3954
### Insufficient resources
4055

@@ -47,35 +62,19 @@ This case is somewhat similar to the previous two: the k6 Operator will wait ind
4762
If there's at least one runner Pod that OOM-ed, the whole test will be [stuck](https://github.com/grafana/k6-operator/issues/251) and will have to be deleted manually:
4863

4964
```bash
50-
kubectl -f my-test.yaml delete
51-
# or
5265
kubectl delete testrun my-test
5366
```
5467

55-
In case of OOM, it makes sense to review the k6 script to understand what kind of resource usage this script requires. It may be that the k6 script can be improved to be more performant. Then, set the `spec.runner.resources` in the TestRun CRD, or `spec.resources` in the PrivateLoadZone CRD accordingly.
68+
A `PrivateLoadZone` test or a `TestRun` [with cloud output](https://grafana.com/docs/k6/latest/set-up/set-up-distributed-k6/usage/k6-operator-to-gck6/#cloud-output) will be aborted by Grafana Cloud k6 after its expected duration is up.
5669

57-
### PrivateLoadZone: subscription error
58-
59-
If there's an issue with your Grafana Cloud k6 subscription, there will be a 400 error in the logs with the message detailing the problem. For example:
60-
61-
```bash
62-
"Received error `(400) You have reached the maximum Number of private load zones your organization is allowed to have. Please contact support if you want to create more.`. Message from server ``"
63-
```
64-
65-
To fix this issue, check your organization settings in Grafana Cloud k6 or contact Support.
66-
67-
### PrivateLoadZone: Wrong token
70+
#### How to fix
6871

69-
There can be two major problems with the authentication token:
72+
In the case of OOM, review your k6 script to understand what kind of resource usage the script requires. It may be that the k6 script can be improved to be more performant. Then, set the `spec.runner.resources` in the `TestRun` CRD, or `spec.resources` in the `PrivateLoadZone` CRD accordingly.
7073

71-
1. If the token wasn't created, or was created in a wrong location, the logs will show the following error:
74+
### Disruption of the k6 runners
7275

73-
```bash
74-
Failed to load k6 Cloud token {"namespace": "plz-ns", "name": "my-plz", "reconcileID": "67c8bc73-f45b-4c7f-a9ad-4fd0ffb4d5f6", "name": "token-with-wrong-name", "secretNamespace": "plz-ns", "error": "Secret \"token-with-wrong-name\" not found"}
75-
```
76+
A k6 test can be executed for a long time. But depending on the Kubernetes setup, it's possible that the Pods running k6 are disrupted and moved elsewhere during execution. This will skew the test results. In the case of a `PrivateLoadZone` test or a `TestRun` [with cloud output](https://grafana.com/docs/k6/latest/set-up/set-up-distributed-k6/usage/k6-operator-to-gck6/#cloud-output), the test run may additionally be aborted by Grafana Cloud k6 once its expected duration is up, regardless of the exact state of k6 processes.
7677

77-
2. If the token contains a corrupted value, or it's not an organizational token, the logs will show the following error:
78+
#### How to fix
7879

79-
```bash
80-
"Received error `(403) Authentication token incorrect or expired`. Message from server ``"
81-
```
80+
Ensure that k6 Pods can't be disrupted by the Kubernetes setup, for example, with [PodDisruptionBudget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/) and a less aggressive configuration of the autoscaler.

docs/sources/k6/next/shared/k6-operator/troubleshooting-how-to.md

Lines changed: 34 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ That ensures that the script has correct syntax and can be parsed with k6 in the
2020

2121
### `TestRun` deployment
2222

23-
#### The pods
23+
#### The Jobs and Pods
2424

2525
In case of one `TestRun` Custom Resource (CR) creation with `parallelism: n`, there are certain repeating patterns:
2626

@@ -38,21 +38,21 @@ In case of one `TestRun` Custom Resource (CR) creation with `parallelism: n`, th
3838
kubectl logs mytest-initializer-xxxxx
3939
```
4040

41-
If the Pods seem to be working but not producing an expected result and there's not enough information in the logs, you can use the k6 [verbose option](https://grafana.com/docs/k6/<K6_VERSION>/using-k6/k6-options/#options) in the `TestRun` spec:
41+
#### `TestRun` with `cleanup` option
4242

43-
```yaml
44-
apiVersion: k6.io/v1alpha1
45-
kind: TestRun
46-
metadata:
47-
name: k6-sample
48-
spec:
49-
parallelism: 2
50-
script:
51-
configMap:
52-
name: 'test'
53-
file: 'test.js'
54-
arguments: --verbose
55-
```
43+
If a `TestRun` has the [`spec.cleanup` option](https://grafana.com/docs/k6/latest/set-up/set-up-distributed-k6/usage/executing-k6-scripts-with-testrun-crd/#clean-up-resources) set, as [`PrivateLoadZone`](https://grafana.com/docs/grafana-cloud/testing/k6/author-run/private-load-zone/) tests always do, for example, it may be harder to locate and analyze the Pod before it's deleted.
44+
45+
In that case, we recommend using observability solutions, like Prometheus and Loki, to store metrics and logs for later analysis.
46+
47+
As an alternative, it's also possible to watch for the resources manually with the following commands:
48+
49+
```bash
50+
kubectl get jobs -n my-namespace -w
51+
kubectl get pods -n my-namespace -w
52+
53+
# To get detailed information (this one is quite verbose so use with caution):
54+
kubectl get pods -n my-namespace -w -o yaml
55+
```
5656

5757
#### k6 Operator
5858

@@ -64,7 +64,7 @@ kubectl -n k6-operator-system -c manager logs k6-operator-controller-manager-xxx
6464

6565
#### Inspect `TestRun` resource
6666

67-
After you deploy a `TestRun` CR, you can inspect it the same way as any other resource:
67+
After you or `PrivateLoadZone` deployed a `TestRun` CR, you can inspect it the same way as any other resource:
6868

6969
```bash
7070
kubectl describe testrun my-testrun
@@ -101,3 +101,21 @@ Status:
101101
If `Stage` is equal to `error`, you can check the logs of k6 Operator.
102102

103103
Conditions can be used as a source of info as well, but it's a more advanced troubleshooting option that should be used if the previous steps weren't enough to diagnose the issue. Note that conditions that start with the `Cloud` prefix only matter in the setting of k6 Cloud test runs, for example, for cloud output and PLZ test runs.
104+
105+
#### Debugging k6 process
106+
107+
If the script is working locally as expected, and the previous steps show no errors as well, yet you don't see an expected result of a test and suspect k6 process is at fault, you can use the k6 [verbose option](https://grafana.com/docs/k6/<K6_VERSION>/using-k6/k6-options/#options) in the `TestRun` spec:
108+
109+
```yaml
110+
apiVersion: k6.io/v1alpha1
111+
kind: TestRun
112+
metadata:
113+
name: k6-sample
114+
spec:
115+
parallelism: 2
116+
script:
117+
configMap:
118+
name: 'test'
119+
file: 'test.js'
120+
arguments: --verbose
121+
```

docs/sources/k6/v1.0.x/set-up/set-up-distributed-k6/troubleshooting.md

Lines changed: 52 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,13 @@ title: Troubleshooting
77

88
This topic includes instructions to help you troubleshoot common issues with the k6 Operator.
99

10+
If you’re using Private Load Zones in Grafana Cloud k6, refer to [Troubleshoot Private Load Zones](https://grafana.com/docs/grafana-cloud/testing/k6/author-run/private-load-zone/troubleshoot/).
11+
1012
## How to troubleshoot
1113

1214
{{< docs/shared source="k6" lookup="k6-operator/troubleshooting-how-to.md" version="<K6_VERSION>" >}}
1315

14-
## Common scenarios
16+
## Common errors
1517

1618
### Issues with environment variables
1719

@@ -35,20 +37,63 @@ time="2024-01-11T11:11:27Z" level=error msg="invalid argument \"product_id=\\\"T
3537
3638
This is a common problem with escaping the characters. You can find an [issue](https://github.com/grafana/k6-operator/issues/211) in the k6 Operator repository that can be upvoted.
3739
38-
### Initializer logs an error but it's not about tags
40+
### An error on reading output of the initializer Pod
41+
42+
The k6 runners fail to start, and in the k6 Operator logs, you see the `unable to marshal` error. This can happen for several reasons:
43+
44+
1. Your Kubernetes setup includes some tool that is implicitly adding symbols to the log output of Pods. You can verify this case by checking the logs of the initializer Pod: they should contain valid JSON, generated by k6. Currently, to fix this, the tool adding symbols must be switched off for the k6 Operator workloads.
45+
46+
2. Multi-file script includes many files which all must be fully accessible from the runner Pod. You can verify this case by checking the logs of the initializer Pod: there will be an error about some file not being found. To fix this, refer to [Multi-file tests](https://grafana.com/docs/k6/latest/set-up/set-up-distributed-k6/usage/executing-k6-scripts-with-testrun-crd/#multi-file-tests) on how to configure multi-file tests in `TestRun`.
3947

40-
This can happen because of lack of attention to the [preparation](#preparation) step. One command that you can use to help diagnose issues with your script is the following:
48+
3. There are problems with environment variables or with importing an extension. Following the steps found in [testing locally](#test-your-script-locally) can help debug this issue. One additional command that you can use to help diagnose issues with your script is the following:
4149

4250
```bash
4351
k6 inspect --execution-requirements script.js
4452
```
4553

4654
That command is a shortened version of what the initializer Pod is executing. If the command produces an error, there's a problem with the script itself and it should be solved outside of the k6 Operator. The error itself may contain a hint to what's wrong, such as a syntax error.
4755

48-
If the standalone `k6 inspect --execution-requirements` executes successfully, then it's likely a problem with `TestRun` deployment specific to your Kubernetes setup. A couple of recommendations here are:
56+
If the standalone `k6 inspect --execution-requirements` executes successfully, then it's likely a problem with `TestRun` deployment specific to your Kubernetes setup.
57+
58+
### An issue with `volumeClaim`
59+
60+
Storing k6 scripts on a persistent volume is one approach to [multi-file tests](https://grafana.com/docs/k6/latest/set-up/set-up-distributed-k6/usage/executing-k6-scripts-with-testrun-crd/#multi-file-tests). However, errors can occur due to misconfiguration of the volume. These errors are not within the purview of the k6 Operator; they are inherent to the Kubernetes setup itself, as the k6 Operator only mounts volumes to the Pods. However, here are some general recommendations to help debug such errors.
61+
62+
The `volumeClaim` option is expecting a persistent volume claim, so first, check the [Kubernetes documentation](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) and your infrastructure provider documentation to confirm if the volume is indeed set up correctly and can be mounted by Kubernetes pods.
63+
64+
Then, if the volume appears to be correct and is mounted to the k6 Pods without an issue, yet the `TestRun` fails with an error like the following:
65+
66+
```bash
67+
The moduleSpecifier \"/test/utils.js\" couldn't be found on local disk.
68+
```
69+
70+
This error implies that either the file was not written successfully to the Volume or there is a misconfiguration with a path. So it makes sense to create a separate debug Pod, for example, with the [`busybox` image](https://hub.docker.com/_/busybox) to confirm that the Volume contains the script and all its dependencies. Such a Pod should have a configuration similar to this one:
71+
72+
```yaml
73+
apiVersion: v1
74+
kind: Pod
75+
metadata:
76+
name: busybox
77+
spec:
78+
volumes:
79+
- name: test-volume
80+
volumeSource:
81+
persistentVolumeClaim:
82+
claimName: test-pvc
83+
readOnly: false
84+
containers:
85+
- image: busybox
86+
name: busybox
87+
imagePullPolicy: IfNotPresent
88+
command:
89+
- sleep
90+
- "3600"
91+
volumeMounts:
92+
- mountPath: /test
93+
name: test-volume
94+
restartPolicy: Always
95+
```
4996

50-
- Review the output of the initializer Pod: is it logged by the k6 process or by something else?
51-
- k6 Operator expects the initializer logs to contain only the output of `k6 inspect`. If there are any other log lines present, then the k6 Operator will fail to parse it and the test won't start. Refer to this [issue](https://github.com/grafana/k6-operator/issues/193) for more details.
52-
- Check events in the initializer Job and Pod as they may contain another hint about what's wrong.
97+
Then execute `ls /test` on this debug Pod to see which files are present.
5398

5499
{{< docs/shared source="k6" lookup="k6-operator/troubleshooting-common-scenarios.md" version="<K6_VERSION>" >}}

0 commit comments

Comments
 (0)