Skip to content
This repository was archived by the owner on Sep 2, 2025. It is now read-only.

Commit 1e1b65d

Browse files
Merge pull request #1474 from splunk/repo-sync
Pulling refs/heads/main into main
2 parents 36e8279 + 3b9e914 commit 1e1b65d

File tree

8 files changed

+216
-72
lines changed

8 files changed

+216
-72
lines changed

gdi/opentelemetry/collector-kubernetes/collector-kubernetes-intro.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,7 @@ Get started with the Collector for Kubernetes
2222
Default Kubernetes metrics <metrics-ootb-k8s.rst>
2323
Upgrade <kubernetes-upgrade.rst>
2424
Uninstall <kubernetes-uninstall.rst>
25-
Troubleshoot <troubleshoot-k8s.rst>
26-
Troubleshoot containers <troubleshoot-k8s-container.rst>
25+
Troubleshoot <k8s-troubleshooting/troubleshoot-k8s-landing.rst>
2726
Support <kubernetes-support.rst>
2827
Tutorial: Monitor your Kubernetes environment <k8s-infrastructure-tutorial/about-k8s-tutorial.rst>
2928
Tutorial: Configure the Collector for Kubernetes <collector-configuration-tutorial-k8s/about-collector-config-tutorial.rst>
@@ -75,8 +74,7 @@ To upgrade or uninstall, see:
7574

7675
If you have any installation or configuration issues, refer to:
7776

78-
* :ref:`otel-troubleshooting`
79-
* :ref:`troubleshoot-k8s`
77+
* :ref:`troubleshoot-k8s-landing`
8078
* :ref:`kubernetes-support`
8179

8280
.. raw:: html

gdi/opentelemetry/collector-kubernetes/k8s-infrastructure-tutorial/about-k8s-tutorial.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Tutorial: Monitor your Kubernetes environment in Splunk Observability Cloud
1616
k8s-monitor-with-navigators
1717
k8s-activate-detector
1818

19-
Deploy the Splunk Distribution of OpenTelemetry Collector in a Kubernetes cluster and start monitoring your Kubernetes platform using Splunk Observability Cloud.
19+
Deploy the Splunk Distribution of the OpenTelemetry Collector in a Kubernetes cluster and start monitoring your Kubernetes platform using Splunk Observability Cloud.
2020

2121
.. raw:: html
2222

gdi/opentelemetry/collector-kubernetes/troubleshoot-k8s-container.rst renamed to gdi/opentelemetry/collector-kubernetes/k8s-troubleshooting/troubleshoot-k8s-container.rst

Lines changed: 17 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,19 @@
11
.. _troubleshoot-k8s-container:
22

33
***************************************************************
4-
Troubleshoot the Collector for Kubernetes containers
4+
Troubleshoot Kubernetes and container runtime compatibility
55
***************************************************************
66

77
.. meta::
8-
:description: Describes troubleshooting specific to the Collector for Kubernetes containers.
8+
:description: Describes troubleshooting specific to Kubernetes and container runtime compatibility.
99

10-
.. note:: For general troubleshooting, see :ref:`otel-troubleshooting` and :ref:`troubleshoot-k8s`.
10+
.. note::
11+
12+
See also:
1113

12-
Verify if your container is running out of memory
13-
=======================================================================
14-
15-
Even if you didn't provide enough resources for the Collector containers, under normal circumstances the Collector doesn't run out of memory (OOM). This can only happen if the Collector is heavily throttled by the backend and exporter sending queue growing faster than collector can control memory utilization. In that case you should see ``429`` errors for metrics and traces or ``503`` errors for logs.
16-
17-
For example:
18-
19-
.. code-block::
20-
21-
2021-11-12T00:22:32.172Z info exporterhelper/queued_retry.go:325 Exporting failed. Will retry the request after interval. {"kind": "exporter", "name": "sapm", "error": "server responded with 429", "interval": "4.4850027s"}
22-
2021-11-12T00:22:38.087Z error exporterhelper/queued_retry.go:190 Dropping data because sending_queue is full. Try increasing queue_size. {"kind": "exporter", "name": "sapm", "dropped_items": 1348}
23-
24-
If you can't fix throttling by bumping limits on the backend or reducing amount of data sent through the Collector, you can avoid OOMs by reducing the sending queue of the failing exporter. For example, you can reduce ``sending_queue`` for the ``sapm`` exporter:
25-
26-
.. code-block:: yaml
27-
28-
agent:
29-
config:
30-
exporters:
31-
sapm:
32-
sending_queue:
33-
queue_size: 512
34-
35-
You can apply a similar configuration to any other failing exporter.
36-
37-
Kubernetes and container runtime compatibility
38-
=============================================================================================
14+
* :ref:`troubleshoot-k8s-general`
15+
* :ref:`troubleshoot-k8s-sizing`
16+
* :ref:`troubleshoot-k8s-missing-metrics`
3917

4018
Kubernetes requires you to install a container runtime on each node in the cluster so that pods can run there. The Splunk Distribution of the Collector for Kubernetes supports container runtimes such as containerd, CRI-O, Docker, and Mirantis Kubernetes Engine (formerly Docker Enterprise/UCP).
4119

@@ -52,7 +30,7 @@ For more information about runtimes, see :new-page:`Container runtime <https://k
5230
.. _check-runtimes:
5331

5432
Troubleshoot the container runtime compatibility
55-
--------------------------------------------------------------------
33+
=============================================================================================
5634

5735
To check if you're having compatibility issues with Kubernets and the container runtime, follow these steps:
5836

@@ -77,7 +55,7 @@ To check if you're having compatibility issues with Kubernets and the container
7755
.. _ts-k8s-stats:
7856

7957
Check the integrity of your container stats
80-
--------------------------------------------------------------------
58+
=============================================================================================
8159

8260
Use the Kubelet Summary API to verify container, pod, and node stats. The Kubelet provides the Summary API to discover and retrieve per-node summarized stats available through the ``/stats`` endpoint.
8361

@@ -88,7 +66,7 @@ All of the stats shown in these examples should be present unless otherwise note
8866
.. _verify-node-stats:
8967

9068
Verify a node's stats
91-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
69+
--------------------------------------------------------------------
9270

9371
To verify a node's stats:
9472

@@ -176,7 +154,7 @@ For reference, the following table shows the mapping for the node stat names to
176154
.. _verify-pod-stats:
177155

178156
Verify a pod's stats
179-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
157+
--------------------------------------------------------------------
180158

181159
.. note::
182160

@@ -268,7 +246,7 @@ For reference, the following table shows the mapping for the pod stat names to t
268246
.. _verify-container-stats:
269247

270248
Verify a container's stats
271-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
249+
--------------------------------------------------------------------
272250

273251
.. note:: Carry out steps 1 and 2 in both :ref:`verify-node-stats` and :ref:`verify-pod-stats` before completing this section.
274252

@@ -340,14 +318,14 @@ For reference, the following table shows the mappings for the container stat nam
340318
- ``container.memory.major_page_faults``
341319

342320
Reported incompatible Kubernetes and container runtime issues
343-
--------------------------------------------------------------------
321+
=============================================================================================
344322

345323
.. note:: Managed Kubernetes services might use a modified container runtime, and the service provider might have applied custom patches or bug fixes that are not present within an unmodified container runtime.
346324

347325
This section describes known incompatibilities and container runtime issues.
348326

349327
containerd with Kubernetes 1.21.0 to 1.21.11
350-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
328+
--------------------------------------------------------------------
351329

352330
When using Kubernetes 1.21.0 to 1.21.11 with containerd, memory and network stats or metrics might be missing. The following is a list of affected metrics:
353331

@@ -367,7 +345,7 @@ Try one of the following workarounds to resolve the issue:
367345
- Upgrade containerd to version 1.4.x or 1.5.x.
368346

369347
containerd 1.4.0 to 1.4.12 with Kubernetes 1.22.0 to 1.22.8
370-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
348+
--------------------------------------------------------------------
371349

372350
When using Kubernetes 1.22.0 to 1.22.8 with containerd 1.4.0 to 1.4.12, memory and network stats or metrics can be missing. The following is a list of affected metrics:
373351

@@ -388,7 +366,7 @@ Try one of the following workarounds to resolve the issue:
388366
- Upgrade containerd to at least version 1.4.13 or 1.5.0 to fix the missing pod memory metrics.
389367

390368
containerd with Kubernetes 1.23.0 to 1.23.6
391-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
369+
--------------------------------------------------------------------
392370

393371
When using Kubernetes versions 1.23.0 to 1.23.6 with containerd, memory stats or metrics can be missing. The following is a list of affected metrics:
394372

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
.. _troubleshoot-k8s-landing:
2+
3+
*****************************************************************************************
4+
Troubleshoot the Collector for Kubernetes
5+
*****************************************************************************************
6+
7+
.. meta::
8+
:description: Learn how to deploy the Splunk Distribution of the OpenTelemetry Collector on a Kubernetes cluster, view your cluster data, and create a detector to issue alerts.
9+
10+
.. toctree::
11+
:hidden:
12+
:maxdepth: 3
13+
14+
Debugging and logs <troubleshoot-k8s>
15+
Sizing <troubleshoot-k8s-sizing>
16+
Missing metrics <troubleshoot-k8s-missing-metrics>
17+
Container runtime compatibility <troubleshoot-k8s-container>
18+
19+
20+
To troubleshoot the Splunk Distribution of the OpenTelemetry Collector for Kubernetes see:
21+
22+
* :ref:`troubleshoot-k8s`
23+
* :ref:`troubleshoot-k8s-sizing`
24+
* :ref:`troubleshoot-k8s-missing-metrics`
25+
* :ref:`troubleshoot-k8s-container`
26+
27+
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
.. _troubleshoot-k8s-missing-metrics:
2+
3+
***************************************************************
4+
Troubleshoot missing metrics
5+
***************************************************************
6+
7+
.. meta::
8+
:description: Describes troubleshooting specific to missing metrics in the Collector for Kubernetes.
9+
10+
.. note::
11+
12+
See also:
13+
14+
* :ref:`troubleshoot-k8s-general`
15+
* :ref:`troubleshoot-k8s-sizing`
16+
* :ref:`troubleshoot-k8s-container`
17+
18+
The Splunk Collector for Kubernetes is missing metrics starting with ``k8s.pod.*`` and ``k8s.node.*``
19+
========================================================================================================
20+
21+
After deploying the Splunk Distribution of the OpenTelemetry Collector for Kubernetes Chart version 0.87.0 or higher as either a new install or upgrade the following pod and node metrics are not being collected:
22+
23+
* ``k8s.(pod/node).cpu.time``
24+
* ``k8s.(pod/node).cpu.utilization``
25+
* ``k8s.(pod/node).filesystem.available``
26+
* ``k8s.(pod/node).filesystem.capacity``
27+
* ``k8s.(pod/node).filesystem.usage``
28+
* ``k8s.(pod/node).memory.available``
29+
* ``k8s.(pod/node).memory.major_page_faults``
30+
* ``k8s.(pod/node).memory.page_faults``
31+
* ``k8s.(pod/node).memory.rss``
32+
* ``k8s.(pod/node).memory.usage``
33+
* ``k8s.(pod/node).memory.working_set``
34+
* ``k8s.(pod/node).network.errors``
35+
* ``k8s.(pod/node).network.io``
36+
37+
Confirm the metrics are missing
38+
--------------------------------------------------------------------
39+
40+
To confirm these metrics are missing perform the following steps:
41+
42+
1. Confirm that the metrics are missing with the following Splunk Search Processing Language (SPL) command:
43+
44+
.. code-block::
45+
46+
| mstats count(_value) as "Val" where index="otel_metrics_0_93_3" AND metric_name IN (k8s.pod.*, k8s.node.*) by metric_name
47+
48+
2. Check the Collector's pod logs from the CLI of the Kubernetes node with this command:
49+
50+
.. code-block::
51+
52+
kubectl -n {namespace} logs {collector-agent-pod-name}
53+
54+
Note: Update ``namespace`` and ``collector-agent-pod-name`` based on your environment.
55+
56+
3. You will see a "tls: failed to verify certificate" error similar to the one below in the agent pod logs:
57+
58+
.. code-block::
59+
60+
2024-02-28T01:11:24.614Z error scraperhelper/scrapercontroller.go:200 Error scraping metrics {"kind": "receiver", "name": "kubeletstats", "data_type": "metrics", "error": "Get \"https://10.202.38.255:10250/stats/summary\": tls: failed to verify certificate: x509: cannot validate certificate for 10.202.38.255 because it doesn't contain any IP SANs", "scraper": "kubeletstats"}
61+
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
62+
go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:200
63+
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1
64+
go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:176
65+
66+
Resolution
67+
--------------------------------------------------------------------
68+
69+
The :ref:`kubelet-stats-receiver` collects k8s.(pod or node) metrics from the Kubernetes endpoint ``/stats/summary``. As of version 0.87.0 of the Splunk OTel Collector the kubelet certificate is verified during this process to confirm it's valid. If you are using a self signed or invalid certificate the Kubelet stats receiver cannot collect the metrics.
70+
71+
You have two alternatives to resolve this error:
72+
73+
1. Add valid a certificate to your Kubernetes cluster. See how at :ref:`otel-kubernetes-config`. After updating the ``values.yaml`` file use the Helm upgrade command to upgrade your Collector deployment.
74+
75+
2. Disable certificate verification in the OTel agent Kubelet Stats receiver by setting ``insecure_skip_verify: true`` for the Kubelet stats receiver in the agent.config section of the values.yaml.
76+
77+
For example, use the configuration below to disable certificate verification:
78+
79+
.. code-block::
80+
81+
agent:
82+
config:
83+
receivers:
84+
kubeletstats:
85+
insecure_skip_verify: true
86+
87+
.. caution:: Keep in mind your security requirements before disabling certificate verification.
88+
89+
90+
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
.. _troubleshoot-k8s-sizing:
2+
3+
***************************************************************
4+
Troubleshoot sizing for the Collector for Kubernetes
5+
***************************************************************
6+
7+
.. meta::
8+
:description: Describes troubleshooting specific to sizing the Collector for Kubernetes containers.
9+
10+
.. note::
11+
12+
See also:
13+
14+
* :ref:`troubleshoot-k8s-general`
15+
* :ref:`troubleshoot-k8s-missing-metrics`
16+
* :ref:`troubleshoot-k8s-container`
17+
18+
Size your Collector instance
19+
=============================================================================================
20+
21+
Set the resources allocated to your Collector instance based on the amount of data you expecte to handle. For more information, see :ref:`otel-sizing`.
22+
23+
Use the following configuration to bump resource limits for the agent:
24+
25+
.. code-block:: yaml
26+
27+
agent:
28+
resources:
29+
limits:
30+
cpu: 500m
31+
memory: 1Gi
32+
33+
Set the resources allocated to your cluster receiver deployment based on the cluster size. For example, for a cluster with 100 nodes alllocate these resources:
34+
35+
.. code-block:: yaml
36+
37+
clusterReceiver:
38+
resources:
39+
limits:
40+
cpu: 1
41+
memory: 2Gi
42+
43+
44+
Verify if your container is running out of memory
45+
=======================================================================
46+
47+
Even if you didn't provide enough resources for the Collector containers, under normal circumstances the Collector doesn't run out of memory (OOM). This can only happen if the Collector is heavily throttled by the backend and exporter sending queue growing faster than collector can control memory utilization. In that case you should see ``429`` errors for metrics and traces or ``503`` errors for logs.
48+
49+
For example:
50+
51+
.. code-block::
52+
53+
2021-11-12T00:22:32.172Z info exporterhelper/queued_retry.go:325 Exporting failed. Will retry the request after interval. {"kind": "exporter", "name": "sapm", "error": "server responded with 429", "interval": "4.4850027s"}
54+
2021-11-12T00:22:38.087Z error exporterhelper/queued_retry.go:190 Dropping data because sending_queue is full. Try increasing queue_size. {"kind": "exporter", "name": "sapm", "dropped_items": 1348}
55+
56+
If you can't fix throttling by bumping limits on the backend or reducing amount of data sent through the Collector, you can avoid OOMs by reducing the sending queue of the failing exporter. For example, you can reduce ``sending_queue`` for the ``sapm`` exporter:
57+
58+
.. code-block:: yaml
59+
60+
agent:
61+
config:
62+
exporters:
63+
sapm:
64+
sending_queue:
65+
queue_size: 512
66+
67+
You can apply a similar configuration to any other failing exporter.
68+

0 commit comments

Comments
 (0)