From ee9920e69a2521c95d7386b4bb0443531caf5bf8 Mon Sep 17 00:00:00 2001 From: Anna Urbiztondo Date: Wed, 13 Nov 2024 12:40:13 +0100 Subject: [PATCH] Fixes, updates --- gdi/get-data-in/connect/aws/aws-prereqs.rst | 3 +- gdi/opentelemetry/exposed-endpoints.rst | 2 +- .../splunk-collector-troubleshooting.rst | 163 +++++++++--------- gdi/opentelemetry/troubleshoot-logs.rst | 2 +- 4 files changed, 87 insertions(+), 83 deletions(-) diff --git a/gdi/get-data-in/connect/aws/aws-prereqs.rst b/gdi/get-data-in/connect/aws/aws-prereqs.rst index f1e2928d9..7ae0c69e8 100644 --- a/gdi/get-data-in/connect/aws/aws-prereqs.rst +++ b/gdi/get-data-in/connect/aws/aws-prereqs.rst @@ -424,7 +424,8 @@ Read more at the official AWS documentation: * :new-page:`AWS Organization Service Control Policies ` * :new-page:`Permissions boundaries for IAM entities ` -* :new-page:`Troubleshooting IAM permission access denied or unauthorized errors ` + +.. tip:: Search for specific troubleshooting at AWS' knowledge center. .. _aws-regions: diff --git a/gdi/opentelemetry/exposed-endpoints.rst b/gdi/opentelemetry/exposed-endpoints.rst index cd396c8fb..19cafc084 100644 --- a/gdi/opentelemetry/exposed-endpoints.rst +++ b/gdi/opentelemetry/exposed-endpoints.rst @@ -35,7 +35,7 @@ See the table for a complete list of exposed ports and endpoints: * - ``http(s)://0.0.0.0:7276`` - SAPM trace receiver * - ``http://localhost:8888/metrics`` - - :new-page:`Internal Prometheus metrics ` + - :new-page:`Internal Prometheus metrics ` * - ``http(s)://localhost:8006`` - Fluent forward receiver * - ``http(s)://0.0.0.0:9080`` diff --git a/gdi/opentelemetry/splunk-collector-troubleshooting.rst b/gdi/opentelemetry/splunk-collector-troubleshooting.rst index cc1be7185..3cf5e22a1 100644 --- a/gdi/opentelemetry/splunk-collector-troubleshooting.rst +++ b/gdi/opentelemetry/splunk-collector-troubleshooting.rst @@ -9,14 +9,16 @@ Troubleshoot the Splunk OpenTelemetry Collector See the following issues and workarounds for the Splunk Distribution of the OpenTelemetry Collector. -.. note:: See also the :new-page:`OpenTelemetry Project troublehooting docs in GitHub `. +.. note:: See also the :new-page:`OpenTelemetry Project troublehooting docs `. -Collector isn't behaving as expected -========================================= +.. caution:: Splunk only provides best-effort support for the upstream OpenTelemetry Collector. + +The Collector isn't behaving as expected +================================================= The Collector might experience the issues described in this section. -Collector or td-agent service isn't working +The Collector or td-agent service isn't working -------------------------------------------------- If either the Collector or td-agent services are not installed and configured, check these things to fix the issue: @@ -26,64 +28,75 @@ If either the Collector or td-agent services are not installed and configured, c * Check that your platform is not running in a containerized environment * Check the installation logs for more details -Collector exits or restarts +The Collector exits or restarts ----------------------------------------- -The collector might exit or restart for the following reasons: +The Collector might exit or restart for the following reasons: * Memory pressure due to a missing or misconfigured ``memory_limiter`` processor * Improperly sized for load * Improperly configured. For example, a queue size configured higher than available memory * Infrastructure resource limits. For example, Kubernetes -Restart the Splunk Distribution of OpenTelemetry Collector and check the configuration. +Restart the Splunk Distribution of the OpenTelemetry Collector and check the configuration. -Collector doesn't start in Windows Docker containers +The Collector doesn't start in Windows Docker containers ----------------------------------------------------------- The process might fail to start in a custom built, Windows-based Docker container, resulting in a "The service process could not connect to the service controller" error message. In this case, the ``NO_WINDOWS_SERVICE=1`` environment variable must be set to force the Splunk Distribution of OpenTelemetry Collector to start as if it were running in an interactive terminal, without attempting to run as a Windows service. -Collector is experiencing data issues +Extract a running configuration +========================================= + +Extracting a running configuration saves or stores the contents of a configuration file to logs that you can use to troubleshoot issues. You can extract a running configuration by accessing these ports: + +* ``http://localhost:55554/debug/configz/initial`` +* ``http://localhost:55554/debug/configz/effective`` + +For Linux, the support bundle script captures this information. See :ref:`otel-install-linux` for the installer script. This capability is primarily useful if you are using remote configuration options such as Zookeeper where the startup configuration can change during operation. + +The Collector is experiencing data issues ============================================ -You can monitor internal Collector metrics tracking parameters such as data loss or CPU resources in Splunk Observability Cloud's default dashboards at :guilabel:`Dashboards > OpenTelemetry Collector > OpenTelemetry Collector`. To learn more about these metrics, see :new-page:`Monitoring ` in the OpenTelemetry GitHub repo. +You can monitor internal Collector metrics tracking parameters such as data loss or CPU resources in Splunk Observability Cloud's default dashboards at :guilabel:`Dashboards > OpenTelemetry Collector > OpenTelemetry Collector`. -The Collector might experience the issues described in this section. +To learn more see: + +* :ref:`metrics-internal-collector` +* :new-page:`Internal telemetry ` in the OpenTelemetry project documentation -Collector is dropping data +The Collector is dropping data -------------------------------- Data might drop for a variety of reasons, but most commonly for the following reasons: -* The collector is improperly sized, resulting in the Splunk Distribution of OpenTelemetry Collector being unable to process and export the data as fast as it is received. See :ref:`otel-sizing` for sizing guidelines. +* The Collector is improperly sized, resulting in the Splunk Distribution of the OpenTelemetry Collector being unable to process and export the data as fast as it is received. See :ref:`otel-sizing` for sizing guidelines. * The exporter destination is unavailable or accepting the data too slowly. To mitigate drops, configure the ``batch`` processor. In addition, you might also need to configure the queued retry options on activated exporters. -Collector isn't receiving data +The Collector isn't receiving data ------------------------------------- -The collector might not receive data for the following reasons: +The Collector might not receive data for the following reasons: * Network configuration issues * Receiver configuration issues * The receiver is defined in the receivers section, but not activated in any pipelines * The client configuration is incorrect -Check the logs and :new-page:`Troubleshooting zPages ` in the OpenTelemetry project GitHub repositories for more information. Note that Splunk only provides best-effort support for the upstream OpenTelemetry Collector. - -Collector can't process data +The Collector can't process data ----------------------------------- -The collector might not process data for the following reasons: +The Collector might not process data for the following reasons: * The attributes processors work only for "tags" on spans. The span name is handled by the span processor. * Processors for trace data (except tail sampling) only work on individual spans. Make sure your collector is configured properly. -Collector can't export data +The Collector can't export data ------------------------------------ -The collector might be unable to export data for the following reasons: +The Collector might be unable to export data for the following reasons: * Network configuration issues, such as firewall, DNS, or proxy support * Incorrect exporter configuration @@ -92,8 +105,6 @@ The collector might be unable to export data for the following reasons: If you need to use a proxy, see :ref:`configure-proxy-collector`. -Check the logs and :new-page:`Troubleshooting zPages ` in the OpenTelemetry project GitHub repositories for more information. Note that Splunk only provides best-effort support for the upstream OpenTelemetry Collector. - .. _collector-gateway-metrics-issue: Metrics and metadata not available in data forwarding (gateway) mode @@ -149,15 +160,6 @@ For example: value: staging key: deployment.environment -Extract a running configuration -========================================= -Extracting a running configuration saves or stores the contents of a configuration file to logs that you can use to troubleshoot issues. You can extract a running configuration by accessing these ports: - -* ``http://localhost:55554/debug/configz/initial`` -* ``http://localhost:55554/debug/configz/effective`` - -For Linux, the support bundle script captures this information. See :ref:`otel-install-linux` for the installer script. This capability is primarily useful if you are using remote configuration options such as Zookeeper where the startup configuration can change during operation. - Check metric data from the command line ============================================== @@ -168,56 +170,37 @@ To check whether host metrics are being collected and processed correctly, you c You can then pipe the output to ``grep`` (Linux) or ``Select-String`` (Windows) to filter the data. For example, ``curl http://localhost:8888/metrics | grep service_instance_id`` retrieves the service instance ID. -You're getting a "bind: address already in use" error message -================================================================================== - -If you see an error message such as "bind: address already in use", another resource is already using the port that the current configuration requires. This resource could be another application, or a tracing tool such as Jaeger or Zipkin. You can modify the configuration to use another port. - -You can modify any of these endpoints or ports: - -* Receiver endpoint -* Extensions endpoint -* Metrics address (if port 8888) - -Conflicts with port 8888 ------------------------------------ - -If you encounter a conflict with port 8888, you will need to change to port 8889, making adjustments in these two areas: - -1. Add telemetry configuration under the service section: +Trace collection issues +================================ -.. code-block:: yaml +Test the Collector by sending synthetic data +------------------------------------------------------------ +You can test the Collector to make sure it can receive spans without instrumenting an application. By default, the Collector activates the Zipkin receiver, which is capable of receiving trace data over JSON. - service: - telemetry: - metrics: - address: ":8889" +To test the UI, you can submit a POST request or paste JSON in this directory, as shown in the following example. +.. code-block:: bash -2. Update the port for ``receivers.prometheus/internal`` from 8888 to 8889: + curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json + curl -X POST localhost:9411/api/v2/spans -H'Content-Type: application/json' -d @yelp.json -.. code-block:: yaml +.. note:: + Update the ``localhost`` field as appropriate to reach the Collector. - receivers: - prometheus/internal: - config: - scrape_configs: - - job_name: 'otel-collector' - scrape_interval: 10s - static_configs: - - targets: ['0.0.0.0:8889'] +No response means the request was sent successfully. You can also pass ``-v`` to the curl command to confirm. -If you see this error message on Kubernetes and you're using Helm charts, modify the configuration by updating the chart values for both configuration and exposed ports. +Error codes and messages +================================================================================== You're getting a "pattern not matched" error message -================================================================================== +------------------------------------------------------------ If you see an error message such as "pattern not matched", this message is from Fluentd, and means that the ```` was unable to match based on the log message. As a result, the log message is not collected. Check the Fluentd configuration and update as required. You're receiving an HTTP error code -================================================================================== +------------------------------------------------------------ If an HTTP request is not successfully completed, you might see the following HTTP error codes. @@ -236,25 +219,45 @@ If an HTTP request is not successfully completed, you might see the following HT * - ``503 (SERVICE UNAVAILABLE)`` - Check the status page. -Trace collection issues -================================ +You're getting a "bind: address already in use" error message +------------------------------------------------------------------------------------------------------------------------ -Here are some common issues related to trace collection on the Collector. +If you see an error message such as "bind: address already in use", another resource is already using the port that the current configuration requires. This resource could be another application, or a tracing tool such as Jaeger or Zipkin. You can modify the configuration to use another port. -Test the Collector by sending synthetic data ------------------------------------------------------------- +You can modify any of these endpoints or ports: -You can test the Collector to make sure it can receive spans without instrumenting an application. By default, the Collector activates the Zipkin receiver, which is capable of receiving trace data over JSON. +* Receiver endpoint +* Extensions endpoint +* Metrics address (if port 8888) -To test the UI, you can submit a POST request or paste JSON in this directory, as shown in the following example. +Conflicts with port 8888 +----------------------------------- -.. code-block:: bash +If you encounter a conflict with port 8888, you will need to change to port 8889, making adjustments in these two areas: - curl -OL https://raw.githubusercontent.com/openzipkin/zipkin/master/zipkin-lens/testdata/yelp.json - curl -X POST localhost:9411/api/v2/spans -H'Content-Type: application/json' -d @yelp.json +1. Add telemetry configuration under the service section: -.. note:: +.. code-block:: yaml - Update the ``localhost`` field as appropriate to reach the Collector. -No response means the request was sent successfully. You can also pass ``-v`` to the curl command to confirm. + service: + telemetry: + metrics: + address: ":8889" + + +2. Update the port for ``receivers.prometheus/internal`` from 8888 to 8889: + +.. code-block:: yaml + + + receivers: + prometheus/internal: + config: + scrape_configs: + - job_name: 'otel-collector' + scrape_interval: 10s + static_configs: + - targets: ['0.0.0.0:8889'] + +If you see this error message on Kubernetes and you're using Helm charts, modify the configuration by updating the chart values for both configuration and exposed ports. diff --git a/gdi/opentelemetry/troubleshoot-logs.rst b/gdi/opentelemetry/troubleshoot-logs.rst index 39325e715..4f7ca8788 100644 --- a/gdi/opentelemetry/troubleshoot-logs.rst +++ b/gdi/opentelemetry/troubleshoot-logs.rst @@ -8,7 +8,7 @@ Troubleshoot Collector logs :description: Describes known issues when collecting logs with the Splunk Distribution of OpenTelemetry Collector. -.. note:: To activate the Collector's debug logging, see the :new-page:`OpenTelemetry project documentation in GitHub `. +.. note:: See also the :new-page:`OpenTelemetry Project troublehooting docs ` for more information about debugging. Here are some common issues related to log collection on the Collector.