Fix: Reinitialize gRPC channel on UNAVAILABLE error (Fixes #4517) #4825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

dheeraj-vanamala wants to merge 3 commits into open-telemetry:main from dheeraj-vanamala:issue-4517/grpc-reconnection

+109 −19

dheeraj-vanamala commented Nov 30, 2025

Description

This PR fixes issue #4517 where the OTLP gRPC exporter fails to reconnect to the collector after a restart (returning UNAVAILABLE).

Changes:

Detected StatusCode.UNAVAILABLE in the export loop.
Added logic to close the existing channel and re-initialize it before retrying.
Added a regression test test_unavailable_reconnects to verify the reconnection behavior.

Fixes #4517

Type of change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I added a new regression test case test_unavailable_reconnects in exporter/opentelemetry-exporter-otlp-proto-grpc/tests/test_otlp_exporter_mixin.py.

test_unavailable_reconnects: Verifies that the exporter closes and re-initializes the gRPC channel when the server returns StatusCode.UNAVAILABLE.

Does This PR Require a Contrib Repo Change?

No.

Checklist:

Followed the style guidelines of this project
Changelogs have been updated
Unit tests have been added
Documentation has been updated

dheeraj-vanamala requested a review from a team as a code owner

November 30, 2025 15:26

linux-foundation-easycla bot commented Nov 30, 2025 •

edited

Loading

The committers listed above are authorized under a signed CLA.

✅ login: dheeraj-vanamala / name: Dheeraj Vanamala (2c848f4, 436ecc9, 8b397a7)

dheeraj-vanamala force-pushed the issue-4517/grpc-reconnection branch from c670f77 to b7620d0 Compare

November 30, 2025 16:00


          Fix: Reinitialize gRPC channel on UNAVAILABLE error (Fixes open-telem…

436ecc9

…etry#4517)

dheeraj-vanamala force-pushed the issue-4517/grpc-reconnection branch from b7620d0 to 436ecc9 Compare

November 30, 2025 16:13

Author

dheeraj-vanamala commented Nov 30, 2025 •

edited

Loading

I understand this issue is related to the upstream gRPC bug (grpc/grpc#38290).

I've analyzed that issue in depth, and the root cause appears to be a regression in the gRPC 'backup poller' (introduced in grpcio>=1.68.0) which fails to recover connections when the primary EventEngine is disabled (common in Python for fork safety).

While upstream fixes are being explored (e.g., grpc/grpc#38480), the issue has persisted for months, leaving exporters stuck in an UNAVAILABLE state indefinitely after collector restarts.

This PR implements a robust mitigation: detecting the persistent UNAVAILABLE state and forcing a channel re-initialization. This effectively resets the underlying poller state, allowing the exporter to recover immediately without requiring a full application restart. This approach provides stability for users while the complex upstream fix is finalized.

DylanRussell reviewed

View reviewed changes

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
              """OTLP Exporter

              This module provides a mixin class for OTLP exporters that send telemetry data

              to an OpenTelemetry Collector via gRPC. It includes a configurable reconnection

Contributor

DylanRussell Dec 4, 2025

The collector is one of many places the OTLP exporters can send data to... any server that implements the OTLP RPC interface can also be sent data..

Author

dheeraj-vanamala Dec 4, 2025

Okay. I've generalized the doc string to say 'OTLP-compatible receiver' instead of 'OpenTelemetry Collector'.

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
                  OTEL_EXPORTER_OTLP_RETRY_INTERVAL: Base retry interval in seconds (default: 2.0).

                  OTEL_EXPORTER_OTLP_MAX_RETRIES: Maximum number of retry attempts (default: 20).

                  OTEL_EXPORTER_OTLP_RETRY_TIMEOUT: Total retry timeout in seconds (default: 300).

                  OTEL_EXPORTER_OTLP_RETRY_MAX_DELAY: Maximum delay between retries in seconds (default: 60.0).

Contributor

DylanRussell Dec 4, 2025

Are these collector env vars ? I don't see OTEL_EXPORTER_OTLP_RETRY_INTERVAL defined in this repo..

Author

dheeraj-vanamala Dec 4, 2025

Thanks for catching that. I think I pulled those from the general spec or another SDK by mistake. I've removed the section entirely since they aren't used here.

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
                                  )

                              # For UNAVAILABLE errors, reinitialize the channel to force reconnection

                              if error.code() == StatusCode.UNAVAILABLE:  # type: ignore

Contributor

DylanRussell Dec 4, 2025

maybe add and retry_num==0, I assume we don't need to do this again if we get another UNAVAILABLE error?

Author

dheeraj-vanamala Dec 4, 2025

That makes sense. We really only need to force the re-init once to unstick the channel. If it fails again, standard backoff should handle it. Added the retry_num == 0 check.

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
                      # Add channel options for better reconnection behavior

                      # Only add these if we're dealing with reconnection scenarios

                      channel_options = []

                      if hasattr(self, "_channel_reconnection_enabled"):

Contributor

DylanRussell Dec 4, 2025

why not default this to false, then just access directly via self._channel_reconnection_enabled ?

Author

dheeraj-vanamala Dec 4, 2025

Yes, that's much cleaner. I've moved the initialization to init so we can drop the hasattr check

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
                          ]

                      # Merge reconnection options with existing channel options

                      current_options = list(self._channel_options)

Contributor

DylanRussell Dec 4, 2025

why do we need to cast this to a list ?

Author

dheeraj-vanamala Dec 4, 2025

I added the list casting to inject aggressive keepalive settings (30s ping) for the new channel. My reasoning was to prevent silent connection drops after a recovery.

However, this is a policy decision. If you prefer we stick to the standard gRPC defaults (2 hours) to minimize traffic/complexity, I am happy to remove this entire block (and the list casting). The core fix (re-initialization) works without it.

What is your preference? Should we keep the aggressive settings or revert to defaults?

Contributor

DylanRussell Dec 4, 2025

I really don't know, I haven't come across this issue and I don't know too much about how these params work..

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
                      # Merge reconnection options with existing channel options

                      current_options = list(self._channel_options)

                      # Filter out options that we are about to override

                      reconnection_keys = {opt[0] for opt in channel_options}

Contributor

DylanRussell Dec 4, 2025

maybe use tuple unpacking to make it more clear, ex for opt, value in channel_options (here and below).


          fix: address PR review comments for gRPC reconnection

2c848f4

DylanRussell reviewed

View reviewed changes

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
                  """OTLP gRPC exporter mixin.

                  This class provides the base functionality for OTLP exporters that send

                  telemetry data (spans or metrics) to an OpenTelemetry Collector via gRPC.

Contributor

DylanRussell Dec 4, 2025

nit: can we say the same thing as above here (OTLP-compatible receiver)

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
                              ("grpc.max_reconnect_backoff_ms", 30000),

                          ]

                      # Merge reconnection options with existing channel options

Contributor

DylanRussell Dec 4, 2025

can you clarify the merge precedence here, which one will win

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
                          ]

                      # Merge reconnection options with existing channel options

                      current_options = list(self._channel_options)

Contributor

DylanRussell Dec 4, 2025

I really don't know, I haven't come across this issue and I don't know too much about how these params work..

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
                      # Add channel options for better reconnection behavior

                      # Only add these if we're dealing with reconnection scenarios

                      channel_options = []

                      if self._channel_reconnection_enabled:

Contributor

DylanRussell Dec 4, 2025

on second thought I think it makes most sense to pass this in as a param to this function

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

    
                      self._shutdown = False

                      if not hasattr(self, "_shutdown_in_progress"):

                          self._shutdown_in_progress = threading.Event()

                          self._shutdown = False

Contributor

DylanRussell Dec 4, 2025

why is this here ? just leave it in the initializer ?

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py Outdated

Comment on lines 405 to 406

    
                          self._credentials = _get_credentials(

                              credentials,

                              self._credentials,

Contributor

DylanRussell Dec 4, 2025

why is this here ? just leave it in the initializer ?

Author

dheeraj-vanamala Dec 5, 2025

I’ve significantly simplified the approach based on your feedback.

I removed the custom keepalive and retry settings entirely, so we’re just relying on gRPC defaults now. This should resolve the concerns about those specific values and merge precedence. I also refactored the channel initialization to be stateless and moved the shutdown and credentials logic back to init as you suggested.

Also updated the docs to use "OTLP-compatible receiver".


          refactor(exporter): simplify reconnection logic and address review co…

8b397a7

…mments

- Remove aggressive gRPC keepalive and retry settings to rely on defaults.
- Fix compression precedence logic to correctly handle NoCompression (0).
- Refactor channel initialization to be stateless (remove _channel_reconnection_enabled).- Update documentation to refer to 'OTLP-compatible receiver'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet