Skip to content

Commit a7ab0d2

Browse files
authored
Implement exponential backoff + killswitch + sdkstats feature control (#43147)
1 parent 49e6e88 commit a7ab0d2

File tree

17 files changed

+797
-199
lines changed

17 files changed

+797
-199
lines changed

sdk/monitor/azure-monitor-opentelemetry-exporter/CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
([#43032](https://github.com/Azure/azure-sdk-for-python/pull/43032))
88
- Adding customer sdkstats to feature statsbeat
99
([#43066](https://github.com/Azure/azure-sdk-for-python/pull/43066))
10+
- OneSettings control plane: Add killswitch + exponential backoff + sdkstats feature control
11+
([#43147](https://github.com/Azure/azure-sdk-for-python/pull/43147))
1012

1113
### Breaking Changes
1214

@@ -41,7 +43,6 @@
4143
([#42655](https://github.com/Azure/azure-sdk-for-python/pull/42655))
4244
- Customer Facing SDKStats: Added telemetry_success field to dropped items as per [Spec] - https://github.com/aep-health-and-standards/Telemetry-Collection-Spec/pull/606
4345
([#42846](https://github.com/Azure/azure-sdk-for-python/pull/42846))
44-
### Breaking Changes
4546

4647
### Bugs Fixed
4748
- Customer Facing SDKStats: Refactor to use `Manager` and `Singleton` pattern

sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/_configuration/__init__.py

Lines changed: 57 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,14 @@
99
_ONE_SETTINGS_DEFAULT_REFRESH_INTERVAL_SECONDS,
1010
_ONE_SETTINGS_CHANGE_URL,
1111
_ONE_SETTINGS_CONFIG_URL,
12+
_ONE_SETTINGS_MAX_REFRESH_INTERVAL_SECONDS,
13+
_RETRYABLE_STATUS_CODES,
1214
)
13-
from azure.monitor.opentelemetry.exporter._configuration._utils import _ConfigurationProfile
15+
from azure.monitor.opentelemetry.exporter._configuration._utils import _ConfigurationProfile, OneSettingsResponse
1416
from azure.monitor.opentelemetry.exporter._configuration._utils import make_onesettings_request
1517
from azure.monitor.opentelemetry.exporter._utils import Singleton
1618

19+
1720
# Set up logger
1821
logger = logging.getLogger(__name__)
1922

@@ -79,29 +82,49 @@ def _notify_callbacks(self, settings: Dict[str, str]):
7982
except Exception as ex: # pylint: disable=broad-except
8083
logger.warning("Callback failed: %s", ex)
8184

82-
# pylint: disable=too-many-statements
85+
def _is_transient_error(self, response: OneSettingsResponse) -> bool:
86+
"""Check if the response indicates a transient error.
87+
88+
:param response: OneSettingsResponse object from OneSettings request
89+
:type response: OneSettingsResponse
90+
:return: True if the error is transient and refresh interval should be increased
91+
:rtype: bool
92+
"""
93+
# Check for exception indicator or retryable HTTP status codes
94+
return response.has_exception or response.status_code in _RETRYABLE_STATUS_CODES
95+
96+
# pylint: disable=too-many-statements, too-many-branches
8397
def get_configuration_and_refresh_interval(self, query_dict: Optional[Dict[str, str]] = None) -> int:
8498
"""Fetch configuration from OneSettings and update local cache atomically.
8599
86100
This method performs a conditional HTTP request to OneSettings using the
87101
current ETag for efficient caching. It atomically updates the local configuration
88102
state with any new settings and manages version tracking for change detection.
89103
104+
When transient errors are encountered (timeouts, network exceptions, or HTTP status
105+
codes 429, 500-504) from the CHANGE endpoint, the method doubles the current refresh
106+
interval to reduce load on the failing service and returns immediately. The refresh
107+
interval is capped at 24 hours (86,400 seconds) to prevent excessively long delays.
108+
90109
The method implements a check-and-set pattern for thread safety:
91110
1. Reads current state atomically to prepare request headers
92111
2. Makes HTTP request to OneSettings CHANGE endpoint outside locks
93-
3. Re-reads current state to make version comparison decisions
94-
4. Conditionally fetches from CONFIG endpoint if version increased
95-
5. Updates all state fields atomically in a single operation
112+
3. If transient error (including timeouts/exceptions), doubles refresh interval
113+
(capped at 24 hours) and returns immediately
114+
4. Re-reads current state to make version comparison decisions
115+
5. Conditionally fetches from CONFIG endpoint if version increased
116+
6. Updates all state fields atomically in a single operation
96117
97118
Version comparison logic:
98119
- Version increase: New configuration available, fetches and caches new settings
99120
- Version same: No changes detected, ETag and refresh interval updated safely
100121
- Version decrease: Unexpected rollback state, logged as warning, no updates applied
101122
102123
Error handling:
124+
- Transient errors (timeouts, exceptions, retryable HTTP codes) from CHANGE endpoint:
125+
Refresh interval doubled (capped), immediate return
103126
- CONFIG endpoint failure: ETag not updated to preserve retry capability on next call
104-
- Network failures: Handled by make_onesettings_request, returns default values
127+
- Network failures: Handled by make_onesettings_request with error indicators
105128
- Missing settings/version: Logged as warning, only ETag and refresh interval updated
106129
107130
:param query_dict: Optional query parameters to include in the OneSettings request.
@@ -110,8 +133,9 @@ def get_configuration_and_refresh_interval(self, query_dict: Optional[Dict[str,
110133
:type query_dict: Optional[Dict[str, str]]
111134
112135
:return: Updated refresh interval in seconds for the next configuration check.
113-
This value comes from the OneSettings response and determines how frequently
114-
the background worker should call this method.
136+
This value comes from the OneSettings response or is doubled (capped at 24 hours)
137+
if transient errors are encountered from the CHANGE endpoint, determining how
138+
frequently the background worker should call this method.
115139
:rtype: int
116140
117141
Thread Safety:
@@ -134,6 +158,12 @@ def get_configuration_and_refresh_interval(self, query_dict: Optional[Dict[str,
134158
All configuration state (ETag, refresh interval, version, settings) is updated
135159
atomically using immutable state objects. This prevents race conditions where
136160
different threads might observe inconsistent combinations of these values.
161+
162+
Transient Error Handling:
163+
When transient errors are detected from the CHANGE endpoint (including timeouts,
164+
network exceptions, or retryable HTTP status codes), the refresh interval is
165+
doubled and the method returns immediately, preserving current state for retry.
166+
The refresh interval is capped at 24 hours to ensure eventual recovery attempts.
137167
"""
138168
query_dict = query_dict or {}
139169
headers = {}
@@ -149,11 +179,27 @@ def get_configuration_and_refresh_interval(self, query_dict: Optional[Dict[str,
149179
# Make the OneSettings request
150180
response = make_onesettings_request(_ONE_SETTINGS_CHANGE_URL, query_dict, headers)
151181

182+
# Check for transient errors from CHANGE endpoint - return immediately if found
183+
if self._is_transient_error(response):
184+
with self._state_lock:
185+
# Double the refresh interval and cap it at 24 hours
186+
doubled_interval = self._current_state.refresh_interval * 2
187+
current_refresh_interval = min(doubled_interval, _ONE_SETTINGS_MAX_REFRESH_INTERVAL_SECONDS)
188+
189+
# Create appropriate log message based on error type
190+
if response.has_exception:
191+
error_description = "network error"
192+
else:
193+
error_description = f"HTTP {response.status_code}"
194+
195+
logger.warning("OneSettings CHANGE request failed with transient error (%s). Retrying. ", error_description)
196+
return current_refresh_interval # type: ignore
197+
152198
# Prepare new state updates
153199
new_state_updates = {}
154200
if response.etag is not None:
155201
new_state_updates['etag'] = response.etag
156-
if response.refresh_interval and response.refresh_interval > 0:
202+
if response.refresh_interval and response.refresh_interval > 0: # type: ignore
157203
new_state_updates['refresh_interval'] = response.refresh_interval # type: ignore
158204

159205
if response.status_code == 304:
@@ -217,14 +263,14 @@ def get_configuration_and_refresh_interval(self, query_dict: Optional[Dict[str,
217263
if notify_callbacks and state_for_callbacks is not None and state_for_callbacks.settings_cache:
218264
self._notify_callbacks(state_for_callbacks.settings_cache)
219265

220-
return current_refresh_interval
266+
return current_refresh_interval # type: ignore
221267

222268
def get_settings(self) -> Dict[str, str]: # pylint: disable=C4741,C4742
223269
"""Get current settings cache."""
224270
with self._state_lock:
225271
return self._current_state.settings_cache.copy() # type: ignore
226272

227-
def get_current_version(self) -> int: # pylint: disable=C4741,C4742
273+
def get_current_version(self) -> int: # type: ignore # pylint: disable=C4741,C4742
228274
"""Get current version."""
229275
with self._state_lock:
230276
return self._current_state.version_cache # type: ignore

sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/_configuration/_state.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,29 @@
44
55
This module provides global access functions for the Configuration Manager singleton.
66
"""
7+
import os
8+
from typing import Optional, TYPE_CHECKING
79

8-
from typing import TYPE_CHECKING
10+
from azure.monitor.opentelemetry.exporter._constants import _APPLICATIONINSIGHTS_CONTROLPLANE_DISABLED
911

1012
if TYPE_CHECKING:
1113
from azure.monitor.opentelemetry.exporter._configuration import _ConfigurationManager
1214

1315
# Global singleton instance for easy access throughout the codebase
1416
_configuration_manager = None
1517

16-
def get_configuration_manager() -> "_ConfigurationManager":
18+
def get_configuration_manager() -> Optional["_ConfigurationManager"]:
1719
"""Get the global Configuration Manager singleton instance.
1820
1921
This provides a single access point to the manager and handles lazy initialization.
22+
Returns None if control plane functionality is disabled via environment variable.
2023
21-
:return: The singleton Configuration Manager instance
22-
:rtype: _ConfigurationManager
24+
:return: The singleton Configuration Manager instance, or None if disabled
25+
:rtype: Optional[_ConfigurationManager]
2326
"""
27+
disabled = os.environ.get(_APPLICATIONINSIGHTS_CONTROLPLANE_DISABLED)
28+
if disabled is not None and disabled.lower() == "true":
29+
return None
2430
global _configuration_manager # pylint: disable=global-statement
2531
if _configuration_manager is None:
2632
from azure.monitor.opentelemetry.exporter._configuration import _ConfigurationManager

sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/_configuration/_utils.py

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -46,14 +46,15 @@ class OneSettingsResponse:
4646
"""Response object containing OneSettings API response data.
4747
4848
This class encapsulates the parsed response from a OneSettings API call,
49-
including configuration settings, version information, and metadata.
49+
including configuration settings, version information, error indicators and metadata.
5050
5151
Attributes:
5252
etag (Optional[str]): ETag header value for caching and conditional requests
5353
refresh_interval (int): Interval in seconds for the next configuration refresh
5454
settings (Dict[str, str]): Dictionary of configuration key-value pairs
5555
version (Optional[int]): Configuration version number for change tracking
5656
status_code (int): HTTP status code from the response
57+
has_exception (bool): True if the request resulted in a transient error (network error, timeout, etc.)
5758
"""
5859

5960
def __init__(
@@ -62,7 +63,8 @@ def __init__(
6263
refresh_interval: int = _ONE_SETTINGS_DEFAULT_REFRESH_INTERVAL_SECONDS,
6364
settings: Optional[Dict[str, str]] = None,
6465
version: Optional[int] = None,
65-
status_code: int = 200
66+
status_code: int = 200,
67+
has_exception: bool = False
6668
):
6769
"""Initialize OneSettingsResponse with configuration data.
6870
@@ -74,12 +76,14 @@ def __init__(
7476
Defaults to empty dict if None.
7577
version (Optional[int], optional): Configuration version number. Defaults to None.
7678
status_code (int, optional): HTTP status code. Defaults to 200.
79+
has_exception (bool, optional): Indicates if request failed with a transient error. Defaults to False.
7780
"""
7881
self.etag = etag
7982
self.refresh_interval = refresh_interval
8083
self.settings = settings or {}
8184
self.version = version
8285
self.status_code = status_code
86+
self.has_exception = has_exception
8387

8488

8589
def make_onesettings_request(url: str, query_dict: Optional[Dict[str, str]] = None,
@@ -88,7 +92,7 @@ def make_onesettings_request(url: str, query_dict: Optional[Dict[str, str]] = No
8892
8993
This function handles the complete OneSettings request lifecycle including:
9094
- Making the HTTP GET request with optional query parameters and headers
91-
- Error handling for network, HTTP, and JSON parsing errors
95+
- Error handling for network, HTTP, timeout, and JSON parsing errors
9296
- Parsing the response into a structured OneSettingsResponse object
9397
9498
:param url: The OneSettings API endpoint URL to request
@@ -100,13 +104,13 @@ def make_onesettings_request(url: str, query_dict: Optional[Dict[str, str]] = No
100104
Common headers include 'If-None-Match' for ETag caching. Defaults to None.
101105
:type headers: Optional[Dict[str, str]]
102106
103-
:return: Parsed response containing configuration data and metadata.
104-
Returns a default response object if the request fails.
107+
:return: Parsed response containing configuration data and metadata, including
108+
error indicators for exceptions and timeouts.
105109
:rtype: OneSettingsResponse
106110
107111
Raises:
108112
Does not raise exceptions - all errors are caught and logged, returning a
109-
default OneSettingsResponse object.
113+
OneSettingsResponse object with appropriate error indicators set.
110114
"""
111115
query_dict = query_dict or {}
112116
headers = headers or {}
@@ -116,15 +120,18 @@ def make_onesettings_request(url: str, query_dict: Optional[Dict[str, str]] = No
116120
result.raise_for_status() # Raises an exception for 4XX/5XX responses
117121

118122
return _parse_onesettings_response(result)
123+
except requests.exceptions.Timeout as ex:
124+
logger.warning("OneSettings request timed out: %s", str(ex))
125+
return OneSettingsResponse(has_exception=True)
119126
except requests.exceptions.RequestException as ex:
120127
logger.warning("Failed to fetch configuration from OneSettings: %s", str(ex))
121-
return OneSettingsResponse()
128+
return OneSettingsResponse(has_exception=True)
122129
except json.JSONDecodeError as ex:
123130
logger.warning("Failed to parse OneSettings response: %s", str(ex))
124-
return OneSettingsResponse()
131+
return OneSettingsResponse(has_exception=True)
125132
except Exception as ex: # pylint: disable=broad-exception-caught
126133
logger.warning("Unexpected error while fetching configuration: %s", str(ex))
127-
return OneSettingsResponse()
134+
return OneSettingsResponse(has_exception=True)
128135

129136

130137
def _parse_onesettings_response(response: requests.Response) -> OneSettingsResponse:

sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/_constants.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@
7979
_MICROSOFT_CUSTOM_EVENT_NAME = "microsoft.custom_event.name"
8080

8181
# ONE SETTINGS
82+
_APPLICATIONINSIGHTS_CONTROLPLANE_DISABLED = "APPLICATIONINSIGHTS_CONTROLPLANE_DISABLED"
8283
_ONE_SETTINGS_PYTHON_KEY = "python"
8384
_ONE_SETTINGS_PYTHON_TARGETING = {"namespaces": _ONE_SETTINGS_PYTHON_KEY}
8485
_ONE_SETTINGS_CHANGE_VERSION_KEY = "CHANGE_VERSION"
@@ -94,6 +95,9 @@
9495
_ONE_SETTINGS_SUPPORTED_DATA_BOUNDARIES_KEY = "SUPPORTED_DATA_BOUNDARIES"
9596
_ONE_SETTINGS_FEATURE_LOCAL_STORAGE = "FEATURE_LOCAL_STORAGE"
9697
_ONE_SETTINGS_FEATURE_LIVE_METRICS = "FEATURE_LIVE_METRICS"
98+
_ONE_SETTINGS_FEATURE_SDK_STATS = "FEATURE_SDK_STATS"
99+
# Maximum refresh interval cap (24 hours in seconds)
100+
_ONE_SETTINGS_MAX_REFRESH_INTERVAL_SECONDS = 24 * 60 * 60 # 86,400 seconds
97101

98102
# Statsbeat
99103
# (OpenTelemetry metric name, Statsbeat metric name)

sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/_quickpulse/_live_metrics.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,9 @@ def enable_live_metrics(**kwargs: Any) -> None: # pylint: disable=C4758
4141
# Will only be added if QuickpulseManager is initialized successfully
4242
# Is a NoOp if _ConfigurationManager not initialized
4343
config_manager = get_configuration_manager()
44-
config_manager.register_callback(get_quickpulse_configuration_callback)
44+
# config_manager would be `None` if control plane is disabled
45+
if config_manager:
46+
config_manager.register_callback(get_quickpulse_configuration_callback)
4547

4648
# We can detect feature usage for statsbeat since we are in an opt-in model currently
4749
# Once we move to live metrics on-by-default, we will have to check for both explicit usage

sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/_quickpulse/_manager.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -274,9 +274,6 @@ def shutdown(self) -> bool:
274274

275275
if shutdown_success:
276276
_set_global_quickpulse_state(_QuickpulseState.OFFLINE)
277-
# Clear the singleton instance from the metaclass
278-
if self.__class__ in _QuickpulseManager._instances: # pylint: disable=protected-access
279-
del _QuickpulseManager._instances[self.__class__] # pylint: disable=protected-access
280277

281278
return shutdown_success
282279

sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/export/_base.py

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -173,14 +173,15 @@ def __init__(self, **kwargs: Any) -> None:
173173
self.client: AzureMonitorClient = AzureMonitorClient(
174174
host=self._endpoint, connection_timeout=self._timeout, policies=policies, **kwargs
175175
)
176-
self._configuration_manager.initialize(
177-
os=_get_os(),
178-
rp=_get_rp(),
179-
attach=_get_attach_type(),
180-
component="ext",
181-
version=ext_version,
182-
region=self._region,
183-
)
176+
if self._configuration_manager:
177+
self._configuration_manager.initialize(
178+
os=_get_os(),
179+
rp=_get_rp(),
180+
attach=_get_attach_type(),
181+
component="ext",
182+
version=ext_version,
183+
region=self._region,
184+
)
184185
self.storage: Optional[LocalFileStorage] = None
185186
if not self._disable_offline_storage:
186187
self.storage = LocalFileStorage( # pyright: ignore

0 commit comments

Comments
 (0)