Fix configuration caching and telemetry exporter flags#150
Conversation
|
Warning Rate limit exceeded@shayancoin has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 3 minutes and 50 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
WalkthroughCustom settings caching with webhook secret invalidation, OpenTelemetry instrumentation initialization framework, Prometheus HTTP metrics integration, and metric data model updates to support both Prometheus and OpenTelemetry telemetry backends. Changes
Sequence Diagram(s)sequenceDiagram
participant FastAPI as FastAPI Application
participant Bootstrap as instrumentation_boot.bootstrap()
participant OTEL as OpenTelemetry
participant Prometheus as Prometheus
FastAPI->>Bootstrap: on_startup (try/except guarded)
activate Bootstrap
Bootstrap->>OTEL: import(otel packages)
alt OTel packages available
Bootstrap->>OTEL: parse env vars (endpoints, headers)
Bootstrap->>OTEL: create Resource(service.name, service.namespace)
Bootstrap->>OTEL: initialize TracerProvider + MeterProvider
Bootstrap->>OTEL: register BatchSpanProcessor
Bootstrap->>OTEL: register PeriodicExportingMetricReader
Bootstrap->>OTEL: set_tracer_provider() / set_meter_provider()
rect rgb(200, 240, 200)
Note over Bootstrap,OTEL: _BOOTSTRAPPED flag prevents re-init
end
else ImportError (packages unavailable)
Bootstrap-->>FastAPI: log debug message, gracefully exit
end
deactivate Bootstrap
FastAPI->>Prometheus: Instrumentator().instrument(app)
activate Prometheus
rect rgb(240, 200, 200)
Note over FastAPI,Prometheus: HTTP metrics collection enabled<br/>(status_code labeling)
end
deactivate Prometheus
sequenceDiagram
participant Client as HTTP Client
participant App as FastAPI (main.py)
participant Metrics as metrics.observe_http_request()
participant Prom as Prometheus
participant OTEL as OpenTelemetry
Client->>App: HTTP Request
App->>App: call_next(request)
alt Response received
rect rgb(200, 240, 200)
Note over App: Response safety check<br/>(raise if None)
end
App->>Metrics: observe_http_request(status_code=int)
else Exception occurred
App->>Metrics: observe_http_request(status_code=int)
end
activate Metrics
Metrics->>Prom: record http_requests_total<br/>record http_request_duration_seconds
alt OTEL initialized
Metrics->>OTEL: record Counter + Histogram<br/>with attributes (service.name,<br/>http.route, http.method,<br/>http.status_code)
end
deactivate Metrics
App->>Client: HTTP Response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes This PR introduces multiple interconnected observability systems (OpenTelemetry bootstrap, Prometheus integration, metrics refactoring) with heterogeneous changes spanning new modules, API modifications, and runtime behavior changes. The custom settings cache invalidation logic and optional dependency guarding add cognitive load for verification. Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
backend/api/config.py(2 hunks)backend/api/instrumentation_boot.py(1 hunks)backend/api/main.py(4 hunks)backend/api/metrics.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
backend/api/metrics.py (1)
backend/api/main.py (1)
metrics(167-169)
backend/api/main.py (2)
backend/api/metrics.py (1)
observe_http_request(89-122)backend/api/instrumentation_boot.py (1)
bootstrap(87-132)
🔍 Remote MCP Sentry
Based on my research, I have found sufficient context to provide a focused review summary. The Linear workspace appears to be new (onboarding setup tasks), and the DeepWiki repository wasn't indexed. However, the Sentry documentation provides relevant baseline information about OpenTelemetry patterns. Let me compile my findings:
Additional Context Found for PR #150 Review
OpenTelemetry Implementation Patterns,
The PR's introduction of instrumentation_boot.py with OpenTelemetry resource creation, TracerProvider/MeterProvider initialization, and OTLP exporter configuration follows standard OpenTelemetry patterns. Key considerations:
- Span Processor Pattern: The PR registers a
BatchSpanProcessorfor spans andPeriodicExportingMetricReaderfor metrics—these are standard OpenTelemetry async collection patterns that batch telemetry before export, reducing network overhead. - Idempotent Bootstrap: The use of
_BOOTSTRAPPEDflag ensures OpenTelemetry initialization occurs only once, preventing duplicate processors and resource leaks—a critical pattern for production code. - OTEL Metrics Addition: The PR extends
observe_http_request()to emit OTEL counters and histograms with integer status codes (vs. Prometheus strings), aligning with OpenTelemetry semantic conventions for HTTP metrics.
Configuration Caching Strategy
The PR replaces @lru_cache with custom module-level cache variables (_SETTINGS_CACHE, _LAST_SECRET) that invalidate when HYGRAPH_WEBHOOK_SECRET changes. This is a secret-based cache invalidation pattern—appropriate for detecting webhook secret rotations without requiring explicit cache flush calls. The comparison logic checks if the current secret differs from the cached _LAST_SECRET and rebuilds the Settings() object accordingly.
Middleware Safety Enhancement
The PR adds a runtime guard (if response is None: raise RuntimeError) in main.py's request dispatch path. While FastAPI typically guarantees a Response object, this defensive pattern provides explicit failure detection should unexpected middleware behavior occur, improving observability of integration failures.
Status Code Label Consistency
The shift from string status to integer status_code in metrics (with safe fallback handling for invalid codes) improves OpenTelemetry compliance and prevents label cardinality explosion from malformed status values.
Testing Directive
The PR specifies Python bytecode compilation testing (python -m compileall), which verifies import statements and syntax validity across the four modified modules, providing baseline coverage verification.
🔇 Additional comments (10)
backend/api/main.py (3)
25-29: Good optional import guard for OTel bootstrapDefensive import with graceful no-op fallback looks solid.
107-109: LGTM: success-path status code emissionUsing response.status_code with OTEL/Prom labels is correct.
166-169: LGTM: explicit metrics endpoint responseExplicit CONTENT_TYPE_LATEST with generate_latest() is correct.
backend/api/instrumentation_boot.py (3)
15-26: LGTM: tolerant boolean env parsingRobust truthy set and whitespace handling look good.
28-41: LGTM: header parsing with safe skipsGracefully ignores malformed entries; uses first '=' only. Solid.
87-133: LGTM: idempotent, guarded bootstrap
- Optional deps guarded
- Resource configured with service.name/namespace
- BatchSpanProcessor + PeriodicExportingMetricReader are standard
backend/api/metrics.py (3)
12-33: LGTM: optional OTEL metrics initializationGraceful ImportError handling with conditional setup is correct.
88-106: LGTM: status_code handling and OTEL attribute emission
- status_code accepted as int|str; label coerced once
- Safe coercion for OTEL http.status_code attribute
- Dual-record to Prometheus and OTEL stays consistent
Also applies to: 107-123
125-131: LGTM: LCP recorded to both Prometheus and OTELMirrors dual path with correct attribute mapping.
backend/api/config.py (1)
76-85: Add type-safety assertion; confirm cache invalidation scope is intentionalThe assertion suggestion is valid—strict type-checkers need it since
_SETTINGS_CACHE: Settings | Nonecoupled withreturn _SETTINGS_CACHEtriggers nullable return warnings, even though your logic guarantees it's set.The cache invalidation behavior (tracking only
HYGRAPH_WEBHOOK_SECRETchanges) is intentional per the docstring and documented design. However, confirm whether this is desired: if other env vars likeDATABASE_URL,HYGRAPH_TOKEN, or.env.developmentcontents change at runtime, they won't trigger a refresh. If runtime env changes are expected, consider adding a reset mechanism or broadening the invalidation key.
backend/api/instrumentation_boot.py
Outdated
| def _span_exporter_kwargs() -> dict[str, Any]: | ||
| """Build keyword arguments for the OTLP span exporter.""" | ||
|
|
||
| kwargs: dict[str, Any] = { | ||
| "insecure": _env_flag("OTEL_EXPORTER_OTLP_INSECURE", default=False) | ||
| } | ||
|
|
||
| endpoint = os.getenv("OTEL_EXPORTER_OTLP_TRACES_ENDPOINT") or os.getenv( | ||
| "OTEL_EXPORTER_OTLP_ENDPOINT" | ||
| ) | ||
| if endpoint: | ||
| kwargs["endpoint"] = endpoint | ||
|
|
||
| headers = os.getenv("OTEL_EXPORTER_OTLP_TRACES_HEADERS") or os.getenv( | ||
| "OTEL_EXPORTER_OTLP_HEADERS" | ||
| ) | ||
| if headers: | ||
| kwargs["headers"] = _parse_headers(headers) | ||
|
|
||
| return kwargs | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider exporter timeouts and TLS options
Exporters can hang on network issues. Provide a bounded timeout and, if applicable, certificate configuration via env to avoid stalls.
Example:
- kwargs: dict[str, Any] = {
- "insecure": _env_flag("OTEL_EXPORTER_OTLP_INSECURE", default=False)
- }
+ kwargs: dict[str, Any] = {
+ "insecure": _env_flag("OTEL_EXPORTER_OTLP_INSECURE", default=False),
+ "timeout": float(os.getenv("OTEL_EXPORTER_OTLP_TIMEOUT", "10")),
+ }Repeat analogously for metrics kwargs.
Also applies to: 65-85
🤖 Prompt for AI Agents
In backend/api/instrumentation_boot.py around lines 43 to 63 (and similarly
update lines ~65-85 for metrics), the span exporter kwargs lack a bounded
network timeout and TLS/certificate configuration; add reading of environment
variables (e.g. OTEL_EXPORTER_OTLP_TRACES_TIMEOUT or OTEL_EXPORTER_OTLP_TIMEOUT
for spans, and OTEL_EXPORTER_OTLP_METRICS_TIMEOUT for metrics) to parse a
numeric timeout and include it as a "timeout" value in the kwargs, and read
certificate-related env vars (e.g. OTEL_EXPORTER_OTLP_CERTIFICATE or
OTEL_EXPORTER_OTLP_TRACES_CERTIFICATE and their metrics equivalents) to include
the appropriate TLS/certificate config key(s) expected by the OTLP exporter
(pass certificate path or credentials as the exporter expects) so exporters
won’t hang and can be configured for TLS; apply the same pattern to the metrics
kwargs block at the referenced lines.
| from api.routes_sync import router as sync_router | ||
| from api.routes_observability import router as observability_router | ||
| from prometheus_client import CONTENT_TYPE_LATEST, generate_latest | ||
| from prometheus_fastapi_instrumentator import Instrumentator |
There was a problem hiding this comment.
🧩 Analysis chain
Guard optional dependency and prevent metric duplication/conflicts
- Importing Instrumentator unguarded can crash when the package isn’t installed.
- Using Instrumentator alongside your custom Prometheus middleware/counters risks double-counting and potential metric-name collisions (e.g., http_requests_total).
Recommended changes:
-from prometheus_fastapi_instrumentator import Instrumentator
+try: # pragma: no cover - optional dependency guard
+ from prometheus_fastapi_instrumentator import Instrumentator
+except ImportError: # pragma: no cover
+ Instrumentator = None # type: ignore[assignment]-# Register Prometheus instrumentation so HTTP metrics are emitted for alerts/dashboards
-Instrumentator().instrument(app)
+# Register Prometheus instrumentation only if available; exclude /metrics
+if Instrumentator is not None:
+ Instrumentator().instrument(app, excluded_handlers=["/metrics"])Also decide on a single HTTP request metric source (either Instrumentator or your custom middleware + metrics.py) to avoid duplicate series.
Run a quick web check to confirm default metric names from Instrumentator:
Also applies to: 53-53
🌐 Web query:
Does prometheus_fastapi_instrumentator export a metric named http_requests_total by default, and can it conflict with an existing custom Counter of the same name?
💡 Result:
Short answer: Yes — prometheus_fastapi_instrumentator registers a Counter named http_requests_total by default, and that can conflict with an existing metric of the same name (you'll get duplicate/registration errors).
Sources:
- The package docs state the default Counter http_requests_total is created. [1][2]
- The instrumentator registers metrics in the global prometheus_client registry, so registering the same metric twice produces duplicate-timeseries/ValueError errors (reports and issues show "Duplicated timeseries" when metrics are registered more than once). [3][4]
If you want, I can show ways to avoid the conflict (use a custom registry/namespace/subsystem or disable the default metric).
---
🏁 Script executed:
```shell
#!/bin/bash
# Check the current state of lines 22 and 53 in backend/api/main.py
# and search for custom metrics or custom Prometheus middleware
echo "=== Lines 20-60 in backend/api/main.py ==="
sed -n '20,60p' backend/api/main.py
echo -e "\n=== Search for http_requests_total in the codebase ==="
rg -n "http_requests_total" --type python
echo -e "\n=== Search for prometheus_client Counter or Gauge in main.py ==="
rg -n "prometheus_client|Counter|Gauge" backend/api/main.py -A 2 -B 2
Length of output: 1984
---
🏁 Script executed:
```shell
#!/bin/bash
# Inspect api/metrics.py to check for custom prometheus metrics
echo "=== Content of api/metrics.py ==="
cat backend/api/metrics.py
echo -e "\n=== Check for prometheus_client usage in api/metrics.py ==="
rg -n "Counter|Gauge|Histogram|Summary" backend/api/metrics.py -A 2
Length of output: 4730
---
🏁 Script executed:
```shell
#!/bin/bash
# Get exact line numbers for Instrumentator usage
rg -n "Instrumentator" backend/api/main.py
Length of output: 99
Guard optional dependency and fix metric duplication conflict
Verification confirms both critical issues:
-
Unguarded import (line 22):
prometheus_fastapi_instrumentatoris imported without try-except, risking crash if the package isn't installed. The codebase already uses this pattern for optional dependencies (see line 26-29 withinstrumentation_boot). -
Metric name collision (line 53):
prometheus_fastapi_instrumentatorregistershttp_requests_totalby default, which conflicts with your customhttp_requests_totalCounter defined inapi/metrics.py(line 68). Both register to the same globalprometheus_clientregistry, causing a duplicate-timeseriesValueErrorat runtime.
Required changes:
Guard the import:
-from prometheus_fastapi_instrumentator import Instrumentator
+try: # pragma: no cover - optional dependency guard
+ from prometheus_fastapi_instrumentator import Instrumentator
+except ImportError: # pragma: no cover
+ Instrumentator = None # type: ignore[assignment]Conditionally register and avoid metric collision:
-Instrumentator().instrument(app)
+if Instrumentator is not None:
+ Instrumentator().instrument(app, excluded_handlers=["/metrics"])Additionally, remove http_requests_total from api/metrics.py (line 68–72) since you already capture HTTP metrics via observe_http_request() in your custom middleware. This avoids the dual instrumentation.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from prometheus_fastapi_instrumentator import Instrumentator | |
| try: # pragma: no cover - optional dependency guard | |
| from prometheus_fastapi_instrumentator import Instrumentator | |
| except ImportError: # pragma: no cover | |
| Instrumentator = None # type: ignore[assignment] |
🤖 Prompt for AI Agents
backend/api/main.py around line 22: the import of
prometheus_fastapi_instrumentator must be guarded with a try/except like the
existing instrumentation_boot pattern (lines ~26-29) so the app doesn't crash
when the optional package is absent; only attempt to instantiate/register
Instrumentator when the import succeeds and configure it to not register its
default http_requests_total metric (or disable that metric) to avoid duplication
with our own metrics; remove the http_requests_total Counter in api/metrics.py
(lines ~68-72) and rely on observe_http_request() in our middleware instead, and
ensure the Instrumentator registration is conditional to prevent
double-registration and runtime ValueError.
| if bootstrap_instrumentation is not None: # pragma: no branch - simple guard | ||
| try: | ||
| bootstrap_instrumentation() | ||
| except Exception: # pragma: no cover - defensive guard | ||
| logger.exception("Failed to bootstrap OpenTelemetry instrumentation") | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Bootstrap timing is fine; avoid re-instantiating Settings later
Bootstrapping at import-time with a try/except is acceptable. Note: below, code constructs Settings again via _Settings(); prefer the cached settings variable to avoid double parsing and to keep behavior consistent with get_settings().
Would you like a follow-up diff to switch to the existing settings instance for the SQLite check?
🤖 Prompt for AI Agents
In backend/api/main.py around lines 36 to 41, the bootstrap try/except is fine
but later code re-instantiates Settings via _Settings(), causing double parsing
and potential inconsistency; change that code to reuse the existing cached
settings instance (call get_settings() or reference the module-level settings
variable) for the SQLite check so the same parsed config is used everywhere, and
remove the direct _Settings() construction.
backend/api/main.py
Outdated
| if response is None: # pragma: no cover - FastAPI guarantees a Response | ||
| raise RuntimeError("ASGI call_next returned None response") | ||
| except Exception: |
There was a problem hiding this comment.
Record accurate status codes on exceptions
Currently exceptions are always recorded as 500, which misrepresents handled errors (e.g., RequestValidationError returns 422). Map known exceptions to their final status before recording.
Apply:
- except Exception:
+ except Exception as exc:
duration = perf_counter() - start
- observe_http_request(
+ # Derive best-effort status code
+ from fastapi import HTTPException as FastAPIHTTPException # local import to avoid top-level import churn
+ status = 500
+ if isinstance(exc, RequestValidationError):
+ status = 422
+ elif isinstance(exc, FastAPIHTTPException):
+ status = int(getattr(exc, "status_code", 500))
+ observe_http_request(
service="backend",
route=_resolve_route_label(request),
method=request.method,
- status_code=500,
+ status_code=status,
duration_seconds=duration,
)
raiseAlso applies to: 97-99
|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @shayancoin. * #150 (comment) The following files were modified: * `backend/api/config.py` * `backend/api/instrumentation_boot.py` * `backend/api/main.py` * `backend/api/metrics.py`
Docstrings generation was requested by @shayancoin. * #150 (comment) The following files were modified: * `backend/api/config.py` * `backend/api/instrumentation_boot.py` * `backend/api/main.py` * `backend/api/metrics.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Shayan <shayan@coin.link>
Summary
Testing
https://chatgpt.com/codex/tasks/task_e_68f158b020908330b101a254cf93c03e
Summary by CodeRabbit
New Features
Improvements