Add OpenTelemetry instrumentation for API#82
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThis PR replaces Prometheus instrumentation with OpenTelemetry (OTLP), adds JSON structured logging with trace/span context, introduces request tracing middleware and metrics recording, and changes settings caching to refresh when the Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant FastAPIApp as FastAPI (app)
participant Middleware as telemetry_middleware
participant Tracer
participant SpanProc as Span Processor
participant OTLP as OTLP Exporter
participant Logger as JSON Logger
Client->>FastAPIApp: HTTP Request
FastAPIApp->>Middleware: enter middleware
Middleware->>Tracer: start_as_current_span()
Tracer->>Tracer: create span (trace_id, span_id)
Middleware->>FastAPIApp: call_next(request)
FastAPIApp->>FastAPIApp: route handler, DB ops (spans)
FastAPIApp-->>Middleware: response
Middleware->>Middleware: compute duration, set attributes/status
Middleware->>Logger: emit JSON log (with trace/span ids)
Tracer->>SpanProc: export spans (batch)
SpanProc->>OTLP: send traces/metrics to OTLP endpoint
Middleware-->>Client: response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
| @app.middleware("http") | ||
| async def telemetry_middleware(request: Request, call_next): | ||
| method = request.method.upper() | ||
| route_path = request.scope.get("route") | ||
| path_template = getattr(route_path, "path", request.url.path) | ||
| start = time.perf_counter() | ||
|
|
||
| current_span = trace.get_current_span() | ||
| if current_span.get_span_context().is_valid: | ||
| span_context = nullcontext(current_span) | ||
| else: | ||
| span_context = _http_tracer.start_as_current_span(f"HTTP {method}") | ||
|
|
||
| status_code = 500 | ||
| response: Optional[Response] = None | ||
| with span_context as span: | ||
| span.set_attribute("http.route", path_template) | ||
| span.set_attribute("http.method", method) | ||
| span.set_attribute("deployment.environment", settings.environment) | ||
| try: | ||
| response = await call_next(request) | ||
| status_code = response.status_code | ||
| span.set_attribute("http.status_code", status_code) | ||
| if status_code >= 500: | ||
| span.set_status(Status(StatusCode.ERROR)) | ||
| except Exception as exc: # noqa: BLE001 | ||
| span.record_exception(exc) | ||
| span.set_status(Status(StatusCode.ERROR)) | ||
| raise | ||
| finally: | ||
| duration_ms = (time.perf_counter() - start) * 1000 | ||
| record_http_request(method, path_template, status_code, duration_ms) |
There was a problem hiding this comment.
Derive http.route after router match
The telemetry middleware reads request.scope.get("route") before calling call_next, so the variable is still None for most requests and path_template falls back to request.url.path. As a result, both the span attribute http.route and the metrics emitted via record_http_request are populated with the concrete request path rather than the route template (e.g. /api/modules/123 instead of /api/modules/{id}). This produces high-cardinality metrics and overwrites the normalized http.route that FastAPIInstrumentor sets later in the pipeline. Consider fetching request.scope["route"].path after call_next to ensure attributes and metrics stay bounded.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (1)
backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
backend/api/config.py(2 hunks)backend/api/instrumentation_boot.py(1 hunks)backend/api/main.py(4 hunks)backend/api/metrics.py(1 hunks)backend/api/routes_sync.py(4 hunks)backend/api/security.py(1 hunks)backend/pyproject.toml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
backend/api/instrumentation_boot.py (1)
backend/api/config.py (1)
get_settings(69-78)
backend/api/routes_sync.py (2)
backend/api/metrics.py (4)
record_sync_duration(59-60)record_sync_failure(50-51)record_sync_records(54-56)record_sync_success(46-47)backend/api/models.py (1)
SyncEvent(118-133)
backend/api/metrics.py (1)
backend/api/main.py (1)
metrics(146-151)
backend/api/main.py (3)
backend/api/config.py (1)
get_settings(69-78)backend/api/instrumentation_boot.py (1)
setup_instrumentation(74-110)backend/api/metrics.py (1)
record_http_request(63-70)
backend/api/security.py (1)
backend/api/config.py (1)
Settings(10-62)
| def get_settings() -> Settings: | ||
| """Return a cached instance of :class:`Settings`.""" | ||
| """Return a cached instance of :class:`Settings`, refreshing on secret changes.""" | ||
|
|
||
| return Settings() | ||
| global _SETTINGS_CACHE, _LAST_SECRET | ||
| current_secret = os.getenv("HYGRAPH_WEBHOOK_SECRET") | ||
| if _SETTINGS_CACHE is None or _LAST_SECRET != current_secret: | ||
| settings = Settings() | ||
| _SETTINGS_CACHE = settings | ||
| _LAST_SECRET = settings.hygraph_webhook_secret | ||
| return _SETTINGS_CACHE |
There was a problem hiding this comment.
Fix settings cache so it actually caches.
_LAST_SECRET stores the trimmed secret from Settings, but you compare it to the raw environment value on every call. When HYGRAPH_WEBHOOK_SECRET is not set (the common case in local/dev where the value comes from the .env file), _LAST_SECRET holds the non-empty secret while current_secret is None, so the condition is always true and Settings() is rebuilt on every call. This regresses the previous @lru_cache behavior and needlessly re-reads the environment and validates settings on every dependency resolution. Please restore a stable cache key based on the raw env value.
Suggested fix:
-_SETTINGS_CACHE: Optional[Settings] = None
-_LAST_SECRET: Optional[str] = None
+_SETTINGS_CACHE: Optional[Settings] = None
+_LAST_SECRET_ENV: Optional[str] = None
@@
- global _SETTINGS_CACHE, _LAST_SECRET
- current_secret = os.getenv("HYGRAPH_WEBHOOK_SECRET")
- if _SETTINGS_CACHE is None or _LAST_SECRET != current_secret:
- settings = Settings()
- _SETTINGS_CACHE = settings
- _LAST_SECRET = settings.hygraph_webhook_secret
+ global _SETTINGS_CACHE, _LAST_SECRET_ENV
+ current_secret = os.getenv("HYGRAPH_WEBHOOK_SECRET")
+ if _SETTINGS_CACHE is None or _LAST_SECRET_ENV != current_secret:
+ settings = Settings()
+ _SETTINGS_CACHE = settings
+ _LAST_SECRET_ENV = current_secret📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_settings() -> Settings: | |
| """Return a cached instance of :class:`Settings`.""" | |
| """Return a cached instance of :class:`Settings`, refreshing on secret changes.""" | |
| return Settings() | |
| global _SETTINGS_CACHE, _LAST_SECRET | |
| current_secret = os.getenv("HYGRAPH_WEBHOOK_SECRET") | |
| if _SETTINGS_CACHE is None or _LAST_SECRET != current_secret: | |
| settings = Settings() | |
| _SETTINGS_CACHE = settings | |
| _LAST_SECRET = settings.hygraph_webhook_secret | |
| return _SETTINGS_CACHE | |
| _SETTINGS_CACHE: Optional[Settings] = None | |
| _LAST_SECRET_ENV: Optional[str] = None | |
| def get_settings() -> Settings: | |
| """Return a cached instance of :class:`Settings`, refreshing on secret changes.""" | |
| global _SETTINGS_CACHE, _LAST_SECRET_ENV | |
| current_secret = os.getenv("HYGRAPH_WEBHOOK_SECRET") | |
| if _SETTINGS_CACHE is None or _LAST_SECRET_ENV != current_secret: | |
| settings = Settings() | |
| _SETTINGS_CACHE = settings | |
| _LAST_SECRET_ENV = current_secret | |
| return _SETTINGS_CACHE |
🤖 Prompt for AI Agents
In backend/api/config.py around lines 69 to 78, the cache key uses the trimmed
value from Settings but you compare it to the raw environment value, causing the
cache to always miss when HYGRAPH_WEBHOOK_SECRET is absent; change the logic so
_LAST_SECRET stores the raw current_secret (the direct os.getenv result) and
compare against that raw value before rebuilding Settings — i.e., when you
detect a miss assign _LAST_SECRET = current_secret (not
settings.hygraph_webhook_secret) and only recreate Settings when _SETTINGS_CACHE
is None or _LAST_SECRET != current_secret.
| tracer_provider = TracerProvider(resource=resource) | ||
| span_exporter = OTLPSpanExporter(endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "otel-collector:4317"), insecure=True) | ||
| tracer_provider.add_span_processor(BatchSpanProcessor(span_exporter)) | ||
| trace.set_tracer_provider(tracer_provider) | ||
|
|
||
| metric_exporter = OTLPMetricExporter( | ||
| endpoint=os.getenv("OTEL_EXPORTER_OTLP_METRICS_ENDPOINT", "otel-collector:4317"), | ||
| insecure=True, | ||
| ) | ||
| reader = PeriodicExportingMetricReader(metric_exporter) | ||
| meter_provider = MeterProvider(resource=resource, metric_readers=[reader]) | ||
| metrics.set_meter_provider(meter_provider) | ||
|
|
There was a problem hiding this comment.
Allow secure OTLP connections instead of forcing plaintext.
Both exporters hard-code insecure=True, so the process can never talk to a TLS-enabled collector—even if OTEL_EXPORTER_OTLP_ENDPOINT/..._METRICS_ENDPOINT points to an HTTPS target. This breaks the standard OTLP deployment pattern where collectors require TLS. Please make the insecure flag configurable (e.g., via the existing OTEL_EXPORTER_OTLP_INSECURE and OTEL_EXPORTER_OTLP_METRICS_INSECURE env vars) and default to False unless explicitly requested.
Example adjustment:
+def _env_flag(name: str, default: bool = False) -> bool:
+ raw = os.getenv(name)
+ if raw is None:
+ return default
+ return raw.strip().lower() in {"1", "true", "yes", "on"}
@@
- span_exporter = OTLPSpanExporter(endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "otel-collector:4317"), insecure=True)
+ span_exporter = OTLPSpanExporter(
+ endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "otel-collector:4317"),
+ insecure=_env_flag("OTEL_EXPORTER_OTLP_INSECURE", default=True),
+ )
@@
- metric_exporter = OTLPMetricExporter(
- endpoint=os.getenv("OTEL_EXPORTER_OTLP_METRICS_ENDPOINT", "otel-collector:4317"),
- insecure=True,
- )
+ metric_exporter = OTLPMetricExporter(
+ endpoint=os.getenv("OTEL_EXPORTER_OTLP_METRICS_ENDPOINT", "otel-collector:4317"),
+ insecure=_env_flag("OTEL_EXPORTER_OTLP_METRICS_INSECURE", default=True),
+ )Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In backend/api/instrumentation_boot.py around lines 91 to 103, both
OTLPSpanExporter and OTLPMetricExporter are hard-coded with insecure=True which
prevents TLS connections; change this to read boolean env vars
(OTEL_EXPORTER_OTLP_INSECURE for traces and OTEL_EXPORTER_OTLP_METRICS_INSECURE
for metrics) defaulting to False, parse them safely (e.g., treat
"1","true","yes" as True), and pass the parsed values to the respective exporter
insecure parameters so TLS is used by default unless explicitly disabled.
backend/api/metrics.py
Outdated
| def record_http_request(method: str, route: str, status_code: int, duration_ms: float) -> None: | ||
| attributes = { | ||
| "http.method": method.upper(), | ||
| "http.route": route or "unknown", | ||
| "http.status_code": str(status_code), | ||
| } | ||
| _http_request_counter.add(1, attributes) | ||
| _http_request_duration_histogram.record(float(duration_ms), attributes) |
There was a problem hiding this comment.
Keep http.status_code as an integer attribute.
OpenTelemetry’s semantic conventions define http.status_code as an integer. Casting it to a string produces a separate attribute namespace and breaks parity with other SDKs/exporters, making dashboards harder to merge. Please emit it as int(status_code) instead of a string.
🤖 Prompt for AI Agents
In backend/api/metrics.py around lines 63 to 70, the http.status_code attribute
is being set as a string which violates OpenTelemetry semantic conventions;
change the attribute value to an integer (e.g., use int(status_code) or ensure
status_code is an int) when building the attributes dict so
_http_request_counter and _http_request_duration_histogram receive an integer
http.status_code attribute.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
backend/api/main.py (1)
91-94: HTTPException responses still misreport status codes.This is the same issue flagged in the past review: when
call_nextraises anHTTPException(e.g., 404, 422), the genericexcept Exceptionblock catches it, butstatus_coderemains 500. The span is marked as an error and metrics record a 500 instead of the actual 4xx status. This misreports telemetry for expected client errors.Handle
HTTPExceptionseparately before the generic exception handler:-from fastapi import FastAPI, Request, Response +from fastapi import FastAPI, HTTPException, Request, Responsetry: response = await call_next(request) status_code = response.status_code span.set_attribute("http.status_code", status_code) if status_code >= 500: span.set_status(Status(StatusCode.ERROR)) + except HTTPException as exc: + status_code = exc.status_code + span.set_attribute("http.status_code", status_code) + if status_code >= 500: + span.set_status(Status(StatusCode.ERROR)) + raise except Exception as exc: # noqa: BLE001 span.record_exception(exc) span.set_status(Status(StatusCode.ERROR)) raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
backend/api/main.py(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/api/main.py (3)
backend/api/config.py (1)
get_settings(69-78)backend/api/instrumentation_boot.py (1)
setup_instrumentation(74-110)backend/api/metrics.py (1)
record_http_request(63-70)
🔇 Additional comments (5)
backend/api/main.py (5)
3-12: LGTM! Imports are appropriate for OpenTelemetry instrumentation.The imports correctly bring in timing, context management, and OpenTelemetry primitives needed for the custom telemetry middleware.
65-65: LGTM! HTTP tracer serves as a fallback.The tracer is used in the middleware (line 78) only when no valid span exists. With
FastAPIInstrumentoractive, this acts as a safety net.
74-78: LGTM! Span reuse pattern prevents duplicate spans.The
nullcontextpattern correctly reuses the span created byFastAPIInstrumentor, avoiding nested duplicate spans while allowing the middleware to add custom attributes and metrics.
96-100: LGTM! Route extraction in finally block avoids high-cardinality metrics.Moving
http.routeextraction to the finally block (aftercall_next) ensures the route template is available and avoids the high-cardinality issue mentioned in past reviews where concrete paths were recorded instead of templates.
16-17: Avoid redundant HTTP attribute assignments
FastAPIInstrumentor.instrument_app and your telemetry_middleware both set standard HTTP attributes (http.method, http.status_code, http.route). The middleware reuses spans but duplicates these assignments. deployment.environment and record_http_request are valuable additions—remove or document the duplicate HTTP attribute sets and verify which attributes your FastAPI instrumentation library version already populates.
| duration_ms = (time.perf_counter() - start) * 1000 | ||
| record_http_request(method, path_template, status_code, duration_ms) | ||
|
|
||
| return response if response is not None else Response(status_code=status_code) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Remove unnecessary response fallback.
The else Response(status_code=status_code) fallback is unreachable. If an exception occurs, it's re-raised at line 94, so execution never reaches line 102. If no exception occurs, response is always set at line 86. The fallback obscures this guarantee and could hide bugs if call_next unexpectedly returns None.
- return response if response is not None else Response(status_code=status_code)
+ return response📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| return response if response is not None else Response(status_code=status_code) | |
| return response |
🤖 Prompt for AI Agents
In backend/api/main.py around line 102, remove the unreachable fallback that
returns Response(status_code=status_code); instead return the response directly
(e.g., return response) and, if you want to guard against an unexpected None
from call_next, replace the fallback with an explicit check that raises a clear
error (or assert) when response is None so None is not silently masked.
| async def metrics() -> Dict[str, str]: | ||
| """Compatibility endpoint for legacy Prometheus scrapes.""" | ||
|
|
||
| return { | ||
| "detail": "Metrics are exported via OpenTelemetry OTLP; no local payload is available.", | ||
| } No newline at end of file |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Confirm the Prometheus scrape deprecation.
The /metrics endpoint now returns a JSON message instead of Prometheus-formatted metrics. This breaks Prometheus scrapers that expect the standard format. If Prometheus scraping is no longer needed (as the "legacy" docstring suggests), consider returning a 410 Gone status or adding a deprecation timeline to help clients migrate.
If Prometheus is still required during a transition period, you could expose both formats by checking the Accept header:
@app.get("/metrics")
async def metrics(request: Request) -> Response:
accept = request.headers.get("accept", "")
if "application/openmetrics-text" in accept or "text/plain" in accept:
# Return 410 Gone or a deprecation notice in Prometheus format
return Response(
content="# Metrics moved to OTLP export\n",
status_code=410,
media_type="text/plain",
)
return JSONResponse({
"detail": "Metrics are exported via OpenTelemetry OTLP; no local payload is available.",
})|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @shayancoin. * #82 (comment) The following files were modified: * `backend/api/config.py` * `backend/api/instrumentation_boot.py` * `backend/api/main.py` * `backend/api/metrics.py` * `backend/api/routes_sync.py` * `backend/api/security.py`
Docstrings generation was requested by @shayancoin. * #82 (comment) The following files were modified: * `backend/api/config.py` * `backend/api/instrumentation_boot.py` * `backend/api/main.py` * `backend/api/metrics.py` * `backend/api/routes_sync.py` * `backend/api/security.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Shayan <shayan@coin.link>
Summary
Testing
https://chatgpt.com/codex/tasks/task_e_68f12ad981288330ae4b2060a1340571
Summary by CodeRabbit
New Features
Bug Fixes
Chores