Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 30d477ea81
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Updates Fastly VCL + Terraform logging configuration so origin 5xx events are still logged even when Fastly serves stale or restarts (which can turn the delivered response into a non-5xx).
Changes:
- Set persistent request headers in
vcl_fetchwhen the origin responds 5xx to carry origin error context through stale/restart. - Switch the Datadog logging endpoint to trigger on a new
Origin 5xx Errorresponse condition and includeorigin_statusin the log payload. - Add a new Fastly
conditionto detect origin 5xx events via the request header flag.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| terraform/warehouse/vcl/main.vcl | Flags origin 5xx responses before stale/restart so later logging can still detect them. |
| terraform/warehouse/main.tf | Updates Datadog logging to use a new origin-5xx condition and logs the origin status code. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| condition { | ||
| name = "Origin 5xx Error" | ||
| type = "RESPONSE" | ||
| statement = "req.http.X-Origin-5xx == \"true\"" |
There was a problem hiding this comment.
🟡 Two logging gaps around the new Origin 5xx Error condition: (1) Fastly-generated 503s (backend unreachable, DNS failure) bypass vcl_fetch entirely, so X-Origin-5xx is never set and these client-facing 503s silently disappear from Datadog — a minor regression from the old resp.status >= 500 condition which did catch them. (2) The S3 Error Logs (~line 162) still use the old 5xx Error response condition, which has the same stale-masking problem this PR fixes for Datadog; consider updating it to use Origin 5xx Error with placement = "none" as well.
Extended reasoning...
Fastly-generated 503s no longer logged to Datadog
The new Origin 5xx Error condition checks req.http.X-Origin-5xx == "true", which is only set in vcl_fetch (line ~379). When the backend is completely unreachable (connection timeout, DNS failure, connection refused), Fastly generates a 503 internally and jumps directly to vcl_error, bypassing vcl_fetch entirely. In this path, X-Origin-5xx is never set.
Step-by-step proof:
- Client sends GET request to PyPI.
- Backend is unreachable (e.g., DNS failure).
- Fastly cannot establish a connection —
vcl_fetchis never entered. - Fastly generates a synthetic 503 and enters
vcl_error. - In
vcl_error(line ~487), ifstale.existsis false, a synthetic HTML 503 is served to the client. - At log time,
req.http.X-Origin-5xxis unset, so theOrigin 5xx Errorcondition evaluates to false. - The Datadog log entry is not emitted, even though the client received a real 503.
With the old condition (resp.status >= 500), the synthetic 503 response status would have matched, and the error would have been logged. This is a behavioral regression, though it only affects the narrow case of backend-unreachable + no stale content available.
S3 Error Logs still use the broken condition
The S3 Error Logs at line ~162 still reference response_condition = "5xx Error", which checks resp.status >= 500 && resp.status < 600. This has the exact same stale-masking problem the PR correctly identifies and fixes for Datadog: when the origin returns a 5xx but stale content is served instead, resp.status becomes 200 and the S3 log condition never fires. The new Origin 5xx Error condition is already defined and could be referenced here too.
Impact assessment
The Fastly-generated 503 regression is narrow in scope: it requires both a completely unreachable backend AND no stale content. Backend outages are typically caught by health checks and other monitoring. The S3 Error Logs issue is pre-existing — this PR does not make it worse. The S3 logs with the old condition do still provide partial coverage for Fastly-generated 503s. The naming choice of "Origin 5xx Error" suggests the narrowing to origin-only errors may be intentional.
Suggested fix
For the Fastly-generated 503 gap, consider also setting req.http.X-Origin-5xx in vcl_error when obj.status >= 500, or adding a secondary Datadog logging condition that catches this case. For S3 Error Logs, update the response_condition from "5xx Error" to "Origin 5xx Error" and add placement = "none" to match the Datadog logger pattern.
There was a problem hiding this comment.
worthh folowing up on both
The VCL's 5xx handler was replacing error responses with stale/synthetic 2xx responses before logging fired, so the
status >= 5000log condition never matched.