Skip to content

Dont dump error responses before logging#224

Open
JacobCoffee wants to merge 3 commits intomainfrom
datadog-fastly-errors
Open

Dont dump error responses before logging#224
JacobCoffee wants to merge 3 commits intomainfrom
datadog-fastly-errors

Conversation

@JacobCoffee
Copy link
Copy Markdown
Member

The VCL's 5xx handler was replacing error responses with stale/synthetic 2xx responses before logging fired, so the status >= 5000 log condition never matched.

@JacobCoffee JacobCoffee requested a review from Copilot March 9, 2026 23:03
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30d477ea81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Fastly VCL + Terraform logging configuration so origin 5xx events are still logged even when Fastly serves stale or restarts (which can turn the delivered response into a non-5xx).

Changes:

  • Set persistent request headers in vcl_fetch when the origin responds 5xx to carry origin error context through stale/restart.
  • Switch the Datadog logging endpoint to trigger on a new Origin 5xx Error response condition and include origin_status in the log payload.
  • Add a new Fastly condition to detect origin 5xx events via the request header flag.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
terraform/warehouse/vcl/main.vcl Flags origin 5xx responses before stale/restart so later logging can still detect them.
terraform/warehouse/main.tf Updates Datadog logging to use a new origin-5xx condition and logs the origin status code.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 228 to +232

condition {
name = "Origin 5xx Error"
type = "RESPONSE"
statement = "req.http.X-Origin-5xx == \"true\""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Two logging gaps around the new Origin 5xx Error condition: (1) Fastly-generated 503s (backend unreachable, DNS failure) bypass vcl_fetch entirely, so X-Origin-5xx is never set and these client-facing 503s silently disappear from Datadog — a minor regression from the old resp.status >= 500 condition which did catch them. (2) The S3 Error Logs (~line 162) still use the old 5xx Error response condition, which has the same stale-masking problem this PR fixes for Datadog; consider updating it to use Origin 5xx Error with placement = "none" as well.

Extended reasoning...

Fastly-generated 503s no longer logged to Datadog

The new Origin 5xx Error condition checks req.http.X-Origin-5xx == "true", which is only set in vcl_fetch (line ~379). When the backend is completely unreachable (connection timeout, DNS failure, connection refused), Fastly generates a 503 internally and jumps directly to vcl_error, bypassing vcl_fetch entirely. In this path, X-Origin-5xx is never set.

Step-by-step proof:

  1. Client sends GET request to PyPI.
  2. Backend is unreachable (e.g., DNS failure).
  3. Fastly cannot establish a connection — vcl_fetch is never entered.
  4. Fastly generates a synthetic 503 and enters vcl_error.
  5. In vcl_error (line ~487), if stale.exists is false, a synthetic HTML 503 is served to the client.
  6. At log time, req.http.X-Origin-5xx is unset, so the Origin 5xx Error condition evaluates to false.
  7. The Datadog log entry is not emitted, even though the client received a real 503.

With the old condition (resp.status >= 500), the synthetic 503 response status would have matched, and the error would have been logged. This is a behavioral regression, though it only affects the narrow case of backend-unreachable + no stale content available.

S3 Error Logs still use the broken condition

The S3 Error Logs at line ~162 still reference response_condition = "5xx Error", which checks resp.status >= 500 && resp.status < 600. This has the exact same stale-masking problem the PR correctly identifies and fixes for Datadog: when the origin returns a 5xx but stale content is served instead, resp.status becomes 200 and the S3 log condition never fires. The new Origin 5xx Error condition is already defined and could be referenced here too.

Impact assessment

The Fastly-generated 503 regression is narrow in scope: it requires both a completely unreachable backend AND no stale content. Backend outages are typically caught by health checks and other monitoring. The S3 Error Logs issue is pre-existing — this PR does not make it worse. The S3 logs with the old condition do still provide partial coverage for Fastly-generated 503s. The naming choice of "Origin 5xx Error" suggests the narrowing to origin-only errors may be intentional.

Suggested fix

For the Fastly-generated 503 gap, consider also setting req.http.X-Origin-5xx in vcl_error when obj.status >= 500, or adding a secondary Datadog logging condition that catches this case. For S3 Error Logs, update the response_condition from "5xx Error" to "Origin 5xx Error" and add placement = "none" to match the Datadog logger pattern.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worthh folowing up on both

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants