Skip to content

Commit 5d5dc97

Browse files
committed
docs: add plan to fix a silent DNS error
1 parent 18b64d2 commit 5d5dc97

File tree

1 file changed

+335
-0
lines changed

1 file changed

+335
-0
lines changed

docs/dns-error-handling-fix.md

Lines changed: 335 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,335 @@
1+
# DNS Error Handling Implementation Plan
2+
3+
## Problem Statement
4+
5+
DNS resolution errors are not consistently propagated to the response channel when HTTP tasks execute. This results in missing error responses for certain DNS policy configurations, making it difficult for clients to detect and respond to DNS failures.
6+
7+
## Observed Behavior
8+
9+
### Working Cases
10+
- **DNSRefreshStatic**: Errors are properly reported when targeting non-existing resources
11+
- Static mode bypasses DNS entirely
12+
- Connection failures at TCP level bubble up through `client.Do()`
13+
- Error responses are sent to response channel via `task_http.go:197`
14+
15+
### Failing Cases
16+
- **DNSRefreshTTL**: DNS lookup failures may not generate error responses
17+
- **DNSRefreshCadence**: DNS lookup failures may not generate error responses
18+
- **DNSRefreshDefault** (suspected): May inherit problematic behavior in certain configurations
19+
20+
## Root Cause Analysis
21+
22+
### DNS Fallback Logic Swallows Errors
23+
24+
When DNS lookups fail in TTL or Cadence modes and fallback is enabled, the error is recorded in the `DNSDecision` struct but the request continues with a cached address. This prevents error propagation to the response channel.
25+
26+
#### Code Location: dns_policy_manager.go:234-249 (resolveTTL)
27+
28+
```go
29+
addrs, ttl, lookupDur, err := m.lookup(ctx, host)
30+
if err != nil {
31+
if len(cached.addrs) > 0 && m.policy.fallbackEnabled() {
32+
addr := net.JoinHostPort(cached.addrs[0].String(), port)
33+
ttl := nonNegativeDuration(cached.expiresAt.Sub(cached.lastLookup))
34+
decision := DNSDecision{
35+
Host: host,
36+
Mode: m.policy.Mode,
37+
ResolvedAddrs: cached.addrs,
38+
TTL: ttl,
39+
ExpiresAt: cached.expiresAt,
40+
Err: err, // ← Error recorded here
41+
}
42+
return dnsResolveOutcome{address: addr, decision: &decision}, nil // ← Returns SUCCESS
43+
}
44+
return dnsResolveOutcome{}, err
45+
}
46+
```
47+
48+
**What happens:**
49+
1. DNS lookup fails (`err != nil`)
50+
2. Code checks if cached address exists and fallback is enabled
51+
3. Returns cached address with `nil` error
52+
4. DNS error only stored in `DNSDecision.Err` field
53+
5. Request proceeds with potentially stale/incorrect address
54+
6. **No error response sent to response channel**
55+
56+
#### Code Location: dns_policy_manager.go:303-316 (resolveCadence)
57+
58+
Same pattern exists in cadence mode:
59+
60+
```go
61+
addrs, ttl, lookupDur, err := m.lookup(ctx, host)
62+
if err != nil {
63+
if len(cached.addrs) > 0 && m.policy.fallbackEnabled() {
64+
addr := net.JoinHostPort(cached.addrs[0].String(), port)
65+
decision := DNSDecision{
66+
Host: host,
67+
Mode: m.policy.Mode,
68+
ResolvedAddrs: cached.addrs,
69+
ExpiresAt: cadenceNext,
70+
Err: err, // ← Error recorded but not propagated
71+
}
72+
return dnsResolveOutcome{address: addr, decision: &decision}, nil // ← Returns SUCCESS
73+
}
74+
return dnsResolveOutcome{}, err
75+
}
76+
```
77+
78+
### Error Flow Analysis
79+
80+
#### Normal Error Path (Working)
81+
1. DNS/connection error occurs
82+
2. `client.Do()` returns error (`task_http.go:192`)
83+
3. `dnsMgr.observeResult()` called with error (`task_http.go:195`)
84+
4. Error response sent to channel (`task_http.go:197`)
85+
5. Early return with error (`task_http.go:198`)
86+
87+
#### Broken Error Path (Fallback Enabled)
88+
1. DNS lookup fails in `prepareRequest()``resolve()`
89+
2. Fallback logic returns `nil` error with cached address
90+
3. `RoundTrip()` proceeds with cached address (`dns_policy_manager.go:495`)
91+
4. Request may succeed with stale address OR fail later at connection level
92+
5. Original DNS error never reaches response channel
93+
94+
## Affected Components
95+
96+
### Primary Files
97+
- `dns_policy_manager.go`: Contains fallback logic that swallows errors
98+
- Lines 234-249: `resolveTTL()` fallback path
99+
- Lines 303-316: `resolveCadence()` fallback path
100+
101+
### Secondary Files
102+
- `task_http.go`: Error handling in `httpRequest.Execute()`
103+
- Lines 192-198: Normal error path (working correctly)
104+
- `dns_policy.go`: Policy configuration and validation
105+
- Lines 147-150: `fallbackEnabled()` method
106+
107+
## Impact Assessment
108+
109+
### Critical
110+
- Monitoring systems cannot detect DNS resolution failures
111+
- Failed requests may appear successful if cached address is reused
112+
- Guard rail mechanisms may not trigger appropriately
113+
- Silent failures violate principle of least surprise
114+
115+
### Moderate
116+
- DNS decision hooks receive error information but most clients don't use them
117+
- Fallback behavior may be intentional for resilience, but should be explicit
118+
119+
### Low
120+
- Static mode works correctly (no DNS involved)
121+
- Single lookup mode works correctly for initial lookup failure
122+
123+
## Implementation Options
124+
125+
### Option 1: Always Propagate DNS Errors (Strictest)
126+
127+
**Change:** Remove fallback behavior entirely or make it opt-in via explicit flag
128+
129+
**Pros:**
130+
- Clear error semantics
131+
- Monitoring systems see all failures
132+
- Predictable behavior
133+
134+
**Cons:**
135+
- May break existing resilience assumptions
136+
- Could increase error volume in transient DNS issues
137+
- Breaking change for users relying on fallback
138+
139+
### Option 2: Report DNS Errors While Using Fallback (Balanced)
140+
141+
**Change:** Send error response to channel even when falling back to cached address
142+
143+
**Implementation:**
144+
```go
145+
if err != nil {
146+
if len(cached.addrs) > 0 && m.policy.fallbackEnabled() {
147+
// Create decision with error
148+
decision := DNSDecision{
149+
Host: host,
150+
Mode: m.policy.Mode,
151+
ResolvedAddrs: cached.addrs,
152+
Err: err,
153+
FallbackUsed: true, // New field
154+
}
155+
// Return outcome but preserve error for reporting
156+
return dnsResolveOutcome{
157+
address: addr,
158+
decision: &decision,
159+
shouldReport: true, // New field to trigger error response
160+
}, err // Return error instead of nil
161+
}
162+
return dnsResolveOutcome{}, err
163+
}
164+
```
165+
166+
**Pros:**
167+
- Maintains resilience of fallback behavior
168+
- Errors visible to monitoring systems
169+
- Clear indication of degraded state
170+
171+
**Cons:**
172+
- Requires changes to `dnsResolveOutcome` struct
173+
- Needs new logic to send error while continuing request
174+
- More complex error handling
175+
176+
### Option 3: Explicit Fallback Mode with Warnings (Most Flexible)
177+
178+
**Change:** Add `DNSFallbackBehavior` enum to policy configuration
179+
180+
```go
181+
type DNSFallbackBehavior uint8
182+
183+
const (
184+
DNSFallbackError DNSFallbackBehavior = iota // Fail immediately on DNS error
185+
DNSFallbackWarn // Use cache but send warning response
186+
DNSFallbackSilent // Current behavior (use cache, no error)
187+
)
188+
189+
type DNSPolicy struct {
190+
// ... existing fields ...
191+
FallbackBehavior DNSFallbackBehavior // Replace DisableFallback bool
192+
}
193+
```
194+
195+
**Pros:**
196+
- Maximum flexibility for different use cases
197+
- Non-breaking (default to Silent for backward compat)
198+
- Explicit control over error reporting
199+
200+
**Cons:**
201+
- More configuration complexity
202+
- Requires additional testing scenarios
203+
- Documentation burden
204+
205+
## Recommended Solution
206+
207+
**Option 2** (Report DNS Errors While Using Fallback) provides the best balance:
208+
209+
1. **Preserve resilience**: Fallback to cached addresses still works
210+
2. **Improve observability**: DNS errors always reported to response channel
211+
3. **Clear semantics**: Clients can distinguish between clean success and degraded fallback
212+
4. **Non-breaking**: Existing fallback behavior continues, just with better error visibility
213+
214+
### Implementation Steps
215+
216+
1. **Modify `dnsResolveOutcome` struct** (dns_policy_manager.go):
217+
```go
218+
type dnsResolveOutcome struct {
219+
address string
220+
forceNewConn bool
221+
decision *DNSDecision
222+
fallbackErr error // New: DNS error when fallback is used
223+
}
224+
```
225+
226+
2. **Update `resolveTTL()` fallback path** (dns_policy_manager.go:234-249):
227+
```go
228+
if err != nil {
229+
if len(cached.addrs) > 0 && m.policy.fallbackEnabled() {
230+
// ... create decision ...
231+
return dnsResolveOutcome{
232+
address: addr,
233+
decision: &decision,
234+
fallbackErr: err, // Preserve error for reporting
235+
}, nil
236+
}
237+
return dnsResolveOutcome{}, err
238+
}
239+
```
240+
241+
3. **Update `resolveCadence()` fallback path** (dns_policy_manager.go:303-316):
242+
Same pattern as `resolveTTL()`
243+
244+
4. **Modify `prepareRequest()` to check for fallback errors** (dns_policy_manager.go:82-133):
245+
```go
246+
outcome, err := m.resolve(ctx, host, port)
247+
if err != nil {
248+
// ... existing error handling ...
249+
}
250+
251+
// Check if fallback was used (new)
252+
if outcome.fallbackErr != nil && outcome.decision != nil {
253+
outcome.decision.FallbackUsed = true
254+
// Return error context for caller to handle
255+
return ctx, outcome.forceNewConn, outcome.fallbackErr
256+
}
257+
```
258+
259+
5. **Add error handling in `policyTransport.RoundTrip()`** (dns_policy_manager.go:485-496):
260+
```go
261+
func (p *policyTransport) RoundTrip(req *http.Request) (*http.Response, error) {
262+
ctx, _, err := p.mgr.prepareRequest(req.Context(), req.URL, p.base)
263+
if err != nil {
264+
// DNS error occurred (either hard failure or fallback scenario)
265+
return nil, err
266+
}
267+
268+
newReq := req.Clone(ctx)
269+
return p.base.RoundTrip(newReq)
270+
}
271+
```
272+
273+
6. **Add `FallbackUsed` field to `DNSDecision`** (dns_policy.go):
274+
```go
275+
type DNSDecision struct {
276+
// ... existing fields ...
277+
FallbackUsed bool // New: indicates cached address used due to DNS error
278+
}
279+
```
280+
281+
7. **Add comprehensive tests**:
282+
- Test DNS failure with fallback enabled (should send error response)
283+
- Test DNS failure with fallback disabled (should send error response)
284+
- Test DNS failure without cached address (should send error response)
285+
- Verify decision hook receives correct fallback state
286+
287+
### Testing Strategy
288+
289+
1. **Unit tests** for each `resolve*()` method:
290+
- DNS failure with cached address and fallback enabled
291+
- DNS failure with cached address and fallback disabled
292+
- DNS failure without cached address
293+
294+
2. **Integration tests** for end-to-end flow:
295+
- HTTP task execution with DNS failure scenarios
296+
- Verify error responses sent to response channel
297+
- Verify guard rail mechanisms trigger appropriately
298+
299+
3. **Regression tests**:
300+
- Ensure static mode continues to work
301+
- Ensure default mode continues to work
302+
- Ensure normal success paths unchanged
303+
304+
## Migration Considerations
305+
306+
### Backward Compatibility
307+
- Existing fallback behavior preserved (requests still use cached addresses)
308+
- New error reporting may increase error response volume
309+
- Clients should be prepared to handle errors that were previously silent
310+
311+
### Documentation Updates
312+
- Update DNS policy documentation to explain fallback error reporting
313+
- Add examples showing how to handle DNS fallback errors
314+
- Document difference between hard DNS failures and fallback scenarios
315+
316+
### Monitoring Impact
317+
- Error dashboards may show increased error counts (this is expected and correct)
318+
- Add metrics to distinguish between:
319+
- Hard DNS failures (no cached address)
320+
- Soft DNS failures (fallback to cached address)
321+
- Guard rail triggers due to DNS issues
322+
323+
## Future Enhancements
324+
325+
1. **Fallback address TTL/expiration**: Don't fall back to very stale cached addresses
326+
2. **Fallback metrics**: Track how often fallback is used per endpoint
327+
3. **Configurable fallback policy**: Per-endpoint control over fallback behavior
328+
4. **DNS error classification**: Distinguish temporary vs permanent DNS failures
329+
5. **Circuit breaker integration**: Use DNS error patterns to trigger circuit breakers
330+
331+
## References
332+
333+
- Issue discovered: DNS resolution failures not producing error responses
334+
- Affected files: `dns_policy_manager.go`, `task_http.go`, `dns_policy.go`
335+
- Related guard rail mechanism: `DNSGuardRailPolicy` in `dns_policy.go`

0 commit comments

Comments
 (0)