|
| 1 | +# DNS Error Handling Implementation Plan |
| 2 | + |
| 3 | +## Problem Statement |
| 4 | + |
| 5 | +DNS resolution errors are not consistently propagated to the response channel when HTTP tasks execute. This results in missing error responses for certain DNS policy configurations, making it difficult for clients to detect and respond to DNS failures. |
| 6 | + |
| 7 | +## Observed Behavior |
| 8 | + |
| 9 | +### Working Cases |
| 10 | +- **DNSRefreshStatic**: Errors are properly reported when targeting non-existing resources |
| 11 | + - Static mode bypasses DNS entirely |
| 12 | + - Connection failures at TCP level bubble up through `client.Do()` |
| 13 | + - Error responses are sent to response channel via `task_http.go:197` |
| 14 | + |
| 15 | +### Failing Cases |
| 16 | +- **DNSRefreshTTL**: DNS lookup failures may not generate error responses |
| 17 | +- **DNSRefreshCadence**: DNS lookup failures may not generate error responses |
| 18 | +- **DNSRefreshDefault** (suspected): May inherit problematic behavior in certain configurations |
| 19 | + |
| 20 | +## Root Cause Analysis |
| 21 | + |
| 22 | +### DNS Fallback Logic Swallows Errors |
| 23 | + |
| 24 | +When DNS lookups fail in TTL or Cadence modes and fallback is enabled, the error is recorded in the `DNSDecision` struct but the request continues with a cached address. This prevents error propagation to the response channel. |
| 25 | + |
| 26 | +#### Code Location: dns_policy_manager.go:234-249 (resolveTTL) |
| 27 | + |
| 28 | +```go |
| 29 | +addrs, ttl, lookupDur, err := m.lookup(ctx, host) |
| 30 | +if err != nil { |
| 31 | + if len(cached.addrs) > 0 && m.policy.fallbackEnabled() { |
| 32 | + addr := net.JoinHostPort(cached.addrs[0].String(), port) |
| 33 | + ttl := nonNegativeDuration(cached.expiresAt.Sub(cached.lastLookup)) |
| 34 | + decision := DNSDecision{ |
| 35 | + Host: host, |
| 36 | + Mode: m.policy.Mode, |
| 37 | + ResolvedAddrs: cached.addrs, |
| 38 | + TTL: ttl, |
| 39 | + ExpiresAt: cached.expiresAt, |
| 40 | + Err: err, // ← Error recorded here |
| 41 | + } |
| 42 | + return dnsResolveOutcome{address: addr, decision: &decision}, nil // ← Returns SUCCESS |
| 43 | + } |
| 44 | + return dnsResolveOutcome{}, err |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | +**What happens:** |
| 49 | +1. DNS lookup fails (`err != nil`) |
| 50 | +2. Code checks if cached address exists and fallback is enabled |
| 51 | +3. Returns cached address with `nil` error |
| 52 | +4. DNS error only stored in `DNSDecision.Err` field |
| 53 | +5. Request proceeds with potentially stale/incorrect address |
| 54 | +6. **No error response sent to response channel** |
| 55 | + |
| 56 | +#### Code Location: dns_policy_manager.go:303-316 (resolveCadence) |
| 57 | + |
| 58 | +Same pattern exists in cadence mode: |
| 59 | + |
| 60 | +```go |
| 61 | +addrs, ttl, lookupDur, err := m.lookup(ctx, host) |
| 62 | +if err != nil { |
| 63 | + if len(cached.addrs) > 0 && m.policy.fallbackEnabled() { |
| 64 | + addr := net.JoinHostPort(cached.addrs[0].String(), port) |
| 65 | + decision := DNSDecision{ |
| 66 | + Host: host, |
| 67 | + Mode: m.policy.Mode, |
| 68 | + ResolvedAddrs: cached.addrs, |
| 69 | + ExpiresAt: cadenceNext, |
| 70 | + Err: err, // ← Error recorded but not propagated |
| 71 | + } |
| 72 | + return dnsResolveOutcome{address: addr, decision: &decision}, nil // ← Returns SUCCESS |
| 73 | + } |
| 74 | + return dnsResolveOutcome{}, err |
| 75 | +} |
| 76 | +``` |
| 77 | + |
| 78 | +### Error Flow Analysis |
| 79 | + |
| 80 | +#### Normal Error Path (Working) |
| 81 | +1. DNS/connection error occurs |
| 82 | +2. `client.Do()` returns error (`task_http.go:192`) |
| 83 | +3. `dnsMgr.observeResult()` called with error (`task_http.go:195`) |
| 84 | +4. Error response sent to channel (`task_http.go:197`) |
| 85 | +5. Early return with error (`task_http.go:198`) |
| 86 | + |
| 87 | +#### Broken Error Path (Fallback Enabled) |
| 88 | +1. DNS lookup fails in `prepareRequest()` → `resolve()` |
| 89 | +2. Fallback logic returns `nil` error with cached address |
| 90 | +3. `RoundTrip()` proceeds with cached address (`dns_policy_manager.go:495`) |
| 91 | +4. Request may succeed with stale address OR fail later at connection level |
| 92 | +5. Original DNS error never reaches response channel |
| 93 | + |
| 94 | +## Affected Components |
| 95 | + |
| 96 | +### Primary Files |
| 97 | +- `dns_policy_manager.go`: Contains fallback logic that swallows errors |
| 98 | + - Lines 234-249: `resolveTTL()` fallback path |
| 99 | + - Lines 303-316: `resolveCadence()` fallback path |
| 100 | + |
| 101 | +### Secondary Files |
| 102 | +- `task_http.go`: Error handling in `httpRequest.Execute()` |
| 103 | + - Lines 192-198: Normal error path (working correctly) |
| 104 | +- `dns_policy.go`: Policy configuration and validation |
| 105 | + - Lines 147-150: `fallbackEnabled()` method |
| 106 | + |
| 107 | +## Impact Assessment |
| 108 | + |
| 109 | +### Critical |
| 110 | +- Monitoring systems cannot detect DNS resolution failures |
| 111 | +- Failed requests may appear successful if cached address is reused |
| 112 | +- Guard rail mechanisms may not trigger appropriately |
| 113 | +- Silent failures violate principle of least surprise |
| 114 | + |
| 115 | +### Moderate |
| 116 | +- DNS decision hooks receive error information but most clients don't use them |
| 117 | +- Fallback behavior may be intentional for resilience, but should be explicit |
| 118 | + |
| 119 | +### Low |
| 120 | +- Static mode works correctly (no DNS involved) |
| 121 | +- Single lookup mode works correctly for initial lookup failure |
| 122 | + |
| 123 | +## Implementation Options |
| 124 | + |
| 125 | +### Option 1: Always Propagate DNS Errors (Strictest) |
| 126 | + |
| 127 | +**Change:** Remove fallback behavior entirely or make it opt-in via explicit flag |
| 128 | + |
| 129 | +**Pros:** |
| 130 | +- Clear error semantics |
| 131 | +- Monitoring systems see all failures |
| 132 | +- Predictable behavior |
| 133 | + |
| 134 | +**Cons:** |
| 135 | +- May break existing resilience assumptions |
| 136 | +- Could increase error volume in transient DNS issues |
| 137 | +- Breaking change for users relying on fallback |
| 138 | + |
| 139 | +### Option 2: Report DNS Errors While Using Fallback (Balanced) |
| 140 | + |
| 141 | +**Change:** Send error response to channel even when falling back to cached address |
| 142 | + |
| 143 | +**Implementation:** |
| 144 | +```go |
| 145 | +if err != nil { |
| 146 | + if len(cached.addrs) > 0 && m.policy.fallbackEnabled() { |
| 147 | + // Create decision with error |
| 148 | + decision := DNSDecision{ |
| 149 | + Host: host, |
| 150 | + Mode: m.policy.Mode, |
| 151 | + ResolvedAddrs: cached.addrs, |
| 152 | + Err: err, |
| 153 | + FallbackUsed: true, // New field |
| 154 | + } |
| 155 | + // Return outcome but preserve error for reporting |
| 156 | + return dnsResolveOutcome{ |
| 157 | + address: addr, |
| 158 | + decision: &decision, |
| 159 | + shouldReport: true, // New field to trigger error response |
| 160 | + }, err // Return error instead of nil |
| 161 | + } |
| 162 | + return dnsResolveOutcome{}, err |
| 163 | +} |
| 164 | +``` |
| 165 | + |
| 166 | +**Pros:** |
| 167 | +- Maintains resilience of fallback behavior |
| 168 | +- Errors visible to monitoring systems |
| 169 | +- Clear indication of degraded state |
| 170 | + |
| 171 | +**Cons:** |
| 172 | +- Requires changes to `dnsResolveOutcome` struct |
| 173 | +- Needs new logic to send error while continuing request |
| 174 | +- More complex error handling |
| 175 | + |
| 176 | +### Option 3: Explicit Fallback Mode with Warnings (Most Flexible) |
| 177 | + |
| 178 | +**Change:** Add `DNSFallbackBehavior` enum to policy configuration |
| 179 | + |
| 180 | +```go |
| 181 | +type DNSFallbackBehavior uint8 |
| 182 | + |
| 183 | +const ( |
| 184 | + DNSFallbackError DNSFallbackBehavior = iota // Fail immediately on DNS error |
| 185 | + DNSFallbackWarn // Use cache but send warning response |
| 186 | + DNSFallbackSilent // Current behavior (use cache, no error) |
| 187 | +) |
| 188 | + |
| 189 | +type DNSPolicy struct { |
| 190 | + // ... existing fields ... |
| 191 | + FallbackBehavior DNSFallbackBehavior // Replace DisableFallback bool |
| 192 | +} |
| 193 | +``` |
| 194 | + |
| 195 | +**Pros:** |
| 196 | +- Maximum flexibility for different use cases |
| 197 | +- Non-breaking (default to Silent for backward compat) |
| 198 | +- Explicit control over error reporting |
| 199 | + |
| 200 | +**Cons:** |
| 201 | +- More configuration complexity |
| 202 | +- Requires additional testing scenarios |
| 203 | +- Documentation burden |
| 204 | + |
| 205 | +## Recommended Solution |
| 206 | + |
| 207 | +**Option 2** (Report DNS Errors While Using Fallback) provides the best balance: |
| 208 | + |
| 209 | +1. **Preserve resilience**: Fallback to cached addresses still works |
| 210 | +2. **Improve observability**: DNS errors always reported to response channel |
| 211 | +3. **Clear semantics**: Clients can distinguish between clean success and degraded fallback |
| 212 | +4. **Non-breaking**: Existing fallback behavior continues, just with better error visibility |
| 213 | + |
| 214 | +### Implementation Steps |
| 215 | + |
| 216 | +1. **Modify `dnsResolveOutcome` struct** (dns_policy_manager.go): |
| 217 | + ```go |
| 218 | + type dnsResolveOutcome struct { |
| 219 | + address string |
| 220 | + forceNewConn bool |
| 221 | + decision *DNSDecision |
| 222 | + fallbackErr error // New: DNS error when fallback is used |
| 223 | + } |
| 224 | + ``` |
| 225 | + |
| 226 | +2. **Update `resolveTTL()` fallback path** (dns_policy_manager.go:234-249): |
| 227 | + ```go |
| 228 | + if err != nil { |
| 229 | + if len(cached.addrs) > 0 && m.policy.fallbackEnabled() { |
| 230 | + // ... create decision ... |
| 231 | + return dnsResolveOutcome{ |
| 232 | + address: addr, |
| 233 | + decision: &decision, |
| 234 | + fallbackErr: err, // Preserve error for reporting |
| 235 | + }, nil |
| 236 | + } |
| 237 | + return dnsResolveOutcome{}, err |
| 238 | + } |
| 239 | + ``` |
| 240 | + |
| 241 | +3. **Update `resolveCadence()` fallback path** (dns_policy_manager.go:303-316): |
| 242 | + Same pattern as `resolveTTL()` |
| 243 | + |
| 244 | +4. **Modify `prepareRequest()` to check for fallback errors** (dns_policy_manager.go:82-133): |
| 245 | + ```go |
| 246 | + outcome, err := m.resolve(ctx, host, port) |
| 247 | + if err != nil { |
| 248 | + // ... existing error handling ... |
| 249 | + } |
| 250 | + |
| 251 | + // Check if fallback was used (new) |
| 252 | + if outcome.fallbackErr != nil && outcome.decision != nil { |
| 253 | + outcome.decision.FallbackUsed = true |
| 254 | + // Return error context for caller to handle |
| 255 | + return ctx, outcome.forceNewConn, outcome.fallbackErr |
| 256 | + } |
| 257 | + ``` |
| 258 | + |
| 259 | +5. **Add error handling in `policyTransport.RoundTrip()`** (dns_policy_manager.go:485-496): |
| 260 | + ```go |
| 261 | + func (p *policyTransport) RoundTrip(req *http.Request) (*http.Response, error) { |
| 262 | + ctx, _, err := p.mgr.prepareRequest(req.Context(), req.URL, p.base) |
| 263 | + if err != nil { |
| 264 | + // DNS error occurred (either hard failure or fallback scenario) |
| 265 | + return nil, err |
| 266 | + } |
| 267 | + |
| 268 | + newReq := req.Clone(ctx) |
| 269 | + return p.base.RoundTrip(newReq) |
| 270 | + } |
| 271 | + ``` |
| 272 | + |
| 273 | +6. **Add `FallbackUsed` field to `DNSDecision`** (dns_policy.go): |
| 274 | + ```go |
| 275 | + type DNSDecision struct { |
| 276 | + // ... existing fields ... |
| 277 | + FallbackUsed bool // New: indicates cached address used due to DNS error |
| 278 | + } |
| 279 | + ``` |
| 280 | + |
| 281 | +7. **Add comprehensive tests**: |
| 282 | + - Test DNS failure with fallback enabled (should send error response) |
| 283 | + - Test DNS failure with fallback disabled (should send error response) |
| 284 | + - Test DNS failure without cached address (should send error response) |
| 285 | + - Verify decision hook receives correct fallback state |
| 286 | + |
| 287 | +### Testing Strategy |
| 288 | + |
| 289 | +1. **Unit tests** for each `resolve*()` method: |
| 290 | + - DNS failure with cached address and fallback enabled |
| 291 | + - DNS failure with cached address and fallback disabled |
| 292 | + - DNS failure without cached address |
| 293 | + |
| 294 | +2. **Integration tests** for end-to-end flow: |
| 295 | + - HTTP task execution with DNS failure scenarios |
| 296 | + - Verify error responses sent to response channel |
| 297 | + - Verify guard rail mechanisms trigger appropriately |
| 298 | + |
| 299 | +3. **Regression tests**: |
| 300 | + - Ensure static mode continues to work |
| 301 | + - Ensure default mode continues to work |
| 302 | + - Ensure normal success paths unchanged |
| 303 | + |
| 304 | +## Migration Considerations |
| 305 | + |
| 306 | +### Backward Compatibility |
| 307 | +- Existing fallback behavior preserved (requests still use cached addresses) |
| 308 | +- New error reporting may increase error response volume |
| 309 | +- Clients should be prepared to handle errors that were previously silent |
| 310 | + |
| 311 | +### Documentation Updates |
| 312 | +- Update DNS policy documentation to explain fallback error reporting |
| 313 | +- Add examples showing how to handle DNS fallback errors |
| 314 | +- Document difference between hard DNS failures and fallback scenarios |
| 315 | + |
| 316 | +### Monitoring Impact |
| 317 | +- Error dashboards may show increased error counts (this is expected and correct) |
| 318 | +- Add metrics to distinguish between: |
| 319 | + - Hard DNS failures (no cached address) |
| 320 | + - Soft DNS failures (fallback to cached address) |
| 321 | + - Guard rail triggers due to DNS issues |
| 322 | + |
| 323 | +## Future Enhancements |
| 324 | + |
| 325 | +1. **Fallback address TTL/expiration**: Don't fall back to very stale cached addresses |
| 326 | +2. **Fallback metrics**: Track how often fallback is used per endpoint |
| 327 | +3. **Configurable fallback policy**: Per-endpoint control over fallback behavior |
| 328 | +4. **DNS error classification**: Distinguish temporary vs permanent DNS failures |
| 329 | +5. **Circuit breaker integration**: Use DNS error patterns to trigger circuit breakers |
| 330 | + |
| 331 | +## References |
| 332 | + |
| 333 | +- Issue discovered: DNS resolution failures not producing error responses |
| 334 | +- Affected files: `dns_policy_manager.go`, `task_http.go`, `dns_policy.go` |
| 335 | +- Related guard rail mechanism: `DNSGuardRailPolicy` in `dns_policy.go` |
0 commit comments