Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 185 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# Changelog

Notable changes to this project will be documented in this file. To keep it lightweight, releases 2+ minor versions back will be churned regularly.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [0.10.x] - Unreleased

### Changed

#### DNS Errors Now Reported With Fallback (Breaking Change)

Previously, when DNS resolution failed but a cached address was available for fallback, the error was silently swallowed and `WatcherResponse.Err` would be `nil`. This made it impossible to detect DNS degradation in monitoring systems.

**Now**, DNS errors are always reported in `WatcherResponse.Err`, even when fallback succeeds. This provides visibility into DNS issues while maintaining resilience.

**Before:**
```go
// DNS fails but fallback uses cached address → resp.Err == nil
resp := <-wadjit.Responses()
if resp.Err != nil {
// Only triggered for hard failures
log.Error("Request failed", resp.Err)
}
```

**After:**
```go
// DNS fails with fallback → resp.Err != nil (contains DNS error)
resp := <-wadjit.Responses()
if resp.Err != nil {
// This now triggers for DNS failures too!
// But request may have completed using cached address
log.Error("Request failed", resp.Err)
}
```

### Added

#### New DNS Metadata Fields

Two new fields added to `DNSMetadata` and `DNSDecision`:

- `FallbackUsed bool`: Indicates whether a cached address was used due to DNS lookup failure
- `Err error`: Contains the DNS resolution error, if any occurred

**Access DNS error information:**
```go
resp := <-wadjit.Responses()
md := resp.Metadata()
if md.DNS != nil {
if md.DNS.FallbackUsed {
log.Warn("Using cached DNS entry", "error", md.DNS.Err)
}
if md.DNS.Err != nil {
log.Warn("DNS error occurred", "error", md.DNS.Err)
}
}
```

#### Enhanced DNS Decision Callbacks

DNS decision hooks now receive fallback and error information:

```go
policy := wadjit.DNSPolicy{
Mode: wadjit.DNSRefreshTTL,
TTLMin: time.Minute,
DecisionCallback: func(ctx context.Context, decision wadjit.DNSDecision) {
if decision.FallbackUsed {
log.Warn("DNS fallback triggered", "error", decision.Err)
}
if decision.Err != nil {
metrics.RecordDNSError(decision.Host, decision.Err)
}
},
}
```

### Migration Guide

#### 1. Distinguish Between Hard Failures and Fallback Scenarios

Update error handling to check if fallback was used:

```go
resp := <-wadjit.Responses()
if resp.Err != nil {
md := resp.Metadata()
if md.DNS != nil && md.DNS.FallbackUsed {
// DNS failed but request completed with cached address
log.Warn("DNS fallback used",
"error", md.DNS.Err,
"cached_addr", md.DNS.ResolvedAddrs)
// Maybe alert but don't fail completely
} else {
// Hard failure - request did not complete
log.Error("Request failed completely", resp.Err)
// Definitely alert/retry
}
}
```

#### 2. Update Monitoring and Alerting

Expect increased error counts in dashboards - this is correct behavior. DNS failures were always happening, just not visible before.

Distinguish between error types in metrics:

```go
resp := <-wadjit.Responses()
if resp.Err != nil {
md := resp.Metadata()
if md.DNS != nil {
if md.DNS.FallbackUsed {
metrics.IncrCounter("dns.fallback.used") // Degraded but working
metrics.IncrCounter("dns.errors.soft")
} else if md.DNS.Err != nil {
metrics.IncrCounter("dns.errors.hard") // Complete failure
}
}
metrics.IncrCounter("requests.errors.total")
}
```

#### 3. Refine Retry Logic

Consider different retry strategies for fallback vs hard failures:

```go
func shouldRetry(resp WatcherResponse) bool {
if resp.Err == nil {
return false // Success
}

md := resp.Metadata()
if md.DNS != nil && md.DNS.FallbackUsed {
// DNS failed but got cached data - maybe don't retry immediately
// as the DNS issue might be temporary
return false
}

// Hard failure or non-DNS error - retry
return true
}
```

#### 4. Update Error Propagation Logic

If code assumed `Err == nil` means complete success:

```go
// OLD - May need updating
if resp.Err != nil {
return fmt.Errorf("health check failed: %w", resp.Err)
}

// NEW - More nuanced
if resp.Err != nil {
md := resp.Metadata()
if md.DNS != nil && md.DNS.FallbackUsed {
// Degrade gracefully - log warning but continue
metrics.IncrDNSFallback(resp.URL.Host)
} else {
// Hard failure - propagate error
return fmt.Errorf("health check failed: %w", resp.Err)
}
}
```

### Recommended Migration Steps

1. **Update error handling** to check for `DNS.FallbackUsed` before treating errors as hard failures
2. **Update monitoring** to distinguish soft (fallback) vs hard DNS failures
3. **Review retry logic** - consider not retrying immediately on fallback errors
4. **Test in staging** - watch for increased error counts (expected and correct)
5. **Update runbooks** - DNS fallback errors may now trigger alerts that need different responses

### Technical Details

- Modified `dnsResolveOutcome` to include `fallbackErr` field for error propagation
- Updated `resolveTTL()`, `resolveCadence()`, and `resolveSingle()` to return DNS decisions even on error
- Added `lastDecision` field to `dnsPolicyManager` for error reporting in HTTP responses
- Created `httpTaskResponseError` type to attach DNS metadata to error responses
- Enhanced decision context propagation to include errors and fallback state
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@

Wadjit (pronounced /ˈwɒdʒɪt/, or "watch it") is a program for endpoint monitoring and analysis.

> **Note**: Version 0.10.x introduces breaking changes to DNS error handling. See [CHANGELOG.md](CHANGELOG.md#010x) for migration details.

> Wadjet is the ancient Egyptian goddess of protection and royal authority. She is sometimes shown as the Eye of Ra, acting as a protector of the country and the king, and her vigilant eye would watch over the land. - [Wikipedia](https://en.wikipedia.org/wiki/Wadjet)

`wadjit.New()` creates a manager for an arbitrary number of watchers. The watchers monitor pre-defined endpoints according to their configuration, and feed the results back to the manager. A single Wadjit manager can hold watchers for many different tasks, as responses on the response channel are separated by watcher ID, or you may choose to create several managers as a more strict separation of concerns.
Expand All @@ -23,6 +25,10 @@ go get github.com/jkbrsn/wadjit@latest
- Buffered responses: Non-blocking channel with watcher IDs and metadata.
- Metrics: Access scheduler metrics via `Metrics()`.

## Changelog

The [CHANGELOG.md](CHANGELOG.md) shows recent changes and explains mitigations that might be needed when migrating from one version to the next.

## Quick Start

Minimal example with one HTTP and one WS task and basic response handling:
Expand Down Expand Up @@ -142,6 +148,26 @@ Guard rails add safety nets on top of any mode. Configure `DNSGuardRailPolicy` w

Set a global default for every watcher by supplying `wadjit.WithDefaultDNSPolicy(...)` when creating the Wadjit. Endpoints that do not call `WithDNSPolicy` inherit this default automatically, while explicit endpoint policies still win.

##### Error Handling

DNS errors are always reported in `WatcherResponse.Err`, even when fallback to cached addresses succeeds. This provides visibility into DNS degradation while maintaining resilience:

```go
resp := <-manager.Responses()
if resp.Err != nil {
md := resp.Metadata()
if md.DNS != nil && md.DNS.FallbackUsed {
// DNS failed but request completed with cached address
log.Warn("DNS fallback used", "error", md.DNS.Err)
} else {
// Hard failure - request did not complete
log.Error("Request failed", resp.Err)
}
}
```

**New in 0.10.x**: `DNSMetadata` includes `FallbackUsed bool` and `Err error` fields to distinguish between degraded (fallback) and failed states. See [CHANGELOG.md](CHANGELOG.md#010x) for migration guidance.

##### Examples

Assuming a parsed target URL:
Expand Down
1 change: 1 addition & 0 deletions dns_policy.go
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ type DNSDecision struct {
LookupDuration time.Duration
GuardRailTriggered bool
Err error
FallbackUsed bool
}

// Validate ensures the DNS policy fields are coherent.
Expand Down
78 changes: 66 additions & 12 deletions dns_policy_manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,12 @@ type dnsPolicyManager struct {
resolver TTLResolver
hook DNSDecisionCallback

mu sync.Mutex
cache dnsCacheEntry
cadenceNext time.Time
forceLookup bool
guard guardRailState
mu sync.Mutex
cache dnsCacheEntry
cadenceNext time.Time
forceLookup bool
guard guardRailState
lastDecision *DNSDecision // Last DNS decision for error reporting

lookupGroup singleflight.Group
}
Expand Down Expand Up @@ -66,6 +67,7 @@ type dnsResolveOutcome struct {
address string
forceNewConn bool
decision *DNSDecision
fallbackErr error
}

const serverErrorThreshold = 500
Expand Down Expand Up @@ -95,12 +97,36 @@ func (m *dnsPolicyManager) prepareRequest(

outcome, err := m.resolve(ctx, host, port)
if err != nil {
if outcome.decision != nil && m.hook != nil {
if outcome.decision != nil {
outcome.decision.Err = err
m.hook(ctx, *outcome.decision)
// Store decision for retrieval after error
m.mu.Lock()
m.lastDecision = outcome.decision
m.mu.Unlock()
if m.hook != nil {
m.hook(ctx, *outcome.decision)
}
}
return ctx, false, err
}

if outcome.fallbackErr != nil && outcome.decision != nil {
outcome.decision.FallbackUsed = true
// Store decision for retrieval after error
m.mu.Lock()
m.lastDecision = outcome.decision
m.mu.Unlock()
// Add decision to context even when returning error
resultCtx := context.WithValue(ctx, dnsDecisionKey{}, *outcome.decision)
if outcome.address != "" {
plan := dnsDialPlan{target: outcome.address}
resultCtx = context.WithValue(resultCtx, dnsPlanKey{}, plan)
}
if m.hook != nil {
m.hook(resultCtx, *outcome.decision)
}
return resultCtx, outcome.forceNewConn, outcome.fallbackErr
}
guardTriggered := m.consumeGuardTrigger()
decision := outcome.decision
if decision == nil && (m.hook != nil || guardTriggered) {
Expand Down Expand Up @@ -184,7 +210,14 @@ func (m *dnsPolicyManager) resolveSingle(

addrs, ttl, lookupDur, err := m.lookup(ctx, host)
if err != nil {
return dnsResolveOutcome{}, err
// No cache available - return error with decision for observability
decision := DNSDecision{
Host: host,
Mode: m.policy.Mode,
LookupDuration: lookupDur,
Err: err,
}
return dnsResolveOutcome{decision: &decision}, err
}
entry := dnsCacheEntry{addrs: addrs, lastLookup: time.Now()}

Expand Down Expand Up @@ -244,9 +277,16 @@ func (m *dnsPolicyManager) resolveTTL(
ExpiresAt: cached.expiresAt,
Err: err,
}
return dnsResolveOutcome{address: addr, decision: &decision}, nil
return dnsResolveOutcome{address: addr, decision: &decision, fallbackErr: err}, nil
}
return dnsResolveOutcome{}, err
// No cache available - return error with decision for observability
decision := DNSDecision{
Host: host,
Mode: m.policy.Mode,
LookupDuration: lookupDur,
Err: err,
}
return dnsResolveOutcome{decision: &decision}, err
}

ttl = m.policy.normalizeTTL(ttl)
Expand Down Expand Up @@ -311,9 +351,16 @@ func (m *dnsPolicyManager) resolveCadence(
ExpiresAt: cadenceNext,
Err: err,
}
return dnsResolveOutcome{address: addr, decision: &decision}, nil
return dnsResolveOutcome{address: addr, decision: &decision, fallbackErr: err}, nil
}
// No cache available - return error with decision for observability
decision := DNSDecision{
Host: host,
Mode: m.policy.Mode,
LookupDuration: lookupDur,
Err: err,
}
return dnsResolveOutcome{}, err
return dnsResolveOutcome{decision: &decision}, err
}

entry := dnsCacheEntry{addrs: addrs, lastLookup: now}
Expand Down Expand Up @@ -408,6 +455,13 @@ func (*dnsPolicyManager) dialContext(
}
}

// getLastDecision retrieves the last DNS decision, if any. This is useful for error reporting.
func (m *dnsPolicyManager) getLastDecision() *DNSDecision {
m.mu.Lock()
defer m.mu.Unlock()
return m.lastDecision
}

// observeResult records request outcomes to drive guard-rail thresholds.
func (m *dnsPolicyManager) observeResult(statusCode int, resultErr error) {
if !m.guard.policy.Enabled() {
Expand Down
Loading