|
| 1 | +# Modern Go Error Handling Strategy |
| 2 | + |
| 3 | +* Status: accepted |
| 4 | + |
| 5 | +## Context and Problem Statement |
| 6 | + |
| 7 | +Kuma currently has **inconsistent error handling** across the codebase: |
| 8 | + |
| 9 | +1. **MADR 073** (accepted 2023) recommends sentinel errors + `fmt.Errorf` with `%w` |
| 10 | +2. **Codebase reality**: Extensive use of `github.com/pkg/errors` (v0.9.1) |
| 11 | + - 30+ usages of `errors.Wrap/Wrapf/Errorf` in `pkg/` |
| 12 | + - Dependency in `go.mod` |
| 13 | +3. **New code** (e.g., `pkg/core/resources/store/error.go`) follows MADR 073 (stdlib) |
| 14 | + |
| 15 | +**Key developments since MADR 073:** |
| 16 | +- `github.com/pkg/errors` **archived in 2024** - no longer maintained |
| 17 | +- Security/maintenance risk for frozen dependency |
| 18 | +- Modern alternatives emerged with better performance |
| 19 | +- Stack traces proven valuable for Kuma debugging |
| 20 | + |
| 21 | +**Requirements:** |
| 22 | +- Stack traces for production debugging |
| 23 | +- Best performance (high-throughput control plane, many data planes) |
| 24 | +- Type-safe error checking (`errors.Is/As`) |
| 25 | +- Error wrapping with context |
| 26 | +- No migration timeline pressure |
| 27 | + |
| 28 | +## Design |
| 29 | + |
| 30 | +### Option 1: Standard Library Only (`fmt.Errorf` + `%w`) |
| 31 | + |
| 32 | +**Current MADR 073 recommendation** |
| 33 | + |
| 34 | +```go |
| 35 | +var ErrNotFound = errors.New("not found") |
| 36 | + |
| 37 | +func ErrorResourceNotFound(rt, name, mesh string) error { |
| 38 | + return fmt.Errorf("resource %w: type=%q name=%q mesh=%q", ErrNotFound, rt, name, mesh) |
| 39 | +} |
| 40 | + |
| 41 | +func IsNotFound(err error) bool { |
| 42 | + return errors.Is(err, ErrNotFound) |
| 43 | +} |
| 44 | +``` |
| 45 | + |
| 46 | +#### Pros |
| 47 | + |
| 48 | +- Zero external dependencies |
| 49 | +- Actively maintained by Go team |
| 50 | +- Consistent with Go 1.13+ idioms |
| 51 | +- Works with `errors.Is/As` |
| 52 | + |
| 53 | +#### Cons |
| 54 | + |
| 55 | +- No stack traces - dealbreaker for debugging requirements |
| 56 | +- Limited error context extraction |
| 57 | +- Performance overhead with `errors.Is()` (up to 5x slower per benchmarks) |
| 58 | + |
| 59 | +### Option 2: Keep `github.com/pkg/errors` |
| 60 | + |
| 61 | +**Status quo** |
| 62 | + |
| 63 | +```go |
| 64 | +return errors.Wrapf(err, "failed to sync: zone=%s", zone) |
| 65 | +``` |
| 66 | + |
| 67 | +#### Pros |
| 68 | + |
| 69 | +- Already in use - no migration |
| 70 | +- Stack traces included |
| 71 | +- Familiar to team |
| 72 | + |
| 73 | +#### Cons |
| 74 | + |
| 75 | +- Archived dependency - security/maintenance risk |
| 76 | +- Performance: 500% slowdown with `errors.Is()` benchmarks |
| 77 | +- Inconsistent with MADR 073 |
| 78 | +- Dave Cheney (author): "I no longer use this package" |
| 79 | + |
| 80 | +### Option 3: Migrate to `github.com/cockroachdb/errors` |
| 81 | + |
| 82 | +**Modern, actively maintained alternative** |
| 83 | + |
| 84 | +```go |
| 85 | +import "github.com/cockroachdb/errors" |
| 86 | + |
| 87 | +// Drop-in replacement |
| 88 | +return errors.Wrapf(err, "failed to sync: zone=%s", zone) |
| 89 | + |
| 90 | +// Stack traces automatic |
| 91 | +errors.WithStack(ErrNotFound) |
| 92 | + |
| 93 | +// Works with stdlib |
| 94 | +errors.Is(err, ErrNotFound) // ✓ |
| 95 | +errors.As(err, &targetErr) // ✓ |
| 96 | +``` |
| 97 | + |
| 98 | +#### Pros |
| 99 | + |
| 100 | +- Stack traces with full call context |
| 101 | +- Better performance than pkg/errors |
| 102 | +- Drop-in replacement - minimal code changes |
| 103 | +- Actively maintained (CockroachDB production use) |
| 104 | +- Stdlib compatible (`errors.Is/As` work correctly) |
| 105 | +- Rich features: error formatting, PII redaction, error barriers |
| 106 | +- Structured error metadata |
| 107 | +- Production-proven at scale |
| 108 | + |
| 109 | +#### Cons |
| 110 | + |
| 111 | +- External dependency (but from reputable source) |
| 112 | +- Migration effort for existing `fmt.Errorf` usage |
| 113 | +- Slightly larger binary size |
| 114 | + |
| 115 | +### Option 4: Hybrid Approach (Custom Types + Sentinels) |
| 116 | + |
| 117 | +**Enhanced structured errors** |
| 118 | + |
| 119 | +```go |
| 120 | +type ResourceError struct { |
| 121 | + sentinel error |
| 122 | + Type model.ResourceType |
| 123 | + Name string |
| 124 | + Mesh string |
| 125 | + stack []uintptr // manual stack capture |
| 126 | +} |
| 127 | + |
| 128 | +func (e *ResourceError) Error() string { |
| 129 | + return fmt.Sprintf("resource %s: type=%q name=%q mesh=%q", |
| 130 | + e.sentinel, e.Type, e.Name, e.Mesh) |
| 131 | +} |
| 132 | +``` |
| 133 | + |
| 134 | +#### Pros |
| 135 | + |
| 136 | +- Full control over error structure |
| 137 | +- Extractable metadata |
| 138 | + |
| 139 | +#### Cons |
| 140 | + |
| 141 | +- Significant implementation effort (stack capture, formatting) |
| 142 | +- Reinventing the wheel |
| 143 | +- Maintenance burden |
| 144 | +- Won't match cockroachdb/errors features |
| 145 | + |
| 146 | +## Security implications and review |
| 147 | + |
| 148 | +**Archived dependency risk:** |
| 149 | +- `pkg/errors` receives no security updates |
| 150 | +- Future CVEs won't be patched |
| 151 | +- Recommendation: Migrate away |
| 152 | + |
| 153 | +**cockroachdb/errors:** |
| 154 | +- Active security maintenance |
| 155 | +- Used in production by CockroachDB (security-conscious) |
| 156 | +- Regular updates and CVE monitoring |
| 157 | + |
| 158 | +## Reliability implications |
| 159 | + |
| 160 | +**Performance (Critical for Kuma CP):** |
| 161 | +- Control plane handles many data planes |
| 162 | +- Error handling in hot paths (xDS updates) |
| 163 | +- Benchmarks show `pkg/errors` + `errors.Is()` = 5x slowdown |
| 164 | +- `cockroachdb/errors` optimized for performance |
| 165 | + |
| 166 | +**Debugging:** |
| 167 | +- Stack traces essential for multi-zone issues |
| 168 | +- Error context helps diagnose cross-component failures |
| 169 | +- Structured errors improve observability/telemetry |
| 170 | + |
| 171 | +**Error propagation:** |
| 172 | +- KDS (multi-zone sync) propagates errors between CPs |
| 173 | +- Stack traces help debug distributed failures |
| 174 | +- Consistent error wrapping improves troubleshooting |
| 175 | + |
| 176 | +## Implications for Kong Mesh |
| 177 | + |
| 178 | +Kong Mesh inherits Kuma's error handling: |
| 179 | +- Stack traces valuable for enterprise debugging |
| 180 | +- Performance improvements benefit enterprise deployments |
| 181 | +- Migration can be synchronized with Kuma changes |
| 182 | +- Error telemetry integration potential |
| 183 | + |
| 184 | +## Decision |
| 185 | + |
| 186 | +**Adopt Option 3: Migrate to `github.com/cockroachdb/errors`** |
| 187 | + |
| 188 | +### Rationale |
| 189 | + |
| 190 | +Meets all requirements: |
| 191 | +1. Stack traces (required) |
| 192 | +2. Better performance than pkg/errors (required) |
| 193 | +3. Drop-in replacement for existing code (low migration cost) |
| 194 | +4. Actively maintained (eliminates security risk) |
| 195 | +5. Stdlib compatible (future-proof) |
| 196 | + |
| 197 | +### Migration Strategy |
| 198 | + |
| 199 | +**Phase 1: Add dependency** |
| 200 | +```bash |
| 201 | +go get github.com/cockroachdb/errors |
| 202 | +``` |
| 203 | + |
| 204 | +**Phase 2: Update imports (gradual)** |
| 205 | +```go |
| 206 | +// Old |
| 207 | +import "github.com/pkg/errors" |
| 208 | + |
| 209 | +// New |
| 210 | +import "github.com/cockroachdb/errors" |
| 211 | +``` |
| 212 | + |
| 213 | +**Phase 3: Update MADR 073** |
| 214 | +- Document cockroachdb/errors as standard |
| 215 | +- Update examples |
| 216 | +- Keep sentinel error pattern |
| 217 | + |
| 218 | +**Phase 4: Convert fmt.Errorf usage** |
| 219 | +```go |
| 220 | +// Old (no stack trace) |
| 221 | +return fmt.Errorf("sync failed: %w", err) |
| 222 | + |
| 223 | +// New (with stack trace) |
| 224 | +return errors.Wrap(err, "sync failed") |
| 225 | +``` |
| 226 | + |
| 227 | +**No timeline pressure** - can be done incrementally across releases. |
| 228 | + |
| 229 | +### Usage Convention |
| 230 | + |
| 231 | +```go |
| 232 | +// Sentinel errors (preserved from MADR 073) |
| 233 | +var ErrNotFound = errors.New("not found") |
| 234 | + |
| 235 | +// Error construction with context |
| 236 | +func ErrorResourceNotFound(rt model.ResourceType, name, mesh string) error { |
| 237 | + return errors.Wrapf(ErrNotFound, |
| 238 | + "resource: type=%q name=%q mesh=%q", rt, name, mesh) |
| 239 | +} |
| 240 | + |
| 241 | +// Error checking (unchanged) |
| 242 | +if errors.Is(err, ErrNotFound) { |
| 243 | + // handle |
| 244 | +} |
| 245 | + |
| 246 | +// Extract structured errors |
| 247 | +var resourceErr *ResourceError |
| 248 | +if errors.As(err, &resourceErr) { |
| 249 | + log.Error("failed", "type", resourceErr.Type) |
| 250 | +} |
| 251 | +``` |
| 252 | + |
| 253 | +## Notes |
| 254 | + |
| 255 | +**Alternative considered:** |
| 256 | +- `gitlab.com/tozd/go/errors` - similar features, less adoption |
| 257 | +- CockroachDB chosen for production track record |
| 258 | + |
| 259 | +**Stack trace overhead:** |
| 260 | +- Only captured on error creation (not propagation) |
| 261 | +- Negligible for error-case performance |
| 262 | +- Can be disabled in production if needed (env var) |
| 263 | + |
| 264 | +## References |
| 265 | + |
| 266 | +### Go Error Handling Best Practices |
| 267 | +- [Mastering Go Error Handling: A Practical Guide - DEV Community](https://dev.to/leapcell/mastering-go-error-handling-a-practical-guide-3411) |
| 268 | +- [Best Practices for Error Handling in Go - JetBrains Guide](https://www.jetbrains.com/guide/go/tutorials/handle_errors_in_go/best_practices/) |
| 269 | +- [Mastering Error Handling in Go: Best Practices and Patterns - SkyCode](https://vickychhetri.com/2024/11/19/mastering-error-handling-in-go-best-practices-and-patterns/) |
| 270 | +- [A practical guide to error handling in Go - Datadog](https://www.datadoghq.com/blog/go-error-handling/) |
| 271 | + |
| 272 | +### Sentinel Errors vs Custom Types |
| 273 | +- [Error Values vs Sentinel Errors in Go - Medium](https://medium.com/@AlexanderObregon/error-values-vs-sentinel-errors-in-go-a377c3d831d1) |
| 274 | +- [Go Error Handling Techniques: Exploring Sentinel Errors, Custom Types, and Client-Facing Errors](https://arashtaher.wordpress.com/2024/09/05/go-error-handling-techniques-exploring-sentinel-errors-custom-types-and-client-facing-errors/) |
| 275 | +- [Go Error Handling: Sentinel vs Custom Types - alesr](https://alesr.github.io/posts/go-errors/) |
| 276 | + |
| 277 | +### Error Wrapping |
| 278 | +- [Working with Errors in Go 1.13 - Official Go Blog](https://go.dev/blog/go1.13-errors) |
| 279 | +- [Error Wrapping with Go's Standard Library - Medium](https://medium.com/@AlexanderObregon/error-wrapping-with-gos-standard-library-0a345eeea019) |
| 280 | +- [Error wrapping in Go - Bitfield Consulting](https://bitfieldconsulting.com/posts/wrapping-errors) |
| 281 | +- [Wrapping errors in Go - barney's tech blog](https://southcla.ws/wrapping-errors-in-go-a-new-approach) |
| 282 | + |
| 283 | +### fmt.Errorf vs pkg/errors |
| 284 | +- [What's the difference between errors.Wrapf(), errors.Errorf(), and fmt.Errorf()? - Stack Overflow](https://stackoverflow.com/questions/61933650/whats-the-difference-between-errors-wrapf-errors-errorf-and-fmt-errorf) |
| 285 | +- [Can new Go errors wrapper replace pkg/errors? - Dmitry Harnitski's Blog](https://blog.dharnitski.com/2019/09/09/go-errors-are-not-pkg-errors/) |
| 286 | +- [decide on the future use of errors, pkg/errors - moby/moby Discussion](https://github.com/moby/moby/discussions/46358) |
| 287 | + |
| 288 | +### Performance |
| 289 | +- [Sentinel errors and errors.Is() slow your code down by 500% - DoltHub Blog](https://www.dolthub.com/blog/2024-05-31-benchmarking-go-error-handling/) |
| 290 | + |
| 291 | +### Modern Alternatives |
| 292 | +- [Beyond fmt.Errorf() - CockroachDB errors library](https://dr-knz.net/cockroachdb-errors-everyday.html) |
| 293 | +- [cockroachdb/errors - Go Packages](https://pkg.go.dev/github.com/cockroachdb/errors) |
0 commit comments