|
| 1 | +# Implementation Plan: TAPI Backoff & Rate Limiting for React Native SDK |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Implement exponential backoff and 429 rate-limiting strategy per the TAPI Backoff SDD to handle API overload during high-traffic events (holidays, Super Bowl, etc.). The implementation adds: |
| 6 | + |
| 7 | +1. **Global rate limiting** for 429 responses (blocks entire upload pipeline) |
| 8 | +2. **Per-batch exponential backoff** for transient errors (5xx, 408, etc.) |
| 9 | +3. **Upload gate pattern** (no timers, state-based flow control) |
| 10 | +4. **Configurable via Settings CDN** (dynamic updates without deployments) |
| 11 | +5. **Persistent state** across app restarts (AsyncStorage) |
| 12 | + |
| 13 | +## Key Architectural Decisions |
| 14 | + |
| 15 | +### Decision 1: Two-Component Architecture |
| 16 | + |
| 17 | +- **UploadStateMachine**: Manages global READY/WAITING states for 429 rate limiting |
| 18 | +- **BatchUploadManager**: Handles per-batch retry metadata and exponential backoff |
| 19 | +- Both use Sovran stores for persistence, integrate into SegmentDestination |
| 20 | + |
| 21 | +### Decision 2: Upload Gate Pattern |
| 22 | + |
| 23 | +- No timers/schedulers to check state |
| 24 | +- Check `canUpload()` at flush start, return early if in WAITING state |
| 25 | +- State transitions on response (429 → WAITING, success → READY) |
| 26 | + |
| 27 | +### Decision 3: Sequential Batch Processing |
| 28 | + |
| 29 | +- Change from `Promise.all()` (parallel) to `for...of` loop (sequential) |
| 30 | +- Required by SDD: "429 responses cause immediate halt of upload loop" |
| 31 | +- Transient errors (5xx) don't block remaining batches |
| 32 | + |
| 33 | +### Decision 4: Authentication & Headers |
| 34 | + |
| 35 | +- **Authorization header**: Add `Basic ${base64(writeKey + ':')}` header |
| 36 | +- **Keep writeKey in body**: Backwards compatibility with TAPI |
| 37 | +- **X-Retry-Count header**: Send per-batch count when available, global count for 429 |
| 38 | + |
| 39 | +### Decision 5: Logging Strategy |
| 40 | + |
| 41 | +- Verbose logging for all retry events (state transitions, backoff delays, drops) |
| 42 | +- Use existing `analytics.logger.info()` and `.warn()` infrastructure |
| 43 | +- Include retry count, backoff duration, and error codes in logs |
| 44 | + |
| 45 | +## Implementation Steps |
| 46 | + |
| 47 | +### Step 1: Add Type Definitions |
| 48 | + |
| 49 | +**File**: `/packages/core/src/types.ts` |
| 50 | + |
| 51 | +Add new interfaces to existing types: |
| 52 | + |
| 53 | +```typescript |
| 54 | +// HTTP Configuration from Settings CDN |
| 55 | +export type HttpConfig = { |
| 56 | + rateLimitConfig?: RateLimitConfig; |
| 57 | + backoffConfig?: BackoffConfig; |
| 58 | +}; |
| 59 | + |
| 60 | +export type RateLimitConfig = { |
| 61 | + enabled: boolean; |
| 62 | + maxRetryCount: number; |
| 63 | + maxRetryInterval: number; // seconds |
| 64 | + maxTotalBackoffDuration: number; // seconds |
| 65 | +}; |
| 66 | + |
| 67 | +export type BackoffConfig = { |
| 68 | + enabled: boolean; |
| 69 | + maxRetryCount: number; |
| 70 | + baseBackoffInterval: number; // seconds |
| 71 | + maxBackoffInterval: number; // seconds |
| 72 | + maxTotalBackoffDuration: number; // seconds |
| 73 | + jitterPercent: number; // 0-100 |
| 74 | + retryableStatusCodes: number[]; |
| 75 | +}; |
| 76 | + |
| 77 | +// Update SegmentAPISettings to include httpConfig |
| 78 | +export type SegmentAPISettings = { |
| 79 | + integrations: SegmentAPIIntegrations; |
| 80 | + edgeFunction?: EdgeFunctionSettings; |
| 81 | + middlewareSettings?: { |
| 82 | + routingRules: RoutingRule[]; |
| 83 | + }; |
| 84 | + metrics?: MetricsOptions; |
| 85 | + consentSettings?: SegmentAPIConsentSettings; |
| 86 | + httpConfig?: HttpConfig; // NEW |
| 87 | +}; |
| 88 | + |
| 89 | +// State machine persistence |
| 90 | +export type UploadStateData = { |
| 91 | + state: 'READY' | 'WAITING'; |
| 92 | + waitUntilTime: number; // timestamp ms |
| 93 | + globalRetryCount: number; |
| 94 | + firstFailureTime: number | null; // timestamp ms |
| 95 | +}; |
| 96 | + |
| 97 | +// Per-batch retry metadata |
| 98 | +export type BatchMetadata = { |
| 99 | + batchId: string; |
| 100 | + events: SegmentEvent[]; // Store events to match batches |
| 101 | + retryCount: number; |
| 102 | + nextRetryTime: number; // timestamp ms |
| 103 | + firstFailureTime: number; // timestamp ms |
| 104 | +}; |
| 105 | + |
| 106 | +// Error classification result |
| 107 | +export type ErrorClassification = { |
| 108 | + isRetryable: boolean; |
| 109 | + errorType: 'rate_limit' | 'transient' | 'permanent'; |
| 110 | + retryAfterSeconds?: number; |
| 111 | +}; |
| 112 | +``` |
| 113 | + |
| 114 | +### Step 2: Add Default Configuration |
| 115 | + |
| 116 | +**File**: `/packages/core/src/constants.ts` |
| 117 | + |
| 118 | +Add default httpConfig: |
| 119 | + |
| 120 | +```typescript |
| 121 | +export const defaultHttpConfig: HttpConfig = { |
| 122 | + rateLimitConfig: { |
| 123 | + enabled: true, |
| 124 | + maxRetryCount: 100, |
| 125 | + maxRetryInterval: 300, |
| 126 | + maxTotalBackoffDuration: 43200, // 12 hours |
| 127 | + }, |
| 128 | + backoffConfig: { |
| 129 | + enabled: true, |
| 130 | + maxRetryCount: 100, |
| 131 | + baseBackoffInterval: 0.5, |
| 132 | + maxBackoffInterval: 300, |
| 133 | + maxTotalBackoffDuration: 43200, |
| 134 | + jitterPercent: 10, |
| 135 | + retryableStatusCodes: [408, 410, 429, 460, 500, 502, 503, 504, 508], |
| 136 | + }, |
| 137 | +}; |
| 138 | +``` |
| 139 | + |
| 140 | +### Step 3: Enhance Error Classification |
| 141 | + |
| 142 | +**File**: `/packages/core/src/errors.ts` |
| 143 | + |
| 144 | +Add new functions to existing error handling: |
| 145 | + |
| 146 | +```typescript |
| 147 | +/** |
| 148 | + * Classifies HTTP errors per TAPI SDD tables |
| 149 | + */ |
| 150 | +export const classifyError = ( |
| 151 | + statusCode: number, |
| 152 | + retryableStatusCodes: number[] = [408, 410, 429, 460, 500, 502, 503, 504, 508] |
| 153 | +): ErrorClassification => { |
| 154 | + // 429 rate limiting |
| 155 | + if (statusCode === 429) { |
| 156 | + return { |
| 157 | + isRetryable: true, |
| 158 | + errorType: 'rate_limit', |
| 159 | + }; |
| 160 | + } |
| 161 | + |
| 162 | + // Retryable transient errors |
| 163 | + if (retryableStatusCodes.includes(statusCode)) { |
| 164 | + return { |
| 165 | + isRetryable: true, |
| 166 | + errorType: 'transient', |
| 167 | + }; |
| 168 | + } |
| 169 | + |
| 170 | + // Non-retryable (400, 401, 403, 404, 413, 422, 501, 505, etc.) |
| 171 | + return { |
| 172 | + isRetryable: false, |
| 173 | + errorType: 'permanent', |
| 174 | + }; |
| 175 | +}; |
| 176 | + |
| 177 | +/** |
| 178 | + * Parses Retry-After header value |
| 179 | + * Supports both seconds (number) and HTTP date format |
| 180 | + */ |
| 181 | +export const parseRetryAfter = ( |
| 182 | + retryAfterValue: string | null, |
| 183 | + maxRetryInterval: number = 300 |
| 184 | +): number | undefined => { |
| 185 | + if (!retryAfterValue) return undefined; |
| 186 | + |
| 187 | + // Try parsing as integer (seconds) |
| 188 | + const seconds = parseInt(retryAfterValue, 10); |
| 189 | + if (!isNaN(seconds)) { |
| 190 | + return Math.min(seconds, maxRetryInterval); |
| 191 | + } |
| 192 | + |
| 193 | + // Try parsing as HTTP date |
| 194 | + const retryDate = new Date(retryAfterValue); |
| 195 | + if (!isNaN(retryDate.getTime())) { |
| 196 | + const secondsUntil = Math.ceil((retryDate.getTime() - Date.now()) / 1000); |
| 197 | + return Math.min(Math.max(secondsUntil, 0), maxRetryInterval); |
| 198 | + } |
| 199 | + |
| 200 | + return undefined; |
| 201 | +}; |
| 202 | +``` |
| 203 | + |
| 204 | +### Step 4: Update HTTP API Layer |
| 205 | + |
| 206 | +**File**: `/packages/core/src/api.ts` |
| 207 | + |
| 208 | +Modify uploadEvents to support retry headers and return full Response: |
| 209 | + |
| 210 | +```typescript |
| 211 | +export const uploadEvents = async ({ |
| 212 | + writeKey, |
| 213 | + url, |
| 214 | + events, |
| 215 | + retryCount = 0, // NEW: for X-Retry-Count header |
| 216 | +}: { |
| 217 | + writeKey: string; |
| 218 | + url: string; |
| 219 | + events: SegmentEvent[]; |
| 220 | + retryCount?: number; // NEW |
| 221 | +}): Promise<Response> => { |
| 222 | + // Changed from void |
| 223 | + // Create Authorization header (Basic auth format) |
| 224 | + const authHeader = 'Basic ' + btoa(writeKey + ':'); |
| 225 | + |
| 226 | + const response = await fetch(url, { |
| 227 | + method: 'POST', |
| 228 | + body: JSON.stringify({ |
| 229 | + batch: events, |
| 230 | + sentAt: new Date().toISOString(), |
| 231 | + writeKey: writeKey, // Keep in body for backwards compatibility |
| 232 | + }), |
| 233 | + headers: { |
| 234 | + 'Content-Type': 'application/json; charset=utf-8', |
| 235 | + 'Authorization': authHeader, // NEW |
| 236 | + 'X-Retry-Count': retryCount.toString(), // NEW |
| 237 | + }, |
| 238 | + }); |
| 239 | + |
| 240 | + return response; // Return full response (not just void) |
| 241 | +}; |
| 242 | +``` |
| 243 | + |
| 244 | +### Step 5: Create Upload State Machine |
| 245 | + |
| 246 | +**New File**: `/packages/core/src/backoff/UploadStateMachine.ts` |
| 247 | + |
| 248 | +(See full implementation in code) |
| 249 | + |
| 250 | +### Step 6: Create Batch Upload Manager |
| 251 | + |
| 252 | +**New File**: `/packages/core/src/backoff/BatchUploadManager.ts` |
| 253 | + |
| 254 | +(See full implementation in code) |
| 255 | + |
| 256 | +### Step 7: Create Barrel Export |
| 257 | + |
| 258 | +**New File**: `/packages/core/src/backoff/index.ts` |
| 259 | + |
| 260 | +```typescript |
| 261 | +export { UploadStateMachine } from './UploadStateMachine'; |
| 262 | +export { BatchUploadManager } from './BatchUploadManager'; |
| 263 | +``` |
| 264 | + |
| 265 | +### Step 8: Integrate into SegmentDestination |
| 266 | + |
| 267 | +**File**: `/packages/core/src/plugins/SegmentDestination.ts` |
| 268 | + |
| 269 | +Major modifications to integrate state machine and batch manager: |
| 270 | + |
| 271 | +(See full implementation in code) |
| 272 | + |
| 273 | +## Testing Strategy |
| 274 | + |
| 275 | +### Unit Tests |
| 276 | + |
| 277 | +1. **Error Classification** (`/packages/core/src/__tests__/errors.test.ts`) |
| 278 | + |
| 279 | + - classifyError() for all status codes in SDD tables |
| 280 | + - parseRetryAfter() with seconds, HTTP dates, invalid values |
| 281 | + |
| 282 | +2. **Upload State Machine** (`/packages/core/src/backoff/__tests__/UploadStateMachine.test.ts`) |
| 283 | + |
| 284 | + - canUpload() returns true/false based on state and time |
| 285 | + - handle429() sets waitUntilTime and increments counter |
| 286 | + - Max retry count enforcement |
| 287 | + - Max total backoff duration enforcement |
| 288 | + - State persistence across restarts |
| 289 | + |
| 290 | +3. **Batch Upload Manager** (`/packages/core/src/backoff/__tests__/BatchUploadManager.test.ts`) |
| 291 | + - calculateBackoff() produces correct exponential values with jitter |
| 292 | + - handleRetry() increments retry count and schedules next retry |
| 293 | + - Max retry count enforcement |
| 294 | + - Max total backoff duration enforcement |
| 295 | + - Batch metadata persistence |
| 296 | + |
| 297 | +### Integration Tests |
| 298 | + |
| 299 | +**File**: `/packages/core/src/plugins/__tests__/SegmentDestination.test.ts` |
| 300 | + |
| 301 | +Add test cases for: |
| 302 | + |
| 303 | +- 429 response halts upload loop (remaining batches not processed) |
| 304 | +- 429 response blocks future flush() calls until waitUntilTime |
| 305 | +- Successful upload after 429 resets state machine |
| 306 | +- Transient error (500) retries per-batch without blocking other batches |
| 307 | +- Non-retryable error (400) drops batch immediately |
| 308 | +- X-Retry-Count header sent with correct value |
| 309 | +- Authorization header contains base64-encoded writeKey |
| 310 | +- Sequential batch processing (not parallel) |
| 311 | +- Legacy behavior when httpConfig.enabled = false |
| 312 | + |
| 313 | +## Verification Steps |
| 314 | + |
| 315 | +### End-to-End Testing |
| 316 | + |
| 317 | +1. **Mock TAPI responses** in test environment |
| 318 | +2. **Verify state persistence**: Trigger 429, close app, reopen → should still be in WAITING state |
| 319 | +3. **Verify headers**: Intercept HTTP requests and check for Authorization and X-Retry-Count headers |
| 320 | +4. **Verify sequential processing**: Queue 3 batches, return 429 on first → only 1 fetch call should occur |
| 321 | +5. **Verify logging**: Check logs for "Rate limited", "Batch uploaded successfully", "retry scheduled" messages |
| 322 | + |
| 323 | +### Manual Testing Checklist |
| 324 | + |
| 325 | +- [ ] Test with real TAPI endpoint during low-load period |
| 326 | +- [ ] Trigger 429 by sending many events quickly |
| 327 | +- [ ] Verify retry happens after Retry-After period |
| 328 | +- [ ] Verify batches are dropped after max retry count |
| 329 | +- [ ] Verify batches are dropped after max total backoff duration (12 hours) |
| 330 | +- [ ] Test app restart during WAITING state (should persist) |
| 331 | +- [ ] Test legacy behavior with httpConfig.enabled = false |
| 332 | +- [ ] Verify no breaking changes to existing event tracking |
| 333 | + |
| 334 | +## Critical Files |
| 335 | + |
| 336 | +### New Files (3) |
| 337 | + |
| 338 | +1. `/packages/core/src/backoff/UploadStateMachine.ts` - Global rate limiting state machine |
| 339 | +2. `/packages/core/src/backoff/BatchUploadManager.ts` - Per-batch retry and backoff |
| 340 | +3. `/packages/core/src/backoff/index.ts` - Barrel export |
| 341 | + |
| 342 | +### Modified Files (5) |
| 343 | + |
| 344 | +1. `/packages/core/src/types.ts` - Add HttpConfig, UploadStateData, BatchMetadata types |
| 345 | +2. `/packages/core/src/errors.ts` - Add classifyError() and parseRetryAfter() |
| 346 | +3. `/packages/core/src/api.ts` - Add retryCount param and Authorization header |
| 347 | +4. `/packages/core/src/plugins/SegmentDestination.ts` - Integrate state machine and batch manager |
| 348 | +5. `/packages/core/src/constants.ts` - Add defaultHttpConfig |
| 349 | + |
| 350 | +### Test Files (3 new + 1 modified) |
| 351 | + |
| 352 | +1. `/packages/core/src/backoff/__tests__/UploadStateMachine.test.ts` |
| 353 | +2. `/packages/core/src/backoff/__tests__/BatchUploadManager.test.ts` |
| 354 | +3. `/packages/core/src/__tests__/errors.test.ts` - Add classification tests |
| 355 | +4. `/packages/core/src/plugins/__tests__/SegmentDestination.test.ts` - Add integration tests |
| 356 | + |
| 357 | +## Rollout Strategy |
| 358 | + |
| 359 | +1. **Phase 1**: Implement and test in development with `enabled: false` (default to legacy behavior) |
| 360 | +2. **Phase 2**: Enable in staging with verbose logging, monitor for issues |
| 361 | +3. **Phase 3**: Update Settings CDN to include httpConfig with `enabled: true` |
| 362 | +4. **Phase 4**: Monitor production metrics (retry count, drop rate, 429 frequency) |
| 363 | +5. **Phase 5**: Tune configuration parameters based on real-world data |
| 364 | + |
| 365 | +## Success Metrics |
| 366 | + |
| 367 | +- Reduction in 429 responses from TAPI during high-load events |
| 368 | +- Reduction in repeated failed upload attempts from slow clients |
| 369 | +- No increase in event loss rate (should be same or better) |
| 370 | +- Successful state persistence across app restarts |
| 371 | +- No performance degradation in normal operation |
| 372 | + |
| 373 | +## Implementation Status |
| 374 | + |
| 375 | +**Status**: ✅ COMPLETE (2026-02-09) |
| 376 | + |
| 377 | +All implementation steps have been completed: |
| 378 | + |
| 379 | +- ✅ Type definitions added |
| 380 | +- ✅ Default configuration added |
| 381 | +- ✅ Error classification functions implemented |
| 382 | +- ✅ API layer updated with retry headers |
| 383 | +- ✅ Upload State Machine implemented |
| 384 | +- ✅ Batch Upload Manager implemented |
| 385 | +- ✅ Integration into SegmentDestination complete |
| 386 | +- ✅ TypeScript compilation successful |
| 387 | + |
| 388 | +**Next Steps**: |
| 389 | + |
| 390 | +1. Write unit tests for error classification functions |
| 391 | +2. Write unit tests for UploadStateMachine |
| 392 | +3. Write unit tests for BatchUploadManager |
| 393 | +4. Add integration tests to SegmentDestination.test.ts |
| 394 | +5. Perform end-to-end testing |
| 395 | +6. Update Settings CDN configuration |
0 commit comments