|
| 1 | +# Pagination Performance Analysis for PR #33 |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +After investigating the pagination changes in PR #33, I've identified three performance issues that cause syncs to be slower, especially noticeable in certain scenarios. |
| 6 | + |
| 7 | +## Issues Identified |
| 8 | + |
| 9 | +### 1. Cache Invalidation (First Sync After Update) ⚠️ |
| 10 | + |
| 11 | +**Impact**: Severe slowdown on first sync after updating |
| 12 | + |
| 13 | +**Root Cause**: The pagination implementation changed all API endpoint URLs by adding query parameters: |
| 14 | +- **Before**: `/user/following`, `/users/bob/starred` |
| 15 | +- **After**: `/user/following?page=1&per_page=100`, `/users/bob/starred?page=1&per_page=100` |
| 16 | + |
| 17 | +**Problem**: The GitHub client uses path-based ETag caching (github/client.go:168-173). When paths change, all cached ETags become useless. This means: |
| 18 | +- The first sync after updating to PR #33 hits GitHub's API for **every endpoint** instead of returning 304 Not Modified |
| 19 | +- No benefit from conditional requests on first sync |
| 20 | +- For users following 100 people, this is 1 + (3 × 100) = **301 API calls** instead of potentially just a few |
| 21 | + |
| 22 | +**Evidence**: See cache lookup in github/client.go:168-173 |
| 23 | + |
| 24 | +### 2. Extra API Call for Exactly 100 Items ⚠️ |
| 25 | + |
| 26 | +**Impact**: Moderate slowdown for users with exactly 100 items in any category |
| 27 | + |
| 28 | +**Root Cause**: The pagination logic in `getPaginated()` (github/client.go:301-333) continues fetching when it gets exactly `perPage` items: |
| 29 | + |
| 30 | +```go |
| 31 | +// If we got fewer results than per_page, this is the last page |
| 32 | +if pageSlice.Len() < perPage { |
| 33 | + break |
| 34 | +} |
| 35 | +``` |
| 36 | + |
| 37 | +**Problem**: When a result set has exactly 100 items (or any multiple of 100), the code: |
| 38 | +1. Fetches page 1 → gets 100 items |
| 39 | +2. Sees 100 == 100 (perPage), so continues |
| 40 | +3. Fetches page 2 → gets 0 items |
| 41 | +4. Only then stops |
| 42 | + |
| 43 | +**Example Scenario**: A user who follows exactly 100 people: |
| 44 | +- Expected: 1 API call |
| 45 | +- Actual: 2 API calls (100% increase) |
| 46 | + |
| 47 | +**Per-User Impact**: For each followed user, we call 3 methods: |
| 48 | +- `GetStarredReposByUsername` |
| 49 | +- `GetOwnedReposByUsername` |
| 50 | +- `GetRecentEvents` |
| 51 | + |
| 52 | +If any of these returns exactly 100 items, that's an extra API call. For 100 followed users, this could add up to 300 extra API calls in the worst case. |
| 53 | + |
| 54 | +**Evidence**: Test in github/client_test.go:637-690 explicitly expects 2 requests for exactly 100 items (see comment on line 687) |
| 55 | + |
| 56 | +### 3. No Incremental Caching for Multi-Page Results |
| 57 | + |
| 58 | +**Impact**: Minor but cumulative |
| 59 | + |
| 60 | +**Root Cause**: GitHub only returns Link headers for pagination, not stable ETags across page boundaries. |
| 61 | + |
| 62 | +**Problem**: Users with > 100 items in any category can't benefit from conditional requests for pages beyond the first. Each page is a new URL that's never been cached. |
| 63 | + |
| 64 | +**Example**: A user with 250 followed users: |
| 65 | +- Page 1: `/user/following?page=1&per_page=100` (can be cached) |
| 66 | +- Page 2: `/user/following?page=2&per_page=100` (new path, no cache benefit) |
| 67 | +- Page 3: `/user/following?page=3&per_page=100` (new path, no cache benefit) |
| 68 | + |
| 69 | +## Real-World Impact |
| 70 | + |
| 71 | +### Single User Scenario (following 5 people) |
| 72 | + |
| 73 | +**Before PR #33**: |
| 74 | +- Fetch followed users: 1 API call |
| 75 | +- For each of 5 users: 3 API calls |
| 76 | +- Total: 16 API calls |
| 77 | +- With ETag caching on subsequent syncs: potentially just 16 × 304 responses |
| 78 | + |
| 79 | +**After PR #33** (first sync): |
| 80 | +- Fetch followed users: 1 API call (cache miss due to path change) |
| 81 | +- For each of 5 users: 3 API calls (all cache misses) |
| 82 | +- Total: 16 API calls **with no cache hits** → full data transfers |
| 83 | + |
| 84 | +**After PR #33** (subsequent syncs): |
| 85 | +- Same as before (16 calls), but paths are now cached again |
| 86 | +- Performance returns to normal after first sync |
| 87 | + |
| 88 | +### User with Exactly 100 Followed Users |
| 89 | + |
| 90 | +**Before PR #33**: |
| 91 | +- Total API calls: ~301 (1 + 3×100) |
| 92 | + |
| 93 | +**After PR #33**: |
| 94 | +- Followed users endpoint: 2 calls (100 users, then check page 2) |
| 95 | +- Per user: potentially 2 calls if starred repos = 100, 2 if owned repos = 100, 2 if events = 100 |
| 96 | +- Worst case: 2 + (6 × 100) = **602 API calls** (100% increase!) |
| 97 | +- Typical case: 2 + (4 × 100) = **402 API calls** (33% increase) |
| 98 | + |
| 99 | +## Proposed Fixes |
| 100 | + |
| 101 | +### Fix #1: Preserve Cache Keys (Migration Strategy) |
| 102 | + |
| 103 | +**Option A - Maintain Backward Compatibility**: |
| 104 | +- Store cache entries for both old and new path formats during transition |
| 105 | +- Check both `/user/following` and `/user/following?page=1&per_page=100` caches |
| 106 | +- Gradually migrate over time |
| 107 | + |
| 108 | +**Option B - Accept One-Time Cache Miss**: |
| 109 | +- Document that first sync after update will be slower |
| 110 | +- This is a one-time cost and resolves itself automatically |
| 111 | + |
| 112 | +**Recommendation**: Option B with documentation. The cache miss is a one-time event and fixing it adds complexity. |
| 113 | + |
| 114 | +### Fix #2: Optimize Exactly-100 Detection ⭐ PRIORITY |
| 115 | + |
| 116 | +**Current Logic**: |
| 117 | +```go |
| 118 | +if pageSlice.Len() < perPage { |
| 119 | + break |
| 120 | +} |
| 121 | +``` |
| 122 | + |
| 123 | +**Proposed Fix**: Use GitHub's Link header to detect if there's a next page, instead of assuming there is when we get exactly `perPage` items. |
| 124 | + |
| 125 | +**Alternative Fix** (simpler): Special case handling for exactly perPage: |
| 126 | +```go |
| 127 | +// If we got fewer than per_page, this is the last page |
| 128 | +if pageSlice.Len() < perPage { |
| 129 | + break |
| 130 | +} |
| 131 | + |
| 132 | +// If we got exactly per_page items, peek at the next page |
| 133 | +// but only if we haven't made too many requests |
| 134 | +if pageSlice.Len() == perPage { |
| 135 | + // Could check Link header here instead of making another request |
| 136 | + // GitHub provides: Link: <url?page=2>; rel="next" |
| 137 | +} |
| 138 | +``` |
| 139 | + |
| 140 | +**Impact**: Reduces API calls by up to 50% for users with exactly 100 items in any category. |
| 141 | + |
| 142 | +### Fix #3: Implement Link Header Pagination ⭐ BEST SOLUTION |
| 143 | + |
| 144 | +**Proper Approach**: Parse GitHub's `Link` header to determine if there are more pages: |
| 145 | + |
| 146 | +```go |
| 147 | +// Check for Link header with rel="next" |
| 148 | +linkHeader := resp.Header.Get("Link") |
| 149 | +hasNext := strings.Contains(linkHeader, `rel="next"`) |
| 150 | + |
| 151 | +if !hasNext || pageSlice.Len() == 0 { |
| 152 | + break |
| 153 | +} |
| 154 | +``` |
| 155 | + |
| 156 | +**Benefits**: |
| 157 | +- Eliminates extra API calls for exactly 100 items |
| 158 | +- More reliable than guessing based on result count |
| 159 | +- Standard GitHub API practice |
| 160 | +- No assumptions about page size |
| 161 | + |
| 162 | +**Impact**: Fixes issue #2 completely, reducing unnecessary API calls by up to 300 for users following many people. |
| 163 | + |
| 164 | +## Recommendation |
| 165 | + |
| 166 | +1. **Immediate**: Implement Link header pagination (Fix #3) - this is the proper solution and fixes issue #2 completely |
| 167 | +2. **Document**: Note that first sync after update will be slower due to cache invalidation (issue #1) |
| 168 | +3. **Consider**: Future optimization for multi-page caching (issue #3) only if it becomes a bottleneck |
| 169 | + |
| 170 | +## Test Case to Add |
| 171 | + |
| 172 | +```go |
| 173 | +func TestGetPaginatedNoExtraCallForExactly100Items(t *testing.T) { |
| 174 | + // Test that we don't make an extra API call when exactly 100 items exist |
| 175 | + // Should use Link header to detect end of pagination |
| 176 | + repos := make([]Repository, 100) |
| 177 | + for i := 0; i < 100; i++ { |
| 178 | + repos[i] = Repository{ID: int64(i), Name: fmt.Sprintf("repo%d", i)} |
| 179 | + } |
| 180 | + |
| 181 | + requestCount := 0 |
| 182 | + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { |
| 183 | + requestCount++ |
| 184 | + page := r.URL.Query().Get("page") |
| 185 | + |
| 186 | + if page == "1" { |
| 187 | + w.Header().Set("Content-Type", "application/json") |
| 188 | + // No Link header with rel="next" means this is the last page |
| 189 | + json.NewEncoder(w).Encode(repos) |
| 190 | + } else { |
| 191 | + t.Errorf("should not request page %s when page 1 has no next link", page) |
| 192 | + } |
| 193 | + })) |
| 194 | + defer server.Close() |
| 195 | + |
| 196 | + c := NewClient("token", WithBaseURL(server.URL)) |
| 197 | + result, err := c.GetStarredRepos(context.Background()) |
| 198 | + |
| 199 | + require.NoError(t, err) |
| 200 | + assert.Equal(t, 100, len(result)) |
| 201 | + assert.Equal(t, 1, requestCount, "should only make 1 request when Link header indicates no next page") |
| 202 | +} |
| 203 | +``` |
| 204 | + |
| 205 | +## Files to Modify |
| 206 | + |
| 207 | +1. **github/client.go**: Update `getPaginated()` to check Link header (lines 286-338) |
| 208 | +2. **github/client.go**: Update `get()` method to return Link header or parse it |
| 209 | +3. **github/client_test.go**: Update test expectations (line 687-689) |
| 210 | +4. **github/client_test.go**: Add new test for Link header pagination |
| 211 | + |
| 212 | +## API Call Comparison |
| 213 | + |
| 214 | +| Scenario | Before PR #33 | After PR #33 (current) | After Fix | |
| 215 | +|----------|---------------|------------------------|-----------| |
| 216 | +| 5 followed users (first sync) | 16 | 16 | 16 | |
| 217 | +| 5 followed users (cached) | 16 (mostly 304s) | 16 (mostly 304s after first) | 16 (mostly 304s) | |
| 218 | +| 100 followed users, all with ~50 items each | 301 | 301 | 301 | |
| 219 | +| 100 followed users, all with exactly 100 items | 301 | 602 | 301 | |
| 220 | +| **100 followed users, 50 with exactly 100 items** | **301** | **451** | **301** | |
| 221 | + |
| 222 | +**Worst case impact**: Currently adding **150 extra API calls** (50% increase) for users in the exactly-100-items scenario. |
0 commit comments