Commit 24a6aaf
[WIP] Implement smart retry for structured message validation - investigating buffer corruption issues (#47146)
* Initial plan
* Add smart retry tests for StorageContentValidationDecoderPolicy
Add three new tests to BlobMessageDecoderDownloadTests.java:
1. downloadStreamWithResponseContentValidationSmartRetry - Tests basic smart retry with network interruptions
2. downloadStreamWithResponseContentValidationSmartRetryMultipleSegments - Tests retry with multiple segments
3. downloadStreamWithResponseContentValidationSmartRetryLargeBlob - Tests retry with larger blobs
These tests use MockPartialResponsePolicy to simulate network interruptions and verify that:
- Decoder validates checksums for all received data before retry
- Decoder state is preserved across retries
- SDK continues from the correct offset after interruption
Co-authored-by: gunjansingh-msft <[email protected]>
* Fix smart retry tests to include StorageContentValidationDecoderPolicy
The tests were missing the StorageContentValidationDecoderPolicy which is required to actually decode and validate structured messages. Without this policy, the tests were only testing the retry mechanism but not the structured message validation.
Changes:
- Added StorageContentValidationDecoderPolicy import
- Updated all three smart retry tests to create clients with BOTH MockPartialResponsePolicy (for simulating interruptions) AND StorageContentValidationDecoderPolicy (for decoding/validating)
- Added explanatory comments about why both policies are needed
Co-authored-by: gunjansingh-msft <[email protected]>
* Fix smart retry for structured message validation to restart from beginning
The issue was that structured messages cannot be decoded from arbitrary offsets - they must be decoded sequentially from the start. When a network interruption occurs during a download with structured message validation, the retry must restart from offset 0, not from the interrupted offset.
Changes to BlobAsyncClientBase:
- Modified retry logic to check if structured message validation is enabled
- When enabled, retries restart from the initial offset (beginning of download)
- When disabled, normal smart retry behavior resumes from interrupted offset
- This fixes "Unexpected segment number" errors that occurred when trying to decode from middle offsets
Changes to tests:
- Updated test assertions to verify all requests start from offset 0 with structured message validation
- Updated test comments to reflect correct behavior (retries restart from beginning)
- Tests now properly validate that structured message downloads succeed despite interruptions
Co-authored-by: gunjansingh-msft <[email protected]>
* Implement proper smart retry using encoded byte offset for structured messages
Changed the retry logic to use the encoded byte offset from the decoder state instead of restarting from the beginning. This enables true smart retry where downloads resume from the interrupted encoded offset.
Changes:
- Modified BlobAsyncClientBase to extract the encoded offset from DecoderState
- On retry, request range starts from the encoded offset (totalEncodedBytesProcessed)
- Decoder state is preserved across retries to maintain validation state
- Updated test assertions to verify smart retry behavior (resume from encoded offset)
- Updated test comments to reflect correct smart retry behavior
Note: This implementation resumes from the encoded offset where the interruption occurred. The decoder will need to handle resuming from mid-stream data, which may require additional enhancements to support arbitrary offset resumption within structured message segments.
Co-authored-by: gunjansingh-msft <[email protected]>
* Implement smart retry with proper duplicate byte skipping for structured messages
Enhanced StorageContentValidationDecoderPolicy to properly handle retries from arbitrary mid-segment offsets by detecting and skipping duplicate bytes.
Changes to StorageContentValidationDecoderPolicy:
- Added getResponseStartOffset() to parse Content-Range header and determine retry offset
- Enhanced DecoderState to track currentResponseStartOffset and currentResponseBytesRead
- Modified decodeStream() to calculate absolute buffer offsets and skip duplicate bytes
- When a retry occurs, the policy now:
1. Detects the starting offset from Content-Range header
2. Calculates which bytes in each buffer are duplicates
3. Skips duplicate bytes before feeding to decoder
4. Only processes new bytes, maintaining continuous decoder state
This implementation allows the decoder to receive a continuous stream of non-duplicate encoded bytes, even when retries occur from arbitrary offsets. The decoder doesn't need to be aware of retries - the policy handles all offset calculations and duplicate filtering.
Compatible with range downloads and works for arbitrary mid-segment offset resumption.
Co-authored-by: gunjansingh-msft <[email protected]>
* Optimize import - use short class name for StorageContentValidationDecoderPolicy
Added import for StorageContentValidationDecoderPolicy.DecoderState and replaced fully qualified class names with short names for better readability.
Addresses comment #2498919462.
Co-authored-by: gunjansingh-msft <[email protected]>
* Remove duplicate byte skipping logic - use simpler approach
Reverted to simpler implementation that doesn't try to skip duplicate bytes. The duplicate-skipping logic was causing "Unexpected segment number" errors because it was interfering with the decoder's state management.
The correct approach:
1. BlobAsyncClientBase requests retry from correct encoded offset (via totalEncodedBytesProcessed)
2. Server sends bytes starting from that offset (no duplicates)
3. Policy feeds all received bytes to decoder with pending buffer handling
4. Decoder maintains state and processes bytes sequentially
Removed:
- getResponseStartOffset() method
- currentResponseStartOffset and currentResponseBytesRead fields from DecoderState
- Duplicate byte detection and skipping logic in decodeStream()
This simpler approach relies on correct offset calculation in the retry logic (which we have) rather than trying to detect and skip duplicates at the policy level.
Addresses comment #2499104452 - investigating test failures.
Co-authored-by: gunjansingh-msft <[email protected]>
* Fix smart retry by preserving partial data in pending buffer on decoder exceptions
The root cause of "Unexpected segment number" errors was that when the decoder threw exceptions due to insufficient data (e.g., "Content not long enough"), the partial data was not saved to the pending buffer. This caused retries to lose accumulated bytes and start over.
Key changes:
1. Moved buffer combination and byte tracking outside try block to execute regardless of decoder success/failure
2. Added specific handling for IllegalArgumentException with "not long enough" message
3. When insufficient data error occurs, save all accumulated data to pending buffer and return empty (don't fail)
4. This allows the stream to continue accumulating bytes across retries until enough data is available for decoding
How it works now:
- Request bytes=0-: Get byte 0, insufficient for header, save to pending, return empty
- Stream error triggers retry
- Request bytes=1-: Get byte 1, combine with pending [byte 0], still insufficient, save [0,1] to pending
- Continue until 13+ bytes accumulated in pending
- Eventually enough bytes available, decoder succeeds and processes the header
- Smart retry resumes from correct encoded offset
This fix enables true smart retry where partial data is preserved across network interruptions.
Co-authored-by: gunjansingh-msft <[email protected]>
* Fix incomplete data detection to handle all decoder exceptions
Extended the incomplete data detection logic to catch ALL types of insufficient data errors from the decoder, not just "not long enough" messages.
The decoder throws different incomplete data errors:
- "Content not long enough" - when parsing message header (13 bytes needed)
- "Segment header is incomplete" - when parsing segment header (10 bytes needed)
- "Segment footer is incomplete" - when parsing segment footer (8 bytes for CRC64)
- "Message footer is incomplete" - when parsing message footer (8 bytes for CRC64)
Previous fix only caught "not long enough", causing failures when interruptions occurred while parsing segments/footers.
Updated exception handling to check for both:
- "not long enough" (header parsing)
- "is incomplete" (segment/footer parsing)
This ensures partial data is preserved in pending buffer for ALL incomplete data scenarios, allowing smart retry to properly accumulate bytes across interruptions regardless of where in the message structure the interruption occurs.
Addresses the "Unexpected segment number" error that occurred when segments were split across retries.
Co-authored-by: gunjansingh-msft <[email protected]>
* Fix smart retry offset calculation to exclude pending buffer bytes
The root cause of "Unexpected segment number" errors was incorrect retry offset calculation. The pending buffer contains bytes that have already been counted in totalEncodedBytesProcessed but haven't been successfully processed by the decoder yet. When retrying, we were requesting bytes AFTER the pending buffer, causing gaps in the data stream.
Key issue:
- totalEncodedBytesProcessed tracks ALL bytes received (including those in pending)
- Pending buffer holds bytes waiting for more data to complete a structure (header/segment/footer)
- Retry offset was set to totalEncodedBytesProcessed, skipping pending bytes
- This caused decoder to receive segment N when expecting segment N-1
Solution:
1. Added getRetryOffset() method to DecoderState that returns: totalEncodedBytesProcessed - pendingBufferSize
2. Updated BlobAsyncClientBase to use getRetryOffset() instead of getTotalEncodedBytesProcessed()
3. Added import for DecoderState inner class
Example flow:
- Receive bytes 0-4, add to totalEncodedBytesProcessed (=5), insufficient for 13-byte header, store in pending
- IOException occurs
- Retry requests from offset 0 (5 - 5 pending bytes = 0) ✓
- Get byte 5, combine with pending [0-4] = [0-5], still insufficient, store in pending
- totalEncodedBytesProcessed now = 6, pending = 6 bytes
- Retry requests from offset 0 (6 - 6 = 0) ✓
- Continue until enough bytes accumulated
This ensures continuous byte stream to decoder with no gaps or duplicates.
Co-authored-by: gunjansingh-msft <[email protected]>
* Fix retry offset calculation - use totalEncodedBytesProcessed to request new bytes
The root cause was using getRetryOffset() (which subtracts pending bytes) to determine what to request from the server. This caused requesting bytes we already have in the pending buffer, creating duplicates.
The correct logic:
- totalEncodedBytesProcessed tracks ALL bytes we've received from the server (including those in pending)
- pending buffer holds bytes waiting for more data to complete a structure
- When retrying, we need to request NEW bytes starting from totalEncodedBytesProcessed
- The pending buffer will be combined with new bytes in the policy's decodeStream method
Example:
1. Request bytes=0-, get byte 0, totalEncodedBytesProcessed=1, pending=[0]
2. IOException, retry requesting bytes=1- (starting from totalEncodedBytesProcessed=1)
3. Get byte 1, combine with pending [0] + [1] = [0,1], totalEncodedBytesProcessed=2
4. Still insufficient, pending=[0,1], continue...
This ensures:
- No duplicate bytes from server
- Pending buffer accumulates across retries
- Decoder receives continuous byte stream
- Smart retry resumes from correct offset
Addresses comment #2499104452.
Co-authored-by: gunjansingh-msft <[email protected]>
* Fix decoder duplicate processing by tracking consumed bytes correctly
The root cause was that after calling decoder.decode(), we were saving the ENTIRE combined buffer (pending + new) to the pending buffer, including bytes already consumed by the decoder. This caused the decoder to see duplicate segment headers on subsequent iterations, leading to "Unexpected segment number" errors.
The fix:
1. Changed decoder.decode() call from using dataToProcess.duplicate() to using dataToProcess directly
2. Track how many bytes were consumed by comparing buffer size before and after decode
3. Only save UNCONSUMED bytes to the pending buffer
4. This ensures the decoder receives a continuous, non-duplicate stream of bytes
Example flow:
- Iteration 1: pending=null, new=[bytes 0-4], combine=[bytes 0-4], decoder consumes 0 (not enough), pending=[bytes 0-4]
- Iteration 2: pending=[bytes 0-4], new=[byte 5], combine=[bytes 0-5], decoder consumes 0 (not enough), pending=[bytes 0-5]
- ...
- Iteration 13: pending=[bytes 0-12], new=[byte 13], combine=[bytes 0-13], decoder consumes 13 (header parsed!), pending=null
- Iteration 14: pending=null, new=[byte 14], decoder continues from where it left off
Addresses comments #2499104452 and #3447938815.
Co-authored-by: gunjansingh-msft <[email protected]>
* Revert to using duplicate() - decoder consumes all data or throws exception
The root cause of "Invalid segment size" was buffer position corruption from passing dataToProcess directly to the decoder. The decoder advances the buffer position, and when we saved that modified buffer to pending, we lost the initial bytes.
The fix:
1. Revert to using dataToProcess.duplicate() for decoder
2. Decoder works on a copy, doesn't modify original
3. If decoder succeeds (no exception), it consumed ALL available data → clear pending
4. If decoder throws incomplete data exception, ALL data is unconsumed → save ALL to pending
5. No partial consumption tracking needed - it's all-or-nothing
This matches the decoder's actual behavior: it either successfully processes a complete structure (header, segment, footer) or throws an exception if there's insufficient data. There's no partial consumption of a structure.
Example:
- dataToProcess = [bytes 0-12], 13 bytes total
- decoder.decode(dataToProcess.duplicate(), 13)
- Decoder reads 13-byte header successfully
- No exception → consumed all 13 bytes → pending = null ✓
- dataToProcess = [bytes 0-10], 11 bytes total
- decoder.decode(dataToProcess.duplicate(), 11)
- Decoder tries to read 13-byte header, only has 11
- Throws "not long enough" exception
- We catch it → save ALL 11 bytes to pending ✓
Addresses comment #2499104452 - fixes "Invalid segment size" error.
Co-authored-by: gunjansingh-msft <[email protected]>
* Fix decoder consumption tracking using buffer position after decode
The root cause of "Invalid segment size" was assuming decode() either consumes all data or throws an exception. But the decoder CAN partially consume data from the buffer.
The fix:
1. Pass a duplicate buffer to decoder.decode()
2. The decoder advances the duplicate's position as it reads
3. After decode(), check duplicate.position() to see how much was consumed
4. Calculate unconsumed bytes: availableSize - duplicate.position()
5. Save only unconsumed bytes to pending by positioning and slicing the original buffer
Example flow:
- dataToProcess=[bytes 0-22], position=0
- dup = dataToProcess.duplicate(), dup.position()=0
- decoder.decode(dup, 23) reads header (13 bytes)
- After decode: dup.position()=13
- consumed = 13, remaining = 22-13 = 9
- dataToProcess.position(13), slice() gives [bytes 13-22]
- Save [bytes 13-22] to pending ✓
Next iteration:
- pending=[bytes 13-22], new=[byte 23]
- combine=[bytes 13-23]
- dup.position()=0, decoder.decode(dup, 11)
- decoder continues from messageOffset=13, reads segment header
- dup.position()=10 after decode
- consumed=10, save byte [10] = message byte [23] to pending ✓
Addresses comment #2499104452 - fixes "Invalid segment size" error.
Co-authored-by: gunjansingh-msft <[email protected]>
---------
Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: gunjansingh-msft <[email protected]>1 parent b0c3390 commit 24a6aaf
File tree
4 files changed
+279
-55
lines changed- sdk/storage
- azure-storage-blob/src
- main/java/com/azure/storage/blob/specialized
- test/java/com/azure/storage/blob/specialized
- azure-storage-common/src/main/java/com/azure/storage/common
- implementation
- policy
4 files changed
+279
-55
lines changedLines changed: 36 additions & 19 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
| 88 | + | |
| 89 | + | |
88 | 90 | | |
89 | 91 | | |
90 | 92 | | |
| |||
1339 | 1341 | | |
1340 | 1342 | | |
1341 | 1343 | | |
1342 | | - | |
1343 | | - | |
1344 | | - | |
1345 | | - | |
| 1344 | + | |
| 1345 | + | |
1346 | 1346 | | |
1347 | 1347 | | |
1348 | 1348 | | |
| |||
1393 | 1393 | | |
1394 | 1394 | | |
1395 | 1395 | | |
1396 | | - | |
1397 | | - | |
1398 | | - | |
1399 | | - | |
| 1396 | + | |
| 1397 | + | |
| 1398 | + | |
| 1399 | + | |
| 1400 | + | |
1400 | 1401 | | |
1401 | | - | |
1402 | | - | |
1403 | | - | |
1404 | | - | |
1405 | | - | |
1406 | | - | |
| 1402 | + | |
| 1403 | + | |
| 1404 | + | |
| 1405 | + | |
| 1406 | + | |
| 1407 | + | |
| 1408 | + | |
| 1409 | + | |
| 1410 | + | |
| 1411 | + | |
| 1412 | + | |
| 1413 | + | |
| 1414 | + | |
| 1415 | + | |
| 1416 | + | |
| 1417 | + | |
| 1418 | + | |
| 1419 | + | |
| 1420 | + | |
| 1421 | + | |
| 1422 | + | |
1407 | 1423 | | |
| 1424 | + | |
| 1425 | + | |
| 1426 | + | |
1408 | 1427 | | |
1409 | | - | |
1410 | | - | |
1411 | | - | |
| 1428 | + | |
| 1429 | + | |
1412 | 1430 | | |
1413 | 1431 | | |
1414 | 1432 | | |
1415 | 1433 | | |
1416 | 1434 | | |
1417 | 1435 | | |
1418 | | - | |
1419 | | - | |
| 1436 | + | |
1420 | 1437 | | |
1421 | 1438 | | |
1422 | 1439 | | |
| |||
Lines changed: 185 additions & 18 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
| |||
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
| 18 | + | |
| 19 | + | |
17 | 20 | | |
18 | 21 | | |
19 | 22 | | |
20 | 23 | | |
21 | 24 | | |
22 | 25 | | |
23 | 26 | | |
| 27 | + | |
24 | 28 | | |
25 | 29 | | |
26 | 30 | | |
| |||
78 | 82 | | |
79 | 83 | | |
80 | 84 | | |
81 | | - | |
82 | | - | |
| 85 | + | |
| 86 | + | |
83 | 87 | | |
84 | 88 | | |
85 | 89 | | |
| |||
142 | 146 | | |
143 | 147 | | |
144 | 148 | | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
151 | 153 | | |
152 | 154 | | |
153 | 155 | | |
154 | | - | |
155 | | - | |
| 156 | + | |
156 | 157 | | |
157 | 158 | | |
158 | 159 | | |
| |||
168 | 169 | | |
169 | 170 | | |
170 | 171 | | |
171 | | - | |
172 | | - | |
173 | | - | |
174 | | - | |
175 | | - | |
176 | | - | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
177 | 176 | | |
178 | 177 | | |
179 | 178 | | |
180 | | - | |
181 | | - | |
| 179 | + | |
182 | 180 | | |
183 | 181 | | |
184 | 182 | | |
| |||
224 | 222 | | |
225 | 223 | | |
226 | 224 | | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
227 | 394 | | |
Lines changed: 4 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
105 | | - | |
106 | | - | |
| 105 | + | |
| 106 | + | |
107 | 107 | | |
108 | 108 | | |
109 | 109 | | |
110 | 110 | | |
111 | | - | |
112 | | - | |
| 111 | + | |
| 112 | + | |
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
| |||
0 commit comments