|
| 1 | +# Spike Investigation: StreamThreadException in Bing Ads Source |
| 2 | + |
| 3 | +## Issue Summary |
| 4 | +- **Issue**: [#8301](https://github.com/airbytehq/oncall/issues/8301) - StreamThreadException in Bing Ads source |
| 5 | +- **Error**: `'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte` |
| 6 | +- **Stream**: `campaign_labels` |
| 7 | +- **Root Cause**: GZIP-compressed data being treated as UTF-8 text |
| 8 | + |
| 9 | +## Analysis |
| 10 | + |
| 11 | +### Error Context |
| 12 | +From Christo's clarification in the issue: |
| 13 | +``` |
| 14 | +Exception while syncing stream campaign_labels: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte |
| 15 | +``` |
| 16 | + |
| 17 | +The byte `0x8b` is the GZIP magic number, indicating that compressed data is being passed to a UTF-8 decoder. |
| 18 | + |
| 19 | +### Technical Investigation |
| 20 | + |
| 21 | +#### 1. Bing Ads Connector Configuration |
| 22 | +- Uses `GzipDecoder` with `CsvDecoder` for bulk streams |
| 23 | +- Encoding: `utf-8-sig` |
| 24 | +- Stream: `campaign_labels` with `DownloadEntities: ["CampaignLabels"]` |
| 25 | + |
| 26 | +#### 2. Concurrent Source Framework |
| 27 | +- `StreamThreadException` wraps exceptions from concurrent processing |
| 28 | +- `CompositeRawDecoder` handles response decoding with multiple parsers |
| 29 | +- `GzipParser` decompresses GZIP data before passing to inner parsers |
| 30 | + |
| 31 | +#### 3. Root Cause Analysis |
| 32 | +The issue occurs in the concurrent source framework when: |
| 33 | +1. GZIP-compressed response is received |
| 34 | +2. Parser selection logic fails to detect GZIP content-encoding |
| 35 | +3. Compressed data (starting with 0x8b) is passed directly to UTF-8 decoder |
| 36 | +4. UTF-8 decoder fails with the observed error |
| 37 | +5. Exception is wrapped in `StreamThreadException` |
| 38 | + |
| 39 | +## Proposed Investigation Areas |
| 40 | + |
| 41 | +### 1. Parser Selection Logic |
| 42 | +- Examine `CompositeRawDecoder._select_parser()` method |
| 43 | +- Check header-based parser selection for GZIP content |
| 44 | +- Investigate concurrent source integration with declarative decoders |
| 45 | + |
| 46 | +### 2. Error Handling |
| 47 | +- Review exception propagation in concurrent processing |
| 48 | +- Check if GZIP decompression errors are properly handled |
| 49 | +- Examine fallback mechanisms for parser failures |
| 50 | + |
| 51 | +### 3. Integration Points |
| 52 | +- Analyze how `ConcurrentDeclarativeSource` handles bulk streams |
| 53 | +- Check if declarative decoders are properly integrated with concurrent framework |
| 54 | +- Investigate state management during concurrent processing |
| 55 | + |
| 56 | +## Next Steps |
| 57 | + |
| 58 | +1. Create test cases to reproduce the issue |
| 59 | +2. Implement parser selection improvements |
| 60 | +3. Add better error handling for GZIP decompression |
| 61 | +4. Test with Bing Ads campaign_labels stream |
| 62 | +5. Validate fix doesn't break other connectors |
| 63 | + |
| 64 | +## Files to Investigate |
| 65 | +- `airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py` |
| 66 | +- `airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py` |
| 67 | +- `airbyte_cdk/sources/declarative/concurrent_declarative_source.py` |
| 68 | +- Bing Ads manifest configuration for bulk streams |
0 commit comments