-
Notifications
You must be signed in to change notification settings - Fork 32
fix(OffsetIncrement Pagination Strategy): Fix bug where streams that use record filtering do not paginate correctly #484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(OffsetIncrement Pagination Strategy): Fix bug where streams that use record filtering do not paginate correctly #484
Conversation
…ctly get page record size
📝 WalkthroughWalkthroughThe changes introduce the ability for pagination strategies, specifically the Changes
Sequence Diagram(s)sequenceDiagram
participant Factory as ModelToComponentFactory
participant ExtractorModel
participant Paginator
participant OffsetIncrement
participant Extractor
Factory->>Paginator: create_default_paginator(extractor_model)
Paginator->>OffsetIncrement: create_offset_increment(extractor_model)
OffsetIncrement->>Extractor: instantiate Extractor from extractor_model
OffsetIncrement->>OffsetIncrement: next_page_token(response)
OffsetIncrement->>Extractor: extract records from response
Extractor-->>OffsetIncrement: return records
OffsetIncrement-->>Paginator: determine next page token
Suggested labels
Suggested reviewers
Would you like to consider adding more tests for edge cases where the extractor returns unexpected types or empty lists, just to further solidify the new logic—wdyt? Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2599-2610: Thoughtful implementation with clear explanationThe detailed comment explains the design decision to create separate extractors with identical behavior. This maintains existing behavior while enabling the fix.
Would a future refactoring to reuse the same component (rather than create identical ones) be beneficial for maintenance, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py(5 hunks)airbyte_cdk/sources/declarative/requesters/paginators/strategies/offset_increment.py(3 hunks)unit_tests/sources/declarative/parsers/test_model_to_component_factory.py(4 hunks)unit_tests/sources/declarative/requesters/paginators/test_default_paginator.py(5 hunks)unit_tests/sources/declarative/requesters/paginators/test_offset_increment.py(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
airbyte_cdk/sources/declarative/requesters/paginators/strategies/offset_increment.py (3)
airbyte_cdk/sources/declarative/extractors/record_extractor.py (2)
RecordExtractor(12-27)extract_records(18-27)unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (1)
extract_records(3612-3616)airbyte_cdk/sources/declarative/extractors/dpath_extractor.py (1)
extract_records(71-86)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
🔇 Additional comments (21)
airbyte_cdk/sources/declarative/requesters/paginators/strategies/offset_increment.py (3)
15-15: Import looks good.The added import for RecordExtractor from dpath_extractor matches the new functionality being introduced.
50-50: Nice addition of the optional extractor parameter.This parameter will allow the OffsetIncrement strategy to get record counts directly from responses. Good call making it optional for backward compatibility, wdyt?
80-86: Good implementation of record extraction.This implementation correctly uses the extractor (when provided) to get the raw count of records from the response before any filtering occurs. I like how you've made the page_size_from_response fallback to the incoming last_page_size to handle edge cases. This should fix the issue with pagination when record filtering is applied.
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py (3)
69-69: Good addition of the DpathExtractorModel import.This import is needed for the updated test cases that now use extractors with pagination strategies.
1835-1835: Test properly updated with extractor model.The test_create_default_paginator function now correctly includes an extractor_model parameter with a field path of ["results"], which aligns with the new functionality.
2648-2670: Great test update for OffsetIncrement with extractor.This test is thoroughly updated to validate the new extractor functionality:
- Created the expected_extractor with the correct field_path
- Created the extractor_model to pass to the factory
- Added the extractor to the expected_strategy
- Updated the factory.create_offset_increment call to include the extractor_model
- Added assertions to verify the extractor is correctly set
This gives good coverage for the new functionality, especially with the field_path assertions.
unit_tests/sources/declarative/requesters/paginators/test_default_paginator.py (5)
12-12: Added DpathExtractor import.This import is necessary for the updated tests that now use extractor instances.
331-337: Test updated for OffsetIncrement with extractor.The test_initial_token_with_offset_pagination test now correctly includes a DpathExtractor when creating the OffsetIncrement strategy. The empty field_path should be fine for this test since it's just validating the initial token behavior.
358-364: Good test update with realistic field_path.The parameterized test now includes a DpathExtractor with field_path=["results"], which matches the realistic test response data added further down. This ensures we're properly testing the extractor functionality in pagination.
389-402: Nice addition of realistic test data.You've updated the test_no_inject_on_first_request_offset_pagination test to include a realistic JSON response with a "results" array containing 10 records. This matches the field_path in the extractor and the expected page size, making this test much more representative of real-world usage. Great improvement!
459-464: Updated test for OffsetIncrement with extractor.The test_paginator_with_page_option_no_page_size test now correctly includes a DpathExtractor when creating the OffsetIncrement strategy. The empty field_path is fine for this edge case test that verifies error handling for missing page size.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)
2057-2058: Good addition of extractor_model parameterThis parameter allows the
OffsetIncrementstrategy to use the same extractor as the record selector, which is essential for accurate pagination when record filtering is involved.
2579-2585: Nice refactoring to incorporate extractor-based paginationThe addition of the extractor_model parameter here enables the pagination strategy to access the raw API response before filtering, addressing the core issue in the PR.
2612-2617: Key fix: adding extractor to OffsetIncrementThis is the critical change that enables the pagination strategy to use the extracted records before filtering to determine if more pages remain.
2980-2981: Important connection pointThis change connects the dots by passing the record selector's extractor model to the paginator creation, ensuring consistent extraction logic throughout the pipeline.
unit_tests/sources/declarative/requesters/paginators/test_offset_increment.py (6)
11-11: Appropriate import for the DpathExtractorThe import aligns with the new functionality being tested.
17-19: Test signature updated to include response_resultsGood update to the test parameters to include the actual response results, which are now crucial for pagination decisions.
72-75: Key test update: using DpathExtractorThe tests now correctly instantiate the OffsetIncrement with a DpathExtractor that looks for records in the "results" path, matching real-world API responses.
80-84: More realistic test data structureThis change creates a more realistic API response structure with results nested under a "results" key, better representing real-world scenarios.
99-115: Excellent additional test caseThis new test verifies the paginator's behavior when the response lacks the expected record path, ensuring robust error handling.
139-146: Consistent test updatesThe initial token test has been updated to maintain consistency with the new constructor signature, using an empty string field path.
What
While migrating
source-hubspot@tolik0 identified a bug in how we perform OffsetIncrement pagination. When we perform record filtering, we can potentially end up with fewer records than the amount returned in the API response. The paginator then falsely thinks the paging is complete because received records is smaller than the max page size.How
This PR fixes the said problem above by utilizing the existing
RecordExtractormodel defined by the developer and utilizes it in theOffsetIncrementpagination strategy. While determine if there are more pages to read innext_page_token()we have the extractor parse theresponseparameter (which will not perform filtering) to get the accurate count.We opted for this solution because reusing an extractor is more accurate than recreating the dpath parsing behavior.
Summary by CodeRabbit
New Features
Bug Fixes
Tests