fix(source-greenhouse): remove custom cursors#54702
fix(source-greenhouse): remove custom cursors#54702Daryna Ishchenko (darynaishchenko) merged 14 commits intomasterfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Summary
🧪 3 passed🟢 Test Sub-stream State Migration for source-greenhouse 🟢 Test Legacy State Migration for Greenhouse Connector - Jira Issue #11527 🟢 Test Millisecond Precision Datetime Format for DatetimeBasedCursor Implementation Do you want to test every PR like this? 👉 make Fume a regular reviewer at Airbyte |
Regression tests:test_catalog_are_the_same[failed]is_resumable now true(due to cdk update) TestDataIntegrity.test_record_schema_match_with_state [failed]Null fields are not in records(due to cdk update) Record count mismatchsome inctemental streams have more records in target version, data could be added between reads, we should be fine with this. |
There was a problem hiding this comment.
The code changes look good to me, granted I'm still not fully an expert in the area. Given the CDK change landed, I'd love to merge this, do a single regression test (if we haven't already) and start a progressive rollout.
The only ask I have is to bump pytest to 8, deps hygiene.
Daryna Ishchenko (@darynaishchenko) take this away, let's get this shipped.
airbyte-integrations/connectors/source-greenhouse/unit_tests/conftest.py
Outdated
Show resolved
Hide resolved
| def run(): | ||
| source = SourceGreenhouse() | ||
| launch(source, sys.argv[1:]) | ||
| def _get_source(args: List[str]): |
There was a problem hiding this comment.
Nit, non-blocking: I think I've seen conversations about our approach to run.py — this seems like boilerplate, feels off. Is there a reason we have to have this code here, as opposed to having this inside a generic source init?
There was a problem hiding this comment.
Probably cc Maxime Carbonneau-Leclerc (@maxi297) and Brian Lai (@brianjlai)
There was a problem hiding this comment.
Natik Gadzhi (@natikgadzhi) I agree that we should probably want to ideally have this generified (is that a word?) into the CDK implementation, but the issue i think when we first build this was that run.py isn't even a thing in the CDK itself. So we were left with a bit of a manual process.
And while this is annoying, ultimately all these individual run.py definitions should ultimately get removed as we move to manifest-only connectors where all of this is already addressed within source-declarative-manifest https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-declarative-manifest/source_declarative_manifest/run.py#L152-L160.
I think unless we find free cycles to make this easier in the CDK, I don't think this is a super high prioirty because it will ultimately be throwaway work once we treat everything as manifest-only.
airbyte-integrations/connectors/source-greenhouse/pyproject.toml
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-greenhouse/pyproject.toml
Outdated
Show resolved
Hide resolved
Brian Lai (brianjlai)
left a comment
There was a problem hiding this comment.
overall looks good. if you have a chance can you rebase this w/ the latest master and re-run regression tests and I can do a final approval
| def is_greater_than_or_equal(self, first: Record, second: Record) -> bool: | ||
| LegacyToPerPartitionStateMigration migrates partition keys as string, while real type of id in greenhouse is integer, | ||
| which leads to partition mismatch. | ||
| To prevent this type casting for partition key was added. |
There was a problem hiding this comment.
thanks for adding the comment for why we needed the custom component
Can you explain to me a little more about why this is needed. I think the explanation mostly makes sense, but I noticed that in the previous state format, it was stored as a string being the key, followed by that partition's state value. Why didn't run into the state key string vs. greenhouse id string in the old state format when we were using strings?
Not strictly a blocker, but it just seems interesting that we now have to convert state to integers when strings used to work
There was a problem hiding this comment.
It was working before because custom StreamSlicer overrides stream_slices method:
so it has right partition where parent primary key(id) is integer: here
and converts parent primary key to string to access the cursor value from state: here
There was a problem hiding this comment.
that makes things more clear. I missed the part where the old custom component was casting. thank you
| name: "applications" | ||
| path: "applications" | ||
| cursor_field: "applied_at" | ||
| cursor_request_option: "created_after" |
There was a problem hiding this comment.
small suggestion, instead of adding these two arbitrary parameters that we then need to account for using field_name: "{{ parameters.get('cursor_request_option', 'updated_after'), I would suggest that we instead just define a new:
incremental_sync:
type: DatetimeBasedCursor
...
start_time_option:
type: RequestOption
inject_into: request_parameter
field_name: "created_after"
I think that reads more easily than the extra parameter injection because at first I thought it was named wrong as cursor_request_option. wdyt?
There was a problem hiding this comment.
removed cursor_request_option and cursor_field from parameters. updated incremental_sync to use $ref to basic incremental_sync instead and overrided start_time_option and cursor_field
Brian Lai (brianjlai)
left a comment
There was a problem hiding this comment.
just a few small suggestions and tweaks, but after those are fixed, this should be ready to go. approved ✅
| def is_greater_than_or_equal(self, first: Record, second: Record) -> bool: | ||
| LegacyToPerPartitionStateMigration migrates partition keys as string, while real type of id in greenhouse is integer, | ||
| which leads to partition mismatch. | ||
| To prevent this type casting for partition key was added. |
There was a problem hiding this comment.
that makes things more clear. I missed the part where the old custom component was casting. thank you
airbyte-integrations/connectors/source-greenhouse/source_greenhouse/manifest.yaml
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-greenhouse/source_greenhouse/manifest.yaml
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-greenhouse/source_greenhouse/manifest.yaml
Outdated
Show resolved
Hide resolved
Brian Lai (brianjlai)
left a comment
There was a problem hiding this comment.
just a few small suggestions and tweaks, but after those are fixed, this should be ready to go. approved ✅
| def is_greater_than_or_equal(self, first: Record, second: Record) -> bool: | ||
| LegacyToPerPartitionStateMigration migrates partition keys as string, while real type of id in greenhouse is integer, | ||
| which leads to partition mismatch. | ||
| To prevent this type casting for partition key was added. |
There was a problem hiding this comment.
that makes things more clear. I missed the part where the old custom component was casting. thank you
What
https://github.com/airbytehq/airbyte-internal-issues/issues/11527
How
Based on cdk changes in airbytehq/airbyte-python-cdk#369 we now able to use
%_msas milliseconds identifier.Removed custom cursors and added incremental sync based on
DatetimeBasedCursor.State migrations from legacy to per partition were added to make state change non-breaking.
Review guide
User Impact
Can this PR be safely reverted and rolled back?