feat(cdk): add option to make empty cells None when parsing csv with CsvParser#575
Conversation
📝 WalkthroughWalkthroughA new boolean property, Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CsvDecoder
participant CsvParser
User->>CsvDecoder: Configure set_empty_cell_to_none (True/False)
CsvDecoder->>CsvParser: Instantiate with set_empty_cell_to_none
CsvParser->>CsvParser: Parse CSV rows
alt set_empty_cell_to_none is True
CsvParser->>CsvParser: Convert empty strings to None
else set_empty_cell_to_none is False
CsvParser->>CsvParser: Keep empty strings as is
end
CsvParser-->>User: Yield parsed rows
Would you like me to create a similar sequence diagram illustrating the test flow as well, or does this cover your needs? Wdyt? ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)
54-62:⚠️ Potential issueFix the CSV generation logic to include the gender field.
When
add_empty_strings=True, thegenderfield is added to the data but not to thefieldnamesin the CSV writer, causing aValueErrorduring CSV writing. Would you like to update the fieldnames dynamically? wdyt?- output = StringIO() - writer = csv.DictWriter(output, fieldnames=["id", "name", "age"], delimiter=delimiter) - writer.writeheader() - for row in data: - writer.writerow(row) + output = StringIO() + fieldnames = ["id", "name", "age"] + if add_empty_strings: + fieldnames.append("gender") + writer = csv.DictWriter(output, fieldnames=fieldnames, delimiter=delimiter) + writer.writeheader() + for row in data: + writer.writerow(row)🧰 Tools
🪛 GitHub Actions: Pytest (Fast)
[error] 62-62: ValueError: dict contains fields not in fieldnames: 'gender' during CSV writing in test_composite_raw_decoder_parse_empty_strings
🧹 Nitpick comments (3)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
3633-3635: Add description forset_empty_cell_to_noneand update CsvDecoder description
I noticed the new flag doesn’t have its owndescriptionand isn’t mentioned in the top-levelCsvDecoderdescription. Could we add both for clarity? wdyt?Example diff:
@@ -3620,7 +3620,8 @@ description: "Select 'CSV' for response data that is formatted as CSV (comma-separated values). Can specify an encoding (default: 'utf-8') and a delimiter (default: ',')." - type: object + description: "Select 'CSV' for response data that is formatted as CSV (comma-separated values). Can specify an encoding (default: 'utf-8'), a delimiter (default: ','), and optionally treat empty cells as None when `set_empty_cell_to_none` is enabled." + type: object required: - type @@ -3632,3 +3632,6 @@ delimiter: type: string default: "," + set_empty_cell_to_none: + type: boolean + description: Interpret empty CSV cells as null values instead of empty strings. + default: falseairbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2649-2653: Great integration ofset_empty_cell_to_noneflag!The new boolean is correctly passed from the
CsvDecoderModeltoCsvParser, enabling the option to treat empty CSV cells asNone.Would you consider adding a brief docstring or inline comment above this branch in
_get_parserto highlight the new behavior for future maintainers? wdyt?unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)
274-274: Consider adding timeout to requests.get call.Static analysis flagged that the
requests.getcall is missing a timeout argument, which could cause the test to hang. Would you like to add a reasonable timeout for test stability? wdyt?- response = requests.get("https://airbyte.io/", stream=True) + response = requests.get("https://airbyte.io/", stream=True, timeout=10)🧰 Tools
🪛 Pylint (3.3.7)
[warning] 274-274: Missing timeout argument for method 'requests.get' can cause your program to hang indefinitely
(W3101)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml(1 hunks)airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py(2 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py(1 hunks)unit_tests/sources/declarative/decoders/test_composite_decoder.py(2 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
unit_tests/sources/declarative/decoders/test_composite_decoder.py
[convention] 268-268: Missing function or method docstring
(C0116)
[warning] 274-274: Missing timeout argument for method 'requests.get' can cause your program to hang indefinitely
(W3101)
[error] 285-285: Undefined variable 'expected_recordp'
(E0602)
[warning] 284-284: Unused variable 'expected_record'
(W0612)
🪛 GitHub Actions: Pytest (Fast)
unit_tests/sources/declarative/decoders/test_composite_decoder.py
[error] 62-62: ValueError: dict contains fields not in fieldnames: 'gender' during CSV writing in test_composite_raw_decoder_parse_empty_strings
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1386-1386:Details
✅ Verification successful
Ensure persistence of generated models via YAML schema update
Since this file is generated bydatamodel-codegen, manual edits here will be overwritten on the next generation. Have we updated the sourcedeclarative_component_schema.yamlto include the newset_empty_cell_to_noneproperty and re-run the generator? wdyt?
🏁 Script executed:
#!/bin/bash set -eo pipefail echo "=== Python file header (first 20 lines) ===" head -n 20 airbyte_cdk/sources/declarative/models/declarative_component_schema.py || true echo echo "=== Locate YAML schema files named declarative_component_schema.yaml/yml ===" find airbyte_cdk -type f \( -iname "declarative_component_schema.yaml" -o -iname "declarative_component_schema.yml" \) || echo "No declarative_component_schema YAML files found" echo echo "=== Searching for 'set_empty_cell_to_none' in all YAML files ===" rg -n "set_empty_cell_to_none" -g "*.yaml" -g "*.yml" || echo "Property 'set_empty_cell_to_none' not found in any YAML schemas"Length of output: 1313
🏁 Script executed:
#!/bin/bash set -eo pipefail echo "=== Context around 'set_empty_cell_to_none' in YAML schema ===" rg -n -C5 "set_empty_cell_to_none" airbyte_cdk/sources/declarative/declarative_component_schema.yamlLength of output: 596
No further action needed – schema already updated
Thedeclarative_component_schema.yaml(around line 3633) already definesset_empty_cell_to_none, so regenerating the models will persist this change.airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2)
106-106: LGTM! Good backward compatibility consideration.The new
set_empty_cell_to_noneattribute is well-designed with a sensible default value ofFalseto maintain backward compatibility. Nice work!
125-127: Clean and efficient implementation!The conditional logic to convert empty strings to
Noneis well-implemented using a dictionary comprehension. The approach is both readable and efficient. The logic correctly preserves all non-empty values while converting only empty strings toNonewhen the flag is enabled.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)
269-291: Great test coverage for the new feature! Could you address a couple of minor improvements?The test logic is well-structured and correctly validates both behaviors of the
set_empty_cell_to_noneflag. I noticed the syntax error from the previous review has been fixed - nice work!However, there are a couple of improvements that would be helpful:
Could you add a docstring explaining what this test does? Something like documenting that it tests whether empty CSV cells are correctly handled based on the flag, wdyt?
The static analysis tool suggests adding a timeout to the
requests.getcall to prevent hanging - would you mind adding that for robustness?@pytest.mark.parametrize("set_empty_cell_to_none", [True, False]) def test_composite_raw_decoder_parse_empty_strings(requests_mock, set_empty_cell_to_none: bool): + """Test that empty CSV cells are correctly handled based on the set_empty_cell_to_none flag. + + When set_empty_cell_to_none is True, empty cells should be parsed as None. + When False, they should remain as empty strings. + """ requests_mock.register_uri( "GET", "https://airbyte.io/", content=generate_csv(should_compress=False, add_empty_strings=True), ) - response = requests.get("https://airbyte.io/", stream=True) + response = requests.get("https://airbyte.io/", stream=True, timeout=30)🧰 Tools
🪛 Pylint (3.3.7)
[convention] 270-270: Missing function or method docstring
(C0116)
[warning] 276-276: Missing timeout argument for method 'requests.get' can cause your program to hang indefinitely
(W3101)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py(2 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
unit_tests/sources/declarative/decoders/test_composite_decoder.py
[convention] 270-270: Missing function or method docstring
(C0116)
[warning] 276-276: Missing timeout argument for method 'requests.get' can cause your program to hang indefinitely
(W3101)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Build Python Package
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
🔇 Additional comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)
44-61: Nice implementation of the CSV generation enhancement!The addition of the
add_empty_stringsparameter and its implementation looks solid. The logic correctly adds empty "gender" fields to the data and updates the fieldnames accordingly. Good use of a sensible default to maintain backward compatibility.
Christo Grabowski (ChristoGrab)
left a comment
There was a problem hiding this comment.
![]()
Option to default values to None is an empty cell, when migrating to manifest and validating records, it is just easier to work with
Rather:
Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/13199
Summary by CodeRabbit