feat(csv-parser): update option to provide an array of values that should be set to None in the csv file#581
Conversation
📝 WalkthroughWalkthroughThe changes replace the boolean Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CsvDecoder
participant CsvParser
User->>CsvDecoder: Configure with set_values_to_none = ["", "--"]
CsvDecoder->>CsvParser: Instantiate with set_values_to_none
CsvParser->>CsvParser: Parse CSV rows
CsvParser->>CsvParser: For each cell, if value in set_values_to_none, set to None
CsvParser-->>User: Return parsed data with specified values as None
Possibly related PRs
Suggested labels
Suggested reviewers
Would you like to add a migration note or documentation update for users upgrading from the boolean to the new list-based configuration, wdyt? ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
3633-3636: Add description and default for the newset_values_to_noneproperty
The newset_values_to_noneentry currently only defines its type. Could we include adescription(e.g. “List of string values that should be treated as null when parsing CSV cells”) and a sensible default (likedefault: []) to guide users? wdyt?airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1)
125-126: Logic looks correct, but consider a performance optimization?The implementation correctly converts any cell value found in
set_values_to_noneto None. For better performance with larger lists, wdyt about converting the list to a set for O(1) lookups?def parse(self, data: BufferedIOBase) -> PARSER_OUTPUT_TYPE: """ Parse CSV data from decompressed bytes. """ text_data = TextIOWrapper(data, encoding=self.encoding) # type: ignore reader = csv.DictReader(text_data, delimiter=self._get_delimiter() or ",") + values_to_none_set = set(self.set_values_to_none) if self.set_values_to_none else None for row in reader: - if self.set_values_to_none: - row = {k: (None if v in self.set_values_to_none else v) for k, v in row.items()} + if values_to_none_set: + row = {k: (None if v in values_to_none_set else v) for k, v in row.items()} yield row
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml(1 hunks)airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py(3 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py(1 hunks)unit_tests/sources/declarative/decoders/test_composite_decoder.py(4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1)
CsvParser(102-127)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
🔇 Additional comments (11)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2651-2656: Align parser instantiation with updated model field
TheCsvParsercall now usesset_values_to_noneas expected to reflect the updated CSV decoder configuration. wdyt?airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1382-1387: LGTM! Nice generalization from boolean to list of values.The addition of
set_values_to_noneas anOptional[List[str]]is a solid improvement that provides much more flexibility than the previous boolean flag. The type annotation and default value are spot-on.Just curious - might it be worth adding a docstring or field description to help users understand what string values they can specify here? Something like examples of common null representations ("", "NULL", "N/A", etc.)? wdyt?
🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 1382-1382: Too few public methods (0/2)
(R0903)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2)
12-12: LGTM!The import addition supports the new
Optional[List[str]]type annotation.
106-106: Nice enhancement from boolean to list of strings!This change aligns perfectly with the PR objective to allow multiple values to be converted to None, not just empty strings. The type annotation is correct and the default value is appropriate.
unit_tests/sources/declarative/decoders/test_composite_decoder.py (7)
11-11: LGTM!The import addition supports the new type annotations in the test.
47-48: Great refactor of the helper function!Replacing the boolean
add_empty_stringswithadd_extra_columnandextra_column_valuemakes the function much more flexible and better aligned with the new functionality.
56-58: Logic update looks correct!The implementation properly uses the new parameters to add the specified value to the extra column.
270-273: Excellent test coverage for the new feature!The parametrization covers the key scenarios: None (no conversion), empty string, and a custom value. The type annotation is also correct.
277-281: Smart test data generation!Using
set_values_to_none[0]when available and falling back to "random_value" ensures the test data matches the expected conversion behavior.
285-285: Correct parser instantiation!The update to use
set_values_to_noneproperly tests the new parameter.
294-294: Test expectation logic is spot on!The conditional logic correctly sets the expected gender value to None when
set_values_to_noneis provided, matching the parser's behavior.
Summary by CodeRabbit
New Features
Tests