You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Support reading CSV files with inconsistent column counts (#17553)
* feat: Support CSV files with inconsistent column counts
Enable DataFusion to read directories containing CSV files with different
numbers of columns by implementing schema union during inference.
Changes:
- Modified CSV schema inference to create union schema from all files
- Extended infer_schema_from_stream to handle varying column counts
- Added tests for schema building logic and integration scenarios
Requires CsvReadOptions::new().truncated_rows(true) to handle files
with fewer columns than the inferred schema.
Fixes#17516
* refactor: Address review comments for CSV union schema feature
Addresses all review feedback from PR #17553 to improve the CSV schema
union implementation that allows reading CSV files with different column counts.
Changes based on review:
- Moved unit tests from separate tests.rs to bottom of file_format.rs
- Updated documentation wording from "now supports" to "can handle"
- Removed all println statements from integration test
- Added comprehensive assertions for actual row content verification
- Simplified HashSet initialization using HashSet::from([...]) syntax
- Updated truncated_rows config documentation to reflect expanded purpose
- Removed unnecessary min() calculation in column processing loop
- Fixed clippy warnings by using enumerate() instead of range loop
Technical improvements:
- Tests now verify null patterns correctly across union schema
- Cleaner iteration logic without redundant bounds checking
- Better documentation explaining union schema behavior
The feature continues to work as designed:
- Creates union schema from all CSV files in a directory
- Files with fewer columns have nulls for missing fields
- Requires explicit opt-in via truncated_rows(true)
- Maintains full backward compatibility
* Apply cargo fmt formatting fixes
* refactor: Address PR review comments for CSV union schema feature
- Remove pub(crate) visibility from build_schema_helper function
- Refactor column type processing to use zip iterator before extension logic
- Add missing error handling for truncated_rows=false case
- Improve truncated_rows documentation to clarify dual purpose
- Replace manual testing with assert_snapshot for better test coverage
- Fix clippy warnings and ensure all tests pass
Addresses all reviewer feedback from PR #17553 while maintaining
backward compatibility and the original CSV union schema functionality.
0 commit comments