Skip to content

Support non-UTF-8 encoded CSV files#20626

Open
Rafferty97 wants to merge 11 commits intoapache:mainfrom
Rafferty97:non-utf8-csv2
Open

Support non-UTF-8 encoded CSV files#20626
Rafferty97 wants to merge 11 commits intoapache:mainfrom
Rafferty97:non-utf8-csv2

Conversation

@Rafferty97
Copy link
Contributor

Which issue does this PR close?

Closes #20473

Rationale for this change

CSV is a ubiquitous file format, and many are encoded in Windows-1252 and other encodings. It would be useful to have the option to read them in datafusion.

What changes are included in this PR?

  • Adds a configuration option to the CSV reader to specify an encoding
  • Adds an optional dependency on encoding_rs to do the actual decoding
  • Refactored CsvSource somewhat to aid the implementation
  • Removed the return value from DecoderDeserializer::digest as it was misleading (call sites were ignoring it)

Are these changes tested?

I have added one unit test that attempts to read a SHIFT-JIS encoded CSV file. More tests are probably needed, but I may need some guidance on this. I'm also running into issues getting the test suite to run locally on my Windows machine.

Are there any user-facing changes?

Adds a new field to CsvOptions.

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Mar 1, 2026
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support non-UTF-8 encoded CSV files

1 participant