`GetColumnsFromCsvFileJob`: use join/split instead of unicde_escape encoding by Icemole · Pull Request #649 · rwth-i6/i6_core

Icemole · 2026-03-16T08:09:03Z

I was too optimistic/naive when encoding a string with unicode_escape. This doesn't work as intended for many languages such as CJK, Spanish, Polish, etc etc, which have Unicode tokens.

The intended way is that only newlines are actually kept in the string.

The actual way is that many other Unicode tokens are also escaped in the final string. This is not good when the output of the job depends on other steps of the pipeline, such as LM training or SPM application. Some examples:

>>> "哔哩哔哩".encode("unicode_escape").decode("utf-8")
'\\u54d4\\u54e9\\u54d4\\u54e9'
>>> "un dólar".encode("unicode_escape").decode("utf-8")
'un d\\xf3lar'

In this PR I propose an easy method to remove extra spaces/newlines, which was the main discussion that this job originally had (source).

I assume that nobody has used it yet except for me, which is why I'm freely modifying the job. If you have any issues with this, please let me know.

text/csv.py

michelwi · 2026-03-16T08:15:51Z

I assume that nobody has used it yet except for me, which is why I'm freely modifying the job.

no objection. especially since it fixes a bug.

How about adding some unittests for this job to verify it works correctly? ;)

text/csv.py

tests/job_tests/text/files/csv/input_file.csv

text/csv.py

albertz

See my comments.

More unicode characters (Chinese) and single quote + double quote + newline

Icemole · 2026-03-18T10:24:56Z

I've now used json.dumps successfully.

There was a slight issue, which is that since it encloses the final string within double quotes, it escapes the double quotes internally. I've added some basic string postprocessing to handle that.

Thanks for the feedback to both!

albertz · 2026-03-18T10:28:22Z

There was a slight issue, which is that since it encloses the final string within double quotes, it escapes the double quotes internally. I've added some basic string postprocessing to handle that.

But that's exactly what I mean. Do not do that. Just leave it like it is. The output format should be as simple as possible. The less custom it is, the better. Ideally not custom at all. Ideally it should be trivial to parse it, without any chance to get it wrong, no edge cases.

tests/job_tests/text/files/csv/input_file.csv

Icemole · 2026-03-18T11:04:56Z

Do not do that. Just leave it like it is.

I hate having additional escape tokens in the output. However, my head hurts from thinking about escaping, so I guess I'll write a disclaimer in the job's docstring and let the user address this.

michelwi · 2026-03-18T12:28:28Z

Why are we doing such custom format changes anyway?

The output format should be as simple as possible.

I agree. Therefore my suggestion for the GetColumnsFromCsvFileJob: get columns from csv file (and do not change the format). i.e. use csv.reader(path, delimiter=self.delimiter) to read the file and then use a couple of csv.writer(out_path, delimiter=self.delimiter) to write the columns.

This will not solve your problem with formatting of newlines, but it will be the least surprise for anyone using the Job in the future. You may then have a custom ChangeSingleColumnCsvToEscapedJsonWithoutQuotesJob that does whatever escaping/processing you deem necessary.

albertz · 2026-03-18T12:35:30Z

Do not do that. Just leave it like it is.

I hate having additional escape tokens in the output. However, my head hurts from thinking about escaping, so I guess I'll write a disclaimer in the job's docstring and let the user address this.

This is what I summarized before as: "this is just any kind of data with a new custom format but figure out yourself how it looks like"

I would say we should avoid it. It should be possible to somehow clearly define the data format that you have in the end. If you cannot even define that, then let's not do it this way. Do the formatting in a way that the output format is totally clear, and there is a very simple way to parse it, and you can be 100% sure that there will not be any edge cases which are not handled properly.

I suggested one possible variant, via the JSON string, but without the leading/trailing quotes. Which is already a bit weird, but close to what you have right now. The problem you have is that you need some format which can encode any random string into a single line. That's what you need.

Use join/split instead of unicde_escape encoding

65af9e8

Icemole requested review from albertz, curufinwe, michelwi and sarahberanek March 16, 2026 08:09

Icemole self-assigned this Mar 16, 2026

michelwi reviewed Mar 16, 2026

View reviewed changes

text/csv.py Outdated Show resolved Hide resolved

Remove obsolete comment

3f09d24

albertz reviewed Mar 16, 2026

View reviewed changes

text/csv.py Outdated Show resolved Hide resolved

Icemole added 3 commits March 16, 2026 06:15

Escape csv newline

e9a6266

Add unit test

53e91fd

Remove redundant module from testing

8c62996

albertz reviewed Mar 16, 2026

View reviewed changes

tests/job_tests/text/files/csv/input_file.csv Outdated Show resolved Hide resolved

albertz reviewed Mar 16, 2026

View reviewed changes

text/csv.py Outdated Show resolved Hide resolved

Icemole added 3 commits March 16, 2026 07:42

Add newline at end of file

ceafd98

Add more information concerning default behavior

6d4ba0f

Add test which contains csv quotechar

7407bc4

albertz requested changes Mar 16, 2026

View reviewed changes

Icemole added 2 commits March 18, 2026 06:21

Add more thorough test

bbf933c

More unicode characters (Chinese) and single quote + double quote + newline

Use json.dumps

a970845

Icemole requested review from albertz and michelwi March 18, 2026 10:22

albertz reviewed Mar 18, 2026

View reviewed changes

tests/job_tests/text/files/csv/input_file.csv Show resolved Hide resolved

Icemole added 3 commits March 18, 2026 06:52

Add more test cases

4581c65

Don't do quote postprocessing

a8408e1

Add disclaimer

122b32e

Icemole requested a review from albertz March 18, 2026 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`GetColumnsFromCsvFileJob`: use join/split instead of unicde_escape encoding#649

`GetColumnsFromCsvFileJob`: use join/split instead of unicde_escape encoding#649
Icemole wants to merge 13 commits intomainfrom
get-columns-from-csv-job-unicode-fix

Icemole commented Mar 16, 2026

Uh oh!

Uh oh!

michelwi commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albertz left a comment

Uh oh!

Icemole commented Mar 18, 2026

Uh oh!

albertz commented Mar 18, 2026

Uh oh!

Uh oh!

Icemole commented Mar 18, 2026

Uh oh!

michelwi commented Mar 18, 2026 •

edited

Loading

Uh oh!

albertz commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Icemole commented Mar 16, 2026

Uh oh!

Uh oh!

michelwi commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albertz left a comment

Choose a reason for hiding this comment

Uh oh!

Icemole commented Mar 18, 2026

Uh oh!

albertz commented Mar 18, 2026

Uh oh!

Uh oh!

Icemole commented Mar 18, 2026

Uh oh!

michelwi commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertz commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michelwi commented Mar 18, 2026 •

edited

Loading

albertz commented Mar 18, 2026 •

edited

Loading