Skip to content

GetColumnsFromCsvFileJob: use join/split instead of unicde_escape encoding#649

Open
Icemole wants to merge 13 commits intomainfrom
get-columns-from-csv-job-unicode-fix
Open

GetColumnsFromCsvFileJob: use join/split instead of unicde_escape encoding#649
Icemole wants to merge 13 commits intomainfrom
get-columns-from-csv-job-unicode-fix

Conversation

@Icemole
Copy link
Collaborator

@Icemole Icemole commented Mar 16, 2026

I was too optimistic/naive when encoding a string with unicode_escape. This doesn't work as intended for many languages such as CJK, Spanish, Polish, etc etc, which have Unicode tokens.

The intended way is that only newlines are actually kept in the string.

The actual way is that many other Unicode tokens are also escaped in the final string. This is not good when the output of the job depends on other steps of the pipeline, such as LM training or SPM application. Some examples:

>>> "哔哩哔哩".encode("unicode_escape").decode("utf-8")
'\\u54d4\\u54e9\\u54d4\\u54e9'
>>> "un dólar".encode("unicode_escape").decode("utf-8")
'un d\\xf3lar'

In this PR I propose an easy method to remove extra spaces/newlines, which was the main discussion that this job originally had (source).

I assume that nobody has used it yet except for me, which is why I'm freely modifying the job. If you have any issues with this, please let me know.

@Icemole Icemole self-assigned this Mar 16, 2026
@michelwi
Copy link
Contributor

I assume that nobody has used it yet except for me, which is why I'm freely modifying the job.

no objection. especially since it fixes a bug.

How about adding some unittests for this job to verify it works correctly? ;)

Copy link
Member

@albertz albertz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments.

Icemole added 2 commits March 18, 2026 06:21
More unicode characters (Chinese) and single quote + double quote + newline
@Icemole Icemole requested review from albertz and michelwi March 18, 2026 10:22
@Icemole
Copy link
Collaborator Author

Icemole commented Mar 18, 2026

I've now used json.dumps successfully.

There was a slight issue, which is that since it encloses the final string within double quotes, it escapes the double quotes internally. I've added some basic string postprocessing to handle that.

Thanks for the feedback to both!

@albertz
Copy link
Member

albertz commented Mar 18, 2026

There was a slight issue, which is that since it encloses the final string within double quotes, it escapes the double quotes internally. I've added some basic string postprocessing to handle that.

But that's exactly what I mean. Do not do that. Just leave it like it is. The output format should be as simple as possible. The less custom it is, the better. Ideally not custom at all. Ideally it should be trivial to parse it, without any chance to get it wrong, no edge cases.

@Icemole
Copy link
Collaborator Author

Icemole commented Mar 18, 2026

Do not do that. Just leave it like it is.

I hate having additional escape tokens in the output. However, my head hurts from thinking about escaping, so I guess I'll write a disclaimer in the job's docstring and let the user address this.

@Icemole Icemole requested a review from albertz March 18, 2026 12:01
@michelwi
Copy link
Contributor

michelwi commented Mar 18, 2026

Why are we doing such custom format changes anyway?

The output format should be as simple as possible.

I agree. Therefore my suggestion for the GetColumnsFromCsvFileJob: get columns from csv file (and do not change the format). i.e. use csv.reader(path, delimiter=self.delimiter) to read the file and then use a couple of csv.writer(out_path, delimiter=self.delimiter) to write the columns.

This will not solve your problem with formatting of newlines, but it will be the least surprise for anyone using the Job in the future. You may then have a custom ChangeSingleColumnCsvToEscapedJsonWithoutQuotesJob that does whatever escaping/processing you deem necessary.

@albertz
Copy link
Member

albertz commented Mar 18, 2026

Do not do that. Just leave it like it is.

I hate having additional escape tokens in the output. However, my head hurts from thinking about escaping, so I guess I'll write a disclaimer in the job's docstring and let the user address this.

This is what I summarized before as: "this is just any kind of data with a new custom format but figure out yourself how it looks like"

I would say we should avoid it. It should be possible to somehow clearly define the data format that you have in the end. If you cannot even define that, then let's not do it this way. Do the formatting in a way that the output format is totally clear, and there is a very simple way to parse it, and you can be 100% sure that there will not be any edge cases which are not handled properly.

I suggested one possible variant, via the JSON string, but without the leading/trailing quotes. Which is already a bit weird, but close to what you have right now. The problem you have is that you need some format which can encode any random string into a single line. That's what you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants