GetColumnsFromCsvFileJob: use join/split instead of unicde_escape encoding#649
GetColumnsFromCsvFileJob: use join/split instead of unicde_escape encoding#649
GetColumnsFromCsvFileJob: use join/split instead of unicde_escape encoding#649Conversation
no objection. especially since it fixes a bug. How about adding some unittests for this job to verify it works correctly? ;) |
More unicode characters (Chinese) and single quote + double quote + newline
|
I've now used There was a slight issue, which is that since it encloses the final string within double quotes, it escapes the double quotes internally. I've added some basic string postprocessing to handle that. Thanks for the feedback to both! |
But that's exactly what I mean. Do not do that. Just leave it like it is. The output format should be as simple as possible. The less custom it is, the better. Ideally not custom at all. Ideally it should be trivial to parse it, without any chance to get it wrong, no edge cases. |
I hate having additional escape tokens in the output. However, my head hurts from thinking about escaping, so I guess I'll write a disclaimer in the job's docstring and let the user address this. |
|
Why are we doing such custom format changes anyway?
I agree. Therefore my suggestion for the GetColumnsFromCsvFileJob: get columns from csv file (and do not change the format). i.e. use This will not solve your problem with formatting of newlines, but it will be the least surprise for anyone using the Job in the future. You may then have a custom ChangeSingleColumnCsvToEscapedJsonWithoutQuotesJob that does whatever escaping/processing you deem necessary. |
This is what I summarized before as: "this is just any kind of data with a new custom format but figure out yourself how it looks like" I would say we should avoid it. It should be possible to somehow clearly define the data format that you have in the end. If you cannot even define that, then let's not do it this way. Do the formatting in a way that the output format is totally clear, and there is a very simple way to parse it, and you can be 100% sure that there will not be any edge cases which are not handled properly. I suggested one possible variant, via the JSON string, but without the leading/trailing quotes. Which is already a bit weird, but close to what you have right now. The problem you have is that you need some format which can encode any random string into a single line. That's what you need. |
I was too optimistic/naive when encoding a string with
unicode_escape. This doesn't work as intended for many languages such as CJK, Spanish, Polish, etc etc, which have Unicode tokens.The intended way is that only newlines are actually kept in the string.
The actual way is that many other Unicode tokens are also escaped in the final string. This is not good when the output of the job depends on other steps of the pipeline, such as LM training or SPM application. Some examples:
In this PR I propose an easy method to remove extra spaces/newlines, which was the main discussion that this job originally had (source).
I assume that nobody has used it yet except for me, which is why I'm freely modifying the job. If you have any issues with this, please let me know.