-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k

Description
Continuing here from #2104, because this is related but indep.
I altered c-parser so thatparser.pyx decode all strings into unicode objects with utf-8 (b16b24b).
Since any source encoding can be transcoded into utf-8, this is a general solution for
getting c-parser to always return unicode. tested on py2.
Data Used:
1M-latin.csv
was generated from unicode_series.csv
(in tests/data), by replicating it to yield 1 million lines (roughly 32MB), each row contains a number and a longish string, so the overhead of encoding is most pronounced in this case.
zeros.csv
(10MB) and matrix.csv
(150MB) are from the recent blog post benchmarking c-parser, They contain just integer/float numbers.
Code versions tested:
- c-parser : current c-parser branch [380f6e6]
- c-parser-t: same, but the file is routed through
codecs
and encoded into utf-8
as part of the test (1M-latin using latin-1->utf-8, the other two ascii->utf-8, just for uniformity) . - c-parser-u-from-ascii: with b16b24b - tested against utf-8/ascii files, so no transcoding needed,all strings (actually none for the files tested) are decoded with utf-8 to yield unicode objects.
- c-parser-u: with b16b24b - the files are transcoded, and the parser decodes all string using
utf-8 to yield unicode objects.
How:
using IPython's %timeit, which gave best of 3 runs.
Results:
c-parser | c-parser-t | c-parser-u-from-ascii | c-parser-u | |
---|---|---|---|---|
zeros.csv | 717 ms | 768 ms | 724ms | 777 ms |
matrix.csv | 2.36 sec | 3.17 sec | 2.37 sec | 3.21 sec |
1M-latin.csv | 427 ms | 558 ms | N/A | 570 ms |
Conclusions:
- There's a performance hit, but the result is still very respectable,
and much better then the 10X hit in Unicode III : revenge of the character planes #2104, which is forced to traverse the
entire dataset, checking the type of each element. - Most of the performance hit is due to the transcoding process, not
the decoding into utf-8. Since that step is unnecessary when the data file
is already encoded in utf-8 (and pure ascii fits into that catagory),
That extra work is somewhat justified, and performance is still very competitive even
with large files. - The cost of returning unicode by default is virtually nil when transcoding isn't needed,
even when the file contains mostly strings.