Skip to content

Benchmarks: Modifying c-parser to return unicode #2130

@ghost

Description

Continuing here from #2104, because this is related but indep.
I altered c-parser so thatparser.pyx decode all strings into unicode objects with utf-8 (b16b24b).
Since any source encoding can be transcoded into utf-8, this is a general solution for
getting c-parser to always return unicode. tested on py2.

Data Used:

1M-latin.csv was generated from unicode_series.csv (in tests/data), by replicating it to yield 1 million lines (roughly 32MB), each row contains a number and a longish string, so the overhead of encoding is most pronounced in this case.
zeros.csv(10MB) and matrix.csv(150MB) are from the recent blog post benchmarking c-parser, They contain just integer/float numbers.

Code versions tested:

  • c-parser : current c-parser branch [380f6e6]
  • c-parser-t: same, but the file is routed through codecs and encoded into utf-8
    as part of the test (1M-latin using latin-1->utf-8, the other two ascii->utf-8, just for uniformity) .
  • c-parser-u-from-ascii: with b16b24b - tested against utf-8/ascii files, so no transcoding needed,all strings (actually none for the files tested) are decoded with utf-8 to yield unicode objects.
  • c-parser-u: with b16b24b - the files are transcoded, and the parser decodes all string using
    utf-8 to yield unicode objects.

How:
using IPython's %timeit, which gave best of 3 runs.

Results:

c-parser c-parser-t c-parser-u-from-ascii c-parser-u
zeros.csv 717 ms 768 ms 724ms 777 ms
matrix.csv 2.36 sec 3.17 sec 2.37 sec 3.21 sec
1M-latin.csv 427 ms 558 ms N/A 570 ms

Conclusions:

  • There's a performance hit, but the result is still very respectable,
    and much better then the 10X hit in Unicode III : revenge of the character planes #2104, which is forced to traverse the
    entire dataset, checking the type of each element.
  • Most of the performance hit is due to the transcoding process, not
    the decoding into utf-8. Since that step is unnecessary when the data file
    is already encoded in utf-8 (and pure ascii fits into that catagory),
    That extra work is somewhat justified, and performance is still very competitive even
    with large files.
  • The cost of returning unicode by default is virtually nil when transcoding isn't needed,
    even when the file contains mostly strings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO DataIO issues that don't fit into a more specific labelUnicodeUnicode strings

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions