Benchmarks: Modifying c-parser to return unicode

Continuing here from #2104, because this is related but indep.
I altered c-parser so thatparser.pyx decode all strings into unicode objects with utf-8 (b16b24b).
Since any source encoding can be transcoded into utf-8, this is a general solution for
getting c-parser to always return unicode. tested on py2.

**Data Used**:

`1M-latin.csv` was generated from `unicode_series.csv` (in tests/data), by replicating it to yield 1 million lines (roughly 32MB), each row contains a number and a longish string, so the overhead of encoding is most pronounced in this case.
`zeros.csv`(10MB) and `matrix.csv`(150MB) are from the recent [blog post](http://wesmckinney.com/blog/?p=543) benchmarking c-parser, They contain just integer/float numbers.

**Code versions tested**:
- c-parser : current c-parser branch [380f6e6]
- c-parser-t: same, but the file is routed through `codecs` and encoded into utf-8
  as part of the test (1M-latin using latin-1->utf-8, the other two ascii->utf-8, just for uniformity) .
- c-parser-u-from-ascii: with b16b24b - tested against utf-8/ascii files, so no transcoding needed,all strings (actually none for the files tested) are decoded with utf-8 to yield unicode objects.
- c-parser-u: with b16b24b - the files are transcoded, and the parser decodes all string using
  utf-8 to yield unicode objects.

**How**:
using IPython's %timeit, which gave best of 3 runs.

**Results**:

<table>
    <tr>
        <th> </td>
        <th>c-parser</td>
        <th>c-parser-t</td>
        <th>c-parser-u-from-ascii</td>
        <th>c-parser-u</td>
    </tr>
<tr>
<th>zeros.csv</td>
<td>717 ms</td>
<td>768 ms</td>
<td>724ms</td>
<td>777 ms</td>
</tr>

<tr>
<th>matrix.csv</td>
<td>2.36 sec</td>
<td>3.17 sec</td>
<td>2.37  sec</td>
<td>3.21 sec</td>
</tr>

<tr>
<th>1M-latin.csv</td>
<td>427 ms</td>
<td>558 ms</td>
<td>N/A</td>
<td>570 ms</td>
</tr>
</table>


**Conclusions**:
- There's a performance hit, but the result is still very respectable,
  and much better then the 10X hit in #2104, which is forced to traverse the 
  entire dataset, checking the type of each element.
- Most of the performance hit is due to the transcoding process, not 
  the decoding into utf-8. Since that step is unnecessary when the data file
  is already encoded in utf-8 (and pure ascii fits into that catagory),
  That extra work is somewhat justified, and performance is still very competitive even 
  with large files.
- The cost of returning unicode by default is virtually nil when transcoding isn't needed,
  even when the file contains mostly strings.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Benchmarks: Modifying c-parser to return unicode #2130

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	c-parser	c-parser-t	c-parser-u-from-ascii	c-parser-u
zeros.csv	717 ms	768 ms	724ms	777 ms
matrix.csv	2.36 sec	3.17 sec	2.37 sec	3.21 sec
1M-latin.csv	427 ms	558 ms	N/A	570 ms

Uh oh!

Benchmarks: Modifying c-parser to return unicode #2130

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions