Skip to content

Error reading non UTF-8 headers even after specifying encoding #298

@alipatti

Description

@alipatti

pyreadstat throws a utf error when reading this file or this file. Haven is able to read them both successfully, which makes me think this is a problem with pyreadstat rather than readstat itself.

It appears that there are Windows-1252 encoded "smart quotes" in the header that pyreadstat is trying to read as utf. Passing encoding="..." to read_dta has no effect.

To reproduce:

# get file
curl http://www.principlesofeconometrics.com/stata/cocaine.dta

# fails to read
python -c '
import pyreadstat
pyreadstat.read_dta("cocaine.dta")
pyreadstat.read_dta("cocaine.dta", encoding = "WINDOWS-1252")
pyreadstat.read_dta("cocaine.dta", encoding = "CP1252")
'

# successfully reads
python -c 'import pandas as pd; pd.read_stata("cocaine.dta")'
R -e 'haven::read_dta("cocaine.dta")'

Full stack trace:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import pyreadstat; pyreadstat.read_dta("cocaine.dta")
                       ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
  File "pyreadstat/pyreadstat.pyx", line 301, in pyreadstat.pyreadstat.read_dta
  File "pyreadstat/_readstat_parser.pyx", line 1176, in pyreadstat._readstat_parser.run_conversion
  File "pyreadstat/_readstat_parser.pyx", line 796, in pyreadstat._readstat_parser.handle_note
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 88: invalid start byte

The relevant part of cocaine.dta. \x93 is a non-utf quote that pyreadstat is interpreting as utf8.

00000370  20 43 61 75 6c 6b 69 6e  73 2c 20 4a 2e 50 2e 20  | Caulkins, J.P. |
00000380  61 6e 64 20 52 2e 20 50  61 64 6d 61 6e 20 28 31  |and R. Padman (1|
00000390  39 39 33 29 2c 20 93 51  75 61 6e 74 69 74 79 20  |993), .Quantity | <- BAD LINE
000003a0  44 69 73 63 6f 75 6e 74  73 20 61 6e 64 20 51 75  |Discounts and Qu|
000003b0  61 6c 69 74 79 20 50 72  65 6d 69 61 20 66 6f 72  |ality Premia for|

installed from pip in a virtualenv, 64-bit Mac, python 3.13, pyreadstat 1.30

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrequires changes in Readstatwaiting for changes in the C library Readstat

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions